Re: How to lookup by a key in an RDD

2015-11-01 Thread Gylfi
Hi. You may want to look into Indexed RDDs https://github.com/amplab/spark-indexedrdd Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-lookup-by-a-key-in-an-RDD-tp25243p25247.html Sent from the Apache Spark User List

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
Thanks for reporting it, Sjoerd. You might have a different version of Janino brought in from somewhere else. This should fix your problem: https://github.com/apache/spark/pull/9372 Can you give it a try? On Tue, Oct 27, 2015 at 9:12 PM, Sjoerd Mulder wrote: > No the

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread Gylfi
Hi. What is slow exactly? In code-base 1: When you run the persist() + count() you stored the result in RAM. Then the map + reducebykey is done on in-memory data. In the latter case (all-in-oneline) you are doing both steps at the same time. So you are saying that if you sum-up the time to

[Spark MLlib] about linear regression issue

2015-11-01 Thread Zhiliang Zhu
Dear All, As for N dimension linear regression, while the labeled training points number (or the rank of the labeled point space) is less than N, then from perspective of math, the weight of the trained linear model may be not unique.  However, the output of model.weight() by spark may be with

Re: Spark 1.5 on CDH 5.4.0

2015-11-01 Thread Deenar Toraskar
HI guys I have documented the steps involved in getting Spark 1.5.1 run on CDH 5.4.0 here, let me know if it works for you as well https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa?trk=hp-feed-article-title-publish Looking forward to CDH 5.5 which supports Spark 1.5.x out

Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-01 Thread Romi Kuntsman
[adding dev list since it's probably a bug, but i'm not sure how to reproduce so I can open a bug about it] Hi, I have a standalone Spark 1.4.0 cluster with 100s of applications running every day. >From time to time, the applications crash with the following error (see below) But at the same

Re: apply simplex method to fix linear programming in spark

2015-11-01 Thread Zhiliang Zhu
Hi Ted Yu, Thanks very much for your kind reply.Do you just mean that in spark there is no specific package for simplex method? Then I may try to fix it by myself, do not decide whether it is convenient to finish by spark, before finally fix it. Thank you,Zhiliang On Monday, November 2,

apply simplex method to fix linear programming in spark

2015-11-01 Thread Zhiliang Zhu
Dear All, As I am facing some typical linear programming issue, and I know simplex method is specific in solving LP question, I am very sorry that whether there is already some mature package in spark about simplex method... Thank you very much~Best Wishes!Zhiliang

Re: apply simplex method to fix linear programming in spark

2015-11-01 Thread Ted Yu
A brief search in code base shows the following: TODO: Add simplex constraints to allow alpha in (0,1). ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala I guess the answer to your question is no. FYI On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
good idea. with the dates sorting correctly alphabetically i should be able to do something similar with strings On Sun, Nov 1, 2015 at 4:06 PM, Jörn Franke wrote: > Try with max date, in your case it could make more sense to represent the > date as int > > Sent from my

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
it seems pretty fast, but if i have 2 partitions and 10mm records i do have to dedupe (distinct) 10mm records a direct way to just find out what the 2 partitions are would be much faster. spark knows it, but its not exposed. On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-11-01 Thread karthik kadiyam
Did any one had issue setting spark.driver.maxResultSize value ? On Friday, October 30, 2015, karthik kadiyam wrote: > Hi Shahid, > > I played around with spark driver memory too. In the conf file it was set > to " --driver-memory 20G " first. When i changed the

Sort Merge Join

2015-11-01 Thread Alex Nastetsky
Hi, I'm trying to understand SortMergeJoin (SPARK-2213). 1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". CODE: val

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread hotdog
yes, the first code takes only 30mins. but the second method, I wait for 5 hours, only finish 10% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html Sent from the Apache Spark User List mailing

Re: [Spark MLlib] about linear regression issue

2015-11-01 Thread DB Tsai
For the constrains like all weights >=0, people do LBFGS-B which is supported in our optimization library, Breeze. https://github.com/scalanlp/breeze/issues/323 However, in Spark's LiR, our implementation doesn't have constrain implementation. I do see this is useful given we're experimenting

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
I agreed the max date will satisfy the latest date requirement but it does not satisfy the second last date requirement you mentioned. Just for your information, before you invested in the partitioned table too much, I want to warn you that it has memory issues (both on executors and driver

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, You should be able to see if it requires scanning the whole data by "explain" the query. The physical plan should say something about it. I wonder if you are trying the distinct-sort-by-limit approach or the max-date approach? Best Regards, Jerry On Sun, Nov 1, 2015 at 4:25 PM,

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
i was going for the distinct approach, since i want it to be general enough to also solve other related problems later. the max-date is likely to be faster though. On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote: > Hi Koert, > > You should be able to see if it requires

Occasionally getting RpcTimeoutException

2015-11-01 Thread Jake Yoon
Hi Sparkers. I am very new to Spark, and I am occasionally getting RpCTimeoutException with the following error. 15/11/01 22:19:46 WARN HeartbeatReceiver: Removing executor 0 with no > recent heartbeats: 321792 ms exceeds timeout 30 ms > 15/11/01 22:19:46 ERROR TaskSchedulerImpl: Lost

spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
hello all, i am trying to get familiar with spark sql partitioning support. my data is partitioned by date, so like this: data/date=2015-01-01 data/date=2015-01-02 data/date=2015-01-03 ... lets say i would like a batch process to read data for the latest date only. how do i proceed? generally

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jörn Franke
Try with max date, in your case it could make more sense to represent the date as int Sent from my iPhone > On 01 Nov 2015, at 21:03, Koert Kuipers wrote: > > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date,

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, the physical plan looks like it is doing the right thing: partitioned table hdfs://user/koert/test, read date from the directory names, hash partitioned and agg the date to find distinct date. Finally shuffle the dates for sort and limit 1 operations. This is my understanding of the

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually

Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-11-01 Thread Akhil Das
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since you want to store the Status object and for every batch it will create a directory under /status (name will mostly be the timestamp), since the data is small (hardly couple of MBs for 1 sec interval) it will not overwhelm

RE: Sort Merge Join

2015-11-01 Thread Cheng, Hao
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". [Hao:] A distributed JOIN operation (either HashBased or SortBased Join)

RE: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-11-01 Thread Sun, Rui
Tom, Have you set the “MASTER” evn variable on your machine? What is the value if set? From: Tom Stewart [mailto:stewartthom...@yahoo.com.INVALID] Sent: Friday, October 30, 2015 10:11 PM To: user@spark.apache.org Subject: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found I have

Re: Unable to use saveAsSequenceFile

2015-11-01 Thread Akhil Das
Make sure your firewall isn't blocking the requests. Thanks Best Regards On Sat, Oct 24, 2015 at 5:04 PM, Amit Singh Hora wrote: > Hi All, > > I am trying to wrote an RDD as Sequence file into my Hadoop cluster but > getting connection time out again and again ,I can ping

Re: Error : - No filesystem for scheme: spark

2015-11-01 Thread Jean-Baptiste Onofré
Hi, do you have something special in conf/spark-defaults.conf (especially on the eventLog directory) ? Regards JB On 11/02/2015 07:48 AM, Balachandar R.A. wrote: Can someone tell me at what point this error could come? In one of my use cases, I am trying to use hadoop custom input format.

Re: java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-11-01 Thread Akhil Das
How are you submitting your job? You need to make sure HADOOP_CONF_DIR is pointing to your hadoop configuration directory (with core-site.xml, hdfs-site.xml files), If you have them set properly then make sure you are giving the full hdfs url like:

RE: SparkR job with >200 tasks hangs when calling from web server

2015-11-01 Thread Sun, Rui
I guess that this is not related to SparkR, but something wrong in the Spark Core. Could you try your application logic within spark-shell (you have to use Scala DataFrame API) instead of SparkR shell and to see if this issue still happens? -Original Message- From: rporcio

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-11-01 Thread shahid ashraf
Is your process getting killed... if yes then try to see using dmesg. On Mon, Nov 2, 2015 at 8:17 AM, karthik kadiyam < karthik.kadiyam...@gmail.com> wrote: > Did any one had issue setting spark.driver.maxResultSize value ? > > On Friday, October 30, 2015, karthik kadiyam

RE: How to set memory for SparkR with master="local[*]"

2015-11-01 Thread Sun, Rui
Hi, Matej, For the convenience of SparkR users, when they start SparkR without using bin/sparkR, (for example, in RStudio), https://issues.apache.org/jira/browse/SPARK-11340 enables setting of “spark.driver.memory”, (also other similar options, like: spark.driver.extraClassPath,

Re: Spark Streaming: how to StreamingContext.queueStream

2015-11-01 Thread Akhil Das
You can do something like this: val rddQueue = scala.collection.mutable.Queue(rdd1,rdd2,rdd3) val qDstream = ssc.queueStream(rddQueue) Thanks Best Regards On Sat, Oct 24, 2015 at 4:43 AM, Anfernee Xu wrote: > Hi, > > Here's my situation, I have some kind of offline

Spark Streaming data checkpoint performance

2015-11-01 Thread Thúy Hằng Lê
Hi Spark guru I am evaluating Spark Streaming, In my application I need to maintain cumulative statistics (e.g the total running word count), so I need to call the updateStateByKey function on very micro-batch. After setting those things, I got following behaviors: * The Processing Time

Re: Caching causes later actions to get stuck

2015-11-01 Thread Sampo Niskanen
Hi, Any ideas what's going wrong or how to fix it? Do I have to downgrade to 0.9.x to be able to use Spark? Best regards, *Sampo Niskanen* *Lead developer / Wellmo* sampo.niska...@wellmo.com +358 40 820 5291 On Fri, Oct 30, 2015 at 4:57 PM, Sampo Niskanen

Error : - No filesystem for scheme: spark

2015-11-01 Thread Balachandar R.A.
Can someone tell me at what point this error could come? In one of my use cases, I am trying to use hadoop custom input format. Here is my code. val hConf: Configuration = sc.hadoopConfiguration hConf.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)

Re: Running 2 spark application in parallel

2015-11-01 Thread Akhil Das
Have a look at the dynamic resource allocation listed here https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Thu, Oct 22, 2015 at 11:50 PM, Suman Somasundar < suman.somasun...@oracle.com> wrote: > Hi all, > > > > Is there a way to run 2 spark applications in