Re: what does Submitting ... missing tasks from Stage mean?

2015-02-23 Thread shahab
Thanks Imran, but I do appreciate if you explain what this mean and what are the reasons make it happening. I do need it. If there is any documentation somewhere you can simply direct me there so I can try to understand it myself. best, /Shahab On Sat, Feb 21, 2015 at 12:26 AM, Imran Rashid

shuffle data taking immense disk space during ALS

2015-02-23 Thread Antony Mayi
Hi, This has already been briefly discussed here in the past but there seems to be more questions... I am running bigger ALS task with input data ~40GB (~3 billions of ratings). The data is partitioned into 512 partitions and I am also using default parallelism set to 512. The ALS runs with

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
No, unfortunately we're not making use of dynamic allocation or the external shuffle service. Hoping that we could reconfigure our cluster to make use of it, but since it requires changes to the cluster itself (and not just the Spark app), it could take some time. Unsure if task 450 was acting as

Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed

Can not run spark shell on yarn on Windows 7

2015-02-23 Thread quangnguyenbh
I am in the same situation with this guy, does anyone face and fixed this error ? http://stackoverflow.com/questions/24610462/why-does-spark-shell-master-yarn-client-fail-yet-pyspark-master-yarn-seems Thanks, Quang. -- View this message in context:

Re: Spark SQL odbc on Windows

2015-02-23 Thread Denny Lee
Makes complete sense - I became a fan of Spark for pretty much the same reasons. Best of luck, eh?! On Mon Feb 23 2015 at 12:08:49 AM Francisco Orchard forch...@gmail.com wrote: Hi Denny Ashic, You are putting us on the right direction. Thanks! We will try following your advice and

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Jaonary Rabarisoa
Hi Joseph, Thank you for you feedback. I've managed to define an image type by following VectorUDT implementation. I have another question about the definition of a user defined transformer. The unary tranfromer is private to spark ml. Do you plan to give a developer api for transformers ? On

Re: Executor lost with too many temp files

2015-02-23 Thread Sameer Farooqui
Hi Marius, Are you using the sort or hash shuffle? Also, do you have the external shuffle service enabled (so that the Worker JVM or NodeManager can still serve the map spill files after an Executor crashes)? How many partitions are in your RDDs before and after the problematic shuffle

Re: Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi Sameer, I’m still using Spark 1.1.1, I think the default is hash shuffle. No external shuffle service. We are processing gzipped JSON files, the partitions are the amount of input files. In my current data set we have ~850 files that amount to 60 GB (so ~600 GB uncompressed). We have 5

Re: How to integrate HBASE on Spark

2015-02-23 Thread Deepak Vohra
Or, use the SparkOnHBase lab.http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/   From: Ted Yu yuzhih...@gmail.com To: Akhil Das ak...@sigmoidanalytics.com Cc: sandeep vura sandeepv...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Monday, February

Movie Recommendation tutorial

2015-02-23 Thread poiuytrez
Hello, I am following the Movies recommendation with MLlib tutorial (https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html). However, I get RMSE that are much larger than what's written at step 7: The best model was trained with rank = 8 and lambda = 1.0, and numIter =

Re: Force RDD evaluation

2015-02-23 Thread Nicholas Pritchard
Thanks, Sean! Yes, I agree that this logging would still have some cost and so would not be used in production. On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen so...@cloudera.com wrote: I think the cheapest possible way to force materialization is something like rdd.foreachPartition(i = None) I

Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan

Re: Posting to the list

2015-02-23 Thread Nicholas Chammas
Nabble is a third-party site. If you send stuff through Nabble, Nabble has to forward it along to the Apache mailing list. If something goes wrong with that, you will have a message show up on Nabble that no-one saw. The reverse can also happen, where something actually goes out on the list and

Re: Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-23 Thread Nicholas Chammas
I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work… Nope, it shouldn’t, unfortunately. The Spark base AMIs are custom-built for spark-ec2. No other AMI will work unless it was built with that goal in mind. Using a random AMI from the Amazon

Re: How Broadcast variable scale?.

2015-02-23 Thread Guillermo Ortiz
Thanks, I'll read that paper. We haven't tried with a cluster so big, but it's suppose we should in the future and I was worried about it. I'll comment something if you finally do, but it's not going to be tomorrow :) 2015-02-23 17:38 GMT+01:00 Mosharaf Chowdhury mosharafka...@gmail.com: Hi

Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a

Re: Spark SQL odbc on Windows

2015-02-23 Thread Robin East
Have you looked at Kylin? http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/#.VOtXUUsqnUk Pretty new but has the backing of eBay. On 23 Feb 2015, at 15:38, Denny Lee denny.g@gmail.com wrote: Makes complete sense - I became a fan of Spark for pretty

Re: Movie Recommendation tutorial

2015-02-23 Thread poiuytrez
What do you mean? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sorting output of join operation

2015-02-23 Thread Imran Rashid
sortByKey() is the probably the easiest way: import org.apache.spark.SparkContext._ joinedRdd.map{case(word, (file1Counts, file2Counts)) = (file1Counts, (word, file1Counts, file2Counts))}.sortByKey() On Mon, Feb 23, 2015 at 10:41 AM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi , To

Re: Running Example Spark Program

2015-02-23 Thread Deepak Vohra
The Spark cluster has no memory allocated. Memory: 0.0 B Total, 0.0 B Used   From: Surendran Duraisamy 2013ht12...@wilp.bits-pilani.ac.in To: user@spark.apache.org Sent: Sunday, February 22, 2015 6:00 AM Subject: Running Example Spark Program Hello All, I am new to Apache Spark,

Re: How Broadcast variable scale?.

2015-02-23 Thread Mosharaf Chowdhury
Hi Guillermo, The current broadcast algorithm in Spark approximates the one described in the Section 5 of this paper http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf. It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We

Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi guys, I keep running into a strange problem where my jobs start to fail with the dreaded Resubmitted (resubmitted due to lost executor)” because of having too many temp files from previous runs. Both /var/run and /spill have enough disk space left, but after a given amount of jobs have

How Broadcast variable scale?.

2015-02-23 Thread Guillermo Ortiz
I'm looking for about how scale broadcast variables in Spark and what algorithm uses. I have found http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf I don't know if they're talking about the current version (1.2.1) because the file was created in 2010. I

RE: spark streaming window operations on a large window size

2015-02-23 Thread Shao, Saisai
I don't think current Spark Streaming supports window operations which beyond its available memory, internally Spark Streaming puts all the data in the memory belongs to the effective window, if the memory is not enough, BlockManager will discard the blocks at LRU policy, so something

sorting output of join operation

2015-02-23 Thread Anupama Joshi
Hi , To simplify my problem - I have 2 files from which I reading words. the o/p is like file 1 aaa 4 bbb 6 ddd 3 file 2 ddd 2 bbb 6 ttt 5 if I do file1.join(file2) I get (ddd(3,2) bbb(6,6) If I want to sort the output by the number of occurances of the word i file1 . How do I achive

Re: How to integrate HBASE on Spark

2015-02-23 Thread Ted Yu
Installing hbase on hadoop cluster would allow hbase to utilize features provided by hdfs, such as short circuit read (See '90.2. Leveraging local data' under http://hbase.apache.org/book.html#perf.hdfs). Cheers On Sun, Feb 22, 2015 at 11:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com wrote: In general you should first

On app upgrade, restore sliding window data.

2015-02-23 Thread Matus Faro
Hi, Our application is being designed to operate at all times on a large sliding window (day+) of data. The operations performed on the window of data will change fairly frequently and I need a way to save and restore the sliding window after an app upgrade without having to wait the duration of

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say client.TransportClientFactor: Found inactive connection to host/ip:port, closing it. Right after that I see shuffle.RetryingBlockFetcher: Exception

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Just to close the loop in case anyone runs into the same problem I had. By setting --hadoop-major-version=2 when using the ec2 scripts, everything worked fine. Darin. - Original Message - From: Darin McBeath ddmcbe...@yahoo.com.INVALID To: Mingyu Kim m...@palantir.com; Aaron Davidson

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
Sounds very similar to what I experienced Corey. Something that seems to at least help with my problems is to have more partitions. Am already fighting between ending up with too many partitions in the end and having too few in the beginning. By coalescing at late as possible and avoiding too few

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
This also trigger an interesting question: how can I do this locally by code if I want. For example: I have RDD A and B, which has some partition, then if I want to join A to B, I might just want to do a mapper side join (although B itself might be big, but B's local partition is known small

Requested array size exceeds VM limit

2015-02-23 Thread insperatum
Hi,I'm using MLLib to train a random forest. It's working fine to depth 15, but if I use depth 20 I get a*java.lang.OutOfMemoryError: Requested array size exceeds VM limit* on the driver, from the collectAsMap operation in DecisionTree.scala, around line 642.It doesn't happen until a good hour

Re: Requested array size exceeds VM limit

2015-02-23 Thread Sean Owen
It doesn't mean 'out of memory'; it means 'you can't allocate a byte[] over 2GB in the JVM'. Something is serializing a huge block somewhere. there are a number of related JIRAs and discussions on JIRA and this mailing list; have a browse of those first for back story. On Mon, Feb 23, 2015 at

Re: RDD groupBy

2015-02-23 Thread Vijayasarathy Kannan
You are right. I was looking at the wrong logs. I ran it on my local machine and saw that the println actually wrote the vertexIds. I was then able to find the same in the executors' logs in the remote machine. Thanks for the clarification. On Mon, Feb 23, 2015 at 2:00 PM, Sean Owen

Re: How to integrate HBASE on Spark

2015-02-23 Thread sandeep vura
Hi Deepak, Thanks for posting the link.Looks Like it supports only for cloudera distributions as per given in github. We are using apache hadoop multinode cluster not cloudera distribution.Please confirm me whether i can use it on apache hadoop cluster. Regards, Sandeep.v On Mon, Feb 23, 2015

RDD groupBy

2015-02-23 Thread kvvt
In the snippet below, graph.edges.groupBy[VertexId](f1).foreach { edgesBySrc = { f2(edgesBySrc).foreach { vertexId = { *println(vertexId)* } } } } f1 is a function that determines how to group the edges (in my case it groups by source vertex) f2 is another

Re: Access time to an elemnt in cached RDD

2015-02-23 Thread Sean Owen
It may involve access an element of an RDD from a remote machine and copying it back to the driver. That and the small overhead of job scheduling could be a millisecond. You're comparing to just reading an entry from memory, which is of course faster. I don't think you should think of an RDD as

Access time to an elemnt in cached RDD

2015-02-23 Thread shahab
Hi, I just wonder what would be the access time to take one element from a cached RDD? if I have understood correctly, access to RDD elements is not as fast as accessing e.g. HashMap and it could take up to mili seconds compare to nano seconds in HashMap, which is quite significant difference if

Re: RDD groupBy

2015-02-23 Thread Sean Owen
Here, println isn't happening on the driver. Are you sure you are looking at the right machine's logs? Yes this may be parallelized over many machines. On Mon, Feb 23, 2015 at 6:37 PM, kvvt kvi...@vt.edu wrote: In the snippet below, graph.edges.groupBy[VertexId](f1).foreach { edgesBySrc =

Re: How to print more lines in spark-shell

2015-02-23 Thread Sean Owen
I'd imagine that myRDD.take(10).foreach(println) is the most straightforward thing but yeah you can probably change shell default behavior too. On Mon, Feb 23, 2015 at 7:15 PM, Mark Hamstra m...@clearstorydata.com wrote: That will produce very different output than just the 10 items that Manas

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tathagata Das
There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark Streaming driver checkpoints, which is necessary for recovering from driver failures. That is

Re: How to print more lines in spark-shell

2015-02-23 Thread Mark Hamstra
Yes, if you're willing to add an explicit foreach(println), then that is the simplest solution. Else changing maxPrintString should modify the default output of the Scala/Spark REPL. On Mon, Feb 23, 2015 at 11:25 AM, Sean Owen so...@cloudera.com wrote: I'd imagine that

How to print more lines in spark-shell

2015-02-23 Thread Manas Kar
Hi experts, I am using Spark 1.2 from CDH5.3. When I issue commands like myRDD.take(10) the result gets truncated after 4-5 records. Is there a way to configure the same to show more items? ..Manas

RE: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Oleg Shirokikh
Dear Patrick, Thanks a lot again for your help. What happens if you submit from the master node itself on ec2 (in client mode), does that work? What about in cluster mode? If I SSH to the machine with Spark master, then everything works - shell, and regular submit in both client and cluster

Re: How to print more lines in spark-shell

2015-02-23 Thread Mark Hamstra
That will produce very different output than just the 10 items that Manas wants. This is essentially a Scala shell issue, so this should apply: http://stackoverflow.com/questions/9516567/settings-maxprintstring-for-scala-2-9-repl On Mon, Feb 23, 2015 at 10:25 AM, Akhil Das

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Mingyu Kim
Cool, we will start from there. Thanks Aaron and Josh! Darin, it¹s likely because the DirectOutputCommitter is compiled with Hadoop 1 classes and you¹re running it with Hadoop 2. org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it became an interface in Hadoop 2. Mingyu

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
If you call reduceByKey(), internally Spark will introduce a shuffle operations, not matter the data is already partitioned locally, Spark itself do not know the data is already well partitioned. So if you want to avoid Shuffle, you have to write the code explicitly to avoid this, from my

Re: Efficient way of scoring all items and users in an ALS model

2015-02-23 Thread Xiangrui Meng
You can use rdd.cartesian then find top-k by key to distribute the work to executors. There is a trick to boost the performance: you need to blockify user/product features and then use native matrix-matrix multiplication. There is a relevant PR from Deb: https://github.com/apache/spark/pull/3098 .

Re: shuffle data taking immense disk space during ALS

2015-02-23 Thread Xiangrui Meng
Did you try to use less number of partitions (user/product blocks)? Did you use implicit feedback? In the current implementation, we only do checkpointing with implicit feedback. We should adopt the checkpoint strategy implemented in LDA:

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1 million). -Xiangrui On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote: What do you mean? -- View this message in context:

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das
You could build a rest API, but you may have issue if you want to return back arbitrary binary data. A more complex but robust alternative is to use some RPC libraries like Akka, Thrift, etc. TD On Mon, Feb 23, 2015 at 12:45 AM, Nikhil Bafna nikhil.ba...@flipkart.com wrote: Tathagata - Yes,

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Aaron. Thanks for the class. Since I'm currently writing Java based Spark applications, I tried converting your class to Java (it seemed pretty straightforward). I set up the use of the class as follows: SparkConf conf = new SparkConf() .set(spark.hadoop.mapred.output.committer.class,

Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
Hi All, I am running a simple page rank program, but it is slow. And I dig out part of reason is there is shuffle happen when I call an union action even both RDD share the same partition: Below is my test code in spark shell: import org.apache.spark.HashPartitioner

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
Thanks for the suggestions. I removed the persist call from program. Doing so I started it with: spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn /tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/ This takes all the default and only runs 2

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Xiangrui Meng
Yes, we are going to expose the developer API. There was a long discussion in the PR: https://github.com/apache/spark/pull/3637. So we marked them package private and look for feedback on how to improve it. Please implement your classes under `spark.ml` for now and let us know your feedback.

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Thanks. I think my problem might actually be the other way around. I'm compiling with hadoop 2, but when I startup Spark, using the ec2 scripts, I don't specify a -hadoop-major-version and the default is 1. I'm guessing that if I make that a 2 that it might work correctly. I'll try it and

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
In the book of learning spark: So here it means only no shuffle happen crossing network but still will do shuffle locally? Even it is the case, why union will trigger shuffle? I think union will only just append the RDD together. From: Shao, Saisai [mailto:saisai.s...@intel.com]

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I think some RDD APIs like zipPartitions or others can do this as you wanted. I might check the docs. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 1:35 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: RE: Union and reduceByKey will trigger

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I've no context of this book, AFAIK union will not trigger shuffle, as they just put the partitions together, the operator reduceByKey() will actually trigger shuffle. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 12:26 PM To: Shao, Saisai Cc:

Spark configuration

2015-02-23 Thread King sami
Hi Experts, I am new in Spark, so I want manipulate it locally on my machine with Ubuntu as OS. I dowloaded the last version of Spark. I ran this command to start it : ./sbin/start-master.sh but an error is occured : *starting org.apache.spark.deploy.master.Master, logging to

Re: Spark configuration

2015-02-23 Thread Sean Owen
It sounds like you downloaded the source distribution perhaps, but have not built it. That's what the message is telling you. See http://spark.apache.org/docs/latest/building-spark.html Or maybe you intended to get a binary distribution. On Mon, Feb 23, 2015 at 10:40 PM, King sami

Re: Spark configuration

2015-02-23 Thread Shlomi Babluki
I guess you downloaded the source code. You can build it with the following command: mvn -DskipTests clean package Or just download a compiled version. Shlomi On 24 בפבר׳ 2015, at 00:40, King sami kgsam...@gmail.com wrote: Hi Experts, I am new in Spark, so I want manipulate it locally

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Sebastián Ramírez
In your log it says: pickle.PicklingError: Can't pickle type 'thread.lock': it's not found as thread.lock As far as I know, you can't pickle Spark models. If you go to the documentation for Pickle you can see that you can pickle only simple Python structures and code (written in Python), at

Re: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Imran Rashid
I think you're getting tripped up lazy evaluation and the way stage boundaries work (admittedly its pretty confusing in this case). It is true that up until recently, if you unioned two RDDs with the same partitioner, the result did not have the same partitioner. But that was just fixed here:

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and (x[3] % 10) 8) - test =

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I've got the opposite problem with regards to partitioning. I've got over 6000 partitions for some of these RDDs which immediately blows the heap somehow- I'm still not exactly sure how. If I coalesce them down to about 600-800 partitions, I get the problems where the executors are dying without

FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Oleg Shirokikh
Patrick, I haven't changed the configs much. I just executed ec2-script to create 1 master, 2 slaves cluster. Then I try to submit the jobs from remote machine leaving all defaults configured by Spark scripts as default. I've tried to change configs as suggested in other mailing-list and stack

Re: Where to look for potential causes for Akka timeout errors in a Spark Streaming Application?

2015-02-23 Thread Emre Sevinc
Hello Todd, Thank you for your suggestion! I have first tried increasing the Driver memory to 2G and it worked without any problems, but I will also test with the parameters and values you've shared. Kind regards, Emre Sevinç http://www.bigindustries.be/ On Fri, Feb 20, 2015 at 3:25 PM, Todd

Re: How to integrate HBASE on Spark

2015-02-23 Thread sandeep vura
Hi Akhil, I had installed spark on hadoop cluster itself.All of my clusters are on the same network. Thanks, Sandeep.v On Mon, Feb 23, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you are having both the clusters on the same network, then i'd suggest you installing it on

Re: Periodic Broadcast in Apache Spark Streaming

2015-02-23 Thread Tathagata Das
You could do something like this. def rddTrasnformationUsingBroadcast(rdd: RDD[...]): RDD[...] = { val broadcastToUse = getBroadcast()// get the reference to a broadcast variable, new or existing. rdd.map { .. } // use broadcast variable }

Re: How to diagnose could not compute split errors and failed jobs?

2015-02-23 Thread Tathagata Das
Could you find the executor logs on the executor where that task was scheduled? That may provide more information on what caused the error. Also take a look at where the block in question was stored, and where the task was scheduled. You will need to enabled log4j INFO level logs for this

Sortingthe output for spark join

2015-02-23 Thread AJ614
Hi , To simplify my problem - I have 2 files from which I reading words. the o/p is like file 1 aaa 4 bbb 6 ddd 3 file 2 ddd 2 bbb 6 ttt 5 if I do file1.join(file2) I get (ddd(3,2) bbb(6,6) If I want to sort the output by the number of occurances of the word i file1 . How do I

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Sean Owen
I reached a similar conclusion about checkpointing . It requires your entire computation to be serializable, even all of the 'local' bits. Which makes sense. In my case I do not use checkpointing and it is fine to restart the driver in the case of failure and not try to recover its state. What I

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell
What happens if you submit from the master node itself on ec2 (in client mode), does that work? What about in cluster mode? It would be helpful if you could print the full command that the executor is failing. That might show that spark.driver.host is being set strangely. IIRC we print the launch

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Sean, thanks for your message! On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote: What I haven't investigated is whether you can enable checkpointing for the state in updateStateByKey separately from this mechanism, which is exactly your question. What happens if you set a

Re: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Akhil Das
Just make sure you meet the following: 1. Set spark.driver.host to your local ip (Where you runs your code, and it should be accessible from the cluster) 2. Make sure no firewall/router configurations are blocking/filtering the connection between your laptop and the cluster. Best way to test

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das
You will have a build a split infrastructure - a front end that takes the queries from the UI and sends them to the backend, and the backend (running the Spark Streaming app) will actually run the queries on table created in the contexts. The RPCs necessary between the frontend and backend will

Re: Query data in Spark RRD

2015-02-23 Thread Nikhil Bafna
Tathagata - Yes, I'm thinking on that line. The problem is how to send to send the query to the backend? Bundle a http server into a spark streaming job, that will accept the parameters? -- Nikhil Bafna On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com wrote: You will have a

spark streaming window operations on a large window size

2015-02-23 Thread avilevi3
Hi guys, does spark streaming supports window operations on a sliding window that is data is larger than the available memory? we would like to currently we are using kafka as input, but we could change that if needed. thanks Avi -- View this message in context:

Re: Spark SQL Where IN support

2015-02-23 Thread Paolo Platter
I was speaking about 1.2 version of spark Paolo Da: Paolo Plattermailto:paolo.plat...@agilelab.it Data invio: ?luned?? ?23? ?febbraio? ?2015 ?10?:?41 A: user@spark.apache.orgmailto:user@spark.apache.org Hi guys, Is the IN operator supported in Spark SQL over Hive Metastore ? Thanks Paolo

Spark SQL Where IN support

2015-02-23 Thread Paolo Platter
Hi guys, Is the “IN” operator supported in Spark SQL over Hive Metastore ? Thanks Paolo

Memory problems when calling pipe()

2015-02-23 Thread Juan Rodríguez Hortalá
Hi, I'm having problems using pipe() from a Spark program written in Java, where I call a python script, running in a YARN cluster. The problem is that the job fails when YARN kills the container because the python script is going beyond the memory limits. I get something like this in the log:

Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Jaggu
Hi Team, I was trying to save a DecisionTree model from Pyspark using joblib. It is giving me the following error http://pastebin.com/82CFhPNn . Any clue how to resolve the same or save a model. Best regards Jagan -- View this message in context:

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-23 Thread Cheng Lian
(Move to user list.) Hi Kannan, You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is this line of code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides |spark.default.parallelism|. Also,

Getting to proto buff classes in Spark Context

2015-02-23 Thread necro351 .
Hello, I am trying to deserialize some data encoded using proto buff from within Spark and am getting class-not-found exceptions. I have narrowed the program down to something very simple that shows the problem exactly (see 'The Program' below) and hopefully someone can tell me the easy fix :)

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
Try to set lambda to 0.1. -Xiangrui On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar ksanka...@gmail.com wrote: The RSME varies a little bit between the versions. Partitioned the training,validation,test set like so: training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) validation =

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Xiangrui Meng
FYI, in 1.3 we support save/load tree models in Scala and Java. We will add save/load support to Python soon. -Xiangrui On Mon, Feb 23, 2015 at 2:57 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: In your log it says: pickle.PicklingError: Can't pickle type 'thread.lock': it's not

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I *think* this may have been related to the default memory overhead setting being too low. I raised the value to 1G it and tried my job again but i had to leave the office before it finished. It did get further but I'm not exactly sure if that's just because i raised the memory. I'll see tomorrow-

Re_ Re_ Does Spark Streaming depend on Hadoop_(4)

2015-02-23 Thread bit1...@163.com
I am crazy for frequent mail rejection so I create a new thread SMTP error, DOT: 552 spam score (5.7) exceeded threshold (FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET,SPF_PASS Hi Silvio and Ted I know there is a configuration parameter to

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread bit1...@163.com
Thanks both of you guys on this! bit1...@163.com From: Akhil Das Date: 2015-02-24 12:58 To: Tathagata Das CC: user; bit1129 Subject: Re: About FlumeUtils.createStream I see, thanks for the clarification TD. On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote: Akhil, that is

Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread Ted Yu
bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports.Reports$ SensorReports Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ? Thanks On Mon, Feb 23, 2015 at 8:43 PM, necro351 . necro...@gmail.com wrote: Hello, I am trying to deserialize some data encoded

Re: Task not serializable exception

2015-02-23 Thread Kartheek.R
I could trace where the problem is. If I run without any threads, it works fine. When I allocate threads, I run into Not serializable problem. But, I need to have threads in my code. Any help please!!! This is my code: object SparkKart { def parseVector(line: String): Vector[Double] = {

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread bit1...@163.com
The behvior is exactly what I expected. Thanks Akhil and Tathagata! bit1...@163.com From: Akhil Das Date: 2015-02-24 13:32 To: bit1129 CC: Tathagata Das; user Subject: Re: Re: About FlumeUtils.createStream That depends on how many machines you have in your cluster. Say you have 6 workers and

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
You mean SPARK_WORKER_CORES in /conf/spark-env.sh? On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com wrote: In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting.

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread bit1...@163.com
Hi, Akhil,Tathagata, This leads me to another question ,For the Spark Streaming and Kafka Integration, If there are more than one Receiver in the cluster, such as val streams = (1 to 6).map ( _ = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) ), then these Receivers will

Performance Instrumentation for Spark Jobs

2015-02-23 Thread Neil Ferguson
Hi all I wanted to share some details about something I've been working on with the folks on the ADAM project: performance instrumentation for Spark jobs. We've added a module to the bdg-utils project ( https://github.com/bigdatagenomics/bdg-utils) to enable Spark users to instrument RDD

Re: Re: Does Spark Streaming depend on Hadoop?

2015-02-23 Thread bit1...@163.com
[hadoop@hadoop bin]$ sh submit.log.streaming.kafka.complicated.sh Spark assembly has been built with Hive, including Datanucleus jars on classpath Start to run MyKafkaWordCount Exception in thread main java.net.ConnectException: Call From hadoop.master/192.168.26.137 to hadoop.master:9000

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
Is your laptop behind a NAT ? I got bitten by a similar issue and (I think) it was because I was behind a NAT that did not forward the public ip back to my private ip unless the connection originated from my private ip cheers On Tue, Feb 24, 2015 at 5:20 AM, Oleg Shirokikh o...@solver.com

  1   2   >