Largest input data set observed for Spark.

2014-03-20 Thread Usman Ghani
All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park?

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
If I may add my contribution to this discussion if I understand well your question... DStream is discretized stream. It discretized the data stream over windows of time (according to the project code I've read and paper too). so when you write: JavaStreamingContext stcObj = new

Re: Relation between DStream and RDDs

2014-03-20 Thread Sanjay Awatramani
@TD: I do not need multiple RDDs in a DStream in every batch. On the contrary my logic would work fine if there is only 1 RDD. But then the description for functions like reduce count (Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source

Error while reading from HDFS Simple application

2014-03-20 Thread Laeeq Ahmed
VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CreateSnapshotRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; What can be cause of this error? Regards, Laeeq Ahmed, PhD Student, HPCViz, KTH.

Re: Spark worker threads waiting

2014-03-20 Thread sparrow
This is what the web UI looks like: [image: Inline image 1] I also tail all the worker logs and theese are the last entries before the waiting begins: 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, minRequest: 10066329 14/03/20 13:29:10 INFO

Re: Accessing the reduce key

2014-03-20 Thread Mayur Rustagi
Why are you trying to reducebyKey? Are you looking to work on the data sequentially. If I understand correctly you are looking to filter your data using the bloom filter each bloom filter is tied to which key is instantiating it. Following are some of the options *partiition* your data by key

Re: sort order after reduceByKey / groupByKey

2014-03-20 Thread Mayur Rustagi
Thats expected. I think sortByKey is option too probably a better one. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Mar 20, 2014 at 3:20 PM, Ameet Kini ameetk...@gmail.com wrote: val rdd2 = rdd.partitionBy(my

Re: sort order after reduceByKey / groupByKey

2014-03-20 Thread Ameet Kini
I saw that but I don't need a global sort, only intra-partition sort. Ameet On Thu, Mar 20, 2014 at 3:26 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Thats expected. I think sortByKey is option too probably a better one. Mayur Rustagi Ph: +1 (760) 203 3257

Re: in SF until Friday

2014-03-20 Thread Mayur Rustagi
Would love to .. but I am in NY till Friday :( Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 19, 2014 at 7:34 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm in San Francisco until Friday for a

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Mayur, To be a little clearer, for creating the Bloom Filters, I don't think broadcast variables are the way to go, though definitely that would work for using the Bloom Filters to filter data. The reason why is that the creation needs to happen in a single thread. Otherwise, some type of

Re: Hadoop streaming like feature for Spark

2014-03-20 Thread Jaonary Rabarisoa
Thank you Ewen. RDD.pipe is what I need and it works like a charm. On the other side RDD.mapPartitions seems to be interesting but I can't figure out how to make it work. Jaonary On Thu, Mar 20, 2014 at 4:54 PM, Ewen Cheslack-Postava m...@ewencp.orgwrote: Take a look at RDD.pipe(). You

Re: Largest input data set observed for Spark.

2014-03-20 Thread Soila Pertet Kavulya
Hi Reynold, Nice! What spark configuration parameters did you use to get your job to run successfully on a large dataset? My job is failing on 1TB of input data (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory errors just lost executors. Thanks, Soila On Mar 20, 2014

Re: Machine Learning on streaming data

2014-03-20 Thread Jeremy Freeman
Thanks TD, happy to share my experience with MLLib + Spark Streaming integration. Here's a gist with two examples I have working, one for StreamingLinearRegression and another for StreamingKMeans. https://gist.github.com/freeman-lab/9672685 The goal in each case was to implement a streaming

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Grouped by the group_id but not sorted. -Suren On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: You are using the data grouped (sorted?) To create the bloom filter ? On Mar 20, 2014 4:35 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Mayur, To be a

Re: Pyspark worker memory

2014-03-20 Thread Andrew Ash
Jim, I'm starting to document the heap size settings all in one place, which has been a confusion for a lot of my peers. Maybe you can take a look at this ticket? https://spark-project.atlassian.net/browse/SPARK-1264 On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo jim.bl...@gmail.com wrote: To

Re: Pyspark worker memory

2014-03-20 Thread Matei Zaharia
Yeah, this is definitely confusing. The motivation for this was that different users of the same cluster may want to set different memory sizes for their apps, so we decided to put this setting in the driver. However, if you put SPARK_JAVA_OPTS in spark-env.sh, it also applies to executors,

Re: DStream spark paper

2014-03-20 Thread Matei Zaharia
Hi Adrian, On every timestep of execution, we receive new data, then report updated word counts for that new data plus the past 30 seconds. The latency here is about how quickly you get these updated counts once the new batch of data comes in. It’s true that the count reflects some data from

Sprak Job stuck

2014-03-20 Thread mohit.goyal
Hi, I have run the spark application to process input data of size ~14GB with executor memory 10GB. The job got stuck with below message 14/03/21 05:02:07 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, guavus-0102bf, 49347, 0) with no recent heart beats: 85563ms