Need sth like "def groupByKeyWithRDD(partitioner: Partitioner): RDD[(K, RDD[V])] = ???"

2015-11-14 Thread chao chu
Hi, In our use case of using the groupByKey(...): RDD[(K, Iterable[V]], there might be a case that even for a single key (an extreme case though), the associated Iterable[V] could resulting in OOM. Is it possible to provide the above 'groupByKeyWithRDD'? And, ideally, it would be great if the

Re: Spark ClosureCleaner or java serializer OOM when trying to grow

2015-11-14 Thread rohangpatil
Confirming that I am also hitting the same errors. host: r3.8xlarge configuration spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 200g spark.serializer.objectStreamReset 10

Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-14 Thread kundan kumar
Hi Cody , Thanks for the clarification. I will try to come up with some workaround. I have an another doubt. When my job is restarted, and recovers from the checkpoint it does the re-partitioning step twice for each 15 minute job until the window of 2 hours is complete. Then the re-partitioning

Re: spark 1.4 GC issue

2015-11-14 Thread Renu Yadav
I have tried with G1 GC .Please if anyone can provide their setting for GC. At code level I am : 1.reading orc table usind dataframe 2.map df to rdd of my case class 3. changed that rdd to paired rdd 4.Applied combineByKey 5. saving the result to orc file Please suggest Regards, Renu Yadav On

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Ted Yu
Which release are you using ? If older than 1.5.0, you miss some fixes such as SPARK-9952 Cheers On Sat, Nov 14, 2015 at 6:35 PM, Jerry Lam wrote: > Hi spark users and developers, > > Have anyone experience the slow startup of a job when it contains a stage > with over 4

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Jerry Lam
Hi Ted, That looks exactly what happens. It has been 5 hrs now. The code was built for 1.4. Thank you very much! Best Regards, Jerry Sent from my iPhone > On 14 Nov, 2015, at 11:21 pm, Ted Yu wrote: > > Which release are you using ? > If older than 1.5.0, you miss

Calculating Timeseries Aggregation

2015-11-14 Thread Sandip Mehta
Hi, I am working on requirement of calculating real time metrics and building prototype on Spark streaming. I need to build aggregate at Seconds, Minutes, Hours and Day level. I am not sure whether I should calculate all these aggregates as different Windowed function on input DStream or

Spark SQL: filter if column substring does not contain a string

2015-11-14 Thread YaoPau
I'm using pyspark 1.3.0, and struggling with what should be simple. Basically, I'd like to run this: site_logs.filter(lambda r: 'page_row' in r.request[:20]) meaning that I want to keep rows that have 'page_row' in the first 20 characters of the request column. The following is the closest

Re: send transformed RDD to s3 from slaves

2015-11-14 Thread Ajay
Hi Walrus, Try caching the results just before calling the rdd.count. Regards, Ajay > On Nov 13, 2015, at 7:56 PM, Walrus theCat wrote: > > Hi, > > I have an RDD which crashes the driver when being collected. I want to send > the data on its partitions out to S3

Re: very slow parquet file write

2015-11-14 Thread Sabarish Sasidharan
How are you writing it out? Can you post some code? Regards Sab On 14-Nov-2015 5:21 am, "Rok Roskar" wrote: > I'm not sure what you mean? I didn't do anything specifically to partition > the columns > On Nov 14, 2015 00:38, "Davies Liu" wrote: > >>

Re: Spark and Spring Integrations

2015-11-14 Thread Sabarish Sasidharan
You are probably trying to access the spring context from the executors after initializing it at the driver. And running into serialization issues. You could instead use mapPartitions() and initialize the spring context from within that. That said I don't think that will solve all of your issues

Re: Spark and Spring Integrations

2015-11-14 Thread Muthu Jayakumar
You could try to use akka actor system with apache spark, if you are intending to use it in online / interactive job execution scenario. On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > You are probably trying to access the spring context from the

Re: send transformed RDD to s3 from slaves

2015-11-14 Thread Andrew Ehrlich
Maybe you want to be using rdd.saveAsTextFile() ? > On Nov 13, 2015, at 4:56 PM, Walrus theCat wrote: > > Hi, > > I have an RDD which crashes the driver when being collected. I want to send > the data on its partitions out to S3 without bringing it back to the driver.

Spark job stuck with 0 input records

2015-11-14 Thread pratik khadloya
Hello, We are running spark on yarn version 1.4.1 java.vendor=Oracle Corporation java.runtime.version=1.7.0_40-b43 datanucleus-core-3.2.10.jar datanucleus-api-jdo-3.2.6.jar datanucleus-rdbms-3.2.9.jar IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDuration ▾GC TimeInput Size /