Re: GroupBy Key and then sort values with the group

2014-10-09 Thread Chinchu Sup
wrote: There is a new API called repartitionAndSortWithinPartitions() in master, it may help in this case, then you should do the `groupBy()` by yourself. On Wed, Oct 8, 2014 at 4:03 PM, chinchu chinchu@gmail.com wrote: Sean, I am having a similar issue, but I have a lot of data

Re: GroupBy Key and then sort values with the group

2014-10-08 Thread chinchu
Sean, I am having a similar issue, but I have a lot of data for a group I cannot materialize the iterable into a List or Seq in memory. [I tried it runs into OOM]. is there any other way to do this ? I also tried a secondary-sort, with the key having the group::time, but the problem with that

spark fold question

2014-10-07 Thread chinchu
Hi, I am using the fold(zeroValue)(t1, t2) on the RDD I noticed that it runs in parallel on all the partitions then aggregates the results from the partitions. My data object is not aggregate-able I was wondering if there's any way to run the fold sequentially. [I am looking to do a foldLeft

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Andrew. that helps On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List] ml-node+s1001560n14708...@n3.nabble.com wrote: Hey just a minor clarification, you _can_ use SparkFiles.get in your application only if it runs on the executors, e.g. in the following way:

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Andrew. I understand the problem a little better now. There was a typo in my earlier mail a bug in the code (causing the NPE in SparkFiles). I am using the --master yarn-cluster (not local). And in this mode, the com.test.batch.modeltrainer.ModelTrainerMain - my main-class will run on the

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Marcelo. The code trying to read the file always runs in the driver. I understand the problem with other master-deployment but will it work in local, yarn-client yarn-cluster deployments.. that's all I care for now :-) Also what is the suggested way to do something like this ? Put the

spark-submit command-line with --files

2014-09-18 Thread chinchu
Hi, I am running spark-1.1.0 and I want to pass in a file (that contains java serialized objects used to initialize my program) to the App main program. I am using the --files option but I am not able to retrieve the file in the main_class. It reports a null pointer exception. [I tried both local