Re: spark git commit: [HOTFIX] Fix the problem for real this time.

2016-04-25 Thread Jacek Laskowski
[INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 10:53 min [INFO] Finished at: 2016-04-26T06:56:49+02:00 [INFO] Final Memory: 107M/890M [INFO]

Re: spark git commit: [HOTFIX] Fix compilation

2016-04-25 Thread Jacek Laskowski
Thanks Reynold! I was going to ask about that one as it breaks the build for me. [info] Compiling 1 Scala source to /Users/jacek/dev/oss/spark/sql/hivecontext-compatibility/target/scala-2.11/classes... [error]

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Praveen Devarao
Cool!! Thanks for the clarification Mike. Thanking You - Praveen Devarao Spark Technology Centre IBM India Software Labs - "Courage

[build system] short downtime wednesday morning (4-27-16), 7-9am

2016-04-25 Thread shane knapp
another project hosted on our jenkins (e-mission) needs anaconda scipy upgraded from 0.15.1 to 0.17.0. this will also upgrade a few other libs, which i've included at the end of this email. i've spoken w/josh @ databricks and we don't believe that this will impact the spark builds at all. if

Re: net.razorvine.pickle.PickleException in Pyspark

2016-04-25 Thread Joseph Bradley
Thanks for your work on this. Can we continue discussing on the JIRA? On Sun, Apr 24, 2016 at 9:39 AM, Caique Marques wrote: > Hello, everyone! > > I'm trying to implement the association rules in Python. I got implement > an association by a frequent element, works

Re: Cache Shuffle Based Operation Before Sort

2016-04-25 Thread Ted Yu
Interesting. bq. details of execution for 10 and 100 scale factor input Looks like some chart (or image) didn't go through. FYI On Mon, Apr 25, 2016 at 12:50 PM, Ali Tootoonchian wrote: > Caching shuffle RDD before the sort process improves system performance. > SQL > planner

Cache Shuffle Based Operation Before Sort

2016-04-25 Thread Ali Tootoonchian
Caching shuffle RDD before the sort process improves system performance. SQL planner can be intelligent to cache join, aggregate or sort data frame before executing next sort process. For any sort process two job is created by spark, first one is responsible for producing range boundary for

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never needed to eagerly calculate the range boundaries (since Spark 1.0). On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao wrote: > Thanks Reynold for the reason as to why sortBykey invokes a Job > >

Re: Question about storage memory in unified memory manager

2016-04-25 Thread Patrick Woody
Hey all, Just wondering if anyone has had issues with this or if it is expected that the semantic around the memory management is different here. Thanks -Pat On Tue, Apr 19, 2016 at 9:32 AM, Patrick Woody wrote: > Hey all, > > I had a question about the MemoryStore

Re: Spark streaming Kafka receiver WriteAheadLog question

2016-04-25 Thread Cody Koeninger
If you want to refer back to Kafka based on offset ranges, why not use createDirectStream? On Fri, Apr 22, 2016 at 11:49 PM, Renyi Xiong wrote: > Hi, > > Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to > hold corresponded Kafka offset range so

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Praveen Devarao
Thanks Reynold for the reason as to why sortBykey invokes a Job When you say "DataFrame/Dataset does not have this issue" is it right to assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in it? Thanking You