Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Patrick Wendell
Hey All, I've mostly kept quiet since I am not very active in maintaining this code anymore. However, it is a bit odd that the project is split-brained with a lot of the code being on github and some in the Spark repo. If the consensus is to migrate everything to github, that seems okay with me.

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Josh Rosen
It would also be great to test this with codegen and unsafe enabled but while continuing to use sort shuffle manager instead of the new tungsten-sort one. On Fri, Jul 31, 2015 at 1:39 AM, Reynold Xin r...@databricks.com wrote: Is this deterministically reproducible? Can you try this on the

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-31 Thread Justin Uang
Sweet! It's here: https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437 On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust mich...@databricks.com wrote: Can you add your description of the

Re: FrequentItems in spark-sql-execution-stat

2015-07-31 Thread Koert Kuipers
this looks like a mistake in FrequentItems to me. if the map is full (map.size==size) then it should still add the new item (after removing items from the map and decrementing counts). if its not a mistake then at least it looks to me like the algo is different than described in the paper. is

Re: Spark CBO

2015-07-31 Thread Olivier Girardot
Hi, there is one cost-based analyzer implemented in Spark SQL, if I'm not mistaken, regarding the Join operations, If the join operation is done with a small dataset then Spark SQL's strategy will be to broadcast automatically the small dataset instead of shuffling. I guess you have something

Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Shivaram Venkataraman
Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it

Spark CBO

2015-07-31 Thread burakkk
Hi everyone, I'm wondering that is there any plan to implement cost-based optimizer for Spark SQL? Best regards... -- *BURAK ISIKLI* | *http://burakisikli.wordpress.com http://burakisikli.wordpress.com*

Came across Spark SQL hang issue with Spark 1.5 Tungsten feature

2015-07-31 Thread james
I try to enable Tungsten with Spark SQL and set below 3 parameters, but i found the Spark SQL always hang below point. So could you please point me what's the potential cause ? I'd appreciate any input. spark.shuffle.manager=tungsten-sort spark.sql.codegen=true spark.sql.unsafe.enabled=true

New Feature Request

2015-07-31 Thread Sandeep Giri
Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here

Re: New Feature Request

2015-07-31 Thread Carsten Schnober
Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread james
Another error: 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40443 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 3 is 583 bytes 15/07/31 16:15:28 INFO

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Reynold Xin
Is this deterministically reproducible? Can you try this on the latest master branch? Would be great to turn debug logging and and dump the generated code. Also would be great to dump the array size at your line 314 in UnsafeRow (and whatever master branch's appropriate line is). On Fri, Jul 31,

Re: New Feature Request

2015-07-31 Thread Jonathan Winandy
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten

Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Sean Owen
PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed.