Problems with spark.locality.wait

2014-11-13 Thread MaChong
Hi, We are running a time sensitive application with 70 partition and 800MB each parition size. The application first load data from database in different cluster, then apply a filter, cache the filted data, then apply a map and a reduce, finally collect results. The application will be

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Sean Owen
LICENSE and NOTICE are fine. Signature and checksum is fine. I unzipped and built the plain source distribution, which built. However I am seeing a consistent test failure with mvn -DskipTests clean package; mvn test. In the Hive module: - SET commands semantics for a HiveContext *** FAILED ***

Join operator in PySpark

2014-11-13 Thread 夏俊鸾
Hi all I have noticed that “Join” operator has been transferred to union and groupByKey operator instead of cogroup operator in PySpark, this change will probably generate more shuffle stage, for example rdd1 = sc.makeRDD(...).partitionBy(2) rdd2 = sc.makeRDD(...).partitionBy(2)

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com wrote: Currently there’s

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Marcelo Vanzin
Hello there, So I just took a quick look at the pom and I see two problems with it. - activatedByDefault does not work like you think it does. It only activates by default if you do not explicitly activate other profiles. So if you do mvn package, scala-2.10 will be activated; but if you do mvn

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Ashutosh
Hi Anant, Please see the changes. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala I have changed the input format to Vector of String. I think we can also make it generic. Line 59 72 : that counter will not affect in parallelism, Since it

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Marcelo Vanzin
Hey Patrick, On Thu, Nov 13, 2014 at 10:49 AM, Patrick Wendell pwend...@gmail.com wrote: I'm not sure chaining activation works like that. At least in my experience activation based on properties only works for properties explicitly specified at the command line rather than declared elsewhere

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Marcelo Vanzin
On Thu, Nov 13, 2014 at 10:58 AM, Patrick Wendell pwend...@gmail.com wrote: That's true, but note the code I posted activates a profile based on the lack of a property being set, which is why it works. Granted, I did not test that if you activate the other profile, the one with the property

Re: Problems with spark.locality.wait

2014-11-13 Thread Kay Ousterhout
Hi, Shivaram and I stumbled across this problem a few weeks ago, and AFAIK there is no nice solution. We worked around it by avoiding jobs with tasks that have tasks with two locality levels. To fix this problem, we really need to fix the underlying problem in the scheduling code, which

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Michael Armbrust
Hey Sean, Thanks for pointing this out. Looks like a bad test where we should be doing Set comparison instead of Array. Michael On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen so...@cloudera.com wrote: LICENSE and NOTICE are fine. Signature and checksum is fine. I unzipped and built the plain

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin
Do people usually important o.a.spark.rdd._ ? Also in order to maintain source and binary compatibility, we would need to keep both right? On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu zsxw...@gmail.com wrote: I saw many people asked how to convert a RDD to a PairRDDFunctions. I would like to

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Andrew Or
Yeah, this seems to be somewhat environment specific too. The same test has been passing here for a while: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.1-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/lastBuild/consoleFull 2014-11-13 11:26 GMT-08:00 Michael Armbrust

Re: Join operator in PySpark

2014-11-13 Thread Josh Rosen
We should implement this using cogroup(); it will just require some tracking to map Python partitioners into dummy Java ones so that Java Spark’s cogroup() operator respects Python’s partitioning.  I’m sure that there are some other subtleties, particularly if we mix datasets that use different

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Sean Owen
Ah right. This is because I'm running Java 8. This was fixed in SPARK-3329 (https://github.com/apache/spark/commit/2b7ab814f9bde65ebc57ebd04386e56c97f06f4a#diff-7bfd8d7c8cbb02aa0023e4c3497ee832). Consider back-porting it if other reasons arise, but this is specific to tests and to Java 8. On

Re: Problems with spark.locality.wait

2014-11-13 Thread Kay Ousterhout
Hi Mridul, In the case Shivaram and I saw, and based on my understanding of Ma chong's description, I don't think that completely fixes the problem. To be very concrete, suppose your job has two tasks, t1 and t2, and they each have input data (in HDFS) on h1 and h2, respectively, and that h1 and

TimSort in 1.2

2014-11-13 Thread Debasish Das
Hi, I am noticing the first step for Spark jobs does a TimSort in 1.2 branch...and there is some time spent doing the TimSort...Is this assigning the RDD blocks to different nodes based on a sort order ? Could someone please point to a JIRA about this change so that I can read more about it ?

Re: Problems with spark.locality.wait

2014-11-13 Thread Nathan Kronenfeld
This sounds like it may be exactly the problem we've been having (and about which I recently posted on the user list). Is there any way of monitoring it's attempts to wait, giving up, and trying another level? In general, I'm trying to figure out why we can have repeated identical jobs, the

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Shixiong Zhu
If we put the `implicit` into pacakge object rdd or object rdd, when we write `rdd.groupbykey()`, because rdd is an object of RDD, Scala compiler will search `object rdd`(companion object) and `package object rdd`(pacakge object) by default. We don't need to import them explicitly. Here is a post

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| property. The default value for this in master and branch-1.2 is 10,000 rows per batch. On 11/14/14 1:27 AM, Sadhan Sood wrote: Thanks Chneg, Just

Re: Re: Problems with spark.locality.wait

2014-11-13 Thread MaChong
In the specific example stated, the user had two taskset if I understood right ... the first taskset reads off db (dfs in your example), and does some filter, etc and caches it. Second which works off the cached data (which is, now, process local locality level aware) to do map, group, etc. The

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew
Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* _http://www.linkedin.com/home?trk=hb_tab_home_top_ On

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin
That seems like a great idea. Can you submit a pull request? On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu zsxw...@gmail.com wrote: If we put the `implicit` into pacakge object rdd or object rdd, when we write `rdd.groupbykey()`, because rdd is an object of RDD, Scala compiler will search

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Shixiong Zhu
OK. I'll take it. Best Regards, Shixiong Zhu 2014-11-14 12:34 GMT+08:00 Reynold Xin r...@databricks.com: That seems like a great idea. Can you submit a pull request? On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu zsxw...@gmail.com wrote: If we put the `implicit` into pacakge object rdd or

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew
Hi, I have a doubt regarding the input to your algorithm. _http://www.linkedin.com/home?trk=hb_tab_home_top_ val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], percent : Double, sc :SparkContext) Here our input data is an RDD[Vector[String]]. How we can create this RDD

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Ashutosh
Please use the following snippet. I am still working on to make it a generic vector, so that input should not Vector[String] always. But String will work fine for now. def main(args:Array[String]) { val sc = new SparkContext(local, OutlierDetection) val dir = hdfs://localhost:54310/train3