Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Imran Rashid
That is not exactly correct -- that being said I'm not 100% on these details either so I'd appreciate you double checking and / or another dev confirming my description. Spark actually has more threads going then the numCores you specify. numCores is really used for how many threads are

Re: [ml] Why all model classes are final?

2015-06-11 Thread Erik Erlandson
I was able to work around this problem in several cases using the class 'enhancement' or 'extension' pattern to add some functionality to the decision tree model data structures. - Original Message - Hi, previously all the models in ml package were private to package, so if i need to

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Kay Ousterhout
Here’s how the shuffle works. This explains what happens for a single task; this will happen in parallel for each task running on the machine, and as Imran said, Spark runs up to “numCores” tasks concurrently on each machine. There's also an answer to the original question about why CPU use is

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Gerard Maas
Kay, Excellent write-up. This should be preserved for reference somewhere searchable. -Gerard. On Fri, Jun 12, 2015 at 1:19 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Here’s how the shuffle works. This explains what happens for a single task; this will happen in parallel for each

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Amit Ramesh
Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com: Hi Jerry, Take a look at this example:

Re: Contributing to pyspark

2015-06-11 Thread Manoj Kumar
Hi, Thanks for your interest in PySpark. The first thing is to have a look at the how to contribute guide https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and filter the JIRA's using the label PySpark. If you have your own improvement in mind, you can file your a JIRA,

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Cheng Lian
Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add them to classpath, which is different from regular files. Cheng On 6/11/15 2:18 PM, Dong Lei wrote: Thanks Cheng, If I do not use --jars how can I tell spark to search the jars(and files) on HDFS? Do you mean the

RE: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Dong Lei
I think in standalone cluster mode, spark is supposed to do: 1. Download jars, files to driver 2. Set the driver’s class path 3. Driver setup a http file server to distribute these files 4. Worker download from driver and setup classpath Right? But somehow, the first

RE: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Dong Lei
Thanks Cheng, If I do not use --jars how can I tell spark to search the jars(and files) on HDFS? Do you mean the driver will not need to setup a HTTP file server for this scenario and the worker will fetch the jars and files from HDFS? Thanks Dong Lei From: Cheng Lian

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Kay Ousterhout
Good idea -- I've added this to the wiki: https://cwiki.apache.org/confluence/display/SPARK/Shuffle+Internals. Happy to stick it elsewhere if folks think there's a more convenient place. On Thu, Jun 11, 2015 at 4:46 PM, Gerard Maas gerard.m...@gmail.com wrote: Kay, Excellent write-up. This

Re: When to expect UTF8String?

2015-06-11 Thread Michael Armbrust
Through the DataFrame API, users should never see UTF8String. Expression (and any class in the catalyst package) is considered internal and so uses the internal representation of various types. Which type we use here is not stable across releases. Is there a reason you aren't defining a UDF

When to expect UTF8String?

2015-06-11 Thread zsampson
I'm hoping for some clarity about when to expect String vs UTF8String when using the Java DataFrames API. In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was once a String is now a UTF8String. The comments in the file and the related commit message indicate that maybe it

Contributing to pyspark

2015-06-11 Thread Usman Ehtesham
Hello, I am currently taking a course in Apache Spark via EdX ( https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x) and at the same time I try to look at the code for pySpark too. I wanted to ask, if ideally I would like to contribute to pyspark specifically, how

Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2015-06-11 Thread shane knapp
+1, and i know i've been guilty of this in the past. :) On Wed, Jun 10, 2015 at 10:20 PM, Joseph Bradley jos...@databricks.com wrote: +1 On Sat, Jun 6, 2015 at 9:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a request here - it would be great if people could create

[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations