Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh :

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Amit Ramesh
Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao wrote: > OK, I get it, I think curren

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh : > > Hi Jerry, > > Take a look at this example: > https://spark.apache.org/docs/latest/str

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Amit Ramesh
Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh : > > Congratulations on the release of 1.

Re: Contributing to pyspark

2015-06-11 Thread Manoj Kumar
Hi, Thanks for your interest in PySpark. The first thing is to have a look at the "how to contribute" guide https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and filter the JIRA's using the label PySpark. If you have your own improvement in mind, you can file your a JIRA, d

Contributing to pyspark

2015-06-11 Thread Usman Ehtesham
Hello, I am currently taking a course in Apache Spark via EdX ( https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x) and at the same time I try to look at the code for pySpark too. I wanted to ask, if ideally I would like to contribute to pyspark specifically, how c

Re: When to expect UTF8String?

2015-06-11 Thread Michael Armbrust
Through the DataFrame API, users should never see UTF8String. Expression (and any class in the catalyst package) is considered internal and so uses the internal representation of various types. Which type we use here is not stable across releases. Is there a reason you aren't defining a UDF inst

When to expect UTF8String?

2015-06-11 Thread zsampson
I'm hoping for some clarity about when to expect String vs UTF8String when using the Java DataFrames API. In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was once a String is now a UTF8String. The comments in the file and the related commit message indicate that maybe it sho

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-11 Thread Kay Ousterhout
Good idea -- I've added this to the wiki: https://cwiki.apache.org/confluence/display/SPARK/Shuffle+Internals. Happy to stick it elsewhere if folks think there's a more convenient place. On Thu, Jun 11, 2015 at 4:46 PM, Gerard Maas wrote: > Kay, > > Excellent write-up. This should be preserved

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-11 Thread Gerard Maas
Kay, Excellent write-up. This should be preserved for reference somewhere searchable. -Gerard. On Fri, Jun 12, 2015 at 1:19 AM, Kay Ousterhout wrote: > Here’s how the shuffle works. This explains what happens for a single > task; this will happen in parallel for each task running on the mac

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-11 Thread Kay Ousterhout
Here’s how the shuffle works. This explains what happens for a single task; this will happen in parallel for each task running on the machine, and as Imran said, Spark runs up to “numCores” tasks concurrently on each machine. There's also an answer to the original question about why CPU use is lo

[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations invo

Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2015-06-11 Thread shane knapp
+1, and i know i've been guilty of this in the past. :) On Wed, Jun 10, 2015 at 10:20 PM, Joseph Bradley wrote: > +1 > > On Sat, Jun 6, 2015 at 9:01 AM, Patrick Wendell > wrote: > >> Hey All, >> >> Just a request here - it would be great if people could create JIRA's >> for any and all merged

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-11 Thread Imran Rashid
That is not exactly correct -- that being said I'm not 100% on these details either so I'd appreciate you double checking and / or another dev confirming my description. Spark actually has more threads going then the "numCores" you specify. "numCores" is really used for how many threads are acti

Re: [ml] Why all model classes are final?

2015-06-11 Thread Erik Erlandson
I was able to work around this problem in several cases using the class 'enhancement' or 'extension' pattern to add some functionality to the decision tree model data structures. - Original Message - > Hi, previously all the models in ml package were private to package, so > if i need t