What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
All, I noticed that while some operations that return RDDs are very cheap, such as map and flatMap, some are quite expensive, such as union and groupByKey. I'm referring here to the cost of constructing the RDD scala value, not the cost of collecting the values contained in the RDD. This does not

Re: one hot encoding

2014-12-18 Thread sm...@yahoo.com.INVALID
Sandy, will it be available to pyspark use too? On Dec 13, 2014, at 6:18 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Lochana, We haven't yet added this in 1.2. https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical feature indexing, which one-hot encoding can be

Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Now that 1.2 is finalized... who are the go-to people to get some long-standing Kafka related issues resolved? The existing api is not sufficiently safe nor flexible for our production use. I don't think we're alone in this viewpoint, because I've seen several different patches and libraries to

Spark Streaming Data flow graph

2014-12-18 Thread francois . garillot
I’ve been trying to produce an updated box diagram to refresh : http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26 … after the SPARK-3129, and other switches (a surprising number of comments still mention NetworkReceiver). Here’s what I have

Re: Which committers care about Kafka?

2014-12-18 Thread Patrick Wendell
Hey Cody, Thanks for reaching out with this. The lead on streaming is TD - he is traveling this week though so I can respond a bit. To the high level point of whether Kafka is important - it definitely is. Something like 80% of Spark Streaming deployments (anecdotally) ingest data from Kafka.

Re: What RDD transformations trigger computations?

2014-12-18 Thread Josh Rosen
Could you provide an example?  These operations are lazy, in the sense that they don’t trigger Spark jobs: scala val a = sc.parallelize(1 to 1, 1).mapPartitions{ x = println(computed a!); x} a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions at console:18 scala

Re: What RDD transformations trigger computations?

2014-12-18 Thread Reynold Xin
Alessandro was probably referring to some transformations whose implementations depend on some actions. For example: sortByKey requires sampling the data to get the histogram. There is a ticket tracking this: https://issues.apache.org/jira/browse/SPARK-2992 On Thu, Dec 18, 2014 at 11:52 AM,

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Thanks for the replies. Regarding skipping WAL, it's not just about optimization. If you actually want exactly-once semantics, you need control of kafka offsets as well, including the ability to not use zookeeper as the system of record for offsets. Kafka already is a reliable system that has

Re: What RDD transformations trigger computations?

2014-12-18 Thread Mark Hamstra
SPARK-2992 is a good start, but it's not exhaustive. For example, zipWithIndex is also an eager transformation, and we occasionally see PRs suggesting additional eager transformations. On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin r...@databricks.com wrote: Alessandro was probably referring to

Re: Which committers care about Kafka?

2014-12-18 Thread Hari Shreedharan
I get what you are saying. But getting exactly once right is an extremely hard problem - especially in presence of failure. The issue is failures can happen in a bunch of places. For example, before the notification of downstream store being successful reaches the receiver that updates the

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
If the downstream store for the output data is idempotent or transactional, and that downstream store also is the system of record for kafka offsets, then you have exactly-once semantics. Commit offsets with / after the data is stored. On any failure, restart from the last committed offsets.

Re: Which committers care about Kafka?

2014-12-18 Thread Luis Ángel Vicente Sánchez
But idempotency is not that easy t achieve sometimes. A strong only once semantic through a proper API would be superuseful; but I'm not implying this is easy to achieve. On 18 Dec 2014 21:52, Cody Koeninger c...@koeninger.org wrote: If the downstream store for the output data is idempotent or

Fwd: Spark JIRA Report

2014-12-18 Thread Sean Owen
In practice, most issues with no activity for, say, 6+ months are dead. There's down-side in believing they will eventually get done by somebody, since they almost always don't. Most is clutter, but if there are important bugs among them, then the fact they're idling is a different problem: too

Re: What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
Reynold, Yes, this exactly what I was referring to. I specifically noted this unexpected behavior with sortByKey. I also noted that union is unexpectedly very slow, taking several minutes to define the RDD: although it does not seem to trigger a spark computation per se, it seems to cause the

Re: Fwd: Spark JIRA Report

2014-12-18 Thread Josh Rosen
Slightly off-topic, but or helping to clear the PR review backlog, I have a proposal to add some “PR lifecycle” tools to spark-prs.appspot.com to make it easier to track which PRs are blocked on reviewers vs. authors:  https://github.com/databricks/spark-pr-dashboard/pull/39 On December 18,

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-18 Thread andy
I just changed the domain name in the mailing list archive settings to remove .incubator so maybe it'll work now. -- View this message in context:

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-18 Thread andy
I just changed the domain name in the mailing list archive settings to remove .incubator so maybe it'll work now. Andy -- View this message in context:

RE: Which committers care about Kafka?

2014-12-18 Thread Shao, Saisai
Hi all, I agree with Hari that Strong exact-once semantics is very hard to guarantee, especially in the failure situation. From my understanding even current implementation of ReliableKafkaReceiver cannot fully guarantee the exact once semantics once failed, first is the ordering of data

Re: [RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-18 Thread Patrick Wendell
Update: An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight. On Tue, Dec 16, 2014 at 9:20 PM, Patrick Wendell pwend...@gmail.com wrote: This vote has PASSED with 12 +1 votes (8