Re: [RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-18 Thread Patrick Wendell
Update: An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight. On Tue, Dec 16, 2014 at 9:20 PM, Patrick Wendell wrote: > This vote has PASSED with 12 +1 votes (8 binding) and no 0 or

RE: Which committers care about Kafka?

2014-12-18 Thread Shao, Saisai
Hi all, I agree with Hari that Strong exact-once semantics is very hard to guarantee, especially in the failure situation. From my understanding even current implementation of ReliableKafkaReceiver cannot fully guarantee the exact once semantics once failed, first is the ordering of data replay

Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-18 Thread andy
I just changed the domain name in the mailing list archive settings to remove ".incubator" so maybe it'll work now. Andy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-This-post-has-NOT-been-accepted-by-the-mailing-l

Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-18 Thread andy
I just changed the domain name in the mailing list archive settings to remove ".incubator" so maybe it'll work now. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-This-post-has-NOT-been-accepted-by-the-mailing-list-ye

Re: Fwd: Spark JIRA Report

2014-12-18 Thread Josh Rosen
Slightly off-topic, but or helping to clear the PR review backlog, I have a proposal to add some “PR lifecycle” tools to spark-prs.appspot.com to make it easier to track which PRs are blocked on reviewers vs. authors:  https://github.com/databricks/spark-pr-dashboard/pull/39 On December 18, 201

Re: What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
Reynold, Yes, this exactly what I was referring to. I specifically noted this unexpected behavior with sortByKey. I also noted that union is unexpectedly very slow, taking several minutes to define the RDD: although it does not seem to trigger a spark computation per se, it seems to cause the inpu

Fwd: Spark JIRA Report

2014-12-18 Thread Sean Owen
In practice, most issues with no activity for, say, 6+ months are dead. There's down-side in believing they will eventually get done by somebody, since they almost always don't. Most is clutter, but if there are important bugs among them, then the fact they're idling is a different problem: too mu

Re: Which committers care about Kafka?

2014-12-18 Thread Luis Ángel Vicente Sánchez
But idempotency is not that easy t achieve sometimes. A strong only once semantic through a proper API would be superuseful; but I'm not implying this is easy to achieve. On 18 Dec 2014 21:52, "Cody Koeninger" wrote: > If the downstream store for the output data is idempotent or transactional, >

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
If the downstream store for the output data is idempotent or transactional, and that downstream store also is the system of record for kafka offsets, then you have exactly-once semantics. Commit offsets with / after the data is stored. On any failure, restart from the last committed offsets. Yes

Re: Which committers care about Kafka?

2014-12-18 Thread Hari Shreedharan
I get what you are saying. But getting exactly once right is an extremely hard problem - especially in presence of failure. The issue is failures can happen in a bunch of places. For example, before the notification of downstream store being successful reaches the receiver that updates the offse

Re: What RDD transformations trigger computations?

2014-12-18 Thread Mark Hamstra
SPARK-2992 is a good start, but it's not exhaustive. For example, zipWithIndex is also an eager transformation, and we occasionally see PRs suggesting additional eager transformations. On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin wrote: > > Alessandro was probably referring to some transformati

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Thanks for the replies. Regarding skipping WAL, it's not just about optimization. If you actually want exactly-once semantics, you need control of kafka offsets as well, including the ability to not use zookeeper as the system of record for offsets. Kafka already is a reliable system that has st

Re: What RDD transformations trigger computations?

2014-12-18 Thread Reynold Xin
Alessandro was probably referring to some transformations whose implementations depend on some actions. For example: sortByKey requires sampling the data to get the histogram. There is a ticket tracking this: https://issues.apache.org/jira/browse/SPARK-2992 On Thu, Dec 18, 2014 at 11:52 AM,

Re: Spark JIRA Report

2014-12-18 Thread Josh Rosen
I don’t think that it makes sense to just close inactive JIRA issue without any human review.  There are many legitimate feature requests / bug reports that might be inactive for a long time because they’re low priorities to fix or because nobody has had time to deal with them yet. On December

Re: What RDD transformations trigger computations?

2014-12-18 Thread Josh Rosen
Could you provide an example?  These operations are lazy, in the sense that they don’t trigger Spark jobs: scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x => println("computed a!"); x} a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions at :18 scala> a.union

Re: Which committers care about Kafka?

2014-12-18 Thread Hari Shreedharan
Hi Cody, I am an absolute +1 on SPARK-3146. I think we can implement something pretty simple and lightweight for that one. For the Kafka DStream skipping the WAL implementation - this is something I discussed with TD a few weeks ago. Though it is a good idea to implement this to avoid un

Re: Which committers care about Kafka?

2014-12-18 Thread Patrick Wendell
Hey Cody, Thanks for reaching out with this. The lead on streaming is TD - he is traveling this week though so I can respond a bit. To the high level point of whether Kafka is important - it definitely is. Something like 80% of Spark Streaming deployments (anecdotally) ingest data from Kafka. Also

Spark Streaming Data flow graph

2014-12-18 Thread francois . garillot
I’ve been trying to produce an updated box diagram to refresh : http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26 … after the SPARK-3129, and other switches (a surprising number of comments still mention NetworkReceiver). Here’s what I have

Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Now that 1.2 is finalized... who are the go-to people to get some long-standing Kafka related issues resolved? The existing api is not sufficiently safe nor flexible for our production use. I don't think we're alone in this viewpoint, because I've seen several different patches and libraries to

Re: one hot encoding

2014-12-18 Thread sm...@yahoo.com.INVALID
Sandy, will it be available to pyspark use too? > On Dec 13, 2014, at 6:18 PM, Sandy Ryza wrote: > > Hi Lochana, > > We haven't yet added this in 1.2. > https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical > feature indexing, which one-hot encoding can be built on. > https

What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
All, I noticed that while some operations that return RDDs are very cheap, such as map and flatMap, some are quite expensive, such as union and groupByKey. I'm referring here to the cost of constructing the RDD scala value, not the cost of collecting the values contained in the RDD. This does not