Update: An Apache infrastructure issue prevented me from pushing this
last night. The issue was resolved today and I should be able to push
the final release artifacts tonight.
On Tue, Dec 16, 2014 at 9:20 PM, Patrick Wendell wrote:
> This vote has PASSED with 12 +1 votes (8 binding) and no 0 or
Hi all,
I agree with Hari that Strong exact-once semantics is very hard to guarantee,
especially in the failure situation. From my understanding even current
implementation of ReliableKafkaReceiver cannot fully guarantee the exact once
semantics once failed, first is the ordering of data replay
I just changed the domain name in the mailing list archive settings to remove
".incubator" so maybe it'll work now.
Andy
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-This-post-has-NOT-been-accepted-by-the-mailing-l
I just changed the domain name in the mailing list archive settings to remove
".incubator" so maybe it'll work now.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-This-post-has-NOT-been-accepted-by-the-mailing-list-ye
Slightly off-topic, but or helping to clear the PR review backlog, I have a
proposal to add some “PR lifecycle” tools to spark-prs.appspot.com to make it
easier to track which PRs are blocked on reviewers vs. authors:
https://github.com/databricks/spark-pr-dashboard/pull/39
On December 18, 201
Reynold,
Yes, this exactly what I was referring to. I specifically noted this
unexpected behavior with sortByKey. I also noted that union is unexpectedly
very slow, taking several minutes to define the RDD: although it does not
seem to trigger a spark computation per se, it seems to cause the inpu
In practice, most issues with no activity for, say, 6+ months are
dead. There's down-side in believing they will eventually get done by
somebody, since they almost always don't.
Most is clutter, but if there are important bugs among them, then the
fact they're idling is a different problem: too mu
But idempotency is not that easy t achieve sometimes. A strong only once
semantic through a proper API would be superuseful; but I'm not implying
this is easy to achieve.
On 18 Dec 2014 21:52, "Cody Koeninger" wrote:
> If the downstream store for the output data is idempotent or transactional,
>
If the downstream store for the output data is idempotent or transactional,
and that downstream store also is the system of record for kafka offsets,
then you have exactly-once semantics. Commit offsets with / after the data
is stored. On any failure, restart from the last committed offsets.
Yes
I get what you are saying. But getting exactly once right is an extremely hard
problem - especially in presence of failure. The issue is failures can happen
in a bunch of places. For example, before the notification of downstream store
being successful reaches the receiver that updates the offse
SPARK-2992 is a good start, but it's not exhaustive. For example,
zipWithIndex is also an eager transformation, and we occasionally see PRs
suggesting additional eager transformations.
On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin wrote:
>
> Alessandro was probably referring to some transformati
Thanks for the replies.
Regarding skipping WAL, it's not just about optimization. If you actually
want exactly-once semantics, you need control of kafka offsets as well,
including the ability to not use zookeeper as the system of record for
offsets. Kafka already is a reliable system that has st
Alessandro was probably referring to some transformations whose
implementations depend on some actions. For example: sortByKey requires
sampling the data to get the histogram.
There is a ticket tracking this:
https://issues.apache.org/jira/browse/SPARK-2992
On Thu, Dec 18, 2014 at 11:52 AM,
I don’t think that it makes sense to just close inactive JIRA issue without any
human review. There are many legitimate feature requests / bug reports that
might be inactive for a long time because they’re low priorities to fix or
because nobody has had time to deal with them yet.
On December
Could you provide an example? These operations are lazy, in the sense that
they don’t trigger Spark jobs:
scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x =>
println("computed a!"); x}
a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions at
:18
scala> a.union
Hi Cody,
I am an absolute +1 on SPARK-3146. I think we can implement something pretty
simple and lightweight for that one.
For the Kafka DStream skipping the WAL implementation - this is something I
discussed with TD a few weeks ago. Though it is a good idea to implement this
to avoid un
Hey Cody,
Thanks for reaching out with this. The lead on streaming is TD - he is
traveling this week though so I can respond a bit. To the high level
point of whether Kafka is important - it definitely is. Something like
80% of Spark Streaming deployments (anecdotally) ingest data from
Kafka. Also
I’ve been trying to produce an updated box diagram to refresh :
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26
… after the SPARK-3129, and other switches (a surprising number of comments
still mention NetworkReceiver).
Here’s what I have
Now that 1.2 is finalized... who are the go-to people to get some
long-standing Kafka related issues resolved?
The existing api is not sufficiently safe nor flexible for our production
use. I don't think we're alone in this viewpoint, because I've seen
several different patches and libraries to
Sandy, will it be available to pyspark use too?
> On Dec 13, 2014, at 6:18 PM, Sandy Ryza wrote:
>
> Hi Lochana,
>
> We haven't yet added this in 1.2.
> https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical
> feature indexing, which one-hot encoding can be built on.
> https
All,
I noticed that while some operations that return RDDs are very cheap, such
as map and flatMap, some are quite expensive, such as union and groupByKey.
I'm referring here to the cost of constructing the RDD scala value, not the
cost of collecting the values contained in the RDD. This does not
21 matches
Mail list logo