What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
All,

I noticed that while some operations that return RDDs are very cheap, such
as map and flatMap, some are quite expensive, such as union and groupByKey.
I'm referring here to the cost of constructing the RDD scala value, not the
cost of collecting the values contained in the RDD. This does not match my
understanding that RDD transformations only set up a computation without
actually running it. Oh, Spark developers, can you please provide some
clarity?

Alex


Re: one hot encoding

2014-12-18 Thread sm...@yahoo.com.INVALID
Sandy, will it be available to pyspark use too?

> On Dec 13, 2014, at 6:18 PM, Sandy Ryza  wrote:
> 
> Hi Lochana,
> 
> We haven't yet added this in 1.2.
> https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical
> feature indexing, which one-hot encoding can be built on.
> https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of
> this prior to the ML pipelines work.
> 
> -Sandy
> 
> On Fri, Dec 12, 2014 at 6:16 PM, Lochana Menikarachchi 
> wrote:
>> 
>> Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't
>> available in 1.1.0.
>> Thanks.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Now that 1.2 is finalized...  who are the go-to people to get some
long-standing Kafka related issues resolved?

The existing api is not sufficiently safe nor flexible for our production
use.  I don't think we're alone in this viewpoint, because I've seen
several different patches and libraries to fix the same things we've been
running into.

Regarding flexibility

https://issues.apache.org/jira/browse/SPARK-3146

has been outstanding since August, and IMHO an equivalent of this is
absolutely necessary.  We wrote a similar patch ourselves, then found that
PR and have been running it in production.  We wouldn't be able to get our
jobs done without it.  It also allows users to solve a whole class of
problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).

Regarding safety, I understand the motivation behind WriteAheadLog as a
general solution for streaming unreliable sources, but Kafka already is a
reliable source.  I think there's a need for an api that treats it as
such.  Even aside from the performance issues of duplicating the
write-ahead log in kafka into another write-ahead log in hdfs, I need
exactly-once semantics in the face of failure (I've had failures that
prevented reloading a spark streaming checkpoint, for instance).

I've got an implementation i've been using

https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
/src/main/scala/org/apache/spark/rdd/kafka

Tresata has something similar at https://github.com/tresata/spark-kafka,
and I know there were earlier attempts based on Storm code.

Trying to distribute these kinds of fixes as libraries rather than patches
to Spark is problematic, because large portions of the implementation are
private[spark].

 I'd like to help, but i need to know whose attention to get.


Spark Streaming Data flow graph

2014-12-18 Thread francois . garillot
I’ve been trying to produce an updated box diagram to refresh :
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26


… after the SPARK-3129, and other switches (a surprising number of comments 
still mention NetworkReceiver).


Here’s what I have so far:
https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0


This is not supposed to respect any particular convention (ER, ORM, …). Data 
flow up to right before RDD creation is in bold arrows, metadata flow is in 
normal width arrows.


This diagram is still very much a WIP (see below : todo), but I wanted to share 
it to ask:
- what’s wrong ?
- what are the glaring omissions ?
- how can I make this better (i.e. what should I add first to the Todo-list 
below) ?


I’ll be happy to share this (including sources) with whoever asks for it. 


Todo :
- mark private/public classes
- mark queues in Receiver, ReceivedBlockHandler, BlockManager
- mark type of info on transport : e.g. Actor message, ReceivedBlockInfo 



—
François Garillot

Re: Which committers care about Kafka?

2014-12-18 Thread Patrick Wendell
Hey Cody,

Thanks for reaching out with this. The lead on streaming is TD - he is
traveling this week though so I can respond a bit. To the high level
point of whether Kafka is important - it definitely is. Something like
80% of Spark Streaming deployments (anecdotally) ingest data from
Kafka. Also, good support for Kafka is something we generally want in
Spark and not a library. In some cases IIRC there were user libraries
that used unstable Kafka API's and we were somewhat waiting on Kafka
to stabilize them to merge things upstream. Otherwise users wouldn't
be able to use newer Kakfa versions. This is a high level impression
only though, I haven't talked to TD about this recently so it's worth
revisiting given the developments in Kafka.

Please do bring things up like this on the dev list if there are
blockers for your usage - thanks for pinging it.

- Patrick

On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger  wrote:
> Now that 1.2 is finalized...  who are the go-to people to get some
> long-standing Kafka related issues resolved?
>
> The existing api is not sufficiently safe nor flexible for our production
> use.  I don't think we're alone in this viewpoint, because I've seen
> several different patches and libraries to fix the same things we've been
> running into.
>
> Regarding flexibility
>
> https://issues.apache.org/jira/browse/SPARK-3146
>
> has been outstanding since August, and IMHO an equivalent of this is
> absolutely necessary.  We wrote a similar patch ourselves, then found that
> PR and have been running it in production.  We wouldn't be able to get our
> jobs done without it.  It also allows users to solve a whole class of
> problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).
>
> Regarding safety, I understand the motivation behind WriteAheadLog as a
> general solution for streaming unreliable sources, but Kafka already is a
> reliable source.  I think there's a need for an api that treats it as
> such.  Even aside from the performance issues of duplicating the
> write-ahead log in kafka into another write-ahead log in hdfs, I need
> exactly-once semantics in the face of failure (I've had failures that
> prevented reloading a spark streaming checkpoint, for instance).
>
> I've got an implementation i've been using
>
> https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
> /src/main/scala/org/apache/spark/rdd/kafka
>
> Tresata has something similar at https://github.com/tresata/spark-kafka,
> and I know there were earlier attempts based on Storm code.
>
> Trying to distribute these kinds of fixes as libraries rather than patches
> to Spark is problematic, because large portions of the implementation are
> private[spark].
>
>  I'd like to help, but i need to know whose attention to get.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which committers care about Kafka?

2014-12-18 Thread Hari Shreedharan
Hi Cody,




I am an absolute +1 on SPARK-3146. I think we can implement something pretty 
simple and lightweight for that one.




For the Kafka DStream skipping the WAL implementation - this is something I 
discussed with TD a few weeks ago. Though it is a good idea to implement this 
to avoid unnecessary HDFS writes, it is an optimization. For that reason, we 
must be careful in implementation. There are a couple of issues that we need to 
ensure works properly - specifically ordering. To ensure we pull messages from 
different topics and partitions in the same order after failure, we’d still 
have to persist the metadata to HDFS (or some other system) - this metadata 
must contain the order of messages consumed, so we know how to re-read the 
messages. I am planning to explore this once I have some time (probably in 
Jan). In addition, we must also ensure bucketing functions work fine as well. I 
will file a placeholder jira for this one. 




I also wrote an API to write data back to Kafka a while back - 
https://github.com/apache/spark/pull/2994 . I am hoping that this will get 
pulled in soon, as this is something I know people want. I am open to feedback 
on that - anything that I can do to make it better.




Thanks, Hari

On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell 
wrote:

> Hey Cody,
> Thanks for reaching out with this. The lead on streaming is TD - he is
> traveling this week though so I can respond a bit. To the high level
> point of whether Kafka is important - it definitely is. Something like
> 80% of Spark Streaming deployments (anecdotally) ingest data from
> Kafka. Also, good support for Kafka is something we generally want in
> Spark and not a library. In some cases IIRC there were user libraries
> that used unstable Kafka API's and we were somewhat waiting on Kafka
> to stabilize them to merge things upstream. Otherwise users wouldn't
> be able to use newer Kakfa versions. This is a high level impression
> only though, I haven't talked to TD about this recently so it's worth
> revisiting given the developments in Kafka.
> Please do bring things up like this on the dev list if there are
> blockers for your usage - thanks for pinging it.
> - Patrick
> On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger  wrote:
>> Now that 1.2 is finalized...  who are the go-to people to get some
>> long-standing Kafka related issues resolved?
>>
>> The existing api is not sufficiently safe nor flexible for our production
>> use.  I don't think we're alone in this viewpoint, because I've seen
>> several different patches and libraries to fix the same things we've been
>> running into.
>>
>> Regarding flexibility
>>
>> https://issues.apache.org/jira/browse/SPARK-3146
>>
>> has been outstanding since August, and IMHO an equivalent of this is
>> absolutely necessary.  We wrote a similar patch ourselves, then found that
>> PR and have been running it in production.  We wouldn't be able to get our
>> jobs done without it.  It also allows users to solve a whole class of
>> problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).
>>
>> Regarding safety, I understand the motivation behind WriteAheadLog as a
>> general solution for streaming unreliable sources, but Kafka already is a
>> reliable source.  I think there's a need for an api that treats it as
>> such.  Even aside from the performance issues of duplicating the
>> write-ahead log in kafka into another write-ahead log in hdfs, I need
>> exactly-once semantics in the face of failure (I've had failures that
>> prevented reloading a spark streaming checkpoint, for instance).
>>
>> I've got an implementation i've been using
>>
>> https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
>> /src/main/scala/org/apache/spark/rdd/kafka
>>
>> Tresata has something similar at https://github.com/tresata/spark-kafka,
>> and I know there were earlier attempts based on Storm code.
>>
>> Trying to distribute these kinds of fixes as libraries rather than patches
>> to Spark is problematic, because large portions of the implementation are
>> private[spark].
>>
>>  I'd like to help, but i need to know whose attention to get.
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org

Re: What RDD transformations trigger computations?

2014-12-18 Thread Josh Rosen
Could you provide an example?  These operations are lazy, in the sense that 
they don’t trigger Spark jobs:


scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x => 
println("computed a!"); x}
a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions at 
:18

scala> a.union(a)
res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at :22

scala> a.map(x => (x, x)).groupByKey()
res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at 
groupByKey at :22

scala> a.map(x => (x, x)).groupByKey().count()
computed a!
res6: Long = 1


On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (alexbare...@gmail.com) 
wrote:

All,  

I noticed that while some operations that return RDDs are very cheap, such  
as map and flatMap, some are quite expensive, such as union and groupByKey.  
I'm referring here to the cost of constructing the RDD scala value, not the  
cost of collecting the values contained in the RDD. This does not match my  
understanding that RDD transformations only set up a computation without  
actually running it. Oh, Spark developers, can you please provide some  
clarity?  

Alex  


Re: Spark JIRA Report

2014-12-18 Thread Josh Rosen
I don’t think that it makes sense to just close inactive JIRA issue without any 
human review.  There are many legitimate feature requests / bug reports that 
might be inactive for a long time because they’re low priorities to fix or 
because nobody has had time to deal with them yet.

On December 15, 2014 at 2:37:30 PM, Nicholas Chammas 
(nicholas.cham...@gmail.com) wrote:

OK, that's good.  

Another approach we can take to controlling the number of stale JIRA issues  
is writing a bot that simply closes issues after N days of inactivity and  
prompts people to reopen the issue if it's still valid. I believe Sean Owen  
proposed that at one point (?).  

I wonder if that might be better since I feel that even a slimmed down  
email might not be enough to get already-busy people to spend time on JIRA  
management.  

Nick  

On Mon Dec 15 2014 at 12:55:06 PM Andrew Ash  wrote:  

> Nick,  
>  
> Putting the N most stale issues into a report like your latest one does  
> seem like a good way to tackle the wall of text effect that I'm worried  
> about.  
>  
> On Sun, Dec 14, 2014 at 12:28 PM, Nicholas Chammas <  
> nicholas.cham...@gmail.com> wrote:  
>  
>> Taking after Andrew’s suggestion, perhaps the report can just focus on  
>> Stale issues (no updates in > 90 days), since those are probably the  
>> easiest to act on.  
>>  
>> For example:  
>> Stale Issues  
>> 
>>   
>>  
>> - [Oct 22, 2012] SPARK-560  
>> : Specialize RDDs /  
>> iterators  
>> - [Oct 22, 2012] SPARK-540  
>> : Add API to  
>> customize in-memory representation of RDDs  
>> - [Oct 22, 2012] SPARK-573  
>> : Clarify semantics  
>> of the parallelized closures  
>> - [Nov 06, 2012] SPARK-609  
>> : Add instructions  
>> for enabling Akka debug logging  
>> - [Dec 17, 2012] SPARK-636  
>> : Add mechanism to  
>> run system management/configuration tasks on all workers  
>>  
>> Andrew,  
>>  
>> Does that seem more useful?  
>>  
>> Nick  
>> ​  
>>  
>> On Sun Dec 14 2014 at 3:20:54 AM Nicholas Chammas <  
>> nicholas.cham...@gmail.com> wrote:  
>>  
>>> I formatted this report using Markdown; I'm open to changing the  
>>> structure or formatting or reducing the amount of information to make the  
>>> report more easily consumable.  
>>>  
>>> Regarding just sending links or whether this would just be mailing list  
>>> noise, those are a good questions.  
>>>  
>>> I've sent out links before, but I feel from a UX perspective having the  
>>> information right in the email itself makes it frictionless for people to  
>>> act on the information. For me, that difference is enough to hook me into  
>>> spending a few minutes on JIRA vs. just glossing over an email with a link. 
>>>  
>>>  
>>> I wonder if that's also the case for others on this list.  
>>>  
>>> If you already spend a good amount of time cleaning up on JIRA, then  
>>> this report won't be that relevant to you. But given the number and growth  
>>> of open issues on our tracker, I suspect we could do with quite a few more  
>>> people chipping in and cleaning up where they can.  
>>>  
>>> That's the real problem that this report is intended to help with.  
>>>  
>>> Nick  
>>>  
>>>  
>>>  
>>> On Sun Dec 14 2014 at 2:49:00 AM Andrew Ash   
>>> wrote:  
>>>  
 The goal of increasing visibility on open issues is a good one. How is  
 this different from just a link to Jira though? Some might say this adds  
 noise to the mailing list and doesn't contain any information not already  
 available in Jira.  
  
 The idea seems good but the formatting leaves a little to be desired.  
 If you aren't opposed to using HTML, I might suggest this more compact  
 format:  
  
 SPARK-2044   
 Pluggable interface for shuffles  
 SPARK-2365  Add  
 IndexedRDD, an efficient updatable key-value  
 SPARK-3561  Allow  
 for pluggable execution contexts in Spark  
  
 Andrew  
  
 On Sat, Dec 13, 2014 at 11:31 PM, Nicholas Chammas <  
 nicholas.cham...@gmail.com> wrote:  
  
> What do y’all think of a report like this emailed out to the dev list  
> on a  
> monthly basis?  
>  
> The goal would be to increase visibility into our open issues and  
> encourage  
> developers to tend to our issue tracker more frequently.  
>  
> Nick  
>  
> There are 1,236 unresolved issues  
>  
 

Re: What RDD transformations trigger computations?

2014-12-18 Thread Reynold Xin
Alessandro was probably referring to some transformations whose
implementations depend on some actions. For example: sortByKey requires
sampling the data to get the histogram.

There is a ticket tracking this:
https://issues.apache.org/jira/browse/SPARK-2992






On Thu, Dec 18, 2014 at 11:52 AM, Josh Rosen  wrote:
>
> Could you provide an example?  These operations are lazy, in the sense
> that they don’t trigger Spark jobs:
>
>
> scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x =>
> println("computed a!"); x}
> a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions
> at :18
>
> scala> a.union(a)
> res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at :22
>
> scala> a.map(x => (x, x)).groupByKey()
> res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at
> groupByKey at :22
>
> scala> a.map(x => (x, x)).groupByKey().count()
> computed a!
> res6: Long = 1
>
>
> On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (
> alexbare...@gmail.com) wrote:
>
> All,
>
> I noticed that while some operations that return RDDs are very cheap, such
> as map and flatMap, some are quite expensive, such as union and groupByKey.
> I'm referring here to the cost of constructing the RDD scala value, not the
> cost of collecting the values contained in the RDD. This does not match my
> understanding that RDD transformations only set up a computation without
> actually running it. Oh, Spark developers, can you please provide some
> clarity?
>
> Alex
>


Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Thanks for the replies.

Regarding skipping WAL, it's not just about optimization.  If you actually
want exactly-once semantics, you need control of kafka offsets as well,
including the ability to not use zookeeper as the system of record for
offsets.  Kafka already is a reliable system that has strong ordering
guarantees (within a partition) and does not mandate the use of zookeeper
to store offsets.  I think there should be a spark api that acts as a very
simple intermediary between Kafka and the user's choice of downstream store.

Take a look at the links I posted - if there's already been 2 independent
implementations of the idea, chances are it's something people need.

On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan  wrote:
>
> Hi Cody,
>
> I am an absolute +1 on SPARK-3146. I think we can implement something
> pretty simple and lightweight for that one.
>
> For the Kafka DStream skipping the WAL implementation - this is something
> I discussed with TD a few weeks ago. Though it is a good idea to implement
> this to avoid unnecessary HDFS writes, it is an optimization. For that
> reason, we must be careful in implementation. There are a couple of issues
> that we need to ensure works properly - specifically ordering. To ensure we
> pull messages from different topics and partitions in the same order after
> failure, we’d still have to persist the metadata to HDFS (or some other
> system) - this metadata must contain the order of messages consumed, so we
> know how to re-read the messages. I am planning to explore this once I have
> some time (probably in Jan). In addition, we must also ensure bucketing
> functions work fine as well. I will file a placeholder jira for this one.
>
> I also wrote an API to write data back to Kafka a while back -
> https://github.com/apache/spark/pull/2994 . I am hoping that this will
> get pulled in soon, as this is something I know people want. I am open to
> feedback on that - anything that I can do to make it better.
>
> Thanks,
> Hari
>
>
> On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell 
> wrote:
>
>> Hey Cody,
>>
>> Thanks for reaching out with this. The lead on streaming is TD - he is
>> traveling this week though so I can respond a bit. To the high level
>> point of whether Kafka is important - it definitely is. Something like
>> 80% of Spark Streaming deployments (anecdotally) ingest data from
>> Kafka. Also, good support for Kafka is something we generally want in
>> Spark and not a library. In some cases IIRC there were user libraries
>> that used unstable Kafka API's and we were somewhat waiting on Kafka
>> to stabilize them to merge things upstream. Otherwise users wouldn't
>> be able to use newer Kakfa versions. This is a high level impression
>> only though, I haven't talked to TD about this recently so it's worth
>> revisiting given the developments in Kafka.
>>
>> Please do bring things up like this on the dev list if there are
>> blockers for your usage - thanks for pinging it.
>>
>> - Patrick
>>
>> On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger 
>> wrote:
>> > Now that 1.2 is finalized... who are the go-to people to get some
>> > long-standing Kafka related issues resolved?
>> >
>> > The existing api is not sufficiently safe nor flexible for our
>> production
>> > use. I don't think we're alone in this viewpoint, because I've seen
>> > several different patches and libraries to fix the same things we've
>> been
>> > running into.
>> >
>> > Regarding flexibility
>> >
>> > https://issues.apache.org/jira/browse/SPARK-3146
>> >
>> > has been outstanding since August, and IMHO an equivalent of this is
>> > absolutely necessary. We wrote a similar patch ourselves, then found
>> that
>> > PR and have been running it in production. We wouldn't be able to get
>> our
>> > jobs done without it. It also allows users to solve a whole class of
>> > problems for themselves (e.g. SPARK-2388, arbitrary delay of messages,
>> etc).
>> >
>> > Regarding safety, I understand the motivation behind WriteAheadLog as a
>> > general solution for streaming unreliable sources, but Kafka already is
>> a
>> > reliable source. I think there's a need for an api that treats it as
>> > such. Even aside from the performance issues of duplicating the
>> > write-ahead log in kafka into another write-ahead log in hdfs, I need
>> > exactly-once semantics in the face of failure (I've had failures that
>> > prevented reloading a spark streaming checkpoint, for instance).
>> >
>> > I've got an implementation i've been using
>> >
>> > https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
>> > /src/main/scala/org/apache/spark/rdd/kafka
>> >
>> > Tresata has something similar at https://github.com/tresata/spark-kafka,
>>
>> > and I know there were earlier attempts based on Storm code.
>> >
>> > Trying to distribute these kinds of fixes as libraries rather than
>> patches
>> > to Spark is problematic, because large portions of the implementation
>> are
>> > private[spark].
>> >
>>

Re: What RDD transformations trigger computations?

2014-12-18 Thread Mark Hamstra
SPARK-2992 is a good start, but it's not exhaustive.  For example,
zipWithIndex is also an eager transformation, and we occasionally see PRs
suggesting additional eager transformations.

On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin  wrote:
>
> Alessandro was probably referring to some transformations whose
> implementations depend on some actions. For example: sortByKey requires
> sampling the data to get the histogram.
>
> There is a ticket tracking this:
> https://issues.apache.org/jira/browse/SPARK-2992
>
>
>
>
>
>
> On Thu, Dec 18, 2014 at 11:52 AM, Josh Rosen  wrote:
> >
> > Could you provide an example?  These operations are lazy, in the sense
> > that they don’t trigger Spark jobs:
> >
> >
> > scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x =>
> > println("computed a!"); x}
> > a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions
> > at :18
> >
> > scala> a.union(a)
> > res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at
> :22
> >
> > scala> a.map(x => (x, x)).groupByKey()
> > res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at
> > groupByKey at :22
> >
> > scala> a.map(x => (x, x)).groupByKey().count()
> > computed a!
> > res6: Long = 1
> >
> >
> > On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (
> > alexbare...@gmail.com) wrote:
> >
> > All,
> >
> > I noticed that while some operations that return RDDs are very cheap,
> such
> > as map and flatMap, some are quite expensive, such as union and
> groupByKey.
> > I'm referring here to the cost of constructing the RDD scala value, not
> the
> > cost of collecting the values contained in the RDD. This does not match
> my
> > understanding that RDD transformations only set up a computation without
> > actually running it. Oh, Spark developers, can you please provide some
> > clarity?
> >
> > Alex
> >
>


Re: Which committers care about Kafka?

2014-12-18 Thread Hari Shreedharan
I get what you are saying. But getting exactly once right is an extremely hard 
problem - especially in presence of failure. The issue is failures can happen 
in a bunch of places. For example, before the notification of downstream store 
being successful reaches the receiver that updates the offsets, the node fails. 
The store was successful, but duplicates came in either way. This is something 
worth discussing by itself - but without uuids etc this might not really be 
solved even when you think it is.




Anyway, I will look at the links. Even I am interested in all of the features 
you mentioned - no HDFS WAL for Kafka and once-only delivery, but I doubt the 
latter is really possible to guarantee - though I really would love to have 
that!




Thanks, Hari

On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger 
wrote:

> Thanks for the replies.
> Regarding skipping WAL, it's not just about optimization.  If you actually
> want exactly-once semantics, you need control of kafka offsets as well,
> including the ability to not use zookeeper as the system of record for
> offsets.  Kafka already is a reliable system that has strong ordering
> guarantees (within a partition) and does not mandate the use of zookeeper
> to store offsets.  I think there should be a spark api that acts as a very
> simple intermediary between Kafka and the user's choice of downstream store.
> Take a look at the links I posted - if there's already been 2 independent
> implementations of the idea, chances are it's something people need.
> On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan > wrote:
>>
>> Hi Cody,
>>
>> I am an absolute +1 on SPARK-3146. I think we can implement something
>> pretty simple and lightweight for that one.
>>
>> For the Kafka DStream skipping the WAL implementation - this is something
>> I discussed with TD a few weeks ago. Though it is a good idea to implement
>> this to avoid unnecessary HDFS writes, it is an optimization. For that
>> reason, we must be careful in implementation. There are a couple of issues
>> that we need to ensure works properly - specifically ordering. To ensure we
>> pull messages from different topics and partitions in the same order after
>> failure, we’d still have to persist the metadata to HDFS (or some other
>> system) - this metadata must contain the order of messages consumed, so we
>> know how to re-read the messages. I am planning to explore this once I have
>> some time (probably in Jan). In addition, we must also ensure bucketing
>> functions work fine as well. I will file a placeholder jira for this one.
>>
>> I also wrote an API to write data back to Kafka a while back -
>> https://github.com/apache/spark/pull/2994 . I am hoping that this will
>> get pulled in soon, as this is something I know people want. I am open to
>> feedback on that - anything that I can do to make it better.
>>
>> Thanks,
>> Hari
>>
>>
>> On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell 
>> wrote:
>>
>>> Hey Cody,
>>>
>>> Thanks for reaching out with this. The lead on streaming is TD - he is
>>> traveling this week though so I can respond a bit. To the high level
>>> point of whether Kafka is important - it definitely is. Something like
>>> 80% of Spark Streaming deployments (anecdotally) ingest data from
>>> Kafka. Also, good support for Kafka is something we generally want in
>>> Spark and not a library. In some cases IIRC there were user libraries
>>> that used unstable Kafka API's and we were somewhat waiting on Kafka
>>> to stabilize them to merge things upstream. Otherwise users wouldn't
>>> be able to use newer Kakfa versions. This is a high level impression
>>> only though, I haven't talked to TD about this recently so it's worth
>>> revisiting given the developments in Kafka.
>>>
>>> Please do bring things up like this on the dev list if there are
>>> blockers for your usage - thanks for pinging it.
>>>
>>> - Patrick
>>>
>>> On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger 
>>> wrote:
>>> > Now that 1.2 is finalized... who are the go-to people to get some
>>> > long-standing Kafka related issues resolved?
>>> >
>>> > The existing api is not sufficiently safe nor flexible for our
>>> production
>>> > use. I don't think we're alone in this viewpoint, because I've seen
>>> > several different patches and libraries to fix the same things we've
>>> been
>>> > running into.
>>> >
>>> > Regarding flexibility
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-3146
>>> >
>>> > has been outstanding since August, and IMHO an equivalent of this is
>>> > absolutely necessary. We wrote a similar patch ourselves, then found
>>> that
>>> > PR and have been running it in production. We wouldn't be able to get
>>> our
>>> > jobs done without it. It also allows users to solve a whole class of
>>> > problems for themselves (e.g. SPARK-2388, arbitrary delay of messages,
>>> etc).
>>> >
>>> > Regarding safety, I understand the motivation behind WriteAheadLog as a
>>> > general solution for streaming unrelia

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
If the downstream store for the output data is idempotent or transactional,
and that downstream store also is the system of record for kafka offsets,
then you have exactly-once semantics.  Commit offsets with / after the data
is stored.  On any failure, restart from the last committed offsets.

Yes, this approach is biased towards the etl-like use cases rather than
near-realtime-analytics use cases.

On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan  wrote:
>
> I get what you are saying. But getting exactly once right is an extremely
> hard problem - especially in presence of failure. The issue is failures can
> happen in a bunch of places. For example, before the notification of
> downstream store being successful reaches the receiver that updates the
> offsets, the node fails. The store was successful, but duplicates came in
> either way. This is something worth discussing by itself - but without
> uuids etc this might not really be solved even when you think it is.
>
> Anyway, I will look at the links. Even I am interested in all of the
> features you mentioned - no HDFS WAL for Kafka and once-only delivery, but
> I doubt the latter is really possible to guarantee - though I really would
> love to have that!
>
> Thanks,
> Hari
>
>
> On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger 
> wrote:
>
>> Thanks for the replies.
>>
>> Regarding skipping WAL, it's not just about optimization.  If you
>> actually want exactly-once semantics, you need control of kafka offsets as
>> well, including the ability to not use zookeeper as the system of record
>> for offsets.  Kafka already is a reliable system that has strong ordering
>> guarantees (within a partition) and does not mandate the use of zookeeper
>> to store offsets.  I think there should be a spark api that acts as a very
>> simple intermediary between Kafka and the user's choice of downstream store.
>>
>> Take a look at the links I posted - if there's already been 2 independent
>> implementations of the idea, chances are it's something people need.
>>
>> On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan <
>> hshreedha...@cloudera.com> wrote:
>>>
>>> Hi Cody,
>>>
>>> I am an absolute +1 on SPARK-3146. I think we can implement something
>>> pretty simple and lightweight for that one.
>>>
>>> For the Kafka DStream skipping the WAL implementation - this is
>>> something I discussed with TD a few weeks ago. Though it is a good idea to
>>> implement this to avoid unnecessary HDFS writes, it is an optimization. For
>>> that reason, we must be careful in implementation. There are a couple of
>>> issues that we need to ensure works properly - specifically ordering. To
>>> ensure we pull messages from different topics and partitions in the same
>>> order after failure, we’d still have to persist the metadata to HDFS (or
>>> some other system) - this metadata must contain the order of messages
>>> consumed, so we know how to re-read the messages. I am planning to explore
>>> this once I have some time (probably in Jan). In addition, we must also
>>> ensure bucketing functions work fine as well. I will file a placeholder
>>> jira for this one.
>>>
>>> I also wrote an API to write data back to Kafka a while back -
>>> https://github.com/apache/spark/pull/2994 . I am hoping that this will
>>> get pulled in soon, as this is something I know people want. I am open to
>>> feedback on that - anything that I can do to make it better.
>>>
>>> Thanks,
>>> Hari
>>>
>>>
>>> On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell 
>>> wrote:
>>>
  Hey Cody,

 Thanks for reaching out with this. The lead on streaming is TD - he is
 traveling this week though so I can respond a bit. To the high level
 point of whether Kafka is important - it definitely is. Something like
 80% of Spark Streaming deployments (anecdotally) ingest data from
 Kafka. Also, good support for Kafka is something we generally want in
 Spark and not a library. In some cases IIRC there were user libraries
 that used unstable Kafka API's and we were somewhat waiting on Kafka
 to stabilize them to merge things upstream. Otherwise users wouldn't
 be able to use newer Kakfa versions. This is a high level impression
 only though, I haven't talked to TD about this recently so it's worth
 revisiting given the developments in Kafka.

 Please do bring things up like this on the dev list if there are
 blockers for your usage - thanks for pinging it.

 - Patrick

 On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger 
 wrote:
 > Now that 1.2 is finalized... who are the go-to people to get some
 > long-standing Kafka related issues resolved?
 >
 > The existing api is not sufficiently safe nor flexible for our
 production
 > use. I don't think we're alone in this viewpoint, because I've seen
 > several different patches and libraries to fix the same things we've
 been
 > running into.
 >
 > Regarding flexibilit

Re: Which committers care about Kafka?

2014-12-18 Thread Luis Ángel Vicente Sánchez
But idempotency is not that easy t achieve sometimes. A strong only once
semantic through a proper API would  be superuseful; but I'm not implying
this is easy to achieve.
On 18 Dec 2014 21:52, "Cody Koeninger"  wrote:

> If the downstream store for the output data is idempotent or transactional,
> and that downstream store also is the system of record for kafka offsets,
> then you have exactly-once semantics.  Commit offsets with / after the data
> is stored.  On any failure, restart from the last committed offsets.
>
> Yes, this approach is biased towards the etl-like use cases rather than
> near-realtime-analytics use cases.
>
> On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan <
> hshreedha...@cloudera.com
> > wrote:
> >
> > I get what you are saying. But getting exactly once right is an extremely
> > hard problem - especially in presence of failure. The issue is failures
> can
> > happen in a bunch of places. For example, before the notification of
> > downstream store being successful reaches the receiver that updates the
> > offsets, the node fails. The store was successful, but duplicates came in
> > either way. This is something worth discussing by itself - but without
> > uuids etc this might not really be solved even when you think it is.
> >
> > Anyway, I will look at the links. Even I am interested in all of the
> > features you mentioned - no HDFS WAL for Kafka and once-only delivery,
> but
> > I doubt the latter is really possible to guarantee - though I really
> would
> > love to have that!
> >
> > Thanks,
> > Hari
> >
> >
> > On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger 
> > wrote:
> >
> >> Thanks for the replies.
> >>
> >> Regarding skipping WAL, it's not just about optimization.  If you
> >> actually want exactly-once semantics, you need control of kafka offsets
> as
> >> well, including the ability to not use zookeeper as the system of record
> >> for offsets.  Kafka already is a reliable system that has strong
> ordering
> >> guarantees (within a partition) and does not mandate the use of
> zookeeper
> >> to store offsets.  I think there should be a spark api that acts as a
> very
> >> simple intermediary between Kafka and the user's choice of downstream
> store.
> >>
> >> Take a look at the links I posted - if there's already been 2
> independent
> >> implementations of the idea, chances are it's something people need.
> >>
> >> On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan <
> >> hshreedha...@cloudera.com> wrote:
> >>>
> >>> Hi Cody,
> >>>
> >>> I am an absolute +1 on SPARK-3146. I think we can implement something
> >>> pretty simple and lightweight for that one.
> >>>
> >>> For the Kafka DStream skipping the WAL implementation - this is
> >>> something I discussed with TD a few weeks ago. Though it is a good
> idea to
> >>> implement this to avoid unnecessary HDFS writes, it is an
> optimization. For
> >>> that reason, we must be careful in implementation. There are a couple
> of
> >>> issues that we need to ensure works properly - specifically ordering.
> To
> >>> ensure we pull messages from different topics and partitions in the
> same
> >>> order after failure, we’d still have to persist the metadata to HDFS
> (or
> >>> some other system) - this metadata must contain the order of messages
> >>> consumed, so we know how to re-read the messages. I am planning to
> explore
> >>> this once I have some time (probably in Jan). In addition, we must also
> >>> ensure bucketing functions work fine as well. I will file a placeholder
> >>> jira for this one.
> >>>
> >>> I also wrote an API to write data back to Kafka a while back -
> >>> https://github.com/apache/spark/pull/2994 . I am hoping that this will
> >>> get pulled in soon, as this is something I know people want. I am open
> to
> >>> feedback on that - anything that I can do to make it better.
> >>>
> >>> Thanks,
> >>> Hari
> >>>
> >>>
> >>> On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell 
> >>> wrote:
> >>>
>   Hey Cody,
> 
>  Thanks for reaching out with this. The lead on streaming is TD - he is
>  traveling this week though so I can respond a bit. To the high level
>  point of whether Kafka is important - it definitely is. Something like
>  80% of Spark Streaming deployments (anecdotally) ingest data from
>  Kafka. Also, good support for Kafka is something we generally want in
>  Spark and not a library. In some cases IIRC there were user libraries
>  that used unstable Kafka API's and we were somewhat waiting on Kafka
>  to stabilize them to merge things upstream. Otherwise users wouldn't
>  be able to use newer Kakfa versions. This is a high level impression
>  only though, I haven't talked to TD about this recently so it's worth
>  revisiting given the developments in Kafka.
> 
>  Please do bring things up like this on the dev list if there are
>  blockers for your usage - thanks for pinging it.
> 
>  - Patrick
> 
>  On Thu, Dec 18, 20

Fwd: Spark JIRA Report

2014-12-18 Thread Sean Owen
In practice, most issues with no activity for, say, 6+ months are
dead. There's down-side in believing they will eventually get done by
somebody, since they almost always don't.

Most is clutter, but if there are important bugs among them, then the
fact they're idling is a different problem: too much demand / not
enough supply of attention, not saying 'no' to enough, fast enough,
and so on.

Sure you can prompt people to at least ping an issue they care about
once every 6 months to keep it alive. Which is essentially the same
as: Resolve and invite anyone who cares to Reopen. If nobody bothers,
can it be important? If the problem is, well, nobody would really be
paying attention to the prompts, that's this different problem again.

So: I think the auto-Resolve idea, or an email blast, is at best a
forcing mechanism to pay attention to a more fundamental issue. I
myself am less interested in that than working on the causes of
long-lived important stuff in a JIRA backlog.

You can see regular process progress like auto-closing PRs,
spark-prs.appspot.com, some big passes at closing stale issues. It's
still my impression that the bulk of existing JIRA does not get
reviewed, so there's more to do. For example, from a recent tour
through the JIRA list, there were ~50 that were even definitively
resolved, and not marked as such. It's not for lack of excellent
effort. The pace of good change outstrips any other project I've seen
by a wide margin, dwarfed only by unprecedented inbound load.

I'd rather the conversation be about more attacks on the supply/demand
problem, like adding committers to offload resolution of the easy and
clear changes more rapidly, docs or tools to help contributors make
better PRs/JIRAs in the first place, stating what is in and out of
scope upfront to direct efforts, and so on. That's a different
discussion from this one though.


On Thu, Dec 18, 2014 at 8:07 PM, Josh Rosen  wrote:
> I don’t think that it makes sense to just close inactive JIRA issue without 
> any human review.  There are many legitimate feature requests / bug reports 
> that might be inactive for a long time because they’re low priorities to fix 
> or because nobody has had time to deal with them yet.
>
> On December 15, 2014 at 2:37:30 PM, Nicholas Chammas 
> (nicholas.cham...@gmail.com) wrote:
>
> OK, that's good.
>
> Another approach we can take to controlling the number of stale JIRA issues
> is writing a bot that simply closes issues after N days of inactivity and
> prompts people to reopen the issue if it's still valid. I believe Sean Owen
> proposed that at one point (?).
>
> I wonder if that might be better since I feel that even a slimmed down
> email might not be enough to get already-busy people to spend time on JIRA
> management.
>
> Nick
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
Reynold,

Yes, this exactly what I was referring to. I specifically noted this
unexpected behavior with sortByKey. I also noted that union is unexpectedly
very slow, taking several minutes to define the RDD: although it does not
seem to trigger a spark computation per se, it seems to cause the input
files to read by the Hadoop subsystem, which to the console such messages
as these:

14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process
: 9
14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process
: 759
14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process
: 228
14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process
: 3076
14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process
: 1013
14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process
: 156

More generally, it would be important for the documentation to clearly
point out what RDD transformations are eager, otherwise it is easy to
introduce horrible performance bugs by constructing unneeded RDDs, assuming
this is a lazy operation. I would venture to suggest introducing one or
more traits to collect all the eager RDD-to-RDD transformations, so that
the type system can be used to enforce that no eager transformation is used
where a lazy one is intended to be used.

Alex

On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin  wrote:
>
> Alessandro was probably referring to some transformations whose
> implementations depend on some actions. For example: sortByKey requires
> sampling the data to get the histogram.
>
> There is a ticket tracking this:
> https://issues.apache.org/jira/browse/SPARK-2992
>
>
>
>
>
>
> On Thu, Dec 18, 2014 at 11:52 AM, Josh Rosen  wrote:
>>
>> Could you provide an example?  These operations are lazy, in the sense
>> that they don’t trigger Spark jobs:
>>
>>
>> scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x =>
>> println("computed a!"); x}
>> a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions
>> at :18
>>
>> scala> a.union(a)
>> res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at
>> :22
>>
>> scala> a.map(x => (x, x)).groupByKey()
>> res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at
>> groupByKey at :22
>>
>> scala> a.map(x => (x, x)).groupByKey().count()
>> computed a!
>> res6: Long = 1
>>
>>
>> On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (
>> alexbare...@gmail.com) wrote:
>>
>> All,
>>
>> I noticed that while some operations that return RDDs are very cheap, such
>> as map and flatMap, some are quite expensive, such as union and
>> groupByKey.
>> I'm referring here to the cost of constructing the RDD scala value, not
>> the
>> cost of collecting the values contained in the RDD. This does not match my
>> understanding that RDD transformations only set up a computation without
>> actually running it. Oh, Spark developers, can you please provide some
>> clarity?
>>
>> Alex
>>
>


Re: Fwd: Spark JIRA Report

2014-12-18 Thread Josh Rosen
Slightly off-topic, but or helping to clear the PR review backlog, I have a 
proposal to add some “PR lifecycle” tools to spark-prs.appspot.com to make it 
easier to track which PRs are blocked on reviewers vs. authors: 
https://github.com/databricks/spark-pr-dashboard/pull/39


On December 18, 2014 at 2:01:31 PM, Sean Owen (so...@cloudera.com) wrote:

In practice, most issues with no activity for, say, 6+ months are  
dead. There's down-side in believing they will eventually get done by  
somebody, since they almost always don't.  

Most is clutter, but if there are important bugs among them, then the  
fact they're idling is a different problem: too much demand / not  
enough supply of attention, not saying 'no' to enough, fast enough,  
and so on.  

Sure you can prompt people to at least ping an issue they care about  
once every 6 months to keep it alive. Which is essentially the same  
as: Resolve and invite anyone who cares to Reopen. If nobody bothers,  
can it be important? If the problem is, well, nobody would really be  
paying attention to the prompts, that's this different problem again.  

So: I think the auto-Resolve idea, or an email blast, is at best a  
forcing mechanism to pay attention to a more fundamental issue. I  
myself am less interested in that than working on the causes of  
long-lived important stuff in a JIRA backlog.  

You can see regular process progress like auto-closing PRs,  
spark-prs.appspot.com, some big passes at closing stale issues. It's  
still my impression that the bulk of existing JIRA does not get  
reviewed, so there's more to do. For example, from a recent tour  
through the JIRA list, there were ~50 that were even definitively  
resolved, and not marked as such. It's not for lack of excellent  
effort. The pace of good change outstrips any other project I've seen  
by a wide margin, dwarfed only by unprecedented inbound load.  

I'd rather the conversation be about more attacks on the supply/demand  
problem, like adding committers to offload resolution of the easy and  
clear changes more rapidly, docs or tools to help contributors make  
better PRs/JIRAs in the first place, stating what is in and out of  
scope upfront to direct efforts, and so on. That's a different  
discussion from this one though.  


On Thu, Dec 18, 2014 at 8:07 PM, Josh Rosen  wrote:  
> I don’t think that it makes sense to just close inactive JIRA issue without 
> any human review. There are many legitimate feature requests / bug reports 
> that might be inactive for a long time because they’re low priorities to fix 
> or because nobody has had time to deal with them yet.  
>  
> On December 15, 2014 at 2:37:30 PM, Nicholas Chammas 
> (nicholas.cham...@gmail.com) wrote:  
>  
> OK, that's good.  
>  
> Another approach we can take to controlling the number of stale JIRA issues  
> is writing a bot that simply closes issues after N days of inactivity and  
> prompts people to reopen the issue if it's still valid. I believe Sean Owen  
> proposed that at one point (?).  
>  
> I wonder if that might be better since I feel that even a slimmed down  
> email might not be enough to get already-busy people to spend time on JIRA  
> management.  
>  
> Nick  
>  

-  
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
For additional commands, e-mail: dev-h...@spark.apache.org  



Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-18 Thread andy
I just changed the domain name in the mailing list archive settings to remove
".incubator" so maybe it'll work now.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp9772p9842.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-18 Thread andy
I just changed the domain name in the mailing list archive settings to remove
".incubator" so maybe it'll work now.

Andy



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp9772p9843.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Which committers care about Kafka?

2014-12-18 Thread Shao, Saisai
Hi all,

I agree with Hari that Strong exact-once semantics is very hard to guarantee, 
especially in the failure situation. From my understanding even current 
implementation of ReliableKafkaReceiver cannot fully guarantee the exact once 
semantics once failed, first is the ordering of data replaying from last 
checkpoint, this is hard to guarantee when multiple partitions are injected in; 
second is the design complexity of achieving this, you can refer to the Kafka 
Spout in Trident, we have to dig into the very details of Kafka metadata 
management system to achieve this, not to say rebalance and fault-tolerance. 

Thanks
Jerry

-Original Message-
From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com] 
Sent: Friday, December 19, 2014 5:57 AM
To: Cody Koeninger
Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
Subject: Re: Which committers care about Kafka?

But idempotency is not that easy t achieve sometimes. A strong only once 
semantic through a proper API would  be superuseful; but I'm not implying this 
is easy to achieve.
On 18 Dec 2014 21:52, "Cody Koeninger"  wrote:

> If the downstream store for the output data is idempotent or 
> transactional, and that downstream store also is the system of record 
> for kafka offsets, then you have exactly-once semantics.  Commit 
> offsets with / after the data is stored.  On any failure, restart from the 
> last committed offsets.
>
> Yes, this approach is biased towards the etl-like use cases rather 
> than near-realtime-analytics use cases.
>
> On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan < 
> hshreedha...@cloudera.com
> > wrote:
> >
> > I get what you are saying. But getting exactly once right is an 
> > extremely hard problem - especially in presence of failure. The 
> > issue is failures
> can
> > happen in a bunch of places. For example, before the notification of 
> > downstream store being successful reaches the receiver that updates 
> > the offsets, the node fails. The store was successful, but 
> > duplicates came in either way. This is something worth discussing by 
> > itself - but without uuids etc this might not really be solved even when 
> > you think it is.
> >
> > Anyway, I will look at the links. Even I am interested in all of the 
> > features you mentioned - no HDFS WAL for Kafka and once-only 
> > delivery,
> but
> > I doubt the latter is really possible to guarantee - though I really
> would
> > love to have that!
> >
> > Thanks,
> > Hari
> >
> >
> > On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger 
> > 
> > wrote:
> >
> >> Thanks for the replies.
> >>
> >> Regarding skipping WAL, it's not just about optimization.  If you 
> >> actually want exactly-once semantics, you need control of kafka 
> >> offsets
> as
> >> well, including the ability to not use zookeeper as the system of 
> >> record for offsets.  Kafka already is a reliable system that has 
> >> strong
> ordering
> >> guarantees (within a partition) and does not mandate the use of
> zookeeper
> >> to store offsets.  I think there should be a spark api that acts as 
> >> a
> very
> >> simple intermediary between Kafka and the user's choice of 
> >> downstream
> store.
> >>
> >> Take a look at the links I posted - if there's already been 2
> independent
> >> implementations of the idea, chances are it's something people need.
> >>
> >> On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan < 
> >> hshreedha...@cloudera.com> wrote:
> >>>
> >>> Hi Cody,
> >>>
> >>> I am an absolute +1 on SPARK-3146. I think we can implement 
> >>> something pretty simple and lightweight for that one.
> >>>
> >>> For the Kafka DStream skipping the WAL implementation - this is 
> >>> something I discussed with TD a few weeks ago. Though it is a good
> idea to
> >>> implement this to avoid unnecessary HDFS writes, it is an
> optimization. For
> >>> that reason, we must be careful in implementation. There are a 
> >>> couple
> of
> >>> issues that we need to ensure works properly - specifically ordering.
> To
> >>> ensure we pull messages from different topics and partitions in 
> >>> the
> same
> >>> order after failure, we’d still have to persist the metadata to 
> >>> HDFS
> (or
> >>> some other system) - this metadata must contain the order of 
> >>> messages consumed, so we know how to re-read the messages. I am 
> >>> planning to
> explore
> >>> this once I have some time (probably in Jan). In addition, we must 
> >>> also ensure bucketing functions work fine as well. I will file a 
> >>> placeholder jira for this one.
> >>>
> >>> I also wrote an API to write data back to Kafka a while back -
> >>> https://github.com/apache/spark/pull/2994 . I am hoping that this 
> >>> will get pulled in soon, as this is something I know people want. 
> >>> I am open
> to
> >>> feedback on that - anything that I can do to make it better.
> >>>
> >>> Thanks,
> >>> Hari
> >>>
> >>>
> >>> On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell 
> >>> 
> >>> wrote:
> >>>
>   Hey Cody,

Re: [RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-18 Thread Patrick Wendell
Update: An Apache infrastructure issue prevented me from pushing this
last night. The issue was resolved today and I should be able to push
the final release artifacts tonight.

On Tue, Dec 16, 2014 at 9:20 PM, Patrick Wendell  wrote:
> This vote has PASSED with 12 +1 votes (8 binding) and no 0 or -1 votes:
>
> +1:
> Matei Zaharia*
> Madhu Siddalingaiah
> Reynold Xin*
> Sandy Ryza
> Josh Rozen*
> Mark Hamstra*
> Denny Lee
> Tom Graves*
> GuiQiang Li
> Nick Pentreath*
> Sean McNamara*
> Patrick Wendell*
>
> 0:
>
> -1:
>
> I'll finalize and package this release in the next 48 hours. Thanks to
> everyone who contributed.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org