Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
the third release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

This release brings operational and performance improvements in Spark
core including a new network transport subsytem designed for very
large shuffles. Spark SQL introduces an API for external data sources
along with Hive 13 support, dynamic partitioning, and the
fixed-precision decimal type. MLlib adds a new pipeline-oriented
package (spark.ml) for composing multiple algorithms. Spark Streaming
adds a Python API and a write ahead log for fault tolerance. Finally,
GraphX has graduated from alpha and introduces a stable API along with
performance improvements.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone involved in creating, testing, and documenting this release!

[1] http://spark.apache.org/releases/spark-release-1-2-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which committers care about Kafka?

2014-12-19 Thread Dibyendu Bhattacharya
Hi,

Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
Trident has done the exact-once guarantee by processing the tuple in a
batch  and assigning same transaction-id for a given batch . The replay for
a given batch with a transaction-id will have exact same set of tuples and
replay of batches happen in exact same order before the failure.

Having this paradigm, if downstream system process data for a given batch
for having a given transaction-id , and if during failure if same batch is
again emitted , you can check if same transaction-id is already processed
or not and hence can guarantee exact once semantics.

And this can only be achieved in Spark if we use Low Level Kafka consumer
API to process the offsets. This low level Kafka Consumer (
https://github.com/dibbhatt/kafka-spark-consumer) has implemented the Spark
Kafka consumer which uses Kafka Low Level APIs . All of the Kafka related
logic has been taken from Storm-Kafka spout and which manages all Kafka
re-balance and fault tolerant aspects and Kafka metadata managements.

Presently this Consumer maintains that during Receiver failure, it will
re-emit the exact same Block with same set of messages . Every message have
the details of its partition, offset and topic related details which can
tackle the SPARK-3146.

As this Low Level consumer has complete control over the Kafka Offsets , we
can implement Trident like feature on top of it like having implement a
transaction-id for a given block , and re-emit the same block with same set
of message during Driver failure.

Regards,
Dibyendu


On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai  wrote:
>
> Hi all,
>
> I agree with Hari that Strong exact-once semantics is very hard to
> guarantee, especially in the failure situation. From my understanding even
> current implementation of ReliableKafkaReceiver cannot fully guarantee the
> exact once semantics once failed, first is the ordering of data replaying
> from last checkpoint, this is hard to guarantee when multiple partitions
> are injected in; second is the design complexity of achieving this, you can
> refer to the Kafka Spout in Trident, we have to dig into the very details
> of Kafka metadata management system to achieve this, not to say rebalance
> and fault-tolerance.
>
> Thanks
> Jerry
>
> -Original Message-
> From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
> Sent: Friday, December 19, 2014 5:57 AM
> To: Cody Koeninger
> Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
> Subject: Re: Which committers care about Kafka?
>
> But idempotency is not that easy t achieve sometimes. A strong only once
> semantic through a proper API would  be superuseful; but I'm not implying
> this is easy to achieve.
> On 18 Dec 2014 21:52, "Cody Koeninger"  wrote:
>
> > If the downstream store for the output data is idempotent or
> > transactional, and that downstream store also is the system of record
> > for kafka offsets, then you have exactly-once semantics.  Commit
> > offsets with / after the data is stored.  On any failure, restart from
> the last committed offsets.
> >
> > Yes, this approach is biased towards the etl-like use cases rather
> > than near-realtime-analytics use cases.
> >
> > On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan <
> > hshreedha...@cloudera.com
> > > wrote:
> > >
> > > I get what you are saying. But getting exactly once right is an
> > > extremely hard problem - especially in presence of failure. The
> > > issue is failures
> > can
> > > happen in a bunch of places. For example, before the notification of
> > > downstream store being successful reaches the receiver that updates
> > > the offsets, the node fails. The store was successful, but
> > > duplicates came in either way. This is something worth discussing by
> > > itself - but without uuids etc this might not really be solved even
> when you think it is.
> > >
> > > Anyway, I will look at the links. Even I am interested in all of the
> > > features you mentioned - no HDFS WAL for Kafka and once-only
> > > delivery,
> > but
> > > I doubt the latter is really possible to guarantee - though I really
> > would
> > > love to have that!
> > >
> > > Thanks,
> > > Hari
> > >
> > >
> > > On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger
> > > 
> > > wrote:
> > >
> > >> Thanks for the replies.
> > >>
> > >> Regarding skipping WAL, it's not just about optimization.  If you
> > >> actually want exactly-once semantics, you need control of kafka
> > >> offsets
> > as
> > >> well, including the ability to not use zookeeper as the system of
> > >> record for offsets.  Kafka already is a reliable system that has
> > >> strong
> > ordering
> > >> guarantees (within a partition) and does not mandate the use of
> > zookeeper
> > >> to store offsets.  I think there should be a spark api that acts as
> > >> a
> > very
> > >> simple intermediary between Kafka and the user's choice of
> > >> downstream
> > store.
> > >>
> > >> Take a look at the l

Re: Announcing Spark 1.2!

2014-12-19 Thread Shixiong Zhu
Congrats!

A little question about this release: Which commit is this release based
on? v1.2.0 and v1.2.0-rc2 are pointed to different commits in
https://github.com/apache/spark/releases

Best Regards,
Shixiong Zhu

2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>
> I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
> the third release on the API-compatible 1.X line. It is Spark's
> largest release ever, with contributions from 172 developers and more
> than 1,000 commits!
>
> This release brings operational and performance improvements in Spark
> core including a new network transport subsytem designed for very
> large shuffles. Spark SQL introduces an API for external data sources
> along with Hive 13 support, dynamic partitioning, and the
> fixed-precision decimal type. MLlib adds a new pipeline-oriented
> package (spark.ml) for composing multiple algorithms. Spark Streaming
> adds a Python API and a write ahead log for fault tolerance. Finally,
> GraphX has graduated from alpha and introduces a stable API along with
> performance improvements.
>
> Visit the release notes [1] to read about the new features, or
> download [2] the release today.
>
> For errata in the contributions or release notes, please e-mail me
> *directly* (not on-list).
>
> Thanks to everyone involved in creating, testing, and documenting this
> release!
>
> [1] http://spark.apache.org/releases/spark-release-1-2-0.html
> [2] http://spark.apache.org/downloads.html
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Announcing Spark 1.2!

2014-12-19 Thread Sean Owen
Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
updated. I assume it's going to be 1.2.0-rc2 plus a few commits
related to the release process.

On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu  wrote:
> Congrats!
>
> A little question about this release: Which commit is this release based on?
> v1.2.0 and v1.2.0-rc2 are pointed to different commits in
> https://github.com/apache/spark/releases
>
> Best Regards,
>
> Shixiong Zhu
>
> 2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>>
>> I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
>> the third release on the API-compatible 1.X line. It is Spark's
>> largest release ever, with contributions from 172 developers and more
>> than 1,000 commits!
>>
>> This release brings operational and performance improvements in Spark
>> core including a new network transport subsytem designed for very
>> large shuffles. Spark SQL introduces an API for external data sources
>> along with Hive 13 support, dynamic partitioning, and the
>> fixed-precision decimal type. MLlib adds a new pipeline-oriented
>> package (spark.ml) for composing multiple algorithms. Spark Streaming
>> adds a Python API and a write ahead log for fault tolerance. Finally,
>> GraphX has graduated from alpha and introduces a stable API along with
>> performance improvements.
>>
>> Visit the release notes [1] to read about the new features, or
>> download [2] the release today.
>>
>> For errata in the contributions or release notes, please e-mail me
>> *directly* (not on-list).
>>
>> Thanks to everyone involved in creating, testing, and documenting this
>> release!
>>
>> [1] http://spark.apache.org/releases/spark-release-1-2-0.html
>> [2] http://spark.apache.org/downloads.html
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re:Re: Announcing Spark 1.2!

2014-12-19 Thread wyphao.2007


In the http://spark.apache.org/downloads.html page,We cann't download the 
newest Spark release.  






At 2014-12-19 17:55:29,"Sean Owen"  wrote:
>Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
>updated. I assume it's going to be 1.2.0-rc2 plus a few commits
>related to the release process.
>
>On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu  wrote:
>> Congrats!
>>
>> A little question about this release: Which commit is this release based on?
>> v1.2.0 and v1.2.0-rc2 are pointed to different commits in
>> https://github.com/apache/spark/releases
>>
>> Best Regards,
>>
>> Shixiong Zhu
>>
>> 2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>>>
>>> I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
>>> the third release on the API-compatible 1.X line. It is Spark's
>>> largest release ever, with contributions from 172 developers and more
>>> than 1,000 commits!
>>>
>>> This release brings operational and performance improvements in Spark
>>> core including a new network transport subsytem designed for very
>>> large shuffles. Spark SQL introduces an API for external data sources
>>> along with Hive 13 support, dynamic partitioning, and the
>>> fixed-precision decimal type. MLlib adds a new pipeline-oriented
>>> package (spark.ml) for composing multiple algorithms. Spark Streaming
>>> adds a Python API and a write ahead log for fault tolerance. Finally,
>>> GraphX has graduated from alpha and introduces a stable API along with
>>> performance improvements.
>>>
>>> Visit the release notes [1] to read about the new features, or
>>> download [2] the release today.
>>>
>>> For errata in the contributions or release notes, please e-mail me
>>> *directly* (not on-list).
>>>
>>> Thanks to everyone involved in creating, testing, and documenting this
>>> release!
>>>
>>> [1] http://spark.apache.org/releases/spark-release-1-2-0.html
>>> [2] http://spark.apache.org/downloads.html
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>


Re: Re: Announcing Spark 1.2!

2014-12-19 Thread Sean Owen
I can download it. Make sure you refresh the page, maybe, so that it
shows the 1.2.0 download as an option.

On Fri, Dec 19, 2014 at 11:16 AM, wyphao.2007  wrote:
>
>
> In the http://spark.apache.org/downloads.html page,We cann't download the 
> newest Spark release.
>
>
>
>
>
>
> At 2014-12-19 17:55:29,"Sean Owen"  wrote:
>>Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
>>updated. I assume it's going to be 1.2.0-rc2 plus a few commits
>>related to the release process.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



spark-yarn_2.10 1.2.0 artifacts

2014-12-19 Thread David McWhorter

Hi all,

Thanks for your work on spark!  I am trying to locate spark-yarn jars 
for the new 1.2.0 release.  The jars for spark-core, etc, are on maven 
central, but the spark-yarn jars are missing.


Confusingly and perhaps relatedly, I also can't seem to get the 
spark-yarn artifact to install on my local computer when I run 'mvn 
-Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean install'.  
At the install plugin stage, maven reports:


[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ 
spark-yarn_2.10 ---

[INFO] Skipping artifact installation

Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven 
build would be appreciated.


David

--

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhor...@ccri.com | 434.299.0090x204


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-yarn_2.10 1.2.0 artifacts

2014-12-19 Thread Sean Owen
I believe spark-yarn does not exist from 1.2 onwards. Have a look at
spark-network-yarn for where some of that went, I believe.

On Fri, Dec 19, 2014 at 5:09 PM, David McWhorter  wrote:
> Hi all,
>
> Thanks for your work on spark!  I am trying to locate spark-yarn jars for
> the new 1.2.0 release.  The jars for spark-core, etc, are on maven central,
> but the spark-yarn jars are missing.
>
> Confusingly and perhaps relatedly, I also can't seem to get the spark-yarn
> artifact to install on my local computer when I run 'mvn -Pyarn -Phadoop-2.2
> -Dhadoop.version=2.2.0 -DskipTests clean install'.  At the install plugin
> stage, maven reports:
>
> [INFO] --- maven-install-plugin:2.5.1:install (default-install) @
> spark-yarn_2.10 ---
> [INFO] Skipping artifact installation
>
> Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build
> would be appreciated.
>
> David
>
> --
>
> David McWhorter
> Software Engineer
> Commonwealth Computer Research, Inc.
> 1422 Sachem Place, Unit #1
> Charlottesville, VA 22901
> mcwhor...@ccri.com | 434.299.0090x204
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
Thanks for pointing out the tag issue. I've updated all links to point
to the correct tag (from the vote thread):

a428c446e23e628b746e0626cc02b7b3cadf588e

On Fri, Dec 19, 2014 at 1:55 AM, Sean Owen  wrote:
> Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
> updated. I assume it's going to be 1.2.0-rc2 plus a few commits
> related to the release process.
>
> On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu  wrote:
>> Congrats!
>>
>> A little question about this release: Which commit is this release based on?
>> v1.2.0 and v1.2.0-rc2 are pointed to different commits in
>> https://github.com/apache/spark/releases
>>
>> Best Regards,
>>
>> Shixiong Zhu
>>
>> 2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>>>
>>> I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
>>> the third release on the API-compatible 1.X line. It is Spark's
>>> largest release ever, with contributions from 172 developers and more
>>> than 1,000 commits!
>>>
>>> This release brings operational and performance improvements in Spark
>>> core including a new network transport subsytem designed for very
>>> large shuffles. Spark SQL introduces an API for external data sources
>>> along with Hive 13 support, dynamic partitioning, and the
>>> fixed-precision decimal type. MLlib adds a new pipeline-oriented
>>> package (spark.ml) for composing multiple algorithms. Spark Streaming
>>> adds a Python API and a write ahead log for fault tolerance. Finally,
>>> GraphX has graduated from alpha and introduces a stable API along with
>>> performance improvements.
>>>
>>> Visit the release notes [1] to read about the new features, or
>>> download [2] the release today.
>>>
>>> For errata in the contributions or release notes, please e-mail me
>>> *directly* (not on-list).
>>>
>>> Thanks to everyone involved in creating, testing, and documenting this
>>> release!
>>>
>>> [1] http://spark.apache.org/releases/spark-release-1-2-0.html
>>> [2] http://spark.apache.org/downloads.html
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Confirming race condition in DagScheduler (NoSuchElementException)

2014-12-19 Thread thlee
any comments?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Confirming-race-condition-in-DagScheduler-NoSuchElementException-tp9798p9855.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark master OOMs with exception stack trace stored in JobProgressListener (SPARK-4906)

2014-12-19 Thread Mingyu Kim
Hi,

I just filed a bug 
SPARK-4906, regarding Spark 
master OOMs. If I understand correctly, the UI states for all running 
applications are kept in memory retained by JobProgressListener, and when there 
are a lot of exception stack traces, this UI states can take up a significant 
amount of heap. This seems very bad especially for long-running applications.

Can you correct me if I’m misunderstanding anything? If my understanding is 
correct, is there any work being done to make sure the UI states don’t grow 
indefinitely over time? Would it make sense to spill some states to disk or 
work with what spark.eventLog is doing so Spark master doesn’t need to keep 
things in memory?

Thanks,
Mingyu


Re: Which committers care about Kafka?

2014-12-19 Thread Hari Shreedharan
Hi Dibyendu,

Thanks for the details on the implementation. But I still do not believe
that it is no duplicates - what they achieve is that the same batch is
processed exactly the same way every time (but see it may be processed more
than once) - so it depends on the operation being idempotent. I believe
Trident uses ZK to keep track of the transactions - a batch can be
processed multiple times in failure scenarios (for example, the transaction
is processed but before ZK is updated the machine fails, causing a "new"
node to process it again).

I don't think it is impossible to do this in Spark Streaming as well and
I'd be really interested in working on it at some point in the near future.

On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> Hi,
>
> Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
> Trident has done the exact-once guarantee by processing the tuple in a
> batch  and assigning same transaction-id for a given batch . The replay for
> a given batch with a transaction-id will have exact same set of tuples and
> replay of batches happen in exact same order before the failure.
>
> Having this paradigm, if downstream system process data for a given batch
> for having a given transaction-id , and if during failure if same batch is
> again emitted , you can check if same transaction-id is already processed
> or not and hence can guarantee exact once semantics.
>
> And this can only be achieved in Spark if we use Low Level Kafka consumer
> API to process the offsets. This low level Kafka Consumer (
> https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
> Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
> related logic has been taken from Storm-Kafka spout and which manages all
> Kafka re-balance and fault tolerant aspects and Kafka metadata managements.
>
> Presently this Consumer maintains that during Receiver failure, it will
> re-emit the exact same Block with same set of messages . Every message have
> the details of its partition, offset and topic related details which can
> tackle the SPARK-3146.
>
> As this Low Level consumer has complete control over the Kafka Offsets ,
> we can implement Trident like feature on top of it like having implement a
> transaction-id for a given block , and re-emit the same block with same set
> of message during Driver failure.
>
> Regards,
> Dibyendu
>
>
> On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai 
> wrote:
>>
>> Hi all,
>>
>> I agree with Hari that Strong exact-once semantics is very hard to
>> guarantee, especially in the failure situation. From my understanding even
>> current implementation of ReliableKafkaReceiver cannot fully guarantee the
>> exact once semantics once failed, first is the ordering of data replaying
>> from last checkpoint, this is hard to guarantee when multiple partitions
>> are injected in; second is the design complexity of achieving this, you can
>> refer to the Kafka Spout in Trident, we have to dig into the very details
>> of Kafka metadata management system to achieve this, not to say rebalance
>> and fault-tolerance.
>>
>> Thanks
>> Jerry
>>
>> -Original Message-
>> From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
>> Sent: Friday, December 19, 2014 5:57 AM
>> To: Cody Koeninger
>> Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
>> Subject: Re: Which committers care about Kafka?
>>
>> But idempotency is not that easy t achieve sometimes. A strong only once
>> semantic through a proper API would  be superuseful; but I'm not implying
>> this is easy to achieve.
>> On 18 Dec 2014 21:52, "Cody Koeninger"  wrote:
>>
>> > If the downstream store for the output data is idempotent or
>> > transactional, and that downstream store also is the system of record
>> > for kafka offsets, then you have exactly-once semantics.  Commit
>> > offsets with / after the data is stored.  On any failure, restart from
>> the last committed offsets.
>> >
>> > Yes, this approach is biased towards the etl-like use cases rather
>> > than near-realtime-analytics use cases.
>> >
>> > On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan <
>> > hshreedha...@cloudera.com
>> > > wrote:
>> > >
>> > > I get what you are saying. But getting exactly once right is an
>> > > extremely hard problem - especially in presence of failure. The
>> > > issue is failures
>> > can
>> > > happen in a bunch of places. For example, before the notification of
>> > > downstream store being successful reaches the receiver that updates
>> > > the offsets, the node fails. The store was successful, but
>> > > duplicates came in either way. This is something worth discussing by
>> > > itself - but without uuids etc this might not really be solved even
>> when you think it is.
>> > >
>> > > Anyway, I will look at the links. Even I am interested in all of the
>> > > features you mentioned - no HDFS WAL for Kafka and once-only
>> > > delivery,
>> > but

Spark Dev

2014-12-19 Thread Harikrishna Kamepalli
i am interested to contribute to spark


Re: Spark Dev

2014-12-19 Thread Sandy Ryza
Hi Harikrishna,

A good place to start is taking a look at the wiki page on contributing:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

-Sandy

On Fri, Dec 19, 2014 at 2:43 PM, Harikrishna Kamepalli <
harikrishna.kamepa...@gmail.com> wrote:
>
> i am interested to contribute to spark
>


Re: Which committers care about Kafka?

2014-12-19 Thread Sean McNamara
Please feel free to correct me if I’m wrong, but I think the exactly once spark 
streaming semantics can easily be solved using updateStateByKey. Make the key 
going into updateStateByKey be a hash of the event, or pluck off some uuid from 
the message.  The updateFunc would only emit the message if the key did not 
exist, and the user has complete control over the window of time / state 
lifecycle for detecting duplicates.  It also makes it really easy to detect and 
take action (alert?) when you DO see a duplicate, or make memory tradeoffs 
within an error bound using a sketch algorithm.  The kafka simple consumer is 
insanely complex, if possible I think it would be better (and vastly more 
flexible) to get reliability using the primitives that spark so elegantly 
provides.

Cheers,

Sean


> On Dec 19, 2014, at 12:06 PM, Hari Shreedharan  
> wrote:
> 
> Hi Dibyendu,
> 
> Thanks for the details on the implementation. But I still do not believe
> that it is no duplicates - what they achieve is that the same batch is
> processed exactly the same way every time (but see it may be processed more
> than once) - so it depends on the operation being idempotent. I believe
> Trident uses ZK to keep track of the transactions - a batch can be
> processed multiple times in failure scenarios (for example, the transaction
> is processed but before ZK is updated the machine fails, causing a "new"
> node to process it again).
> 
> I don't think it is impossible to do this in Spark Streaming as well and
> I'd be really interested in working on it at some point in the near future.
> 
> On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
>> Trident has done the exact-once guarantee by processing the tuple in a
>> batch  and assigning same transaction-id for a given batch . The replay for
>> a given batch with a transaction-id will have exact same set of tuples and
>> replay of batches happen in exact same order before the failure.
>> 
>> Having this paradigm, if downstream system process data for a given batch
>> for having a given transaction-id , and if during failure if same batch is
>> again emitted , you can check if same transaction-id is already processed
>> or not and hence can guarantee exact once semantics.
>> 
>> And this can only be achieved in Spark if we use Low Level Kafka consumer
>> API to process the offsets. This low level Kafka Consumer (
>> https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
>> Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
>> related logic has been taken from Storm-Kafka spout and which manages all
>> Kafka re-balance and fault tolerant aspects and Kafka metadata managements.
>> 
>> Presently this Consumer maintains that during Receiver failure, it will
>> re-emit the exact same Block with same set of messages . Every message have
>> the details of its partition, offset and topic related details which can
>> tackle the SPARK-3146.
>> 
>> As this Low Level consumer has complete control over the Kafka Offsets ,
>> we can implement Trident like feature on top of it like having implement a
>> transaction-id for a given block , and re-emit the same block with same set
>> of message during Driver failure.
>> 
>> Regards,
>> Dibyendu
>> 
>> 
>> On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai 
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I agree with Hari that Strong exact-once semantics is very hard to
>>> guarantee, especially in the failure situation. From my understanding even
>>> current implementation of ReliableKafkaReceiver cannot fully guarantee the
>>> exact once semantics once failed, first is the ordering of data replaying
>>> from last checkpoint, this is hard to guarantee when multiple partitions
>>> are injected in; second is the design complexity of achieving this, you can
>>> refer to the Kafka Spout in Trident, we have to dig into the very details
>>> of Kafka metadata management system to achieve this, not to say rebalance
>>> and fault-tolerance.
>>> 
>>> Thanks
>>> Jerry
>>> 
>>> -Original Message-
>>> From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
>>> Sent: Friday, December 19, 2014 5:57 AM
>>> To: Cody Koeninger
>>> Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
>>> Subject: Re: Which committers care about Kafka?
>>> 
>>> But idempotency is not that easy t achieve sometimes. A strong only once
>>> semantic through a proper API would  be superuseful; but I'm not implying
>>> this is easy to achieve.
>>> On 18 Dec 2014 21:52, "Cody Koeninger"  wrote:
>>> 
 If the downstream store for the output data is idempotent or
 transactional, and that downstream store also is the system of record
 for kafka offsets, then you have exactly-once semantics.  Commit
 offsets with / after the data is stored.  On any failure, restart from
>>> the last committed offsets.
>

Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-19 Thread Andy Konwinski
Yesterday, I changed the domain name in the mailing list archive settings
to remove ".incubator" so maybe it'll work now.

However, I also sent two emails about this through the nabble interface (in
this same thread) yesterday and they don't appear to have made it through
so not sure if it actually worked after all.

Andy

On Wed, Dec 17, 2014 at 1:09 PM, Josh Rosen  wrote:
>
> Yeah, it looks like messages that are successfully posted via Nabble end
> up on the Apache mailing list, but messages posted directly to Apache
> aren't mirrored to Nabble anymore because it's based off the incubator
> mailing list.  We should fix this so that Nabble posts to / archives the
> non-incubator list.
>
> On Sat, Dec 13, 2014 at 6:27 PM, Yana Kadiyska 
> wrote:
>>
>> Since you mentioned this, I had a related quandry recently -- it also
>> says that the forum archives "*u...@spark.incubator.apache.org
>> "/* *d...@spark.incubator.apache.org
>>  *respectively, yet the "Community page"
>> clearly says to email the @spark.apache.org list (but the nabble archive
>> is linked right there too). IMO even putting a clear explanation at the top
>>
>> "Posting here requires that you create an account via the UI. Your
>> message will be sent to both spark.incubator.apache.org and
>> spark.apache.org (if that is the case, i'm not sure which alias nabble
>> posts get sent to)" would make things a lot more clear.
>>
>> On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen  wrote:
>>>
>>> I've noticed that several users are attempting to post messages to
>>> Spark's user / dev mailing lists using the Nabble web UI (
>>> http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
>>> are many posts in Nabble that are not posted to the Apache lists and are
>>> flagged with "This post has NOT been accepted by the mailing list yet."
>>> errors.
>>>
>>> I suspect that the issue is that users are not completing the sign-up
>>> confirmation process (
>>> http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
>>> which is preventing their emails from being accepted by the mailing list.
>>>
>>> I wanted to mention this issue to the Spark community to see whether
>>> there are any good solutions to address this.  I have spoken to users who
>>> think that our mailing list is unresponsive / inactive because their
>>> un-posted messages haven't received any replies.
>>>
>>> - Josh
>>>
>>


Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-19 Thread Ted Yu
Andy:
I saw two emails from you from yesterday.

See this thread: http://search-hadoop.com/m/JW1q5opRsY1

Cheers

On Fri, Dec 19, 2014 at 12:51 PM, Andy Konwinski 
wrote:

> Yesterday, I changed the domain name in the mailing list archive settings
> to remove ".incubator" so maybe it'll work now.
>
> However, I also sent two emails about this through the nabble interface
> (in this same thread) yesterday and they don't appear to have made it
> through so not sure if it actually worked after all.
>
> Andy
>
> On Wed, Dec 17, 2014 at 1:09 PM, Josh Rosen  wrote:
>>
>> Yeah, it looks like messages that are successfully posted via Nabble end
>> up on the Apache mailing list, but messages posted directly to Apache
>> aren't mirrored to Nabble anymore because it's based off the incubator
>> mailing list.  We should fix this so that Nabble posts to / archives the
>> non-incubator list.
>>
>> On Sat, Dec 13, 2014 at 6:27 PM, Yana Kadiyska 
>> wrote:
>>>
>>> Since you mentioned this, I had a related quandry recently -- it also
>>> says that the forum archives "*u...@spark.incubator.apache.org
>>> "/* *d...@spark.incubator.apache.org
>>>  *respectively, yet the "Community
>>> page" clearly says to email the @spark.apache.org list (but the nabble
>>> archive is linked right there too). IMO even putting a clear explanation at
>>> the top
>>>
>>> "Posting here requires that you create an account via the UI. Your
>>> message will be sent to both spark.incubator.apache.org and
>>> spark.apache.org (if that is the case, i'm not sure which alias nabble
>>> posts get sent to)" would make things a lot more clear.
>>>
>>> On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen 
>>> wrote:

 I've noticed that several users are attempting to post messages to
 Spark's user / dev mailing lists using the Nabble web UI (
 http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
 are many posts in Nabble that are not posted to the Apache lists and are
 flagged with "This post has NOT been accepted by the mailing list yet."
 errors.

 I suspect that the issue is that users are not completing the sign-up
 confirmation process (
 http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
 which is preventing their emails from being accepted by the mailing list.

 I wanted to mention this issue to the Spark community to see whether
 there are any good solutions to address this.  I have spoken to users who
 think that our mailing list is unresponsive / inactive because their
 un-posted messages haven't received any replies.

 - Josh

>>>


Re: Which committers care about Kafka?

2014-12-19 Thread Cody Koeninger
The problems you guys are discussing come from trying to store state in
spark, so don't do that.  Spark isn't a distributed database.

Just map kafka partitions directly to rdds, llet user code specify the
range of offsets explicitly, and let them be in charge of committing
offsets.

Using the simple consumer isn't that bad, I'm already using this in
production with the code I linked to, and tresata apparently has been as
well.  Again, for everyone saying this is impossible, have you read either
of those implementations and looked at the approach?



On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara 
wrote:

> Please feel free to correct me if I’m wrong, but I think the exactly once
> spark streaming semantics can easily be solved using updateStateByKey. Make
> the key going into updateStateByKey be a hash of the event, or pluck off
> some uuid from the message.  The updateFunc would only emit the message if
> the key did not exist, and the user has complete control over the window of
> time / state lifecycle for detecting duplicates.  It also makes it really
> easy to detect and take action (alert?) when you DO see a duplicate, or
> make memory tradeoffs within an error bound using a sketch algorithm.  The
> kafka simple consumer is insanely complex, if possible I think it would be
> better (and vastly more flexible) to get reliability using the primitives
> that spark so elegantly provides.
>
> Cheers,
>
> Sean
>
>
> > On Dec 19, 2014, at 12:06 PM, Hari Shreedharan <
> hshreedha...@cloudera.com> wrote:
> >
> > Hi Dibyendu,
> >
> > Thanks for the details on the implementation. But I still do not believe
> > that it is no duplicates - what they achieve is that the same batch is
> > processed exactly the same way every time (but see it may be processed
> more
> > than once) - so it depends on the operation being idempotent. I believe
> > Trident uses ZK to keep track of the transactions - a batch can be
> > processed multiple times in failure scenarios (for example, the
> transaction
> > is processed but before ZK is updated the machine fails, causing a "new"
> > node to process it again).
> >
> > I don't think it is impossible to do this in Spark Streaming as well and
> > I'd be really interested in working on it at some point in the near
> future.
> >
> > On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya <
> > dibyendu.bhattach...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
> >> Trident has done the exact-once guarantee by processing the tuple in a
> >> batch  and assigning same transaction-id for a given batch . The replay
> for
> >> a given batch with a transaction-id will have exact same set of tuples
> and
> >> replay of batches happen in exact same order before the failure.
> >>
> >> Having this paradigm, if downstream system process data for a given
> batch
> >> for having a given transaction-id , and if during failure if same batch
> is
> >> again emitted , you can check if same transaction-id is already
> processed
> >> or not and hence can guarantee exact once semantics.
> >>
> >> And this can only be achieved in Spark if we use Low Level Kafka
> consumer
> >> API to process the offsets. This low level Kafka Consumer (
> >> https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
> >> Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
> >> related logic has been taken from Storm-Kafka spout and which manages
> all
> >> Kafka re-balance and fault tolerant aspects and Kafka metadata
> managements.
> >>
> >> Presently this Consumer maintains that during Receiver failure, it will
> >> re-emit the exact same Block with same set of messages . Every message
> have
> >> the details of its partition, offset and topic related details which can
> >> tackle the SPARK-3146.
> >>
> >> As this Low Level consumer has complete control over the Kafka Offsets ,
> >> we can implement Trident like feature on top of it like having
> implement a
> >> transaction-id for a given block , and re-emit the same block with same
> set
> >> of message during Driver failure.
> >>
> >> Regards,
> >> Dibyendu
> >>
> >>
> >> On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai 
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I agree with Hari that Strong exact-once semantics is very hard to
> >>> guarantee, especially in the failure situation. From my understanding
> even
> >>> current implementation of ReliableKafkaReceiver cannot fully guarantee
> the
> >>> exact once semantics once failed, first is the ordering of data
> replaying
> >>> from last checkpoint, this is hard to guarantee when multiple
> partitions
> >>> are injected in; second is the design complexity of achieving this,
> you can
> >>> refer to the Kafka Spout in Trident, we have to dig into the very
> details
> >>> of Kafka metadata management system to achieve this, not to say
> rebalance
> >>> and fault-tolerance.
> >>>
> >>> Thanks
> >>> Jerry
> >>>
> >>> -Original Me

Re: Which committers care about Kafka?

2014-12-19 Thread Hari Shreedharan
Can you explain your basic algorithm for the once-only-delivery? It is quite a 
bit of very Kafka-specific code, that would take more time to read than I can 
currently afford? If you can explain your algorithm a bit, it might help.




Thanks, Hari

On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger 
wrote:

> The problems you guys are discussing come from trying to store state in
> spark, so don't do that.  Spark isn't a distributed database.
> Just map kafka partitions directly to rdds, llet user code specify the
> range of offsets explicitly, and let them be in charge of committing
> offsets.
> Using the simple consumer isn't that bad, I'm already using this in
> production with the code I linked to, and tresata apparently has been as
> well.  Again, for everyone saying this is impossible, have you read either
> of those implementations and looked at the approach?
> On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara 
> wrote:
>> Please feel free to correct me if I’m wrong, but I think the exactly once
>> spark streaming semantics can easily be solved using updateStateByKey. Make
>> the key going into updateStateByKey be a hash of the event, or pluck off
>> some uuid from the message.  The updateFunc would only emit the message if
>> the key did not exist, and the user has complete control over the window of
>> time / state lifecycle for detecting duplicates.  It also makes it really
>> easy to detect and take action (alert?) when you DO see a duplicate, or
>> make memory tradeoffs within an error bound using a sketch algorithm.  The
>> kafka simple consumer is insanely complex, if possible I think it would be
>> better (and vastly more flexible) to get reliability using the primitives
>> that spark so elegantly provides.
>>
>> Cheers,
>>
>> Sean
>>
>>
>> > On Dec 19, 2014, at 12:06 PM, Hari Shreedharan <
>> hshreedha...@cloudera.com> wrote:
>> >
>> > Hi Dibyendu,
>> >
>> > Thanks for the details on the implementation. But I still do not believe
>> > that it is no duplicates - what they achieve is that the same batch is
>> > processed exactly the same way every time (but see it may be processed
>> more
>> > than once) - so it depends on the operation being idempotent. I believe
>> > Trident uses ZK to keep track of the transactions - a batch can be
>> > processed multiple times in failure scenarios (for example, the
>> transaction
>> > is processed but before ZK is updated the machine fails, causing a "new"
>> > node to process it again).
>> >
>> > I don't think it is impossible to do this in Spark Streaming as well and
>> > I'd be really interested in working on it at some point in the near
>> future.
>> >
>> > On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya <
>> > dibyendu.bhattach...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
>> >> Trident has done the exact-once guarantee by processing the tuple in a
>> >> batch  and assigning same transaction-id for a given batch . The replay
>> for
>> >> a given batch with a transaction-id will have exact same set of tuples
>> and
>> >> replay of batches happen in exact same order before the failure.
>> >>
>> >> Having this paradigm, if downstream system process data for a given
>> batch
>> >> for having a given transaction-id , and if during failure if same batch
>> is
>> >> again emitted , you can check if same transaction-id is already
>> processed
>> >> or not and hence can guarantee exact once semantics.
>> >>
>> >> And this can only be achieved in Spark if we use Low Level Kafka
>> consumer
>> >> API to process the offsets. This low level Kafka Consumer (
>> >> https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
>> >> Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
>> >> related logic has been taken from Storm-Kafka spout and which manages
>> all
>> >> Kafka re-balance and fault tolerant aspects and Kafka metadata
>> managements.
>> >>
>> >> Presently this Consumer maintains that during Receiver failure, it will
>> >> re-emit the exact same Block with same set of messages . Every message
>> have
>> >> the details of its partition, offset and topic related details which can
>> >> tackle the SPARK-3146.
>> >>
>> >> As this Low Level consumer has complete control over the Kafka Offsets ,
>> >> we can implement Trident like feature on top of it like having
>> implement a
>> >> transaction-id for a given block , and re-emit the same block with same
>> set
>> >> of message during Driver failure.
>> >>
>> >> Regards,
>> >> Dibyendu
>> >>
>> >>
>> >> On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai 
>> >> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I agree with Hari that Strong exact-once semantics is very hard to
>> >>> guarantee, especially in the failure situation. From my understanding
>> even
>> >>> current implementation of ReliableKafkaReceiver cannot fully guarantee
>> the
>> >>> exact once semantics once failed, first is the ordering of data
>> repl

Re: Which committers care about Kafka?

2014-12-19 Thread Cody Koeninger
That KafkaRDD code is dead simple.

Given a user specified map

(topic1, partition0) -> (startingOffset, endingOffset)
(topic1, partition1) -> (startingOffset, endingOffset)
...
turn each one of those entries into a partition of an rdd, using the simple
consumer.
That's it.  No recovery logic, no state, nothing - for any failures, bail
on the rdd and let it retry.
Spark stays out of the business of being a distributed database.

The client code does any transformation it wants, then stores the data and
offsets.  There are two ways of doing this, either based on idempotence or
a transactional data store.

For idempotent stores:

1.manipulate data
2.save data to store
3.save ending offsets to the same store

If you fail between 2 and 3, the offsets haven't been stored, you start
again at the same beginning offsets, do the same calculations in the same
order, overwrite the same data, all is good.


For transactional stores:

1. manipulate data
2. begin transaction
3. save data to the store
4. save offsets
5. commit transaction

If you fail before 5, the transaction rolls back.  To make this less
heavyweight, you can write the data outside the transaction and then update
a pointer to the current data inside the transaction.


Again, spark has nothing much to do with guaranteeing exactly once.  In
fact, the current streaming api actively impedes my ability to do the
above.  I'm just suggesting providing an api that doesn't get in the way of
exactly-once.





On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan  wrote:

> Can you explain your basic algorithm for the once-only-delivery? It is
> quite a bit of very Kafka-specific code, that would take more time to read
> than I can currently afford? If you can explain your algorithm a bit, it
> might help.
>
> Thanks,
> Hari
>
>
> On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger 
> wrote:
>
>>
>> The problems you guys are discussing come from trying to store state in
>> spark, so don't do that.  Spark isn't a distributed database.
>>
>> Just map kafka partitions directly to rdds, llet user code specify the
>> range of offsets explicitly, and let them be in charge of committing
>> offsets.
>>
>> Using the simple consumer isn't that bad, I'm already using this in
>> production with the code I linked to, and tresata apparently has been as
>> well.  Again, for everyone saying this is impossible, have you read either
>> of those implementations and looked at the approach?
>>
>>
>>
>> On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara <
>> sean.mcnam...@webtrends.com> wrote:
>>
>>> Please feel free to correct me if I’m wrong, but I think the exactly
>>> once spark streaming semantics can easily be solved using updateStateByKey.
>>> Make the key going into updateStateByKey be a hash of the event, or pluck
>>> off some uuid from the message.  The updateFunc would only emit the message
>>> if the key did not exist, and the user has complete control over the window
>>> of time / state lifecycle for detecting duplicates.  It also makes it
>>> really easy to detect and take action (alert?) when you DO see a duplicate,
>>> or make memory tradeoffs within an error bound using a sketch algorithm.
>>> The kafka simple consumer is insanely complex, if possible I think it would
>>> be better (and vastly more flexible) to get reliability using the
>>> primitives that spark so elegantly provides.
>>>
>>> Cheers,
>>>
>>> Sean
>>>
>>>
>>> > On Dec 19, 2014, at 12:06 PM, Hari Shreedharan <
>>> hshreedha...@cloudera.com> wrote:
>>> >
>>> > Hi Dibyendu,
>>> >
>>> > Thanks for the details on the implementation. But I still do not
>>> believe
>>> > that it is no duplicates - what they achieve is that the same batch is
>>> > processed exactly the same way every time (but see it may be processed
>>> more
>>> > than once) - so it depends on the operation being idempotent. I believe
>>> > Trident uses ZK to keep track of the transactions - a batch can be
>>> > processed multiple times in failure scenarios (for example, the
>>> transaction
>>> > is processed but before ZK is updated the machine fails, causing a
>>> "new"
>>> > node to process it again).
>>> >
>>> > I don't think it is impossible to do this in Spark Streaming as well
>>> and
>>> > I'd be really interested in working on it at some point in the near
>>> future.
>>> >
>>> > On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya <
>>> > dibyendu.bhattach...@gmail.com> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
>>> >> Trident has done the exact-once guarantee by processing the tuple in a
>>> >> batch  and assigning same transaction-id for a given batch . The
>>> replay for
>>> >> a given batch with a transaction-id will have exact same set of
>>> tuples and
>>> >> replay of batches happen in exact same order before the failure.
>>> >>
>>> >> Having this paradigm, if downstream system process data for a given
>>> batch
>>> >> for having a given transaction-id , and if during fai

Re: Which committers care about Kafka?

2014-12-19 Thread Koert Kuipers
yup, we at tresata do the idempotent store the same way. very simple
approach.

On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger  wrote:
>
> That KafkaRDD code is dead simple.
>
> Given a user specified map
>
> (topic1, partition0) -> (startingOffset, endingOffset)
> (topic1, partition1) -> (startingOffset, endingOffset)
> ...
> turn each one of those entries into a partition of an rdd, using the simple
> consumer.
> That's it.  No recovery logic, no state, nothing - for any failures, bail
> on the rdd and let it retry.
> Spark stays out of the business of being a distributed database.
>
> The client code does any transformation it wants, then stores the data and
> offsets.  There are two ways of doing this, either based on idempotence or
> a transactional data store.
>
> For idempotent stores:
>
> 1.manipulate data
> 2.save data to store
> 3.save ending offsets to the same store
>
> If you fail between 2 and 3, the offsets haven't been stored, you start
> again at the same beginning offsets, do the same calculations in the same
> order, overwrite the same data, all is good.
>
>
> For transactional stores:
>
> 1. manipulate data
> 2. begin transaction
> 3. save data to the store
> 4. save offsets
> 5. commit transaction
>
> If you fail before 5, the transaction rolls back.  To make this less
> heavyweight, you can write the data outside the transaction and then update
> a pointer to the current data inside the transaction.
>
>
> Again, spark has nothing much to do with guaranteeing exactly once.  In
> fact, the current streaming api actively impedes my ability to do the
> above.  I'm just suggesting providing an api that doesn't get in the way of
> exactly-once.
>
>
>
>
>
> On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan <
> hshreedha...@cloudera.com
> > wrote:
>
> > Can you explain your basic algorithm for the once-only-delivery? It is
> > quite a bit of very Kafka-specific code, that would take more time to
> read
> > than I can currently afford? If you can explain your algorithm a bit, it
> > might help.
> >
> > Thanks,
> > Hari
> >
> >
> > On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger 
> > wrote:
> >
> >>
> >> The problems you guys are discussing come from trying to store state in
> >> spark, so don't do that.  Spark isn't a distributed database.
> >>
> >> Just map kafka partitions directly to rdds, llet user code specify the
> >> range of offsets explicitly, and let them be in charge of committing
> >> offsets.
> >>
> >> Using the simple consumer isn't that bad, I'm already using this in
> >> production with the code I linked to, and tresata apparently has been as
> >> well.  Again, for everyone saying this is impossible, have you read
> either
> >> of those implementations and looked at the approach?
> >>
> >>
> >>
> >> On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara <
> >> sean.mcnam...@webtrends.com> wrote:
> >>
> >>> Please feel free to correct me if I’m wrong, but I think the exactly
> >>> once spark streaming semantics can easily be solved using
> updateStateByKey.
> >>> Make the key going into updateStateByKey be a hash of the event, or
> pluck
> >>> off some uuid from the message.  The updateFunc would only emit the
> message
> >>> if the key did not exist, and the user has complete control over the
> window
> >>> of time / state lifecycle for detecting duplicates.  It also makes it
> >>> really easy to detect and take action (alert?) when you DO see a
> duplicate,
> >>> or make memory tradeoffs within an error bound using a sketch
> algorithm.
> >>> The kafka simple consumer is insanely complex, if possible I think it
> would
> >>> be better (and vastly more flexible) to get reliability using the
> >>> primitives that spark so elegantly provides.
> >>>
> >>> Cheers,
> >>>
> >>> Sean
> >>>
> >>>
> >>> > On Dec 19, 2014, at 12:06 PM, Hari Shreedharan <
> >>> hshreedha...@cloudera.com> wrote:
> >>> >
> >>> > Hi Dibyendu,
> >>> >
> >>> > Thanks for the details on the implementation. But I still do not
> >>> believe
> >>> > that it is no duplicates - what they achieve is that the same batch
> is
> >>> > processed exactly the same way every time (but see it may be
> processed
> >>> more
> >>> > than once) - so it depends on the operation being idempotent. I
> believe
> >>> > Trident uses ZK to keep track of the transactions - a batch can be
> >>> > processed multiple times in failure scenarios (for example, the
> >>> transaction
> >>> > is processed but before ZK is updated the machine fails, causing a
> >>> "new"
> >>> > node to process it again).
> >>> >
> >>> > I don't think it is impossible to do this in Spark Streaming as well
> >>> and
> >>> > I'd be really interested in working on it at some point in the near
> >>> future.
> >>> >
> >>> > On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya <
> >>> > dibyendu.bhattach...@gmail.com> wrote:
> >>> >
> >>> >> Hi,
> >>> >>
> >>> >> Thanks to Jerry for mentioning the Kafka Spout for Trident. The
> Storm
> >>> >> Trident has done the exact-once

EndpointWriter : Dropping message failure ReliableDeliverySupervisor errors...

2014-12-19 Thread jay vyas
Hi spark.   Im trying to understand the akka debug messages when networking
doesnt work properly.  any hints would be great on this.

SIMPLE TESTS I RAN

- i tried a ping, works.
- i tried a telnet to the 7077 port of master, from slave, also works.

LOGS

1) On the master I see this WARN log buried:

ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkWorker@s2.docker:45477] has failed, address is now gated
for [500] ms  Reason is: [Disassociated].

2) I also see a periodic, repeated ERROR message :

 ERROR EndpointWriter: dropping message [class
akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://
sparkMaster@172.17.0.12:7077


Any idea what these folks mean?   From what i can tel, i can telnet from
s2.docker to my master server.

Any thoughts for more debugging of this would be appreciated! im out of
ideas for the time being 

-- 
jay vyas