[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113648731
  
Could you update title to get the ordering right? [JIRA][Streaming][Kafka]. 
Otherwise LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113648473
  
That is a good point. Then make two different examples one for Java and
Scala.
BTW, I would like to have this for 1.4.1, so gotta merge soon. So I think I
will merge this, and you can make another PR with the same JIRA to add the
new examples.



On Fri, Jun 19, 2015 at 1:29 PM, Cody Koeninger 
wrote:

> The word count examples don't have any need of accessing offsets.
>
> Wouldn't it be better to have separate examples? I don't want someone
> thinking they need to do all this typecast hoop jumping just to get a word
> count
>
> On Fri, Jun 19, 2015 at 3:16 PM, Tathagata Das 
> wrote:
>
> > So the JIRA was about updating the examples actually. Its great that you
> > have updated the docs AND the tests, but it would ideal if the examples
> > DirectKafkaWordCount and JavaDirectKafkaWordCount are updated to show 
how
> > the offset ranges can be accessed. Since you have updated the tests, 
mind
> > updating the examples as well?
> >
> > —
> > Reply to this email directly or view it on GitHub
> > .
>
> >
>
> —
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread koeninger
Github user koeninger commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113632591
  
The word count examples don't have any need of accessing offsets.

Wouldn't it be better to have separate examples?  I don't want someone
thinking they need to do all this typecast hoop jumping just to get a word
count

On Fri, Jun 19, 2015 at 3:16 PM, Tathagata Das 
wrote:

> So the JIRA was about updating the examples actually. Its great that you
> have updated the docs AND the tests, but it would ideal if the examples
> DirectKafkaWordCount and JavaDirectKafkaWordCount are updated to show how
> the offset ranges can be accessed. Since you have updated the tests, mind
> updating the examples as well?
>
> —
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113630096
  
So the JIRA was about updating the examples actually. Its great that you 
have updated the docs AND the tests, but it would ideal if the examples 
DirectKafkaWordCount and JavaDirectKafkaWordCount are updated to show how the 
offset ranges can be accessed. Since you have updated the tests, mind updating 
the examples as well? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113580623
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113580590
  
  [Test build #35280 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35280/console)
 for   PR 6863 at commit 
[`26a06bd`](https://github.com/apache/spark/commit/26a06bd0fed6952e1c507800ca9911349ba42653).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113580317
  
  [Test build #35279 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35279/console)
 for   PR 6863 at commit 
[`3744492`](https://github.com/apache/spark/commit/37444921ef4f26c6b1d2a39aaa65934f9240485c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113580339
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113563803
  
  [Test build #35280 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35280/consoleFull)
 for   PR 6863 at commit 
[`26a06bd`](https://github.com/apache/spark/commit/26a06bd0fed6952e1c507800ca9911349ba42653).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113563385
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113563346
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113562822
  
  [Test build #35279 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35279/consoleFull)
 for   PR 6863 at commit 
[`3744492`](https://github.com/apache/spark/commit/37444921ef4f26c6b1d2a39aaa65934f9240485c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113562313
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113562279
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-18 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113315111
  
Could you update the example to something that accesses the offsets in the 
foreachRDD? 
```
var offsetRanges: OffsetRange[] = _
...
kafkaStream.transform { rdd => 
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
}.map { ... }.reduceByKey { ... }.foreachRDD { rdd => 
   // use offsetRanges to do something, maybe print the ranges
}
```

Do the same for Java and Scala



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-18 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/6863#discussion_r32788398
  
--- Diff: docs/streaming-kafka-integration.md ---
@@ -161,6 +161,8 @@ Next, we discuss how to use this approach in your 
streaming application.
 
You can use this to update Zookeeper yourself if you want 
Zookeeper-based Kafka monitoring tools to show progress of the streaming 
application.
 
-   Another thing to note is that since this approach does not use 
Receivers, the standard receiver-related (that is, 
[configurations](configuration.html) of the form `spark.streaming.receiver.*` ) 
will not apply to the input DStreams created by this approach (will apply to 
other input DStreams though). Instead, use the 
[configurations](configuration.html) `spark.streaming.kafka.*`. An important 
one is `spark.streaming.kafka.maxRatePerPartition` which is the maximum rate at 
which each Kafka partition will be read by this direct API. 
+   Note that the typecast to HasOffsetRanges will only succeed if it is 
done in the first method called on the directKafkaStream, not later down a 
chain of methods. You can use transform() instead of foreachRDD() as your first 
method call in order to access offsets, then call further Spark methods. 
However, be aware that the one-to-one mapping between RDD partition and Kafka 
partition does not remain after any methods that shuffle or repartition, e.g. 
reduceByKey() or window().
--- End diff --

Good addition. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-18 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/6863#discussion_r32788358
  
--- Diff: docs/streaming-kafka-integration.md ---
@@ -74,15 +74,15 @@ Next, we discuss how to use this approach in your 
streaming application.
[Maven 
repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22)
 and add it to `spark-submit` with `--jars`.
 
 ## Approach 2: Direct Approach (No Receivers)
-This is a new receiver-less "direct" approach has been introduced in Spark 
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to 
receive data, this approach periodically queries Kafka for the latest offsets 
in each topic+partition, and accordingly defines the offset ranges to process 
in each batch. When the jobs to process the data are launched, Kafka's simple 
consumer API is used to read the defined ranges of offsets from Kafka (similar 
to read files from a file system). Note that this is an experimental feature in 
Spark 1.3 and is only available in the Scala and Java API.
+This new receiver-less "direct" approach has been introduced in Spark 1.3 
to ensure stronger end-to-end guarantees. Instead of using receivers to receive 
data, this approach periodically queries Kafka for the latest offsets in each 
topic+partition, and accordingly defines the offset ranges to process in each 
batch. When the jobs to process the data are launched, Kafka's simple consumer 
API is used to read the defined ranges of offsets from Kafka (similar to read 
files from a file system). Note that this is an experimental feature introduced 
in Spark 1.3 for the Scala and Java API. Spark 1.4 added a Python API, but it 
is not yet at full feature parity.
 
-This approach has the following advantages over the received-based 
approach (i.e. Approach 1).
+This approach has the following advantages over the receiver-based 
approach (i.e. Approach 1).
 
-- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union-ing them. With `directStream`, Spark Streaming will create as many 
RDD partitions as there is Kafka partitions to consume, which will all read 
data from Kafka in parallel. So there is one-to-one mapping between Kafka and 
RDD partitions, which is easier to understand and tune.
+- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union them. With `directStream`, Spark Streaming will create as many RDD 
partitions as there are Kafka partitions to consume, which will all read data 
from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD 
partitions, which is easier to understand and tune.
 
-- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminate the problem as there is no receiver, and hence no need for Write 
Ahead Logs.
+- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminates the problem as there is no receiver, and hence no need for Write 
Ahead Logs. As long as you have sufficient Kafka retention, messages can be 
recovered from Kafka.
 
-- *Exactly-once semantics:* The first approach uses Kafka's high level API 
to store consumed offsets in Zookeeper. This is traditionally the way to 
consume data from Kafka. While this approach (in combination with write ahead 
logs) can ensure zero data loss (i.e. at-least once semantics), there is a 
small chance some records may get consumed twice under some failures. This 
occurs because of inconsistencies between data reliably received by Spark 
Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we 
use simple Kafka API that does not use Zookeeper and offsets tracked only by 
Spark Streaming within its checkpoints. This eliminates inconsistencies between 
Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark 
Streaming effectively exactly once despite failures.
+- *Exactly-once semantics:* The first approach uses Kafka's high level API 
to store consumed offsets in Zookeeper. This is traditionally the way to 
consume data from Kafka. While this approach (in combination with write ahead 
logs) can ensure zero data loss (i.e. at-least once semantics), there is a 
small chance some records may get consumed twice under some failures. This 
occurs because of inconsistencies between data reliably received by Spark 
Streaming and of

[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113028325
  
  [Test build #35087 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35087/console)
 for   PR 6863 at commit 
[`b108c9d`](https://github.com/apache/spark/commit/b108c9dc75d0c190af130b5c3a590e797589ff8f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113028336
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread koeninger
Github user koeninger commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113022865
  
Yeah, the spacing in that document in general is a mess (mix of tabs and 
spaces, some 2 spaces between sentences, etc).  I cleaned it up somewhat.  Also 
further fixed the scala / java offset ranges examples, the java one is pretty 
much a C&P from the test at this point.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113022711
  
  [Test build #35087 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35087/consoleFull)
 for   PR 6863 at commit 
[`b108c9d`](https://github.com/apache/spark/commit/b108c9dc75d0c190af130b5c3a590e797589ff8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113022638
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113022645
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-113014760
  
If the java example is updated to the access the offsets, then the scala 
example should also be updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/6863#discussion_r32694426
  
--- Diff: docs/streaming-kafka-integration.md ---
@@ -74,15 +74,15 @@ Next, we discuss how to use this approach in your 
streaming application.
[Maven 
repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22)
 and add it to `spark-submit` with `--jars`.
 
 ## Approach 2: Direct Approach (No Receivers)
-This is a new receiver-less "direct" approach has been introduced in Spark 
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to 
receive data, this approach periodically queries Kafka for the latest offsets 
in each topic+partition, and accordingly defines the offset ranges to process 
in each batch. When the jobs to process the data are launched, Kafka's simple 
consumer API is used to read the defined ranges of offsets from Kafka (similar 
to read files from a file system). Note that this is an experimental feature in 
Spark 1.3 and is only available in the Scala and Java API.
+This new receiver-less "direct" approach has been introduced in Spark 1.3 
to ensure stronger end-to-end guarantees. Instead of using receivers to receive 
data, this approach periodically queries Kafka for the latest offsets in each 
topic+partition, and accordingly defines the offset ranges to process in each 
batch. When the jobs to process the data are launched, Kafka's simple consumer 
API is used to read the defined ranges of offsets from Kafka (similar to read 
files from a file system). Note that this is an experimental feature introduced 
in Spark 1.3 for the Scala and Java API.  Spark 1.4 added a Python API, but it 
is not yet at full feature parity.
 
-This approach has the following advantages over the received-based 
approach (i.e. Approach 1).
+This approach has the following advantages over the receiver-based 
approach (i.e. Approach 1).
 
-- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union-ing them. With `directStream`, Spark Streaming will create as many 
RDD partitions as there is Kafka partitions to consume, which will all read 
data from Kafka in parallel. So there is one-to-one mapping between Kafka and 
RDD partitions, which is easier to understand and tune.
+- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union them. With `directStream`, Spark Streaming will create as many RDD 
partitions as there are Kafka partitions to consume, which will all read data 
from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD 
partitions, which is easier to understand and tune.
 
-- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminate the problem as there is no receiver, and hence no need for Write 
Ahead Logs.
+- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminates the problem as there is no receiver, and hence no need for Write 
Ahead Logs.  As long as you have sufficient Kafka retention, messages can be 
recovered from Kafka.
 
-- *Exactly-once semantics:* The first approach uses Kafka's high level API 
to store consumed offsets in Zookeeper. This is traditionally the way to 
consume data from Kafka. While this approach (in combination with write ahead 
logs) can ensure zero data loss (i.e. at-least once semantics), there is a 
small chance some records may get consumed twice under some failures. This 
occurs because of inconsistencies between data reliably received by Spark 
Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we 
use simple Kafka API that does not use Zookeeper and offsets tracked only by 
Spark Streaming within its checkpoints. This eliminates inconsistencies between 
Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark 
Streaming effectively exactly once despite failures.
+- *Exactly-once semantics:* The first approach uses Kafka's high level API 
to store consumed offsets in Zookeeper. This is traditionally the way to 
consume data from Kafka. While this approach (in combination with write ahead 
logs) can ensure zero data loss (i.e. at-least once semantics), there is a 
small chance some records may get consumed twice under some failures. This 
occurs because of inconsistencies between data reliably received by Spark 
Streaming and 

[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/6863#discussion_r32694391
  
--- Diff: docs/streaming-kafka-integration.md ---
@@ -74,15 +74,15 @@ Next, we discuss how to use this approach in your 
streaming application.
[Maven 
repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22)
 and add it to `spark-submit` with `--jars`.
 
 ## Approach 2: Direct Approach (No Receivers)
-This is a new receiver-less "direct" approach has been introduced in Spark 
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to 
receive data, this approach periodically queries Kafka for the latest offsets 
in each topic+partition, and accordingly defines the offset ranges to process 
in each batch. When the jobs to process the data are launched, Kafka's simple 
consumer API is used to read the defined ranges of offsets from Kafka (similar 
to read files from a file system). Note that this is an experimental feature in 
Spark 1.3 and is only available in the Scala and Java API.
+This new receiver-less "direct" approach has been introduced in Spark 1.3 
to ensure stronger end-to-end guarantees. Instead of using receivers to receive 
data, this approach periodically queries Kafka for the latest offsets in each 
topic+partition, and accordingly defines the offset ranges to process in each 
batch. When the jobs to process the data are launched, Kafka's simple consumer 
API is used to read the defined ranges of offsets from Kafka (similar to read 
files from a file system). Note that this is an experimental feature introduced 
in Spark 1.3 for the Scala and Java API.  Spark 1.4 added a Python API, but it 
is not yet at full feature parity.
 
-This approach has the following advantages over the received-based 
approach (i.e. Approach 1).
+This approach has the following advantages over the receiver-based 
approach (i.e. Approach 1).
 
-- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union-ing them. With `directStream`, Spark Streaming will create as many 
RDD partitions as there is Kafka partitions to consume, which will all read 
data from Kafka in parallel. So there is one-to-one mapping between Kafka and 
RDD partitions, which is easier to understand and tune.
+- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union them. With `directStream`, Spark Streaming will create as many RDD 
partitions as there are Kafka partitions to consume, which will all read data 
from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD 
partitions, which is easier to understand and tune.
 
-- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminate the problem as there is no receiver, and hence no need for Write 
Ahead Logs.
+- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminates the problem as there is no receiver, and hence no need for Write 
Ahead Logs.  As long as you have sufficient Kafka retention, messages can be 
recovered from Kafka.
--- End diff --

Are there two spaces before "As long as"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread koeninger
Github user koeninger commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-112978856
  
I don't think a doc change caused a python test failure.
 On Jun 17, 2015 5:27 PM, "UCB AMPLab"  wrote:

> Merged build finished. Test FAILed.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-112969443
  
  [Test build #35065 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35065/console)
 for   PR 6863 at commit 
[`bb4336b`](https://github.com/apache/spark/commit/bb4336b80c73d6e40dd491af6ef7150c1b3425a2).
 * This patch **fails some tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-112969458
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-112967694
  
  [Test build #35065 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35065/consoleFull)
 for   PR 6863 at commit 
[`bb4336b`](https://github.com/apache/spark/commit/bb4336b80c73d6e40dd491af6ef7150c1b3425a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-112966905
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6863#issuecomment-112966838
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

2015-06-17 Thread koeninger
GitHub user koeninger opened a pull request:

https://github.com/apache/spark/pull/6863

[Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/koeninger/spark-1 SPARK-8390

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6863.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6863


commit 3f3c57af929adc50a4e56303dcd99daada5e10cf
Author: cody koeninger 
Date:   2015-06-16T17:18:19Z

[Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the 
existing java direct stream api

commit bb4336b80c73d6e40dd491af6ef7150c1b3425a2
Author: cody koeninger 
Date:   2015-06-17T22:08:54Z

[Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org