Semantics of Manual Offset Commit for Kafka Spark Streaming

2019-10-14 Thread Andre Piwoni
When using manual Kafka offset commit in Spark streaming  job and application 
fails to process current batch without committing offset in executor, is it 
expected behavior that next batch will be processed and offset will be moved to 
next batch regardless of application failure to commit? It seems so based on 
glance at the code. If so, is it expected that job termination upon failure to 
process batch and commit offset should resume from last committed offset?

I’m asking since until now I didn’t have to deal with Spark streaming from 
Kafka where assumption was “successfully processed  at-least-once”. Stopping 
Kafka processing or streaming on any application failure may seem rather 
extreme but it is what it is.

Thank you,
Andre


Re: Kafka Spark Streaming integration : Relationship between DStreams and Tasks

2019-05-12 Thread Sheel Pancholi
Hello
Can anyone help me understand this? We work with Receiver based approach
and are trying to move to Direct based approach. There is no problem as
such moving from former to the latter. I am just trying to understand the
inner details bottom up.

Please help.

Regards
Sheel

On Mon 13 May, 2019, 12:28 AM Sheel Pancholi,  wrote:

> Hello Everyone
> I am trying to understand the internals of Spark Streaming (not Structured
> Streaming), specifically the way tasks see the DStream. I am going over the
> source code of Spark in scala, here <https://github.com/apache/spark>. I
> understand the call stack:
>
> ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) -> TaskRunner 
> (Runnable).run() -> task.run(...)
>
> I understand the DStream really is a hashmap of RDDs but I am trying to
> understand the way tasks see the DStream. I know that there are basically 2
> approaches to Kafka Spark integration:
>
>-
>
>*Receiver based using High Level Kafka Consumer APIs*
>
>Here a new (micro-)batch is created at every batch interval (say 5
>secs) with say 5 partitions (=> 1 sec block interval) by the *Receiver* 
> task
>and handed downstream to *Regular* tasks.
>
>*Question:* Considering our example where every microbatch is created
>every 5 secs; has exactly 5 partitions and all these partitions of all the
>microbatches are supposed to be DAG-ged downstream the exact same way, is
>the same *regular* task re-used over and over again for the same
>partition id of every microbatch (RDD) as a long running task? e.g.
>
>If *ubatch1* of partitions *(P1,P2,P3,P4,P5)* at time *T0* is assigned
>to task ids *(T1, T2, T3, T4, T5)*, will *ubatch2* of partitions
>*(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set
>of tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9,
>T10)* be created for *ubatch2*?
>
>If the latter is the case then, wouldn't it be performance intensive
>having to send new tasks over the network to executors every 5 seconds when
>you already know that there are tasks doing the exact same thing and could
>be re-used as long running tasks?
>-
>
>*Direct using Low Level Kafka Consumer APIs*
>
>Here a Kafka Partition maps to a Spark Partition and therefore a Task.
>Again, considering 5 Kafka partitions for a topic *t*, we get 5 Spark
>partitions and their corresponding tasks.
>
>*Question:* Say, the *ubatch1* at *T0* has partitions
>*(P1,P2,P3,P4,P5)* assigned to tasks *(T1, T2, T3, T4, T5).* Will
>*ubatch2* of partitions *(P1',P2',P3',P4',P5')* at time *T5* be also
>assigned to the same set of tasks *(T1, T2, T3, T4, T5)* or will new
>tasks *(T6, T7, T8, T9, T10)* be created for *ubatch2*?
>
>
> I have put up this question on SO @
> https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams
> .
>
> Regards
> Sheel
>


Kafka Spark Streaming integration : Relationship between DStreams and Tasks

2019-05-12 Thread Sheel Pancholi
Hello Everyone
I am trying to understand the internals of Spark Streaming (not Structured
Streaming), specifically the way tasks see the DStream. I am going over the
source code of Spark in scala, here <https://github.com/apache/spark>. I
understand the call stack:

ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) ->
TaskRunner (Runnable).run() -> task.run(...)

I understand the DStream really is a hashmap of RDDs but I am trying to
understand the way tasks see the DStream. I know that there are basically 2
approaches to Kafka Spark integration:

   -

   *Receiver based using High Level Kafka Consumer APIs*

   Here a new (micro-)batch is created at every batch interval (say 5 secs)
   with say 5 partitions (=> 1 sec block interval) by the *Receiver* task
   and handed downstream to *Regular* tasks.

   *Question:* Considering our example where every microbatch is created
   every 5 secs; has exactly 5 partitions and all these partitions of all the
   microbatches are supposed to be DAG-ged downstream the exact same way, is
   the same *regular* task re-used over and over again for the same
   partition id of every microbatch (RDD) as a long running task? e.g.

   If *ubatch1* of partitions *(P1,P2,P3,P4,P5)* at time *T0* is assigned
   to task ids *(T1, T2, T3, T4, T5)*, will *ubatch2* of partitions
   *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set of
   tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9, T10)* be
   created for *ubatch2*?

   If the latter is the case then, wouldn't it be performance intensive
   having to send new tasks over the network to executors every 5 seconds when
   you already know that there are tasks doing the exact same thing and could
   be re-used as long running tasks?
   -

   *Direct using Low Level Kafka Consumer APIs*

   Here a Kafka Partition maps to a Spark Partition and therefore a Task.
   Again, considering 5 Kafka partitions for a topic *t*, we get 5 Spark
   partitions and their corresponding tasks.

   *Question:* Say, the *ubatch1* at *T0* has partitions
*(P1,P2,P3,P4,P5)* assigned
   to tasks *(T1, T2, T3, T4, T5).* Will *ubatch2* of partitions
   *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set of
   tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9, T10)* be
   created for *ubatch2*?


I have put up this question on SO @
https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams
.

Regards
Sheel


Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
Hello there,

I just upgraded to spark 2.3.1 from spark 2.2.1, ran my streaming workload
and got the error (java.lang.AbstractMethodError) never seen before; check
the error stack attached in (a) bellow.

anyone knows if  spark 2.3.1 works well with kafka
spark-streaming-kafka-0-10?

this link spark kafka integration page doesn't say anything about any
limitation:
https://spark.apache.org/docs/2.3.1/streaming-kafka-integration.html

but this discussion seems to say there is indeed an issue when upgrading to
spark 2.3.1:
https://stackoverflow.com/questions/49180931/abstractmethoderror-creating-kafka-stream

i also rebuilt the workload with some spark 2.3.1 jars (see (b) below). it
doesn't seem to help.

Would be great if anyone could kindly share any insights here.

Thanks!

Peter

(a) the exception
Exception in thread "stream execution thread for [id =
5adae836-268a-4ebf-adc4-e3cc9fbe5acf, runId =
70e78d5c-665e-4c6f-a0cc-41a56e488e30]" java.lang.AbstractMethodError
at
org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.initializeLogIfNecessary(KafkaSourceProvider.scala:369)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.log(KafkaSourceProvider.scala:369)
at
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.logDebug(KafkaSourceProvider.scala:369)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$ConfigUpdater.set(KafkaSourceProvider.scala:439)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.kafkaParamsForDriver(KafkaSourceProvider.scala:394)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.createSource(KafkaSourceProvider.scala:90)
at
org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:277)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1$$anonfun$applyOrElse$1.apply(MicroBatchExecution.scala:80)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1$$anonfun$applyOrElse$1.apply(MicroBatchExecution.scala:77)
at
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at
scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:77)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:75)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.logicalPlan$lzycompute(MicroBatchExecution.scala:75)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.logicalPlan(MicroBatchExecution.scala:61)
at org.apache.spark.sql.execution.streaming.StreamExecution.org
$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:265)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)

(b)* the build script update:*

[pgl@datanode20 SparkStreamingBenchmark-RemoteConsumer-Spk231]$ diff
build.sbt spk211-build.sbt.original
10,11c10,11
< libraryDependencies += "org.apache.spark" % "spark-sql_2.11" %* "2.3.1"*
< libraryDependencies += "org.apache.spark" % "spark-core_2.11" %* "2.3.1"*
---
> libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.1"
> libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.1"
[pgl@datanode20 SparkStreamingBenchmark-RemoteConsumer-Spk231]$


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-21 Thread Aakash Basu
Thanks Chris!

On Fri, Mar 16, 2018 at 10:13 PM, Bowden, Chris  wrote:

> 2. You must decide. If multiple streaming queries are launched in a single
> / simple application, only you can dictate if a single failure should cause
> the application to exit. If you use spark.streams.awaitAnyTermination be
> aware it returns / throws if _any_ streaming query terminates. A more
> complex application may keep track of many streaming queries and attempt to
> relaunch them with lower latency for certain types of failures.
>
>
> 3a. I'm not very familiar with py, but I doubt you need the sleep
>
> 3b. Kafka consumer tuning is simply a matter of passing appropriate config
> keys to the source's options if desired
>
> 3c. I would argue the most obvious improvement would be a more structured
> and compact data format if CSV isn't required.
>
> --
> *From:* Aakash Basu 
> *Sent:* Friday, March 16, 2018 9:12:39 AM
> *To:* sagar grover
> *Cc:* Bowden, Chris; Tathagata Das; Dylan Guedes; Georg Heiler; user;
> jagrati.go...@myntra.com
>
> *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query
>
> Hi all,
>
> From the last mail queries in the bottom, query 1's doubt has been
> resolved, I was already guessing so, that I resent same columns from Kafka
> producer multiple times, hence the join gave duplicates.
>
> Retested with fresh Kafka feed and problem was solved.
>
> But, the other queries still persists, would anyone like to reply? :)
>
> Thanks,
> Aakash.
>
> On 16-Mar-2018 3:57 PM, "Aakash Basu"  wrote:
>
> Hi all,
>
> The code was perfectly alright, just the package I was submitting had to
> be the updated one (marked green below). The join happened but the output
> has many duplicates (even though the *how *parameter is by default *inner*)
> -
>
> Spark Submit:
>
> /home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 
> /home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py
>
>
>
> Code:
>
> from pyspark.sql import SparkSession
> import time
> from pyspark.sql.functions import split, col
>
> class test:
>
>
> spark = SparkSession.builder \
> .appName("DirectKafka_Spark_Stream_Stream_Join") \
> .getOrCreate()
>
> table1_stream = 
> (spark.readStream.format("kafka").option("startingOffsets", 
> "earliest").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "test1").load())
>
> table2_stream = 
> (spark.readStream.format("kafka").option("startingOffsets", 
> "earliest").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "test2").load())
>
>
> query1 = table1_stream.select('value')\
> .withColumn('value', table1_stream.value.cast("string")) \
> .withColumn("ID", split(col("value"), ",").getItem(0)) \
> .withColumn("First_Name", split(col("value"), ",").getItem(1)) \
> .withColumn("Last_Name", split(col("value"), ",").getItem(2)) \
> .drop('value')
>
> query2 = table2_stream.select('value') \
> .withColumn('value', table2_stream.value.cast("string")) \
> .withColumn("ID", split(col("value"), ",").getItem(0)) \
> .withColumn("Department", split(col("value"), ",").getItem(1)) \
> .withColumn("Date_joined", split(col("value"), ",").getItem(2)) \
> .drop('value')
>
> joined_Stream = query1.join(query2, "Id")
>
> a = query1.writeStream.format("console").start()
> b = query2.writeStream.format("console").start()
> c = joined_Stream.writeStream.format("console").start()
>
> time.sleep(10)
>
> a.awaitTermination()
> b.awaitTermination()
> c.awaitTermination()
>
>
> Output -
>
> +---+--+-+---+---+
> | ID|First_Name|Last_Name| Department|Date_joined|
> +---+--+-+---+---+
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardl

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
@gmail.com> wrote:
>>>
>>>> Awesome, thanks for detailing!
>>>>
>>>> Was thinking the same, we've to split by comma for csv while casting
>>>> inside.
>>>>
>>>> Cool! Shall try it and revert back tomm.
>>>>
>>>> Thanks a ton!
>>>>
>>>> On 15-Mar-2018 11:50 PM, "Bowden, Chris" 
>>>> wrote:
>>>>
>>>>> To remain generic, the KafkaSource can only offer the lowest common
>>>>> denominator for a schema (topic, partition, offset, key, value, timestamp,
>>>>> timestampType). As such, you can't just feed it a StructType. When you are
>>>>> using a producer or consumer directly with Kafka, serialization and
>>>>> deserialization is often an orthogonal and implicit transform. However, in
>>>>> Spark, serialization and deserialization is an explicit transform (e.g.,
>>>>> you define it in your query plan).
>>>>>
>>>>>
>>>>> To make this more granular, if we imagine your source is registered as
>>>>> a temp view named "foo":
>>>>>
>>>>> SELECT
>>>>>
>>>>>   split(cast(value as string), ',')[0] as id,
>>>>>
>>>>>   split(cast(value as string), ',')[1] as name
>>>>>
>>>>> FROM foo;
>>>>>
>>>>>
>>>>> Assuming you were providing the following messages to Kafka:
>>>>>
>>>>> 1,aakash
>>>>>
>>>>> 2,tathagata
>>>>>
>>>>> 3,chris
>>>>>
>>>>>
>>>>> You could make the query plan less repetitive. I don't believe Spark
>>>>> offers from_csv out of the box as an expression (although CSV is well
>>>>> supported as a data source). You could implement an expression by reusing 
>>>>> a
>>>>> lot of the supporting CSV classes which may result in a better user
>>>>> experience vs. explicitly using split and array indices, etc. In this
>>>>> simple example, casting the binary to a string just works because there is
>>>>> a common understanding of string's encoded as bytes between Spark and 
>>>>> Kafka
>>>>> by default.
>>>>>
>>>>>
>>>>> -Chris
>>>>> --
>>>>> *From:* Aakash Basu 
>>>>> *Sent:* Thursday, March 15, 2018 10:48:45 AM
>>>>> *To:* Bowden, Chris
>>>>> *Cc:* Tathagata Das; Dylan Guedes; Georg Heiler; user
>>>>>
>>>>> *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query
>>>>>
>>>>> Hey Chris,
>>>>>
>>>>> You got it right. I'm reading a *csv *file from local as mentioned
>>>>> above, with a console producer on Kafka side.
>>>>>
>>>>> So, as it is a csv data with headers, shall I then use from_csv on the
>>>>> spark side and provide a StructType to shape it up with a schema and then
>>>>> cast it to string as TD suggested?
>>>>>
>>>>> I'm getting all of your points at a very high level. A little more
>>>>> granularity would help.
>>>>>
>>>>> *In the slide TD just shared*, PFA, I'm confused at the point where
>>>>> he is casting the value as string. Logically, the value shall consist of
>>>>> all the entire data set, so, suppose, I've a table with many columns, *how
>>>>> can I provide a single alias as he did in the groupBy. I missed it there
>>>>> itself. Another question is, do I have to cast in groupBy itself? Can't I
>>>>> do it directly in a select query? The last one, if the steps are followed,
>>>>> can I then run a SQL query on top of the columns separately?*
>>>>>
>>>>> Thanks,
>>>>> Aakash.
>>>>>
>>>>>
>>>>> On 15-Mar-2018 9:07 PM, "Bowden, Chris" 
>>>>> wrote:
>>>>>
>>>>> You need to tell Spark about the structure of the data, it doesn't
>>>>> know ahead of time if you put avro, json, protobuf, etc. in kafka for the
>>>>> message format. If the messages are in json, Spark provides from_json out
>>>>> of the box. Fo

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
ate: RUNNABLE'*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Final code (for clearer understanding of where it may go wrong in 2.3.0)
> -from pyspark.sql import SparkSessionimport timefrom pyspark.sql.functions
> import split, colclass test: spark = SparkSession.builder \
> .appName("DirectKafka_Spark_Stream_Stream_Join") \ .getOrCreate()
> table1_stream = (spark.readStream.format("kafka").option("startingOffsets",
> "earliest").option("kafka.bootstrap.servers",
> "localhost:9092").option("subscribe", "test1").load()) query =
> table1_stream.select('value').withColumn('value',
> table1_stream.value.cast("string")) \ .withColumn("ID", split(col("value"),
> ",").getItem(0)) \ .withColumn("First_Name", split(col("value"),
> ",").getItem(1)) \ .withColumn("Last_Name", split(col("value"),
> ",").getItem(2)) \ .drop('value').writeStream.format("console").start()
> time.sleep(10) query.awaitTermination()# Code working in Spark 2.2.1#
> /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit --packages
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
> /home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py#
> Code not working in Spark 2.3.0#
> /home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --packages
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
> /home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py*
> 2) I'm getting the below output as expected, from the above code in 2.2.1.
> My query is, is there a way to get the header of a file being read and
> ensure header=True? (Or is it that for Structured Streaming, user has to
> provide headers explicitly all the time, as data shall always come in this
> structure [for Kafka] - topic, partition, offset, key, value, timestamp,
> timestampType; if so, then how to remove column headers explicitly from the
> data, as in the below table) I know it is a stream, and the data is fed in
> as messages, but still wanted experts to put some more light into it.
>
> +---+--+-+
> | ID|First_Name|Last_Name|
> +---+--+-+
> | Hi|  null| null|
> | id|first_name|last_name|
> |  1|  Kellyann|Moyne|
> |  2| Morty|  Blacker|
> |  3| Tobit|Robardley|
> |  4|Wilona|Kells|
> |  5| Reggy|Comizzoli|
> | id|first_name|last_name|
> |  1|  Kellyann|Moyne|
> |  2| Morty|  Blacker|
> |  3| Tobit|Robardley|
> |  4|Wilona|Kells|
> |  5| Reggy|Comizzoli|
> | id|first_name|last_name|
> |  1|  Kellyann|Moyne|
> |  2| Morty|  Blacker|
> |  3| Tobit|Robardley|
> |  4|Wilona|Kells|
> |  5| Reggy|Comizzoli|
> | id|first_name|last_name|
> +---+--+-+
> only showing top 20 rows
>
>
> Any help?
>
> Thanks,
> Aakash.
>
> On Fri, Mar 16, 2018 at 12:54 PM, sagar grover 
> wrote:
>
>>
>> With regards,
>> Sagar Grover
>> Phone - 7022175584
>>
>> On Fri, Mar 16, 2018 at 12:15 AM, Aakash Basu > > wrote:
>>
>>> Awesome, thanks for detailing!
>>>
>>> Was thinking the same, we've to split by comma for csv while casting
>>> inside.
>>>
>>> Cool! Shall try it and revert back tomm.
>>>
>>> Thanks a ton!
>>>
>>> On 15-Mar-2018 11:50 PM, "Bowden, Chris" 
>>> wrote:
>>>
>>>> To remain generic, the KafkaSource can only offer the lowest common
>>>> denominator for a schema (topic, partition, offset, key, value, timestamp,
>>>> timestampType). As such, you can't just feed it a StructType. When you are
>>>> using a producer or consumer directly with Kafka, serialization and
>>>> deserialization is often an orthogonal and implicit transform. However, in
>>>> Spark, serialization and deserialization is an explicit transform (e.g.,
>>>> you define it in your query plan).
>>>>
>>>>
>>>> To make this more granular, if we imagine your source is registered as
>>>> a temp view named "foo":
>>>>
>>>> SELECT
>>>>
>>>>   split(cast(value as string), ',')[0] as id,
>>>>
>>>>   split(cast(value as string), ',')[1] as name
>>>>
>>>> FROM foo;
>>>>
>>>>
>>&g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
x27;,')[1] as name
>>>
>>> FROM foo;
>>>
>>>
>>> Assuming you were providing the following messages to Kafka:
>>>
>>> 1,aakash
>>>
>>> 2,tathagata
>>>
>>> 3,chris
>>>
>>>
>>> You could make the query plan less repetitive. I don't believe Spark
>>> offers from_csv out of the box as an expression (although CSV is well
>>> supported as a data source). You could implement an expression by reusing a
>>> lot of the supporting CSV classes which may result in a better user
>>> experience vs. explicitly using split and array indices, etc. In this
>>> simple example, casting the binary to a string just works because there is
>>> a common understanding of string's encoded as bytes between Spark and Kafka
>>> by default.
>>>
>>>
>>> -Chris
>>> --
>>> *From:* Aakash Basu 
>>> *Sent:* Thursday, March 15, 2018 10:48:45 AM
>>> *To:* Bowden, Chris
>>> *Cc:* Tathagata Das; Dylan Guedes; Georg Heiler; user
>>>
>>> *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query
>>>
>>> Hey Chris,
>>>
>>> You got it right. I'm reading a *csv *file from local as mentioned
>>> above, with a console producer on Kafka side.
>>>
>>> So, as it is a csv data with headers, shall I then use from_csv on the
>>> spark side and provide a StructType to shape it up with a schema and then
>>> cast it to string as TD suggested?
>>>
>>> I'm getting all of your points at a very high level. A little more
>>> granularity would help.
>>>
>>> *In the slide TD just shared*, PFA, I'm confused at the point where he
>>> is casting the value as string. Logically, the value shall consist of all
>>> the entire data set, so, suppose, I've a table with many columns, *how
>>> can I provide a single alias as he did in the groupBy. I missed it there
>>> itself. Another question is, do I have to cast in groupBy itself? Can't I
>>> do it directly in a select query? The last one, if the steps are followed,
>>> can I then run a SQL query on top of the columns separately?*
>>>
>>> Thanks,
>>> Aakash.
>>>
>>>
>>> On 15-Mar-2018 9:07 PM, "Bowden, Chris" 
>>> wrote:
>>>
>>> You need to tell Spark about the structure of the data, it doesn't know
>>> ahead of time if you put avro, json, protobuf, etc. in kafka for the
>>> message format. If the messages are in json, Spark provides from_json out
>>> of the box. For a very simple POC you can happily cast the value to a
>>> string, etc. if you are prototyping and pushing messages by hand with a
>>> console producer on the kafka side.
>>>
>>> 
>>> From: Aakash Basu 
>>> Sent: Thursday, March 15, 2018 7:52:28 AM
>>> To: Tathagata Das
>>> Cc: Dylan Guedes; Georg Heiler; user
>>> Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query
>>>
>>> Hi,
>>>
>>> And if I run this below piece of code -
>>>
>>>
>>> from pyspark.sql import SparkSession
>>> import time
>>>
>>> class test:
>>>
>>>
>>> spark = SparkSession.builder \
>>> .appName("DirectKafka_Spark_Stream_Stream_Join") \
>>> .getOrCreate()
>>> # ssc = StreamingContext(spark, 20)
>>>
>>> table1_stream = 
>>> (spark.readStream.format("kafka").option("startingOffsets",
>>> "earliest").option("kafka.bootstrap.servers",
>>> "localhost:9092").option("subscribe", "test1").load())
>>>
>>> table2_stream = (
>>> spark.readStream.format("kafka").option("startingOffsets",
>>> "earliest").option("kafka.bootstrap.servers",
>>>
>>>   "localhost:9092").option("subscribe",
>>>
>>>"test2").load())
>>>
>>> joined_Stream = table1_stream.join(table2_stream, "Id")
>>> #
>>> # joined_Stream.show()
>>>
>>> # query =
>>> table1_stream.writeStream.format("console").start().awaitTermination()
>>> # .queryName("table_A").format("

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread sagar grover
With regards,
Sagar Grover
Phone - 7022175584

On Fri, Mar 16, 2018 at 12:15 AM, Aakash Basu 
wrote:

> Awesome, thanks for detailing!
>
> Was thinking the same, we've to split by comma for csv while casting
> inside.
>
> Cool! Shall try it and revert back tomm.
>
> Thanks a ton!
>
> On 15-Mar-2018 11:50 PM, "Bowden, Chris" 
> wrote:
>
>> To remain generic, the KafkaSource can only offer the lowest common
>> denominator for a schema (topic, partition, offset, key, value, timestamp,
>> timestampType). As such, you can't just feed it a StructType. When you are
>> using a producer or consumer directly with Kafka, serialization and
>> deserialization is often an orthogonal and implicit transform. However, in
>> Spark, serialization and deserialization is an explicit transform (e.g.,
>> you define it in your query plan).
>>
>>
>> To make this more granular, if we imagine your source is registered as a
>> temp view named "foo":
>>
>> SELECT
>>
>>   split(cast(value as string), ',')[0] as id,
>>
>>   split(cast(value as string), ',')[1] as name
>>
>> FROM foo;
>>
>>
>> Assuming you were providing the following messages to Kafka:
>>
>> 1,aakash
>>
>> 2,tathagata
>>
>> 3,chris
>>
>>
>> You could make the query plan less repetitive. I don't believe Spark
>> offers from_csv out of the box as an expression (although CSV is well
>> supported as a data source). You could implement an expression by reusing a
>> lot of the supporting CSV classes which may result in a better user
>> experience vs. explicitly using split and array indices, etc. In this
>> simple example, casting the binary to a string just works because there is
>> a common understanding of string's encoded as bytes between Spark and Kafka
>> by default.
>>
>>
>> -Chris
>> --
>> *From:* Aakash Basu 
>> *Sent:* Thursday, March 15, 2018 10:48:45 AM
>> *To:* Bowden, Chris
>> *Cc:* Tathagata Das; Dylan Guedes; Georg Heiler; user
>>
>> *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query
>>
>> Hey Chris,
>>
>> You got it right. I'm reading a *csv *file from local as mentioned
>> above, with a console producer on Kafka side.
>>
>> So, as it is a csv data with headers, shall I then use from_csv on the
>> spark side and provide a StructType to shape it up with a schema and then
>> cast it to string as TD suggested?
>>
>> I'm getting all of your points at a very high level. A little more
>> granularity would help.
>>
>> *In the slide TD just shared*, PFA, I'm confused at the point where he
>> is casting the value as string. Logically, the value shall consist of all
>> the entire data set, so, suppose, I've a table with many columns, *how
>> can I provide a single alias as he did in the groupBy. I missed it there
>> itself. Another question is, do I have to cast in groupBy itself? Can't I
>> do it directly in a select query? The last one, if the steps are followed,
>> can I then run a SQL query on top of the columns separately?*
>>
>> Thanks,
>> Aakash.
>>
>>
>> On 15-Mar-2018 9:07 PM, "Bowden, Chris" 
>> wrote:
>>
>> You need to tell Spark about the structure of the data, it doesn't know
>> ahead of time if you put avro, json, protobuf, etc. in kafka for the
>> message format. If the messages are in json, Spark provides from_json out
>> of the box. For a very simple POC you can happily cast the value to a
>> string, etc. if you are prototyping and pushing messages by hand with a
>> console producer on the kafka side.
>>
>> 
>> From: Aakash Basu 
>> Sent: Thursday, March 15, 2018 7:52:28 AM
>> To: Tathagata Das
>> Cc: Dylan Guedes; Georg Heiler; user
>> Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query
>>
>> Hi,
>>
>> And if I run this below piece of code -
>>
>>
>> from pyspark.sql import SparkSession
>> import time
>>
>> class test:
>>
>>
>> spark = SparkSession.builder \
>> .appName("DirectKafka_Spark_Stream_Stream_Join") \
>> .getOrCreate()
>> # ssc = StreamingContext(spark, 20)
>>
>> table1_stream = 
>> (spark.readStream.format("kafka").option("startingOffsets",
>> "earliest").option("kafka.bootstrap

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Awesome, thanks for detailing!

Was thinking the same, we've to split by comma for csv while casting inside.

Cool! Shall try it and revert back tomm.

Thanks a ton!

On 15-Mar-2018 11:50 PM, "Bowden, Chris" 
wrote:

> To remain generic, the KafkaSource can only offer the lowest common
> denominator for a schema (topic, partition, offset, key, value, timestamp,
> timestampType). As such, you can't just feed it a StructType. When you are
> using a producer or consumer directly with Kafka, serialization and
> deserialization is often an orthogonal and implicit transform. However, in
> Spark, serialization and deserialization is an explicit transform (e.g.,
> you define it in your query plan).
>
>
> To make this more granular, if we imagine your source is registered as a
> temp view named "foo":
>
> SELECT
>
>   split(cast(value as string), ',')[0] as id,
>
>   split(cast(value as string), ',')[1] as name
>
> FROM foo;
>
>
> Assuming you were providing the following messages to Kafka:
>
> 1,aakash
>
> 2,tathagata
>
> 3,chris
>
>
> You could make the query plan less repetitive. I don't believe Spark
> offers from_csv out of the box as an expression (although CSV is well
> supported as a data source). You could implement an expression by reusing a
> lot of the supporting CSV classes which may result in a better user
> experience vs. explicitly using split and array indices, etc. In this
> simple example, casting the binary to a string just works because there is
> a common understanding of string's encoded as bytes between Spark and Kafka
> by default.
>
>
> -Chris
> ------
> *From:* Aakash Basu 
> *Sent:* Thursday, March 15, 2018 10:48:45 AM
> *To:* Bowden, Chris
> *Cc:* Tathagata Das; Dylan Guedes; Georg Heiler; user
> *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query
>
> Hey Chris,
>
> You got it right. I'm reading a *csv *file from local as mentioned above,
> with a console producer on Kafka side.
>
> So, as it is a csv data with headers, shall I then use from_csv on the
> spark side and provide a StructType to shape it up with a schema and then
> cast it to string as TD suggested?
>
> I'm getting all of your points at a very high level. A little more
> granularity would help.
>
> *In the slide TD just shared*, PFA, I'm confused at the point where he is
> casting the value as string. Logically, the value shall consist of all the
> entire data set, so, suppose, I've a table with many columns, *how can I
> provide a single alias as he did in the groupBy. I missed it there itself.
> Another question is, do I have to cast in groupBy itself? Can't I do it
> directly in a select query? The last one, if the steps are followed, can I
> then run a SQL query on top of the columns separately?*
>
> Thanks,
> Aakash.
>
>
> On 15-Mar-2018 9:07 PM, "Bowden, Chris" 
> wrote:
>
> You need to tell Spark about the structure of the data, it doesn't know
> ahead of time if you put avro, json, protobuf, etc. in kafka for the
> message format. If the messages are in json, Spark provides from_json out
> of the box. For a very simple POC you can happily cast the value to a
> string, etc. if you are prototyping and pushing messages by hand with a
> console producer on the kafka side.
>
> 
> From: Aakash Basu 
> Sent: Thursday, March 15, 2018 7:52:28 AM
> To: Tathagata Das
> Cc: Dylan Guedes; Georg Heiler; user
> Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query
>
> Hi,
>
> And if I run this below piece of code -
>
>
> from pyspark.sql import SparkSession
> import time
>
> class test:
>
>
> spark = SparkSession.builder \
> .appName("DirectKafka_Spark_Stream_Stream_Join") \
> .getOrCreate()
> # ssc = StreamingContext(spark, 20)
>
> table1_stream = 
> (spark.readStream.format("kafka").option("startingOffsets",
> "earliest").option("kafka.bootstrap.servers",
> "localhost:9092").option("subscribe", "test1").load())
>
> table2_stream = (
> spark.readStream.format("kafka").option("startingOffsets",
> "earliest").option("kafka.bootstrap.servers",
>
> "localhost:9092").option("subscribe",
>
>  "test2").load())
>
> joined_Stream = table1_stream.join(table2_stream, "Id")
> #
> # joined_Stream.show()
>
> # query =
> table1_st

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Hey Chris,

You got it right. I'm reading a *csv *file from local as mentioned above,
with a console producer on Kafka side.

So, as it is a csv data with headers, shall I then use from_csv on the
spark side and provide a StructType to shape it up with a schema and then
cast it to string as TD suggested?

I'm getting all of your points at a very high level. A little more
granularity would help.

*In the slide TD just shared*, PFA, I'm confused at the point where he is
casting the value as string. Logically, the value shall consist of all the
entire data set, so, suppose, I've a table with many columns, *how can I
provide a single alias as he did in the groupBy. I missed it there itself.
Another question is, do I have to cast in groupBy itself? Can't I do it
directly in a select query? The last one, if the steps are followed, can I
then run a SQL query on top of the columns separately?*

Thanks,
Aakash.


On 15-Mar-2018 9:07 PM, "Bowden, Chris"  wrote:

You need to tell Spark about the structure of the data, it doesn't know
ahead of time if you put avro, json, protobuf, etc. in kafka for the
message format. If the messages are in json, Spark provides from_json out
of the box. For a very simple POC you can happily cast the value to a
string, etc. if you are prototyping and pushing messages by hand with a
console producer on the kafka side.


From: Aakash Basu 
Sent: Thursday, March 15, 2018 7:52:28 AM
To: Tathagata Das
Cc: Dylan Guedes; Georg Heiler; user
Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query

Hi,

And if I run this below piece of code -


from pyspark.sql import SparkSession
import time

class test:


spark = SparkSession.builder \
.appName("DirectKafka_Spark_Stream_Stream_Join") \
.getOrCreate()
# ssc = StreamingContext(spark, 20)

table1_stream = (spark.readStream.format("kafka").option("startingOffsets",
"earliest").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe",
"test1").load())

table2_stream = (
spark.readStream.format("kafka").option("startingOffsets",
"earliest").option("kafka.bootstrap.servers",

  "localhost:9092").option("subscribe",

   "test2").load())

joined_Stream = table1_stream.join(table2_stream, "Id")
#
# joined_Stream.show()

# query =
table1_stream.writeStream.format("console").start().awaitTermination()
# .queryName("table_A").format("memory")
# spark.sql("select * from table_A").show()
time.sleep(10)  # sleep 20 seconds
# query.stop()
# query


# /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Stream_Stream_Join.py




I get the below error (in Spark 2.3.0) -

Traceback (most recent call last):
  File 
"/home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py",
line 4, in 
class test:
  File 
"/home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py",
line 19, in test
joined_Stream = table1_stream.join(table2_stream, "Id")
  File "/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/
lib/pyspark.zip/pyspark/sql/dataframe.py", line 931, in join
  File "/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/
lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/
lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'USING column `Id` cannot be resolved
on the left side of the join. The left-side columns: [key, value, topic,
partition, offset, timestamp, timestampType];'

Seems, as per the documentation, they key and value are deserialized as
byte arrays.

I am badly stuck at this step, not many materials online, with steps to
proceed on this, too.

Any help, guys?

Thanks,
Aakash.


On Thu, Mar 15, 2018 at 7:54 PM, Aakash Basu mailto:aakash.spark@gmail.com>> wrote:
Any help on the above?

On Thu, Mar 15, 2018 at 3:53 PM, Aakash Basu mailto:aakash.spark@gmail.com>> wrote:
Hi,

I progressed a bit in the above mentioned topic -

1) I am feeding a CSV file into the Kafka topic.
2) Feeding the Kafka topic as readStream as TD's article suggests.
3) Then, simply trying to do a show on the streaming dataframe, using
queryName('XYZ') in the writeStream and writing a sql query on top of it,
but that doesn't show anything.
4) Once all the above problems are resolved, I want to perform a
stream-stream join.

The CSV file I'm ingesting into Kafka has -

id,first_name,last_name
1,Kellyann,Moyne
2,Morty,Blacker
3,Tob

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Tathagata Das
Chris identified the problem correctly. You need to parse out the json text
from Kafka into separate columns before you can join them up.
I walk through an example of this in my slides -
https://www.slideshare.net/databricks/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-with-tathagata-das


On Thu, Mar 15, 2018 at 8:37 AM, Bowden, Chris 
wrote:

> You need to tell Spark about the structure of the data, it doesn't know
> ahead of time if you put avro, json, protobuf, etc. in kafka for the
> message format. If the messages are in json, Spark provides from_json out
> of the box. For a very simple POC you can happily cast the value to a
> string, etc. if you are prototyping and pushing messages by hand with a
> console producer on the kafka side.
>
> 
> From: Aakash Basu 
> Sent: Thursday, March 15, 2018 7:52:28 AM
> To: Tathagata Das
> Cc: Dylan Guedes; Georg Heiler; user
> Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query
>
> Hi,
>
> And if I run this below piece of code -
>
>
> from pyspark.sql import SparkSession
> import time
>
> class test:
>
>
> spark = SparkSession.builder \
> .appName("DirectKafka_Spark_Stream_Stream_Join") \
> .getOrCreate()
> # ssc = StreamingContext(spark, 20)
>
> table1_stream = 
> (spark.readStream.format("kafka").option("startingOffsets",
> "earliest").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe",
> "test1").load())
>
> table2_stream = (
> spark.readStream.format("kafka").option("startingOffsets",
> "earliest").option("kafka.bootstrap.servers",
>
> "localhost:9092").option("subscribe",
>
>  "test2").load())
>
> joined_Stream = table1_stream.join(table2_stream, "Id")
> #
> # joined_Stream.show()
>
> # query =
> table1_stream.writeStream.format("console").start().awaitTermination()
> # .queryName("table_A").format("memory")
> # spark.sql("select * from table_A").show()
> time.sleep(10)  # sleep 20 seconds
> # query.stop()
> # query
>
>
> # /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit
> --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
> Stream_Stream_Join.py
>
>
>
>
> I get the below error (in Spark 2.3.0) -
>
> Traceback (most recent call last):
>   File "/home/aakashbasu/PycharmProjects/AllMyRnD/
> Kafka_Spark/Stream_Stream_Join.py", line 4, in 
> class test:
>   File "/home/aakashbasu/PycharmProjects/AllMyRnD/
> Kafka_Spark/Stream_Stream_Join.py", line 19, in test
> joined_Stream = table1_stream.join(table2_stream, "Id")
>   File "/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/
> lib/pyspark.zip/pyspark/sql/dataframe.py", line 931, in join
>   File "/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/
> lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
>   File "/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/
> lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
> pyspark.sql.utils.AnalysisException: u'USING column `Id` cannot be
> resolved on the left side of the join. The left-side columns: [key, value,
> topic, partition, offset, timestamp, timestampType];'
>
> Seems, as per the documentation, they key and value are deserialized as
> byte arrays.
>
> I am badly stuck at this step, not many materials online, with steps to
> proceed on this, too.
>
> Any help, guys?
>
> Thanks,
> Aakash.
>
>
> On Thu, Mar 15, 2018 at 7:54 PM, Aakash Basu  mailto:aakash.spark@gmail.com>> wrote:
> Any help on the above?
>
> On Thu, Mar 15, 2018 at 3:53 PM, Aakash Basu  mailto:aakash.spark@gmail.com>> wrote:
> Hi,
>
> I progressed a bit in the above mentioned topic -
>
> 1) I am feeding a CSV file into the Kafka topic.
> 2) Feeding the Kafka topic as readStream as TD's article suggests.
> 3) Then, simply trying to do a show on the streaming dataframe, using
> queryName('XYZ') in the writeStream and writing a sql query on top of it,
> but that doesn't show anything.
> 4) Once all the above problems are resolved, I want to perform a
> stream-stream join.
>
> The CSV file I'm ingesting into Kafka has -
>
> id,first_name,last_name
> 1,Kellyann,Moyne
> 2,Morty,Blacker
> 3,Tobit,Robardley
> 4,Wilona,Kells
> 5,Reggy,Comizzoli
>
&

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Hi,

And if I run this below piece of code -

from pyspark.sql import SparkSession
import time

class test:


spark = SparkSession.builder \
.appName("DirectKafka_Spark_Stream_Stream_Join") \
.getOrCreate()
# ssc = StreamingContext(spark, 20)

table1_stream =
(spark.readStream.format("kafka").option("startingOffsets",
"earliest").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "test1").load())

table2_stream = (
spark.readStream.format("kafka").option("startingOffsets",
"earliest").option("kafka.bootstrap.servers",

   "localhost:9092").option("subscribe",

"test2").load())

joined_Stream = table1_stream.join(table2_stream, "Id")
#
# joined_Stream.show()

# query =
table1_stream.writeStream.format("console").start().awaitTermination()
 # .queryName("table_A").format("memory")
# spark.sql("select * from table_A").show()
time.sleep(10)  # sleep 20 seconds
# query.stop()
# query


# /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Stream_Stream_Join.py




I get the below error (in *Spark 2.3.0*) -

Traceback (most recent call last):
  File
"/home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py",
line 4, in 
class test:
  File
"/home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py",
line 19, in test
joined_Stream = table1_stream.join(table2_stream, "Id")
  File
"/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 931, in join
  File
"/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
line 1160, in __call__
  File
"/home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 69, in deco


*pyspark.sql.utils.AnalysisException: u'USING column `Id` cannot be
resolved on the left side of the join. The left-side columns: [key, value,
topic, partition, offset, timestamp, timestampType];'*
Seems, as per the documentation, they key and value are deserialized as
byte arrays.

I am badly stuck at this step, not many materials online, with steps to
proceed on this, too.

Any help, guys?

Thanks,
Aakash.


On Thu, Mar 15, 2018 at 7:54 PM, Aakash Basu 
wrote:

> Any help on the above?
>
> On Thu, Mar 15, 2018 at 3:53 PM, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> I progressed a bit in the above mentioned topic -
>>
>> 1) I am feeding a CSV file into the Kafka topic.
>> 2) Feeding the Kafka topic as readStream as TD's article suggests.
>> 3) Then, simply trying to do a show on the streaming dataframe, using
>> queryName('XYZ') in the writeStream and writing a sql query on top of it,
>> but that doesn't show anything.
>> 4) Once all the above problems are resolved, I want to perform a
>> stream-stream join.
>>
>> The CSV file I'm ingesting into Kafka has -
>>
>> id,first_name,last_name
>> 1,Kellyann,Moyne
>> 2,Morty,Blacker
>> 3,Tobit,Robardley
>> 4,Wilona,Kells
>> 5,Reggy,Comizzoli
>>
>>
>> My test code -
>>
>> from pyspark.sql import SparkSession
>> import time
>>
>> class test:
>>
>>
>> spark = SparkSession.builder \
>> .appName("DirectKafka_Spark_Stream_Stream_Join") \
>> .getOrCreate()
>> # ssc = StreamingContext(spark, 20)
>>
>> table1_stream = 
>> (spark.readStream.format("kafka").option("startingOffsets", 
>> "earliest").option("kafka.bootstrap.servers", 
>> "localhost:9092").option("subscribe", "test1").load())
>>
>> # table2_stream = 
>> (spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
>> "localhost:9092").option("subscribe", "test2").load())
>>
>> # joined_Stream = table1_stream.join(table2_stream, "Id")
>> #
>> # joined_Stream.show()
>>
>> query = 
>> table1_stream.writeStream.format("console").queryName("table_A").start()  # 
>> .format("memory")
>> # spark.sql("select * from table_A").show()
>> # time.sleep(10)  # sleep 20 seconds
>> # query.stop()
>> query.awaitTermination()
>>
>>
>> # /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit 
>> --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 
>> Stream_Stream_Join.py
>>
>>
>> The output I'm getting (whereas I simply want to show() my dataframe) -
>>
>> +++-+-+--+--
>> --+-+
>> | key|   value|topic|partition|offset|
>> timestamp|timestampType|
>> +++-+-+--+--
>> --+-+
>> |null|[69 64 2C 66 69 7...|test1|0|  5226|2018-03-15
>> 15:48:...|0|
>> |null|[31 2C 4B 65 6C 6...|test1|0|  5227|2018-03-15
>> 15:48:...|0|
>> |null|[32 2C 4D 6F 72 7...|test1|0|  5228|2018-03-15
>> 15:48:...|0|
>> |null|[33 2C 54 6F 62 6...|test1|0|  5229|2018-03-15
>> 15:48:...|

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Any help on the above?

On Thu, Mar 15, 2018 at 3:53 PM, Aakash Basu 
wrote:

> Hi,
>
> I progressed a bit in the above mentioned topic -
>
> 1) I am feeding a CSV file into the Kafka topic.
> 2) Feeding the Kafka topic as readStream as TD's article suggests.
> 3) Then, simply trying to do a show on the streaming dataframe, using
> queryName('XYZ') in the writeStream and writing a sql query on top of it,
> but that doesn't show anything.
> 4) Once all the above problems are resolved, I want to perform a
> stream-stream join.
>
> The CSV file I'm ingesting into Kafka has -
>
> id,first_name,last_name
> 1,Kellyann,Moyne
> 2,Morty,Blacker
> 3,Tobit,Robardley
> 4,Wilona,Kells
> 5,Reggy,Comizzoli
>
>
> My test code -
>
> from pyspark.sql import SparkSession
> import time
>
> class test:
>
>
> spark = SparkSession.builder \
> .appName("DirectKafka_Spark_Stream_Stream_Join") \
> .getOrCreate()
> # ssc = StreamingContext(spark, 20)
>
> table1_stream = 
> (spark.readStream.format("kafka").option("startingOffsets", 
> "earliest").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "test1").load())
>
> # table2_stream = 
> (spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "test2").load())
>
> # joined_Stream = table1_stream.join(table2_stream, "Id")
> #
> # joined_Stream.show()
>
> query = 
> table1_stream.writeStream.format("console").queryName("table_A").start()  # 
> .format("memory")
> # spark.sql("select * from table_A").show()
> # time.sleep(10)  # sleep 20 seconds
> # query.stop()
> query.awaitTermination()
>
>
> # /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 Stream_Stream_Join.py
>
>
> The output I'm getting (whereas I simply want to show() my dataframe) -
>
> +++-+-+--+--
> --+-+
> | key|   value|topic|partition|offset|
> timestamp|timestampType|
> +++-+-+--+--
> --+-+
> |null|[69 64 2C 66 69 7...|test1|0|  5226|2018-03-15
> 15:48:...|0|
> |null|[31 2C 4B 65 6C 6...|test1|0|  5227|2018-03-15
> 15:48:...|0|
> |null|[32 2C 4D 6F 72 7...|test1|0|  5228|2018-03-15
> 15:48:...|0|
> |null|[33 2C 54 6F 62 6...|test1|0|  5229|2018-03-15
> 15:48:...|0|
> |null|[34 2C 57 69 6C 6...|test1|0|  5230|2018-03-15
> 15:48:...|0|
> |null|[35 2C 52 65 67 6...|test1|0|  5231|2018-03-15
> 15:48:...|0|
> +++-+-+--+--
> --+-+
>
> 18/03/15 15:48:07 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ca7e2862-73c6-41bf-9a6f-c79e533a2bf8",
>   "runId" : "0758ddbd-9b1c-428b-aa52-1dd40d477d21",
>   "name" : "table_A",
>   "timestamp" : "2018-03-15T10:18:07.218Z",
>   "numInputRows" : 6,
>   "inputRowsPerSecond" : 461.53846153846155,
>   "processedRowsPerSecond" : 14.634146341463415,
>   "durationMs" : {
> "addBatch" : 241,
> "getBatch" : 15,
> "getOffset" : 2,
> "queryPlanning" : 2,
> "triggerExecution" : 410,
> "walCommit" : 135
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaSource[Subscribe[test1]]",
> "startOffset" : {
>   "test1" : {
> "0" : 5226
>   }
> },
> "endOffset" : {
>   "test1" : {
> "0" : 5232
>   }
> },
> "numInputRows" : 6,
> "inputRowsPerSecond" : 461.53846153846155,
> "processedRowsPerSecond" : 14.634146341463415
>   } ],
>   "sink" : {
> "description" : "org.apache.spark.sql.execution.streaming.
> ConsoleSink@3dfc7990"
>   }
> }
>
> P.S - If I add the below piece in the code, it doesn't print a DF of the
> actual table.
>
> spark.sql("select * from table_A").show()
>
>
> Any help?
>
>
> Thanks,
> Aakash.
>
> On Thu, Mar 15, 2018 at 10:52 AM, Aakash Basu 
> wrote:
>
>> Thanks to TD, the savior!
>>
>> Shall look into it.
>>
>> On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Relevant: https://databricks.com/blog/2018/03/13/introducing
>>> -stream-stream-joins-in-apache-spark-2-3.html
>>>
>>> This is true stream-stream join which will automatically buffer delayed
>>> data and appropriately join stuff with SQL join semantics. Please check it
>>> out :)
>>>
>>> TD
>>>
>>>
>>>
>>> On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes 
>>> wrote:
>>>
 I misread it, and thought that you question was if pyspark supports
 kafka lol. Sorry!

 On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu <
 aakash.spark@gmail.com> wrote:

> Hey Dylan,
>
> Great!
>
> Can you revert back to my initial and also the latest mail?
>
> Thanks,
> Aakash.
>
>

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Hi,

I progressed a bit in the above mentioned topic -

1) I am feeding a CSV file into the Kafka topic.
2) Feeding the Kafka topic as readStream as TD's article suggests.
3) Then, simply trying to do a show on the streaming dataframe, using
queryName('XYZ') in the writeStream and writing a sql query on top of it,
but that doesn't show anything.
4) Once all the above problems are resolved, I want to perform a
stream-stream join.

The CSV file I'm ingesting into Kafka has -

id,first_name,last_name
1,Kellyann,Moyne
2,Morty,Blacker
3,Tobit,Robardley
4,Wilona,Kells
5,Reggy,Comizzoli


My test code -

from pyspark.sql import SparkSession
import time

class test:


spark = SparkSession.builder \
.appName("DirectKafka_Spark_Stream_Stream_Join") \
.getOrCreate()
# ssc = StreamingContext(spark, 20)

table1_stream =
(spark.readStream.format("kafka").option("startingOffsets",
"earliest").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "test1").load())

# table2_stream =
(spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "test2").load())

# joined_Stream = table1_stream.join(table2_stream, "Id")
#
# joined_Stream.show()

query = 
table1_stream.writeStream.format("console").queryName("table_A").start()
 # .format("memory")
# spark.sql("select * from table_A").show()
# time.sleep(10)  # sleep 20 seconds
# query.stop()
query.awaitTermination()


# /home/kafka/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Stream_Stream_Join.py


The output I'm getting (whereas I simply want to show() my dataframe) -

+++-+-+--++-+
| key|   value|topic|partition|offset|
timestamp|timestampType|
+++-+-+--++-+
|null|[69 64 2C 66 69 7...|test1|0|  5226|2018-03-15
15:48:...|0|
|null|[31 2C 4B 65 6C 6...|test1|0|  5227|2018-03-15
15:48:...|0|
|null|[32 2C 4D 6F 72 7...|test1|0|  5228|2018-03-15
15:48:...|0|
|null|[33 2C 54 6F 62 6...|test1|0|  5229|2018-03-15
15:48:...|0|
|null|[34 2C 57 69 6C 6...|test1|0|  5230|2018-03-15
15:48:...|0|
|null|[35 2C 52 65 67 6...|test1|0|  5231|2018-03-15
15:48:...|0|
+++-+-+--++-+

18/03/15 15:48:07 INFO StreamExecution: Streaming query made progress: {
  "id" : "ca7e2862-73c6-41bf-9a6f-c79e533a2bf8",
  "runId" : "0758ddbd-9b1c-428b-aa52-1dd40d477d21",
  "name" : "table_A",
  "timestamp" : "2018-03-15T10:18:07.218Z",
  "numInputRows" : 6,
  "inputRowsPerSecond" : 461.53846153846155,
  "processedRowsPerSecond" : 14.634146341463415,
  "durationMs" : {
"addBatch" : 241,
"getBatch" : 15,
"getOffset" : 2,
"queryPlanning" : 2,
"triggerExecution" : 410,
"walCommit" : 135
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaSource[Subscribe[test1]]",
"startOffset" : {
  "test1" : {
"0" : 5226
  }
},
"endOffset" : {
  "test1" : {
"0" : 5232
  }
},
"numInputRows" : 6,
"inputRowsPerSecond" : 461.53846153846155,
"processedRowsPerSecond" : 14.634146341463415
  } ],
  "sink" : {
"description" :
"org.apache.spark.sql.execution.streaming.ConsoleSink@3dfc7990"
  }
}

P.S - If I add the below piece in the code, it doesn't print a DF of the
actual table.

spark.sql("select * from table_A").show()


Any help?


Thanks,
Aakash.

On Thu, Mar 15, 2018 at 10:52 AM, Aakash Basu 
wrote:

> Thanks to TD, the savior!
>
> Shall look into it.
>
> On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Relevant: https://databricks.com/blog/2018/03/13/introducing
>> -stream-stream-joins-in-apache-spark-2-3.html
>>
>> This is true stream-stream join which will automatically buffer delayed
>> data and appropriately join stuff with SQL join semantics. Please check it
>> out :)
>>
>> TD
>>
>>
>>
>> On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes 
>> wrote:
>>
>>> I misread it, and thought that you question was if pyspark supports
>>> kafka lol. Sorry!
>>>
>>> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu >> > wrote:
>>>
 Hey Dylan,

 Great!

 Can you revert back to my initial and also the latest mail?

 Thanks,
 Aakash.

 On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:

> Hi,
>
> I've been using the Kafka with pyspark since 2.1.
>
> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu <
> aakash.spark@gmail.com> wrote:
>
>> Hi,
>>
>> I'm yet to.
>>
>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
>> allows Python? I read somewhere, as of now S

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Thanks to TD, the savior!

Shall look into it.

On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das 
wrote:

> Relevant: https://databricks.com/blog/2018/03/13/
> introducing-stream-stream-joins-in-apache-spark-2-3.html
>
> This is true stream-stream join which will automatically buffer delayed
> data and appropriately join stuff with SQL join semantics. Please check it
> out :)
>
> TD
>
>
>
> On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes 
> wrote:
>
>> I misread it, and thought that you question was if pyspark supports kafka
>> lol. Sorry!
>>
>> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu 
>> wrote:
>>
>>> Hey Dylan,
>>>
>>> Great!
>>>
>>> Can you revert back to my initial and also the latest mail?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:
>>>
 Hi,

 I've been using the Kafka with pyspark since 2.1.

 On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu <
 aakash.spark@gmail.com> wrote:

> Hi,
>
> I'm yet to.
>
> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
> allows Python? I read somewhere, as of now Scala and Java are the 
> languages
> to be used.
>
> Please correct me if am wrong.
>
> Thanks,
> Aakash.
>
> On 14-Mar-2018 8:24 PM, "Georg Heiler" 
> wrote:
>
>> Did you try spark 2.3 with structured streaming? There watermarking
>> and plain sql might be really interesting for you.
>> Aakash Basu  schrieb am Mi. 14. März
>> 2018 um 14:57:
>>
>>> Hi,
>>>
>>>
>>>
>>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>>
>>> *Spark 2.2.1*
>>> *Kafka 1.0.1*
>>>
>>> As of now, I am feeding paragraphs in Kafka console producer and my
>>> Spark, which is acting as a receiver is printing the flattened words, 
>>> which
>>> is a complete RDD operation.
>>>
>>> *My motive is to read two tables continuously (being updated) as two
>>> distinct Kafka topics being read as two Spark Dataframes and join them
>>> based on a key and produce the output. *(I am from Spark-SQL
>>> background, pardon my Spark-SQL-ish writing)
>>>
>>> *It may happen, the first topic is receiving new data 15 mins prior
>>> to the second topic, in that scenario, how to proceed? I should not lose
>>> any data.*
>>>
>>> As of now, I want to simply pass paragraphs, read them as RDD,
>>> convert to DF and then join to get the common keys as the output. (Just 
>>> for
>>> R&D).
>>>
>>> Started using Spark Streaming and Kafka today itself.
>>>
>>> Please help!
>>>
>>> Thanks,
>>> Aakash.
>>>
>>

>>
>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Tathagata Das
Relevant:
https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html


This is true stream-stream join which will automatically buffer delayed
data and appropriately join stuff with SQL join semantics. Please check it
out :)

TD



On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes  wrote:

> I misread it, and thought that you question was if pyspark supports kafka
> lol. Sorry!
>
> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu 
> wrote:
>
>> Hey Dylan,
>>
>> Great!
>>
>> Can you revert back to my initial and also the latest mail?
>>
>> Thanks,
>> Aakash.
>>
>> On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:
>>
>>> Hi,
>>>
>>> I've been using the Kafka with pyspark since 2.1.
>>>
>>> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu >> > wrote:
>>>
 Hi,

 I'm yet to.

 Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
 allows Python? I read somewhere, as of now Scala and Java are the languages
 to be used.

 Please correct me if am wrong.

 Thanks,
 Aakash.

 On 14-Mar-2018 8:24 PM, "Georg Heiler" 
 wrote:

> Did you try spark 2.3 with structured streaming? There watermarking
> and plain sql might be really interesting for you.
> Aakash Basu  schrieb am Mi. 14. März 2018
> um 14:57:
>
>> Hi,
>>
>>
>>
>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>
>> *Spark 2.2.1*
>> *Kafka 1.0.1*
>>
>> As of now, I am feeding paragraphs in Kafka console producer and my
>> Spark, which is acting as a receiver is printing the flattened words, 
>> which
>> is a complete RDD operation.
>>
>> *My motive is to read two tables continuously (being updated) as two
>> distinct Kafka topics being read as two Spark Dataframes and join them
>> based on a key and produce the output. *(I am from Spark-SQL
>> background, pardon my Spark-SQL-ish writing)
>>
>> *It may happen, the first topic is receiving new data 15 mins prior
>> to the second topic, in that scenario, how to proceed? I should not lose
>> any data.*
>>
>> As of now, I want to simply pass paragraphs, read them as RDD,
>> convert to DF and then join to get the common keys as the output. (Just 
>> for
>> R&D).
>>
>> Started using Spark Streaming and Kafka today itself.
>>
>> Please help!
>>
>> Thanks,
>> Aakash.
>>
>
>>>
>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
I misread it, and thought that you question was if pyspark supports kafka
lol. Sorry!

On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu 
wrote:

> Hey Dylan,
>
> Great!
>
> Can you revert back to my initial and also the latest mail?
>
> Thanks,
> Aakash.
>
> On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:
>
>> Hi,
>>
>> I've been using the Kafka with pyspark since 2.1.
>>
>> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm yet to.
>>>
>>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
>>> allows Python? I read somewhere, as of now Scala and Java are the languages
>>> to be used.
>>>
>>> Please correct me if am wrong.
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On 14-Mar-2018 8:24 PM, "Georg Heiler" 
>>> wrote:
>>>
 Did you try spark 2.3 with structured streaming? There watermarking and
 plain sql might be really interesting for you.
 Aakash Basu  schrieb am Mi. 14. März 2018
 um 14:57:

> Hi,
>
>
>
> *Info (Using):Spark Streaming Kafka 0.8 package*
>
> *Spark 2.2.1*
> *Kafka 1.0.1*
>
> As of now, I am feeding paragraphs in Kafka console producer and my
> Spark, which is acting as a receiver is printing the flattened words, 
> which
> is a complete RDD operation.
>
> *My motive is to read two tables continuously (being updated) as two
> distinct Kafka topics being read as two Spark Dataframes and join them
> based on a key and produce the output. *(I am from Spark-SQL
> background, pardon my Spark-SQL-ish writing)
>
> *It may happen, the first topic is receiving new data 15 mins prior to
> the second topic, in that scenario, how to proceed? I should not lose any
> data.*
>
> As of now, I want to simply pass paragraphs, read them as RDD, convert
> to DF and then join to get the common keys as the output. (Just for R&D).
>
> Started using Spark Streaming and Kafka today itself.
>
> Please help!
>
> Thanks,
> Aakash.
>

>>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hey Dylan,

Great!

Can you revert back to my initial and also the latest mail?

Thanks,
Aakash.

On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:

> Hi,
>
> I've been using the Kafka with pyspark since 2.1.
>
> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> I'm yet to.
>>
>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
>> allows Python? I read somewhere, as of now Scala and Java are the languages
>> to be used.
>>
>> Please correct me if am wrong.
>>
>> Thanks,
>> Aakash.
>>
>> On 14-Mar-2018 8:24 PM, "Georg Heiler"  wrote:
>>
>>> Did you try spark 2.3 with structured streaming? There watermarking and
>>> plain sql might be really interesting for you.
>>> Aakash Basu  schrieb am Mi. 14. März 2018
>>> um 14:57:
>>>
 Hi,



 *Info (Using):Spark Streaming Kafka 0.8 package*

 *Spark 2.2.1*
 *Kafka 1.0.1*

 As of now, I am feeding paragraphs in Kafka console producer and my
 Spark, which is acting as a receiver is printing the flattened words, which
 is a complete RDD operation.

 *My motive is to read two tables continuously (being updated) as two
 distinct Kafka topics being read as two Spark Dataframes and join them
 based on a key and produce the output. *(I am from Spark-SQL
 background, pardon my Spark-SQL-ish writing)

 *It may happen, the first topic is receiving new data 15 mins prior to
 the second topic, in that scenario, how to proceed? I should not lose any
 data.*

 As of now, I want to simply pass paragraphs, read them as RDD, convert
 to DF and then join to get the common keys as the output. (Just for R&D).

 Started using Spark Streaming and Kafka today itself.

 Please help!

 Thanks,
 Aakash.

>>>
>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
Hi,

I've been using the Kafka with pyspark since 2.1.

On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu 
wrote:

> Hi,
>
> I'm yet to.
>
> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
> allows Python? I read somewhere, as of now Scala and Java are the languages
> to be used.
>
> Please correct me if am wrong.
>
> Thanks,
> Aakash.
>
> On 14-Mar-2018 8:24 PM, "Georg Heiler"  wrote:
>
>> Did you try spark 2.3 with structured streaming? There watermarking and
>> plain sql might be really interesting for you.
>> Aakash Basu  schrieb am Mi. 14. März 2018 um
>> 14:57:
>>
>>> Hi,
>>>
>>>
>>>
>>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>>
>>> *Spark 2.2.1*
>>> *Kafka 1.0.1*
>>>
>>> As of now, I am feeding paragraphs in Kafka console producer and my
>>> Spark, which is acting as a receiver is printing the flattened words, which
>>> is a complete RDD operation.
>>>
>>> *My motive is to read two tables continuously (being updated) as two
>>> distinct Kafka topics being read as two Spark Dataframes and join them
>>> based on a key and produce the output. *(I am from Spark-SQL
>>> background, pardon my Spark-SQL-ish writing)
>>>
>>> *It may happen, the first topic is receiving new data 15 mins prior to
>>> the second topic, in that scenario, how to proceed? I should not lose any
>>> data.*
>>>
>>> As of now, I want to simply pass paragraphs, read them as RDD, convert
>>> to DF and then join to get the common keys as the output. (Just for R&D).
>>>
>>> Started using Spark Streaming and Kafka today itself.
>>>
>>> Please help!
>>>
>>> Thanks,
>>> Aakash.
>>>
>>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi,

I'm yet to.

Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package allows
Python? I read somewhere, as of now Scala and Java are the languages to be
used.

Please correct me if am wrong.

Thanks,
Aakash.

On 14-Mar-2018 8:24 PM, "Georg Heiler"  wrote:

> Did you try spark 2.3 with structured streaming? There watermarking and
> plain sql might be really interesting for you.
> Aakash Basu  schrieb am Mi. 14. März 2018 um
> 14:57:
>
>> Hi,
>>
>>
>>
>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>
>> *Spark 2.2.1*
>> *Kafka 1.0.1*
>>
>> As of now, I am feeding paragraphs in Kafka console producer and my
>> Spark, which is acting as a receiver is printing the flattened words, which
>> is a complete RDD operation.
>>
>> *My motive is to read two tables continuously (being updated) as two
>> distinct Kafka topics being read as two Spark Dataframes and join them
>> based on a key and produce the output. *(I am from Spark-SQL background,
>> pardon my Spark-SQL-ish writing)
>>
>> *It may happen, the first topic is receiving new data 15 mins prior to
>> the second topic, in that scenario, how to proceed? I should not lose any
>> data.*
>>
>> As of now, I want to simply pass paragraphs, read them as RDD, convert to
>> DF and then join to get the common keys as the output. (Just for R&D).
>>
>> Started using Spark Streaming and Kafka today itself.
>>
>> Please help!
>>
>> Thanks,
>> Aakash.
>>
>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Georg Heiler
Did you try spark 2.3 with structured streaming? There watermarking and
plain sql might be really interesting for you.
Aakash Basu  schrieb am Mi. 14. März 2018 um
14:57:

> Hi,
>
>
>
> *Info (Using):Spark Streaming Kafka 0.8 package*
>
> *Spark 2.2.1*
> *Kafka 1.0.1*
>
> As of now, I am feeding paragraphs in Kafka console producer and my Spark,
> which is acting as a receiver is printing the flattened words, which is a
> complete RDD operation.
>
> *My motive is to read two tables continuously (being updated) as two
> distinct Kafka topics being read as two Spark Dataframes and join them
> based on a key and produce the output. *(I am from Spark-SQL background,
> pardon my Spark-SQL-ish writing)
>
> *It may happen, the first topic is receiving new data 15 mins prior to the
> second topic, in that scenario, how to proceed? I should not lose any data.*
>
> As of now, I want to simply pass paragraphs, read them as RDD, convert to
> DF and then join to get the common keys as the output. (Just for R&D).
>
> Started using Spark Streaming and Kafka today itself.
>
> Please help!
>
> Thanks,
> Aakash.
>


Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi,



*Info (Using):Spark Streaming Kafka 0.8 package*

*Spark 2.2.1*
*Kafka 1.0.1*

As of now, I am feeding paragraphs in Kafka console producer and my Spark,
which is acting as a receiver is printing the flattened words, which is a
complete RDD operation.

*My motive is to read two tables continuously (being updated) as two
distinct Kafka topics being read as two Spark Dataframes and join them
based on a key and produce the output. *(I am from Spark-SQL background,
pardon my Spark-SQL-ish writing)

*It may happen, the first topic is receiving new data 15 mins prior to the
second topic, in that scenario, how to proceed? I should not lose any data.*

As of now, I want to simply pass paragraphs, read them as RDD, convert to
DF and then join to get the common keys as the output. (Just for R&D).

Started using Spark Streaming and Kafka today itself.

Please help!

Thanks,
Aakash.


Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have
the following Kafka Consumer. The consumer is reading from a topic, and I
have a mechanism which blocks the consumer from time to time.

The producer is a separate thread which is continuously sending data. I
want to ensure that the consumer does not drop/not read data sent during
the period when the consumer was "blocked".

*In case the "blocked" part is confusing - we have a modified Spark
scheduler where we take a lock on the scheduler.*

public static JavaDStream getKafkaDStream(String inputTopics,
String broker, int kafkaPort, JavaStreamingContext ssc){
HashSet inputTopicsSet = new
HashSet(Arrays.asList(inputTopics.split(",")));
HashMap kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", broker + ":" + kafkaPort);

JavaPairInputDStream messages =
KafkaUtils.createDirectStream(
ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
inputTopicsSet
);

JavaDStream lines = messages.map(new
Function, String>() {
@Override
public String call(Tuple2 tuple2) {
return tuple2._2();
}
});

return lines;
}

Thanks
Nipun


Re: Message getting lost in Kafka + Spark Streaming

2017-05-31 Thread Vikash Pareek
Thanks Sidney for your response,

To check if all the messages are processed I used accumulator and also add
a print statement for debuging.


*val accum = ssc.sparkContext.accumulator(0, "Debug Accumulator")*
*...*
*...*
*...*
*val mappedDataStream = dataStream.map(_._2);*
*  mappedDataStream.foreachRDD { rdd =>*
*...*
*...*
*...*
*partition.foreach { row =>*
*  if (debug) println(row.mkString)*
*  val keyedMessage = new KeyedMessage[String,
String](props.getProperty("outTopicUnharmonized"),*
*null, row.toString())*
*  producer.send(keyedMessage)*
*  println("Messges sent to Kafka: " + keyedMessage.message)*
*  accum += 1*
*}*
*//hack, should be done with the flush*
*Thread.sleep(1000)*
*producer.close()*
*print("Accumulator's value is: " + accum)*

And I am getting all the messages in "*println("Messges sent to Kafka: " +
keyedMessage.message)*" received by the stream, and accumulator value is
also same as number of incoming messages.



Best Regards,


[image: InfoObjects Inc.] <http://www.infoobjects.com/>
Vikash Pareek
Team Lead  *InfoObjects Inc.*
Big Data Analytics

m: +91 8800206898 a: E5, Jhalana Institutionall Area, Jaipur, Rajasthan
302004
w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com




On Thu, Jun 1, 2017 at 11:24 AM, Sidney Feiner 
wrote:

> Are you sure that every message gets processed? It could be that some
> messages failed passing the decoder.
> And during the processing, are you maybe putting the events into a map?
> That way, events with the same key could override each other and that way
> you'll have less final events.
>
> -Original Message-
> From: Vikash Pareek [mailto:vikash.par...@infoobjects.com]
> Sent: Tuesday, May 30, 2017 4:00 PM
> To: user@spark.apache.org
> Subject: Message getting lost in Kafka + Spark Streaming
>
> I am facing an issue related to spark streaming with kafka, my use case is
> as
> follow:
> 1. Spark streaming(DirectStream) application reading data/messages from
> kafka topic and process it 2. On the basis of proccessed message, app will
> write proccessed message to different kafka topics for e.g. if messgese is
> harmonized then write to harmonized topic else unharmonized topic
>
> the problem is that during the streaming somehow we are lossing some
> messaged i.e all the incoming messages are not written to harmonized or
> unharmonized topics.
> for e.g. if app received 30 messages in one batch then sometime it write
> all the messges to output topics(this is expected behaviour) but sometimes
> it writes only 27 (3 messages are lost, this number can change).
>
> Versions as follow:
> Spark 1.6.0
> Kafka 0.9
>
> Kafka topics confguration is as follow:
> # of brokers: 3
> # replicxation factor: 3
> # of paritions: 3
>
> Following are the properties we are using for kafka:
> *  val props = new Properties()
>   props.put("metadata.broker.list",
> properties.getProperty("metadataBrokerList"))
>   props.put("auto.offset.reset",
> properties.getProperty("autoOffsetReset"))
>   props.put("group.id", properties.getProperty("group.id"))
>   props.put("serializer.class", "kafka.serializer.StringEncoder")
>   props.put("outTopicHarmonized",
> properties.getProperty("outletKafkaTopicHarmonized"))
>   props.put("outTopicUnharmonized",
> properties.getProperty("outletKafkaTopicUnharmonized"))
>   props.put("acks", "all");
>   props.put("retries", "5");
>   props.put("request.required.acks", "-1")
> *
> Following is the piece of code where we are writing proccessed messges to
> kafka:
> *  val schemaRdd2 = finalHarmonizedDF.toJSON
>
>   schemaRdd2.foreachPartition { partition =>
> val producerConfig = new ProducerConfig(props)
> val producer = new Producer[String, String](producerConfig)
>
> partition.foreach { row =>
>   if (debug) println(row.mkString)
>   val keyedMessage = new KeyedMessage[String,
> String](props.getProperty("outTopicHarmonized"),
> null, row.toString())
>   producer.send(keyedMessage)
>
> }
> //hack, should be done with the flush
> Thread.sleep(1000)
> producer.close()
>   }
> *
> We explicitely added sleep(1000) for testing purpose.
> But this is also not solving the problem :(
>
> Any suggestion would be appreciated.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Message-getting-lost-in-Kafka-Spark-
> Streaming-tp28719.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: Message getting lost in Kafka + Spark Streaming

2017-05-31 Thread Sidney Feiner
Are you sure that every message gets processed? It could be that some messages 
failed passing the decoder.
And during the processing, are you maybe putting the events into a map? That 
way, events with the same key could override each other and that way you'll 
have less final events.

-Original Message-
From: Vikash Pareek [mailto:vikash.par...@infoobjects.com] 
Sent: Tuesday, May 30, 2017 4:00 PM
To: user@spark.apache.org
Subject: Message getting lost in Kafka + Spark Streaming

I am facing an issue related to spark streaming with kafka, my use case is as
follow:
1. Spark streaming(DirectStream) application reading data/messages from kafka 
topic and process it 2. On the basis of proccessed message, app will write 
proccessed message to different kafka topics for e.g. if messgese is harmonized 
then write to harmonized topic else unharmonized topic
 
the problem is that during the streaming somehow we are lossing some messaged 
i.e all the incoming messages are not written to harmonized or unharmonized 
topics.
for e.g. if app received 30 messages in one batch then sometime it write all 
the messges to output topics(this is expected behaviour) but sometimes it 
writes only 27 (3 messages are lost, this number can change).
 
Versions as follow:
Spark 1.6.0
Kafka 0.9
 
Kafka topics confguration is as follow:
# of brokers: 3
# replicxation factor: 3
# of paritions: 3
 
Following are the properties we are using for kafka:
*  val props = new Properties()
  props.put("metadata.broker.list",
properties.getProperty("metadataBrokerList"))
  props.put("auto.offset.reset",
properties.getProperty("autoOffsetReset"))
  props.put("group.id", properties.getProperty("group.id"))
  props.put("serializer.class", "kafka.serializer.StringEncoder")
  props.put("outTopicHarmonized",
properties.getProperty("outletKafkaTopicHarmonized"))
  props.put("outTopicUnharmonized",
properties.getProperty("outletKafkaTopicUnharmonized"))
  props.put("acks", "all");
  props.put("retries", "5");
  props.put("request.required.acks", "-1")
*
Following is the piece of code where we are writing proccessed messges to
kafka:
*  val schemaRdd2 = finalHarmonizedDF.toJSON
 
  schemaRdd2.foreachPartition { partition =>
val producerConfig = new ProducerConfig(props)
val producer = new Producer[String, String](producerConfig)
 
partition.foreach { row =>
  if (debug) println(row.mkString)
  val keyedMessage = new KeyedMessage[String, 
String](props.getProperty("outTopicHarmonized"),
null, row.toString())
  producer.send(keyedMessage)
 
}
//hack, should be done with the flush
Thread.sleep(1000)
producer.close()
  }
*
We explicitely added sleep(1000) for testing purpose.
But this is also not solving the problem :(
 
Any suggestion would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Message-getting-lost-in-Kafka-Spark-Streaming-tp28719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Cody Koeninger
First thing I noticed, you should be using a singleton kafka producer,
not recreating one every partition.  It's threadsafe.

On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek
 wrote:
> I am facing an issue related to spark streaming with kafka, my use case is as
> follow:
> 1. Spark streaming(DirectStream) application reading data/messages from
> kafka topic and process it
> 2. On the basis of proccessed message, app will write proccessed message to
> different kafka topics
> for e.g. if messgese is harmonized then write to harmonized topic else
> unharmonized topic
>
> the problem is that during the streaming somehow we are lossing some
> messaged i.e all the incoming messages are not written to harmonized or
> unharmonized topics.
> for e.g. if app received 30 messages in one batch then sometime it write all
> the messges to output topics(this is expected behaviour) but sometimes it
> writes only 27 (3 messages are lost, this number can change).
>
> Versions as follow:
> Spark 1.6.0
> Kafka 0.9
>
> Kafka topics confguration is as follow:
> # of brokers: 3
> # replicxation factor: 3
> # of paritions: 3
>
> Following are the properties we are using for kafka:
> *  val props = new Properties()
>   props.put("metadata.broker.list",
> properties.getProperty("metadataBrokerList"))
>   props.put("auto.offset.reset",
> properties.getProperty("autoOffsetReset"))
>   props.put("group.id", properties.getProperty("group.id"))
>   props.put("serializer.class", "kafka.serializer.StringEncoder")
>   props.put("outTopicHarmonized",
> properties.getProperty("outletKafkaTopicHarmonized"))
>   props.put("outTopicUnharmonized",
> properties.getProperty("outletKafkaTopicUnharmonized"))
>   props.put("acks", "all");
>   props.put("retries", "5");
>   props.put("request.required.acks", "-1")
> *
> Following is the piece of code where we are writing proccessed messges to
> kafka:
> *  val schemaRdd2 = finalHarmonizedDF.toJSON
>
>   schemaRdd2.foreachPartition { partition =>
> val producerConfig = new ProducerConfig(props)
> val producer = new Producer[String, String](producerConfig)
>
> partition.foreach { row =>
>   if (debug) println(row.mkString)
>   val keyedMessage = new KeyedMessage[String,
> String](props.getProperty("outTopicHarmonized"),
> null, row.toString())
>   producer.send(keyedMessage)
>
> }
> //hack, should be done with the flush
> Thread.sleep(1000)
> producer.close()
>   }
> *
> We explicitely added sleep(1000) for testing purpose.
> But this is also not solving the problem :(
>
> Any suggestion would be appreciated.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Message-getting-lost-in-Kafka-Spark-Streaming-tp28719.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Vikash Pareek
I am facing an issue related to spark streaming with kafka, my use case is as
follow:
1. Spark streaming(DirectStream) application reading data/messages from
kafka topic and process it
2. On the basis of proccessed message, app will write proccessed message to
different kafka topics
for e.g. if messgese is harmonized then write to harmonized topic else
unharmonized topic
 
the problem is that during the streaming somehow we are lossing some
messaged i.e all the incoming messages are not written to harmonized or
unharmonized topics.
for e.g. if app received 30 messages in one batch then sometime it write all
the messges to output topics(this is expected behaviour) but sometimes it
writes only 27 (3 messages are lost, this number can change).
 
Versions as follow:
Spark 1.6.0
Kafka 0.9
 
Kafka topics confguration is as follow:
# of brokers: 3
# replicxation factor: 3
# of paritions: 3
 
Following are the properties we are using for kafka:
*  val props = new Properties()
  props.put("metadata.broker.list",
properties.getProperty("metadataBrokerList"))
  props.put("auto.offset.reset",
properties.getProperty("autoOffsetReset"))
  props.put("group.id", properties.getProperty("group.id"))
  props.put("serializer.class", "kafka.serializer.StringEncoder")
  props.put("outTopicHarmonized",
properties.getProperty("outletKafkaTopicHarmonized"))
  props.put("outTopicUnharmonized",
properties.getProperty("outletKafkaTopicUnharmonized"))
  props.put("acks", "all");
  props.put("retries", "5");
  props.put("request.required.acks", "-1")
* 
Following is the piece of code where we are writing proccessed messges to
kafka:
*  val schemaRdd2 = finalHarmonizedDF.toJSON
 
  schemaRdd2.foreachPartition { partition =>
val producerConfig = new ProducerConfig(props)
val producer = new Producer[String, String](producerConfig)
 
partition.foreach { row =>
  if (debug) println(row.mkString)
  val keyedMessage = new KeyedMessage[String,
String](props.getProperty("outTopicHarmonized"),
null, row.toString())
  producer.send(keyedMessage)
 
}
//hack, should be done with the flush
Thread.sleep(1000)
producer.close()
  }
* 
We explicitely added sleep(1000) for testing purpose.
But this is also not solving the problem :(
 
Any suggestion would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Message-getting-lost-in-Kafka-Spark-Streaming-tp28719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can't access the data in Kafka Spark Streaming globally

2016-12-23 Thread Cody Koeninger
This doesn't sound like a question regarding Kafka streaming, it
sounds like confusion about the scope of variables in spark generally.
Is that right?  If so, I'd suggest reading the documentation, starting
with a simple rdd (e.g. using sparkContext.parallelize), and
experimenting to confirm your understanding.

On Thu, Dec 22, 2016 at 11:46 PM, Sree Eedupuganti  wrote:
> I am trying to stream the data from Kafka to Spark.
>
> JavaPairInputDStream directKafkaStream =
> KafkaUtils.createDirectStream(ssc,
> String.class,
> String.class,
> StringDecoder.class,
> StringDecoder.class,
> kafkaParams, topics);
>
> Here i am iterating over the JavaPairInputDStream to process the RDD's.
>
> directKafkaStream.foreachRDD(rdd ->{
> rdd.foreachPartition(items ->{
> while (items.hasNext()) {
> String[] State = items.next()._2.split("\\,");
>
> System.out.println(State[2]+","+State[3]+","+State[4]+"--");
> };
> });
> });
>
>
> In this i can able to access the String Array but when i am trying to access
> the String Array data globally i can't access the data. Here my requirement
> is if i had access these data globally i had another lookup table in Hive.
> So i am trying to perform an operation on these. Any suggestions please,
> Thanks.
>
>
> --
> Best Regards,
> Sreeharsha Eedupuganti

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Can't access the data in Kafka Spark Streaming globally

2016-12-22 Thread Sree Eedupuganti
I am trying to stream the data from Kafka to Spark.

JavaPairInputDStream directKafkaStream =
KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams, topics);

Here i am iterating over the JavaPairInputDStream to process the RDD's.

directKafkaStream.foreachRDD(rdd ->{
rdd.foreachPartition(items ->{
while (items.hasNext()) {
String[] State = items.next()._2.split("\\,");
System.out.println(State[2]+","+State[3]+","+State[4]+"--");
};
});
});


In this i can able to access the String Array but when i am trying to
access the String Array data globally i can't access the data. Here my
requirement is if i had access these data globally i had another lookup
table in Hive. So i am trying to perform an operation on these. Any
suggestions please, Thanks.

-- 
Best Regards,
Sreeharsha Eedupuganti


Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Chawla,Sumit
How are you producing data? I just tested your code and i can receive the
messages from Kafka.



Regards
Sumit Chawla


On Sun, Sep 18, 2016 at 7:56 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> i am very new to *Spark streaming* and i am implementing small exercise
> like sending *XML* data from *kafka* and need to receive that *streaming* data
> through *spark streaming.* I tried in all possible ways.. but every time
> i am getting *empty values.*
>
>
> *There is no problem in Kafka side, only problem is receiving
> the Streaming data from Spark side.Here is the code how i am
> implementing:package com.package; import org.apache.spark.SparkConf; import
> org.apache.spark.api.java.JavaSparkContext; import
> org.apache.spark.streaming.Duration; import
> org.apache.spark.streaming.api.java.JavaStreamingContext; public class
> SparkStringConsumer { public static void main(String[] args) { SparkConf
> conf = new SparkConf() .setAppName("kafka-sandbox") .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(conf); JavaStreamingContext ssc
> = new JavaStreamingContext(sc, new Duration(2000)); Map
> kafkaParams = new HashMap<>(); kafkaParams.put("metadata.broker.list",
> "localhost:9092"); Set topics = Collections.singleton("mytopic");
> JavaPairInputDStream directKafkaStream =
> KafkaUtils.createDirectStream(ssc, String.class, String.class,
> StringDecoder.class, StringDecoder.class, kafkaParams, topics);
> directKafkaStream.foreachRDD(rdd -> { System.out.println("--- New RDD with
> " + rdd.partitions().size() + " partitions and " + rdd.count() + "
> records"); rdd.foreach(record -> System.out.println(record._2)); });
> ssc.start(); ssc.awaitTermination(); } } And i am using following
> versions:**Zookeeper 3.4.6Scala 2.11Spark 2.0Kafka 0.8.2***
>


Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread ayan guha
Empty RDD generally means Kafka is not producing msgs in those intervals.
For example, if I have batch duration of 10secs and there is no msgs within
any 10 secs, RDD corresponding to that 10 secs will be empty.

On Mon, Sep 19, 2016 at 12:56 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> i am very new to *Spark streaming* and i am implementing small exercise
> like sending *XML* data from *kafka* and need to receive that *streaming* data
> through *spark streaming.* I tried in all possible ways.. but every time
> i am getting *empty values.*
>
>
> *There is no problem in Kafka side, only problem is receiving
> the Streaming data from Spark side.Here is the code how i am
> implementing:package com.package; import org.apache.spark.SparkConf; import
> org.apache.spark.api.java.JavaSparkContext; import
> org.apache.spark.streaming.Duration; import
> org.apache.spark.streaming.api.java.JavaStreamingContext; public class
> SparkStringConsumer { public static void main(String[] args) { SparkConf
> conf = new SparkConf() .setAppName("kafka-sandbox") .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(conf); JavaStreamingContext ssc
> = new JavaStreamingContext(sc, new Duration(2000)); Map
> kafkaParams = new HashMap<>(); kafkaParams.put("metadata.broker.list",
> "localhost:9092"); Set topics = Collections.singleton("mytopic");
> JavaPairInputDStream directKafkaStream =
> KafkaUtils.createDirectStream(ssc, String.class, String.class,
> StringDecoder.class, StringDecoder.class, kafkaParams, topics);
> directKafkaStream.foreachRDD(rdd -> { System.out.println("--- New RDD with
> " + rdd.partitions().size() + " partitions and " + rdd.count() + "
> records"); rdd.foreach(record -> System.out.println(record._2)); });
> ssc.start(); ssc.awaitTermination(); } } And i am using following
> versions:**Zookeeper 3.4.6Scala 2.11Spark 2.0Kafka 0.8.2***
>



-- 
Best Regards,
Ayan Guha


Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Sateesh Karuturi
i am very new to *Spark streaming* and i am implementing small exercise
like sending *XML* data from *kafka* and need to receive that *streaming* data
through *spark streaming.* I tried in all possible ways.. but every time i
am getting *empty values.*


*There is no problem in Kafka side, only problem is receiving the Streaming
data from Spark side.Here is the code how i am implementing:package
com.package; import org.apache.spark.SparkConf; import
org.apache.spark.api.java.JavaSparkContext; import
org.apache.spark.streaming.Duration; import
org.apache.spark.streaming.api.java.JavaStreamingContext; public class
SparkStringConsumer { public static void main(String[] args) { SparkConf
conf = new SparkConf() .setAppName("kafka-sandbox") .setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf); JavaStreamingContext ssc
= new JavaStreamingContext(sc, new Duration(2000)); Map
kafkaParams = new HashMap<>(); kafkaParams.put("metadata.broker.list",
"localhost:9092"); Set topics = Collections.singleton("mytopic");
JavaPairInputDStream directKafkaStream =
KafkaUtils.createDirectStream(ssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topics);
directKafkaStream.foreachRDD(rdd -> { System.out.println("--- New RDD with
" + rdd.partitions().size() + " partitions and " + rdd.count() + "
records"); rdd.foreach(record -> System.out.println(record._2)); });
ssc.start(); ssc.awaitTermination(); } } And i am using following
versions:**Zookeeper 3.4.6Scala 2.11Spark 2.0Kafka 0.8.2***


Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
Unless I'm misreading the image you posted, it does show event counts
for the single batch that is still running, with 1.7 billion events in
it.  The recent batches do show 0 events, but I'm guessing that's
because they're actually empty.

When you said you had a kafka topic with 1.7 billion events in it, did
you mean it just statically contains that many events, and no new
events are coming in currently?  If that's the case, you may be better
off just generating RDDs of an appropriate range of offsets, one after
the other, rather than using streaming.

I'm also still not clear if you have tried benchmarking a job that
simply reads from your topic, without inserting into hbase.

On Thu, Jun 23, 2016 at 12:09 AM, Colin Kincaid Williams  wrote:
> Streaming UI tab showing empty events and very different metrics than on 1.5.2
>
> On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams  
> wrote:
>> After a bit of effort I moved from a Spark cluster running 1.5.2, to a
>> Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The
>> completed batches are no longer showing the number of events processed
>> in the Streaming UI tab . I'm getting around 4k inserts per second in
>> hbase, but I haven't yet tried to remove or reset the mRPP.  I will
>> attach a screenshot of the UI tab. It shows significantly lower
>> figures for processing and delay times, than the previous posted shot.
>> It also shows the batches as empty, however I see the requests hitting
>> hbase.
>>
>> Then it's possible my issues were related to running on the Spark
>> 1.5.2 cluster. Also is the missing event count in the completed
>> batches a bug? Should I file an issue?
>>
>> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams  
>> wrote:
>>> Thanks @Cody, I will try that out. In the interm, I tried to validate
>>> my Hbase cluster by running a random write test and see 30-40K writes
>>> per second. This suggests there is noticeable room for improvement.
>>>
>>> On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger  wrote:
 Take HBase out of the equation and just measure what your read
 performance is by doing something like

 createDirectStream(...).foreach(_.println)

 not take() or print()

 On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams  
 wrote:
> @Cody I was able to bring my processing time down to a second by
> setting maxRatePerPartition as discussed. My bad that I didn't
> recognize it as the cause of my scheduling delay.
>
> Since then I've tried experimenting with a larger Spark Context
> duration. I've been trying to get some noticeable improvement
> inserting messages from Kafka -> Hbase using the above application.
> I'm currently getting around 3500 inserts / second on a 9 node hbase
> cluster. So far, I haven't been able to get much more throughput. Then
> I'm looking for advice here how I should tune Kafka and Spark for this
> job.
>
> I can create a kafka topic with as many partitions that I want. I can
> set the Duration and maxRatePerPartition. I have 1.7 billion messages
> that I can insert rather quickly into the Kafka queue, and I'd like to
> get them into Hbase as quickly as possible.
>
> I'm looking for advice regarding # Kafka Topic Partitions / Streaming
> Duration / maxRatePerPartition / any other spark settings or code
> changes that I should make to try to get a better consumption rate.
>
> Thanks for all the help so far, this is the first Spark application I
> have written.
>
> On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  
> wrote:
>> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
>> However even at application starts up I have this large scheduling
>> delay. I will report my progress later on.
>>
>> On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  
>> wrote:
>>> If your batch time is 1 second and your average processing time is
>>> 1.16 seconds, you're always going to be falling behind.  That would
>>> explain why you've built up an hour of scheduling delay after eight
>>> hours of running.
>>>
>>> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams 
>>>  wrote:
 Hi Mich again,

 Regarding batch window, etc. I have provided the sources, but I'm not
 currently calling the window function. Did you see the program source?
 It's only 100 lines.

 https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

 Then I would expect I'm using defaults, other than what has been shown
 in the configuration.

 For example:

 In the launcher configuration I set --conf
 spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
 are 500 messages for the duration set in the application:
 JavaStreamingContext jssc = new JavaStreamingCont

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
Streaming UI tab showing empty events and very different metrics than on 1.5.2

On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams  wrote:
> After a bit of effort I moved from a Spark cluster running 1.5.2, to a
> Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The
> completed batches are no longer showing the number of events processed
> in the Streaming UI tab . I'm getting around 4k inserts per second in
> hbase, but I haven't yet tried to remove or reset the mRPP.  I will
> attach a screenshot of the UI tab. It shows significantly lower
> figures for processing and delay times, than the previous posted shot.
> It also shows the batches as empty, however I see the requests hitting
> hbase.
>
> Then it's possible my issues were related to running on the Spark
> 1.5.2 cluster. Also is the missing event count in the completed
> batches a bug? Should I file an issue?
>
> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams  
> wrote:
>> Thanks @Cody, I will try that out. In the interm, I tried to validate
>> my Hbase cluster by running a random write test and see 30-40K writes
>> per second. This suggests there is noticeable room for improvement.
>>
>> On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger  wrote:
>>> Take HBase out of the equation and just measure what your read
>>> performance is by doing something like
>>>
>>> createDirectStream(...).foreach(_.println)
>>>
>>> not take() or print()
>>>
>>> On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams  
>>> wrote:
 @Cody I was able to bring my processing time down to a second by
 setting maxRatePerPartition as discussed. My bad that I didn't
 recognize it as the cause of my scheduling delay.

 Since then I've tried experimenting with a larger Spark Context
 duration. I've been trying to get some noticeable improvement
 inserting messages from Kafka -> Hbase using the above application.
 I'm currently getting around 3500 inserts / second on a 9 node hbase
 cluster. So far, I haven't been able to get much more throughput. Then
 I'm looking for advice here how I should tune Kafka and Spark for this
 job.

 I can create a kafka topic with as many partitions that I want. I can
 set the Duration and maxRatePerPartition. I have 1.7 billion messages
 that I can insert rather quickly into the Kafka queue, and I'd like to
 get them into Hbase as quickly as possible.

 I'm looking for advice regarding # Kafka Topic Partitions / Streaming
 Duration / maxRatePerPartition / any other spark settings or code
 changes that I should make to try to get a better consumption rate.

 Thanks for all the help so far, this is the first Spark application I
 have written.

 On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  
 wrote:
> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
> However even at application starts up I have this large scheduling
> delay. I will report my progress later on.
>
> On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  
> wrote:
>> If your batch time is 1 second and your average processing time is
>> 1.16 seconds, you're always going to be falling behind.  That would
>> explain why you've built up an hour of scheduling delay after eight
>> hours of running.
>>
>> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  
>> wrote:
>>> Hi Mich again,
>>>
>>> Regarding batch window, etc. I have provided the sources, but I'm not
>>> currently calling the window function. Did you see the program source?
>>> It's only 100 lines.
>>>
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>> Then I would expect I'm using defaults, other than what has been shown
>>> in the configuration.
>>>
>>> For example:
>>>
>>> In the launcher configuration I set --conf
>>> spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
>>> are 500 messages for the duration set in the application:
>>> JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
>>> Duration(1000));
>>>
>>>
>>> Then with the --num-executors 6 \ submit flag, and the
>>> spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
>>> arrive at the 3000 events per batch in the UI, pasted above.
>>>
>>> Feel free to correct me if I'm wrong.
>>>
>>> Then are you suggesting that I set the window?
>>>
>>> Maybe following this as reference:
>>>
>>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>>>
>>> On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
>>>  wrote:
 Ok

 What is the set up for these please?

 batch window
 window length
 sliding interval

 And also in each batch window how much d

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
After a bit of effort I moved from a Spark cluster running 1.5.2, to a
Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The
completed batches are no longer showing the number of events processed
in the Streaming UI tab . I'm getting around 4k inserts per second in
hbase, but I haven't yet tried to remove or reset the mRPP.  I will
attach a screenshot of the UI tab. It shows significantly lower
figures for processing and delay times, than the previous posted shot.
It also shows the batches as empty, however I see the requests hitting
hbase.

Then it's possible my issues were related to running on the Spark
1.5.2 cluster. Also is the missing event count in the completed
batches a bug? Should I file an issue?

On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams  wrote:
> Thanks @Cody, I will try that out. In the interm, I tried to validate
> my Hbase cluster by running a random write test and see 30-40K writes
> per second. This suggests there is noticeable room for improvement.
>
> On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger  wrote:
>> Take HBase out of the equation and just measure what your read
>> performance is by doing something like
>>
>> createDirectStream(...).foreach(_.println)
>>
>> not take() or print()
>>
>> On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams  
>> wrote:
>>> @Cody I was able to bring my processing time down to a second by
>>> setting maxRatePerPartition as discussed. My bad that I didn't
>>> recognize it as the cause of my scheduling delay.
>>>
>>> Since then I've tried experimenting with a larger Spark Context
>>> duration. I've been trying to get some noticeable improvement
>>> inserting messages from Kafka -> Hbase using the above application.
>>> I'm currently getting around 3500 inserts / second on a 9 node hbase
>>> cluster. So far, I haven't been able to get much more throughput. Then
>>> I'm looking for advice here how I should tune Kafka and Spark for this
>>> job.
>>>
>>> I can create a kafka topic with as many partitions that I want. I can
>>> set the Duration and maxRatePerPartition. I have 1.7 billion messages
>>> that I can insert rather quickly into the Kafka queue, and I'd like to
>>> get them into Hbase as quickly as possible.
>>>
>>> I'm looking for advice regarding # Kafka Topic Partitions / Streaming
>>> Duration / maxRatePerPartition / any other spark settings or code
>>> changes that I should make to try to get a better consumption rate.
>>>
>>> Thanks for all the help so far, this is the first Spark application I
>>> have written.
>>>
>>> On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  
>>> wrote:
 I'll try dropping the maxRatePerPartition=400, or maybe even lower.
 However even at application starts up I have this large scheduling
 delay. I will report my progress later on.

 On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  wrote:
> If your batch time is 1 second and your average processing time is
> 1.16 seconds, you're always going to be falling behind.  That would
> explain why you've built up an hour of scheduling delay after eight
> hours of running.
>
> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  
> wrote:
>> Hi Mich again,
>>
>> Regarding batch window, etc. I have provided the sources, but I'm not
>> currently calling the window function. Did you see the program source?
>> It's only 100 lines.
>>
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> Then I would expect I'm using defaults, other than what has been shown
>> in the configuration.
>>
>> For example:
>>
>> In the launcher configuration I set --conf
>> spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
>> are 500 messages for the duration set in the application:
>> JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
>> Duration(1000));
>>
>>
>> Then with the --num-executors 6 \ submit flag, and the
>> spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
>> arrive at the 3000 events per batch in the UI, pasted above.
>>
>> Feel free to correct me if I'm wrong.
>>
>> Then are you suggesting that I set the window?
>>
>> Maybe following this as reference:
>>
>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>>
>> On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
>>  wrote:
>>> Ok
>>>
>>> What is the set up for these please?
>>>
>>> batch window
>>> window length
>>> sliding interval
>>>
>>> And also in each batch window how much data do you get in (no of 
>>> messages in
>>> the topic whatever)?
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
Thanks @Cody, I will try that out. In the interm, I tried to validate
my Hbase cluster by running a random write test and see 30-40K writes
per second. This suggests there is noticeable room for improvement.

On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger  wrote:
> Take HBase out of the equation and just measure what your read
> performance is by doing something like
>
> createDirectStream(...).foreach(_.println)
>
> not take() or print()
>
> On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams  
> wrote:
>> @Cody I was able to bring my processing time down to a second by
>> setting maxRatePerPartition as discussed. My bad that I didn't
>> recognize it as the cause of my scheduling delay.
>>
>> Since then I've tried experimenting with a larger Spark Context
>> duration. I've been trying to get some noticeable improvement
>> inserting messages from Kafka -> Hbase using the above application.
>> I'm currently getting around 3500 inserts / second on a 9 node hbase
>> cluster. So far, I haven't been able to get much more throughput. Then
>> I'm looking for advice here how I should tune Kafka and Spark for this
>> job.
>>
>> I can create a kafka topic with as many partitions that I want. I can
>> set the Duration and maxRatePerPartition. I have 1.7 billion messages
>> that I can insert rather quickly into the Kafka queue, and I'd like to
>> get them into Hbase as quickly as possible.
>>
>> I'm looking for advice regarding # Kafka Topic Partitions / Streaming
>> Duration / maxRatePerPartition / any other spark settings or code
>> changes that I should make to try to get a better consumption rate.
>>
>> Thanks for all the help so far, this is the first Spark application I
>> have written.
>>
>> On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  
>> wrote:
>>> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
>>> However even at application starts up I have this large scheduling
>>> delay. I will report my progress later on.
>>>
>>> On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  wrote:
 If your batch time is 1 second and your average processing time is
 1.16 seconds, you're always going to be falling behind.  That would
 explain why you've built up an hour of scheduling delay after eight
 hours of running.

 On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  
 wrote:
> Hi Mich again,
>
> Regarding batch window, etc. I have provided the sources, but I'm not
> currently calling the window function. Did you see the program source?
> It's only 100 lines.
>
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> Then I would expect I'm using defaults, other than what has been shown
> in the configuration.
>
> For example:
>
> In the launcher configuration I set --conf
> spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
> are 500 messages for the duration set in the application:
> JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
> Duration(1000));
>
>
> Then with the --num-executors 6 \ submit flag, and the
> spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
> arrive at the 3000 events per batch in the UI, pasted above.
>
> Feel free to correct me if I'm wrong.
>
> Then are you suggesting that I set the window?
>
> Maybe following this as reference:
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>
> On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
>  wrote:
>> Ok
>>
>> What is the set up for these please?
>>
>> batch window
>> window length
>> sliding interval
>>
>> And also in each batch window how much data do you get in (no of 
>> messages in
>> the topic whatever)?
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 21:01, Mich Talebzadeh  
>> wrote:
>>>
>>> I believe you have an issue with performance?
>>>
>>> have you checked spark GUI (default 4040) for details including shuffles
>>> etc?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:

 There are 25 nodes in the spark cluster.

 On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
  wrote:
 > how many nodes are in your cluster?
 >
 > --num-executors 6 \
 >  --dr

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Cody Koeninger
Take HBase out of the equation and just measure what your read
performance is by doing something like

createDirectStream(...).foreach(_.println)

not take() or print()

On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams  wrote:
> @Cody I was able to bring my processing time down to a second by
> setting maxRatePerPartition as discussed. My bad that I didn't
> recognize it as the cause of my scheduling delay.
>
> Since then I've tried experimenting with a larger Spark Context
> duration. I've been trying to get some noticeable improvement
> inserting messages from Kafka -> Hbase using the above application.
> I'm currently getting around 3500 inserts / second on a 9 node hbase
> cluster. So far, I haven't been able to get much more throughput. Then
> I'm looking for advice here how I should tune Kafka and Spark for this
> job.
>
> I can create a kafka topic with as many partitions that I want. I can
> set the Duration and maxRatePerPartition. I have 1.7 billion messages
> that I can insert rather quickly into the Kafka queue, and I'd like to
> get them into Hbase as quickly as possible.
>
> I'm looking for advice regarding # Kafka Topic Partitions / Streaming
> Duration / maxRatePerPartition / any other spark settings or code
> changes that I should make to try to get a better consumption rate.
>
> Thanks for all the help so far, this is the first Spark application I
> have written.
>
> On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  
> wrote:
>> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
>> However even at application starts up I have this large scheduling
>> delay. I will report my progress later on.
>>
>> On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  wrote:
>>> If your batch time is 1 second and your average processing time is
>>> 1.16 seconds, you're always going to be falling behind.  That would
>>> explain why you've built up an hour of scheduling delay after eight
>>> hours of running.
>>>
>>> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  
>>> wrote:
 Hi Mich again,

 Regarding batch window, etc. I have provided the sources, but I'm not
 currently calling the window function. Did you see the program source?
 It's only 100 lines.

 https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

 Then I would expect I'm using defaults, other than what has been shown
 in the configuration.

 For example:

 In the launcher configuration I set --conf
 spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
 are 500 messages for the duration set in the application:
 JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
 Duration(1000));


 Then with the --num-executors 6 \ submit flag, and the
 spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
 arrive at the 3000 events per batch in the UI, pasted above.

 Feel free to correct me if I'm wrong.

 Then are you suggesting that I set the window?

 Maybe following this as reference:

 https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html

 On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
  wrote:
> Ok
>
> What is the set up for these please?
>
> batch window
> window length
> sliding interval
>
> And also in each batch window how much data do you get in (no of messages 
> in
> the topic whatever)?
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 21:01, Mich Talebzadeh  
> wrote:
>>
>> I believe you have an issue with performance?
>>
>> have you checked spark GUI (default 4040) for details including shuffles
>> etc?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>>>
>>> There are 25 nodes in the spark cluster.
>>>
>>> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>>>  wrote:
>>> > how many nodes are in your cluster?
>>> >
>>> > --num-executors 6 \
>>> >  --driver-memory 4G \
>>> >  --executor-memory 2G \
>>> >  --total-executor-cores 12 \
>>> >
>>> >
>>> > Dr Mich Talebzadeh
>>> >
>>> >
>>> >
>>> > LinkedIn
>>> >
>>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >
>>> >
>>> >
>>> > http://talebzadehmich.wordpress.com
>>> >
>>> >
>>> >
>>> >
>>> > O

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
@Cody I was able to bring my processing time down to a second by
setting maxRatePerPartition as discussed. My bad that I didn't
recognize it as the cause of my scheduling delay.

Since then I've tried experimenting with a larger Spark Context
duration. I've been trying to get some noticeable improvement
inserting messages from Kafka -> Hbase using the above application.
I'm currently getting around 3500 inserts / second on a 9 node hbase
cluster. So far, I haven't been able to get much more throughput. Then
I'm looking for advice here how I should tune Kafka and Spark for this
job.

I can create a kafka topic with as many partitions that I want. I can
set the Duration and maxRatePerPartition. I have 1.7 billion messages
that I can insert rather quickly into the Kafka queue, and I'd like to
get them into Hbase as quickly as possible.

I'm looking for advice regarding # Kafka Topic Partitions / Streaming
Duration / maxRatePerPartition / any other spark settings or code
changes that I should make to try to get a better consumption rate.

Thanks for all the help so far, this is the first Spark application I
have written.

On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  wrote:
> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
> However even at application starts up I have this large scheduling
> delay. I will report my progress later on.
>
> On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  wrote:
>> If your batch time is 1 second and your average processing time is
>> 1.16 seconds, you're always going to be falling behind.  That would
>> explain why you've built up an hour of scheduling delay after eight
>> hours of running.
>>
>> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  
>> wrote:
>>> Hi Mich again,
>>>
>>> Regarding batch window, etc. I have provided the sources, but I'm not
>>> currently calling the window function. Did you see the program source?
>>> It's only 100 lines.
>>>
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>> Then I would expect I'm using defaults, other than what has been shown
>>> in the configuration.
>>>
>>> For example:
>>>
>>> In the launcher configuration I set --conf
>>> spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
>>> are 500 messages for the duration set in the application:
>>> JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
>>> Duration(1000));
>>>
>>>
>>> Then with the --num-executors 6 \ submit flag, and the
>>> spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
>>> arrive at the 3000 events per batch in the UI, pasted above.
>>>
>>> Feel free to correct me if I'm wrong.
>>>
>>> Then are you suggesting that I set the window?
>>>
>>> Maybe following this as reference:
>>>
>>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>>>
>>> On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
>>>  wrote:
 Ok

 What is the set up for these please?

 batch window
 window length
 sliding interval

 And also in each batch window how much data do you get in (no of messages 
 in
 the topic whatever)?




 Dr Mich Talebzadeh



 LinkedIn
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



 http://talebzadehmich.wordpress.com




 On 18 June 2016 at 21:01, Mich Talebzadeh  
 wrote:
>
> I believe you have an issue with performance?
>
> have you checked spark GUI (default 4040) for details including shuffles
> etc?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>>
>> There are 25 nodes in the spark cluster.
>>
>> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>>  wrote:
>> > how many nodes are in your cluster?
>> >
>> > --num-executors 6 \
>> >  --driver-memory 4G \
>> >  --executor-memory 2G \
>> >  --total-executor-cores 12 \
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> >
>> >
>> > On 18 June 2016 at 20:40, Colin Kincaid Williams 
>> > wrote:
>> >>
>> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
>> >> Kafka using the direct api and inserts content into an hbase cluster,
>> >> as described in this thread. I was away from this project for awhile
>> >> due to events in my family.
>> >>
>> >> Currently my scheduling delay is high, but

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Colin Kincaid Williams
I'll try dropping the maxRatePerPartition=400, or maybe even lower.
However even at application starts up I have this large scheduling
delay. I will report my progress later on.

On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  wrote:
> If your batch time is 1 second and your average processing time is
> 1.16 seconds, you're always going to be falling behind.  That would
> explain why you've built up an hour of scheduling delay after eight
> hours of running.
>
> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  
> wrote:
>> Hi Mich again,
>>
>> Regarding batch window, etc. I have provided the sources, but I'm not
>> currently calling the window function. Did you see the program source?
>> It's only 100 lines.
>>
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> Then I would expect I'm using defaults, other than what has been shown
>> in the configuration.
>>
>> For example:
>>
>> In the launcher configuration I set --conf
>> spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
>> are 500 messages for the duration set in the application:
>> JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
>> Duration(1000));
>>
>>
>> Then with the --num-executors 6 \ submit flag, and the
>> spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
>> arrive at the 3000 events per batch in the UI, pasted above.
>>
>> Feel free to correct me if I'm wrong.
>>
>> Then are you suggesting that I set the window?
>>
>> Maybe following this as reference:
>>
>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>>
>> On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
>>  wrote:
>>> Ok
>>>
>>> What is the set up for these please?
>>>
>>> batch window
>>> window length
>>> sliding interval
>>>
>>> And also in each batch window how much data do you get in (no of messages in
>>> the topic whatever)?
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>> On 18 June 2016 at 21:01, Mich Talebzadeh  wrote:

 I believe you have an issue with performance?

 have you checked spark GUI (default 4040) for details including shuffles
 etc?

 HTH

 Dr Mich Talebzadeh



 LinkedIn
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



 http://talebzadehmich.wordpress.com




 On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>
> There are 25 nodes in the spark cluster.
>
> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>  wrote:
> > how many nodes are in your cluster?
> >
> > --num-executors 6 \
> >  --driver-memory 4G \
> >  --executor-memory 2G \
> >  --total-executor-cores 12 \
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> >
> >
> > On 18 June 2016 at 20:40, Colin Kincaid Williams 
> > wrote:
> >>
> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
> >> Kafka using the direct api and inserts content into an hbase cluster,
> >> as described in this thread. I was away from this project for awhile
> >> due to events in my family.
> >>
> >> Currently my scheduling delay is high, but the processing time is
> >> stable around a second. I changed my setup to use 6 kafka partitions
> >> on a set of smaller kafka brokers, with fewer disks. I've included
> >> some details below, including the script I use to launch the
> >> application. I'm using a Spark on Hbase library, whose version is
> >> relevant to my Hbase cluster. Is it apparent there is something wrong
> >> with my launch method that could be causing the delay, related to the
> >> included jars?
> >>
> >> Or is there something wrong with the very simple approach I'm taking
> >> for the application?
> >>
> >> Any advice is appriciated.
> >>
> >>
> >> The application:
> >>
> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
> >>
> >>
> >> From the streaming UI I get something like:
> >>
> >> table Completed Batches (last 1000 out of 27136)
> >>
> >>
> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
> >> Delay (?) Output Ops: Succeeded/Total
> >>
> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >>
> >> Here's how I'm launching the spark appli

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Cody Koeninger
If your batch time is 1 second and your average processing time is
1.16 seconds, you're always going to be falling behind.  That would
explain why you've built up an hour of scheduling delay after eight
hours of running.

On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams  wrote:
> Hi Mich again,
>
> Regarding batch window, etc. I have provided the sources, but I'm not
> currently calling the window function. Did you see the program source?
> It's only 100 lines.
>
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> Then I would expect I'm using defaults, other than what has been shown
> in the configuration.
>
> For example:
>
> In the launcher configuration I set --conf
> spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
> are 500 messages for the duration set in the application:
> JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
> Duration(1000));
>
>
> Then with the --num-executors 6 \ submit flag, and the
> spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
> arrive at the 3000 events per batch in the UI, pasted above.
>
> Feel free to correct me if I'm wrong.
>
> Then are you suggesting that I set the window?
>
> Maybe following this as reference:
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>
> On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
>  wrote:
>> Ok
>>
>> What is the set up for these please?
>>
>> batch window
>> window length
>> sliding interval
>>
>> And also in each batch window how much data do you get in (no of messages in
>> the topic whatever)?
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 21:01, Mich Talebzadeh  wrote:
>>>
>>> I believe you have an issue with performance?
>>>
>>> have you checked spark GUI (default 4040) for details including shuffles
>>> etc?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:

 There are 25 nodes in the spark cluster.

 On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
  wrote:
 > how many nodes are in your cluster?
 >
 > --num-executors 6 \
 >  --driver-memory 4G \
 >  --executor-memory 2G \
 >  --total-executor-cores 12 \
 >
 >
 > Dr Mich Talebzadeh
 >
 >
 >
 > LinkedIn
 >
 > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 >
 >
 >
 > http://talebzadehmich.wordpress.com
 >
 >
 >
 >
 > On 18 June 2016 at 20:40, Colin Kincaid Williams 
 > wrote:
 >>
 >> I updated my app to Spark 1.5.2 streaming so that it consumes from
 >> Kafka using the direct api and inserts content into an hbase cluster,
 >> as described in this thread. I was away from this project for awhile
 >> due to events in my family.
 >>
 >> Currently my scheduling delay is high, but the processing time is
 >> stable around a second. I changed my setup to use 6 kafka partitions
 >> on a set of smaller kafka brokers, with fewer disks. I've included
 >> some details below, including the script I use to launch the
 >> application. I'm using a Spark on Hbase library, whose version is
 >> relevant to my Hbase cluster. Is it apparent there is something wrong
 >> with my launch method that could be causing the delay, related to the
 >> included jars?
 >>
 >> Or is there something wrong with the very simple approach I'm taking
 >> for the application?
 >>
 >> Any advice is appriciated.
 >>
 >>
 >> The application:
 >>
 >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
 >>
 >>
 >> From the streaming UI I get something like:
 >>
 >> table Completed Batches (last 1000 out of 27136)
 >>
 >>
 >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
 >> Delay (?) Output Ops: Succeeded/Total
 >>
 >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
 >>
 >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
 >>
 >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
 >>
 >>
 >> Here's how I'm launching the spark application.
 >>
 >>
 >> #!/usr/bin/env bash
 >>
 >> export SPARK_CONF_DIR=/home/colin.williams/spark
 >>
 >> export HADOOP_CONF_DIR=/etc/hadoop/conf
 >>
 >> export
 >>
 >> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
Hi Mich again,

Regarding batch window, etc. I have provided the sources, but I'm not
currently calling the window function. Did you see the program source?
It's only 100 lines.

https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

Then I would expect I'm using defaults, other than what has been shown
in the configuration.

For example:

In the launcher configuration I set --conf
spark.streaming.kafka.maxRatePerPartition=500 \ and I believe there
are 500 messages for the duration set in the application:
JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
Duration(1000));


Then with the --num-executors 6 \ submit flag, and the
spark.streaming.kafka.maxRatePerPartition=500 I think that's how we
arrive at the 3000 events per batch in the UI, pasted above.

Feel free to correct me if I'm wrong.

Then are you suggesting that I set the window?

Maybe following this as reference:

https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html

On Sat, Jun 18, 2016 at 8:08 PM, Mich Talebzadeh
 wrote:
> Ok
>
> What is the set up for these please?
>
> batch window
> window length
> sliding interval
>
> And also in each batch window how much data do you get in (no of messages in
> the topic whatever)?
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 21:01, Mich Talebzadeh  wrote:
>>
>> I believe you have an issue with performance?
>>
>> have you checked spark GUI (default 4040) for details including shuffles
>> etc?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>>>
>>> There are 25 nodes in the spark cluster.
>>>
>>> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>>>  wrote:
>>> > how many nodes are in your cluster?
>>> >
>>> > --num-executors 6 \
>>> >  --driver-memory 4G \
>>> >  --executor-memory 2G \
>>> >  --total-executor-cores 12 \
>>> >
>>> >
>>> > Dr Mich Talebzadeh
>>> >
>>> >
>>> >
>>> > LinkedIn
>>> >
>>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >
>>> >
>>> >
>>> > http://talebzadehmich.wordpress.com
>>> >
>>> >
>>> >
>>> >
>>> > On 18 June 2016 at 20:40, Colin Kincaid Williams 
>>> > wrote:
>>> >>
>>> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
>>> >> Kafka using the direct api and inserts content into an hbase cluster,
>>> >> as described in this thread. I was away from this project for awhile
>>> >> due to events in my family.
>>> >>
>>> >> Currently my scheduling delay is high, but the processing time is
>>> >> stable around a second. I changed my setup to use 6 kafka partitions
>>> >> on a set of smaller kafka brokers, with fewer disks. I've included
>>> >> some details below, including the script I use to launch the
>>> >> application. I'm using a Spark on Hbase library, whose version is
>>> >> relevant to my Hbase cluster. Is it apparent there is something wrong
>>> >> with my launch method that could be causing the delay, related to the
>>> >> included jars?
>>> >>
>>> >> Or is there something wrong with the very simple approach I'm taking
>>> >> for the application?
>>> >>
>>> >> Any advice is appriciated.
>>> >>
>>> >>
>>> >> The application:
>>> >>
>>> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>> >>
>>> >>
>>> >> From the streaming UI I get something like:
>>> >>
>>> >> table Completed Batches (last 1000 out of 27136)
>>> >>
>>> >>
>>> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>>> >> Delay (?) Output Ops: Succeeded/Total
>>> >>
>>> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>>> >>
>>> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>>> >>
>>> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>>> >>
>>> >>
>>> >> Here's how I'm launching the spark application.
>>> >>
>>> >>
>>> >> #!/usr/bin/env bash
>>> >>
>>> >> export SPARK_CONF_DIR=/home/colin.williams/spark
>>> >>
>>> >> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>> >>
>>> >> export
>>> >>
>>> >> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>>> >>
>>> >>
>>> >> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>>> >>
>>> >> --class com.example.KafkaToHbase \
>>> >>
>>> >> --master spark://spark_master:7077 \
>>> >>
>>> >> --deploy-mode client \
>>> >>
>>> >> --num-executors 6 \
>>> >>
>>> >> --driver-memory 4G \
>>> >>
>>> >> --executor-memory 2G \
>>> >>
>>> >> --total-executor-cores 12 \
>>> >>
>>> >> --jars
>>> >>
>>> >> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeepe

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
Ok

What is the set up for these please?

batch window
window length
sliding interval

And also in each batch window how much data do you get in (no of messages
in the topic whatever)?




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 21:01, Mich Talebzadeh  wrote:

> I believe you have an issue with performance?
>
> have you checked spark GUI (default 4040) for details including shuffles
> etc?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:
>
>> There are 25 nodes in the spark cluster.
>>
>> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>>  wrote:
>> > how many nodes are in your cluster?
>> >
>> > --num-executors 6 \
>> >  --driver-memory 4G \
>> >  --executor-memory 2G \
>> >  --total-executor-cores 12 \
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> >
>> >
>> > On 18 June 2016 at 20:40, Colin Kincaid Williams 
>> wrote:
>> >>
>> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
>> >> Kafka using the direct api and inserts content into an hbase cluster,
>> >> as described in this thread. I was away from this project for awhile
>> >> due to events in my family.
>> >>
>> >> Currently my scheduling delay is high, but the processing time is
>> >> stable around a second. I changed my setup to use 6 kafka partitions
>> >> on a set of smaller kafka brokers, with fewer disks. I've included
>> >> some details below, including the script I use to launch the
>> >> application. I'm using a Spark on Hbase library, whose version is
>> >> relevant to my Hbase cluster. Is it apparent there is something wrong
>> >> with my launch method that could be causing the delay, related to the
>> >> included jars?
>> >>
>> >> Or is there something wrong with the very simple approach I'm taking
>> >> for the application?
>> >>
>> >> Any advice is appriciated.
>> >>
>> >>
>> >> The application:
>> >>
>> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>> >>
>> >>
>> >> From the streaming UI I get something like:
>> >>
>> >> table Completed Batches (last 1000 out of 27136)
>> >>
>> >>
>> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>> >> Delay (?) Output Ops: Succeeded/Total
>> >>
>> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>> >>
>> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>> >>
>> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>> >>
>> >>
>> >> Here's how I'm launching the spark application.
>> >>
>> >>
>> >> #!/usr/bin/env bash
>> >>
>> >> export SPARK_CONF_DIR=/home/colin.williams/spark
>> >>
>> >> export HADOOP_CONF_DIR=/etc/hadoop/conf
>> >>
>> >> export
>> >>
>> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>> >>
>> >>
>> >> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>> >>
>> >> --class com.example.KafkaToHbase \
>> >>
>> >> --master spark://spark_master:7077 \
>> >>
>> >> --deploy-mode client \
>> >>
>> >> --num-executors 6 \
>> >>
>> >> --driver-memory 4G \
>> >>
>> >> --executor-memory 2G \
>> >>
>> >> --total-executor-cores 12 \
>> >>
>> >> --jars
>> >>
>> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
>> >> \
>> >>
>> >> --conf spark.app.name="Kafka To Hbase" \
>> >>
>> >> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>> >>
>> >> --conf spark.eventLog.enabled=false \
>> >>
>> >> --conf spark.eventLog.overwrite=true \
>> >>
>> >> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>> >>
>> >> --conf spark.streaming.ba

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
I believe you have an issue with performance?

have you checked spark GUI (default 4040) for details including shuffles
etc?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 20:59, Colin Kincaid Williams  wrote:

> There are 25 nodes in the spark cluster.
>
> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>  wrote:
> > how many nodes are in your cluster?
> >
> > --num-executors 6 \
> >  --driver-memory 4G \
> >  --executor-memory 2G \
> >  --total-executor-cores 12 \
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> >
> >
> > On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:
> >>
> >> I updated my app to Spark 1.5.2 streaming so that it consumes from
> >> Kafka using the direct api and inserts content into an hbase cluster,
> >> as described in this thread. I was away from this project for awhile
> >> due to events in my family.
> >>
> >> Currently my scheduling delay is high, but the processing time is
> >> stable around a second. I changed my setup to use 6 kafka partitions
> >> on a set of smaller kafka brokers, with fewer disks. I've included
> >> some details below, including the script I use to launch the
> >> application. I'm using a Spark on Hbase library, whose version is
> >> relevant to my Hbase cluster. Is it apparent there is something wrong
> >> with my launch method that could be causing the delay, related to the
> >> included jars?
> >>
> >> Or is there something wrong with the very simple approach I'm taking
> >> for the application?
> >>
> >> Any advice is appriciated.
> >>
> >>
> >> The application:
> >>
> >> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
> >>
> >>
> >> From the streaming UI I get something like:
> >>
> >> table Completed Batches (last 1000 out of 27136)
> >>
> >>
> >> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
> >> Delay (?) Output Ops: Succeeded/Total
> >>
> >> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
> >>
> >>
> >> Here's how I'm launching the spark application.
> >>
> >>
> >> #!/usr/bin/env bash
> >>
> >> export SPARK_CONF_DIR=/home/colin.williams/spark
> >>
> >> export HADOOP_CONF_DIR=/etc/hadoop/conf
> >>
> >> export
> >>
> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
> >>
> >>
> >> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
> >>
> >> --class com.example.KafkaToHbase \
> >>
> >> --master spark://spark_master:7077 \
> >>
> >> --deploy-mode client \
> >>
> >> --num-executors 6 \
> >>
> >> --driver-memory 4G \
> >>
> >> --executor-memory 2G \
> >>
> >> --total-executor-cores 12 \
> >>
> >> --jars
> >>
> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
> >> \
> >>
> >> --conf spark.app.name="Kafka To Hbase" \
> >>
> >> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
> >>
> >> --conf spark.eventLog.enabled=false \
> >>
> >> --conf spark.eventLog.overwrite=true \
> >>
> >> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> >>
> >> --conf spark.streaming.backpressure.enabled=false \
> >>
> >> --conf spark.streaming.kafka.maxRatePerPartition=500 \
> >>
> >> --driver-class-path /home/colin.williams/kafka-hbase.jar \
> >>
> >> --driver-java-options
> >>
> >>
> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
> >> \
> >>
> >> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
> >> "broker1:9092,broker2:9092"
> >>
> >> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
> >> wrote:
> >> > Thanks Cody, I can see that the partitions are well distributed...
> >> > Then I'm in the process of using the direct api.
> >> >
> >> > 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I'm attaching a picture from the streaming UI.

On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams  wrote:
> There are 25 nodes in the spark cluster.
>
> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
>  wrote:
>> how many nodes are in your cluster?
>>
>> --num-executors 6 \
>>  --driver-memory 4G \
>>  --executor-memory 2G \
>>  --total-executor-cores 12 \
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>> On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:
>>>
>>> I updated my app to Spark 1.5.2 streaming so that it consumes from
>>> Kafka using the direct api and inserts content into an hbase cluster,
>>> as described in this thread. I was away from this project for awhile
>>> due to events in my family.
>>>
>>> Currently my scheduling delay is high, but the processing time is
>>> stable around a second. I changed my setup to use 6 kafka partitions
>>> on a set of smaller kafka brokers, with fewer disks. I've included
>>> some details below, including the script I use to launch the
>>> application. I'm using a Spark on Hbase library, whose version is
>>> relevant to my Hbase cluster. Is it apparent there is something wrong
>>> with my launch method that could be causing the delay, related to the
>>> included jars?
>>>
>>> Or is there something wrong with the very simple approach I'm taking
>>> for the application?
>>>
>>> Any advice is appriciated.
>>>
>>>
>>> The application:
>>>
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>>
>>> From the streaming UI I get something like:
>>>
>>> table Completed Batches (last 1000 out of 27136)
>>>
>>>
>>> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>>> Delay (?) Output Ops: Succeeded/Total
>>>
>>> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>>>
>>> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>>>
>>> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>>>
>>>
>>> Here's how I'm launching the spark application.
>>>
>>>
>>> #!/usr/bin/env bash
>>>
>>> export SPARK_CONF_DIR=/home/colin.williams/spark
>>>
>>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>>
>>> export
>>> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>>>
>>>
>>> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>>>
>>> --class com.example.KafkaToHbase \
>>>
>>> --master spark://spark_master:7077 \
>>>
>>> --deploy-mode client \
>>>
>>> --num-executors 6 \
>>>
>>> --driver-memory 4G \
>>>
>>> --executor-memory 2G \
>>>
>>> --total-executor-cores 12 \
>>>
>>> --jars
>>> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
>>> \
>>>
>>> --conf spark.app.name="Kafka To Hbase" \
>>>
>>> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>>>
>>> --conf spark.eventLog.enabled=false \
>>>
>>> --conf spark.eventLog.overwrite=true \
>>>
>>> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>>>
>>> --conf spark.streaming.backpressure.enabled=false \
>>>
>>> --conf spark.streaming.kafka.maxRatePerPartition=500 \
>>>
>>> --driver-class-path /home/colin.williams/kafka-hbase.jar \
>>>
>>> --driver-java-options
>>>
>>> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
>>> \
>>>
>>> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
>>> "broker1:9092,broker2:9092"
>>>
>>> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
>>> wrote:
>>> > Thanks Cody, I can see that the partitions are well distributed...
>>> > Then I'm in the process of using the direct api.
>>> >
>>> > On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger 
>>> > wrote:
>>> >> 60 partitions in and of itself shouldn't be a big performance issue
>>> >> (as long as producers are distributing across partitions evenly).
>>> >>
>>> >> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams 
>>> >> wrote:
>>> >>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>>> >>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>>> >>> iss

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
There are 25 nodes in the spark cluster.

On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
 wrote:
> how many nodes are in your cluster?
>
> --num-executors 6 \
>  --driver-memory 4G \
>  --executor-memory 2G \
>  --total-executor-cores 12 \
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:
>>
>> I updated my app to Spark 1.5.2 streaming so that it consumes from
>> Kafka using the direct api and inserts content into an hbase cluster,
>> as described in this thread. I was away from this project for awhile
>> due to events in my family.
>>
>> Currently my scheduling delay is high, but the processing time is
>> stable around a second. I changed my setup to use 6 kafka partitions
>> on a set of smaller kafka brokers, with fewer disks. I've included
>> some details below, including the script I use to launch the
>> application. I'm using a Spark on Hbase library, whose version is
>> relevant to my Hbase cluster. Is it apparent there is something wrong
>> with my launch method that could be causing the delay, related to the
>> included jars?
>>
>> Or is there something wrong with the very simple approach I'm taking
>> for the application?
>>
>> Any advice is appriciated.
>>
>>
>> The application:
>>
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>>
>> From the streaming UI I get something like:
>>
>> table Completed Batches (last 1000 out of 27136)
>>
>>
>> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
>> Delay (?) Output Ops: Succeeded/Total
>>
>> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>>
>> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>>
>> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>>
>>
>> Here's how I'm launching the spark application.
>>
>>
>> #!/usr/bin/env bash
>>
>> export SPARK_CONF_DIR=/home/colin.williams/spark
>>
>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>
>> export
>> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>>
>>
>> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>>
>> --class com.example.KafkaToHbase \
>>
>> --master spark://spark_master:7077 \
>>
>> --deploy-mode client \
>>
>> --num-executors 6 \
>>
>> --driver-memory 4G \
>>
>> --executor-memory 2G \
>>
>> --total-executor-cores 12 \
>>
>> --jars
>> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
>> \
>>
>> --conf spark.app.name="Kafka To Hbase" \
>>
>> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>>
>> --conf spark.eventLog.enabled=false \
>>
>> --conf spark.eventLog.overwrite=true \
>>
>> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>>
>> --conf spark.streaming.backpressure.enabled=false \
>>
>> --conf spark.streaming.kafka.maxRatePerPartition=500 \
>>
>> --driver-class-path /home/colin.williams/kafka-hbase.jar \
>>
>> --driver-java-options
>>
>> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
>> \
>>
>> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
>> "broker1:9092,broker2:9092"
>>
>> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
>> wrote:
>> > Thanks Cody, I can see that the partitions are well distributed...
>> > Then I'm in the process of using the direct api.
>> >
>> > On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger 
>> > wrote:
>> >> 60 partitions in and of itself shouldn't be a big performance issue
>> >> (as long as producers are distributing across partitions evenly).
>> >>
>> >> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams 
>> >> wrote:
>> >>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>> >>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>> >>> issue with the receiver was the large number of partitions. I had
>> >>> miscounted the disks and so 11*3*2 is how I decided to partition my
>> >>> topic on insertion, ( by my own, unjustified reasoning, on a first
>> >>> attempt ) . This worked well enough 

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
how many nodes are in your cluster?

--num-executors 6 \
 --driver-memory 4G \
 --executor-memory 2G \
 --total-executor-cores 12 \


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 June 2016 at 20:40, Colin Kincaid Williams  wrote:

> I updated my app to Spark 1.5.2 streaming so that it consumes from
> Kafka using the direct api and inserts content into an hbase cluster,
> as described in this thread. I was away from this project for awhile
> due to events in my family.
>
> Currently my scheduling delay is high, but the processing time is
> stable around a second. I changed my setup to use 6 kafka partitions
> on a set of smaller kafka brokers, with fewer disks. I've included
> some details below, including the script I use to launch the
> application. I'm using a Spark on Hbase library, whose version is
> relevant to my Hbase cluster. Is it apparent there is something wrong
> with my launch method that could be causing the delay, related to the
> included jars?
>
> Or is there something wrong with the very simple approach I'm taking
> for the application?
>
> Any advice is appriciated.
>
>
> The application:
>
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
>
> From the streaming UI I get something like:
>
> table Completed Batches (last 1000 out of 27136)
>
>
> Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
> Delay (?) Output Ops: Succeeded/Total
>
> 2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1
>
> 2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1
>
> 2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1
>
>
> Here's how I'm launching the spark application.
>
>
> #!/usr/bin/env bash
>
> export SPARK_CONF_DIR=/home/colin.williams/spark
>
> export HADOOP_CONF_DIR=/etc/hadoop/conf
>
> export
> HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar
>
>
> /opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \
>
> --class com.example.KafkaToHbase \
>
> --master spark://spark_master:7077 \
>
> --deploy-mode client \
>
> --num-executors 6 \
>
> --driver-memory 4G \
>
> --executor-memory 2G \
>
> --total-executor-cores 12 \
>
> --jars
> /home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
> \
>
> --conf spark.app.name="Kafka To Hbase" \
>
> --conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \
>
> --conf spark.eventLog.enabled=false \
>
> --conf spark.eventLog.overwrite=true \
>
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>
> --conf spark.streaming.backpressure.enabled=false \
>
> --conf spark.streaming.kafka.maxRatePerPartition=500 \
>
> --driver-class-path /home/colin.williams/kafka-hbase.jar \
>
> --driver-java-options
>
> -Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
> \
>
> /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
> "broker1:9092,broker2:9092"
>
> On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams 
> wrote:
> > Thanks Cody, I can see that the partitions are well distributed...
> > Then I'm in the process of using the direct api.
> >
> > On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger 
> wrote:
> >> 60 partitions in and of itself shouldn't be a big performance issue
> >> (as long as producers are distributing across partitions evenly).
> >>
> >> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams 
> wrote:
> >>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
> >>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
> >>> issue with the receiver was the large number of partitions. I had
> >>> miscounted the disks and so 11*3*2 is how I decided to partition my
> >>> topic on insertion, ( by my own, unjustified reasoning, on a first
> >>> attempt ) . This worked well enough for me, I put 1.7 billion entries
> >>> into Kafka on a map reduce job in 5 and a half hours.
> >>>
> >>> I was concerned using spark 1.5.2 because I'm currently puttin

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I updated my app to Spark 1.5.2 streaming so that it consumes from
Kafka using the direct api and inserts content into an hbase cluster,
as described in this thread. I was away from this project for awhile
due to events in my family.

Currently my scheduling delay is high, but the processing time is
stable around a second. I changed my setup to use 6 kafka partitions
on a set of smaller kafka brokers, with fewer disks. I've included
some details below, including the script I use to launch the
application. I'm using a Spark on Hbase library, whose version is
relevant to my Hbase cluster. Is it apparent there is something wrong
with my launch method that could be causing the delay, related to the
included jars?

Or is there something wrong with the very simple approach I'm taking
for the application?

Any advice is appriciated.


The application:

https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877


>From the streaming UI I get something like:

table Completed Batches (last 1000 out of 27136)


Batch Time Input Size Scheduling Delay (?) Processing Time (?) Total
Delay (?) Output Ops: Succeeded/Total

2016/06/18 11:21:32 3000 events 1.2 h 1 s 1.2 h 1/1

2016/06/18 11:21:31 3000 events 1.2 h 1 s 1.2 h 1/1

2016/06/18 11:21:30 3000 events 1.2 h 1 s 1.2 h 1/1


Here's how I'm launching the spark application.


#!/usr/bin/env bash

export SPARK_CONF_DIR=/home/colin.williams/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export 
HADOOP_CLASSPATH=/home/colin.williams/hbase/conf/:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*:/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar


/opt/spark-1.5.2-bin-hadoop2.4/bin/spark-submit \

--class com.example.KafkaToHbase \

--master spark://spark_master:7077 \

--deploy-mode client \

--num-executors 6 \

--driver-memory 4G \

--executor-memory 2G \

--total-executor-cores 12 \

--jars 
/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/guava-12.0.1.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/protobuf-java-2.5.0.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-protocol.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-client.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-common.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop2-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-hadoop-compat.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/hbase-server.jar,/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/htrace-core.jar
\

--conf spark.app.name="Kafka To Hbase" \

--conf spark.eventLog.dir="hdfs:///user/spark/applicationHistory" \

--conf spark.eventLog.enabled=false \

--conf spark.eventLog.overwrite=true \

--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \

--conf spark.streaming.backpressure.enabled=false \

--conf spark.streaming.kafka.maxRatePerPartition=500 \

--driver-class-path /home/colin.williams/kafka-hbase.jar \

--driver-java-options
-Dspark.executor.extraClassPath=/home/colin.williams/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
\

/home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
"broker1:9092,broker2:9092"

On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams  wrote:
> Thanks Cody, I can see that the partitions are well distributed...
> Then I'm in the process of using the direct api.
>
> On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger  wrote:
>> 60 partitions in and of itself shouldn't be a big performance issue
>> (as long as producers are distributing across partitions evenly).
>>
>> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams  
>> wrote:
>>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>>> issue with the receiver was the large number of partitions. I had
>>> miscounted the disks and so 11*3*2 is how I decided to partition my
>>> topic on insertion, ( by my own, unjustified reasoning, on a first
>>> attempt ) . This worked well enough for me, I put 1.7 billion entries
>>> into Kafka on a map reduce job in 5 and a half hours.
>>>
>>> I was concerned using spark 1.5.2 because I'm currently putting my
>>> data into a CDH 5.3 HDFS cluster, using hbase-spark .98 library jars
>>> built for spark 1.2 on CDH 5.3. But after debugging quite a bit
>>> yesterday, I tried building against 1.5.2. So far it's running without
>>> issue on a Spark 1.5.2 cluster. I'm not sure there was too much
>>> improvement using the same code, but I'll see how the direct api
>>> handles it. In the end I can reduce the number of partitions in Kafka
>>> if it causes big performance issues.
>>>
>>> On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger  wrote:
 print() isn't really the best way to benchmark things, since it calls
 take

Re: spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
I also tried


jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2")

But, still its not working.

Any ideas why its not working ?


Abhi

On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand 
wrote:

> My spark streaming checkpoint directory is being written to HDFS with
> default replication factor of 3.
>
> In my streaming application where I am listening from kafka and setting
> the dfs.replication = 2 as below the files are still being written with
> replication factor=3
>
> SparkConf sparkConfig = new
> SparkConf().setMaster("mymaster").set("spark.hadoop.dfs.replication", "2");
>
> Is there anything else that I need to do ??
>
>
> Thanks !!!
> Abhi
>


spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
My spark streaming checkpoint directory is being written to HDFS with
default replication factor of 3.

In my streaming application where I am listening from kafka and setting the
dfs.replication = 2 as below the files are still being written with
replication factor=3

SparkConf sparkConfig = new
SparkConf().setMaster("mymaster").set("spark.hadoop.dfs.replication", "2");

Is there anything else that I need to do ??


Thanks !!!
Abhi


Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks Cody, I can see that the partitions are well distributed...
Then I'm in the process of using the direct api.

On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger  wrote:
> 60 partitions in and of itself shouldn't be a big performance issue
> (as long as producers are distributing across partitions evenly).
>
> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams  wrote:
>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>> issue with the receiver was the large number of partitions. I had
>> miscounted the disks and so 11*3*2 is how I decided to partition my
>> topic on insertion, ( by my own, unjustified reasoning, on a first
>> attempt ) . This worked well enough for me, I put 1.7 billion entries
>> into Kafka on a map reduce job in 5 and a half hours.
>>
>> I was concerned using spark 1.5.2 because I'm currently putting my
>> data into a CDH 5.3 HDFS cluster, using hbase-spark .98 library jars
>> built for spark 1.2 on CDH 5.3. But after debugging quite a bit
>> yesterday, I tried building against 1.5.2. So far it's running without
>> issue on a Spark 1.5.2 cluster. I'm not sure there was too much
>> improvement using the same code, but I'll see how the direct api
>> handles it. In the end I can reduce the number of partitions in Kafka
>> if it causes big performance issues.
>>
>> On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger  wrote:
>>> print() isn't really the best way to benchmark things, since it calls
>>> take(10) under the covers, but 380 records / second for a single
>>> receiver doesn't sound right in any case.
>>>
>>> Am I understanding correctly that you're trying to process a large
>>> number of already-existing kafka messages, not keep up with an
>>> incoming stream?  Can you give any details (e.g. hardware, number of
>>> topicpartitions, etc)?
>>>
>>> Really though, I'd try to start with spark 1.6 and direct streams, or
>>> even just kafkacat, as a baseline.
>>>
>>>
>>>
>>> On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams  
>>> wrote:
 Hello again. I searched for "backport kafka" in the list archives but
 couldn't find anything but a post from Spark 0.7.2 . I was going to
 use accumulators to make a counter, but then saw on the Streaming tab
 the Receiver Statistics. Then I removed all other "functionality"
 except:


 JavaPairReceiverInputDStream dstream = KafkaUtils
   //createStream(JavaStreamingContext jssc,Class
 keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
 Class valueDecoderClass, java.util.Map kafkaParams,
 java.util.Map topics, StorageLevel storageLevel)
   .createStream(jssc, byte[].class, byte[].class,
 kafka.serializer.DefaultDecoder.class,
 kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
 StorageLevel.MEMORY_AND_DISK_SER());

dstream.print();

 Then in the Recieiver Stats for the single receiver, I'm seeing around
 380 records / second. Then to get anywhere near my 10% mentioned
 above, I'd need to run around 21 receivers, assuming 380 records /
 second, just using the print output. This seems awfully high to me,
 considering that I wrote 8+ records a second to Kafka from a
 mapreduce job, and that my bottleneck was likely Hbase. Again using
 the 380 estimate, I would need 200+ receivers to reach a similar
 amount of reads.

 Even given the issues with the 1.2 receivers, is this the expected way
 to use the Kafka streaming API, or am I doing something terribly
 wrong?

 My application looks like
 https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

 On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using spark 1.2, or is upgrading possible?  The
> kafka direct stream is available starting with 1.3.  If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
> wrote:
>> I've written an application to get content from a kafka topic with 1.7
>> billion entries,  get the protobuf serialized entries, and insert into
>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>
>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>> 0-2500 writes / second. This will take much too long to consume the
>> entries.
>>
>> I currently believe that the spark kafka receiver is the bottleneck.
>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>> notice any large performance difference. I've tried many different
>> spark configuration options, but can't seem to get better performance.
>>
>> I saw 8 requests / seco

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger
60 partitions in and of itself shouldn't be a big performance issue
(as long as producers are distributing across partitions evenly).

On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams  wrote:
> Thanks again Cody. Regarding the details 66 kafka partitions on 3
> kafka servers, likely 8 core systems with 10 disks each. Maybe the
> issue with the receiver was the large number of partitions. I had
> miscounted the disks and so 11*3*2 is how I decided to partition my
> topic on insertion, ( by my own, unjustified reasoning, on a first
> attempt ) . This worked well enough for me, I put 1.7 billion entries
> into Kafka on a map reduce job in 5 and a half hours.
>
> I was concerned using spark 1.5.2 because I'm currently putting my
> data into a CDH 5.3 HDFS cluster, using hbase-spark .98 library jars
> built for spark 1.2 on CDH 5.3. But after debugging quite a bit
> yesterday, I tried building against 1.5.2. So far it's running without
> issue on a Spark 1.5.2 cluster. I'm not sure there was too much
> improvement using the same code, but I'll see how the direct api
> handles it. In the end I can reduce the number of partitions in Kafka
> if it causes big performance issues.
>
> On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger  wrote:
>> print() isn't really the best way to benchmark things, since it calls
>> take(10) under the covers, but 380 records / second for a single
>> receiver doesn't sound right in any case.
>>
>> Am I understanding correctly that you're trying to process a large
>> number of already-existing kafka messages, not keep up with an
>> incoming stream?  Can you give any details (e.g. hardware, number of
>> topicpartitions, etc)?
>>
>> Really though, I'd try to start with spark 1.6 and direct streams, or
>> even just kafkacat, as a baseline.
>>
>>
>>
>> On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams  
>> wrote:
>>> Hello again. I searched for "backport kafka" in the list archives but
>>> couldn't find anything but a post from Spark 0.7.2 . I was going to
>>> use accumulators to make a counter, but then saw on the Streaming tab
>>> the Receiver Statistics. Then I removed all other "functionality"
>>> except:
>>>
>>>
>>> JavaPairReceiverInputDStream dstream = KafkaUtils
>>>   //createStream(JavaStreamingContext jssc,Class
>>> keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
>>> Class valueDecoderClass, java.util.Map kafkaParams,
>>> java.util.Map topics, StorageLevel storageLevel)
>>>   .createStream(jssc, byte[].class, byte[].class,
>>> kafka.serializer.DefaultDecoder.class,
>>> kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
>>> StorageLevel.MEMORY_AND_DISK_SER());
>>>
>>>dstream.print();
>>>
>>> Then in the Recieiver Stats for the single receiver, I'm seeing around
>>> 380 records / second. Then to get anywhere near my 10% mentioned
>>> above, I'd need to run around 21 receivers, assuming 380 records /
>>> second, just using the print output. This seems awfully high to me,
>>> considering that I wrote 8+ records a second to Kafka from a
>>> mapreduce job, and that my bottleneck was likely Hbase. Again using
>>> the 380 estimate, I would need 200+ receivers to reach a similar
>>> amount of reads.
>>>
>>> Even given the issues with the 1.2 receivers, is this the expected way
>>> to use the Kafka streaming API, or am I doing something terribly
>>> wrong?
>>>
>>> My application looks like
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>> On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
 Have you tested for read throughput (without writing to hbase, just
 deserialize)?

 Are you limited to using spark 1.2, or is upgrading possible?  The
 kafka direct stream is available starting with 1.3.  If you're stuck
 on 1.2, I believe there have been some attempts to backport it, search
 the mailing list archives.

 On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
 wrote:
> I've written an application to get content from a kafka topic with 1.7
> billion entries,  get the protobuf serialized entries, and insert into
> hbase. Currently the environment that I'm running in is Spark 1.2.
>
> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
> 0-2500 writes / second. This will take much too long to consume the
> entries.
>
> I currently believe that the spark kafka receiver is the bottleneck.
> I've tried both 1.2 receivers, with the WAL and without, and didn't
> notice any large performance difference. I've tried many different
> spark configuration options, but can't seem to get better performance.
>
> I saw 8 requests / second inserting these records into kafka using
> yarn / hbase / protobuf / kafka in a bulk fashion.
>
> While hbase inserts might not deliver the same throughput, I'd like to
> at least get 10%.
>
> My application looks like
> https://gist.github.c

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks again Cody. Regarding the details 66 kafka partitions on 3
kafka servers, likely 8 core systems with 10 disks each. Maybe the
issue with the receiver was the large number of partitions. I had
miscounted the disks and so 11*3*2 is how I decided to partition my
topic on insertion, ( by my own, unjustified reasoning, on a first
attempt ) . This worked well enough for me, I put 1.7 billion entries
into Kafka on a map reduce job in 5 and a half hours.

I was concerned using spark 1.5.2 because I'm currently putting my
data into a CDH 5.3 HDFS cluster, using hbase-spark .98 library jars
built for spark 1.2 on CDH 5.3. But after debugging quite a bit
yesterday, I tried building against 1.5.2. So far it's running without
issue on a Spark 1.5.2 cluster. I'm not sure there was too much
improvement using the same code, but I'll see how the direct api
handles it. In the end I can reduce the number of partitions in Kafka
if it causes big performance issues.

On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger  wrote:
> print() isn't really the best way to benchmark things, since it calls
> take(10) under the covers, but 380 records / second for a single
> receiver doesn't sound right in any case.
>
> Am I understanding correctly that you're trying to process a large
> number of already-existing kafka messages, not keep up with an
> incoming stream?  Can you give any details (e.g. hardware, number of
> topicpartitions, etc)?
>
> Really though, I'd try to start with spark 1.6 and direct streams, or
> even just kafkacat, as a baseline.
>
>
>
> On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams  wrote:
>> Hello again. I searched for "backport kafka" in the list archives but
>> couldn't find anything but a post from Spark 0.7.2 . I was going to
>> use accumulators to make a counter, but then saw on the Streaming tab
>> the Receiver Statistics. Then I removed all other "functionality"
>> except:
>>
>>
>> JavaPairReceiverInputDStream dstream = KafkaUtils
>>   //createStream(JavaStreamingContext jssc,Class
>> keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
>> Class valueDecoderClass, java.util.Map kafkaParams,
>> java.util.Map topics, StorageLevel storageLevel)
>>   .createStream(jssc, byte[].class, byte[].class,
>> kafka.serializer.DefaultDecoder.class,
>> kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
>> StorageLevel.MEMORY_AND_DISK_SER());
>>
>>dstream.print();
>>
>> Then in the Recieiver Stats for the single receiver, I'm seeing around
>> 380 records / second. Then to get anywhere near my 10% mentioned
>> above, I'd need to run around 21 receivers, assuming 380 records /
>> second, just using the print output. This seems awfully high to me,
>> considering that I wrote 8+ records a second to Kafka from a
>> mapreduce job, and that my bottleneck was likely Hbase. Again using
>> the 380 estimate, I would need 200+ receivers to reach a similar
>> amount of reads.
>>
>> Even given the issues with the 1.2 receivers, is this the expected way
>> to use the Kafka streaming API, or am I doing something terribly
>> wrong?
>>
>> My application looks like
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
>>> Have you tested for read throughput (without writing to hbase, just
>>> deserialize)?
>>>
>>> Are you limited to using spark 1.2, or is upgrading possible?  The
>>> kafka direct stream is available starting with 1.3.  If you're stuck
>>> on 1.2, I believe there have been some attempts to backport it, search
>>> the mailing list archives.
>>>
>>> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
>>> wrote:
 I've written an application to get content from a kafka topic with 1.7
 billion entries,  get the protobuf serialized entries, and insert into
 hbase. Currently the environment that I'm running in is Spark 1.2.

 With 8 executors and 2 cores, and 2 jobs, I'm only getting between
 0-2500 writes / second. This will take much too long to consume the
 entries.

 I currently believe that the spark kafka receiver is the bottleneck.
 I've tried both 1.2 receivers, with the WAL and without, and didn't
 notice any large performance difference. I've tried many different
 spark configuration options, but can't seem to get better performance.

 I saw 8 requests / second inserting these records into kafka using
 yarn / hbase / protobuf / kafka in a bulk fashion.

 While hbase inserts might not deliver the same throughput, I'd like to
 at least get 10%.

 My application looks like
 https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

 This is my first spark application. I'd appreciate any assistance.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
print() isn't really the best way to benchmark things, since it calls
take(10) under the covers, but 380 records / second for a single
receiver doesn't sound right in any case.

Am I understanding correctly that you're trying to process a large
number of already-existing kafka messages, not keep up with an
incoming stream?  Can you give any details (e.g. hardware, number of
topicpartitions, etc)?

Really though, I'd try to start with spark 1.6 and direct streams, or
even just kafkacat, as a baseline.



On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams  wrote:
> Hello again. I searched for "backport kafka" in the list archives but
> couldn't find anything but a post from Spark 0.7.2 . I was going to
> use accumulators to make a counter, but then saw on the Streaming tab
> the Receiver Statistics. Then I removed all other "functionality"
> except:
>
>
> JavaPairReceiverInputDStream dstream = KafkaUtils
>   //createStream(JavaStreamingContext jssc,Class
> keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
> Class valueDecoderClass, java.util.Map kafkaParams,
> java.util.Map topics, StorageLevel storageLevel)
>   .createStream(jssc, byte[].class, byte[].class,
> kafka.serializer.DefaultDecoder.class,
> kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
> StorageLevel.MEMORY_AND_DISK_SER());
>
>dstream.print();
>
> Then in the Recieiver Stats for the single receiver, I'm seeing around
> 380 records / second. Then to get anywhere near my 10% mentioned
> above, I'd need to run around 21 receivers, assuming 380 records /
> second, just using the print output. This seems awfully high to me,
> considering that I wrote 8+ records a second to Kafka from a
> mapreduce job, and that my bottleneck was likely Hbase. Again using
> the 380 estimate, I would need 200+ receivers to reach a similar
> amount of reads.
>
> Even given the issues with the 1.2 receivers, is this the expected way
> to use the Kafka streaming API, or am I doing something terribly
> wrong?
>
> My application looks like
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
>> Have you tested for read throughput (without writing to hbase, just
>> deserialize)?
>>
>> Are you limited to using spark 1.2, or is upgrading possible?  The
>> kafka direct stream is available starting with 1.3.  If you're stuck
>> on 1.2, I believe there have been some attempts to backport it, search
>> the mailing list archives.
>>
>> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
>> wrote:
>>> I've written an application to get content from a kafka topic with 1.7
>>> billion entries,  get the protobuf serialized entries, and insert into
>>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>>
>>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>>> 0-2500 writes / second. This will take much too long to consume the
>>> entries.
>>>
>>> I currently believe that the spark kafka receiver is the bottleneck.
>>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>>> notice any large performance difference. I've tried many different
>>> spark configuration options, but can't seem to get better performance.
>>>
>>> I saw 8 requests / second inserting these records into kafka using
>>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>>
>>> While hbase inserts might not deliver the same throughput, I'd like to
>>> at least get 10%.
>>>
>>> My application looks like
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>> This is my first spark application. I'd appreciate any assistance.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hello again. I searched for "backport kafka" in the list archives but
couldn't find anything but a post from Spark 0.7.2 . I was going to
use accumulators to make a counter, but then saw on the Streaming tab
the Receiver Statistics. Then I removed all other "functionality"
except:


JavaPairReceiverInputDStream dstream = KafkaUtils
  //createStream(JavaStreamingContext jssc,Class
keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
Class valueDecoderClass, java.util.Map kafkaParams,
java.util.Map topics, StorageLevel storageLevel)
  .createStream(jssc, byte[].class, byte[].class,
kafka.serializer.DefaultDecoder.class,
kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
StorageLevel.MEMORY_AND_DISK_SER());

   dstream.print();

Then in the Recieiver Stats for the single receiver, I'm seeing around
380 records / second. Then to get anywhere near my 10% mentioned
above, I'd need to run around 21 receivers, assuming 380 records /
second, just using the print output. This seems awfully high to me,
considering that I wrote 8+ records a second to Kafka from a
mapreduce job, and that my bottleneck was likely Hbase. Again using
the 380 estimate, I would need 200+ receivers to reach a similar
amount of reads.

Even given the issues with the 1.2 receivers, is this the expected way
to use the Kafka streaming API, or am I doing something terribly
wrong?

My application looks like
https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using spark 1.2, or is upgrading possible?  The
> kafka direct stream is available starting with 1.3.  If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
> wrote:
>> I've written an application to get content from a kafka topic with 1.7
>> billion entries,  get the protobuf serialized entries, and insert into
>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>
>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>> 0-2500 writes / second. This will take much too long to consume the
>> entries.
>>
>> I currently believe that the spark kafka receiver is the bottleneck.
>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>> notice any large performance difference. I've tried many different
>> spark configuration options, but can't seem to get better performance.
>>
>> I saw 8 requests / second inserting these records into kafka using
>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>
>> While hbase inserts might not deliver the same throughput, I'd like to
>> at least get 10%.
>>
>> My application looks like
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> This is my first spark application. I'd appreciate any assistance.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi Cody,

  I'm going to use an accumulator right now to get an idea of the
throughput. Thanks for mentioning the back ported module. Also it
looks like I missed this section:
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch
from the docs. Then maybe I should try creating multiple streams to
get more throughput?

Thanks,

Colin Williams

On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using spark 1.2, or is upgrading possible?  The
> kafka direct stream is available starting with 1.3.  If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
> wrote:
>> I've written an application to get content from a kafka topic with 1.7
>> billion entries,  get the protobuf serialized entries, and insert into
>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>
>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>> 0-2500 writes / second. This will take much too long to consume the
>> entries.
>>
>> I currently believe that the spark kafka receiver is the bottleneck.
>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>> notice any large performance difference. I've tried many different
>> spark configuration options, but can't seem to get better performance.
>>
>> I saw 8 requests / second inserting these records into kafka using
>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>
>> While hbase inserts might not deliver the same throughput, I'd like to
>> at least get 10%.
>>
>> My application looks like
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> This is my first spark application. I'd appreciate any assistance.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi David,

 My current concern is that I'm using a spark hbase bulk put driver
written for Spark 1.2 on the version of CDH my spark / yarn cluster is
running on. Even if I were to run on another Spark cluster, I'm
concerned that I might have issues making the put requests into hbase.
However I should give it a shot if I abandon Spark 1.2, and my current
environment.

Thanks,

Colin Williams

On Mon, May 2, 2016 at 6:06 PM, Krieg, David
 wrote:
> Spark 1.2 is a little old and busted. I think most of the advice you'll get is
> to try to use Spark 1.3 at least, which introduced a new Spark streaming mode
> (direct receiver). The 1.2 Receiver based implementation had a number of
> shortcomings. 1.3 is where the "direct streaming" interface was introduced,
> which is what we use. You'll get more joy the more you upgrade Spark, at least
> to some extent.
>
> David Krieg | Enterprise Software Engineer
> Early Warning
> Direct: 480.426.2171 | Fax: 480.483.4628 | Mobile: 859.227.6173
>
>
> -Original Message-
> From: Colin Kincaid Williams [mailto:disc...@uw.edu]
> Sent: Monday, May 02, 2016 10:55 AM
> To: user@spark.apache.org
> Subject: Improving performance of a kafka spark streaming app
>
> I've written an application to get content from a kafka topic with 1.7 billion
> entries,  get the protobuf serialized entries, and insert into hbase.
> Currently the environment that I'm running in is Spark 1.2.
>
> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
> 0-2500 writes / second. This will take much too long to consume the entries.
>
> I currently believe that the spark kafka receiver is the bottleneck.
> I've tried both 1.2 receivers, with the WAL and without, and didn't notice any
> large performance difference. I've tried many different spark configuration
> options, but can't seem to get better performance.
>
> I saw 8 requests / second inserting these records into kafka using yarn /
> hbase / protobuf / kafka in a bulk fashion.
>
> While hbase inserts might not deliver the same throughput, I'd like to at
> least get 10%.
>
> My application looks like
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_drocsid_b0efa4ff6ff4a7c3c8bb56767d0b6877&d=CwIBaQ&c=rtKJL1IoQkrgf7t9D493SuUmYZJqgJmwEhoO6UD_DpY&r=rWkTz7PE5TRtkkWejPue_zcBxoTQE4f0g8LBaR2mVi8&m=pVPZ7WXHDTWO7s5u0qQupsWkiaGiv3B50BdtYvOvazo&s=_FnCXUJfmNKIVqDy046SS5YVP8cpJBQ3ynECFWJkzK8&e=
>
> This is my first spark application. I'd appreciate any assistance.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just
deserialize)?

Are you limited to using spark 1.2, or is upgrading possible?  The
kafka direct stream is available starting with 1.3.  If you're stuck
on 1.2, I believe there have been some attempts to backport it, search
the mailing list archives.

On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  wrote:
> I've written an application to get content from a kafka topic with 1.7
> billion entries,  get the protobuf serialized entries, and insert into
> hbase. Currently the environment that I'm running in is Spark 1.2.
>
> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
> 0-2500 writes / second. This will take much too long to consume the
> entries.
>
> I currently believe that the spark kafka receiver is the bottleneck.
> I've tried both 1.2 receivers, with the WAL and without, and didn't
> notice any large performance difference. I've tried many different
> spark configuration options, but can't seem to get better performance.
>
> I saw 8 requests / second inserting these records into kafka using
> yarn / hbase / protobuf / kafka in a bulk fashion.
>
> While hbase inserts might not deliver the same throughput, I'd like to
> at least get 10%.
>
> My application looks like
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> This is my first spark application. I'd appreciate any assistance.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
I've written an application to get content from a kafka topic with 1.7
billion entries,  get the protobuf serialized entries, and insert into
hbase. Currently the environment that I'm running in is Spark 1.2.

With 8 executors and 2 cores, and 2 jobs, I'm only getting between
0-2500 writes / second. This will take much too long to consume the
entries.

I currently believe that the spark kafka receiver is the bottleneck.
I've tried both 1.2 receivers, with the WAL and without, and didn't
notice any large performance difference. I've tried many different
spark configuration options, but can't seem to get better performance.

I saw 8 requests / second inserting these records into kafka using
yarn / hbase / protobuf / kafka in a bulk fashion.

While hbase inserts might not deliver the same throughput, I'd like to
at least get 10%.

My application looks like
https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

This is my first spark application. I'd appreciate any assistance.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-19 Thread Cody Koeninger
There's 1 topic per partition, so you're probably better off dealing
with topics that way rather than at the individual message level.

http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

Look at the discussion of "HasOffsetRanges"

If you really want to attach a topic to each message, look at the
constructor that allows you to pass a messageHandler argument.  That
gives you per-item access to everything in message and metadata,
including the topic.

On Wed, Mar 16, 2016 at 3:37 AM, Imre Nagi  wrote:
> Hi,
>
> I'm just trying to process the data that come from the kafka source in my
> spark streaming application. What I want to do is get the pair of topic and
> message in a tuple from the message stream.
>
> Here is my streams:
>
>>  val streams = KafkaUtils.createDirectStream[String, Array[Byte],
>> StringDecoder, DefaultDecoder](ssc,kafkaParameter,
>>   Array["topic1", "topic2])
>
>
> I have done several things, but still failed when i did some transformations
> from the streams to the pair of topic and message. I hope somebody can help
> me here.
>
> Thanks,
> Imre

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-16 Thread Imre Nagi
Hi,

I'm just trying to process the data that come from the kafka source in my
spark streaming application. What I want to do is get the pair of topic and
message in a tuple from the message stream.

Here is my streams:

 val streams = KafkaUtils.createDirectStream[String, Array[Byte],
> StringDecoder, DefaultDecoder](ssc,kafkaParameter,
>   Array["topic1", "topic2])


I have done several things, but still failed when i did some
transformations from the streams to the pair of topic and message. I hope
somebody can help me here.

Thanks,
Imre


Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-15 Thread Imre Nagi
Hi,

I'm just trying to process the data that come from the kafka source in my
spark streaming application. What I want to do is get the pair of topic and
message in a tuple from the message stream.

Here is my streams:

 val streams = KafkaUtils.createDirectStream[String, Array[Byte],
> StringDecoder, DefaultDecoder](ssc,kafkaParameter,
>   Array["topic1", "topic2])


I have done several things, but still failed when i did some
transformations from the streams to the pair of topic and message. I hope
somebody can help me here.

Thanks,
Imre


RE: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Mukul Gupta
Thanks, Behavior is now clear to me.

I tried with "foreachRDD" and indeed all partitions are being processed in 
parallel.
I also tried using "saveAsTextFile" instead of print and  again all partitions 
were processed in parallel.

-Original Message-
From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: Monday, March 14, 2016 9:39 PM
To: Mukul Gupta 
Cc: user@spark.apache.org
Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel

So what's happening here is that print() uses take().  Take() will try to 
satisfy the request using only the first partition of the rdd, then use other 
partitions if necessary.

If you change to using something like foreach

processed.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD it) {
  it.foreach(new VoidFunction() {
  @Override
  public void call(String s) {
System.err.println(s);
  }
});
}
  });

you'll see 3 or 5 or however many partitions being processed simultaneously.

As an aside, if you don't have much experience with or investment in java, I'd 
highly recommend you use scala for interacting with spark.
Most of the code is written in scala, it's more concise, and it's easier to use 
the spark shell to experiment when you're learning.


On Sun, Mar 13, 2016 at 7:03 AM, Mukul Gupta  wrote:
> Sorry for the late reply. I am new to Java and it took me a while to set 
> things up.
>
> Yes, you are correct that kafka client libs need not be specifically added. I 
> didn't realized that . I removed the same and code still compiled. However, 
> upon execution, I encountered the same issue as before.
>
> Following is the link to repository:
> https://github.com/guptamukul/sparktest.git
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 23:04
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in
> parallel
>
> Why are you including a specific dependency on Kafka?  Spark's
> external streaming kafka module already depends on kafka.
>
> Can you link to an actual repo with build file etc?
>
> On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
>> Please note that while building jar of code below, i used spark 1.6.0
>> + kafka 0.9.0.0 libraries I also tried spark 1.5.0 + kafka 0.9.0.1 
>> combination, but encountered the same issue.
>>
>> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
>> matches with spark and kafka versions installed on my machine) because while 
>> doing so, i get the following error at run time:
>> Exception in thread "main" java.lang.ClassCastException:
>> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder; import scala.Tuple2;
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder; import scala.Tuple2;
>>
>> public class SparkTest {
>>
>> public static void main(String[] args) {
>>
>> if (args.length < 5) {
>> System.err.println("Usage: SparkTest  
>>   "); System.exit(1); }
>>
>> String kafkaBroker = args[0];
>> String sparkMaster = args[1];
>> String topics = args[2];
>> String consumerGroupID = args[3];
>> String durationSec = args[4];
>>
>> int duration = 0;
>>
>> try {
>> duration = Integ

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Cody Koeninger
So what's happening here is that print() uses take().  Take() will try
to satisfy the request using only the first partition of the rdd, then
use other partitions if necessary.

If you change to using something like foreach

processed.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD it) {
  it.foreach(new VoidFunction() {
  @Override
  public void call(String s) {
System.err.println(s);
  }
});
}
  });

you'll see 3 or 5 or however many partitions being processed simultaneously.

As an aside, if you don't have much experience with or investment in
java, I'd highly recommend you use scala for interacting with spark.
Most of the code is written in scala, it's more concise, and it's
easier to use the spark shell to experiment when you're learning.


On Sun, Mar 13, 2016 at 7:03 AM, Mukul Gupta  wrote:
> Sorry for the late reply. I am new to Java and it took me a while to set 
> things up.
>
> Yes, you are correct that kafka client libs need not be specifically added. I 
> didn't realized that . I removed the same and code still compiled. However, 
> upon execution, I encountered the same issue as before.
>
> Following is the link to repository:
> https://github.com/guptamukul/sparktest.git
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 23:04
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel
>
> Why are you including a specific dependency on Kafka?  Spark's
> external streaming kafka module already depends on kafka.
>
> Can you link to an actual repo with build file etc?
>
> On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
>> Please note that while building jar of code below, i used spark 1.6.0 + 
>> kafka 0.9.0.0 libraries
>> I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the 
>> same issue.
>>
>> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
>> matches with spark and kafka versions installed on my machine) because while 
>> doing so, i get the following error at run time:
>> Exception in thread "main" java.lang.ClassCastException: 
>> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder;
>> import scala.Tuple2;
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder;
>> import scala.Tuple2;
>>
>> public class SparkTest {
>>
>> public static void main(String[] args) {
>>
>> if (args.length < 5) {
>> System.err.println("Usage: SparkTest
>>  ");
>> System.exit(1);
>> }
>>
>> String kafkaBroker = args[0];
>> String sparkMaster = args[1];
>> String topics = args[2];
>> String consumerGroupID = args[3];
>> String durationSec = args[4];
>>
>> int duration = 0;
>>
>> try {
>> duration = Integer.parseInt(durationSec);
>> } catch (Exception e) {
>> System.err.println("Illegal duration");
>> System.exit(1);
>> }
>>
>> HashSet topicsSet = new 
>> HashSet(Arrays.asList(topics.split(",")));
>>
>> SparkConf  conf = new 
>> SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");
>>
>> JavaStreamingContext jssc = new JavaStr

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-13 Thread Mukul Gupta
Sorry for the late reply. I am new to Java and it took me a while to set things 
up.

Yes, you are correct that kafka client libs need not be specifically added. I 
didn't realized that . I removed the same and code still compiled. However, 
upon execution, I encountered the same issue as before.

Following is the link to repository:
https://github.com/guptamukul/sparktest.git


From: Cody Koeninger 
Sent: 11 March 2016 23:04
To: Mukul Gupta
Cc: user@spark.apache.org
Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Why are you including a specific dependency on Kafka?  Spark's
external streaming kafka module already depends on kafka.

Can you link to an actual repo with build file etc?

On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
> Please note that while building jar of code below, i used spark 1.6.0 + kafka 
> 0.9.0.0 libraries
> I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the 
> same issue.
>
> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
> matches with spark and kafka versions installed on my machine) because while 
> doing so, i get the following error at run time:
> Exception in thread "main" java.lang.ClassCastException: 
> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> public class SparkTest {
>
> public static void main(String[] args) {
>
> if (args.length < 5) {
> System.err.println("Usage: SparkTest
>  ");
> System.exit(1);
> }
>
> String kafkaBroker = args[0];
> String sparkMaster = args[1];
> String topics = args[2];
> String consumerGroupID = args[3];
> String durationSec = args[4];
>
> int duration = 0;
>
> try {
> duration = Integer.parseInt(durationSec);
> } catch (Exception e) {
> System.err.println("Illegal duration");
> System.exit(1);
> }
>
> HashSet topicsSet = new 
> HashSet(Arrays.asList(topics.split(",")));
>
> SparkConf  conf = new 
> SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");
>
> JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(duration));
>
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", kafkaBroker);
> kafkaParams.put("group.id", consumerGroupID);
>
> JavaPairInputDStream messages = 
> KafkaUtils.createDirectStream(jssc, String.class, String.class,
> StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
>
> JavaDStream processed = messages.map(new Function String>, String>() {
>
> @Override
> public String call(Tuple2 arg0) throws Exception {
>
> Thread.sleep(7000);
> return arg0._2;
> }
> });
>
> processed.print(90);
>
> try {
> jssc.start();
> jssc.awaitTermination();
> } catch (Exception e) {
>
> } finally {
> jssc.close();
> }
> }
> }
>
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 20:42
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel
>
> Can you post your actual code?
>
> On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
>> Hi All, I was running the following test: Setup 9 VM runing spark workers
>> with 1 spark executor each. 1 VM running kafka and spark master. Spark
>> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
>> manager and is not running over YARN. Test I created a kafka topic with 3
>&

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Why are you including a specific dependency on Kafka?  Spark's
external streaming kafka module already depends on kafka.

Can you link to an actual repo with build file etc?

On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
> Please note that while building jar of code below, i used spark 1.6.0 + kafka 
> 0.9.0.0 libraries
> I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the 
> same issue.
>
> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
> matches with spark and kafka versions installed on my machine) because while 
> doing so, i get the following error at run time:
> Exception in thread "main" java.lang.ClassCastException: 
> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> public class SparkTest {
>
> public static void main(String[] args) {
>
> if (args.length < 5) {
> System.err.println("Usage: SparkTest
>  ");
> System.exit(1);
> }
>
> String kafkaBroker = args[0];
> String sparkMaster = args[1];
> String topics = args[2];
> String consumerGroupID = args[3];
> String durationSec = args[4];
>
> int duration = 0;
>
> try {
> duration = Integer.parseInt(durationSec);
> } catch (Exception e) {
> System.err.println("Illegal duration");
> System.exit(1);
> }
>
> HashSet topicsSet = new 
> HashSet(Arrays.asList(topics.split(",")));
>
> SparkConf  conf = new 
> SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");
>
> JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(duration));
>
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", kafkaBroker);
> kafkaParams.put("group.id", consumerGroupID);
>
> JavaPairInputDStream messages = 
> KafkaUtils.createDirectStream(jssc, String.class, String.class,
> StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
>
> JavaDStream processed = messages.map(new Function String>, String>() {
>
> @Override
> public String call(Tuple2 arg0) throws Exception {
>
> Thread.sleep(7000);
> return arg0._2;
> }
> });
>
> processed.print(90);
>
> try {
> jssc.start();
> jssc.awaitTermination();
> } catch (Exception e) {
>
> } finally {
> jssc.close();
> }
> }
> }
>
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 20:42
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel
>
> Can you post your actual code?
>
> On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
>> Hi All, I was running the following test: Setup 9 VM runing spark workers
>> with 1 spark executor each. 1 VM running kafka and spark master. Spark
>> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
>> manager and is not running over YARN. Test I created a kafka topic with 3
>> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
>> JavaPairInputDStream stream =
>> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
>> stream1.print(); where func1 just contains a sleep followed by returning of
>> value. Observation First RDD partition corresponding to partition 1 of kafka
>> was processed on one of the spark executor. Once processing is finished,
>> then RDD partitions corresponding to remaining two kafka partitions were
>> processed in parallel on different spark 

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Mukul Gupta
Please note that while building jar of code below, i used spark 1.6.0 + kafka 
0.9.0.0 libraries
I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the same 
issue.

I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
matches with spark and kafka versions installed on my machine) because while 
doing so, i get the following error at run time:
Exception in thread "main" java.lang.ClassCastException: 
kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker

package sparktest;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;

import org.apache.spark.SparkConf;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;

package sparktest;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;

import org.apache.spark.SparkConf;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;

public class SparkTest {

public static void main(String[] args) {

if (args.length < 5) {
System.err.println("Usage: SparkTest
 ");
System.exit(1);
}

String kafkaBroker = args[0];
String sparkMaster = args[1];
String topics = args[2];
String consumerGroupID = args[3];
String durationSec = args[4];

int duration = 0;

try {
duration = Integer.parseInt(durationSec);
} catch (Exception e) {
System.err.println("Illegal duration");
System.exit(1);
}

HashSet topicsSet = new 
HashSet(Arrays.asList(topics.split(",")));

SparkConf  conf = new 
SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");

JavaStreamingContext jssc = new JavaStreamingContext(conf, 
Durations.seconds(duration));

HashMap kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", kafkaBroker);
kafkaParams.put("group.id", consumerGroupID);

JavaPairInputDStream messages = 
KafkaUtils.createDirectStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);

JavaDStream processed = messages.map(new Function, String>() {

@Override
public String call(Tuple2 arg0) throws Exception {

Thread.sleep(7000);
return arg0._2;
}
});

processed.print(90);

try {
jssc.start();
jssc.awaitTermination();
} catch (Exception e) {

} finally {
jssc.close();
}
}
}



From: Cody Koeninger 
Sent: 11 March 2016 20:42
To: Mukul Gupta
Cc: user@spark.apache.org
Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Can you post your actual code?

On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
> Hi All, I was running the following test: Setup 9 VM runing spark workers
> with 1 spark executor each. 1 VM running kafka and spark master. Spark
> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
> manager and is not running over YARN. Test I created a kafka topic with 3
> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
> JavaPairInputDStream stream =
> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
> stream1.print(); where func1 just contains a sleep followed by returning of
> value. Observation First RDD partition corresponding to partition 1 of kafka
> was processed on one of the spark executor. Once processing is finished,
> then RDD partitions corresponding to remaining two kafka partitions were
> processed in parallel on different spark executors. I expected that all
> three RDD partitions should have been processed in parallel as there were
> spark executors available which were lying idle. I re-ran the test after
> increasing the partitions of kafka topic to 5. This time also RDD partition
> corresponding to partition 1 of kafka was processed on one of the spark
> executor. Once processing is finished for this RDD partition, then RDD
> partitions corresponding to remaining four kafka partitions were processed
> in parallel on different spark executors. I am not clear about why spark is
> waiting for operations on first RDD partition to finish, while it could
> process remaining partitions in parallel? Am I missing any configuration?
> Any help is appreciated. Thanks, Mukul
> 
> View this message in context: Ka

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Can you post your actual code?

On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
> Hi All, I was running the following test: Setup 9 VM runing spark workers
> with 1 spark executor each. 1 VM running kafka and spark master. Spark
> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
> manager and is not running over YARN. Test I created a kafka topic with 3
> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
> JavaPairInputDStream stream =
> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
> stream1.print(); where func1 just contains a sleep followed by returning of
> value. Observation First RDD partition corresponding to partition 1 of kafka
> was processed on one of the spark executor. Once processing is finished,
> then RDD partitions corresponding to remaining two kafka partitions were
> processed in parallel on different spark executors. I expected that all
> three RDD partitions should have been processed in parallel as there were
> spark executors available which were lying idle. I re-ran the test after
> increasing the partitions of kafka topic to 5. This time also RDD partition
> corresponding to partition 1 of kafka was processed on one of the spark
> executor. Once processing is finished for this RDD partition, then RDD
> partitions corresponding to remaining four kafka partitions were processed
> in parallel on different spark executors. I am not clear about why spark is
> waiting for operations on first RDD partition to finish, while it could
> process remaining partitions in parallel? Am I missing any configuration?
> Any help is appreciated. Thanks, Mukul
> ____
> View this message in context: Kafka + Spark streaming, RDD partitions not
> processed in parallel
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-10 Thread Mukul Gupta
Hi All,I was running the following test:*Setup*9 VM runing spark workers with
1 spark executor each.1 VM running kafka and spark master.Spark version is
1.6.0Kafka version is 0.9.0.1Spark is using its own resource manager and is
not running over YARN.*Test*I created a kafka topic with 3 partition. next I
used "KafkaUtils.createDirectStream" to get a
DStream./JavaPairInputDStream stream =
KafkaUtils.createDirectStream(…);JavaDStream stream1 =
stream.map(func1);stream1.print();/where func1 just contains a sleep
followed by returning of value.*Observation*First RDD partition
corresponding to partition 1 of kafka was processed on one of the spark
executor. Once processing is finished, then RDD partitions corresponding to
remaining two kafka partitions were processed in parallel on different spark
executors.I expected that all three RDD partitions should have been
processed in parallel as there were spark executors available which were
lying idle.I re-ran the test after increasing the partitions of kafka topic
to 5. This time also RDD partition corresponding to partition 1 of kafka was
processed on one of the spark executor. Once processing is finished for this
RDD partition, then RDD partitions corresponding to remaining four kafka
partitions were processed in parallel on different spark executors.I am not
clear about why spark is waiting for operations on first RDD partition to
finish, while it could process remaining partitions in parallel? Am I
missing any configuration? Any help is appreciated.Thanks,Mukul



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-RDD-partitions-not-processed-in-parallel-tp26457.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks. Ill keep an eye on this. Our implementation of the DStream
basically accepts a function to compute current offsets. The implementation
of the function fetches list of topics from zookeeper once in while. It
then adds consumer offsets for newly added topics  with the currentOffsets
thats in memory  & deletes removed topics. The "once in a while" is
pluggable as well, and we are planning to use ZK watches instead of a time
based refresh. Works for us because we use ZK extensively for a lot of
other book keeping.


On Fri, Sep 25, 2015 at 1:16 PM, Cody Koeninger  wrote:

> Yes, the partition IDs are the same.
>
> As far as the failure / subclassing goes, you may want to keep an eye on
> https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the
> suggestions in there will end up going anywhere.
>
> On Fri, Sep 25, 2015 at 3:01 PM, Neelesh  wrote:
>
>> For the 1-1 mapping case, can I use TaskContext.get().partitionId as an
>> index in to the offset ranges?
>> For the failure case, yes, I'm subclassing of DirectKafkaInputDStream.
>> As for failures, different partitions in the same batch may be talking to
>> different RDBMS servers due to multitenancy - a spark streaming app is
>> consuming from several topics, each topic mapped to a customer for example.
>> It is quite possible that in a batch, several partitions belonging to the
>> same customer may fail, and others will go through. We don't want the whole
>> job to be killed because of one failing customer,and affect others in the
>> same job. Hope that makes sense.
>>
>> thnx
>>
>> On Fri, Sep 25, 2015 at 12:52 PM, Cody Koeninger 
>> wrote:
>>
>>> Your success case will work fine, it is a 1-1 mapping as you said.
>>>
>>> To handle failures in exactly the way you describe, you'd need to
>>> subclass or modify DirectKafkaInputDStream and change the way compute()
>>> works.
>>>
>>> Unless you really are going to have very fine-grained failures (why
>>> would only a given partition be failing while the rest are fine?) it's
>>> going to be easier to just fail the whole task and retry, or eventually
>>> kill the job.
>>>
>>> On Fri, Sep 25, 2015 at 1:55 PM, Neelesh  wrote:
>>>
 Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
 trying to achieve

 stream.foreachRDD {rdd=>
rdd.foreachPartition { p=>

Try(myFunc(...))  match {
  case Sucess(s) => updatewatermark for this partition //of
 course, expectation is that it will work only if there is a 1-1 mapping at
 this point in time
  case Failure()  => Tell the driver not to generate a partition
 for this kafka topic+partition for a while, by updating some shared state
 (zk)

}

  }
 }

 I was looking for that mapping b/w kafka partition thats bound to a
 task inside the task execution code, in cases where the intermediate
 operations do not change partitions, shuffle etc.

 -neelesh

 On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger 
 wrote:

>
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>
> also has an example of how to close over the offset ranges so they are
> available on executors.
>
> On Fri, Sep 25, 2015 at 12:50 PM, Neelesh  wrote:
>
>> Hi,
>>We are using DirectKafkaInputDStream and store completed consumer
>> offsets in Kafka (0.8.2). However, some of our use case require that
>> offsets be not written if processing of a partition fails with certain
>> exceptions. This allows us to build various backoff strategies for that
>> partition, instead of either blindly committing consumer offsets 
>> regardless
>> of errors (because KafkaRDD as HasOffsetRanges is available only on the
>> driver)  or relying on Spark's retry logic and continuing without 
>> remedial
>> action.
>>
>> I was playing with SparkListener and found that while one can listen
>> on taskCompletedEvent on the driver and even figure out that there was an
>> error, there is no way of mapping this task back to the partition and
>> retrieving offset range, topic & kafka partition # etc.
>>
>> Any pointers appreciated!
>>
>> Thanks!
>> -neelesh
>>
>
>

>>>
>>
>


Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Yes, the partition IDs are the same.

As far as the failure / subclassing goes, you may want to keep an eye on
https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the
suggestions in there will end up going anywhere.

On Fri, Sep 25, 2015 at 3:01 PM, Neelesh  wrote:

> For the 1-1 mapping case, can I use TaskContext.get().partitionId as an
> index in to the offset ranges?
> For the failure case, yes, I'm subclassing of DirectKafkaInputDStream. As
> for failures, different partitions in the same batch may be talking to
> different RDBMS servers due to multitenancy - a spark streaming app is
> consuming from several topics, each topic mapped to a customer for example.
> It is quite possible that in a batch, several partitions belonging to the
> same customer may fail, and others will go through. We don't want the whole
> job to be killed because of one failing customer,and affect others in the
> same job. Hope that makes sense.
>
> thnx
>
> On Fri, Sep 25, 2015 at 12:52 PM, Cody Koeninger 
> wrote:
>
>> Your success case will work fine, it is a 1-1 mapping as you said.
>>
>> To handle failures in exactly the way you describe, you'd need to
>> subclass or modify DirectKafkaInputDStream and change the way compute()
>> works.
>>
>> Unless you really are going to have very fine-grained failures (why would
>> only a given partition be failing while the rest are fine?) it's going to
>> be easier to just fail the whole task and retry, or eventually kill the job.
>>
>> On Fri, Sep 25, 2015 at 1:55 PM, Neelesh  wrote:
>>
>>> Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
>>> trying to achieve
>>>
>>> stream.foreachRDD {rdd=>
>>>rdd.foreachPartition { p=>
>>>
>>>Try(myFunc(...))  match {
>>>  case Sucess(s) => updatewatermark for this partition //of
>>> course, expectation is that it will work only if there is a 1-1 mapping at
>>> this point in time
>>>  case Failure()  => Tell the driver not to generate a partition
>>> for this kafka topic+partition for a while, by updating some shared state
>>> (zk)
>>>
>>>}
>>>
>>>  }
>>> }
>>>
>>> I was looking for that mapping b/w kafka partition thats bound to a task
>>> inside the task execution code, in cases where the intermediate operations
>>> do not change partitions, shuffle etc.
>>>
>>> -neelesh
>>>
>>> On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger 
>>> wrote:
>>>

 http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

 also has an example of how to close over the offset ranges so they are
 available on executors.

 On Fri, Sep 25, 2015 at 12:50 PM, Neelesh  wrote:

> Hi,
>We are using DirectKafkaInputDStream and store completed consumer
> offsets in Kafka (0.8.2). However, some of our use case require that
> offsets be not written if processing of a partition fails with certain
> exceptions. This allows us to build various backoff strategies for that
> partition, instead of either blindly committing consumer offsets 
> regardless
> of errors (because KafkaRDD as HasOffsetRanges is available only on the
> driver)  or relying on Spark's retry logic and continuing without remedial
> action.
>
> I was playing with SparkListener and found that while one can listen
> on taskCompletedEvent on the driver and even figure out that there was an
> error, there is no way of mapping this task back to the partition and
> retrieving offset range, topic & kafka partition # etc.
>
> Any pointers appreciated!
>
> Thanks!
> -neelesh
>


>>>
>>
>


Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
For the 1-1 mapping case, can I use TaskContext.get().partitionId as an
index in to the offset ranges?
For the failure case, yes, I'm subclassing of DirectKafkaInputDStream. As
for failures, different partitions in the same batch may be talking to
different RDBMS servers due to multitenancy - a spark streaming app is
consuming from several topics, each topic mapped to a customer for example.
It is quite possible that in a batch, several partitions belonging to the
same customer may fail, and others will go through. We don't want the whole
job to be killed because of one failing customer,and affect others in the
same job. Hope that makes sense.

thnx

On Fri, Sep 25, 2015 at 12:52 PM, Cody Koeninger  wrote:

> Your success case will work fine, it is a 1-1 mapping as you said.
>
> To handle failures in exactly the way you describe, you'd need to subclass
> or modify DirectKafkaInputDStream and change the way compute() works.
>
> Unless you really are going to have very fine-grained failures (why would
> only a given partition be failing while the rest are fine?) it's going to
> be easier to just fail the whole task and retry, or eventually kill the job.
>
> On Fri, Sep 25, 2015 at 1:55 PM, Neelesh  wrote:
>
>> Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
>> trying to achieve
>>
>> stream.foreachRDD {rdd=>
>>rdd.foreachPartition { p=>
>>
>>Try(myFunc(...))  match {
>>  case Sucess(s) => updatewatermark for this partition //of
>> course, expectation is that it will work only if there is a 1-1 mapping at
>> this point in time
>>  case Failure()  => Tell the driver not to generate a partition
>> for this kafka topic+partition for a while, by updating some shared state
>> (zk)
>>
>>}
>>
>>  }
>> }
>>
>> I was looking for that mapping b/w kafka partition thats bound to a task
>> inside the task execution code, in cases where the intermediate operations
>> do not change partitions, shuffle etc.
>>
>> -neelesh
>>
>> On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger 
>> wrote:
>>
>>>
>>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>>>
>>> also has an example of how to close over the offset ranges so they are
>>> available on executors.
>>>
>>> On Fri, Sep 25, 2015 at 12:50 PM, Neelesh  wrote:
>>>
 Hi,
We are using DirectKafkaInputDStream and store completed consumer
 offsets in Kafka (0.8.2). However, some of our use case require that
 offsets be not written if processing of a partition fails with certain
 exceptions. This allows us to build various backoff strategies for that
 partition, instead of either blindly committing consumer offsets regardless
 of errors (because KafkaRDD as HasOffsetRanges is available only on the
 driver)  or relying on Spark's retry logic and continuing without remedial
 action.

 I was playing with SparkListener and found that while one can listen on
 taskCompletedEvent on the driver and even figure out that there was an
 error, there is no way of mapping this task back to the partition and
 retrieving offset range, topic & kafka partition # etc.

 Any pointers appreciated!

 Thanks!
 -neelesh

>>>
>>>
>>
>


Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Your success case will work fine, it is a 1-1 mapping as you said.

To handle failures in exactly the way you describe, you'd need to subclass
or modify DirectKafkaInputDStream and change the way compute() works.

Unless you really are going to have very fine-grained failures (why would
only a given partition be failing while the rest are fine?) it's going to
be easier to just fail the whole task and retry, or eventually kill the job.

On Fri, Sep 25, 2015 at 1:55 PM, Neelesh  wrote:

> Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
> trying to achieve
>
> stream.foreachRDD {rdd=>
>rdd.foreachPartition { p=>
>
>Try(myFunc(...))  match {
>  case Sucess(s) => updatewatermark for this partition //of course,
> expectation is that it will work only if there is a 1-1 mapping at this
> point in time
>  case Failure()  => Tell the driver not to generate a partition
> for this kafka topic+partition for a while, by updating some shared state
> (zk)
>
>}
>
>  }
> }
>
> I was looking for that mapping b/w kafka partition thats bound to a task
> inside the task execution code, in cases where the intermediate operations
> do not change partitions, shuffle etc.
>
> -neelesh
>
> On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger 
> wrote:
>
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>>
>> also has an example of how to close over the offset ranges so they are
>> available on executors.
>>
>> On Fri, Sep 25, 2015 at 12:50 PM, Neelesh  wrote:
>>
>>> Hi,
>>>We are using DirectKafkaInputDStream and store completed consumer
>>> offsets in Kafka (0.8.2). However, some of our use case require that
>>> offsets be not written if processing of a partition fails with certain
>>> exceptions. This allows us to build various backoff strategies for that
>>> partition, instead of either blindly committing consumer offsets regardless
>>> of errors (because KafkaRDD as HasOffsetRanges is available only on the
>>> driver)  or relying on Spark's retry logic and continuing without remedial
>>> action.
>>>
>>> I was playing with SparkListener and found that while one can listen on
>>> taskCompletedEvent on the driver and even figure out that there was an
>>> error, there is no way of mapping this task back to the partition and
>>> retrieving offset range, topic & kafka partition # etc.
>>>
>>> Any pointers appreciated!
>>>
>>> Thanks!
>>> -neelesh
>>>
>>
>>
>


Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
trying to achieve

stream.foreachRDD {rdd=>
   rdd.foreachPartition { p=>

   Try(myFunc(...))  match {
 case Sucess(s) => updatewatermark for this partition //of course,
expectation is that it will work only if there is a 1-1 mapping at this
point in time
 case Failure()  => Tell the driver not to generate a partition for
this kafka topic+partition for a while, by updating some shared state (zk)

   }

 }
}

I was looking for that mapping b/w kafka partition thats bound to a task
inside the task execution code, in cases where the intermediate operations
do not change partitions, shuffle etc.

-neelesh

On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger  wrote:

>
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>
> also has an example of how to close over the offset ranges so they are
> available on executors.
>
> On Fri, Sep 25, 2015 at 12:50 PM, Neelesh  wrote:
>
>> Hi,
>>We are using DirectKafkaInputDStream and store completed consumer
>> offsets in Kafka (0.8.2). However, some of our use case require that
>> offsets be not written if processing of a partition fails with certain
>> exceptions. This allows us to build various backoff strategies for that
>> partition, instead of either blindly committing consumer offsets regardless
>> of errors (because KafkaRDD as HasOffsetRanges is available only on the
>> driver)  or relying on Spark's retry logic and continuing without remedial
>> action.
>>
>> I was playing with SparkListener and found that while one can listen on
>> taskCompletedEvent on the driver and even figure out that there was an
>> error, there is no way of mapping this task back to the partition and
>> retrieving offset range, topic & kafka partition # etc.
>>
>> Any pointers appreciated!
>>
>> Thanks!
>> -neelesh
>>
>
>


Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

also has an example of how to close over the offset ranges so they are
available on executors.

On Fri, Sep 25, 2015 at 12:50 PM, Neelesh  wrote:

> Hi,
>We are using DirectKafkaInputDStream and store completed consumer
> offsets in Kafka (0.8.2). However, some of our use case require that
> offsets be not written if processing of a partition fails with certain
> exceptions. This allows us to build various backoff strategies for that
> partition, instead of either blindly committing consumer offsets regardless
> of errors (because KafkaRDD as HasOffsetRanges is available only on the
> driver)  or relying on Spark's retry logic and continuing without remedial
> action.
>
> I was playing with SparkListener and found that while one can listen on
> taskCompletedEvent on the driver and even figure out that there was an
> error, there is no way of mapping this task back to the partition and
> retrieving offset range, topic & kafka partition # etc.
>
> Any pointers appreciated!
>
> Thanks!
> -neelesh
>


Re: Kafka & Spark Streaming

2015-09-25 Thread Petr Novak
You can have offsetRanges on workers f.e.

object Something {
  var offsetRanges = Array[OffsetRange]()

  def create[F : ClassTag](stream: InputDStream[Array[Byte]])
  (implicit codec: Codec[F]: DStream[F] = {
stream transform { rdd =>
  offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  rdd flatMap { message =>
Try(codec.decode(message)) match {
  case Success(fact) => Some(fact)
  case Failure(e) => None
}
  }
  }
}

call create and use returned stream downstream.

or something like

// See https://issues.apache.org/jira/browse/SPARK-5569 why I map
OffsetRamges to a custom class


case class TopicMetadata(name: String, partition: Int, fromOffset:
Long, untilOffset: Long)

object KafkaContext {
  private[this] var state = Array[TopicMetadata]()

  def captureTopicMetadata(offsetRanges: Array[OffsetRange]): Unit = {
state = offsetRanges.map { o =>
  TopicMetadata(o.topic, o.partition, o.fromOffset, o.untilOffset)
}
  }

  def topics: Array[TopicMetadata] = state
}

//then somewhere

def run(steam) = {
  stream.transform { rdd =>

KafkaContext.captureTopicMetadata(rdd.asInstanceOf[HasOffsetRanges].offsetRanges)

rdd
  
  
  
  }

  .forecahRDD {
val s = KafkaContext.topics.map { x =>
  s"${x.name}_${x.partition}_${x.fromOffset}-${x.untilOffset}"
}
...
  }


}


So they can be available on Driver. Sorry for not precise code. I'm in a
hurry. There a probably mistakes but you can get the idea.
Petr

On Fri, Sep 25, 2015 at 7:50 PM, Neelesh  wrote:

> Hi,
>We are using DirectKafkaInputDStream and store completed consumer
> offsets in Kafka (0.8.2). However, some of our use case require that
> offsets be not written if processing of a partition fails with certain
> exceptions. This allows us to build various backoff strategies for that
> partition, instead of either blindly committing consumer offsets regardless
> of errors (because KafkaRDD as HasOffsetRanges is available only on the
> driver)  or relying on Spark's retry logic and continuing without remedial
> action.
>
> I was playing with SparkListener and found that while one can listen on
> taskCompletedEvent on the driver and even figure out that there was an
> error, there is no way of mapping this task back to the partition and
> retrieving offset range, topic & kafka partition # etc.
>
> Any pointers appreciated!
>
> Thanks!
> -neelesh
>


Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Hi,
   We are using DirectKafkaInputDStream and store completed consumer
offsets in Kafka (0.8.2). However, some of our use case require that
offsets be not written if processing of a partition fails with certain
exceptions. This allows us to build various backoff strategies for that
partition, instead of either blindly committing consumer offsets regardless
of errors (because KafkaRDD as HasOffsetRanges is available only on the
driver)  or relying on Spark's retry logic and continuing without remedial
action.

I was playing with SparkListener and found that while one can listen on
taskCompletedEvent on the driver and even figure out that there was an
error, there is no way of mapping this task back to the partition and
retrieving offset range, topic & kafka partition # etc.

Any pointers appreciated!

Thanks!
-neelesh


Re: Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
Hi All,

figured it out for got mention local as loca[2] , at least two node
required.

package com.examples

/**
 * Created by kalit_000 on 19/09/2015.
 */

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark._
import  org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka.KafkaUtils


object SparkStreamingKafka {

  def main(args: Array[String]): Unit =
  {
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)

val conf = new
SparkConf().setMaster("local[2]").setAppName("KafkaStreaming").set("spark.executor.memory",
"1g")
val sc=new SparkContext(conf)
val ssc= new StreamingContext(sc,Seconds(2))

val zkQuorm="localhost:2181"
val group="test-group"
val topics="first"
val numThreads=1

val topicMap=topics.split(",").map((_,numThreads.toInt)).toMap

val lineMap=KafkaUtils.createStream(ssc,zkQuorm,group,topicMap)

val lines=lineMap.map(_._2)

lines.print

//lines.print()

//val words=lines.flatMap(_.split(" "))

   // val pair=words.map( x => (x,1))

//val
wordcount=pair.reduceByKeyAndWindow(_+_,_-_,Minutes(1),Seconds(2),2)

//wordcount.print

//ssc.checkpoint("hdfs://localhost:9000/user/hduser/checkpoint")

ssc.checkpoint("C:\\scalatutorials\\sparkstreaming_checkpoint_folder")

//C:\scalatutorials\sparkstreaming_checkpoint_folder

ssc.start()

ssc.awaitTermination()





  }

}
Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-see-my-kafka-spark-streaming-output-tp24750p24751.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
Hi All,
I am unable to see the output getting printed in the console can anyone
help.

package com.examples

/**
 * Created by kalit_000 on 19/09/2015.
 */

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark._
import  org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka.KafkaUtils


object SparkStreamingKafka {

  def main(args: Array[String]): Unit =
  {
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)

val conf = new
SparkConf().setMaster("local").setAppName("KafkaStreaming").set("spark.executor.memory",
"1g")
val sc=new SparkContext(conf)
val ssc= new StreamingContext(sc,Seconds(2))

val zkQuorm="localhost:2181"
val group="test-group"
val topics="first"
val numThreads=1

val topicMap=topics.split(",").map((_,numThreads.toInt)).toMap

val lineMap=KafkaUtils.createStream(ssc,zkQuorm,group,topicMap)

val lines=lineMap.map(_._2)

//lines.print()

val words=lines.flatMap(_.split(" "))

val pair=words.map( x => (x,1))

val wordcount=pair.reduceByKeyAndWindow(_+_,_-_,Minutes(1),Seconds(2),2)

wordcount.print

//ssc.checkpoint("hdfs://localhost:9000/user/hduser/checkpoint")

ssc.checkpoint("C:\\scalatutorials\\sparkstreaming_checkpoint_folder")

//C:\scalatutorials\sparkstreaming_checkpoint_folder

ssc.start()

ssc.awaitTermination()





  }

}


Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-see-my-kafka-spark-streaming-output-tp24750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: kafka spark streaming with mesos

2015-06-24 Thread Akhil Das
A screenshot of your framework running would also be helpful. How many
cores does it have?
Did you try running it in coarse grained mode?

Try to add these to the conf:

sparkConf.set("spark.mesos.coarse", "true")
sparkConfset("spark.cores.max", "2")


Thanks
Best Regards

On Wed, Jun 24, 2015 at 1:35 AM, Bartek Radziszewski 
wrote:

> Hey,
>
> I’m trying to run kafka spark streaming using mesos with following example:
>
> *sc.stop*
> *import org.apache.spark.SparkConf*
> *import org.apache.spark.SparkContext._*
> *import kafka.serializer.StringDecoder*
> *import org.apache.spark.streaming._*
> *import org.apache.spark.streaming.kafka._*
> *import org.apache.spark.storage.StorageLevel*
> *val sparkConf = new
> SparkConf().setAppName("Summarizer").setMaster("zk://127.0.0.1:2181/mesos")*
> *val ssc = new StreamingContext(sparkConf, Seconds(10))*
> *val kafkaParams = Map[String, String]("zookeeper.connect" ->
> "127.0.0.1:2181 <http://127.0.0.1:2181>", "group.id <http://group.id>" ->
> "test")*
> *val messages = KafkaUtils.createStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, Map("test" -> 1),
> StorageLevel.MEMORY_ONLY_SER).map(_._2)*
>
> *messages.foreachRDD { pairRDD =>*
> *println(s"DataListener.listen() [pairRDD = ${pairRDD}]")*
> *println(s"DataListener.listen() [pairRDD.count = ${pairRDD.count()}]")*
> *pairRDD.foreach(row => println(s"DataListener.listen() [row = ${row}]"))*
> *val msgData = pairRDD.collect()*
> *}*
>
> Unfortunately *println(s"DataListener.listen() [pairRDD.count =
> ${pairRDD.count()}]”) *returning always *0*
>
> I tested same example but using “local[2]” instead of "
> zk://127.0.0.1:2181/mesos” and all working perfect (count return correct
> produced message count, and *pairRDD.foreach(row =>
> println(s"DataListener.listen() [row = ${row}]”)) *returning kafka msg.
>
> Could you help me to understand that issue? what i’m going wrong?
>
> attaching:
> spark shell output http://pastebin.com/zdYFBj4T
> executor output http://pastebin.com/LDMtCjq0
>
> thanks!
> bartek
>


kafka spark streaming with mesos

2015-06-23 Thread Bartek Radziszewski
Hey,

I’m trying to run kafka spark streaming using mesos with following example:

sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.storage.StorageLevel
val sparkConf = new 
SparkConf().setAppName("Summarizer").setMaster("zk://127.0.0.1:2181/mesos")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val kafkaParams = Map[String, String]("zookeeper.connect" -> "127.0.0.1:2181", 
"group.id" -> "test")
val messages = KafkaUtils.createStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, Map("test" -> 1), 
StorageLevel.MEMORY_ONLY_SER).map(_._2)

messages.foreachRDD { pairRDD =>
println(s"DataListener.listen() [pairRDD = ${pairRDD}]")
println(s"DataListener.listen() [pairRDD.count = ${pairRDD.count()}]")
pairRDD.foreach(row => println(s"DataListener.listen() [row = ${row}]"))
val msgData = pairRDD.collect()
}

Unfortunately println(s"DataListener.listen() [pairRDD.count = 
${pairRDD.count()}]”) returning always 0

I tested same example but using “local[2]” instead of 
"zk://127.0.0.1:2181/mesos” and all working perfect (count return correct 
produced message count, and pairRDD.foreach(row => 
println(s"DataListener.listen() [row = ${row}]”)) returning kafka msg.

Could you help me to understand that issue? what i’m going wrong?

attaching:
spark shell output http://pastebin.com/zdYFBj4T <http://pastebin.com/zdYFBj4T>
executor output http://pastebin.com/LDMtCjq0 <http://pastebin.com/LDMtCjq0>

thanks!
bartek

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster("local") set it to local[2] or local[*]

Thanks
Best Regards

On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski 
wrote:

> hi,
> I'm trying to run simple kafka spark streaming example over spark-shell:
>
> sc.stop
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext._
> import kafka.serializer.DefaultDecoder
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.kafka._
> import org.apache.spark.storage.StorageLevel
> val sparkConf = new SparkConf().setAppName("Summarizer").setMaster("local")
> val ssc = new StreamingContext(sparkConf, Seconds(10))
> val kafkaParams = Map[String, String]("zookeeper.connect" -> "
> 127.0.0.1:2181", "group.id" -> "test")
> val messages = KafkaUtils.createStream[Array[Byte], Array[Byte],
> DefaultDecoder, DefaultDecoder](ssc, kafkaParams, Map("test" -> 1),
> StorageLevel.MEMORY_ONLY_SER).map(_._2)
> messages.foreachRDD { pairRDD =>
> println(s"DataListener.listen() [pairRDD = ${pairRDD}]")
> println(s"DataListener.listen() [pairRDD.count = ${pairRDD.count()}]")
> pairRDD.foreach(row => println(s"DataListener.listen() [row = ${row}]"))
> }
> ssc.start()
> ssc.awaitTermination()
>
>
> in spark output i'm able to find only following println log:
> println(s"DataListener.listen() [pairRDD = ${pairRDD}]")
>
> but unfortunately can't find output of:
> println(s"DataListener.listen() [pairRDD.count = ${pairRDD.count()}]") and
> println(s"DataListener.listen() [row = ${row}]")
>
> it's my spark-shell full output - http://pastebin.com/sfxbYYga
>
> any ideas what i'm doing wrong? thanks!
>


kafka spark streaming working example

2015-06-18 Thread Bartek Radziszewski
hi, 
I'm trying to run simple kafka spark streaming example over spark-shell:

sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import kafka.serializer.DefaultDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.storage.StorageLevel
val sparkConf = new SparkConf().setAppName("Summarizer").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val kafkaParams = Map[String, String]("zookeeper.connect" -> "127.0.0.1:2181", 
"group.id" -> "test")
val messages = KafkaUtils.createStream[Array[Byte], Array[Byte], 
DefaultDecoder, DefaultDecoder](ssc, kafkaParams, Map("test" -> 1), 
StorageLevel.MEMORY_ONLY_SER).map(_._2)
messages.foreachRDD { pairRDD =>
println(s"DataListener.listen() [pairRDD = ${pairRDD}]")
println(s"DataListener.listen() [pairRDD.count = ${pairRDD.count()}]")
pairRDD.foreach(row => println(s"DataListener.listen() [row = ${row}]"))
}
ssc.start()
ssc.awaitTermination()


in spark output i'm able to find only following println log:
println(s"DataListener.listen() [pairRDD = ${pairRDD}]")

but unfortunately can't find output of:
println(s"DataListener.listen() [pairRDD.count = ${pairRDD.count()}]") and 
println(s"DataListener.listen() [row = ${row}]")

it's my spark-shell full output - http://pastebin.com/sfxbYYga 
<http://pastebin.com/sfxbYYga>

any ideas what i'm doing wrong? thanks!

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread Dibyendu Bhattacharya
Hi,

Can you please little detail stack trace from your receiver logs and also
the consumer settings you used ? I have never tested the consumer with
Kafka 0.7.3 ..not sure if Kafka Version is the issue . Have you tried
building the consumer using Kafka 0.7.3 ?

Regards,
Dibyendu

On Wed, Jun 10, 2015 at 11:52 AM, karma243  wrote:

> Thank you for responding @nsalian.
>
> 1. I am trying to replicate  this
> <https://github.com/dibbhatt/kafka-spark-consumer>   project on my local
> system.
>
> 2. Yes, kafka and brokers on the same host.
>
> 3. I am working with kafka 0.7.3 and spark 1.3.1. Kafka 0.7.3 does not has
> "--describe" command. Though I've worked on three cases (Kafka and
> Zookeeper
> were on my machine all the time):
>   (i) Producer-Consumer on my machine.
>   (ii) Producer on my machine and Consumer on different machine.
>   (iii) Consumer on my machine and producer on different machine.
>
> All the three cases were working properly.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228p23240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread karma243
Thank you for responding @nsalian. 

1. I am trying to replicate  this
<https://github.com/dibbhatt/kafka-spark-consumer>   project on my local
system.

2. Yes, kafka and brokers on the same host.

3. I am working with kafka 0.7.3 and spark 1.3.1. Kafka 0.7.3 does not has
"--describe" command. Though I've worked on three cases (Kafka and Zookeeper
were on my machine all the time):
  (i) Producer-Consumer on my machine.
  (ii) Producer on my machine and Consumer on different machine.
  (iii) Consumer on my machine and producer on different machine.

All the three cases were working properly.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228p23240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread nsalian
1) Could you share your command?

2) Are the kafka brokers on the same host?

3) Could you run a --describe on the topic to see if the topic is setup
correctly (just to be sure)?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228p23235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread karma243
Hello,

While trying to link kafka to spark, I'm not able to get data from kafka.
This is the error that I'm getting from spark logs:

ERROR EndpointWriter: dropping message [class
akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://sparkMaster@localhost:7077/]] arriving at
[akka.tcp://sparkMaster@localhost:7077] inbound addresses are
[akka.tcp://sparkMaster@karma-HP-Pavilion-g6-Notebook-PC:7077]




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: kafka + Spark Streaming with checkPointing fails to start with

2015-05-15 Thread Alexander Krasheninnikov

I had same problem.
The solution, I've found was to use:
JavaStreamingContext streamingContext = 
JavaStreamingContext.getOrCreate('checkpoint_dir', contextFactory);


ALL configuration should be performed inside contextFactory. If you try 
to configure streamContext after ::getOrCreate, you receive an error 
"has not been initialized".


On 13.05.2015 00:51, Ankur Chauhan wrote:

Hi,

I have a simple application which fails with the following exception only when 
the application is restarted (i.e. the checkpointDir has entires from a 
previous execution):

Exception in thread "main" org.apache.spark.SparkException: 
org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c has not been initialized
at 
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:227)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:222)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:222)
at 
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:90)
at 
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
at 
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
at 
com.brightcove.analytics.tacoma.RawLogProcessor$.start(RawLogProcessor.scala:115)
at 
com.brightcove.analytics.tacoma.Main$delayedInit$body.apply(Main.scala:15)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.brightcove.analytics.tacoma.Main$.main(Main.scala:5)
at com.brightcove.analytics.tacoma.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The relavant source is:

class RawLogProcessor(ssc: StreamingContext, topic: String, kafkaParams: 
Map[String, String]) {
   // create kafka stream
   val rawlogDStream = KafkaUtils.createDirectStream[String, Object, 
StringDecoder, KafkaAvroDecoder](ssc, kafkaParams, Set(topic))
   //KafkaUtils.createStream[String, Object, StringDecoder, KafkaAvroDecoder](ssc, 
kafkaParams, Map("qa-rawlogs" -> 10), StorageLevel.MEMORY_AND_DISK_2)

   val eventStream = rawlogDStream
 .map({
   case (key, rawlogVal) =>
 val record = rawlogVal.asInstanceOf[GenericData.Record]
 val rlog = RawLog.newBuilder()
   .setId(record.get("id").

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-14 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thanks everyone, that was the problem. the "create new streaming
context" function was supposed to setup the stream processing as well
as the checkpoint directory. I had missed the whole process of
checkpoint setup. With that done, everything works as expected.

For the benefit of others, my final version of the code that works
looks like this and it works correctly:


object RawLogProcessor extends Logging {

  import TacomaHelper._

  val checkpointDir = "/tmp/checkpointDir_tacoma"
  var ssc: Option[StreamingContext] = None

  def createSparkConf(config: Config): SparkConf = {
val sparkConf = new SparkConf()
config.entrySet.asScala
  .map(kv => kv.getKey -> kv.getValue)
  .foreach { case (k, v) => sparkConf.set(s"spark.$k",
unquote(v.render())) }

sparkConf.registerKryoClasses(Array(classOf[VideoView],
classOf[RawLog], classOf[VideoEngagement], classOf[VideoImpression]))
sparkConf
  }

  // a function that returns a function of type: `() => StreamingContext
`
  def createContext(sparkConfig: Config, kafkaConf: Config)(f:
StreamingContext => StreamingContext) = () => {
val batchDurationSecs =
sparkConfig.getDuration("streaming.batch_duration", TimeUnit.SECONDS)
val sparkConf = createSparkConf(sparkConfig)

// calculate sparkContext and streamingContext
val streamingContext = new StreamingContext(sparkConf,
Durations.seconds(batchDurationSecs))
streamingContext.checkpoint(checkpointDir)

// apply the streaming context function to the function
f(streamingContext)
  }

  def createNewContext(sparkConf: Config, kafkaConf: Config, f:
StreamingContext => StreamingContext) = {
logInfo("Create new Spark streamingContext with provided pipeline
function")
StreamingContext.getOrCreate(
  checkpointPath = checkpointDir,
  creatingFunc = createContext(sparkConf, kafkaConf)(f),
  createOnError = true)
  }

  def apply(sparkConfig: Config, kafkaConf: Config): StreamingContext =
{
rawlogTopic = kafkaConf.getString("rawlog.topic")
kafkaParams = kafkaConf.entrySet.asScala
  .map(kv => kv.getKey -> unquote(kv.getValue.render()))
  .toMap

if (ssc.isEmpty) {
  ssc = Some(createNewContext(sparkConfig, kafkaConf, setupPipeline)
)
}
ssc.get
  }

  var rawlogTopic: String = "qa-rawlog"
  var kafkaParams: Map[String, String] = Map()

  def setupPipeline(streamingContext: StreamingContext):
StreamingContext = {

logInfo("Creating new kafka rawlog stream")
// TODO: extract this and pass it around somehow
val rawlogDStream = KafkaUtils.createDirectStream[String, Object,
StringDecoder, KafkaAvroDecoder](streamingContext, kafkaParams,
Set(rawlogTopic))

logInfo("adding step to parse kafka stream into RawLog types
(Normalizer)")
val eventStream = rawlogDStream
  .map({
  case (key, rawlogVal) =>
val record = rawlogVal.asInstanceOf[GenericData.Record]
val rlog = RawLog.newBuilder()
  .setId(record.get("id").asInstanceOf[String])
  .setAccount(record.get("account").asInstanceOf[String])
  .setEvent(record.get("event").asInstanceOf[String])
  .setTimestamp(record.get("timestamp").asInstanceOf[Long])
  .setUserAgent(record.get("user_agent").asInstanceOf[String])

.setParams(record.get("params").asInstanceOf[java.util.Map[String,
String]])
  .build()
val norm = Normalizer(rlog)
(key, rlog.getEvent, norm)
})

logInfo("Adding step to filter out VideoView only events and cache
them")
val videoViewStream = eventStream
  .filter(_._2 == "video_view")
  .filter(_._3.isDefined)
  .map((z) => (z._1, z._3.get))
  .map((z) => (z._1, z._2.asInstanceOf[VideoView]))
  .cache()

// repartition by account
logInfo("repartition videoView by account and calculate stats")
videoViewStream.map((v) => (v._2.getAccount, 1))
  .filter(_._1 != null)
  .window(Durations.seconds(20))
  .reduceByKey(_ + _)
  .print()

// repartition by (deviceType, DeviceOS)
logInfo("repartition videoView by (DeviceType, DeviceOS) and
calculate stats")
videoViewStream.map((v) => ((v._2.getDeviceType,
v._2.getDeviceOs), 1))
  .reduceByKeyAndWindow(_ + _, Durations.seconds(10))
  .print()

streamingContext
  }

}

- - Ankur

On 13/05/2015 23:52, NB wrote:
> The data pipeline (DAG) should not be added to the StreamingContext
> in the case of a recovery scenario. The pipeline metadata is
> recovered from the checkpoint folder. That is one thing you will
> need to fix in your code. Also, I don't think the
> ssc.checkpoint(folder) call should be made in case of the
> recovery.
> 
> The idiom to follow is to set up the DAG in the creatingFunc and
> not outside of it. This will ensure that if a new context is being
> created i.e. checkpoint folder does not exist, the DAG will get
> added to it and then checkpointed. Once a recovery happens, this
> function

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread NB
The data pipeline (DAG) should not be added to the StreamingContext in the
case of a recovery scenario. The pipeline metadata is recovered from the
checkpoint folder. That is one thing you will need to fix in your code.
Also, I don't think the ssc.checkpoint(folder) call should be made in case
of the recovery.

The idiom to follow is to set up the DAG in the creatingFunc and not outside
of it. This will ensure that if a new context is being created i.e.
checkpoint folder does not exist, the DAG will get added to it and then
checkpointed. Once a recovery happens, this function is not invoked but
everything is recreated from the checkpointed data.

Hope this helps,
NB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kafka-Spark-Streaming-with-checkPointing-fails-to-restart-tp22864p22878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Cody Koeninger
cala)
>
> The relavant source is:
>
> class RawLogProcessor(ssc: StreamingContext, topic: String,
> kafkaParams: Map[String, String]) {
>  // create kafka stream
>  val rawlogDStream = KafkaUtils.createDirectStream[String, Object,
> StringDecoder, KafkaAvroDecoder](ssc, kafkaParams, Set(topic))
>  //KafkaUtils.createStream[String, Object, StringDecoder,
> KafkaAvroDecoder](ssc, kafkaParams, Map("qa-rawlogs" -> 10),
> StorageLevel.MEMORY_AND_DISK_2)
>
>  val eventStream = rawlogDStream
>.map({
>  case (key, rawlogVal) =>
>val record = rawlogVal.asInstanceOf[GenericData.Record]
>val rlog = RawLog.newBuilder()
>  .setId(record.get("id").asInstanceOf[String])
>  .setAccount(record.get("account").asInstanceOf[String])
>  .setEvent(record.get("event").asInstanceOf[String])
>  .setTimestamp(record.get("timestamp").asInstanceOf[Long])
>  .setUserAgent(record.get("user_agent").asInstanceOf[String])
>
> .setParams(record.get("params").asInstanceOf[java.util.Map[String,
> String]])
>  .build()
>val norm = Normalizer(rlog)
>(key, rlog.getEvent, norm)
>})
>
>  val videoViewStream = eventStream
>.filter(_._2 == "video_view")
>.filter(_._3.isDefined)
>.map((z) => (z._1, z._3.get))
>.map((z) => (z._1, z._2.asInstanceOf[VideoView]))
>.cache()
>
>  // repartition by (deviceType, DeviceOS)
>  val deviceTypeVideoViews = videoViewStream.map((v) =>
> ((v._2.getDeviceType, v._2.getDeviceOs), 1))
>.reduceByKeyAndWindow(_ + _, Durations.seconds(10))
>.print()
> }
>
> object RawLogProcessor extends Logging {
>
>  /**
>   * If str is surrounded by quotes it return the content between the
> quotes
>   */
>  def unquote(str: String) = {
>if (str != null && str.length >= 2 && str.charAt(0) == '\"' &&
> str.charAt(str.length - 1) == '\"')
>  str.substring(1, str.length - 1)
>else
>  str
>  }
>
>  val checkpointDir = "/tmp/checkpointDir_tacoma"
>  var sparkConfig: Config = _
>  var ssc: StreamingContext = _
>  var processor: Option[RawLogProcessor] = None
>
>  val createContext: () => StreamingContext = () => {
>val batchDurationSecs =
> sparkConfig.getDuration("streaming.batch_duration", TimeUnit.SECONDS)
>val sparkConf = new SparkConf()
>sparkConf.registerKryoClasses(Array(classOf[VideoView],
> classOf[RawLog], classOf[VideoEngagement], classOf[VideoImpression]))
>sparkConfig.entrySet.asScala
>  .map(kv => kv.getKey -> kv.getValue)
>  .foreach {
>case (k, v) =>
>  val value = unquote(v.render())
>
>  logInfo(s"spark.$k = $value")
>
>  sparkConf.set(s"spark.$k", value)
>  }
>
>// calculate sparkContext and streamingContext
>new StreamingContext(sparkConf, Durations.seconds(batchDurationSecs))
>  }
>
>  def createProcessor(sparkConf: Config, kafkaConf: Config):
> RawLogProcessor = {
>sparkConfig = sparkConf
>ssc = StreamingContext.getOrCreate(checkpointPath = checkpointDir,
> creatingFunc = createContext, createOnError = true)
>ssc.checkpoint(checkpointDir)
>// kafkaProperties
>val kafkaParams = kafkaConf.entrySet.asScala
>  .map(kv => kv.getKey -> unquote(kv.getValue.render()))
>  .toMap
>
>logInfo(s"Initializing kafkaParams = $kafkaParams")
>// create processor
>new RawLogProcessor(ssc, kafkaConf.getString("rawlog.topic"),
> kafkaParams)
>  }
>
>  def apply(sparkConfig: Config, kafkaConf: Config) = {
>if (processor.isEmpty) {
>  processor = Some(createProcessor(sparkConfig, kafkaConf))
>}
>processor.get
>  }
>
>  def start() = {
>ssc.start()
>ssc.awaitTermination()
>  }
>
> }
>
> Extended logs:
> https://gist.githubusercontent.com/ankurcha/f35df63f0d8a99da0be4/raw/ec9
> 6b932540ac87577e4ce8385d26699c1a7d05e/spark-console.log
>
> Could someone tell me what it causes this problem? I tried looking at
> the stacktrace but I am not very familiar with the codebase to make
> solid assertions.
> Any ideas as to what may be happening here.
>
> --- Ankur Chauhan
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/kafka-Spark-Streaming-with-checkPointing-fails-to-restart-tp22864.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread ankurcha
tanceOf[java.util.Map[String,
String]])
 .build()
   val norm = Normalizer(rlog)
   (key, rlog.getEvent, norm)
   })

 val videoViewStream = eventStream
   .filter(_._2 == "video_view")
   .filter(_._3.isDefined)
   .map((z) => (z._1, z._3.get))
   .map((z) => (z._1, z._2.asInstanceOf[VideoView]))
   .cache()

 // repartition by (deviceType, DeviceOS)
 val deviceTypeVideoViews = videoViewStream.map((v) =>
((v._2.getDeviceType, v._2.getDeviceOs), 1))
   .reduceByKeyAndWindow(_ + _, Durations.seconds(10))
   .print()
}

object RawLogProcessor extends Logging {

 /**
  * If str is surrounded by quotes it return the content between the
quotes
  */
 def unquote(str: String) = {
   if (str != null && str.length >= 2 && str.charAt(0) == '\"' &&
str.charAt(str.length - 1) == '\"')
 str.substring(1, str.length - 1)
   else
 str
 }

 val checkpointDir = "/tmp/checkpointDir_tacoma"
 var sparkConfig: Config = _
 var ssc: StreamingContext = _
 var processor: Option[RawLogProcessor] = None

 val createContext: () => StreamingContext = () => {
   val batchDurationSecs =
sparkConfig.getDuration("streaming.batch_duration", TimeUnit.SECONDS)
   val sparkConf = new SparkConf()
   sparkConf.registerKryoClasses(Array(classOf[VideoView],
classOf[RawLog], classOf[VideoEngagement], classOf[VideoImpression]))
   sparkConfig.entrySet.asScala
 .map(kv => kv.getKey -> kv.getValue)
 .foreach {
   case (k, v) =>
 val value = unquote(v.render())

 logInfo(s"spark.$k = $value")

 sparkConf.set(s"spark.$k", value)
 }

   // calculate sparkContext and streamingContext
   new StreamingContext(sparkConf, Durations.seconds(batchDurationSecs))
 }

 def createProcessor(sparkConf: Config, kafkaConf: Config):
RawLogProcessor = {
   sparkConfig = sparkConf
   ssc = StreamingContext.getOrCreate(checkpointPath = checkpointDir,
creatingFunc = createContext, createOnError = true)
   ssc.checkpoint(checkpointDir)
   // kafkaProperties
   val kafkaParams = kafkaConf.entrySet.asScala
 .map(kv => kv.getKey -> unquote(kv.getValue.render()))
 .toMap

   logInfo(s"Initializing kafkaParams = $kafkaParams")
   // create processor
   new RawLogProcessor(ssc, kafkaConf.getString("rawlog.topic"),
kafkaParams)
 }

 def apply(sparkConfig: Config, kafkaConf: Config) = {
   if (processor.isEmpty) {
 processor = Some(createProcessor(sparkConfig, kafkaConf))
   }
   processor.get
 }

 def start() = {
   ssc.start()
   ssc.awaitTermination()
 }

}

Extended logs:
https://gist.githubusercontent.com/ankurcha/f35df63f0d8a99da0be4/raw/ec9
6b932540ac87577e4ce8385d26699c1a7d05e/spark-console.log

Could someone tell me what it causes this problem? I tried looking at
the stacktrace but I am not very familiar with the codebase to make
solid assertions.
Any ideas as to what may be happening here.

--- Ankur Chauhan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kafka-Spark-Streaming-with-checkPointing-fails-to-restart-tp22864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I have a simple application which fails with the following exception
only when the application is restarted (i.e. the checkpointDir has
entires from a previous execution):

Exception in thread "main" org.apache.spark.SparkException:
org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c has not
been initialized
at
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266
)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply
(DStream.scala:287)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply
(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:28
4)
at
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDSt
ream.scala:38)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.sc
ala:116)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.sc
ala:116)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLik
e.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLik
e.scala:251)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.sca
la:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251
)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:
116)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.app
ly(JobGenerator.scala:227)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.app
ly(JobGenerator.scala:222)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.s
cala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.s
cala:222)
at
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.sca
la:90)
at
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.sca
la:67)
at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala
:512)
at
com.brightcove.analytics.tacoma.RawLogProcessor$.start(RawLogProcessor.s
cala:115)
at
com.brightcove.analytics.tacoma.Main$delayedInit$body.apply(Main.scala:1
5)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableF
orwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.brightcove.analytics.tacoma.Main$.main(Main.scala:5)
at com.brightcove.analytics.tacoma.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit
$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The relavant source is:

class RawLogProcessor(ssc: StreamingContext, topic: String,
kafkaParams: Map[String, String]) {
 // create kafka stream
 val rawlogDStream = KafkaUtils.createDirectStream[String, Object,
StringDecoder, KafkaAvroDecoder](ssc, kafkaParams, Set(topic))
 //KafkaUtils.createStream[String, Object, StringDecoder,
KafkaAvroDecoder](ssc, kafkaParams, Map("qa-rawlogs" -> 10),
StorageLevel.MEMORY_AND_DISK_2)

 val eventStream = rawlogDStream
   .map({
 case (key, rawlogVal) =>
   val record = rawlogVal.asInstanceOf[GenericData.Record]
   val rlog = RawLog.newBuilder()
 .setId(record.get("id").asInstanceOf[String])
 .setAccount(record.get("account").asInstanceOf[String])
 .setEvent(record.get("event").asInstanceOf[String])
 .setTimestamp(record.get("timestamp").asInstanceOf[Long])
 .setUserAgent(record.get("user_agent").asInstanceOf[String])

.setParams(record.get("params").asInstanceOf[java.util.Map[String,
String]])
 

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I have a simple application which fails with the following exception
only when the application is restarted (i.e. the checkpointDir has
entires from a previous execution):

Exception in thread "main" org.apache.spark.SparkException:
org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c has not
been initialized
at
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266
)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply
(DStream.scala:287)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply
(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:28
4)
at
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDSt
ream.scala:38)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.sc
ala:116)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.sc
ala:116)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLik
e.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLik
e.scala:251)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.sca
la:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251
)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:
116)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.app
ly(JobGenerator.scala:227)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.app
ly(JobGenerator.scala:222)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.s
cala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.s
cala:222)
at
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.sca
la:90)
at
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.sca
la:67)
at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala
:512)
at
com.brightcove.analytics.tacoma.RawLogProcessor$.start(RawLogProcessor.s
cala:115)
at
com.brightcove.analytics.tacoma.Main$delayedInit$body.apply(Main.scala:1
5)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableF
orwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.brightcove.analytics.tacoma.Main$.main(Main.scala:5)
at com.brightcove.analytics.tacoma.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit
$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The relavant source is:

class RawLogProcessor(ssc: StreamingContext, topic: String,
kafkaParams: Map[String, String]) {
 // create kafka stream
 val rawlogDStream = KafkaUtils.createDirectStream[String, Object,
StringDecoder, KafkaAvroDecoder](ssc, kafkaParams, Set(topic))
 //KafkaUtils.createStream[String, Object, StringDecoder,
KafkaAvroDecoder](ssc, kafkaParams, Map("qa-rawlogs" -> 10),
StorageLevel.MEMORY_AND_DISK_2)

 val eventStream = rawlogDStream
   .map({
 case (key, rawlogVal) =>
   val record = rawlogVal.asInstanceOf[GenericData.Record]
   val rlog = RawLog.newBuilder()
 .setId(record.get("id").asInstanceOf[String])
 .setAccount(record.get("account").asInstanceOf[String])
 .setEvent(record.get("event").asInstanceOf[String])
 .setTimestamp(record.get("timestamp").asInstanceOf[Long])
 .setUserAgent(record.get("user_agent").asInstanceOf[String])

.setParams(record.get("params").asInstanceOf[java.util.Map[String,
String]])
 

kafka + Spark Streaming with checkPointing fails to start with

2015-05-12 Thread Ankur Chauhan
Hi,

I have a simple application which fails with the following exception only when 
the application is restarted (i.e. the checkpointDir has entires from a 
previous execution):

Exception in thread "main" org.apache.spark.SparkException: 
org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c has not been 
initialized
at 
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:227)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:222)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:222)
at 
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:90)
at 
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
at 
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
at 
com.brightcove.analytics.tacoma.RawLogProcessor$.start(RawLogProcessor.scala:115)
at 
com.brightcove.analytics.tacoma.Main$delayedInit$body.apply(Main.scala:15)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.brightcove.analytics.tacoma.Main$.main(Main.scala:5)
at com.brightcove.analytics.tacoma.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The relavant source is:

class RawLogProcessor(ssc: StreamingContext, topic: String, kafkaParams: 
Map[String, String]) {
  // create kafka stream
  val rawlogDStream = KafkaUtils.createDirectStream[String, Object, 
StringDecoder, KafkaAvroDecoder](ssc, kafkaParams, Set(topic))
  //KafkaUtils.createStream[String, Object, StringDecoder, 
KafkaAvroDecoder](ssc, kafkaParams, Map("qa-rawlogs" -> 10), 
StorageLevel.MEMORY_AND_DISK_2)

  val eventStream = rawlogDStream
.map({
  case (key, rawlogVal) =>
val record = rawlogVal.asInstanceOf[GenericData.Record]
val rlog = RawLog.newBuilder()
  .setId(record.get("id").asInstanceOf[String])
  .setAccount(record.get("account").asInstanceOf[String])
  .setEvent(record.get("event").asInstanceOf[String])
  .setTimestamp(record.get("timestamp").asInstanceOf[Long])
  .setUserAgent(record.get("user_agent").asInstanceOf[String])
  .setParams(record.get("params").asInstanceOf[java.util.Map[String, 
String]])
  .buil

Re: Kafka + Spark streaming

2014-12-31 Thread Samya Maiti
Thanks TD.

On Wed, Dec 31, 2014 at 7:19 AM, Tathagata Das 
wrote:

> 1. Of course, a single block / partition has many Kafka messages, and
> from different Kafka topics interleaved together. The message count is
> not related to the block count. Any message received within a
> particular block interval will go in the same block.
>
> 2. Yes, the receiver will be started on another worker.
>
> TD
>
>
> On Tue, Dec 30, 2014 at 2:19 PM, SamyaMaiti 
> wrote:
> > Hi Experts,
> >
> > Few general Queries :
> >
> > 1. Can a single block/partition in a RDD have more than 1 kafka message?
> or
> > there will be one & only one kafka message per block? In a more broader
> way,
> > is the message count related to block in any way or its just that any
> > message received with in a particular block interval will go in the same
> > block.
> >
> > 2. If a worker goes down which runs the Receiver for Kafka, Will the
> > receiver be restarted on some other worker?
> >
> > Regards,
> > Sam
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: Kafka + Spark streaming

2014-12-30 Thread Tathagata Das
1. Of course, a single block / partition has many Kafka messages, and
from different Kafka topics interleaved together. The message count is
not related to the block count. Any message received within a
particular block interval will go in the same block.

2. Yes, the receiver will be started on another worker.

TD


On Tue, Dec 30, 2014 at 2:19 PM, SamyaMaiti  wrote:
> Hi Experts,
>
> Few general Queries :
>
> 1. Can a single block/partition in a RDD have more than 1 kafka message? or
> there will be one & only one kafka message per block? In a more broader way,
> is the message count related to block in any way or its just that any
> message received with in a particular block interval will go in the same
> block.
>
> 2. If a worker goes down which runs the Receiver for Kafka, Will the
> receiver be restarted on some other worker?
>
> Regards,
> Sam
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
Hi Experts,

Few general Queries : 

1. Can a single block/partition in a RDD have more than 1 kafka message? or
there will be one & only one kafka message per block? In a more broader way,
is the message count related to block in any way or its just that any
message received with in a particular block interval will go in the same
block.

2. If a worker goes down which runs the Receiver for Kafka, Will the
receiver be restarted on some other worker?

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread Akhil Das
I see you have no worker machines to execute the job

[image: Inline image 1]

You haven't configured your spark cluster properly.

Quick fix to get it running would be run it on local mode, for that change
this line

JavaStreamingContext jssc = *new* JavaStreamingContext("spark://
192.168.88.130:7077", "SparkStream", *new* Duration(3000));

to this

JavaStreamingContext jssc = *new* JavaStreamingContext("local[4]",
"SparkStream", *new* Duration(3000));


Thanks
Best Regards

On Mon, Dec 1, 2014 at 4:18 PM,  wrote:

>  Hi,
>
>
>
> The spark master is working, and I have given the same url in the code:
>
>
>
> The warning is gone, and the new log is:
>
> ---
>
> Time: 141742785 ms
>
> ---
>
>
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-2] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Starting job streaming job 141742785 ms.0
> from job set of time 141742785 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-2] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Finished job streaming job 141742785 ms.0
> from job set of time 141742785 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-2] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Total delay: 0.028 s for time 141742785
> ms (execution: 0.001 s)
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Added jobs for time 141742785 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] rdd.MappedRDD
> (Logging.scala:logInfo(59)) - Removing RDD 25 from persistence list
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManager
> (Logging.scala:logInfo(59)) - Removing RDD 25
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] rdd.BlockRDD
> (Logging.scala:logInfo(59)) - Removing RDD 24 from persistence list
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-6] storage.BlockManager
> (Logging.scala:logInfo(59)) - Removing RDD 24
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5]
> kafka.KafkaInputDStream (Logging.scala:logInfo(59)) - Removing blocks of
> RDD BlockRDD[24] at BlockRDD at ReceiverInputDStream.scala:69 of time
> 141742785 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-4]
> scheduler.ReceiverTracker (Logging.scala:logInfo(59)) *- Stream 0
> received 0 blocks*
>
> ---
>
> Time: 1417427853000 ms
>
> ---
>
>
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Starting job streaming job 1417427853000 ms.0
> from job set of time 1417427853000 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Finished job streaming job 1417427853000 ms.0
> from job set of time 1417427853000 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Total delay: 0.015 s for time 1417427853000
> ms (execution: 0.001 s)
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-4] scheduler.JobScheduler
> (Logging.scala:logInfo(59)) - Added jobs for time 1417427853000 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-4] rdd.MappedRDD
> (Logging.scala:logInfo(59)) - Removing RDD 27 from persistence list
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-5] storage.BlockManager
> (Logging.scala:logInfo(59)) - Removing RDD 27
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-4] rdd.BlockRDD
> (Logging.scala:logInfo(59)) - Removing RDD 26 from persistence list
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-6] storage.BlockManager
> (Logging.scala:logInfo(59)) - Removing RDD 26
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-4]
> kafka.KafkaInputDStream (Logging.scala:logInfo(59)) - Removing blocks of
> RDD BlockRDD[26] at BlockRDD at ReceiverInputDStream.scala:69 of time
> 1417427853000 ms
>
> INFO  [sparkDriver-akka.actor.default-dispatcher-6]
> scheduler.ReceiverTracker (Logging.scala:logInfo(59)) - *Stream 0
> received 0 blocks*
>
>
>
> What should be my approach now ?
>
> Need urgent help.
>
>
>
> Regards,
>
> Aiman
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Monday, December 01, 2014 3:56 PM
> *To:* Sarosh, M.
> *Cc:* user@spark.apache.org
> *Subject:* Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks
>
>
>
> It says:
>
>
>
>  14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not
> 

RE: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
Hi,

The spark master is working, and I have given the same url in the code:
[cid:image001.png@01D00D82.6DC2FFF0]

The warning is gone, and the new log is:
---
Time: 141742785 ms
---

INFO  [sparkDriver-akka.actor.default-dispatcher-2] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Starting job streaming job 141742785 ms.0 
from job set of time 141742785 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-2] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Finished job streaming job 141742785 ms.0 
from job set of time 141742785 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-2] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Total delay: 0.028 s for time 141742785 ms 
(execution: 0.001 s)
INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Added jobs for time 141742785 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-5] rdd.MappedRDD 
(Logging.scala:logInfo(59)) - Removing RDD 25 from persistence list
INFO  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManager 
(Logging.scala:logInfo(59)) - Removing RDD 25
INFO  [sparkDriver-akka.actor.default-dispatcher-5] rdd.BlockRDD 
(Logging.scala:logInfo(59)) - Removing RDD 24 from persistence list
INFO  [sparkDriver-akka.actor.default-dispatcher-6] storage.BlockManager 
(Logging.scala:logInfo(59)) - Removing RDD 24
INFO  [sparkDriver-akka.actor.default-dispatcher-5] kafka.KafkaInputDStream 
(Logging.scala:logInfo(59)) - Removing blocks of RDD BlockRDD[24] at BlockRDD 
at ReceiverInputDStream.scala:69 of time 141742785 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-4] scheduler.ReceiverTracker 
(Logging.scala:logInfo(59)) - Stream 0 received 0 blocks
---
Time: 1417427853000 ms
---

INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Starting job streaming job 1417427853000 ms.0 
from job set of time 1417427853000 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Finished job streaming job 1417427853000 ms.0 
from job set of time 1417427853000 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-5] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Total delay: 0.015 s for time 1417427853000 ms 
(execution: 0.001 s)
INFO  [sparkDriver-akka.actor.default-dispatcher-4] scheduler.JobScheduler 
(Logging.scala:logInfo(59)) - Added jobs for time 1417427853000 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-4] rdd.MappedRDD 
(Logging.scala:logInfo(59)) - Removing RDD 27 from persistence list
INFO  [sparkDriver-akka.actor.default-dispatcher-5] storage.BlockManager 
(Logging.scala:logInfo(59)) - Removing RDD 27
INFO  [sparkDriver-akka.actor.default-dispatcher-4] rdd.BlockRDD 
(Logging.scala:logInfo(59)) - Removing RDD 26 from persistence list
INFO  [sparkDriver-akka.actor.default-dispatcher-6] storage.BlockManager 
(Logging.scala:logInfo(59)) - Removing RDD 26
INFO  [sparkDriver-akka.actor.default-dispatcher-4] kafka.KafkaInputDStream 
(Logging.scala:logInfo(59)) - Removing blocks of RDD BlockRDD[26] at BlockRDD 
at ReceiverInputDStream.scala:69 of time 1417427853000 ms
INFO  [sparkDriver-akka.actor.default-dispatcher-6] scheduler.ReceiverTracker 
(Logging.scala:logInfo(59)) - Stream 0 received 0 blocks

What should be my approach now ?
Need urgent help.

Regards,
Aiman

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, December 01, 2014 3:56 PM
To: Sarosh, M.
Cc: user@spark.apache.org
Subject: Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

It says:

 14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not 
accepted any resources; check your cluster UI to ensure that workers are 
registered and have sufficient memory

A quick guess would be, you are giving the wrong master url. ( 
spark://192.168.88.130:7077<http://192.168.88.130:7077/> ) Open the webUI 
running on port 8080 and use the master url listed there on top left corner of 
the page.

Thanks
Best Regards

On Mon, Dec 1, 2014 at 3:42 PM, 
mailto:m.sar...@accenture.com>> wrote:
Hi,

I am integrating Kafka and Spark, using spark-streaming. I have created a topic 
as a kafka producer:

bin/kafka-topics.sh --create --zookeeper localhost:2181 
--replication-factor 1 --partitions 1 --topic test


I am publishing messages in kafka and trying to read them using spark-streaming 
java code and displaying them on screen.
The daemons are all up: Spark-master,worker; zookeeper; kafka.
I am writing a java code for doing it, using KafkaUtils.createStream
code is below:

package com.spark;

import scala.Tuple2;
import kafka.serializer.Decoder;
import kafka.serializer.Encoder;
import org.apache.spark.streaming.Duration;
import org.apache.spa

  1   2   >