Re: Spark Streaming - stream between 2 applications

2015-11-21 Thread Christian
Instead of sending the results of the one spark app directly to the other
one, you could write the results to a Kafka topic which is consumed by your
other spark application.

On Fri, Nov 20, 2015 at 12:07 PM Saiph Kappa  wrote:

> I think my problem persists whether I use Kafka or sockets. Or am I wrong?
> How would you use Kafka here?
>
> On Fri, Nov 20, 2015 at 7:12 PM, Christian  wrote:
>
>> Have you considered using Kafka?
>>
>> On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a basic spark streaming application like this:
>>>
>>> «
>>> ...
>>>
>>> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
>>> val rawStreams = (1 to numStreams).map(_ =>
>>>   ssc.rawSocketStream[String](host, port, 
>>> StorageLevel.MEMORY_ONLY_SER)).toArray
>>> val union = ssc.union(rawStreams)
>>>
>>> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>>>
>>>   // TODO
>>>
>>> }
>>> ...
>>> »
>>>
>>>
>>> My question is: what is the best and fastest way to send the resulting rdds
>>> as input to be consumed by another spark streaming application?
>>>
>>> I tried to add this code in place of the "TODO" comment:
>>>
>>> «
>>> val serverSocket = new ServerSocket(9998)
>>> while (true) {
>>>   val socket = serverSocket.accept()
>>>   @transient val out = new PrintWriter(socket.getOutputStream)
>>>   try {
>>> rdd.foreach(out.write)
>>>   } catch {
>>> case e: IOException =>
>>>   socket.close()
>>>   }
>>> }
>>> »
>>>
>>>
>>> I also tried to create a thread in the driver application code to launch the
>>> socket server and then share state (the PrintWriter object) between the 
>>> driver program and tasks.
>>> But got an exception saying that task is not serializable - PrintWriter is 
>>> not serializable
>>> (despite the @trasient annotation). I know this is not a very elegant 
>>> solution, but what other
>>> directions should I explore?
>>>
>>> Thanks.
>>>
>>>
>


Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Cody Koeninger
You're confused about which parts of your code are running on the driver vs
the executor, which is why you're getting serialization errors.

Read

http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd



On Fri, Nov 20, 2015 at 1:07 PM, Saiph Kappa  wrote:

> I think my problem persists whether I use Kafka or sockets. Or am I wrong?
> How would you use Kafka here?
>
> On Fri, Nov 20, 2015 at 7:12 PM, Christian  wrote:
>
>> Have you considered using Kafka?
>>
>> On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a basic spark streaming application like this:
>>>
>>> «
>>> ...
>>>
>>> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
>>> val rawStreams = (1 to numStreams).map(_ =>
>>>   ssc.rawSocketStream[String](host, port, 
>>> StorageLevel.MEMORY_ONLY_SER)).toArray
>>> val union = ssc.union(rawStreams)
>>>
>>> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>>>
>>>   // TODO
>>>
>>> }
>>> ...
>>> »
>>>
>>>
>>> My question is: what is the best and fastest way to send the resulting rdds
>>> as input to be consumed by another spark streaming application?
>>>
>>> I tried to add this code in place of the "TODO" comment:
>>>
>>> «
>>> val serverSocket = new ServerSocket(9998)
>>> while (true) {
>>>   val socket = serverSocket.accept()
>>>   @transient val out = new PrintWriter(socket.getOutputStream)
>>>   try {
>>> rdd.foreach(out.write)
>>>   } catch {
>>> case e: IOException =>
>>>   socket.close()
>>>   }
>>> }
>>> »
>>>
>>>
>>> I also tried to create a thread in the driver application code to launch the
>>> socket server and then share state (the PrintWriter object) between the 
>>> driver program and tasks.
>>> But got an exception saying that task is not serializable - PrintWriter is 
>>> not serializable
>>> (despite the @trasient annotation). I know this is not a very elegant 
>>> solution, but what other
>>> directions should I explore?
>>>
>>> Thanks.
>>>
>>>
>


Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
I think my problem persists whether I use Kafka or sockets. Or am I wrong?
How would you use Kafka here?

On Fri, Nov 20, 2015 at 7:12 PM, Christian  wrote:

> Have you considered using Kafka?
>
> On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa  wrote:
>
>> Hi,
>>
>> I have a basic spark streaming application like this:
>>
>> «
>> ...
>>
>> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
>> val rawStreams = (1 to numStreams).map(_ =>
>>   ssc.rawSocketStream[String](host, port, 
>> StorageLevel.MEMORY_ONLY_SER)).toArray
>> val union = ssc.union(rawStreams)
>>
>> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>>
>>   // TODO
>>
>> }
>> ...
>> »
>>
>>
>> My question is: what is the best and fastest way to send the resulting rdds
>> as input to be consumed by another spark streaming application?
>>
>> I tried to add this code in place of the "TODO" comment:
>>
>> «
>> val serverSocket = new ServerSocket(9998)
>> while (true) {
>>   val socket = serverSocket.accept()
>>   @transient val out = new PrintWriter(socket.getOutputStream)
>>   try {
>> rdd.foreach(out.write)
>>   } catch {
>> case e: IOException =>
>>   socket.close()
>>   }
>> }
>> »
>>
>>
>> I also tried to create a thread in the driver application code to launch the
>> socket server and then share state (the PrintWriter object) between the 
>> driver program and tasks.
>> But got an exception saying that task is not serializable - PrintWriter is 
>> not serializable
>> (despite the @trasient annotation). I know this is not a very elegant 
>> solution, but what other
>> directions should I explore?
>>
>> Thanks.
>>
>>


Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Christian
Have you considered using Kafka?
On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa  wrote:

> Hi,
>
> I have a basic spark streaming application like this:
>
> «
> ...
>
> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
> val rawStreams = (1 to numStreams).map(_ =>
>   ssc.rawSocketStream[String](host, port, 
> StorageLevel.MEMORY_ONLY_SER)).toArray
> val union = ssc.union(rawStreams)
>
> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>
>   // TODO
>
> }
> ...
> »
>
>
> My question is: what is the best and fastest way to send the resulting rdds
> as input to be consumed by another spark streaming application?
>
> I tried to add this code in place of the "TODO" comment:
>
> «
> val serverSocket = new ServerSocket(9998)
> while (true) {
>   val socket = serverSocket.accept()
>   @transient val out = new PrintWriter(socket.getOutputStream)
>   try {
> rdd.foreach(out.write)
>   } catch {
> case e: IOException =>
>   socket.close()
>   }
> }
> »
>
>
> I also tried to create a thread in the driver application code to launch the
> socket server and then share state (the PrintWriter object) between the 
> driver program and tasks.
> But got an exception saying that task is not serializable - PrintWriter is 
> not serializable
> (despite the @trasient annotation). I know this is not a very elegant 
> solution, but what other
> directions should I explore?
>
> Thanks.
>
>


Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
Hi,

I have a basic spark streaming application like this:

«
...

val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
val rawStreams = (1 to numStreams).map(_ =>
  ssc.rawSocketStream[String](host, port, StorageLevel.MEMORY_ONLY_SER)).toArray
val union = ssc.union(rawStreams)

union.flatMap(line => line.split(' ')).foreachRDD(rdd => {

  // TODO

}
...
»


My question is: what is the best and fastest way to send the resulting rdds
as input to be consumed by another spark streaming application?

I tried to add this code in place of the "TODO" comment:

«
val serverSocket = new ServerSocket(9998)
while (true) {
  val socket = serverSocket.accept()
  @transient val out = new PrintWriter(socket.getOutputStream)
  try {
rdd.foreach(out.write)
  } catch {
case e: IOException =>
  socket.close()
  }
}
»


I also tried to create a thread in the driver application code to launch the
socket server and then share state (the PrintWriter object) between
the driver program and tasks.
But got an exception saying that task is not serializable -
PrintWriter is not serializable
(despite the @trasient annotation). I know this is not a very elegant
solution, but what other
directions should I explore?

Thanks.