Re: Structured Streaming: multiple sinks

2017-08-25 Thread Tathagata Das
My apologies Chris. Somehow I have not received the first email by OP, and
hence thought our answers to OP as cryptic questions. :/
I found the full thread on nabble. I agree with your analysis of OP's
question 1.


On Fri, Aug 25, 2017 at 12:48 AM, Chris Bowden <cbcweb...@gmail.com> wrote:

> Tathagata, thanks for filling in context for other readers on 2a and 2b, I
> summarized too much in hindsight.
>
> Regarding the OP's first question, I was hinting it is quite natural to
> chain processes via kafka. If you are already interested in writing
> processed data to kafka, why add complexity to a job by having it commit
> processed data to kafka and s3 vs. simply moving the processed data from
> kafka out to s3 as needed. Perhaps the OP's thread got lost in context
> based on how I responded.
>
> 1) We are consuming from  kafka using  structured streaming and  writing
> the processed data set to s3.
> We also want to write the processed data to kafka moving forward, is it
> possible to do it from the same streaming query ? (spark  version 2.1.1)
>
> Streaming queries are currently bound to a single sink, so multiplexing
> the write with existing sinks via the  streaming query isn't possible
> AFAIK. Arguably you can reuse the "processed data" DAG by starting multiple
> sinks against it, though you will effectively process the data twice on
> different "schedules" since each sink will effectively have its own
> instance of StreamExecution, TriggerExecutor, etc. If you *really* wanted
> to do one pass of the data and process the same exact block of data per
> micro batch you could implement it via foreach or a custom sink which
> writes to kafka and s3, but I wouldn't recommend it. As stated above, it is
> quite natural to chain processes via kafka.
>
> On Thu, Aug 24, 2017 at 11:03 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Responses inline.
>>
>> On Thu, Aug 24, 2017 at 7:16 PM, cbowden <cbcweb...@gmail.com> wrote:
>>
>>> 1. would it not be more natural to write processed to kafka and sink
>>> processed from kafka to s3?
>>>
>>
>> I am sorry i dont fully understand this question. Could you please
>> elaborate further, as in, what is more natural than what?
>>
>>
>>> 2a. addBatch is the time Sink#addBatch took as measured by
>>> StreamExecution.
>>>
>>
>> Yes. This essentially includes the time taken to compute the output and
>> finish writing the output to the sink.
>> (**to give some context for other readers, this person is referring to
>> the different time durations reported through StreamingQuery.lastProgress)
>>
>>
>>> 2b. getBatch is the time Source#getBatch took as measured by
>>> StreamExecution.
>>>
>> Yes, it is the time taken by the source prepare the DataFrame the has the
>> new data to be processed in the trigger.
>> Usually this is low, but its not guaranteed to be as some sources may
>> require complicated tracking and bookkeeping to prepare the DataFrame.
>>
>>
>>> 3. triggerExecution is effectively end-to-end processing time for the
>>> micro-batch, note all other durations sum closely to triggerExecution,
>>> there
>>> is a little slippage based on book-keeping activities in StreamExecution.
>>>
>>
>> Yes. Precisely.
>>
>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp
>>> 29056p29105.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Structured Streaming: multiple sinks

2017-08-25 Thread Chris Bowden
Tathagata, thanks for filling in context for other readers on 2a and 2b, I
summarized too much in hindsight.

Regarding the OP's first question, I was hinting it is quite natural to
chain processes via kafka. If you are already interested in writing
processed data to kafka, why add complexity to a job by having it commit
processed data to kafka and s3 vs. simply moving the processed data from
kafka out to s3 as needed. Perhaps the OP's thread got lost in context
based on how I responded.

1) We are consuming from  kafka using  structured streaming and  writing
the processed data set to s3.
We also want to write the processed data to kafka moving forward, is it
possible to do it from the same streaming query ? (spark  version 2.1.1)

Streaming queries are currently bound to a single sink, so multiplexing the
write with existing sinks via the  streaming query isn't possible
AFAIK. Arguably you can reuse the "processed data" DAG by starting multiple
sinks against it, though you will effectively process the data twice on
different "schedules" since each sink will effectively have its own
instance of StreamExecution, TriggerExecutor, etc. If you *really* wanted
to do one pass of the data and process the same exact block of data per
micro batch you could implement it via foreach or a custom sink which
writes to kafka and s3, but I wouldn't recommend it. As stated above, it is
quite natural to chain processes via kafka.

On Thu, Aug 24, 2017 at 11:03 PM, Tathagata Das <tathagata.das1...@gmail.com
> wrote:

> Responses inline.
>
> On Thu, Aug 24, 2017 at 7:16 PM, cbowden <cbcweb...@gmail.com> wrote:
>
>> 1. would it not be more natural to write processed to kafka and sink
>> processed from kafka to s3?
>>
>
> I am sorry i dont fully understand this question. Could you please
> elaborate further, as in, what is more natural than what?
>
>
>> 2a. addBatch is the time Sink#addBatch took as measured by
>> StreamExecution.
>>
>
> Yes. This essentially includes the time taken to compute the output and
> finish writing the output to the sink.
> (**to give some context for other readers, this person is referring to the
> different time durations reported through StreamingQuery.lastProgress)
>
>
>> 2b. getBatch is the time Source#getBatch took as measured by
>> StreamExecution.
>>
> Yes, it is the time taken by the source prepare the DataFrame the has the
> new data to be processed in the trigger.
> Usually this is low, but its not guaranteed to be as some sources may
> require complicated tracking and bookkeeping to prepare the DataFrame.
>
>
>> 3. triggerExecution is effectively end-to-end processing time for the
>> micro-batch, note all other durations sum closely to triggerExecution,
>> there
>> is a little slippage based on book-keeping activities in StreamExecution.
>>
>
> Yes. Precisely.
>
>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-
>> tp29056p29105.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Structured Streaming: multiple sinks

2017-08-25 Thread Tathagata Das
Responses inline.

On Thu, Aug 24, 2017 at 7:16 PM, cbowden <cbcweb...@gmail.com> wrote:

> 1. would it not be more natural to write processed to kafka and sink
> processed from kafka to s3?
>

I am sorry i dont fully understand this question. Could you please
elaborate further, as in, what is more natural than what?


> 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
>

Yes. This essentially includes the time taken to compute the output and
finish writing the output to the sink.
(**to give some context for other readers, this person is referring to the
different time durations reported through StreamingQuery.lastProgress)


> 2b. getBatch is the time Source#getBatch took as measured by
> StreamExecution.
>
Yes, it is the time taken by the source prepare the DataFrame the has the
new data to be processed in the trigger.
Usually this is low, but its not guaranteed to be as some sources may
require complicated tracking and bookkeeping to prepare the DataFrame.


> 3. triggerExecution is effectively end-to-end processing time for the
> micro-batch, note all other durations sum closely to triggerExecution,
> there
> is a little slippage based on book-keeping activities in StreamExecution.
>

Yes. Precisely.


>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Structured-Streaming-multiple-
> sinks-tp29056p29105.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Structured Streaming: multiple sinks

2017-08-24 Thread cbowden
1. would it not be more natural to write processed to kafka and sink
processed from kafka to s3?
2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
2b. getBatch is the time Source#getBatch took as measured by
StreamExecution.
3. triggerExecution is effectively end-to-end processing time for the
micro-batch, note all other durations sum closely to triggerExecution, there
is a little slippage based on book-keeping activities in StreamExecution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056p29105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Structured Streaming: multiple sinks

2017-08-11 Thread aravias
1) We are consuming from  kafka using  structured streaming and  writing the
processed data set to s3.
We also want to write the processed data to kafka moving forward, is it
possible to do it from the same streaming query ? (spark  version 2.1.1)


2) In the logs, I see the streaming  query progress output and I have a
sample duration JSON from the log, can some one please provide more clarity
on what  the difference is between *addBatch* and *getBatch* ?  

3)  TriggerExecution - is it the time take  to both process the fetched data
and writing to the sink?




"durationMs" : {
"addBatch" : 2263426,
"getBatch" : 12,
"getOffset" : 273,
"queryPlanning" : 13,
"triggerExecution" : 2264288,
"walCommit" : 552
  },

regards
aravias



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org