Re: Apache Beam and Spark

Nishu Mon, 13 Nov 2017 03:23:58 -0800

Hi Jean,

Thanks for your response.  So when can we expect Spark 2.x support for
spark runner?


Thanks,
Nishu

On Mon, Nov 13, 2017 at 11:53 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi,
>
> Regarding your question:
>
> 1. Not yet, but as you might have seen on the mailing list, we have a PR
> about Spark 2.x support.
>
> 2. We have additional triggers supported and in progress. GroupByKey and
> Accumator are also supported.
>
> 3. No, I did the change to both allows you to define the default storage
> level (via the pipeline options). The runner also automatically define when
> to persist a RDD by analyzing the DAG.
>
> 4. Yes, it's supported.
>
> Regards
> JB
>
> On 11/13/2017 10:50 AM, Nishu wrote:
>
>> Hi Team,
>>
>> I am writing a streaming pipeline in Apache beam using spark runner.
>> Use case : To join the multiple kafka streams using windowed collections.
>> I use GroupByKey to group the events based on common business key and that
>> output is used as input for Join operation. Pipeline run on direct runner
>> as expected but on Spark cluster(v2.1), it throws the Accumulator error.
>> *"Exception in thread "main" java.lang.AssertionError: assertion failed:
>> copyAndReset must return a zero value copy"*
>>
>> I tried the same pipeline on Spark cluster(v1.6), there it runs without
>> any
>> error but doesn't perform the join operations on the streams .
>>
>> I got couple of questions.
>>
>> 1. Does spark runner support spark version 2.x?
>>
>> 2. Regarding the triggers, currently only ProcessingTimeTrigger is
>> supported in Capability Matrix
>> <https://beam.apache.org/documentation/runners/capability-
>> matrix/#cap-summary-what>
>> ,
>> can we expect to have support for more trigger in near future sometime
>> soon
>> ? Also, GroupByKey and Accumulating panes features, are those supported
>> for
>> spark for streaming pipeline?
>>
>> 3. According to the documentation, Storage level
>> <https://beam.apache.org/documentation/runners/spark/#pipeli
>> ne-options-for-the-spark-runner>
>> is set to IN_MEMORY for streaming pipelines. Can we configure it to disk
>> as
>> well?
>>
>> 4. Is there checkpointing feature supported for Spark runner? In case if
>> Beam pipeline fails unexpectedly, can we read the state from the last run.
>>
>> It will be great if someone could help to know above.
>>
>>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Thanks & Regards,
Nishu Tayal

Re: Apache Beam and Spark

Reply via email to