Re: Issue with GroupByKey in BeamSql using SparkRunner

Jean-Baptiste Onofré Wed, 10 Oct 2018 02:20:02 -0700

It's maybe related: I have a pipeline (streaming with sliding windows)
that works fine with Direct and Flink runners, but I don't have any
result when using the Spark runner.


I gonna investigate this using my beam-samples.

Regards
JB

On 10/10/2018 11:16, Ismaël Mejía wrote:
> Are you trying this in a particular spark distribution or just locally ?
> I ask this because there was a data corruption issue with Spark 2.3.1
> (previous version used by Beam)
> https://issues.apache.org/jira/browse/SPARK-23243
> 
> Current Beam master (and next release) moves Spark to version 2.3.2
> and that should fix some of the data correctness issues (maybe yours
> too).
> Can you give it a try and report back if this fixes your issue.
> 
> 
> On Tue, Oct 9, 2018 at 6:45 PM Vishwas Bm <bmvish...@gmail.com> wrote:
>>
>> Hi Kenn,
>>
>> We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2 
>> cluster on Kubernetes.
>>
>>
>> On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles <k...@apache.org> wrote:
>>>
>>> Thanks for the report! I filed 
>>> https://issues.apache.org/jira/browse/BEAM-5690 to track the issue.
>>>
>>> Can you share what version of Beam you are using?
>>>
>>> Kenn
>>>
>>> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm <bmvish...@gmail.com> wrote:
>>>>
>>>> We are trying to setup a pipeline with using BeamSql and the trigger used 
>>>> is default (AfterWatermark crosses the window).
>>>> Below is the pipeline:
>>>>
>>>>    KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) ---> BeamSql 
>>>> ---> KafkaSink (KafkaIO)
>>>>
>>>> We are using Spark Runner for this.
>>>> The BeamSql query is:
>>>>              select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY 
>>>> Col3
>>>>
>>>> We are grouping by Col3 which is a string. It can hold values string[0-9].
>>>>
>>>> The records are getting emitted out at 1 min to kafka sink, but the output 
>>>> record in kafka is not as expected.
>>>> Below is the output observed: (WST and WET are indicators for window start 
>>>> time and window end time)
>>>>
>>>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":1,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  
>>>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>>>
>>>> We ran the same pipeline using direct and flink runner and we dont see 0 
>>>> entries for count_col1.
>>>>
>>>> As per beam matrix page 
>>>> (https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what),
>>>>  GroupBy is not fully supported,is this one of those cases ?
>>>> Thanks & Regards,
>>>> Vishwas
>>>>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Issue with GroupByKey in BeamSql using SparkRunner

Reply via email to