It's maybe related: I have a pipeline (streaming with sliding windows) that works fine with Direct and Flink runners, but I don't have any result when using the Spark runner.
I gonna investigate this using my beam-samples. Regards JB On 10/10/2018 11:16, Ismaël Mejía wrote: > Are you trying this in a particular spark distribution or just locally ? > I ask this because there was a data corruption issue with Spark 2.3.1 > (previous version used by Beam) > https://issues.apache.org/jira/browse/SPARK-23243 > > Current Beam master (and next release) moves Spark to version 2.3.2 > and that should fix some of the data correctness issues (maybe yours > too). > Can you give it a try and report back if this fixes your issue. > > > On Tue, Oct 9, 2018 at 6:45 PM Vishwas Bm <bmvish...@gmail.com> wrote: >> >> Hi Kenn, >> >> We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2 >> cluster on Kubernetes. >> >> >> On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles <k...@apache.org> wrote: >>> >>> Thanks for the report! I filed >>> https://issues.apache.org/jira/browse/BEAM-5690 to track the issue. >>> >>> Can you share what version of Beam you are using? >>> >>> Kenn >>> >>> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm <bmvish...@gmail.com> wrote: >>>> >>>> We are trying to setup a pipeline with using BeamSql and the trigger used >>>> is default (AfterWatermark crosses the window). >>>> Below is the pipeline: >>>> >>>> KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) ---> BeamSql >>>> ---> KafkaSink (KafkaIO) >>>> >>>> We are using Spark Runner for this. >>>> The BeamSql query is: >>>> select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY >>>> Col3 >>>> >>>> We are grouping by Col3 which is a string. It can hold values string[0-9]. >>>> >>>> The records are getting emitted out at 1 min to kafka sink, but the output >>>> record in kafka is not as expected. >>>> Below is the output observed: (WST and WET are indicators for window start >>>> time and window end time) >>>> >>>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":1,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 0000 >>>> +0000","WET":"2018-10-09 09-56-00 0000 +0000"} >>>> >>>> We ran the same pipeline using direct and flink runner and we dont see 0 >>>> entries for count_col1. >>>> >>>> As per beam matrix page >>>> (https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what), >>>> GroupBy is not fully supported,is this one of those cases ? >>>> Thanks & Regards, >>>> Vishwas >>>> -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com