Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi Eugene,

I ran it on both  standalone flink(non Yarn) and  Flink on HDInsight
Cluster(Yarn). Both ran successfully. :)

Regards,
Nishu

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Nov 22, 2017 at 9:40 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Thanks Nishu. So, if I understand correctly, your pipelines were running on
> non-YARN, but you're planning to run with YARN?
>
> I meanwhile was able to get Flink running on Dataproc (YARN), and validated
> quickstart and game examples.
> At this point we need validation for Spark and Flink non-YARN [I think if
> Nishu's runs were non-YARN, they'd give us enough confidence, combined with
> the success of other validations of Spark and Flink runners?], and Apex on
> YARN. However, it seems that in previous RCs we were not validating Apex on
> YARN, only local cluster. Is it needed this time?
>
> On Wed, Nov 22, 2017 at 12:28 PM Nishu  wrote:
>
> > Hi Eugene,
> >
> > No, I didn't try with those instead I have my custom pipeline where Kafka
> > topic is the source. I have defined a Global Window and processing time
> > trigger to read the data. Further it runs some transformation i.e.
> > GroupByKey and CoGroupByKey. on the windowed collections.
> > I was running the same pipeline on direct runner and spark runner
> earlier..
> > Today gave it a try with Flink on Yarn.
> >
> > Best Regards,
> > Nishu.
> >
> > <
> > https://www.avast.com/sig-email?utm_medium=email&utm_
> source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon
> > >
> > Virus-free.
> > www.avast.com
> > <
> > https://www.avast.com/sig-email?utm_medium=email&utm_
> source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link
> > >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
> > On Wed, Nov 22, 2017 at 8:07 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Thanks Nishu! Can you clarify which pipeline you were running?
> > > The validation spreadsheet includes 1) the quickstart and 2) mobile
> game
> > > walkthroughs. Was it one of these, or your custom pipeline?
> > >
> > > On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:
> > >
> > > > Hi,
> > > >
> > > > Typo in previous mail.  I meant Flink runner.
> > > >
> > > > Thanks,
> > > > Nishu
> > > > On Wed, 22 Nov 2017 at 19.17,
> > > >
> > > > > Hi,
> > > > >
> > > > > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > > > > It worked seamlessly for unbounded sources. Couldn’t see any issues
> > > with
> > > > > my pipeline so far :)
> > > > >
> > > > >
> > > > > Thanks,Nishu
> > > > >
> > > > > On Wed, 22 Nov 2017 at 18.57, Reuven Lax  >
> > > > wrote:
> > > > >
> > > > >> Who is validating Flink and Yarn?
> > > > >>
> > > > >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles
> > >  > > > >
> > > > >> wrote:
> > > > >>
> > > > >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > > > >> > kirpic...@google.com.invalid> wrote:
> > > > >> >
> > > > >> > > In the verification spreadsheet, I'm not sure I understand the
> > > > >> difference
> > > > >> > > between the "YARN" and "Standalone cluster/service". Which is
> > > > >> Dataproc?
> > > > >> > It
> > > > >> > > definitely uses YARN, but it is also a standalone
> > cluster/service.
> > > > >> Does
> > > > >> > it
> > > > >> > > count for both?
> > > > >> > >
> > > > >> >
> > > > >> > No, it doesn't. A number of runners have their own non-YARN
> > cluster
> > > > >> mode. I
> > > > >> > would expect that the launching experience might be different
> and
> > >

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi Eugene,

No, I didn't try with those instead I have my custom pipeline where Kafka
topic is the source. I have defined a Global Window and processing time
trigger to read the data. Further it runs some transformation i.e.
GroupByKey and CoGroupByKey. on the windowed collections.
I was running the same pipeline on direct runner and spark runner earlier..
Today gave it a try with Flink on Yarn.

Best Regards,
Nishu.

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Nov 22, 2017 at 8:07 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Thanks Nishu! Can you clarify which pipeline you were running?
> The validation spreadsheet includes 1) the quickstart and 2) mobile game
> walkthroughs. Was it one of these, or your custom pipeline?
>
> On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:
>
> > Hi,
> >
> > Typo in previous mail.  I meant Flink runner.
> >
> > Thanks,
> > Nishu
> > On Wed, 22 Nov 2017 at 19.17,
> >
> > > Hi,
> > >
> > > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > > It worked seamlessly for unbounded sources. Couldn’t see any issues
> with
> > > my pipeline so far :)
> > >
> > >
> > > Thanks,Nishu
> > >
> > > On Wed, 22 Nov 2017 at 18.57, Reuven Lax 
> > wrote:
> > >
> > >> Who is validating Flink and Yarn?
> > >>
> > >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles
>  > >
> > >> wrote:
> > >>
> > >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > >> > kirpic...@google.com.invalid> wrote:
> > >> >
> > >> > > In the verification spreadsheet, I'm not sure I understand the
> > >> difference
> > >> > > between the "YARN" and "Standalone cluster/service". Which is
> > >> Dataproc?
> > >> > It
> > >> > > definitely uses YARN, but it is also a standalone cluster/service.
> > >> Does
> > >> > it
> > >> > > count for both?
> > >> > >
> > >> >
> > >> > No, it doesn't. A number of runners have their own non-YARN cluster
> > >> mode. I
> > >> > would expect that the launching experience might be different and
> the
> > >> > portable container management to differ. If they are identical,
> > experts
> > >> in
> > >> > those systems should feel free to coalesce the rows. Conversely, as
> > >> other
> > >> > platforms become supported, they could be added or not based on
> > whether
> > >> > they are substantively different from a user experience or QA point
> of
> > >> > view.
> > >> >
> > >> > Kenn
> > >> >
> > >> >
> > >> > > Seems now we're missing just Apex and Flink cluster verifications.
> > >> > >
> > >> > > *though Spark runner took 6x longer to run UserScore, partially I
> > >> guess
> > >> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> > >> workers
> > >> > > whereas Spark runner used 2 workers). For some reason Spark runner
> > >> chose
> > >> > > not to split the 10GB input files into chunks.
> > >> > >
> > >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax
>  > >
> > >> > > wrote:
> > >> > >
> > >> > > > Done
> > >> > > >
> > >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > >> > > > rober...@google.com.invalid> wrote:
> > >> > > >
> > >> > > > > Thanks. You need to re-sign as well.
> > >> > > > >
> > >> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> > >> >  > >> > > >
> > >> > > > > wrote:
> > >> > > > > > FYI these generated files have been removed from the source
> > >> > > > distribution.
> > >> > > > > >
> > >> > 

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi,

Typo in previous mail.  I meant Flink runner.

Thanks,
Nishu
On Wed, 22 Nov 2017 at 19.17,

> Hi,
>
> I build a pipeline using RC 2.2 today and ran with runner on yarn.
> It worked seamlessly for unbounded sources. Couldn’t see any issues with
> my pipeline so far :)
>
>
> Thanks,Nishu
>
> On Wed, 22 Nov 2017 at 18.57, Reuven Lax  wrote:
>
>> Who is validating Flink and Yarn?
>>
>> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles 
>> wrote:
>>
>> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
>> > kirpic...@google.com.invalid> wrote:
>> >
>> > > In the verification spreadsheet, I'm not sure I understand the
>> difference
>> > > between the "YARN" and "Standalone cluster/service". Which is
>> Dataproc?
>> > It
>> > > definitely uses YARN, but it is also a standalone cluster/service.
>> Does
>> > it
>> > > count for both?
>> > >
>> >
>> > No, it doesn't. A number of runners have their own non-YARN cluster
>> mode. I
>> > would expect that the launching experience might be different and the
>> > portable container management to differ. If they are identical, experts
>> in
>> > those systems should feel free to coalesce the rows. Conversely, as
>> other
>> > platforms become supported, they could be added or not based on whether
>> > they are substantively different from a user experience or QA point of
>> > view.
>> >
>> > Kenn
>> >
>> >
>> > > Seems now we're missing just Apex and Flink cluster verifications.
>> > >
>> > > *though Spark runner took 6x longer to run UserScore, partially I
>> guess
>> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
>> workers
>> > > whereas Spark runner used 2 workers). For some reason Spark runner
>> chose
>> > > not to split the 10GB input files into chunks.
>> > >
>> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax 
>> > > wrote:
>> > >
>> > > > Done
>> > > >
>> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
>> > > > rober...@google.com.invalid> wrote:
>> > > >
>> > > > > Thanks. You need to re-sign as well.
>> > > > >
>> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
>> > > > > >
>> > > > > wrote:
>> > > > > > FYI these generated files have been removed from the source
>> > > > distribution.
>> > > > > >
>> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax 
>> > > wrote:
>> > > > > >
>> > > > > >> hmmm, I thought I removed those generated files from the zip
>> file
>> > > > before
>> > > > > >> sending this email. Let me check again.
>> > > > > >>
>> > > > > >> Reuven
>> > > > > >>
>> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
>> > > > > >> rober...@google.com.invalid> wrote:
>> > > > > >>
>> > > > > >>> The source distribution contains a couple of files not on
>> github
>> > > > (e.g.
>> > > > > >>> folders that were added on master, Python generated files).
>> The
>> > pom
>> > > > > >>> files differed only by missing -SNAPSHOT, other than that
>> > > presumably
>> > > > > >>> the source release should just be "wget
>> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
>> > > > > >>>
>> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
>> > > > > >>>
>> > > > > >>> # OK?
>> > > > > >>>
>> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
>> > > > > >>>
>> > > > > >>> # Expected.
>> > > > > >>>
>> > > > > >>> Only in beam/: .git
>> > > > > >>> Only in beam/: .gitattributes
>> > > > > >>> Only in beam/: .gitignore
>> > > > > >>>
>> > > &g

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi,

I build a pipeline using RC 2.2 today and ran with runner on yarn.
It worked seamlessly for unbounded sources. Couldn’t see any issues with my
pipeline so far :)


Thanks,Nishu

On Wed, 22 Nov 2017 at 18.57, Reuven Lax  wrote:

> Who is validating Flink and Yarn?
>
> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles 
> wrote:
>
> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > In the verification spreadsheet, I'm not sure I understand the
> difference
> > > between the "YARN" and "Standalone cluster/service". Which is Dataproc?
> > It
> > > definitely uses YARN, but it is also a standalone cluster/service. Does
> > it
> > > count for both?
> > >
> >
> > No, it doesn't. A number of runners have their own non-YARN cluster
> mode. I
> > would expect that the launching experience might be different and the
> > portable container management to differ. If they are identical, experts
> in
> > those systems should feel free to coalesce the rows. Conversely, as other
> > platforms become supported, they could be added or not based on whether
> > they are substantively different from a user experience or QA point of
> > view.
> >
> > Kenn
> >
> >
> > > Seems now we're missing just Apex and Flink cluster verifications.
> > >
> > > *though Spark runner took 6x longer to run UserScore, partially I guess
> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> workers
> > > whereas Spark runner used 2 workers). For some reason Spark runner
> chose
> > > not to split the 10GB input files into chunks.
> > >
> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax 
> > > wrote:
> > >
> > > > Done
> > > >
> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > > > rober...@google.com.invalid> wrote:
> > > >
> > > > > Thanks. You need to re-sign as well.
> > > > >
> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> >  > > >
> > > > > wrote:
> > > > > > FYI these generated files have been removed from the source
> > > > distribution.
> > > > > >
> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax 
> > > wrote:
> > > > > >
> > > > > >> hmmm, I thought I removed those generated files from the zip
> file
> > > > before
> > > > > >> sending this email. Let me check again.
> > > > > >>
> > > > > >> Reuven
> > > > > >>
> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> > > > > >> rober...@google.com.invalid> wrote:
> > > > > >>
> > > > > >>> The source distribution contains a couple of files not on
> github
> > > > (e.g.
> > > > > >>> folders that were added on master, Python generated files). The
> > pom
> > > > > >>> files differed only by missing -SNAPSHOT, other than that
> > > presumably
> > > > > >>> the source release should just be "wget
> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
> > > > > >>>
> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> > > > > >>>
> > > > > >>> # OK?
> > > > > >>>
> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> > > > > >>>
> > > > > >>> # Expected.
> > > > > >>>
> > > > > >>> Only in beam/: .git
> > > > > >>> Only in beam/: .gitattributes
> > > > > >>> Only in beam/: .gitignore
> > > > > >>>
> > > > > >>> # These folders are probably from switching around between
> master
> > > and
> > > > > >>> git branches.
> > > > > >>>
> > > > > >>> Only in apache-beam-2.2.0: model
> > > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> > > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> > > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> > > > > 

Re: Apache Beam and Spark

2017-11-13 Thread Nishu
Hi Jean,

Thanks for your response.  So when can we expect Spark 2.x support for
spark runner?

Thanks,
Nishu

On Mon, Nov 13, 2017 at 11:53 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> Regarding your question:
>
> 1. Not yet, but as you might have seen on the mailing list, we have a PR
> about Spark 2.x support.
>
> 2. We have additional triggers supported and in progress. GroupByKey and
> Accumator are also supported.
>
> 3. No, I did the change to both allows you to define the default storage
> level (via the pipeline options). The runner also automatically define when
> to persist a RDD by analyzing the DAG.
>
> 4. Yes, it's supported.
>
> Regards
> JB
>
> On 11/13/2017 10:50 AM, Nishu wrote:
>
>> Hi Team,
>>
>> I am writing a streaming pipeline in Apache beam using spark runner.
>> Use case : To join the multiple kafka streams using windowed collections.
>> I use GroupByKey to group the events based on common business key and that
>> output is used as input for Join operation. Pipeline run on direct runner
>> as expected but on Spark cluster(v2.1), it throws the Accumulator error.
>> *"Exception in thread "main" java.lang.AssertionError: assertion failed:
>> copyAndReset must return a zero value copy"*
>>
>> I tried the same pipeline on Spark cluster(v1.6), there it runs without
>> any
>> error but doesn't perform the join operations on the streams .
>>
>> I got couple of questions.
>>
>> 1. Does spark runner support spark version 2.x?
>>
>> 2. Regarding the triggers, currently only ProcessingTimeTrigger is
>> supported in Capability Matrix
>> <https://beam.apache.org/documentation/runners/capability-
>> matrix/#cap-summary-what>
>> ,
>> can we expect to have support for more trigger in near future sometime
>> soon
>> ? Also, GroupByKey and Accumulating panes features, are those supported
>> for
>> spark for streaming pipeline?
>>
>> 3. According to the documentation, Storage level
>> <https://beam.apache.org/documentation/runners/spark/#pipeli
>> ne-options-for-the-spark-runner>
>> is set to IN_MEMORY for streaming pipelines. Can we configure it to disk
>> as
>> well?
>>
>> 4. Is there checkpointing feature supported for Spark runner? In case if
>> Beam pipeline fails unexpectedly, can we read the state from the last run.
>>
>> It will be great if someone could help to know above.
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Thanks & Regards,
Nishu Tayal


Apache Beam and Spark

2017-11-13 Thread Nishu
Hi Team,

I am writing a streaming pipeline in Apache beam using spark runner.
Use case : To join the multiple kafka streams using windowed collections.
I use GroupByKey to group the events based on common business key and that
output is used as input for Join operation. Pipeline run on direct runner
as expected but on Spark cluster(v2.1), it throws the Accumulator error.
*"Exception in thread "main" java.lang.AssertionError: assertion failed:
copyAndReset must return a zero value copy"*

I tried the same pipeline on Spark cluster(v1.6), there it runs without any
error but doesn't perform the join operations on the streams .

I got couple of questions.

1. Does spark runner support spark version 2.x?

2. Regarding the triggers, currently only ProcessingTimeTrigger is
supported in Capability Matrix
<https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what>
,
can we expect to have support for more trigger in near future sometime soon
? Also, GroupByKey and Accumulating panes features, are those supported for
spark for streaming pipeline?

3. According to the documentation, Storage level
<https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner>
is set to IN_MEMORY for streaming pipelines. Can we configure it to disk as
well?

4. Is there checkpointing feature supported for Spark runner? In case if
Beam pipeline fails unexpectedly, can we read the state from the last run.

It will be great if someone could help to know above.

-- 
Thanks & Regards,
Nishu Tayal