Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Thomas Weise
+1

Run quickstart with Apex runner in embedded mode and on YARN.

It needed couple tweaks to get there though.

1) Change quickstart pom.xml apex-runner profile:



  org.apache.hadoop
  hadoop-yarn-client
  ${hadoop.version}
  runtime


  org.apache.hadoop
  hadoop-common
  ${hadoop.version}
  runtime


2) After copying the fat jar to the cluster:

java -cp word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount
\
 --inputFile=file:///tmp/input.txt --output=/tmp/counts
--embeddedExecution=false --configFile=beam-runners-apex.properties
--runner=ApexRunner

(this was on a single node cluster, hence the local file path)

The quickstart instructions suggest to use *mvn exec:java* instead of *java*
- it generally isn't valid to assume that mvn and a build environment
exists on the edge node of a YARN cluster.



On Wed, Nov 22, 2017 at 2:12 PM, Nishu  wrote:

> Hi Eugene,
>
> I ran it on both  standalone flink(non Yarn) and  Flink on HDInsight
> Cluster(Yarn). Both ran successfully. :)
>
> Regards,
> Nishu
>
>  source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
> Virus-free.
> www.avast.com
>  source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> On Wed, Nov 22, 2017 at 9:40 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Thanks Nishu. So, if I understand correctly, your pipelines were running
> on
> > non-YARN, but you're planning to run with YARN?
> >
> > I meanwhile was able to get Flink running on Dataproc (YARN), and
> validated
> > quickstart and game examples.
> > At this point we need validation for Spark and Flink non-YARN [I think if
> > Nishu's runs were non-YARN, they'd give us enough confidence, combined
> with
> > the success of other validations of Spark and Flink runners?], and Apex
> on
> > YARN. However, it seems that in previous RCs we were not validating Apex
> on
> > YARN, only local cluster. Is it needed this time?
> >
> > On Wed, Nov 22, 2017 at 12:28 PM Nishu  wrote:
> >
> > > Hi Eugene,
> > >
> > > No, I didn't try with those instead I have my custom pipeline where
> Kafka
> > > topic is the source. I have defined a Global Window and processing time
> > > trigger to read the data. Further it runs some transformation i.e.
> > > GroupByKey and CoGroupByKey. on the windowed collections.
> > > I was running the same pipeline on direct runner and spark runner
> > earlier..
> > > Today gave it a try with Flink on Yarn.
> > >
> > > Best Regards,
> > > Nishu.
> > >
> > > <
> > > https://www.avast.com/sig-email?utm_medium=email&utm_
> > source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon
> > > >
> > > Virus-free.
> > > www.avast.com
> > > <
> > > https://www.avast.com/sig-email?utm_medium=email&utm_
> > source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link
> > > >
> > > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> > >
> > > On Wed, Nov 22, 2017 at 8:07 PM, Eugene Kirpichov <
> > > kirpic...@google.com.invalid> wrote:
> > >
> > > > Thanks Nishu! Can you clarify which pipeline you were running?
> > > > The validation spreadsheet includes 1) the quickstart and 2) mobile
> > game
> > > > walkthroughs. Was it one of these, or your custom pipeline?
> > > >
> > > > On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Typo in previous mail.  I meant Flink runner.
> > > > >
> > > > > Thanks,
> > > > > Nishu
> > > > > On Wed, 22 Nov 2017 at 19.17,
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I build a pipeline using RC 2.2 today and ran with runner on
> yarn.
> > > > > > It worked seamlessly for unbounded sources. Couldn’t see any
> issues
> > > > with
> > > > > > my pipeline so far :)
> > > > > >
> > > > > >
> > > > > > Thanks,Nishu
> > > > > >
> > > > > > On Wed, 22 Nov 2017 at 18.57, Reuven Lax
>  > >
> > > > > wrote:
> > > > > >
> > > > > >> Who is validating Flink and Yarn?
> > > > > >>
> > > > > >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles
> > > >  > > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > > > > >> > kirpic...@google.com.invalid> wrote:
> > > > > >> >
> > > > > >> > > In the verification spreadsheet, I'm not sure I understand
> the
> > > > > >> difference
> > > > > >> > > between the "YARN" and "Standalone cluster/service". Which
> is
> > > > > >> Dataproc?
> > > > > >> > It
> > > > > >> > > definitely uses YARN, but it is also a standalone
> > > cluster/service.
> > > > > >> Does
> > > > > >> > it
> > > > > >> > > count for both?
> > > > > >> > >
> > > > > >> >
> > > > > >> > No, it doesn't. A number of runners have their own non-YARN
> > > cluster
> > > > > >> mode. I
> > > > > >> > would expect that the launching experien

RE: Azure(ADLS) compatibility on Beam with Spark runner

2017-11-22 Thread Milan Chandna
I tried both the ways.
Passed ADL specific configuration in --hdfsConfiguration as well and have setup 
the core-site.xml/hdfs-site.xml as well.
As I mentioned it's a HDI + Spark cluster, those things are already setup.
Spark job(without Beam) is also able to read and write to ADLS on same machine.

BTW if the authentication or understanding ADL was a problem, it would have 
thrown error like ADLFileSystem missing or probably access failed or something. 
Thoughts?

-Milan.

-Original Message-
From: Lukasz Cwik [mailto:lc...@google.com.INVALID] 
Sent: Thursday, November 23, 2017 5:05 AM
To: dev@beam.apache.org
Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner

In your example it seems as though your HDFS configuration doesn't contain any 
ADL specific configuration:  "--hdfsConfiguration='[{\"fs.defaultFS\":
\"hdfs://home/sample.txt\"]'"
Do you have a core-site.xml or hdfs-site.xml configured as per:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1Gft8eka5W7Yts%3D&reserved=0?

From the documentation for --hdfsConfiguration:
A list of Hadoop configurations used to configure zero or more Hadoop 
filesystems. By default, Hadoop configuration is loaded from 'core-site.xml' 
and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and YARN_CONF_DIR environment 
variables. To specify configuration on the command-line, represent the value as 
a JSON list of JSON maps, where each map represents the entire configuration 
for a single Hadoop filesystem. For example 
--hdfsConfiguration='[{\"fs.default.name\":
\"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
From:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0

On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> FYI, I'm in touch with Microsoft Azure team about that.
>
> We are testing the ADLS support via HDFS.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/22/2017 09:12 AM, Milan Chandna wrote:
>
>> Hi,
>>
>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
>> I was trying recently to do this but was unable to make it work.
>>
>> Steps that I tried:
>>
>>1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
>>2.  Built Apache Beam on that. Built to include Beam-2790< 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu
>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0>
>>  fix which earlier I was facing for ADL as well.
>>3.  Modified WordCount.java example to use HadoopFileSystemOptions
>>4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
>>   *   Just gave the input path and output path as
>> adl://home/sample.txt and adl://home/output
>>   *   In addition to adl input and output path, also gave required
>> HDFS configuration with adl required configs as well.
>>
>> Both didn't worked btw.
>> s
>>1.  Have checked ACL's and permissions. In fact similar job with 
>> same paths work on Spark directly.
>>2.  Issues faced:
>>   *   For input, Beam is not able to find the path. Console log:
>> Filepattern adl://home/sample.txt matched 0 files with total size 0
>>   *   Output path always gets converted to relative path, something
>> like this: /home/user1/adl:/home/output/.tmp
>>
>>
>>
>>
>>
>> Debugging more into this but was checking if someone is already 
>> facing this and has some resolution.
>>
>>
>>
>> Here is a sample code and command I used.
>>
>>
>>
>>  HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg 
>> s(args).as(HadoopFileSystemOptions.class);
>>
>>  Pipeline p = Pipeline.create(options);
>>
>>  p.apply( TextIO.read().from(options.get
>> HdfsConfiguration().get(0).get("fs.defaultFS")))
>>
>>   .apply(new CountWords())
>>
>>   .apply(MapElements.via(new FormatAsTextFn()))
>>
>>   .apply(TextIO.write().to("adl://home/output"));
>>
>>  p.run().waitUntilFinish();
>>
>>
>>
>>
>>
>> spark-submit --class org.apache.beam.examples.WordCount --master 
>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
>> --hdfsConfiguration=

Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Robert Bradshaw
+1

On Wed, Nov 22, 2017, 10:10 PM Jean-Baptiste Onofré  wrote:

> +1
>
> Regards
> JB
>
> On 11/23/2017 12:25 AM, Lukasz Cwik wrote:
> > I have noticed that some e-mail addresses (notably @google.com) get
> > .INVALID suffixed onto it so per...@yyy.com become
> per...@yyy.com.INVALID
> > in the From: header.
> >
> > I have figured out that this is an issue with the way that our mail
> server
> > is configured and opened
> https://issues.apache.org/jira/browse/INFRA-15529.
> >
> > For those of us that are impacted, it makes it more difficult for users
> to
> > reply directly to the originator.
> >
> > Infra has asked to get consensus from PMC members before making the
> change
> > which I figured it would be easiest with a vote.
> >
> > Please vote:
> > +1 Update mail server to stop suffixing .INVALID
> > -1 Don't change mail server settings.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Jean-Baptiste Onofré

+1

Regards
JB

On 11/23/2017 12:25 AM, Lukasz Cwik wrote:

I have noticed that some e-mail addresses (notably @google.com) get
.INVALID suffixed onto it so per...@yyy.com become per...@yyy.com.INVALID
in the From: header.

I have figured out that this is an issue with the way that our mail server
is configured and opened https://issues.apache.org/jira/browse/INFRA-15529.

For those of us that are impacted, it makes it more difficult for users to
reply directly to the originator.

Infra has asked to get consensus from PMC members before making the change
which I figured it would be easiest with a vote.

Please vote:
+1 Update mail server to stop suffixing .INVALID
-1 Don't change mail server settings.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Pei HE
+1

On Thu, Nov 23, 2017 at 8:43 AM, Holden Karau  wrote:

> +1 (non-binding)
>
> On Wed, Nov 22, 2017 at 4:06 PM Kenneth Knowles 
> wrote:
>
> > +1
> >
> > On Wed, Nov 22, 2017 at 3:43 PM, Lukasz Cwik 
> > wrote:
> >
> > > +1
> > >
> > > On Wed, Nov 22, 2017 at 3:35 PM, Reuven Lax 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Nov 22, 2017 3:29 PM, "Ben Sidhom" 
> > wrote:
> > > >
> > > > > I'm not a PMC member, but this would be especially valuable if it
> > > > > propagated DKIM signatures properly.
> > > > >
> > > > > On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik
> >  > > >
> > > > > wrote:
> > > > >
> > > > > > I have noticed that some e-mail addresses (notably @google.com)
> > get
> > > > > > .INVALID suffixed onto it so per...@yyy.com become
> > > > > per...@yyy.com.INVALID
> > > > > > in the From: header.
> > > > > >
> > > > > > I have figured out that this is an issue with the way that our
> mail
> > > > > server
> > > > > > is configured and opened https://issues.apache.org/
> > > > > jira/browse/INFRA-15529
> > > > > > .
> > > > > >
> > > > > > For those of us that are impacted, it makes it more difficult for
> > > users
> > > > > to
> > > > > > reply directly to the originator.
> > > > > >
> > > > > > Infra has asked to get consensus from PMC members before making
> the
> > > > > change
> > > > > > which I figured it would be easiest with a vote.
> > > > > >
> > > > > > Please vote:
> > > > > > +1 Update mail server to stop suffixing .INVALID
> > > > > > -1 Don't change mail server settings.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -Ben
> > > > >
> > > >
> > >
> >
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Holden Karau
+1 (non-binding)

On Wed, Nov 22, 2017 at 4:06 PM Kenneth Knowles 
wrote:

> +1
>
> On Wed, Nov 22, 2017 at 3:43 PM, Lukasz Cwik 
> wrote:
>
> > +1
> >
> > On Wed, Nov 22, 2017 at 3:35 PM, Reuven Lax 
> > wrote:
> >
> > > +1
> > >
> > > On Nov 22, 2017 3:29 PM, "Ben Sidhom" 
> wrote:
> > >
> > > > I'm not a PMC member, but this would be especially valuable if it
> > > > propagated DKIM signatures properly.
> > > >
> > > > On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik
>  > >
> > > > wrote:
> > > >
> > > > > I have noticed that some e-mail addresses (notably @google.com)
> get
> > > > > .INVALID suffixed onto it so per...@yyy.com become
> > > > per...@yyy.com.INVALID
> > > > > in the From: header.
> > > > >
> > > > > I have figured out that this is an issue with the way that our mail
> > > > server
> > > > > is configured and opened https://issues.apache.org/
> > > > jira/browse/INFRA-15529
> > > > > .
> > > > >
> > > > > For those of us that are impacted, it makes it more difficult for
> > users
> > > > to
> > > > > reply directly to the originator.
> > > > >
> > > > > Infra has asked to get consensus from PMC members before making the
> > > > change
> > > > > which I figured it would be easiest with a vote.
> > > > >
> > > > > Please vote:
> > > > > +1 Update mail server to stop suffixing .INVALID
> > > > > -1 Don't change mail server settings.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -Ben
> > > >
> > >
> >
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Kenneth Knowles
+1

On Wed, Nov 22, 2017 at 3:43 PM, Lukasz Cwik 
wrote:

> +1
>
> On Wed, Nov 22, 2017 at 3:35 PM, Reuven Lax 
> wrote:
>
> > +1
> >
> > On Nov 22, 2017 3:29 PM, "Ben Sidhom"  wrote:
> >
> > > I'm not a PMC member, but this would be especially valuable if it
> > > propagated DKIM signatures properly.
> > >
> > > On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik  >
> > > wrote:
> > >
> > > > I have noticed that some e-mail addresses (notably @google.com) get
> > > > .INVALID suffixed onto it so per...@yyy.com become
> > > per...@yyy.com.INVALID
> > > > in the From: header.
> > > >
> > > > I have figured out that this is an issue with the way that our mail
> > > server
> > > > is configured and opened https://issues.apache.org/
> > > jira/browse/INFRA-15529
> > > > .
> > > >
> > > > For those of us that are impacted, it makes it more difficult for
> users
> > > to
> > > > reply directly to the originator.
> > > >
> > > > Infra has asked to get consensus from PMC members before making the
> > > change
> > > > which I figured it would be easiest with a vote.
> > > >
> > > > Please vote:
> > > > +1 Update mail server to stop suffixing .INVALID
> > > > -1 Don't change mail server settings.
> > > >
> > >
> > >
> > >
> > > --
> > > -Ben
> > >
> >
>


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Lukasz Cwik
+1

On Wed, Nov 22, 2017 at 3:35 PM, Reuven Lax 
wrote:

> +1
>
> On Nov 22, 2017 3:29 PM, "Ben Sidhom"  wrote:
>
> > I'm not a PMC member, but this would be especially valuable if it
> > propagated DKIM signatures properly.
> >
> > On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik 
> > wrote:
> >
> > > I have noticed that some e-mail addresses (notably @google.com) get
> > > .INVALID suffixed onto it so per...@yyy.com become
> > per...@yyy.com.INVALID
> > > in the From: header.
> > >
> > > I have figured out that this is an issue with the way that our mail
> > server
> > > is configured and opened https://issues.apache.org/
> > jira/browse/INFRA-15529
> > > .
> > >
> > > For those of us that are impacted, it makes it more difficult for users
> > to
> > > reply directly to the originator.
> > >
> > > Infra has asked to get consensus from PMC members before making the
> > change
> > > which I figured it would be easiest with a vote.
> > >
> > > Please vote:
> > > +1 Update mail server to stop suffixing .INVALID
> > > -1 Don't change mail server settings.
> > >
> >
> >
> >
> > --
> > -Ben
> >
>


Re: Azure(ADLS) compatibility on Beam with Spark runner

2017-11-22 Thread Lukasz Cwik
In your example it seems as though your HDFS configuration doesn't contain
any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.defaultFS\":
\"hdfs://home/sample.txt\"]'"
Do you have a core-site.xml or hdfs-site.xml configured as per:
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html?

>From the documentation for --hdfsConfiguration:
A list of Hadoop configurations used to configure zero or more Hadoop
filesystems. By default, Hadoop configuration is loaded from
'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
YARN_CONF_DIR environment variables. To specify configuration on the
command-line, represent the value as a JSON list of JSON maps, where each
map represents the entire configuration for a single Hadoop filesystem. For
example --hdfsConfiguration='[{\"fs.default.name\":
\"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
From:
https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45

On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> FYI, I'm in touch with Microsoft Azure team about that.
>
> We are testing the ADLS support via HDFS.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/22/2017 09:12 AM, Milan Chandna wrote:
>
>> Hi,
>>
>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
>> I was trying recently to do this but was unable to make it work.
>>
>> Steps that I tried:
>>
>>1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
>>2.  Built Apache Beam on that. Built to include Beam-2790<
>> https://issues.apache.org/jira/browse/BEAM-2790> fix which earlier I was
>> facing for ADL as well.
>>3.  Modified WordCount.java example to use HadoopFileSystemOptions
>>4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
>>   *   Just gave the input path and output path as
>> adl://home/sample.txt and adl://home/output
>>   *   In addition to adl input and output path, also gave required
>> HDFS configuration with adl required configs as well.
>>
>> Both didn't worked btw.
>> s
>>1.  Have checked ACL's and permissions. In fact similar job with same
>> paths work on Spark directly.
>>2.  Issues faced:
>>   *   For input, Beam is not able to find the path. Console log:
>> Filepattern adl://home/sample.txt matched 0 files with total size 0
>>   *   Output path always gets converted to relative path, something
>> like this: /home/user1/adl:/home/output/.tmp
>>
>>
>>
>>
>>
>> Debugging more into this but was checking if someone is already facing
>> this and has some resolution.
>>
>>
>>
>> Here is a sample code and command I used.
>>
>>
>>
>>  HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg
>> s(args).as(HadoopFileSystemOptions.class);
>>
>>  Pipeline p = Pipeline.create(options);
>>
>>  p.apply( TextIO.read().from(options.get
>> HdfsConfiguration().get(0).get("fs.defaultFS")))
>>
>>   .apply(new CountWords())
>>
>>   .apply(MapElements.via(new FormatAsTextFn()))
>>
>>   .apply(TextIO.write().to("adl://home/output"));
>>
>>  p.run().waitUntilFinish();
>>
>>
>>
>>
>>
>> spark-submit --class org.apache.beam.examples.WordCount --master local
>> beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
>>
>>
>>
>>
>>
>> P.S: Created fat jar to use with spark just for testing. Is there any
>> other correct way of running it with Spark runner?
>>
>>
>>
>> -Milan.
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Reuven Lax
+1

On Nov 22, 2017 3:29 PM, "Ben Sidhom"  wrote:

> I'm not a PMC member, but this would be especially valuable if it
> propagated DKIM signatures properly.
>
> On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik 
> wrote:
>
> > I have noticed that some e-mail addresses (notably @google.com) get
> > .INVALID suffixed onto it so per...@yyy.com become
> per...@yyy.com.INVALID
> > in the From: header.
> >
> > I have figured out that this is an issue with the way that our mail
> server
> > is configured and opened https://issues.apache.org/
> jira/browse/INFRA-15529
> > .
> >
> > For those of us that are impacted, it makes it more difficult for users
> to
> > reply directly to the originator.
> >
> > Infra has asked to get consensus from PMC members before making the
> change
> > which I figured it would be easiest with a vote.
> >
> > Please vote:
> > +1 Update mail server to stop suffixing .INVALID
> > -1 Don't change mail server settings.
> >
>
>
>
> --
> -Ben
>


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Ben Sidhom
I'm not a PMC member, but this would be especially valuable if it
propagated DKIM signatures properly.

On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik 
wrote:

> I have noticed that some e-mail addresses (notably @google.com) get
> .INVALID suffixed onto it so per...@yyy.com become per...@yyy.com.INVALID
> in the From: header.
>
> I have figured out that this is an issue with the way that our mail server
> is configured and opened https://issues.apache.org/jira/browse/INFRA-15529
> .
>
> For those of us that are impacted, it makes it more difficult for users to
> reply directly to the originator.
>
> Infra has asked to get consensus from PMC members before making the change
> which I figured it would be easiest with a vote.
>
> Please vote:
> +1 Update mail server to stop suffixing .INVALID
> -1 Don't change mail server settings.
>



-- 
-Ben


[VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Lukasz Cwik
I have noticed that some e-mail addresses (notably @google.com) get
.INVALID suffixed onto it so per...@yyy.com become per...@yyy.com.INVALID
in the From: header.

I have figured out that this is an issue with the way that our mail server
is configured and opened https://issues.apache.org/jira/browse/INFRA-15529.

For those of us that are impacted, it makes it more difficult for users to
reply directly to the originator.

Infra has asked to get consensus from PMC members before making the change
which I figured it would be easiest with a vote.

Please vote:
+1 Update mail server to stop suffixing .INVALID
-1 Don't change mail server settings.


Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi Eugene,

I ran it on both  standalone flink(non Yarn) and  Flink on HDInsight
Cluster(Yarn). Both ran successfully. :)

Regards,
Nishu


Virus-free.
www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Nov 22, 2017 at 9:40 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Thanks Nishu. So, if I understand correctly, your pipelines were running on
> non-YARN, but you're planning to run with YARN?
>
> I meanwhile was able to get Flink running on Dataproc (YARN), and validated
> quickstart and game examples.
> At this point we need validation for Spark and Flink non-YARN [I think if
> Nishu's runs were non-YARN, they'd give us enough confidence, combined with
> the success of other validations of Spark and Flink runners?], and Apex on
> YARN. However, it seems that in previous RCs we were not validating Apex on
> YARN, only local cluster. Is it needed this time?
>
> On Wed, Nov 22, 2017 at 12:28 PM Nishu  wrote:
>
> > Hi Eugene,
> >
> > No, I didn't try with those instead I have my custom pipeline where Kafka
> > topic is the source. I have defined a Global Window and processing time
> > trigger to read the data. Further it runs some transformation i.e.
> > GroupByKey and CoGroupByKey. on the windowed collections.
> > I was running the same pipeline on direct runner and spark runner
> earlier..
> > Today gave it a try with Flink on Yarn.
> >
> > Best Regards,
> > Nishu.
> >
> > <
> > https://www.avast.com/sig-email?utm_medium=email&utm_
> source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon
> > >
> > Virus-free.
> > www.avast.com
> > <
> > https://www.avast.com/sig-email?utm_medium=email&utm_
> source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link
> > >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
> > On Wed, Nov 22, 2017 at 8:07 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Thanks Nishu! Can you clarify which pipeline you were running?
> > > The validation spreadsheet includes 1) the quickstart and 2) mobile
> game
> > > walkthroughs. Was it one of these, or your custom pipeline?
> > >
> > > On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:
> > >
> > > > Hi,
> > > >
> > > > Typo in previous mail.  I meant Flink runner.
> > > >
> > > > Thanks,
> > > > Nishu
> > > > On Wed, 22 Nov 2017 at 19.17,
> > > >
> > > > > Hi,
> > > > >
> > > > > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > > > > It worked seamlessly for unbounded sources. Couldn’t see any issues
> > > with
> > > > > my pipeline so far :)
> > > > >
> > > > >
> > > > > Thanks,Nishu
> > > > >
> > > > > On Wed, 22 Nov 2017 at 18.57, Reuven Lax  >
> > > > wrote:
> > > > >
> > > > >> Who is validating Flink and Yarn?
> > > > >>
> > > > >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles
> > >  > > > >
> > > > >> wrote:
> > > > >>
> > > > >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > > > >> > kirpic...@google.com.invalid> wrote:
> > > > >> >
> > > > >> > > In the verification spreadsheet, I'm not sure I understand the
> > > > >> difference
> > > > >> > > between the "YARN" and "Standalone cluster/service". Which is
> > > > >> Dataproc?
> > > > >> > It
> > > > >> > > definitely uses YARN, but it is also a standalone
> > cluster/service.
> > > > >> Does
> > > > >> > it
> > > > >> > > count for both?
> > > > >> > >
> > > > >> >
> > > > >> > No, it doesn't. A number of runners have their own non-YARN
> > cluster
> > > > >> mode. I
> > > > >> > would expect that the launching experience might be different
> and
> > > the
> > > > >> > portable container management to differ. If they are identical,
> > > > experts
> > > > >> in
> > > > >> > those systems should feel free to coalesce the rows. Conversely,
> > as
> > > > >> other
> > > > >> > platforms become supported, they could be added or not based on
> > > > whether
> > > > >> > they are substantively different from a user experience or QA
> > point
> > > of
> > > > >> > view.
> > > > >> >
> > > > >> > Kenn
> > > > >> >
> > > > >> >
> > > > >> > > Seems now we're missing just Apex and Flink cluster
> > verifications.
> > > > >> > >
> > > > >> > > *though Spark runner took 6x longer to run UserScore,
> partially
> > I
> > > > >> guess
> > > > >> > > because it didn't do autoscaling (Dataflow runner ramped up
> to 5
> > > > >> workers
> > > > >> > > whereas Spark runner used 2 workers). For some reason Spark
> > runner
> > > > >> chose
> > > > >> > > not to split the 10GB input files into chunks.
> > > > >> > >
> > > > >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax
> > >  > > > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Done
> > > > >> > > >
> > > > >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > > > >> > > > rober...

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Eugene Kirpichov
Thanks Nishu. So, if I understand correctly, your pipelines were running on
non-YARN, but you're planning to run with YARN?

I meanwhile was able to get Flink running on Dataproc (YARN), and validated
quickstart and game examples.
At this point we need validation for Spark and Flink non-YARN [I think if
Nishu's runs were non-YARN, they'd give us enough confidence, combined with
the success of other validations of Spark and Flink runners?], and Apex on
YARN. However, it seems that in previous RCs we were not validating Apex on
YARN, only local cluster. Is it needed this time?

On Wed, Nov 22, 2017 at 12:28 PM Nishu  wrote:

> Hi Eugene,
>
> No, I didn't try with those instead I have my custom pipeline where Kafka
> topic is the source. I have defined a Global Window and processing time
> trigger to read the data. Further it runs some transformation i.e.
> GroupByKey and CoGroupByKey. on the windowed collections.
> I was running the same pipeline on direct runner and spark runner earlier..
> Today gave it a try with Flink on Yarn.
>
> Best Regards,
> Nishu.
>
> <
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon
> >
> Virus-free.
> www.avast.com
> <
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link
> >
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> On Wed, Nov 22, 2017 at 8:07 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Thanks Nishu! Can you clarify which pipeline you were running?
> > The validation spreadsheet includes 1) the quickstart and 2) mobile game
> > walkthroughs. Was it one of these, or your custom pipeline?
> >
> > On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:
> >
> > > Hi,
> > >
> > > Typo in previous mail.  I meant Flink runner.
> > >
> > > Thanks,
> > > Nishu
> > > On Wed, 22 Nov 2017 at 19.17,
> > >
> > > > Hi,
> > > >
> > > > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > > > It worked seamlessly for unbounded sources. Couldn’t see any issues
> > with
> > > > my pipeline so far :)
> > > >
> > > >
> > > > Thanks,Nishu
> > > >
> > > > On Wed, 22 Nov 2017 at 18.57, Reuven Lax 
> > > wrote:
> > > >
> > > >> Who is validating Flink and Yarn?
> > > >>
> > > >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles
> >  > > >
> > > >> wrote:
> > > >>
> > > >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > > >> > kirpic...@google.com.invalid> wrote:
> > > >> >
> > > >> > > In the verification spreadsheet, I'm not sure I understand the
> > > >> difference
> > > >> > > between the "YARN" and "Standalone cluster/service". Which is
> > > >> Dataproc?
> > > >> > It
> > > >> > > definitely uses YARN, but it is also a standalone
> cluster/service.
> > > >> Does
> > > >> > it
> > > >> > > count for both?
> > > >> > >
> > > >> >
> > > >> > No, it doesn't. A number of runners have their own non-YARN
> cluster
> > > >> mode. I
> > > >> > would expect that the launching experience might be different and
> > the
> > > >> > portable container management to differ. If they are identical,
> > > experts
> > > >> in
> > > >> > those systems should feel free to coalesce the rows. Conversely,
> as
> > > >> other
> > > >> > platforms become supported, they could be added or not based on
> > > whether
> > > >> > they are substantively different from a user experience or QA
> point
> > of
> > > >> > view.
> > > >> >
> > > >> > Kenn
> > > >> >
> > > >> >
> > > >> > > Seems now we're missing just Apex and Flink cluster
> verifications.
> > > >> > >
> > > >> > > *though Spark runner took 6x longer to run UserScore, partially
> I
> > > >> guess
> > > >> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> > > >> workers
> > > >> > > whereas Spark runner used 2 workers). For some reason Spark
> runner
> > > >> chose
> > > >> > > not to split the 10GB input files into chunks.
> > > >> > >
> > > >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax
> >  > > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Done
> > > >> > > >
> > > >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > > >> > > > rober...@google.com.invalid> wrote:
> > > >> > > >
> > > >> > > > > Thanks. You need to re-sign as well.
> > > >> > > > >
> > > >> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> > > >> >  > > >> > > >
> > > >> > > > > wrote:
> > > >> > > > > > FYI these generated files have been removed from the
> source
> > > >> > > > distribution.
> > > >> > > > > >
> > > >> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax <
> > re...@google.com
> > > >
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > >> hmmm, I thought I removed those generated files from the
> > zip
> > > >> file
> > > >> > > > before
> > > >> > > > > >> sending this email. Let me check again.
> > > >> > > > > >>
> > > >> > > > > >> Reuven
> > > >> > > > > >>
> > > >> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> > > >> > > 

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi Eugene,

No, I didn't try with those instead I have my custom pipeline where Kafka
topic is the source. I have defined a Global Window and processing time
trigger to read the data. Further it runs some transformation i.e.
GroupByKey and CoGroupByKey. on the windowed collections.
I was running the same pipeline on direct runner and spark runner earlier..
Today gave it a try with Flink on Yarn.

Best Regards,
Nishu.


Virus-free.
www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Nov 22, 2017 at 8:07 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Thanks Nishu! Can you clarify which pipeline you were running?
> The validation spreadsheet includes 1) the quickstart and 2) mobile game
> walkthroughs. Was it one of these, or your custom pipeline?
>
> On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:
>
> > Hi,
> >
> > Typo in previous mail.  I meant Flink runner.
> >
> > Thanks,
> > Nishu
> > On Wed, 22 Nov 2017 at 19.17,
> >
> > > Hi,
> > >
> > > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > > It worked seamlessly for unbounded sources. Couldn’t see any issues
> with
> > > my pipeline so far :)
> > >
> > >
> > > Thanks,Nishu
> > >
> > > On Wed, 22 Nov 2017 at 18.57, Reuven Lax 
> > wrote:
> > >
> > >> Who is validating Flink and Yarn?
> > >>
> > >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles
>  > >
> > >> wrote:
> > >>
> > >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > >> > kirpic...@google.com.invalid> wrote:
> > >> >
> > >> > > In the verification spreadsheet, I'm not sure I understand the
> > >> difference
> > >> > > between the "YARN" and "Standalone cluster/service". Which is
> > >> Dataproc?
> > >> > It
> > >> > > definitely uses YARN, but it is also a standalone cluster/service.
> > >> Does
> > >> > it
> > >> > > count for both?
> > >> > >
> > >> >
> > >> > No, it doesn't. A number of runners have their own non-YARN cluster
> > >> mode. I
> > >> > would expect that the launching experience might be different and
> the
> > >> > portable container management to differ. If they are identical,
> > experts
> > >> in
> > >> > those systems should feel free to coalesce the rows. Conversely, as
> > >> other
> > >> > platforms become supported, they could be added or not based on
> > whether
> > >> > they are substantively different from a user experience or QA point
> of
> > >> > view.
> > >> >
> > >> > Kenn
> > >> >
> > >> >
> > >> > > Seems now we're missing just Apex and Flink cluster verifications.
> > >> > >
> > >> > > *though Spark runner took 6x longer to run UserScore, partially I
> > >> guess
> > >> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> > >> workers
> > >> > > whereas Spark runner used 2 workers). For some reason Spark runner
> > >> chose
> > >> > > not to split the 10GB input files into chunks.
> > >> > >
> > >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax
>  > >
> > >> > > wrote:
> > >> > >
> > >> > > > Done
> > >> > > >
> > >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > >> > > > rober...@google.com.invalid> wrote:
> > >> > > >
> > >> > > > > Thanks. You need to re-sign as well.
> > >> > > > >
> > >> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> > >> >  > >> > > >
> > >> > > > > wrote:
> > >> > > > > > FYI these generated files have been removed from the source
> > >> > > > distribution.
> > >> > > > > >
> > >> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax <
> re...@google.com
> > >
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > >> hmmm, I thought I removed those generated files from the
> zip
> > >> file
> > >> > > > before
> > >> > > > > >> sending this email. Let me check again.
> > >> > > > > >>
> > >> > > > > >> Reuven
> > >> > > > > >>
> > >> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> > >> > > > > >> rober...@google.com.invalid> wrote:
> > >> > > > > >>
> > >> > > > > >>> The source distribution contains a couple of files not on
> > >> github
> > >> > > > (e.g.
> > >> > > > > >>> folders that were added on master, Python generated
> files).
> > >> The
> > >> > pom
> > >> > > > > >>> files differed only by missing -SNAPSHOT, other than that
> > >> > > presumably
> > >> > > > > >>> the source release should just be "wget
> > >> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip
> "?
> > >> > > > > >>>
> > >> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> > >> > > > > >>>
> > >> > > > > >>> # OK?
> > >> > > > > >>>
> > >> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> > >> > > > > >>>
> > >> > > > > >>> # Expected.
> > >> > > > > >>>
> > >> > > > > >>> Only in beam/: .git
> > >> > > > > >>> Only in beam/: .gitattributes
> > >> > > > > >>> Only in beam/:

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Kenneth Knowles
+1 (binding)

On Wed, Nov 22, 2017 at 11:43 AM, Max Barrios 
wrote:

> +1 (non-binding)
>
> Sent from my iPhone
>
> > On Nov 20, 2017, at 12:47 AM, Jean-Baptiste Onofré 
> wrote:
> >
> > Yeah, I have a Jira about that.
> >
> > You just have to update the existing symlink to point on the new release.
> >
> > I will update the release guide asap.
> >
> > Thanks !
> > Regards
> > JB
> >
> >> On 11/20/2017 09:39 AM, Reuven Lax wrote:
> >> Good point. How do I update that link? I think this step might be
> missing
> >> from our release guide (or I simply can't find it).
> >> Reuven
> >> On Mon, Nov 20, 2017 at 4:24 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>> +1 (binding)
> >>>
> >>> Tested on Beam samples, it looks good.
> >>>
> >>> Checked the legal/license, content. OK for me.
> >>>
> >>> Just a side note: don't forget to update the latest link on
> >>> dist.apache.org.
> >>>
> >>> Thanks
> >>> Regards
> >>> JB
> >>>
> >>>
>  On 11/17/2017 07:08 AM, Reuven Lax wrote:
> 
>  Hi everyone,
> 
>  Please review and vote on the release candidate #4 for the version
> 2.2.0,
>  as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific
> comments)
> 
> 
>  The complete staging area is available for your review, which
> includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to
> dist.apache.org
>  [2],
>  which is signed with the key with fingerprint B98B7708 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.2.0-RC4" [5],
> * website pull request listing the release and publishing the API
>  reference manual [6].
> * Java artifacts were built with Maven 3.5.0 and OpenJDK/Oracle JDK
>  1.8.0_144.
> * Python artifacts are deployed along with the source release to
> the
>  dist.apache.org [2].
> 
>  The vote will be open for at least 72 hours. It is adopted by majority
>  approval, with at least 3 PMC affirmative votes.
> 
>  Thanks,
>  Reuven
> 
>  [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?p
>  rojectId=12319527&version=12341044
>  [2] https://dist.apache.org/repos/dist/dev/beam/2.2.0/
>  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>  [4] https://repository.apache.org/content/repositories/orgapache
>  beam-1025/
>  [5] https://github.com/apache/beam/tree/v2.2.0-RC4
>  
>  [6] https://github.com/apache/beam-site/pull/337
> 
> 
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Dataflow pipeline problem: Streaming data combined with large bounded data

2017-11-22 Thread Taylor Coleman
Hello,


I have a Dataflow streaming pipeline where I need to consume queue messages 
(PubsubIO) and send each one through a very large lookup table from a database 
to join relevant values. I have tried the following methods:


1) Side Input

In this approach I kept the large lookup table as a side input (static map 
object) and sent each message through the map on each machine. Obviously this 
is not ideal for a streaming process, but it did work with smaller test 
datasets. However, with larger, production sized datasets, the side input 
continually failed with the error:


  exception:  "java.lang.RuntimeException: 
org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: 
ByteString would be too long: 590128395+1811498110
at 
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
at 
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:104)

... which implies a direct limit on the size of side inputs. So I've been 
working on the following strategy instead:


2) Key-value pair grouping

This is the main problem I've been dealing with. In this approach I attempted 
to convert both the large map table and the incoming queue messages to 
Key-Value pairs, and then combine the messages with the relevant map data using 
a Flatten merge transform and GroupByKey transform. Although this runs without 
error, it never outputs any data.


The core of the problem is here: streaming messages require a windowing 
strategy if they are to be merged with other data. However, the logic for 
windowing does make sense with the large map dataset - it should all be one 
window. All of the data needs to be sent at once, and it needs to be sent with 
every small batch of Pubsub messages streaming in (over and over again).  I 
attempted to window the map dataset with a static timestamp and then to create 
a window large enough to apply to all items - but of course, this just sends 
all the map data through the stream once in a single window, and then never 
again.


I am facing a situation where I need to repeatedly process a large amount of 
static, bounded data in a continuous stream - is the only solution to this to 
run many batch jobs?


Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Max Barrios
+1 (non-binding)

Sent from my iPhone

> On Nov 20, 2017, at 12:47 AM, Jean-Baptiste Onofré  wrote:
> 
> Yeah, I have a Jira about that.
> 
> You just have to update the existing symlink to point on the new release.
> 
> I will update the release guide asap.
> 
> Thanks !
> Regards
> JB
> 
>> On 11/20/2017 09:39 AM, Reuven Lax wrote:
>> Good point. How do I update that link? I think this step might be missing
>> from our release guide (or I simply can't find it).
>> Reuven
>> On Mon, Nov 20, 2017 at 4:24 PM, Jean-Baptiste Onofré 
>> wrote:
>>> +1 (binding)
>>> 
>>> Tested on Beam samples, it looks good.
>>> 
>>> Checked the legal/license, content. OK for me.
>>> 
>>> Just a side note: don't forget to update the latest link on
>>> dist.apache.org.
>>> 
>>> Thanks
>>> Regards
>>> JB
>>> 
>>> 
 On 11/17/2017 07:08 AM, Reuven Lax wrote:
 
 Hi everyone,
 
 Please review and vote on the release candidate #4 for the version 2.2.0,
 as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)
 
 
 The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
 [2],
 which is signed with the key with fingerprint B98B7708 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.2.0-RC4" [5],
* website pull request listing the release and publishing the API
 reference manual [6].
* Java artifacts were built with Maven 3.5.0 and OpenJDK/Oracle JDK
 1.8.0_144.
* Python artifacts are deployed along with the source release to the
 dist.apache.org [2].
 
 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.
 
 Thanks,
 Reuven
 
 [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?p
 rojectId=12319527&version=12341044
 [2] https://dist.apache.org/repos/dist/dev/beam/2.2.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4] https://repository.apache.org/content/repositories/orgapache
 beam-1025/
 [5] https://github.com/apache/beam/tree/v2.2.0-RC4
 
 [6] https://github.com/apache/beam-site/pull/337
 
 
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Version 2.2.0 release date

2017-11-22 Thread Ahmet Altay
Hi Stefania,

Release candidate for 2.2.0 is currently being voted [1]. The release will
happen after a successful vote.

Ahmet

[1] https://lists.apache.org/thread.html/da2acabdb15c9f8d11351f9167633a
4b089664fe3cce014ba619c937@%3Cdev.beam.apache.org%3E

On Mon, Nov 20, 2017 at 7:04 AM, Stefania Mantisi <
stefania.mant...@noovle.it> wrote:

> Hi everyone,
> I saw all the issues are currently marked as having been fixed.
> When will version 2.2 come out?
>
> Thank you!
>
> Stefania Mantisi
>
> --
> *Stefania Mantisi *
> Software Engineer - Cloud Development - Noovle S.r.l.
> mail: stefania.mant...@noovle.it
> 
>
> Noovle  | The Nexus of forces
>


Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Eugene Kirpichov
Thanks Nishu! Can you clarify which pipeline you were running?
The validation spreadsheet includes 1) the quickstart and 2) mobile game
walkthroughs. Was it one of these, or your custom pipeline?

On Wed, Nov 22, 2017 at 10:20 AM Nishu  wrote:

> Hi,
>
> Typo in previous mail.  I meant Flink runner.
>
> Thanks,
> Nishu
> On Wed, 22 Nov 2017 at 19.17,
>
> > Hi,
> >
> > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > It worked seamlessly for unbounded sources. Couldn’t see any issues with
> > my pipeline so far :)
> >
> >
> > Thanks,Nishu
> >
> > On Wed, 22 Nov 2017 at 18.57, Reuven Lax 
> wrote:
> >
> >> Who is validating Flink and Yarn?
> >>
> >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles  >
> >> wrote:
> >>
> >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> >> > kirpic...@google.com.invalid> wrote:
> >> >
> >> > > In the verification spreadsheet, I'm not sure I understand the
> >> difference
> >> > > between the "YARN" and "Standalone cluster/service". Which is
> >> Dataproc?
> >> > It
> >> > > definitely uses YARN, but it is also a standalone cluster/service.
> >> Does
> >> > it
> >> > > count for both?
> >> > >
> >> >
> >> > No, it doesn't. A number of runners have their own non-YARN cluster
> >> mode. I
> >> > would expect that the launching experience might be different and the
> >> > portable container management to differ. If they are identical,
> experts
> >> in
> >> > those systems should feel free to coalesce the rows. Conversely, as
> >> other
> >> > platforms become supported, they could be added or not based on
> whether
> >> > they are substantively different from a user experience or QA point of
> >> > view.
> >> >
> >> > Kenn
> >> >
> >> >
> >> > > Seems now we're missing just Apex and Flink cluster verifications.
> >> > >
> >> > > *though Spark runner took 6x longer to run UserScore, partially I
> >> guess
> >> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> >> workers
> >> > > whereas Spark runner used 2 workers). For some reason Spark runner
> >> chose
> >> > > not to split the 10GB input files into chunks.
> >> > >
> >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax  >
> >> > > wrote:
> >> > >
> >> > > > Done
> >> > > >
> >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> >> > > > rober...@google.com.invalid> wrote:
> >> > > >
> >> > > > > Thanks. You need to re-sign as well.
> >> > > > >
> >> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> >> >  >> > > >
> >> > > > > wrote:
> >> > > > > > FYI these generated files have been removed from the source
> >> > > > distribution.
> >> > > > > >
> >> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax  >
> >> > > wrote:
> >> > > > > >
> >> > > > > >> hmmm, I thought I removed those generated files from the zip
> >> file
> >> > > > before
> >> > > > > >> sending this email. Let me check again.
> >> > > > > >>
> >> > > > > >> Reuven
> >> > > > > >>
> >> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> >> > > > > >> rober...@google.com.invalid> wrote:
> >> > > > > >>
> >> > > > > >>> The source distribution contains a couple of files not on
> >> github
> >> > > > (e.g.
> >> > > > > >>> folders that were added on master, Python generated files).
> >> The
> >> > pom
> >> > > > > >>> files differed only by missing -SNAPSHOT, other than that
> >> > > presumably
> >> > > > > >>> the source release should just be "wget
> >> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
> >> > > > > >>>
> >> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> >> > > > > >>>
> >> > > > > >>> # OK?
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> >> > > > > >>>
> >> > > > > >>> # Expected.
> >> > > > > >>>
> >> > > > > >>> Only in beam/: .git
> >> > > > > >>> Only in beam/: .gitattributes
> >> > > > > >>> Only in beam/: .gitignore
> >> > > > > >>>
> >> > > > > >>> # These folders are probably from switching around between
> >> master
> >> > > and
> >> > > > > >>> git branches.
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0: model
> >> > > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> >> > > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> >> > > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> >> > maven-archetypes/examples/src/
> >> > > ma
> >> > > > > >>> in/resources/archetype-resources:
> >> > > > > >>> src
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> >> maven-archetypes/examples-
> >> > > java8/
> >> > > > > >>> src/main/resources/archetype-resources:
> >> > > > > >>> src
> >> > > > > >>> Only in apache-beam-2.2.

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Reuven Lax
Please update the spreadsheet.

On Wed, Nov 22, 2017 at 10:19 AM, Nishu  wrote:

> Hi,
>
> Typo in previous mail.  I meant Flink runner.
>
> Thanks,
> Nishu
> On Wed, 22 Nov 2017 at 19.17,
>
> > Hi,
> >
> > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > It worked seamlessly for unbounded sources. Couldn’t see any issues with
> > my pipeline so far :)
> >
> >
> > Thanks,Nishu
> >
> > On Wed, 22 Nov 2017 at 18.57, Reuven Lax 
> wrote:
> >
> >> Who is validating Flink and Yarn?
> >>
> >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles  >
> >> wrote:
> >>
> >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> >> > kirpic...@google.com.invalid> wrote:
> >> >
> >> > > In the verification spreadsheet, I'm not sure I understand the
> >> difference
> >> > > between the "YARN" and "Standalone cluster/service". Which is
> >> Dataproc?
> >> > It
> >> > > definitely uses YARN, but it is also a standalone cluster/service.
> >> Does
> >> > it
> >> > > count for both?
> >> > >
> >> >
> >> > No, it doesn't. A number of runners have their own non-YARN cluster
> >> mode. I
> >> > would expect that the launching experience might be different and the
> >> > portable container management to differ. If they are identical,
> experts
> >> in
> >> > those systems should feel free to coalesce the rows. Conversely, as
> >> other
> >> > platforms become supported, they could be added or not based on
> whether
> >> > they are substantively different from a user experience or QA point of
> >> > view.
> >> >
> >> > Kenn
> >> >
> >> >
> >> > > Seems now we're missing just Apex and Flink cluster verifications.
> >> > >
> >> > > *though Spark runner took 6x longer to run UserScore, partially I
> >> guess
> >> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> >> workers
> >> > > whereas Spark runner used 2 workers). For some reason Spark runner
> >> chose
> >> > > not to split the 10GB input files into chunks.
> >> > >
> >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax  >
> >> > > wrote:
> >> > >
> >> > > > Done
> >> > > >
> >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> >> > > > rober...@google.com.invalid> wrote:
> >> > > >
> >> > > > > Thanks. You need to re-sign as well.
> >> > > > >
> >> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> >> >  >> > > >
> >> > > > > wrote:
> >> > > > > > FYI these generated files have been removed from the source
> >> > > > distribution.
> >> > > > > >
> >> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax  >
> >> > > wrote:
> >> > > > > >
> >> > > > > >> hmmm, I thought I removed those generated files from the zip
> >> file
> >> > > > before
> >> > > > > >> sending this email. Let me check again.
> >> > > > > >>
> >> > > > > >> Reuven
> >> > > > > >>
> >> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> >> > > > > >> rober...@google.com.invalid> wrote:
> >> > > > > >>
> >> > > > > >>> The source distribution contains a couple of files not on
> >> github
> >> > > > (e.g.
> >> > > > > >>> folders that were added on master, Python generated files).
> >> The
> >> > pom
> >> > > > > >>> files differed only by missing -SNAPSHOT, other than that
> >> > > presumably
> >> > > > > >>> the source release should just be "wget
> >> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
> >> > > > > >>>
> >> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> >> > > > > >>>
> >> > > > > >>> # OK?
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> >> > > > > >>>
> >> > > > > >>> # Expected.
> >> > > > > >>>
> >> > > > > >>> Only in beam/: .git
> >> > > > > >>> Only in beam/: .gitattributes
> >> > > > > >>> Only in beam/: .gitignore
> >> > > > > >>>
> >> > > > > >>> # These folders are probably from switching around between
> >> master
> >> > > and
> >> > > > > >>> git branches.
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0: model
> >> > > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> >> > > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> >> > > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> >> > maven-archetypes/examples/src/
> >> > > ma
> >> > > > > >>> in/resources/archetype-resources:
> >> > > > > >>> src
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> >> maven-archetypes/examples-
> >> > > java8/
> >> > > > > >>> src/main/resources/archetype-resources:
> >> > > > > >>> src
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java: microbenchmarks
> >> > > > > >>>
> >> > > > > >>> # Here's the generated protos.
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/pytho

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Konstantinos Katsiapis
+1 (non-binding)

Since Beam 2.2 is blocking release of tensorflow.transform
 0.4.

On Wed, Nov 22, 2017 at 10:19 AM, Nishu  wrote:

> Hi,
>
> Typo in previous mail.  I meant Flink runner.
>
> Thanks,
> Nishu
> On Wed, 22 Nov 2017 at 19.17,
>
> > Hi,
> >
> > I build a pipeline using RC 2.2 today and ran with runner on yarn.
> > It worked seamlessly for unbounded sources. Couldn’t see any issues with
> > my pipeline so far :)
> >
> >
> > Thanks,Nishu
> >
> > On Wed, 22 Nov 2017 at 18.57, Reuven Lax 
> wrote:
> >
> >> Who is validating Flink and Yarn?
> >>
> >> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles  >
> >> wrote:
> >>
> >> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> >> > kirpic...@google.com.invalid> wrote:
> >> >
> >> > > In the verification spreadsheet, I'm not sure I understand the
> >> difference
> >> > > between the "YARN" and "Standalone cluster/service". Which is
> >> Dataproc?
> >> > It
> >> > > definitely uses YARN, but it is also a standalone cluster/service.
> >> Does
> >> > it
> >> > > count for both?
> >> > >
> >> >
> >> > No, it doesn't. A number of runners have their own non-YARN cluster
> >> mode. I
> >> > would expect that the launching experience might be different and the
> >> > portable container management to differ. If they are identical,
> experts
> >> in
> >> > those systems should feel free to coalesce the rows. Conversely, as
> >> other
> >> > platforms become supported, they could be added or not based on
> whether
> >> > they are substantively different from a user experience or QA point of
> >> > view.
> >> >
> >> > Kenn
> >> >
> >> >
> >> > > Seems now we're missing just Apex and Flink cluster verifications.
> >> > >
> >> > > *though Spark runner took 6x longer to run UserScore, partially I
> >> guess
> >> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> >> workers
> >> > > whereas Spark runner used 2 workers). For some reason Spark runner
> >> chose
> >> > > not to split the 10GB input files into chunks.
> >> > >
> >> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax  >
> >> > > wrote:
> >> > >
> >> > > > Done
> >> > > >
> >> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> >> > > > rober...@google.com.invalid> wrote:
> >> > > >
> >> > > > > Thanks. You need to re-sign as well.
> >> > > > >
> >> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> >> >  >> > > >
> >> > > > > wrote:
> >> > > > > > FYI these generated files have been removed from the source
> >> > > > distribution.
> >> > > > > >
> >> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax  >
> >> > > wrote:
> >> > > > > >
> >> > > > > >> hmmm, I thought I removed those generated files from the zip
> >> file
> >> > > > before
> >> > > > > >> sending this email. Let me check again.
> >> > > > > >>
> >> > > > > >> Reuven
> >> > > > > >>
> >> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> >> > > > > >> rober...@google.com.invalid> wrote:
> >> > > > > >>
> >> > > > > >>> The source distribution contains a couple of files not on
> >> github
> >> > > > (e.g.
> >> > > > > >>> folders that were added on master, Python generated files).
> >> The
> >> > pom
> >> > > > > >>> files differed only by missing -SNAPSHOT, other than that
> >> > > presumably
> >> > > > > >>> the source release should just be "wget
> >> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
> >> > > > > >>>
> >> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> >> > > > > >>>
> >> > > > > >>> # OK?
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> >> > > > > >>>
> >> > > > > >>> # Expected.
> >> > > > > >>>
> >> > > > > >>> Only in beam/: .git
> >> > > > > >>> Only in beam/: .gitattributes
> >> > > > > >>> Only in beam/: .gitignore
> >> > > > > >>>
> >> > > > > >>> # These folders are probably from switching around between
> >> master
> >> > > and
> >> > > > > >>> git branches.
> >> > > > > >>>
> >> > > > > >>> Only in apache-beam-2.2.0: model
> >> > > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> >> > > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> >> > > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> >> > maven-archetypes/examples/src/
> >> > > ma
> >> > > > > >>> in/resources/archetype-resources:
> >> > > > > >>> src
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> >> maven-archetypes/examples-
> >> > > java8/
> >> > > > > >>> src/main/resources/archetype-resources:
> >> > > > > >>> src
> >> > > > > >>> Only in apache-beam-2.2.0/sdks/java: microbenchmarks
> >> > > > > >>>
> >> > > > > >>> # Here's

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi,

Typo in previous mail.  I meant Flink runner.

Thanks,
Nishu
On Wed, 22 Nov 2017 at 19.17,

> Hi,
>
> I build a pipeline using RC 2.2 today and ran with runner on yarn.
> It worked seamlessly for unbounded sources. Couldn’t see any issues with
> my pipeline so far :)
>
>
> Thanks,Nishu
>
> On Wed, 22 Nov 2017 at 18.57, Reuven Lax  wrote:
>
>> Who is validating Flink and Yarn?
>>
>> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles 
>> wrote:
>>
>> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
>> > kirpic...@google.com.invalid> wrote:
>> >
>> > > In the verification spreadsheet, I'm not sure I understand the
>> difference
>> > > between the "YARN" and "Standalone cluster/service". Which is
>> Dataproc?
>> > It
>> > > definitely uses YARN, but it is also a standalone cluster/service.
>> Does
>> > it
>> > > count for both?
>> > >
>> >
>> > No, it doesn't. A number of runners have their own non-YARN cluster
>> mode. I
>> > would expect that the launching experience might be different and the
>> > portable container management to differ. If they are identical, experts
>> in
>> > those systems should feel free to coalesce the rows. Conversely, as
>> other
>> > platforms become supported, they could be added or not based on whether
>> > they are substantively different from a user experience or QA point of
>> > view.
>> >
>> > Kenn
>> >
>> >
>> > > Seems now we're missing just Apex and Flink cluster verifications.
>> > >
>> > > *though Spark runner took 6x longer to run UserScore, partially I
>> guess
>> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
>> workers
>> > > whereas Spark runner used 2 workers). For some reason Spark runner
>> chose
>> > > not to split the 10GB input files into chunks.
>> > >
>> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax 
>> > > wrote:
>> > >
>> > > > Done
>> > > >
>> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
>> > > > rober...@google.com.invalid> wrote:
>> > > >
>> > > > > Thanks. You need to re-sign as well.
>> > > > >
>> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
>> > > > > >
>> > > > > wrote:
>> > > > > > FYI these generated files have been removed from the source
>> > > > distribution.
>> > > > > >
>> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax 
>> > > wrote:
>> > > > > >
>> > > > > >> hmmm, I thought I removed those generated files from the zip
>> file
>> > > > before
>> > > > > >> sending this email. Let me check again.
>> > > > > >>
>> > > > > >> Reuven
>> > > > > >>
>> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
>> > > > > >> rober...@google.com.invalid> wrote:
>> > > > > >>
>> > > > > >>> The source distribution contains a couple of files not on
>> github
>> > > > (e.g.
>> > > > > >>> folders that were added on master, Python generated files).
>> The
>> > pom
>> > > > > >>> files differed only by missing -SNAPSHOT, other than that
>> > > presumably
>> > > > > >>> the source release should just be "wget
>> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
>> > > > > >>>
>> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
>> > > > > >>>
>> > > > > >>> # OK?
>> > > > > >>>
>> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
>> > > > > >>>
>> > > > > >>> # Expected.
>> > > > > >>>
>> > > > > >>> Only in beam/: .git
>> > > > > >>> Only in beam/: .gitattributes
>> > > > > >>> Only in beam/: .gitignore
>> > > > > >>>
>> > > > > >>> # These folders are probably from switching around between
>> master
>> > > and
>> > > > > >>> git branches.
>> > > > > >>>
>> > > > > >>> Only in apache-beam-2.2.0: model
>> > > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
>> > > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
>> > > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
>> > maven-archetypes/examples/src/
>> > > ma
>> > > > > >>> in/resources/archetype-resources:
>> > > > > >>> src
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
>> maven-archetypes/examples-
>> > > java8/
>> > > > > >>> src/main/resources/archetype-resources:
>> > > > > >>> src
>> > > > > >>> Only in apache-beam-2.2.0/sdks/java: microbenchmarks
>> > > > > >>>
>> > > > > >>> # Here's the generated protos.
>> > > > > >>>
>> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
>> > apache_beam/portability/api:
>> > > > > >>> beam_artifact_api_pb2_grpc.py
>> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
>> > apache_beam/portability/api:
>> > > > > >>> beam_artifact_api_pb2.py
>> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
>> > apache_beam/portability/api:
>> > > > > >>> beam_fn_api_pb2_grpc.py
>> > > > > >>> Only in apache-beam-2.2.

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Nishu
Hi,

I build a pipeline using RC 2.2 today and ran with runner on yarn.
It worked seamlessly for unbounded sources. Couldn’t see any issues with my
pipeline so far :)


Thanks,Nishu

On Wed, 22 Nov 2017 at 18.57, Reuven Lax  wrote:

> Who is validating Flink and Yarn?
>
> On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles 
> wrote:
>
> > On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > In the verification spreadsheet, I'm not sure I understand the
> difference
> > > between the "YARN" and "Standalone cluster/service". Which is Dataproc?
> > It
> > > definitely uses YARN, but it is also a standalone cluster/service. Does
> > it
> > > count for both?
> > >
> >
> > No, it doesn't. A number of runners have their own non-YARN cluster
> mode. I
> > would expect that the launching experience might be different and the
> > portable container management to differ. If they are identical, experts
> in
> > those systems should feel free to coalesce the rows. Conversely, as other
> > platforms become supported, they could be added or not based on whether
> > they are substantively different from a user experience or QA point of
> > view.
> >
> > Kenn
> >
> >
> > > Seems now we're missing just Apex and Flink cluster verifications.
> > >
> > > *though Spark runner took 6x longer to run UserScore, partially I guess
> > > because it didn't do autoscaling (Dataflow runner ramped up to 5
> workers
> > > whereas Spark runner used 2 workers). For some reason Spark runner
> chose
> > > not to split the 10GB input files into chunks.
> > >
> > > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax 
> > > wrote:
> > >
> > > > Done
> > > >
> > > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > > > rober...@google.com.invalid> wrote:
> > > >
> > > > > Thanks. You need to re-sign as well.
> > > > >
> > > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
> >  > > >
> > > > > wrote:
> > > > > > FYI these generated files have been removed from the source
> > > > distribution.
> > > > > >
> > > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax 
> > > wrote:
> > > > > >
> > > > > >> hmmm, I thought I removed those generated files from the zip
> file
> > > > before
> > > > > >> sending this email. Let me check again.
> > > > > >>
> > > > > >> Reuven
> > > > > >>
> > > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> > > > > >> rober...@google.com.invalid> wrote:
> > > > > >>
> > > > > >>> The source distribution contains a couple of files not on
> github
> > > > (e.g.
> > > > > >>> folders that were added on master, Python generated files). The
> > pom
> > > > > >>> files differed only by missing -SNAPSHOT, other than that
> > > presumably
> > > > > >>> the source release should just be "wget
> > > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
> > > > > >>>
> > > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> > > > > >>>
> > > > > >>> # OK?
> > > > > >>>
> > > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> > > > > >>>
> > > > > >>> # Expected.
> > > > > >>>
> > > > > >>> Only in beam/: .git
> > > > > >>> Only in beam/: .gitattributes
> > > > > >>> Only in beam/: .gitignore
> > > > > >>>
> > > > > >>> # These folders are probably from switching around between
> master
> > > and
> > > > > >>> git branches.
> > > > > >>>
> > > > > >>> Only in apache-beam-2.2.0: model
> > > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> > > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> > > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
> > > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
> > > > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
> > > > > >>> Only in apache-beam-2.2.0/sdks/java/
> > maven-archetypes/examples/src/
> > > ma
> > > > > >>> in/resources/archetype-resources:
> > > > > >>> src
> > > > > >>> Only in apache-beam-2.2.0/sdks/java/maven-archetypes/examples-
> > > java8/
> > > > > >>> src/main/resources/archetype-resources:
> > > > > >>> src
> > > > > >>> Only in apache-beam-2.2.0/sdks/java: microbenchmarks
> > > > > >>>
> > > > > >>> # Here's the generated protos.
> > > > > >>>
> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
> > apache_beam/portability/api:
> > > > > >>> beam_artifact_api_pb2_grpc.py
> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
> > apache_beam/portability/api:
> > > > > >>> beam_artifact_api_pb2.py
> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
> > apache_beam/portability/api:
> > > > > >>> beam_fn_api_pb2_grpc.py
> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
> > apache_beam/portability/api:
> > > > > >>> beam_fn_api_pb2.py
> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
> > apache_beam/portability/api:
> > > > > >>> beam_job_api_pb2_grpc.py
> > > > > >>> Only in apache-beam-2.2.0/sdks/python/
>

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-22 Thread Reuven Lax
Who is validating Flink and Yarn?

On Tue, Nov 21, 2017 at 9:26 AM, Kenneth Knowles 
wrote:

> On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > In the verification spreadsheet, I'm not sure I understand the difference
> > between the "YARN" and "Standalone cluster/service". Which is Dataproc?
> It
> > definitely uses YARN, but it is also a standalone cluster/service. Does
> it
> > count for both?
> >
>
> No, it doesn't. A number of runners have their own non-YARN cluster mode. I
> would expect that the launching experience might be different and the
> portable container management to differ. If they are identical, experts in
> those systems should feel free to coalesce the rows. Conversely, as other
> platforms become supported, they could be added or not based on whether
> they are substantively different from a user experience or QA point of
> view.
>
> Kenn
>
>
> > Seems now we're missing just Apex and Flink cluster verifications.
> >
> > *though Spark runner took 6x longer to run UserScore, partially I guess
> > because it didn't do autoscaling (Dataflow runner ramped up to 5 workers
> > whereas Spark runner used 2 workers). For some reason Spark runner chose
> > not to split the 10GB input files into chunks.
> >
> > On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax 
> > wrote:
> >
> > > Done
> > >
> > > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > > rober...@google.com.invalid> wrote:
> > >
> > > > Thanks. You need to re-sign as well.
> > > >
> > > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax
>  > >
> > > > wrote:
> > > > > FYI these generated files have been removed from the source
> > > distribution.
> > > > >
> > > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax 
> > wrote:
> > > > >
> > > > >> hmmm, I thought I removed those generated files from the zip file
> > > before
> > > > >> sending this email. Let me check again.
> > > > >>
> > > > >> Reuven
> > > > >>
> > > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> > > > >> rober...@google.com.invalid> wrote:
> > > > >>
> > > > >>> The source distribution contains a couple of files not on github
> > > (e.g.
> > > > >>> folders that were added on master, Python generated files). The
> pom
> > > > >>> files differed only by missing -SNAPSHOT, other than that
> > presumably
> > > > >>> the source release should just be "wget
> > > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip";?
> > > > >>>
> > > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> > > > >>>
> > > > >>> # OK?
> > > > >>>
> > > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> > > > >>>
> > > > >>> # Expected.
> > > > >>>
> > > > >>> Only in beam/: .git
> > > > >>> Only in beam/: .gitattributes
> > > > >>> Only in beam/: .gitignore
> > > > >>>
> > > > >>> # These folders are probably from switching around between master
> > and
> > > > >>> git branches.
> > > > >>>
> > > > >>> Only in apache-beam-2.2.0: model
> > > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> > > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> > > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
> > > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
> > > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
> > > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
> > > > >>> Only in apache-beam-2.2.0/sdks/java/
> maven-archetypes/examples/src/
> > ma
> > > > >>> in/resources/archetype-resources:
> > > > >>> src
> > > > >>> Only in apache-beam-2.2.0/sdks/java/maven-archetypes/examples-
> > java8/
> > > > >>> src/main/resources/archetype-resources:
> > > > >>> src
> > > > >>> Only in apache-beam-2.2.0/sdks/java: microbenchmarks
> > > > >>>
> > > > >>> # Here's the generated protos.
> > > > >>>
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_artifact_api_pb2_grpc.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_artifact_api_pb2.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_fn_api_pb2_grpc.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_fn_api_pb2.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_job_api_pb2_grpc.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_job_api_pb2.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_provision_api_pb2_grpc.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_provision_api_pb2.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/portability/api:
> > > > >>> beam_runner_api_pb2_grpc.py
> > > > >>> Only in apache-beam-2.2.0/sdks/python/
> apache_beam/por

Re: Issues processing 150K files with DataflowRunner

2017-11-22 Thread Chamikara Jayalath
Thanks. Note that shards generated by ReadAll transform will not support
dynamic work rebalancing but this should not matter when number of shards
are large. Long term solution is Splittable DoFn which is on the works.

- Cham

On Wed, Nov 22, 2017 at 8:23 AM Asha Rostamianfar
 wrote:

> Thanks a lot, Cham! yes, it looks like we need a ReadAll transform similar
> to TextIO and AvroIO :) We'll implement this.
>
> On Tue, Nov 21, 2017 at 1:05 PM, Chamikara Jayalath  >
> wrote:
>
> > I suspect that you might be hitting Dataflow API limit for messages
> during
> > initial splitting the source. Some details are available under "Total
> > number of BoundedSource objects" below (you should see a similar message
> in
> > worker logs but exact error message might be out of date).
> >
> https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline
> >
> > The exact number of files you can support depends on the size of
> generated
> > splits (usually about 400k for TextIO).
> >
> > One solution for this is to develop a ReadAll() transform for VcfSource
> > similar to the following available for TextIO.
> > https://github.com/apache/beam/blob/master/sdks/python/
> > apache_beam/io/textio.py#L409
> >
> > Thanks,
> > Cham
> >
> >
> > On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar
> >  wrote:
> >
> > > Hi,
> > >
> > > I'm wondering whether anyone has tried processing a large number
> (~150K)
> > of
> > > files using DataflowRunner? We are seeing a behavior where the dataflow
> > job
> > > starts, but never attaches any workers. After 1h, it cancels the job
> due
> > to
> > > being "stuck". See logs here
> > > <
> > > https://02931532374587840286.googlegroups.com/attach/
> > 3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay-
> >
> Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-
> > PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4
> > > >.
> > > It works fine for smaller number of files (e.g. 1k).
> > >
> > > We have tried setting num_workers, max_num_workers, etc. Are there any
> > > other settings that we can try?
> > >
> > > Context: the pipeline is using the python Apache Beam SDK and running
> the
> > > code at https://github.com/googlegenomics/gcp-variant-transforms. It's
> > > using the VcfSource, which is based on TextSource. See this thread
> > > <
> > > https://groups.google.com/d/msg/google-genomics-discuss/
> > LUgqh1s56SY/WUnJkkHUAwAJ
> > > >
> > > for
> > > more context.
> > >
> > > Thanks,
> > > Asha
> > >
> >
>


Re: Issues processing 150K files with DataflowRunner

2017-11-22 Thread Asha Rostamianfar
Thanks a lot, Cham! yes, it looks like we need a ReadAll transform similar
to TextIO and AvroIO :) We'll implement this.

On Tue, Nov 21, 2017 at 1:05 PM, Chamikara Jayalath 
wrote:

> I suspect that you might be hitting Dataflow API limit for messages during
> initial splitting the source. Some details are available under "Total
> number of BoundedSource objects" below (you should see a similar message in
> worker logs but exact error message might be out of date).
> https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline
>
> The exact number of files you can support depends on the size of generated
> splits (usually about 400k for TextIO).
>
> One solution for this is to develop a ReadAll() transform for VcfSource
> similar to the following available for TextIO.
> https://github.com/apache/beam/blob/master/sdks/python/
> apache_beam/io/textio.py#L409
>
> Thanks,
> Cham
>
>
> On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar
>  wrote:
>
> > Hi,
> >
> > I'm wondering whether anyone has tried processing a large number (~150K)
> of
> > files using DataflowRunner? We are seeing a behavior where the dataflow
> job
> > starts, but never attaches any workers. After 1h, it cancels the job due
> to
> > being "stuck". See logs here
> > <
> > https://02931532374587840286.googlegroups.com/attach/
> 3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay-
> Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-
> PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4
> > >.
> > It works fine for smaller number of files (e.g. 1k).
> >
> > We have tried setting num_workers, max_num_workers, etc. Are there any
> > other settings that we can try?
> >
> > Context: the pipeline is using the python Apache Beam SDK and running the
> > code at https://github.com/googlegenomics/gcp-variant-transforms. It's
> > using the VcfSource, which is based on TextSource. See this thread
> > <
> > https://groups.google.com/d/msg/google-genomics-discuss/
> LUgqh1s56SY/WUnJkkHUAwAJ
> > >
> > for
> > more context.
> >
> > Thanks,
> > Asha
> >
>


Re: Azure(ADLS) compatibility on Beam with Spark runner

2017-11-22 Thread Jean-Baptiste Onofré

Hi,

FYI, I'm in touch with Microsoft Azure team about that.

We are testing the ADLS support via HDFS.

I keep you posted.

Regards
JB

On 11/22/2017 09:12 AM, Milan Chandna wrote:

Hi,

Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
I was trying recently to do this but was unable to make it work.

Steps that I tried:

   1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
   2.  Built Apache Beam on that. Built to include 
Beam-2790 fix which earlier I 
was facing for ADL as well.
   3.  Modified WordCount.java example to use HadoopFileSystemOptions
   4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
  *   Just gave the input path and output path as adl://home/sample.txt and 
adl://home/output
  *   In addition to adl input and output path, also gave required HDFS 
configuration with adl required configs as well.

Both didn't worked btw.
s
   1.  Have checked ACL's and permissions. In fact similar job with same paths 
work on Spark directly.
   2.  Issues faced:
  *   For input, Beam is not able to find the path. Console log: 
Filepattern adl://home/sample.txt matched 0 files with total size 0
  *   Output path always gets converted to relative path, something like 
this: /home/user1/adl:/home/output/.tmp





Debugging more into this but was checking if someone is already facing this and 
has some resolution.



Here is a sample code and command I used.



 HadoopFileSystemOptions options = 
PipelineOptionsFactory.fromArgs(args).as(HadoopFileSystemOptions.class);

 Pipeline p = Pipeline.create(options);

 p.apply( 
TextIO.read().from(options.getHdfsConfiguration().get(0).get("fs.defaultFS")))

  .apply(new CountWords())

  .apply(MapElements.via(new FormatAsTextFn()))

  .apply(TextIO.write().to("adl://home/output"));

 p.run().waitUntilFinish();





spark-submit --class org.apache.beam.examples.WordCount --master local 
beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner 
--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'





P.S: Created fat jar to use with spark just for testing. Is there any other 
correct way of running it with Spark runner?



-Milan.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Version 2.2.0 release date

2017-11-22 Thread Stefania Mantisi
Hi everyone,
I saw all the issues are currently marked as having been fixed.
When will version 2.2 come out?

Thank you!

Stefania Mantisi

-- 
*Stefania Mantisi *
Software Engineer - Cloud Development - Noovle S.r.l.
mail: stefania.mant...@noovle.it


Noovle  | The Nexus of forces


Azure(ADLS) compatibility on Beam with Spark runner

2017-11-22 Thread Milan Chandna
Hi,

Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
I was trying recently to do this but was unable to make it work.

Steps that I tried:

  1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
  2.  Built Apache Beam on that. Built to include 
Beam-2790 fix which earlier I 
was facing for ADL as well.
  3.  Modified WordCount.java example to use HadoopFileSystemOptions
  4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
 *   Just gave the input path and output path as adl://home/sample.txt and 
adl://home/output
 *   In addition to adl input and output path, also gave required HDFS 
configuration with adl required configs as well.

Both didn't worked btw.

  1.  Have checked ACL's and permissions. In fact similar job with same paths 
work on Spark directly.
  2.  Issues faced:
 *   For input, Beam is not able to find the path. Console log: Filepattern 
adl://home/sample.txt matched 0 files with total size 0
 *   Output path always gets converted to relative path, something like 
this: /home/user1/adl:/home/output/.tmp





Debugging more into this but was checking if someone is already facing this and 
has some resolution.



Here is a sample code and command I used.



HadoopFileSystemOptions options = 
PipelineOptionsFactory.fromArgs(args).as(HadoopFileSystemOptions.class);

Pipeline p = Pipeline.create(options);

p.apply( 
TextIO.read().from(options.getHdfsConfiguration().get(0).get("fs.defaultFS")))

 .apply(new CountWords())

 .apply(MapElements.via(new FormatAsTextFn()))

 .apply(TextIO.write().to("adl://home/output"));

p.run().waitUntilFinish();





spark-submit --class org.apache.beam.examples.WordCount --master local 
beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner 
--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'





P.S: Created fat jar to use with spark just for testing. Is there any other 
correct way of running it with Spark runner?



-Milan.