Out of curiosity, does using the DirectRunner with ADL work for you?
If not, then you'll be able to debug locally why its failing.

On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna <
milan.chan...@microsoft.com.invalid> wrote:

> Hi JB,
>
> Thanks for the updates.
> BTW I am myself in Microsoft but I am trying this out of my interest.
> And it's good to know that someone else is also working on this.
>
> -Milan.
>
> -----Original Message-----
> From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> Sent: Thursday, November 23, 2017 1:47 PM
> To: dev@beam.apache.org
> Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
>
> The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems
> they didn't succeed.
> The new approach we plan is to directly use the ADLS API.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/23/2017 07:42 AM, Milan Chandna wrote:
> > I tried both the ways.
> > Passed ADL specific configuration in --hdfsConfiguration as well and
> have setup the core-site.xml/hdfs-site.xml as well.
> > As I mentioned it's a HDI + Spark cluster, those things are already
> setup.
> > Spark job(without Beam) is also able to read and write to ADLS on same
> machine.
> >
> > BTW if the authentication or understanding ADL was a problem, it would
> have thrown error like ADLFileSystem missing or probably access failed or
> something. Thoughts?
> >
> > -Milan.
> >
> > -----Original Message-----
> > From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
> > Sent: Thursday, November 23, 2017 5:05 AM
> > To: dev@beam.apache.org
> > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> >
> > In your example it seems as though your HDFS configuration doesn't
> contain any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.
> defaultFS\":
> > \"hdfs://home/sample.txt\"]'"
> > Do you have a core-site.xml or hdfs-site.xml configured as per:
> > https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-
> azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%
> 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1
> Gft8eka5W7Yts%3D&reserved=0?
> >
> >  From the documentation for --hdfsConfiguration:
> > A list of Hadoop configurations used to configure zero or more Hadoop
> filesystems. By default, Hadoop configuration is loaded from
> 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
> YARN_CONF_DIR environment variables. To specify configuration on the
> command-line, represent the value as a JSON list of JSON maps, where each
> map represents the entire configuration for a single Hadoop filesystem. For
> example --hdfsConfiguration='[{\"fs.default.name\":
> > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\",
> ...},...]'
> > From:
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed
> > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F
> > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d
> > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5
> > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382
> > 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0
> >
> > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré
> > <j...@nanthrax.net>
> > wrote:
> >
> >> Hi,
> >>
> >> FYI, I'm in touch with Microsoft Azure team about that.
> >>
> >> We are testing the ADLS support via HDFS.
> >>
> >> I keep you posted.
> >>
> >> Regards
> >> JB
> >>
> >> On 11/22/2017 09:12 AM, Milan Chandna wrote:
> >>
> >>> Hi,
> >>>
> >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
> >>> I was trying recently to do this but was unable to make it work.
> >>>
> >>> Steps that I tried:
> >>>
> >>>     1.  Took HDI + Spark 1.6 cluster with default storage as ADLS
> account.
> >>>     2.  Built Apache Beam on that. Built to include Beam-2790<
> >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> >>> u
> >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%
> 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%
> 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%
> 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> fix which
> earlier I was facing for ADL as well.
> >>>     3.  Modified WordCount.java example to use HadoopFileSystemOptions
> >>>     4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
> >>>        *   Just gave the input path and output path as
> >>> adl://home/sample.txt and adl://home/output
> >>>        *   In addition to adl input and output path, also gave required
> >>> HDFS configuration with adl required configs as well.
> >>>
> >>> Both didn't worked btw.
> >>> s
> >>>     1.  Have checked ACL's and permissions. In fact similar job with
> >>> same paths work on Spark directly.
> >>>     2.  Issues faced:
> >>>        *   For input, Beam is not able to find the path. Console log:
> >>> Filepattern adl://home/sample.txt matched 0 files with total size 0
> >>>        *   Output path always gets converted to relative path,
> something
> >>> like this: /home/user1/adl:/home/output/.tmp....
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Debugging more into this but was checking if someone is already
> >>> facing this and has some resolution.
> >>>
> >>>
> >>>
> >>> Here is a sample code and command I used.
> >>>
> >>>
> >>>
> >>>       HadoopFileSystemOptions options =
> >>> PipelineOptionsFactory.fromArg
> >>> s(args).as(HadoopFileSystemOptions.class);
> >>>
> >>>       Pipeline p = Pipeline.create(options);
> >>>
> >>>       p.apply( TextIO.read().from(options.get
> >>> HdfsConfiguration().get(0).get("fs.defaultFS")))
> >>>
> >>>        .apply(new CountWords())
> >>>
> >>>        .apply(MapElements.via(new FormatAsTextFn()))
> >>>
> >>>        .apply(TextIO.write().to("adl://home/output"));
> >>>
> >>>       p.run().waitUntilFinish();
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> spark-submit --class org.apache.beam.examples.WordCount --master
> >>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
> >>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> P.S: Created fat jar to use with spark just for testing. Is there
> >>> any other correct way of running it with Spark runner?
> >>>
> >>>
> >>>
> >>> -Milan.
> >>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> >> n
> >> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b
> >> f
> >> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> >> 6
> >> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3
> >> D
> >> &reserved=0 Talend -
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t
> >> a
> >> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4
> >> 4
> >> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646
> >> 9
> >> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r
> >> e
> >> served=0
> >>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> https://na01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan.
> Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c%
> 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata=
> fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0
> Talend - https://na01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%
> 7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF%
> 2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0
>

Reply via email to