Re: Azure(ADLS) compatibility on Beam with Spark runner

Lukasz Cwik Mon, 27 Nov 2017 12:44:27 -0800

Out of curiosity, does using the DirectRunner with ADL work for you?
If not, then you'll be able to debug locally why its failing.


On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna <
milan.chan...@microsoft.com.invalid> wrote:

> Hi JB,
>
> Thanks for the updates.
> BTW I am myself in Microsoft but I am trying this out of my interest.
> And it's good to know that someone else is also working on this.
>
> -Milan.
>
> -----Original Message-----
> From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> Sent: Thursday, November 23, 2017 1:47 PM
> To: dev@beam.apache.org
> Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
>
> The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems
> they didn't succeed.
> The new approach we plan is to directly use the ADLS API.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/23/2017 07:42 AM, Milan Chandna wrote:
> > I tried both the ways.
> > Passed ADL specific configuration in --hdfsConfiguration as well and
> have setup the core-site.xml/hdfs-site.xml as well.
> > As I mentioned it's a HDI + Spark cluster, those things are already
> setup.
> > Spark job(without Beam) is also able to read and write to ADLS on same
> machine.
> >
> > BTW if the authentication or understanding ADL was a problem, it would
> have thrown error like ADLFileSystem missing or probably access failed or
> something. Thoughts?
> >
> > -Milan.
> >
> > -----Original Message-----
> > From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
> > Sent: Thursday, November 23, 2017 5:05 AM
> > To: dev@beam.apache.org
> > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> >
> > In your example it seems as though your HDFS configuration doesn't
> contain any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.
> defaultFS\":
> > \"hdfs://home/sample.txt\"]'"
> > Do you have a core-site.xml or hdfs-site.xml configured as per:
> > https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-
> azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%
> 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1
> Gft8eka5W7Yts%3D&reserved=0?
> >
> >  From the documentation for --hdfsConfiguration:
> > A list of Hadoop configurations used to configure zero or more Hadoop
> filesystems. By default, Hadoop configuration is loaded from
> 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
> YARN_CONF_DIR environment variables. To specify configuration on the
> command-line, represent the value as a JSON list of JSON maps, where each
> map represents the entire configuration for a single Hadoop filesystem. For
> example --hdfsConfiguration='[{\"fs.default.name\":
> > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\",
> ...},...]'
> > From:
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed
> > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F
> > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d
> > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5
> > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382
> > 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0
> >
> > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré
> > <j...@nanthrax.net>
> > wrote:
> >
> >> Hi,
> >>
> >> FYI, I'm in touch with Microsoft Azure team about that.
> >>
> >> We are testing the ADLS support via HDFS.
> >>
> >> I keep you posted.
> >>
> >> Regards
> >> JB
> >>
> >> On 11/22/2017 09:12 AM, Milan Chandna wrote:
> >>
> >>> Hi,
> >>>
> >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
> >>> I was trying recently to do this but was unable to make it work.
> >>>
> >>> Steps that I tried:
> >>>
> >>>     1.  Took HDI + Spark 1.6 cluster with default storage as ADLS
> account.
> >>>     2.  Built Apache Beam on that. Built to include Beam-2790<
> >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> >>> u
> >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%
> 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%
> 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%
> 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> fix which
> earlier I was facing for ADL as well.
> >>>     3.  Modified WordCount.java example to use HadoopFileSystemOptions
> >>>     4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
> >>>        *   Just gave the input path and output path as
> >>> adl://home/sample.txt and adl://home/output
> >>>        *   In addition to adl input and output path, also gave required
> >>> HDFS configuration with adl required configs as well.
> >>>
> >>> Both didn't worked btw.
> >>> s
> >>>     1.  Have checked ACL's and permissions. In fact similar job with
> >>> same paths work on Spark directly.
> >>>     2.  Issues faced:
> >>>        *   For input, Beam is not able to find the path. Console log:
> >>> Filepattern adl://home/sample.txt matched 0 files with total size 0
> >>>        *   Output path always gets converted to relative path,
> something
> >>> like this: /home/user1/adl:/home/output/.tmp....
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Debugging more into this but was checking if someone is already
> >>> facing this and has some resolution.
> >>>
> >>>
> >>>
> >>> Here is a sample code and command I used.
> >>>
> >>>
> >>>
> >>>       HadoopFileSystemOptions options =
> >>> PipelineOptionsFactory.fromArg
> >>> s(args).as(HadoopFileSystemOptions.class);
> >>>
> >>>       Pipeline p = Pipeline.create(options);
> >>>
> >>>       p.apply( TextIO.read().from(options.get
> >>> HdfsConfiguration().get(0).get("fs.defaultFS")))
> >>>
> >>>        .apply(new CountWords())
> >>>
> >>>        .apply(MapElements.via(new FormatAsTextFn()))
> >>>
> >>>        .apply(TextIO.write().to("adl://home/output"));
> >>>
> >>>       p.run().waitUntilFinish();
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> spark-submit --class org.apache.beam.examples.WordCount --master
> >>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
> >>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> P.S: Created fat jar to use with spark just for testing. Is there
> >>> any other correct way of running it with Spark runner?
> >>>
> >>>
> >>>
> >>> -Milan.
> >>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> >> n
> >> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b
> >> f
> >> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> >> 6
> >> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3
> >> D
> >> &reserved=0 Talend -
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t
> >> a
> >> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4
> >> 4
> >> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646
> >> 9
> >> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r
> >> e
> >> served=0
> >>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> https://na01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan.
> Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c%
> 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata=
> fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0
> Talend - https://na01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%
> 7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF%
> 2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0
>

Re: Azure(ADLS) compatibility on Beam with Spark runner

Reply via email to