Re: Azure(ADLS) compatibility on Beam with Spark runner

Udi Meiri Tue, 28 Nov 2017 16:31:40 -0800

Hi JB,
I'm working on adding HDFS support to the Python runner.
We're planning on using libhdfs3, which doesn't seem to support anything
other than HDFS.



On Mon, Nov 27, 2017 at 12:44 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Out of curiosity, does using the DirectRunner with ADL work for you?
> If not, then you'll be able to debug locally why its failing.
>
> On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna <
> milan.chan...@microsoft.com.invalid> wrote:
>
> > Hi JB,
> >
> > Thanks for the updates.
> > BTW I am myself in Microsoft but I am trying this out of my interest.
> > And it's good to know that someone else is also working on this.
> >
> > -Milan.
> >
> > -----Original Message-----
> > From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> > Sent: Thursday, November 23, 2017 1:47 PM
> > To: dev@beam.apache.org
> > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> >
> > The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems
> > they didn't succeed.
> > The new approach we plan is to directly use the ADLS API.
> >
> > I keep you posted.
> >
> > Regards
> > JB
> >
> > On 11/23/2017 07:42 AM, Milan Chandna wrote:
> > > I tried both the ways.
> > > Passed ADL specific configuration in --hdfsConfiguration as well and
> > have setup the core-site.xml/hdfs-site.xml as well.
> > > As I mentioned it's a HDI + Spark cluster, those things are already
> > setup.
> > > Spark job(without Beam) is also able to read and write to ADLS on same
> > machine.
> > >
> > > BTW if the authentication or understanding ADL was a problem, it would
> > have thrown error like ADLFileSystem missing or probably access failed or
> > something. Thoughts?
> > >
> > > -Milan.
> > >
> > > -----Original Message-----
> > > From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
> > > Sent: Thursday, November 23, 2017 5:05 AM
> > > To: dev@beam.apache.org
> > > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> > >
> > > In your example it seems as though your HDFS configuration doesn't
> > contain any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.
> > defaultFS\":
> > > \"hdfs://home/sample.txt\"]'"
> > > Do you have a core-site.xml or hdfs-site.xml configured as per:
> > > https://na01.safelinks.protection.outlook.com/?url=
> > https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-
> > azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com
> %
> > 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011
> > db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1
> > Gft8eka5W7Yts%3D&reserved=0?
> > >
> > >  From the documentation for --hdfsConfiguration:
> > > A list of Hadoop configurations used to configure zero or more Hadoop
> > filesystems. By default, Hadoop configuration is loaded from
> > 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
> > YARN_CONF_DIR environment variables. To specify configuration on the
> > command-line, represent the value as a JSON list of JSON maps, where each
> > map represents the entire configuration for a single Hadoop filesystem.
> For
> > example --hdfsConfiguration='[{\"fs.default.name\":
> > > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\",
> > ...},...]'
> > > From:
> > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed
> > > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F
> > > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d
> > > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5
> > > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382
> > > 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0
> > >
> > > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré
> > > <j...@nanthrax.net>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> FYI, I'm in touch with Microsoft Azure team about that.
> > >>
> > >> We are testing the ADLS support via HDFS.
> > >>
> > >> I keep you posted.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 11/22/2017 09:12 AM, Milan Chandna wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
> > >>> I was trying recently to do this but was unable to make it work.
> > >>>
> > >>> Steps that I tried:
> > >>>
> > >>>     1.  Took HDI + Spark 1.6 cluster with default storage as ADLS
> > account.
> > >>>     2.  Built Apache Beam on that. Built to include Beam-2790<
> > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> > >>> u
> > >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%
> > 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%
> > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%
> > 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> fix which
> > earlier I was facing for ADL as well.
> > >>>     3.  Modified WordCount.java example to use
> HadoopFileSystemOptions
> > >>>     4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2
> things
> > >>>        *   Just gave the input path and output path as
> > >>> adl://home/sample.txt and adl://home/output
> > >>>        *   In addition to adl input and output path, also gave
> required
> > >>> HDFS configuration with adl required configs as well.
> > >>>
> > >>> Both didn't worked btw.
> > >>> s
> > >>>     1.  Have checked ACL's and permissions. In fact similar job with
> > >>> same paths work on Spark directly.
> > >>>     2.  Issues faced:
> > >>>        *   For input, Beam is not able to find the path. Console log:
> > >>> Filepattern adl://home/sample.txt matched 0 files with total size 0
> > >>>        *   Output path always gets converted to relative path,
> > something
> > >>> like this: /home/user1/adl:/home/output/.tmp....
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Debugging more into this but was checking if someone is already
> > >>> facing this and has some resolution.
> > >>>
> > >>>
> > >>>
> > >>> Here is a sample code and command I used.
> > >>>
> > >>>
> > >>>
> > >>>       HadoopFileSystemOptions options =
> > >>> PipelineOptionsFactory.fromArg
> > >>> s(args).as(HadoopFileSystemOptions.class);
> > >>>
> > >>>       Pipeline p = Pipeline.create(options);
> > >>>
> > >>>       p.apply( TextIO.read().from(options.get
> > >>> HdfsConfiguration().get(0).get("fs.defaultFS")))
> > >>>
> > >>>        .apply(new CountWords())
> > >>>
> > >>>        .apply(MapElements.via(new FormatAsTextFn()))
> > >>>
> > >>>        .apply(TextIO.write().to("adl://home/output"));
> > >>>
> > >>>       p.run().waitUntilFinish();
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> spark-submit --class org.apache.beam.examples.WordCount --master
> > >>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
> > >>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> P.S: Created fat jar to use with spark just for testing. Is there
> > >>> any other correct way of running it with Spark runner?
> > >>>
> > >>>
> > >>>
> > >>> -Milan.
> > >>>
> > >>>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > >> n
> > >> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b
> > >> f
> > >> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> > >> 6
> > >> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3
> > >> D
> > >> &reserved=0 Talend -
> > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t
> > >> a
> > >> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4
> > >> 4
> > >> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646
> > >> 9
> > >> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r
> > >> e
> > >> served=0
> > >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > https://na01.safelinks.protection.outlook.com/?url=
> > http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan.
> > Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c%
> > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata=
> > fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0
> > Talend - https://na01.safelinks.protection.outlook.com/?url=
> > http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com
> %
> > 7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011
> > db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF%
> > 2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0
> >
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Azure(ADLS) compatibility on Beam with Spark runner

Reply via email to