Hi JB,
I'm working on adding HDFS support to the Python runner.
We're planning on using libhdfs3, which doesn't seem to support anything
other than HDFS.


On Mon, Nov 27, 2017 at 12:44 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Out of curiosity, does using the DirectRunner with ADL work for you?
> If not, then you'll be able to debug locally why its failing.
>
> On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna <
> milan.chan...@microsoft.com.invalid> wrote:
>
> > Hi JB,
> >
> > Thanks for the updates.
> > BTW I am myself in Microsoft but I am trying this out of my interest.
> > And it's good to know that someone else is also working on this.
> >
> > -Milan.
> >
> > -----Original Message-----
> > From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> > Sent: Thursday, November 23, 2017 1:47 PM
> > To: dev@beam.apache.org
> > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> >
> > The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems
> > they didn't succeed.
> > The new approach we plan is to directly use the ADLS API.
> >
> > I keep you posted.
> >
> > Regards
> > JB
> >
> > On 11/23/2017 07:42 AM, Milan Chandna wrote:
> > > I tried both the ways.
> > > Passed ADL specific configuration in --hdfsConfiguration as well and
> > have setup the core-site.xml/hdfs-site.xml as well.
> > > As I mentioned it's a HDI + Spark cluster, those things are already
> > setup.
> > > Spark job(without Beam) is also able to read and write to ADLS on same
> > machine.
> > >
> > > BTW if the authentication or understanding ADL was a problem, it would
> > have thrown error like ADLFileSystem missing or probably access failed or
> > something. Thoughts?
> > >
> > > -Milan.
> > >
> > > -----Original Message-----
> > > From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
> > > Sent: Thursday, November 23, 2017 5:05 AM
> > > To: dev@beam.apache.org
> > > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> > >
> > > In your example it seems as though your HDFS configuration doesn't
> > contain any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.
> > defaultFS\":
> > > \"hdfs://home/sample.txt\"]'"
> > > Do you have a core-site.xml or hdfs-site.xml configured as per:
> > > https://na01.safelinks.protection.outlook.com/?url=
> > https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-
> > azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com
> %
> > 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011
> > db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1
> > Gft8eka5W7Yts%3D&reserved=0?
> > >
> > >  From the documentation for --hdfsConfiguration:
> > > A list of Hadoop configurations used to configure zero or more Hadoop
> > filesystems. By default, Hadoop configuration is loaded from
> > 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
> > YARN_CONF_DIR environment variables. To specify configuration on the
> > command-line, represent the value as a JSON list of JSON maps, where each
> > map represents the entire configuration for a single Hadoop filesystem.
> For
> > example --hdfsConfiguration='[{\"fs.default.name\":
> > > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\",
> > ...},...]'
> > > From:
> > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed
> > > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F
> > > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d
> > > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5
> > > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382
> > > 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0
> > >
> > > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré
> > > <j...@nanthrax.net>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> FYI, I'm in touch with Microsoft Azure team about that.
> > >>
> > >> We are testing the ADLS support via HDFS.
> > >>
> > >> I keep you posted.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 11/22/2017 09:12 AM, Milan Chandna wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
> > >>> I was trying recently to do this but was unable to make it work.
> > >>>
> > >>> Steps that I tried:
> > >>>
> > >>>     1.  Took HDI + Spark 1.6 cluster with default storage as ADLS
> > account.
> > >>>     2.  Built Apache Beam on that. Built to include Beam-2790<
> > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> > >>> u
> > >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%
> > 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%
> > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%
> > 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> fix which
> > earlier I was facing for ADL as well.
> > >>>     3.  Modified WordCount.java example to use
> HadoopFileSystemOptions
> > >>>     4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2
> things
> > >>>        *   Just gave the input path and output path as
> > >>> adl://home/sample.txt and adl://home/output
> > >>>        *   In addition to adl input and output path, also gave
> required
> > >>> HDFS configuration with adl required configs as well.
> > >>>
> > >>> Both didn't worked btw.
> > >>> s
> > >>>     1.  Have checked ACL's and permissions. In fact similar job with
> > >>> same paths work on Spark directly.
> > >>>     2.  Issues faced:
> > >>>        *   For input, Beam is not able to find the path. Console log:
> > >>> Filepattern adl://home/sample.txt matched 0 files with total size 0
> > >>>        *   Output path always gets converted to relative path,
> > something
> > >>> like this: /home/user1/adl:/home/output/.tmp....
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Debugging more into this but was checking if someone is already
> > >>> facing this and has some resolution.
> > >>>
> > >>>
> > >>>
> > >>> Here is a sample code and command I used.
> > >>>
> > >>>
> > >>>
> > >>>       HadoopFileSystemOptions options =
> > >>> PipelineOptionsFactory.fromArg
> > >>> s(args).as(HadoopFileSystemOptions.class);
> > >>>
> > >>>       Pipeline p = Pipeline.create(options);
> > >>>
> > >>>       p.apply( TextIO.read().from(options.get
> > >>> HdfsConfiguration().get(0).get("fs.defaultFS")))
> > >>>
> > >>>        .apply(new CountWords())
> > >>>
> > >>>        .apply(MapElements.via(new FormatAsTextFn()))
> > >>>
> > >>>        .apply(TextIO.write().to("adl://home/output"));
> > >>>
> > >>>       p.run().waitUntilFinish();
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> spark-submit --class org.apache.beam.examples.WordCount --master
> > >>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
> > >>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> P.S: Created fat jar to use with spark just for testing. Is there
> > >>> any other correct way of running it with Spark runner?
> > >>>
> > >>>
> > >>>
> > >>> -Milan.
> > >>>
> > >>>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > >> n
> > >> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b
> > >> f
> > >> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> > >> 6
> > >> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3
> > >> D
> > >> &reserved=0 Talend -
> > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t
> > >> a
> > >> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4
> > >> 4
> > >> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646
> > >> 9
> > >> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r
> > >> e
> > >> served=0
> > >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > https://na01.safelinks.protection.outlook.com/?url=
> > http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan.
> > Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c%
> > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata=
> > fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0
> > Talend - https://na01.safelinks.protection.outlook.com/?url=
> > http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com
> %
> > 7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011
> > db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF%
> > 2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0
> >
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to