Hi JB, I'm working on adding HDFS support to the Python runner. We're planning on using libhdfs3, which doesn't seem to support anything other than HDFS.
On Mon, Nov 27, 2017 at 12:44 PM Lukasz Cwik <lc...@google.com.invalid> wrote: > Out of curiosity, does using the DirectRunner with ADL work for you? > If not, then you'll be able to debug locally why its failing. > > On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna < > milan.chan...@microsoft.com.invalid> wrote: > > > Hi JB, > > > > Thanks for the updates. > > BTW I am myself in Microsoft but I am trying this out of my interest. > > And it's good to know that someone else is also working on this. > > > > -Milan. > > > > -----Original Message----- > > From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] > > Sent: Thursday, November 23, 2017 1:47 PM > > To: dev@beam.apache.org > > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner > > > > The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems > > they didn't succeed. > > The new approach we plan is to directly use the ADLS API. > > > > I keep you posted. > > > > Regards > > JB > > > > On 11/23/2017 07:42 AM, Milan Chandna wrote: > > > I tried both the ways. > > > Passed ADL specific configuration in --hdfsConfiguration as well and > > have setup the core-site.xml/hdfs-site.xml as well. > > > As I mentioned it's a HDI + Spark cluster, those things are already > > setup. > > > Spark job(without Beam) is also able to read and write to ADLS on same > > machine. > > > > > > BTW if the authentication or understanding ADL was a problem, it would > > have thrown error like ADLFileSystem missing or probably access failed or > > something. Thoughts? > > > > > > -Milan. > > > > > > -----Original Message----- > > > From: Lukasz Cwik [mailto:lc...@google.com.INVALID] > > > Sent: Thursday, November 23, 2017 5:05 AM > > > To: dev@beam.apache.org > > > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner > > > > > > In your example it seems as though your HDFS configuration doesn't > > contain any ADL specific configuration: "--hdfsConfiguration='[{\"fs. > > defaultFS\": > > > \"hdfs://home/sample.txt\"]'" > > > Do you have a core-site.xml or hdfs-site.xml configured as per: > > > https://na01.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop- > > azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com > % > > 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011 > > db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1 > > Gft8eka5W7Yts%3D&reserved=0? > > > > > > From the documentation for --hdfsConfiguration: > > > A list of Hadoop configurations used to configure zero or more Hadoop > > filesystems. By default, Hadoop configuration is loaded from > > 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and > > YARN_CONF_DIR environment variables. To specify configuration on the > > command-line, represent the value as a JSON list of JSON maps, where each > > map represents the entire configuration for a single Hadoop filesystem. > For > > example --hdfsConfiguration='[{\"fs.default.name\": > > > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", > > ...},...]' > > > From: > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu > > > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed > > > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F > > > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d > > > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5 > > > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382 > > > 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0 > > > > > > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré > > > <j...@nanthrax.net> > > > wrote: > > > > > >> Hi, > > >> > > >> FYI, I'm in touch with Microsoft Azure team about that. > > >> > > >> We are testing the ADLS support via HDFS. > > >> > > >> I keep you posted. > > >> > > >> Regards > > >> JB > > >> > > >> On 11/22/2017 09:12 AM, Milan Chandna wrote: > > >> > > >>> Hi, > > >>> > > >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner? > > >>> I was trying recently to do this but was unable to make it work. > > >>> > > >>> Steps that I tried: > > >>> > > >>> 1. Took HDI + Spark 1.6 cluster with default storage as ADLS > > account. > > >>> 2. Built Apache Beam on that. Built to include Beam-2790< > > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss > > >>> u > > >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01% > > 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab% > > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj% > > 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> fix which > > earlier I was facing for ADL as well. > > >>> 3. Modified WordCount.java example to use > HadoopFileSystemOptions > > >>> 4. Since HDI + Spark cluster has ADLS as defaultFS, tried 2 > things > > >>> * Just gave the input path and output path as > > >>> adl://home/sample.txt and adl://home/output > > >>> * In addition to adl input and output path, also gave > required > > >>> HDFS configuration with adl required configs as well. > > >>> > > >>> Both didn't worked btw. > > >>> s > > >>> 1. Have checked ACL's and permissions. In fact similar job with > > >>> same paths work on Spark directly. > > >>> 2. Issues faced: > > >>> * For input, Beam is not able to find the path. Console log: > > >>> Filepattern adl://home/sample.txt matched 0 files with total size 0 > > >>> * Output path always gets converted to relative path, > > something > > >>> like this: /home/user1/adl:/home/output/.tmp.... > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> Debugging more into this but was checking if someone is already > > >>> facing this and has some resolution. > > >>> > > >>> > > >>> > > >>> Here is a sample code and command I used. > > >>> > > >>> > > >>> > > >>> HadoopFileSystemOptions options = > > >>> PipelineOptionsFactory.fromArg > > >>> s(args).as(HadoopFileSystemOptions.class); > > >>> > > >>> Pipeline p = Pipeline.create(options); > > >>> > > >>> p.apply( TextIO.read().from(options.get > > >>> HdfsConfiguration().get(0).get("fs.defaultFS"))) > > >>> > > >>> .apply(new CountWords()) > > >>> > > >>> .apply(MapElements.via(new FormatAsTextFn())) > > >>> > > >>> .apply(TextIO.write().to("adl://home/output")); > > >>> > > >>> p.run().waitUntilFinish(); > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> spark-submit --class org.apache.beam.examples.WordCount --master > > >>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner > > >>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]' > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> P.S: Created fat jar to use with spark just for testing. Is there > > >>> any other correct way of running it with Spark runner? > > >>> > > >>> > > >>> > > >>> -Milan. > > >>> > > >>> > > >> -- > > >> Jean-Baptiste Onofré > > >> jbono...@apache.org > > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > > >> n > > >> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b > > >> f > > >> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63 > > >> 6 > > >> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3 > > >> D > > >> &reserved=0 Talend - > > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t > > >> a > > >> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4 > > >> 4 > > >> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646 > > >> 9 > > >> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r > > >> e > > >> served=0 > > >> > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > https://na01.safelinks.protection.outlook.com/?url= > > http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan. > > Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c% > > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata= > > fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0 > > Talend - https://na01.safelinks.protection.outlook.com/?url= > > http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com > % > > 7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011 > > db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF% > > 2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0 > > >
smime.p7s
Description: S/MIME Cryptographic Signature