Out of curiosity, does using the DirectRunner with ADL work for you? If not, then you'll be able to debug locally why its failing.
On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna < milan.chan...@microsoft.com.invalid> wrote: > Hi JB, > > Thanks for the updates. > BTW I am myself in Microsoft but I am trying this out of my interest. > And it's good to know that someone else is also working on this. > > -Milan. > > -----Original Message----- > From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] > Sent: Thursday, November 23, 2017 1:47 PM > To: dev@beam.apache.org > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner > > The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems > they didn't succeed. > The new approach we plan is to directly use the ADLS API. > > I keep you posted. > > Regards > JB > > On 11/23/2017 07:42 AM, Milan Chandna wrote: > > I tried both the ways. > > Passed ADL specific configuration in --hdfsConfiguration as well and > have setup the core-site.xml/hdfs-site.xml as well. > > As I mentioned it's a HDI + Spark cluster, those things are already > setup. > > Spark job(without Beam) is also able to read and write to ADLS on same > machine. > > > > BTW if the authentication or understanding ADL was a problem, it would > have thrown error like ADLFileSystem missing or probably access failed or > something. Thoughts? > > > > -Milan. > > > > -----Original Message----- > > From: Lukasz Cwik [mailto:lc...@google.com.INVALID] > > Sent: Thursday, November 23, 2017 5:05 AM > > To: dev@beam.apache.org > > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner > > > > In your example it seems as though your HDFS configuration doesn't > contain any ADL specific configuration: "--hdfsConfiguration='[{\"fs. > defaultFS\": > > \"hdfs://home/sample.txt\"]'" > > Do you have a core-site.xml or hdfs-site.xml configured as per: > > https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop- > azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com% > 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011 > db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1 > Gft8eka5W7Yts%3D&reserved=0? > > > > From the documentation for --hdfsConfiguration: > > A list of Hadoop configurations used to configure zero or more Hadoop > filesystems. By default, Hadoop configuration is loaded from > 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and > YARN_CONF_DIR environment variables. To specify configuration on the > command-line, represent the value as a JSON list of JSON maps, where each > map represents the entire configuration for a single Hadoop filesystem. For > example --hdfsConfiguration='[{\"fs.default.name\": > > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", > ...},...]' > > From: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu > > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed > > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F > > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d > > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5 > > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382 > > 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0 > > > > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré > > <j...@nanthrax.net> > > wrote: > > > >> Hi, > >> > >> FYI, I'm in touch with Microsoft Azure team about that. > >> > >> We are testing the ADLS support via HDFS. > >> > >> I keep you posted. > >> > >> Regards > >> JB > >> > >> On 11/22/2017 09:12 AM, Milan Chandna wrote: > >> > >>> Hi, > >>> > >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner? > >>> I was trying recently to do this but was unable to make it work. > >>> > >>> Steps that I tried: > >>> > >>> 1. Took HDI + Spark 1.6 cluster with default storage as ADLS > account. > >>> 2. Built Apache Beam on that. Built to include Beam-2790< > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss > >>> u > >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01% > 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab% > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj% > 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> fix which > earlier I was facing for ADL as well. > >>> 3. Modified WordCount.java example to use HadoopFileSystemOptions > >>> 4. Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things > >>> * Just gave the input path and output path as > >>> adl://home/sample.txt and adl://home/output > >>> * In addition to adl input and output path, also gave required > >>> HDFS configuration with adl required configs as well. > >>> > >>> Both didn't worked btw. > >>> s > >>> 1. Have checked ACL's and permissions. In fact similar job with > >>> same paths work on Spark directly. > >>> 2. Issues faced: > >>> * For input, Beam is not able to find the path. Console log: > >>> Filepattern adl://home/sample.txt matched 0 files with total size 0 > >>> * Output path always gets converted to relative path, > something > >>> like this: /home/user1/adl:/home/output/.tmp.... > >>> > >>> > >>> > >>> > >>> > >>> Debugging more into this but was checking if someone is already > >>> facing this and has some resolution. > >>> > >>> > >>> > >>> Here is a sample code and command I used. > >>> > >>> > >>> > >>> HadoopFileSystemOptions options = > >>> PipelineOptionsFactory.fromArg > >>> s(args).as(HadoopFileSystemOptions.class); > >>> > >>> Pipeline p = Pipeline.create(options); > >>> > >>> p.apply( TextIO.read().from(options.get > >>> HdfsConfiguration().get(0).get("fs.defaultFS"))) > >>> > >>> .apply(new CountWords()) > >>> > >>> .apply(MapElements.via(new FormatAsTextFn())) > >>> > >>> .apply(TextIO.write().to("adl://home/output")); > >>> > >>> p.run().waitUntilFinish(); > >>> > >>> > >>> > >>> > >>> > >>> spark-submit --class org.apache.beam.examples.WordCount --master > >>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner > >>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]' > >>> > >>> > >>> > >>> > >>> > >>> P.S: Created fat jar to use with spark just for testing. Is there > >>> any other correct way of running it with Spark runner? > >>> > >>> > >>> > >>> -Milan. > >>> > >>> > >> -- > >> Jean-Baptiste Onofré > >> jbono...@apache.org > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > >> n > >> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b > >> f > >> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63 > >> 6 > >> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3 > >> D > >> &reserved=0 Talend - > >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t > >> a > >> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4 > >> 4 > >> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646 > >> 9 > >> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r > >> e > >> served=0 > >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan. > Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c% > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata= > fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0 > Talend - https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com% > 7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011 > db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF% > 2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0 >