Re: Azure(ADLS) compatibility on Beam with Spark runner

Jean-Baptiste Onofré Thu, 23 Nov 2017 00:17:57 -0800

The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems theydidn't succeed.

The new approach we plan is to directly use the ADLS API.


I keep you posted.

Regards
JB

On 11/23/2017 07:42 AM, Milan Chandna wrote:

I tried both the ways.
Passed ADL specific configuration in --hdfsConfiguration as well and have setup 
the core-site.xml/hdfs-site.xml as well.
As I mentioned it's a HDI + Spark cluster, those things are already setup.
Spark job(without Beam) is also able to read and write to ADLS on same machine.

BTW if the authentication or understanding ADL was a problem, it would have 
thrown error like ADLFileSystem missing or probably access failed or something. 
Thoughts?

-Milan.

-----Original Message-----
From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
Sent: Thursday, November 23, 2017 5:05 AM
To: dev@beam.apache.org
Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner

In your example it seems as though your HDFS configuration doesn't contain any ADL specific 
configuration:  "--hdfsConfiguration='[{\"fs.defaultFS\":
\"hdfs://home/sample.txt\"]'"
Do you have a core-site.xml or hdfs-site.xml configured as per:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1Gft8eka5W7Yts%3D&reserved=0?

 From the documentation for --hdfsConfiguration:
A list of Hadoop configurations used to configure zero or more Hadoop filesystems. By 
default, Hadoop configuration is loaded from 'core-site.xml' and 'hdfs-site.xml based 
upon the HADOOP_CONF_DIR and YARN_CONF_DIR environment variables. To specify 
configuration on the command-line, represent the value as a JSON list of JSON maps, where 
each map represents the entire configuration for a single Hadoop filesystem. For example 
--hdfsConfiguration='[{\"fs.default.name\":
\"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
From:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0

On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi,

FYI, I'm in touch with Microsoft Azure team about that.

We are testing the ADLS support via HDFS.

I keep you posted.

Regards
JB

On 11/22/2017 09:12 AM, Milan Chandna wrote:

Hi,

Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
I was trying recently to do this but was unable to make it work.

Steps that I tried:

    1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
    2.  Built Apache Beam on that. Built to include Beam-2790<
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu
es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0>
 fix which earlier I was facing for ADL as well.
    3.  Modified WordCount.java example to use HadoopFileSystemOptions
    4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
       *   Just gave the input path and output path as
adl://home/sample.txt and adl://home/output
       *   In addition to adl input and output path, also gave required
HDFS configuration with adl required configs as well.

Both didn't worked btw.
s
    1.  Have checked ACL's and permissions. In fact similar job with
same paths work on Spark directly.
    2.  Issues faced:
       *   For input, Beam is not able to find the path. Console log:
Filepattern adl://home/sample.txt matched 0 files with total size 0
       *   Output path always gets converted to relative path, something
like this: /home/user1/adl:/home/output/.tmp....





Debugging more into this but was checking if someone is already
facing this and has some resolution.



Here is a sample code and command I used.



      HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg
s(args).as(HadoopFileSystemOptions.class);

      Pipeline p = Pipeline.create(options);

      p.apply( TextIO.read().from(options.get
HdfsConfiguration().get(0).get("fs.defaultFS")))

       .apply(new CountWords())

       .apply(MapElements.via(new FormatAsTextFn()))

       .apply(TextIO.write().to("adl://home/output"));

      p.run().waitUntilFinish();





spark-submit --class org.apache.beam.examples.WordCount --master
local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'





P.S: Created fat jar to use with spark just for testing. Is there any
other correct way of running it with Spark runner?



-Milan.

--
Jean-Baptiste Onofré
jbono...@apache.org
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.n
anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bf
e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636
469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3D
&reserved=0 Talend -
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ta
lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44
df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469
905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&re
served=0


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Azure(ADLS) compatibility on Beam with Spark runner

Reply via email to