The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems they
didn't succeed.
The new approach we plan is to directly use the ADLS API.
I keep you posted.
Regards
JB
On 11/23/2017 07:42 AM, Milan Chandna wrote:
I tried both the ways.
Passed ADL specific configuration in --hdfsConfiguration as well and have setup
the core-site.xml/hdfs-site.xml as well.
As I mentioned it's a HDI + Spark cluster, those things are already setup.
Spark job(without Beam) is also able to read and write to ADLS on same machine.
BTW if the authentication or understanding ADL was a problem, it would have
thrown error like ADLFileSystem missing or probably access failed or something.
Thoughts?
-Milan.
-----Original Message-----
From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
Sent: Thursday, November 23, 2017 5:05 AM
To: dev@beam.apache.org
Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
In your example it seems as though your HDFS configuration doesn't contain any ADL specific
configuration: "--hdfsConfiguration='[{\"fs.defaultFS\":
\"hdfs://home/sample.txt\"]'"
Do you have a core-site.xml or hdfs-site.xml configured as per:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1Gft8eka5W7Yts%3D&reserved=0?
From the documentation for --hdfsConfiguration:
A list of Hadoop configurations used to configure zero or more Hadoop filesystems. By
default, Hadoop configuration is loaded from 'core-site.xml' and 'hdfs-site.xml based
upon the HADOOP_CONF_DIR and YARN_CONF_DIR environment variables. To specify
configuration on the command-line, represent the value as a JSON list of JSON maps, where
each map represents the entire configuration for a single Hadoop filesystem. For example
--hdfsConfiguration='[{\"fs.default.name\":
\"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
From:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0
On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:
Hi,
FYI, I'm in touch with Microsoft Azure team about that.
We are testing the ADLS support via HDFS.
I keep you posted.
Regards
JB
On 11/22/2017 09:12 AM, Milan Chandna wrote:
Hi,
Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
I was trying recently to do this but was unable to make it work.
Steps that I tried:
1. Took HDI + Spark 1.6 cluster with default storage as ADLS account.
2. Built Apache Beam on that. Built to include Beam-2790<
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu
es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0>
fix which earlier I was facing for ADL as well.
3. Modified WordCount.java example to use HadoopFileSystemOptions
4. Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
* Just gave the input path and output path as
adl://home/sample.txt and adl://home/output
* In addition to adl input and output path, also gave required
HDFS configuration with adl required configs as well.
Both didn't worked btw.
s
1. Have checked ACL's and permissions. In fact similar job with
same paths work on Spark directly.
2. Issues faced:
* For input, Beam is not able to find the path. Console log:
Filepattern adl://home/sample.txt matched 0 files with total size 0
* Output path always gets converted to relative path, something
like this: /home/user1/adl:/home/output/.tmp....
Debugging more into this but was checking if someone is already
facing this and has some resolution.
Here is a sample code and command I used.
HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg
s(args).as(HadoopFileSystemOptions.class);
Pipeline p = Pipeline.create(options);
p.apply( TextIO.read().from(options.get
HdfsConfiguration().get(0).get("fs.defaultFS")))
.apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.write().to("adl://home/output"));
p.run().waitUntilFinish();
spark-submit --class org.apache.beam.examples.WordCount --master
local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
P.S: Created fat jar to use with spark just for testing. Is there any
other correct way of running it with Spark runner?
-Milan.
--
Jean-Baptiste Onofré
jbono...@apache.org
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.n
anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bf
e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636
469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3D
&reserved=0 Talend -
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ta
lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44
df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469
905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&re
served=0
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com