Hi
Do you have the beam hdfs filesystem extension in the dependencies ? Did you
define the HADOOP_CONF_DIR env variable containing path to hdfs-site.xml ?
Regards
JB
On 01/10/2018 08:55 AM, Shashank Prabhakara wrote:
Hello,
I'm testing some pipelines on a dataproc cluster with hadoop version 2.8.2, beam
2.3.0-SNAPSHOT.
I have observed on our pipeline as well as the wordcount that ships with beam,
that FileBasedSource does not "match" any files when using hdfs prefix -
verified this with apex runner and direct runner. Local fs and
GoogleHadoopFileSystem work fine. HDFS files access is verified from all worker
nodes for all users from cli.
In the logs (console for direct runner, apex.log from one of the containers for
apex runner):
INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern
hdfs:///tmp/input/
Tried numerous versions of the same uri. For example:
INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern
hdfs://cluster-m/tmp/input/twitter.avro
INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern
hdfs://mycluster-m/tmp/input/twitter.avro
Works for gcs files:
INFO org.apache.beam.sdk.io.FileBasedSource: Matched 1 files for pattern
gs://mybucket/input/twitter/twitter.avro
To reproduce, use beam examples archetype, package and execute:
mvn archetype:generate
-DarchetypeRepository=https://repository.apache.org/content/groups/snapshots
-DarchetypeGroupId=org.apache.beam
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
-DarchetypeVersion=LATEST -DgroupId=org.example -DartifactId=word-count-beam
-Dversion="0.1" -Dpackage=org.apache.beam.examples -DinteractiveMode=false
cd word-count-beam
mvn clean package -Papex-runner -DskipTests
yarn jar target/word-count-beam-bundled-0.1.jar
org.apache.beam.examples.WordCount --inputFile=hdfs:///tmp/input/pom.xml
--output=/tmp/output --runner=ApexRunner --embeddedExecution=false
Note: "mvn compile exec:java ..." would not work for me due to
classpath/version-compat issues. Also needed to exclude org.apache.hadoop:* and
com.google.cloud.bigdataoss:* from shaded jar for version compat.
Appreciate any help.
Regards,
Shashank
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com