It was the missing hdfs filesystem extension dependency. Thanks Jean-Baptiste. Much appreciated.
Regards, Shashank On Wed, Jan 10, 2018 at 2:09 PM, Jean-Baptiste Onofré <[email protected]> wrote: > Hi > > Do you have the beam hdfs filesystem extension in the dependencies ? Did > you define the HADOOP_CONF_DIR env variable containing path to > hdfs-site.xml ? > > Regards > JB > > > On 01/10/2018 08:55 AM, Shashank Prabhakara wrote: > >> Hello, >> >> I'm testing some pipelines on a dataproc cluster with hadoop version >> 2.8.2, beam 2.3.0-SNAPSHOT. >> I have observed on our pipeline as well as the wordcount that ships with >> beam, that FileBasedSource does not "match" any files when using hdfs >> prefix - verified this with apex runner and direct runner. Local fs and >> GoogleHadoopFileSystem work fine. HDFS files access is verified from all >> worker nodes for all users from cli. >> >> In the logs (console for direct runner, apex.log from one of the >> containers for apex runner): >> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern >> hdfs:///tmp/input/ >> >> Tried numerous versions of the same uri. For example: >> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern >> hdfs://cluster-m/tmp/input/twitter.avro >> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern >> hdfs://mycluster-m/tmp/input/twitter.avro >> >> Works for gcs files: >> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 1 files for pattern >> gs://mybucket/input/twitter/twitter.avro >> >> >> >> To reproduce, use beam examples archetype, package and execute: >> >> mvn archetype:generate -DarchetypeRepository=https:// >> repository.apache.org/content/groups/snapshots >> -DarchetypeGroupId=org.apache.beam -DarchetypeArtifactId=beam-sdk >> s-java-maven-archetypes-examples -DarchetypeVersion=LATEST >> -DgroupId=org.example -DartifactId=word-count-beam -Dversion="0.1" >> -Dpackage=org.apache.beam.examples -DinteractiveMode=false >> >> cd word-count-beam >> mvn clean package -Papex-runner -DskipTests >> >> yarn jar target/word-count-beam-bundled-0.1.jar >> org.apache.beam.examples.WordCount --inputFile=hdfs:///tmp/input/pom.xml >> --output=/tmp/output --runner=ApexRunner --embeddedExecution=false >> >> >> Note: "mvn compile exec:java ..." would not work for me due to >> classpath/version-compat issues. Also needed to exclude org.apache.hadoop:* >> and com.google.cloud.bigdataoss:* from shaded jar for version compat. >> >> Appreciate any help. >> >> Regards, >> Shashank >> > > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
