It was the missing hdfs filesystem extension dependency. Thanks
Jean-Baptiste. Much appreciated.

Regards,
Shashank

On Wed, Jan 10, 2018 at 2:09 PM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi
>
> Do you have the beam hdfs filesystem extension in the dependencies ? Did
> you define the HADOOP_CONF_DIR env variable containing path to
> hdfs-site.xml ?
>
> Regards
> JB
>
>
> On 01/10/2018 08:55 AM, Shashank Prabhakara wrote:
>
>> Hello,
>>
>> I'm testing some pipelines on a dataproc cluster with hadoop version
>> 2.8.2, beam 2.3.0-SNAPSHOT.
>> I have observed on our pipeline as well as the wordcount that ships with
>> beam, that FileBasedSource does not "match" any files when using hdfs
>> prefix - verified this with apex runner and direct runner. Local fs and
>> GoogleHadoopFileSystem work fine. HDFS files access is verified from all
>> worker nodes for all users from cli.
>>
>> In the logs (console for direct runner, apex.log from one of the
>> containers for apex runner):
>> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern
>> hdfs:///tmp/input/
>>
>> Tried numerous versions of the same uri. For example:
>> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern
>> hdfs://cluster-m/tmp/input/twitter.avro
>> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern
>> hdfs://mycluster-m/tmp/input/twitter.avro
>>
>> Works for gcs files:
>> INFO org.apache.beam.sdk.io.FileBasedSource: Matched 1 files for pattern
>> gs://mybucket/input/twitter/twitter.avro
>>
>>
>>
>> To reproduce, use beam examples archetype, package and execute:
>>
>> mvn archetype:generate -DarchetypeRepository=https://
>> repository.apache.org/content/groups/snapshots
>> -DarchetypeGroupId=org.apache.beam -DarchetypeArtifactId=beam-sdk
>> s-java-maven-archetypes-examples -DarchetypeVersion=LATEST
>> -DgroupId=org.example -DartifactId=word-count-beam -Dversion="0.1"
>> -Dpackage=org.apache.beam.examples -DinteractiveMode=false
>>
>> cd word-count-beam
>> mvn clean package -Papex-runner -DskipTests
>>
>> yarn jar target/word-count-beam-bundled-0.1.jar
>> org.apache.beam.examples.WordCount --inputFile=hdfs:///tmp/input/pom.xml
>> --output=/tmp/output --runner=ApexRunner --embeddedExecution=false
>>
>>
>> Note: "mvn compile exec:java ..." would not work for me due to
>> classpath/version-compat issues. Also needed to exclude org.apache.hadoop:*
>> and com.google.cloud.bigdataoss:* from shaded jar for version compat.
>>
>> Appreciate any help.
>>
>> Regards,
>> Shashank
>>
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to