Thank you Davor for the reply! Your understanding of my problem is exactly right.
I thought about the issue you mentioned. Then I looked at Beam source code. It looks to me that IO is done via IOChannelFactory class. And it has two subclasses, FileIOChannelFactory and GcsIOChannelFactory. I figured probably the wrong class got registered. This link I found http://markmail.org/message/mrv4cg4y6bjtdssy <http://markmail.org/message/mrv4cg4y6bjtdssy> points out the same registration problem. So I tried registering GcsIOChannelFactory, but got the following error: Scheme: [file] is already registered with class org.apache.beam.sdk.util.FileIOChannelFactory Now I’m not sure what to do.. Thanks for the help! Chaoran > On Jan 30, 2017, at 12:05 PM, Davor Bonaci <da...@apache.org> wrote: > > Sorry for the delay here. > > Am I correct in summarizing that "gs://bucket/file" doesn't work on a Spark > cluster, but does with Spark runner locally? Beam file systems utilize > AutoService functionality and, generally speaking, all filesystems should be > available and automatically registered on all runners. This is probably just > a simple matter of staging the right classes on the cluster. > > Pei, any additional thoughts here? > > On Mon, Jan 23, 2017 at 1:58 PM, Chaoran Yu <chaoran...@lightbend.com > <mailto:chaoran...@lightbend.com>> wrote: > Sorry for the spam. But to clarify, I didn’t write the code. I’m using the > code described here https://beam.apache.org/get-started/wordcount-example/ > <https://beam.apache.org/get-started/wordcount-example/> > So the file already exists in GS. > >> On Jan 23, 2017, at 4:55 PM, Chaoran Yu <chaoran...@lightbend.com >> <mailto:chaoran...@lightbend.com>> wrote: >> >> I didn’t upload the file. But since the identical Beam code, when running in >> Spark local mode, was able to fetch the file and process it, the file does >> exist. >> It’s just that somehow Spark standalone mode can’t find the file. >> >> >>> On Jan 23, 2017, at 4:50 PM, Amit Sela <amitsel...@gmail.com >>> <mailto:amitsel...@gmail.com>> wrote: >>> >>> I think "external" is the key here, you're cluster is running all it's >>> components on your local machine so you're good. >>> >>> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered by >>> Google. You need to upload your file to GS. Have you ? >>> >>> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <chaoran...@lightbend.com >>> <mailto:chaoran...@lightbend.com>> wrote: >>> Well, my file is not in my local filesystem. It’s in GS. >>> This is the line of code that reads the input file: >>> p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/* <>")) >>> >>> And this page https://beam.apache.org/get-started/quickstart/ >>> <https://beam.apache.org/get-started/quickstart/> says the following: >>> "you can’t access a local file if you are running the pipeline on an >>> external cluster”. >>> I’m indeed trying to run a pipeline on a standalone Spark cluster running >>> on my local machine. So local files are not an option. >>> >>> >>>> On Jan 23, 2017, at 4:41 PM, Amit Sela <amitsel...@gmail.com >>>> <mailto:amitsel...@gmail.com>> wrote: >>>> >>>> Why not try file:// instead ? it doesn't seem like you're using Google >>>> Storage, right ? I mean the input file is on your local FS. >>>> >>>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <chaoran...@lightbend.com >>>> <mailto:chaoran...@lightbend.com>> wrote: >>>> No I’m not using Dataproc. >>>> I’m simply running on my local machine. I started a local Spark cluster >>>> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my >>>> Beam job to that cluster. >>>> The gs file is the kinglear.txt from Beam’s example code and it should be >>>> public. >>>> >>>> My full stack trace is attached. >>>> >>>> Thanks, >>>> Chaoran >>>> >>>> >>>> >>>>> On Jan 23, 2017, at 4:23 PM, Amit Sela <amitsel...@gmail.com >>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>> >>>>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the >>>>> machines hosting the executor processes have access to GS ? could you >>>>> paste the entire stack trace ? >>>>> >>>>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <chaoran...@lightbend.com >>>>> <mailto:chaoran...@lightbend.com>> wrote: >>>>> Thank you Amit for the reply, >>>>> >>>>> I just tried two more runners and below is a summary: >>>>> >>>>> DirectRunner: works >>>>> FlinkRunner: works in local mode. I got an error “Communication with >>>>> JobManager failed: lost connection to the JobManager” when running in >>>>> cluster mode, >>>>> SparkRunner: works in local mode (mvn exec command) but fails in cluster >>>>> mode (spark-submit) with the error I pasted in the previous email. >>>>> >>>>> In SparkRunner’s case, can it be that Spark executor can’t access gs file >>>>> in Google Storage? >>>>> >>>>> Thank you, >>>>> >>>>> >>>>> >>>>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <amitsel...@gmail.com >>>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>>> >>>>>> Is this working for you with other runners ? judging by the stack trace, >>>>>> it seems like IOChannelUtils fails to find a handler so it doesn't seem >>>>>> like it is a Spark specific problem. >>>>>> >>>>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <chaoran...@lightbend.com >>>>>> <mailto:chaoran...@lightbend.com>> wrote: >>>>>> Thank you Amit and JB! >>>>>> >>>>>> This is not related to DC/OS itself, but I ran into a problem when >>>>>> launching a Spark job on a cluster with spark-submit. My Spark job >>>>>> written in Beam can’t read the specified gs file. I got the following >>>>>> error: >>>>>> >>>>>> Caused by: java.io.IOException: Unable to find handler for >>>>>> gs://beam-samples/sample.txt <> >>>>>> at >>>>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307) >>>>>> at org.apache.beam.sdk.io >>>>>> <http://org.apache.beam.sdk.io/>.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528) >>>>>> at org.apache.beam.sdk.io >>>>>> <http://org.apache.beam.sdk.io/>.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271) >>>>>> at >>>>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125) >>>>>> >>>>>> Then I thought about switching to reading from another source, but I saw >>>>>> in Beam’s documentation that TextIO can only read from files in Google >>>>>> Cloud Storage (prefixed with gs://) when running in cluster mode. How do >>>>>> you guys doing file IO in Beam when using the SparkRunner? >>>>>> >>>>>> >>>>>> Thank you, >>>>>> Chaoran >>>>>> >>>>>> >>>>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <amitsel...@gmail.com >>>>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>>>> >>>>>>> I'lll join JB's comment on the Spark runner saying that submitting Beam >>>>>>> pipelines using the Spark runner can be done using Spark's spark-submit >>>>>>> script, find out more in the Spark runner documentation >>>>>>> <https://beam.apache.org/documentation/runners/spark/>. >>>>>>> >>>>>>> Amit. >>>>>>> >>>>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <j...@nanthrax.net >>>>>>> <mailto:j...@nanthrax.net>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> Not directly DCOS (I think Stephen did some test on it), but I have a >>>>>>> platform running Spark and Flink with Beam on Mesos + Marathon. >>>>>>> >>>>>>> It basically doesn't have anything special as running piplines uses >>>>>>> spark-submit (as on in Spark "natively"). >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote: >>>>>>> > Hello all, >>>>>>> > >>>>>>> > Has anyone had experience using Beam on DC/OS? I want to run Beam >>>>>>> > code >>>>>>> > >>>>>>> > executed with Spark runner on DC/OS. As a next step, I would like to >>>>>>> > run the >>>>>>> > >>>>>>> > Flink runner as well. There doesn't seem to exist any information >>>>>>> > about running >>>>>>> > >>>>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly >>>>>>> > appreciated. >>>>>>> > >>>>>>> > Thank you, >>>>>>> > >>>>>>> > Chaoran Yu >>>>>>> > >>>>>>> >>>>>>> -- >>>>>>> Jean-Baptiste Onofré >>>>>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/> >>>>>>> Talend - http://www.talend.com <http://www.talend.com/> >>>>>> >>>>> >>>> >>> >> > >