Thanks Davor, what you said makes sense! Some guidance on registering Gcs file system on a remote cluster would be greatly appreciated!
> On Jan 31, 2017, at 2:47 PM, Davor Bonaci <da...@apache.org> wrote: > > The file systems are automatically registered locally, so that's why you'd be > getting an error saying 'already registered'. We need to get it registered on > your cluster, in the remote execution environment, where you are getting an > error saying 'Unable to find handler for gs://'. > > Amit, any idea from the Spark runner whether IOChannelFactories get staged, > would you expect AutoService to automatically register them or something > manual needs to happen? > > On Mon, Jan 30, 2017 at 10:11 PM, Chaoran Yu <chaoran...@lightbend.com > <mailto:chaoran...@lightbend.com>> wrote: > Thank you Davor for the reply! > > Your understanding of my problem is exactly right. > > I thought about the issue you mentioned. Then I looked at Beam source code. > It looks to me that IO is done via IOChannelFactory class. And it has two > subclasses, FileIOChannelFactory and GcsIOChannelFactory. I figured probably > the wrong class got registered. This link I found > http://markmail.org/message/mrv4cg4y6bjtdssy > <http://markmail.org/message/mrv4cg4y6bjtdssy> points out the same > registration problem. So I tried registering GcsIOChannelFactory, but got the > following error: > > Scheme: [file] is already registered with class > org.apache.beam.sdk.util.FileIOChannelFactory > > Now I’m not sure what to do.. > > Thanks for the help! > > Chaoran > > > >> On Jan 30, 2017, at 12:05 PM, Davor Bonaci <da...@apache.org >> <mailto:da...@apache.org>> wrote: >> >> Sorry for the delay here. >> >> Am I correct in summarizing that "gs://bucket/file <>" doesn't work on a >> Spark cluster, but does with Spark runner locally? Beam file systems utilize >> AutoService functionality and, generally speaking, all filesystems should be >> available and automatically registered on all runners. This is probably just >> a simple matter of staging the right classes on the cluster. >> >> Pei, any additional thoughts here? >> >> On Mon, Jan 23, 2017 at 1:58 PM, Chaoran Yu <chaoran...@lightbend.com >> <mailto:chaoran...@lightbend.com>> wrote: >> Sorry for the spam. But to clarify, I didn’t write the code. I’m using the >> code described here https://beam.apache.org/get-started/wordcount-example/ >> <https://beam.apache.org/get-started/wordcount-example/> >> So the file already exists in GS. >> >>> On Jan 23, 2017, at 4:55 PM, Chaoran Yu <chaoran...@lightbend.com >>> <mailto:chaoran...@lightbend.com>> wrote: >>> >>> I didn’t upload the file. But since the identical Beam code, when running >>> in Spark local mode, was able to fetch the file and process it, the file >>> does exist. >>> It’s just that somehow Spark standalone mode can’t find the file. >>> >>> >>>> On Jan 23, 2017, at 4:50 PM, Amit Sela <amitsel...@gmail.com >>>> <mailto:amitsel...@gmail.com>> wrote: >>>> >>>> I think "external" is the key here, you're cluster is running all it's >>>> components on your local machine so you're good. >>>> >>>> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered >>>> by Google. You need to upload your file to GS. Have you ? >>>> >>>> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <chaoran...@lightbend.com >>>> <mailto:chaoran...@lightbend.com>> wrote: >>>> Well, my file is not in my local filesystem. It’s in GS. >>>> This is the line of code that reads the input file: >>>> p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/* <>")) >>>> >>>> And this page https://beam.apache.org/get-started/quickstart/ >>>> <https://beam.apache.org/get-started/quickstart/> says the following: >>>> "you can’t access a local file if you are running the pipeline on an >>>> external cluster”. >>>> I’m indeed trying to run a pipeline on a standalone Spark cluster running >>>> on my local machine. So local files are not an option. >>>> >>>> >>>>> On Jan 23, 2017, at 4:41 PM, Amit Sela <amitsel...@gmail.com >>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>> >>>>> Why not try file:// instead ? it doesn't seem like you're using Google >>>>> Storage, right ? I mean the input file is on your local FS. >>>>> >>>>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <chaoran...@lightbend.com >>>>> <mailto:chaoran...@lightbend.com>> wrote: >>>>> No I’m not using Dataproc. >>>>> I’m simply running on my local machine. I started a local Spark cluster >>>>> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my >>>>> Beam job to that cluster. >>>>> The gs file is the kinglear.txt from Beam’s example code and it should be >>>>> public. >>>>> >>>>> My full stack trace is attached. >>>>> >>>>> Thanks, >>>>> Chaoran >>>>> >>>>> >>>>> >>>>>> On Jan 23, 2017, at 4:23 PM, Amit Sela <amitsel...@gmail.com >>>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>>> >>>>>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the >>>>>> machines hosting the executor processes have access to GS ? could you >>>>>> paste the entire stack trace ? >>>>>> >>>>>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <chaoran...@lightbend.com >>>>>> <mailto:chaoran...@lightbend.com>> wrote: >>>>>> Thank you Amit for the reply, >>>>>> >>>>>> I just tried two more runners and below is a summary: >>>>>> >>>>>> DirectRunner: works >>>>>> FlinkRunner: works in local mode. I got an error “Communication with >>>>>> JobManager failed: lost connection to the JobManager” when running in >>>>>> cluster mode, >>>>>> SparkRunner: works in local mode (mvn exec command) but fails in cluster >>>>>> mode (spark-submit) with the error I pasted in the previous email. >>>>>> >>>>>> In SparkRunner’s case, can it be that Spark executor can’t access gs >>>>>> file in Google Storage? >>>>>> >>>>>> Thank you, >>>>>> >>>>>> >>>>>> >>>>>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <amitsel...@gmail.com >>>>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>>>> >>>>>>> Is this working for you with other runners ? judging by the stack >>>>>>> trace, it seems like IOChannelUtils fails to find a handler so it >>>>>>> doesn't seem like it is a Spark specific problem. >>>>>>> >>>>>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <chaoran...@lightbend.com >>>>>>> <mailto:chaoran...@lightbend.com>> wrote: >>>>>>> Thank you Amit and JB! >>>>>>> >>>>>>> This is not related to DC/OS itself, but I ran into a problem when >>>>>>> launching a Spark job on a cluster with spark-submit. My Spark job >>>>>>> written in Beam can’t read the specified gs file. I got the following >>>>>>> error: >>>>>>> >>>>>>> Caused by: java.io.IOException: Unable to find handler for >>>>>>> gs://beam-samples/sample.txt <> >>>>>>> at >>>>>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307) >>>>>>> at org.apache.beam.sdk.io >>>>>>> <http://org.apache.beam.sdk.io/>.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528) >>>>>>> at org.apache.beam.sdk.io >>>>>>> <http://org.apache.beam.sdk.io/>.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271) >>>>>>> at >>>>>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125) >>>>>>> >>>>>>> Then I thought about switching to reading from another source, but I >>>>>>> saw in Beam’s documentation that TextIO can only read from files in >>>>>>> Google Cloud Storage (prefixed with gs://) when running in cluster >>>>>>> mode. How do you guys doing file IO in Beam when using the SparkRunner? >>>>>>> >>>>>>> >>>>>>> Thank you, >>>>>>> Chaoran >>>>>>> >>>>>>> >>>>>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <amitsel...@gmail.com >>>>>>>> <mailto:amitsel...@gmail.com>> wrote: >>>>>>>> >>>>>>>> I'lll join JB's comment on the Spark runner saying that submitting >>>>>>>> Beam pipelines using the Spark runner can be done using Spark's >>>>>>>> spark-submit script, find out more in the Spark runner documentation >>>>>>>> <https://beam.apache.org/documentation/runners/spark/>. >>>>>>>> >>>>>>>> Amit. >>>>>>>> >>>>>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré >>>>>>>> <j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Not directly DCOS (I think Stephen did some test on it), but I have a >>>>>>>> platform running Spark and Flink with Beam on Mesos + Marathon. >>>>>>>> >>>>>>>> It basically doesn't have anything special as running piplines uses >>>>>>>> spark-submit (as on in Spark "natively"). >>>>>>>> >>>>>>>> Regards >>>>>>>> JB >>>>>>>> >>>>>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote: >>>>>>>> > Hello all, >>>>>>>> > >>>>>>>> > Has anyone had experience using Beam on DC/OS? I want to run Beam >>>>>>>> > code >>>>>>>> > >>>>>>>> > executed with Spark runner on DC/OS. As a next step, I would like to >>>>>>>> > run the >>>>>>>> > >>>>>>>> > Flink runner as well. There doesn't seem to exist any information >>>>>>>> > about running >>>>>>>> > >>>>>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly >>>>>>>> > appreciated. >>>>>>>> > >>>>>>>> > Thank you, >>>>>>>> > >>>>>>>> > Chaoran Yu >>>>>>>> > >>>>>>>> >>>>>>>> -- >>>>>>>> Jean-Baptiste Onofré >>>>>>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>>>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/> >>>>>>>> Talend - http://www.talend.com <http://www.talend.com/> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> > >