Re: Beam Spark/Flink runner with DC/OS

Chaoran Yu Tue, 31 Jan 2017 13:54:55 -0800

Thanks Davor, what you said makes sense!

Some guidance on registering Gcs file system on a remote cluster would be 
greatly appreciated!



> On Jan 31, 2017, at 2:47 PM, Davor Bonaci <da...@apache.org> wrote:
> 
> The file systems are automatically registered locally, so that's why you'd be 
> getting an error saying 'already registered'. We need to get it registered on 
> your cluster, in the remote execution environment, where you are getting an 
> error saying 'Unable to find handler for gs://'.
> 
> Amit, any idea from the Spark runner whether IOChannelFactories get staged, 
> would you expect AutoService to automatically register them or something 
> manual needs to happen?
> 
> On Mon, Jan 30, 2017 at 10:11 PM, Chaoran Yu <chaoran...@lightbend.com 
> <mailto:chaoran...@lightbend.com>> wrote:
> Thank you Davor for the reply!
> 
> Your understanding of my problem is exactly right. 
> 
> I thought about the issue you mentioned. Then I looked at Beam source code. 
> It looks to me that IO is done via IOChannelFactory class. And it has two 
> subclasses, FileIOChannelFactory and GcsIOChannelFactory. I figured probably 
> the wrong class got registered. This link I found 
> http://markmail.org/message/mrv4cg4y6bjtdssy 
> <http://markmail.org/message/mrv4cg4y6bjtdssy> points out the same 
> registration problem. So I tried registering GcsIOChannelFactory, but got the 
> following error:
> 
> Scheme: [file] is already registered with class 
> org.apache.beam.sdk.util.FileIOChannelFactory
> 
> Now I’m not sure what to do..
> 
> Thanks for the help!
> 
> Chaoran
> 
> 
> 
>> On Jan 30, 2017, at 12:05 PM, Davor Bonaci <da...@apache.org 
>> <mailto:da...@apache.org>> wrote:
>> 
>> Sorry for the delay here.
>> 
>> Am I correct in summarizing that "gs://bucket/file <>" doesn't work on a 
>> Spark cluster, but does with Spark runner locally? Beam file systems utilize 
>> AutoService functionality and, generally speaking, all filesystems should be 
>> available and automatically registered on all runners. This is probably just 
>> a simple matter of staging the right classes on the cluster.
>> 
>> Pei, any additional thoughts here?
>> 
>> On Mon, Jan 23, 2017 at 1:58 PM, Chaoran Yu <chaoran...@lightbend.com 
>> <mailto:chaoran...@lightbend.com>> wrote:
>> Sorry for the spam. But to clarify, I didn’t write the code. I’m using the 
>> code described here https://beam.apache.org/get-started/wordcount-example/ 
>> <https://beam.apache.org/get-started/wordcount-example/>
>> So the file already exists in GS.
>> 
>>> On Jan 23, 2017, at 4:55 PM, Chaoran Yu <chaoran...@lightbend.com 
>>> <mailto:chaoran...@lightbend.com>> wrote:
>>> 
>>> I didn’t upload the file. But since the identical Beam code, when running 
>>> in Spark local mode, was able to fetch the file and process it, the file 
>>> does exist.
>>> It’s just that somehow Spark standalone mode can’t find the file.
>>> 
>>> 
>>>> On Jan 23, 2017, at 4:50 PM, Amit Sela <amitsel...@gmail.com 
>>>> <mailto:amitsel...@gmail.com>> wrote:
>>>> 
>>>> I think "external" is the key here, you're cluster is running all it's 
>>>> components on your local machine so you're good.
>>>> 
>>>> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered 
>>>> by Google. You need to upload your file to GS. Have you ?  
>>>> 
>>>> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <chaoran...@lightbend.com 
>>>> <mailto:chaoran...@lightbend.com>> wrote:
>>>> Well, my file is not in my local filesystem. It’s in GS. 
>>>> This is the line of code that reads the input file: 
>>>> p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/* <>"))
>>>> 
>>>> And this page https://beam.apache.org/get-started/quickstart/ 
>>>> <https://beam.apache.org/get-started/quickstart/> says the following:
>>>> "you can’t access a local file if you are running the pipeline on an 
>>>> external cluster”.
>>>> I’m indeed trying to run a pipeline on a standalone Spark cluster running 
>>>> on my local machine. So local files are not an option.
>>>> 
>>>> 
>>>>> On Jan 23, 2017, at 4:41 PM, Amit Sela <amitsel...@gmail.com 
>>>>> <mailto:amitsel...@gmail.com>> wrote:
>>>>> 
>>>>> Why not try file:// instead ? it doesn't seem like you're using Google 
>>>>> Storage, right ? I mean the input file is on your local FS.
>>>>> 
>>>>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <chaoran...@lightbend.com 
>>>>> <mailto:chaoran...@lightbend.com>> wrote:
>>>>> No I’m not using Dataproc.
>>>>> I’m simply running on my local machine. I started a local Spark cluster 
>>>>> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my 
>>>>> Beam job to that cluster.
>>>>> The gs file is the kinglear.txt from Beam’s example code and it should be 
>>>>> public. 
>>>>> 
>>>>> My full stack trace is attached.
>>>>> 
>>>>> Thanks,
>>>>> Chaoran
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jan 23, 2017, at 4:23 PM, Amit Sela <amitsel...@gmail.com 
>>>>>> <mailto:amitsel...@gmail.com>> wrote:
>>>>>> 
>>>>>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the 
>>>>>> machines hosting the executor processes have access to GS ? could you 
>>>>>> paste the entire stack trace ?
>>>>>> 
>>>>>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <chaoran...@lightbend.com 
>>>>>> <mailto:chaoran...@lightbend.com>> wrote:
>>>>>> Thank you Amit for the reply,
>>>>>> 
>>>>>> I just tried two more runners and below is a summary:
>>>>>> 
>>>>>> DirectRunner: works
>>>>>> FlinkRunner: works in local mode. I got an error “Communication with 
>>>>>> JobManager failed: lost connection to the JobManager” when running in 
>>>>>> cluster mode, 
>>>>>> SparkRunner: works in local mode (mvn exec command) but fails in cluster 
>>>>>> mode (spark-submit) with the error I pasted in the previous email.
>>>>>> 
>>>>>> In SparkRunner’s case, can it be that Spark executor can’t access gs 
>>>>>> file in Google Storage?
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <amitsel...@gmail.com 
>>>>>>> <mailto:amitsel...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Is this working for you with other runners ? judging by the stack 
>>>>>>> trace, it seems like IOChannelUtils fails to find a handler so it 
>>>>>>> doesn't seem like it is a Spark specific problem. 
>>>>>>> 
>>>>>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <chaoran...@lightbend.com 
>>>>>>> <mailto:chaoran...@lightbend.com>> wrote:
>>>>>>> Thank you Amit and JB! 
>>>>>>> 
>>>>>>> This is not related to DC/OS itself, but I ran into a problem when 
>>>>>>> launching a Spark job on a cluster with spark-submit. My Spark job 
>>>>>>> written in Beam can’t read the specified gs file. I got the following 
>>>>>>> error:
>>>>>>> 
>>>>>>> Caused by: java.io.IOException: Unable to find handler for 
>>>>>>> gs://beam-samples/sample.txt <>
>>>>>>>         at 
>>>>>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307)
>>>>>>>         at org.apache.beam.sdk.io 
>>>>>>> <http://org.apache.beam.sdk.io/>.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
>>>>>>>         at org.apache.beam.sdk.io 
>>>>>>> <http://org.apache.beam.sdk.io/>.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
>>>>>>>         at 
>>>>>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125)
>>>>>>> 
>>>>>>> Then I thought about switching to reading from another source, but I 
>>>>>>> saw in Beam’s documentation that TextIO can only read from files in 
>>>>>>> Google Cloud Storage (prefixed with gs://) when running in cluster 
>>>>>>> mode. How do you guys doing file IO in Beam when using the SparkRunner?
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Chaoran
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <amitsel...@gmail.com 
>>>>>>>> <mailto:amitsel...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> I'lll join JB's comment on the Spark runner saying that submitting 
>>>>>>>> Beam pipelines using the Spark runner can be done using Spark's 
>>>>>>>> spark-submit script, find out more in the Spark runner documentation 
>>>>>>>> <https://beam.apache.org/documentation/runners/spark/>.
>>>>>>>> 
>>>>>>>> Amit.
>>>>>>>> 
>>>>>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré 
>>>>>>>> <j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Not directly DCOS (I think Stephen did some test on it), but I have a
>>>>>>>> platform running Spark and Flink with Beam on Mesos + Marathon.
>>>>>>>> 
>>>>>>>> It basically doesn't have anything special as running piplines uses
>>>>>>>> spark-submit (as on in Spark "natively").
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>> 
>>>>>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
>>>>>>>> > Hello all,
>>>>>>>> >
>>>>>>>> >   Has anyone had experience using Beam on DC/OS? I want to run Beam 
>>>>>>>> > code
>>>>>>>> >
>>>>>>>> > executed with Spark runner on DC/OS. As a next step, I would like to 
>>>>>>>> > run the
>>>>>>>> >
>>>>>>>> > Flink runner as well. There doesn't seem to exist any information
>>>>>>>> > about running
>>>>>>>> >
>>>>>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly
>>>>>>>> > appreciated.
>>>>>>>> >
>>>>>>>> > Thank you,
>>>>>>>> >
>>>>>>>> > Chaoran Yu
>>>>>>>> >
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jean-Baptiste Onofré
>>>>>>>> jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/>
>>>>>>>> Talend - http://www.talend.com <http://www.talend.com/>
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: Beam Spark/Flink runner with DC/OS

Reply via email to