Re: Spark GraphFrame ConnectedComponents

Ankur Srivastava Thu, 05 Jan 2017 13:11:36 -0800

Yes I did try it out and it choses the local file system as my checkpoint
location starts with s3n://


I am not sure how can I make it load the S3FileSystem.

On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Right, I'd agree, it seems to be only with delete.
>
> Could you by chance run just the delete to see if it fails
>
> FileSystem.get(sc.hadoopConfiguration)
> .delete(new Path(somepath), true)
> ------------------------------
> *From:* Ankur Srivastava <ankur.srivast...@gmail.com>
> *Sent:* Thursday, January 5, 2017 10:05:03 AM
> *To:* Felix Cheung
> *Cc:* user@spark.apache.org
>
> *Subject:* Re: Spark GraphFrame ConnectedComponents
>
> Yes it works to read the vertices and edges data from S3 location and is
> also able to write the checkpoint files to S3. It only fails when deleting
> the data and that is because it tries to use the default file system. I
> tried looking up how to update the default file system but could not find
> anything in that regard.
>
> Thanks
> Ankur
>
> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> From the stack it looks to be an error from the explicit call to
>> hadoop.fs.FileSystem.
>>
>> Is the URL scheme for s3n registered?
>> Does it work when you try to read from s3 from Spark?
>>
>> _____________________________
>> From: Ankur Srivastava <ankur.srivast...@gmail.com>
>> Sent: Wednesday, January 4, 2017 9:23 PM
>> Subject: Re: Spark GraphFrame ConnectedComponents
>> To: Felix Cheung <felixcheun...@hotmail.com>
>> Cc: <user@spark.apache.org>
>>
>>
>>
>> This is the exact trace from the driver logs
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>> s3n://<checkpoint-folder>/8ac233e4-10f9-4eb3-aa53-df6d9d7ea7
>> be/connected-components-c1dbc2b0/3, expected: file:///
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>> ileSystem.java:80)
>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>> tus(RawLocalFileSystem.java:529)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>> ernal(RawLocalFileSystem.java:747)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>> alFileSystem.java:524)
>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS
>> ystem.java:534)
>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$
>> lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
>> nts.scala:139)
>> at GraphTest.main(GraphTest.java:31) ----------- Application Class
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:731)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>>
>> Thanks
>> Ankur
>>
>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
>> ankur.srivast...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am rerunning the pipeline to generate the exact trace, I have below
>>> part of trace from last run:
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>> FS: s3n://<folder-path>, expected: file:///
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>> ileSystem.java:69)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>> alFileSystem.java:516)
>>> at 
>>> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
>>>
>>>
>>> Also I think the error is happening in this part of the code
>>> "ConnectedComponents.scala:339" I am referring the code @
>>> https://github.com/graphframes/graphframes/blob/master/src/
>>> main/scala/org/graphframes/lib/ConnectedComponents.scala
>>>
>>>       if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
>>>         // TODO: remove this after DataFrame.checkpoint is implemented
>>>         val out = s"${checkpointDir.get}/$iteration"
>>>         ee.write.parquet(out)
>>>         // may hit S3 eventually consistent issue
>>>         ee = sqlContext.read.parquet(out)
>>>
>>>         // remove previous checkpoint
>>>         if (iteration > checkpointInterval) {
>>>           *FileSystem.get(sc.hadoopConfiguration)*
>>> *            .delete(new Path(s"${checkpointDir.get}/${iteration -
>>> checkpointInterval}"), true)*
>>>         }
>>>
>>>         System.gc() // hint Spark to clean shuffle directories
>>>       }
>>>
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> Do you have more of the exception stack?
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Ankur Srivastava <ankur.srivast...@gmail.com>
>>>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM
>>>> *To:* user@spark.apache.org
>>>> *Subject:* Spark GraphFrame ConnectedComponents
>>>>
>>>> Hi,
>>>>
>>>> I am trying to use the ConnectedComponent algorithm of GraphFrames but
>>>> by default it needs a checkpoint directory. As I am running my spark
>>>> cluster with S3 as the DFS and do not have access to HDFS file system I
>>>> tried using a s3 directory as checkpoint directory but I run into below
>>>> exception:
>>>>
>>>> Exception in thread "main"java.lang.IllegalArgumentException: Wrong
>>>> FS: s3n://<folder-path>, expected: file:///
>>>>
>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>>>
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>>> ileSystem.java:69)
>>>>
>>>> If I set checkpoint interval to -1 to avoid checkpointing the driver
>>>> just hangs after 3 or 4 iterations.
>>>>
>>>> Is there some way I can set the default FileSystem to S3 for Spark or
>>>> any other option?
>>>>
>>>> Thanks
>>>> Ankur
>>>>
>>>>
>>>
>>
>>
>>
>

Re: Spark GraphFrame ConnectedComponents

Reply via email to