newAPIHadoopFile bad performance

2017-01-05 Thread Mudasar
Hi,

I am using newAPIHadoopFile to process large number of s3 files(around 20
thousand) by passing URLs as comma separated String. It take around *7
minutes* to start the job. I am running the job on EMR 5.2.0 with spark
2.0.2.

Here is the code

Configuration conf = new Configuration();

JavaPairRDD file = 
jsc.newAPIHadoopFile(inputPath,
FullFileInputFormat.class, Text.class,
BytesWritable.class, conf)
;

I have experimented with different url schemes s://, s3n:// and s3a:// but
it did not worked.

Could you please suggest something to reduce this job start time?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/newAPIHadoopFile-bad-performance-tp28281.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2017-01-05 Thread bobwang





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
Would it be more robust to use the Path when creating the FileSystem?
https://github.com/graphframes/graphframes/issues/160

On Thu, Jan 5, 2017 at 4:57 PM, Felix Cheung 
wrote:

> This is likely a factor of your hadoop config and Spark rather then
> anything specific with GraphFrames.
>
> You might have better luck getting assistance if you could isolate the
> code to a simple case that manifests the problem (without GraphFrames), and
> repost.
>
>
> --
> *From:* Ankur Srivastava 
> *Sent:* Thursday, January 5, 2017 3:45:59 PM
> *To:* Felix Cheung; d...@spark.apache.org
>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark GraphFrame ConnectedComponents
>
> Adding DEV mailing list to see if this is a defect with ConnectedComponent
> or if they can recommend any solution.
>
> Thanks
> Ankur
>
> On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Yes I did try it out and it choses the local file system as my checkpoint
>> location starts with s3n://
>>
>> I am not sure how can I make it load the S3FileSystem.
>>
>> On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
>> wrote:
>>
>>> Right, I'd agree, it seems to be only with delete.
>>>
>>> Could you by chance run just the delete to see if it fails
>>>
>>> FileSystem.get(sc.hadoopConfiguration)
>>> .delete(new Path(somepath), true)
>>> --
>>> *From:* Ankur Srivastava 
>>> *Sent:* Thursday, January 5, 2017 10:05:03 AM
>>> *To:* Felix Cheung
>>> *Cc:* user@spark.apache.org
>>>
>>> *Subject:* Re: Spark GraphFrame ConnectedComponents
>>>
>>> Yes it works to read the vertices and edges data from S3 location and is
>>> also able to write the checkpoint files to S3. It only fails when deleting
>>> the data and that is because it tries to use the default file system. I
>>> tried looking up how to update the default file system but could not find
>>> anything in that regard.
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung >> > wrote:
>>>
 From the stack it looks to be an error from the explicit call to
 hadoop.fs.FileSystem.

 Is the URL scheme for s3n registered?
 Does it work when you try to read from s3 from Spark?

 _
 From: Ankur Srivastava 
 Sent: Wednesday, January 4, 2017 9:23 PM
 Subject: Re: Spark GraphFrame ConnectedComponents
 To: Felix Cheung 
 Cc: 



 This is the exact trace from the driver logs

 Exception in thread "main" java.lang.IllegalArgumentException: Wrong
 FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7
 be/connected-components-c1dbc2b0/3, expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
 at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
 ileSystem.java:80)
 at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
 tus(RawLocalFileSystem.java:529)
 at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
 ernal(RawLocalFileSystem.java:747)
 at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
 alFileSystem.java:524)
 at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS
 ystem.java:534)
 at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib
 $ConnectedComponents$$run(ConnectedComponents.scala:340)
 at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
 nts.scala:139)
 at GraphTest.main(GraphTest.java:31) --- Application Class
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
 ssorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
 thodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
 $SparkSubmit$$runMain(SparkSubmit.scala:731)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
 .scala:181)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

 Thanks
 Ankur

 On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
 ankur.srivast...@gmail.com> wrote:

> Hi
>
> I am rerunning the pipeline to generate the exact trace, I have below
> part of trace from last run:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
> FS: s3n://, expected: file:///
> at 

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
This is likely a factor of your hadoop config and Spark rather then anything 
specific with GraphFrames.

You might have better luck getting assistance if you could isolate the code to 
a simple case that manifests the problem (without GraphFrames), and repost.



From: Ankur Srivastava 
Sent: Thursday, January 5, 2017 3:45:59 PM
To: Felix Cheung; d...@spark.apache.org
Cc: user@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Adding DEV mailing list to see if this is a defect with ConnectedComponent or 
if they can recommend any solution.

Thanks
Ankur

On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava 
> wrote:
Yes I did try it out and it choses the local file system as my checkpoint 
location starts with s3n://

I am not sure how can I make it load the S3FileSystem.

On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
> wrote:
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava 
>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: user@spark.apache.org

Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung >
Cc: >



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am 

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Adding DEV mailing list to see if this is a defect with ConnectedComponent
or if they can recommend any solution.

Thanks
Ankur

On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava  wrote:

> Yes I did try it out and it choses the local file system as my checkpoint
> location starts with s3n://
>
> I am not sure how can I make it load the S3FileSystem.
>
> On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
> wrote:
>
>> Right, I'd agree, it seems to be only with delete.
>>
>> Could you by chance run just the delete to see if it fails
>>
>> FileSystem.get(sc.hadoopConfiguration)
>> .delete(new Path(somepath), true)
>> --
>> *From:* Ankur Srivastava 
>> *Sent:* Thursday, January 5, 2017 10:05:03 AM
>> *To:* Felix Cheung
>> *Cc:* user@spark.apache.org
>>
>> *Subject:* Re: Spark GraphFrame ConnectedComponents
>>
>> Yes it works to read the vertices and edges data from S3 location and is
>> also able to write the checkpoint files to S3. It only fails when deleting
>> the data and that is because it tries to use the default file system. I
>> tried looking up how to update the default file system but could not find
>> anything in that regard.
>>
>> Thanks
>> Ankur
>>
>> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
>> wrote:
>>
>>> From the stack it looks to be an error from the explicit call to
>>> hadoop.fs.FileSystem.
>>>
>>> Is the URL scheme for s3n registered?
>>> Does it work when you try to read from s3 from Spark?
>>>
>>> _
>>> From: Ankur Srivastava 
>>> Sent: Wednesday, January 4, 2017 9:23 PM
>>> Subject: Re: Spark GraphFrame ConnectedComponents
>>> To: Felix Cheung 
>>> Cc: 
>>>
>>>
>>>
>>> This is the exact trace from the driver logs
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>> FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7
>>> be/connected-components-c1dbc2b0/3, expected: file:///
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>> ileSystem.java:80)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>>> tus(RawLocalFileSystem.java:529)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>>> ernal(RawLocalFileSystem.java:747)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>> alFileSystem.java:524)
>>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS
>>> ystem.java:534)
>>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib
>>> $ConnectedComponents$$run(ConnectedComponents.scala:340)
>>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
>>> nts.scala:139)
>>> at GraphTest.main(GraphTest.java:31) --- Application Class
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:57)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>> $SparkSubmit$$runMain(SparkSubmit.scala:731)
>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>> .scala:181)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
>>> ankur.srivast...@gmail.com> wrote:
>>>
 Hi

 I am rerunning the pipeline to generate the exact trace, I have below
 part of trace from last run:

 Exception in thread "main" java.lang.IllegalArgumentException: Wrong
 FS: s3n://, expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
 at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
 ileSystem.java:69)
 at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
 alFileSystem.java:516)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)


 Also I think the error is happening in this part of the code
 "ConnectedComponents.scala:339" I am referring the code @
 https://github.com/graphframes/graphframes/blob/master/src/
 main/scala/org/graphframes/lib/ConnectedComponents.scala

   if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
 // TODO: remove this after DataFrame.checkpoint is implemented
 val out = s"${checkpointDir.get}/$iteration"
 ee.write.parquet(out)
 // may hit S3 

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Yes I did try it out and it choses the local file system as my checkpoint
location starts with s3n://

I am not sure how can I make it load the S3FileSystem.

On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
wrote:

> Right, I'd agree, it seems to be only with delete.
>
> Could you by chance run just the delete to see if it fails
>
> FileSystem.get(sc.hadoopConfiguration)
> .delete(new Path(somepath), true)
> --
> *From:* Ankur Srivastava 
> *Sent:* Thursday, January 5, 2017 10:05:03 AM
> *To:* Felix Cheung
> *Cc:* user@spark.apache.org
>
> *Subject:* Re: Spark GraphFrame ConnectedComponents
>
> Yes it works to read the vertices and edges data from S3 location and is
> also able to write the checkpoint files to S3. It only fails when deleting
> the data and that is because it tries to use the default file system. I
> tried looking up how to update the default file system but could not find
> anything in that regard.
>
> Thanks
> Ankur
>
> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
> wrote:
>
>> From the stack it looks to be an error from the explicit call to
>> hadoop.fs.FileSystem.
>>
>> Is the URL scheme for s3n registered?
>> Does it work when you try to read from s3 from Spark?
>>
>> _
>> From: Ankur Srivastava 
>> Sent: Wednesday, January 4, 2017 9:23 PM
>> Subject: Re: Spark GraphFrame ConnectedComponents
>> To: Felix Cheung 
>> Cc: 
>>
>>
>>
>> This is the exact trace from the driver logs
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>> s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7
>> be/connected-components-c1dbc2b0/3, expected: file:///
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>> ileSystem.java:80)
>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>> tus(RawLocalFileSystem.java:529)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>> ernal(RawLocalFileSystem.java:747)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>> alFileSystem.java:524)
>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS
>> ystem.java:534)
>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$
>> lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
>> nts.scala:139)
>> at GraphTest.main(GraphTest.java:31) --- Application Class
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:731)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>>
>> Thanks
>> Ankur
>>
>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
>> ankur.srivast...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am rerunning the pipeline to generate the exact trace, I have below
>>> part of trace from last run:
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>> FS: s3n://, expected: file:///
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>> ileSystem.java:69)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>> alFileSystem.java:516)
>>> at 
>>> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
>>>
>>>
>>> Also I think the error is happening in this part of the code
>>> "ConnectedComponents.scala:339" I am referring the code @
>>> https://github.com/graphframes/graphframes/blob/master/src/
>>> main/scala/org/graphframes/lib/ConnectedComponents.scala
>>>
>>>   if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
>>> // TODO: remove this after DataFrame.checkpoint is implemented
>>> val out = s"${checkpointDir.get}/$iteration"
>>> ee.write.parquet(out)
>>> // may hit S3 eventually consistent issue
>>> ee = sqlContext.read.parquet(out)
>>>
>>> // remove previous checkpoint
>>> if (iteration > checkpointInterval) {
>>>   *FileSystem.get(sc.hadoopConfiguration)*
>>> *.delete(new Path(s"${checkpointDir.get}/${iteration -
>>> checkpointInterval}"), true)*
>>>  

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
Hello Experts,

I am trying to allow null values in numeric fields. Here are the details of
the issue I have:
http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields

I also tried making all columns nullable by using the below function (from
one of the suggestions on web)

def setNullableStateForAllColumns( df: DataFrame, nullable: Boolean) :
DataFrame = {
  df.sqlContext.createDataFrame(df.rdd,
StructType(df.schema.map(_.copy(nullable = nullable
}

The printSchema shows that the columns are now nullable, but still I
cannot persist

as parquet with null in the numeric fields.

Is there a workaround to it? I need to be able to allow null values
for numeric fields

Thanks in advance.

regards

Sunita


Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava 
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: user@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung >
Cc: >



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am referring the code 
@https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(s"${checkpointDir.get}/${iteration - 
checkpointInterval}"), true)
}

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
> wrote:
Do you have more of the exception stack?



From: Ankur Srivastava 
>
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: 

RE: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Manohar Reddy
Hi Steve,
Thanks for the reply and below is follow-up help needed from you.
Do you mean we can set up two native file system to single sparkcontext ,so 
then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) 
will that identify and write/read appropriate cloud.

Is that my understanding right?

Manohar
From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Thursday, January 5, 2017 11:05 PM
To: Manohar Reddy
Cc: user@spark.apache.org
Subject: Re: Spark Read from Google store and save in AWS s3


On 5 Jan 2017, at 09:58, Manohar753 
> wrote:

Hi All,

Using spark is  interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
processing and finally needs to store in S3 and my spark engine runs on AWS
Cluster.

Please let me back is there any way for this kind of usecase bu using
directly spark without any middle components and share the info or link if
you have.

Thanks,

I've not played with GCS, and have some noted concerns about test coverage ( 
https://github.com/GoogleCloudPlatform/bigdata-interop/pull/40
 ) , but assuming you are not hitting any specific problems, it should be a 
matter of having the input as gs://bucket/path and dest s3a://bucket-on-s3/path2

You'll need the google storage JARs on your classpath, along with those needed 
for S3n/s3a.

1. little talk on the topic, though I only play with azure and s3
https://www.youtube.com/watch?v=ND4L_zSDqF0

2. some notes; bear in mind that the s3a performance tuning covered relates to 
things surfacing in Hadoop 2.8, which you probably wont have.


https://hortonworks.github.io/hdp-aws/s3-spark/

A one line test for s3 installed is can you read the landsat CSV file

sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()

this should work from wherever you are if your classpath and credentials are 
set up

Happiest Minds Disclaimer

This message is for the sole use of the intended recipient(s) and may contain 
confidential, proprietary or legally privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
original intended recipient of the message, please contact the sender by reply 
email and destroy all copies of the original message.

Happiest Minds Technologies 




Re: Spark SQL - Applying transformation on a struct inside an array

2017-01-05 Thread Olivier Girardot
So, it seems the only way I found for now is a recursive handling of the Row
instances directly, but to do that I have to go back to RDDs, i've put together
a simple test case demonstrating the problem :
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.scalatest.{FlatSpec, Matchers}

class extends with DFInPlaceTransform FlatSpec Matchers {
val spark = SparkSession.builder().appName("local""local[*]"
).master().getOrCreate()
it should "access and mutate deeply nested arrays/structs" in {

val df = spark.read.json(spark.sparkContext.parallelize(List(
"""{"a":[{"b" : "toto" }]}""".stripMargin)))
df.show()
df.printSchema()

val result = transformInPlace("a.b", df)

result.printSchema()
result.show()

result.schema should be (df.schema)
val res = result.toJSON.take(1)
res should be("""{"a":[{"b" : TOTO" }]}""")
}

def transformInPlace(path: String, df: DataFrame): DataFrame = {
val udf = spark.udf.register("transform", (s: String) => s.toUpperCase)
val paths = path.split('.')
val root = paths.head
import org.apache.spark.sql.functions._
df.withColumn(root, udf(df(path))) // does not work of course
}
}

the three other solutions I see are * to create a dedicated Expression for
   in-place modifications of nested arrays and structs,
 * to use heavy explode/lateral views/group
   by computations, but that's bound to be inefficient
 * or to generate bytecode using the schema
   to do the nested "getRow,getSeq…" and re-create the rows once transformation
   is applied

I'd like to open an issue regarding that use case because it's not the first or
last time it comes up and I still don't see any generic solution using
Dataframes.Thanks for your time,Regards,
Olivier
 





On Fri, Sep 16, 2016 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com
wrote:
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by
SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal
for me, I managed to make it work anyway like that :> df.withColumn("a",
struct(struct(myUDF(df("a.b.c." // I didn't put back the aliases but you see
what I mean.
What I'd like to make work in essence is something like that> val someFunc :
String => String = ???> val myUDF = udf(someFunc)> df.withColumn("a.b[*].c",
myUDF(df("a.b[*].c"))) // the fact is that in order to be consistent with the
previous API, maybe I'd have to put something like a struct(array(struct(… which
would be troublesome because I'd have to parse the arbitrary input string  and
create something like "a.b[*].c" => struct(array(struct(
I realise the ambiguity implied in the kind of column expression, but it doesn't
seem for now available to cleanly update data inplace at an arbitrary depth.
I'll try to work on a PR that would make this possible, but any pointers would
be appreciated.
Regards,
Olivier.
 





On Fri, Sep 16, 2016 12:42 AM, Michael Armbrust mich...@databricks.com
wrote:
Is what you are looking for a withColumn that support in place modification of
nested columns? or is it some other problem?
On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
I tried to use the RowEncoder but got stuck along the way :The main issue really
is that even if it's possible (however tedious) to pattern match generically
Row(s) and target the nested field that you need to modify, Rows being immutable
data structure without a method like a case class's copy or any kind of lens to
create a brand new object, I ended up stuck at the step "target and extract the
field to update" without any way to update the original Row with the new value.
To sum up, I tried : * using only dataframe's API itself + my udf - which works
   for nested structs as long as no arrays are along the way
 * trying to create a udf the can apply on Row and pattern
   match recursively the path I needed to explore/modify
 * trying to create a UDT - but we seem to be stuck in a
   strange middle-ground with 2.0 because some parts of the API ended up private
   while some stayed public making it impossible to use it now (I'd be glad if
   I'm mistaken)

All of these failed for me and I ended up converting the rows to JSON and update
using JSONPath which is…. something I'd like to avoid 'pretty please' 





On Thu, Sep 15, 2016 5:20 AM, Michael Allman mich...@videoamp.com
wrote:
Hi Guys,
Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a
public API, but it is publicly accessible. I used it recently to correct some
bad data in a few nested columns in a dataframe. It wasn't an easy job, but it
made it possible. In my particular case I was not working with arrays.
Olivier, I'm interested in seeing what you come up with.
Thanks,
Michael

On Sep 14, 2016, at 10:44 AM, Fred Reiss  wrote:
+1 to this request. I talked last week with a product group within IBM that is
struggling with the same issue. It's pretty common in data cleaning applications
for data in the early stages to 

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Marco Mistroni
Hi
   might be off topic, but databricks has a web application in whicn you
can use spark with jupyter. have a look at
https://community.cloud.databricks.com

kr

On Thu, Jan 5, 2017 at 7:53 PM, Jon G  wrote:

> I don't use MapR but I use pyspark with jupyter, and this MapR blogpost
> looks similar to what I do to setup:
>
> https://community.mapr.com/docs/DOC-1874-how-to-use-
> jupyter-pyspark-on-mapr
>
>
> On Thu, Jan 5, 2017 at 3:05 AM, neil90  wrote:
>
>> Assuming you don't have your environment variables setup in your
>> .bash_profile you would do it like this -
>>
>> import os
>> import sys
>>
>> spark_home = '/usr/local/spark'
>> sys.path.insert(0, spark_home + "/python")
>> sys.path.insert(0, os.path.join(spark_home,
>> 'python/lib/py4j-0.10.1-src.zip'))
>> #os.environ['PYSPARK_SUBMIT_ARGS'] = """--master spark://
>> 54.68.147.137:7077
>> pyspark-shell""" < where you can pass commands you would pass in
>> launching pyspark directly from command line
>>
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SparkSession
>>
>> conf = SparkConf()\
>> .setMaster("local[8]")\
>> .setAppName("Test")
>>
>> sc = SparkContext(conf=conf)
>>
>> spark = SparkSession.builder\
>> .config(conf=sc.getConf())\
>> .enableHiveSupport()\
>> .getOrCreate()
>>
>> Mind you this is for spark 2.0 and above
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-Python-in-Jupyter-Notebook-tp28268p28274.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Jon G
I don't use MapR but I use pyspark with jupyter, and this MapR blogpost
looks similar to what I do to setup:

https://community.mapr.com/docs/DOC-1874-how-to-use-jupyter-pyspark-on-mapr


On Thu, Jan 5, 2017 at 3:05 AM, neil90  wrote:

> Assuming you don't have your environment variables setup in your
> .bash_profile you would do it like this -
>
> import os
> import sys
>
> spark_home = '/usr/local/spark'
> sys.path.insert(0, spark_home + "/python")
> sys.path.insert(0, os.path.join(spark_home,
> 'python/lib/py4j-0.10.1-src.zip'))
> #os.environ['PYSPARK_SUBMIT_ARGS'] = """--master spark://
> 54.68.147.137:7077
> pyspark-shell""" < where you can pass commands you would pass in
> launching pyspark directly from command line
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
>
> conf = SparkConf()\
> .setMaster("local[8]")\
> .setAppName("Test")
>
> sc = SparkContext(conf=conf)
>
> spark = SparkSession.builder\
> .config(conf=sc.getConf())\
> .enableHiveSupport()\
> .getOrCreate()
>
> Mind you this is for spark 2.0 and above
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Python-in-Jupyter-Notebook-tp28268p28274.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Help in generating unique Id in spark row

2017-01-05 Thread Olivier Girardot
There is a way, you can use
org.apache.spark.sql.functions.monotonicallyIncreasingId it will give each rows
of your dataframe a unique Id
 





On Tue, Oct 18, 2016 10:36 AM, ayan guha guha.a...@gmail.com
wrote:
Do you have any primary key or unique identifier in your data? Even if multiple
columns can make a composite key? In other words, can your data have exactly
same 2 rows with different unique ID? Also, do you have to have numeric ID? 

You may want to pursue hashing algorithm such as sha group to convert single or
composite unique columns to an ID. 

On 18 Oct 2016 15:32, "Saurav Sinha"  wrote:
Can any one help me out
On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha   wrote:
Hi,
I am in situation where I want to generate unique Id for each row.
I have use monotonicallyIncreasingId but it is giving increasing values and
start generating from start if it fail.
I have two question here:
Q1. Does this method give me unique id even in failure situation becaue I want
to use that ID in my solr id.
Q2. If answer to previous question is NO. Then Is there way yo generate UUID for
each row which is uniqe and not updatedable.
As I have come up with situation where UUID is updated

val idUDF = udf(() => UUID.randomUUID().toString)
val a = withColumn("alarmUUID", lit(idUDF()))a.persist(StorageLevel.MEMORY_
AND_DISK)
rawDataDf.registerTempTable("rawAlarms")

// I do some joines
but as I reach further below
I do sonthing likeb is transformation of asqlContext.sql("""Select
a.alarmUUID,b.alarmUUID                      from a right outer join bon
a.alarmUUID = b.alarmUUID""")
it give output as
+++|           alarmUUID|          
alarmUUID|+++|7d33a516-5532-410...|    
           null||                null|2439d6db-16a2-44b...|
+++


-- 
Thanks and Regards,
Saurav Sinha
Contact: 9742879062


-- 
Thanks and Regards,
Saurav Sinha
Contact: 9742879062

 

Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: Setting Spark Properties on Dataframes

2017-01-05 Thread neil90
This blog post(Not mine) has some nice examples - 

https://hadoopist.wordpress.com/2016/08/19/how-to-create-compressed-output-files-in-spark-2-0/

>From the blog - 
df.write.mode("overwrite").format("parquet").option("compression",
"none").mode("overwrite").save("/tmp/file_no_compression_parq")
df.write.mode("overwrite").format("parquet").option("compression",
"gzip").mode("overwrite").save("/tmp/file_with_gzip_parq")
df.write.mode("overwrite").format("parquet").option("compression",
"snappy").mode("overwrite").save("/tmp/file_with_snappy_parq")
//lzo - requires a different method in terms of implementation.

df.write.mode("overwrite").format("orc").option("compression",
"none").mode("overwrite").save("/tmp/file_no_compression_orc")
df.write.mode("overwrite").format("orc").option("compression",
"snappy").mode("overwrite").save("/tmp/file_with_snappy_orc")
df.write.mode("overwrite").format("orc").option("compression",
"zlib").mode("overwrite").save("/tmp/file_with_zlib_orc")



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28280.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Yes it works to read the vertices and edges data from S3 location and is
also able to write the checkpoint files to S3. It only fails when deleting
the data and that is because it tries to use the default file system. I
tried looking up how to update the default file system but could not find
anything in that regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
wrote:

> From the stack it looks to be an error from the explicit call to
> hadoop.fs.FileSystem.
>
> Is the URL scheme for s3n registered?
> Does it work when you try to read from s3 from Spark?
>
> _
> From: Ankur Srivastava 
> Sent: Wednesday, January 4, 2017 9:23 PM
> Subject: Re: Spark GraphFrame ConnectedComponents
> To: Felix Cheung 
> Cc: 
>
>
>
> This is the exact trace from the driver logs
>
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
> s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
> expected: file:///
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(
> RawLocalFileSystem.java:80)
> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(
> RawLocalFileSystem.java:529)
> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(
> RawLocalFileSystem.java:747)
> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(
> RawLocalFileSystem.java:524)
> at org.apache.hadoop.fs.ChecksumFileSystem.delete(
> ChecksumFileSystem.java:534)
> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$
> ConnectedComponents$$run(ConnectedComponents.scala:340)
> at org.graphframes.lib.ConnectedComponents.run(
> ConnectedComponents.scala:139)
> at GraphTest.main(GraphTest.java:31) --- Application Class
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>
> Thanks
> Ankur
>
> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi
>>
>> I am rerunning the pipeline to generate the exact trace, I have below
>> part of trace from last run:
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>> s3n://, expected: file:///
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>> ileSystem.java:69)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>> alFileSystem.java:516)
>> at 
>> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
>>
>>
>> Also I think the error is happening in this part of the code
>> "ConnectedComponents.scala:339" I am referring the code @
>> https://github.com/graphframes/graphframes/blob/master/src/
>> main/scala/org/graphframes/lib/ConnectedComponents.scala
>>
>>   if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
>> // TODO: remove this after DataFrame.checkpoint is implemented
>> val out = s"${checkpointDir.get}/$iteration"
>> ee.write.parquet(out)
>> // may hit S3 eventually consistent issue
>> ee = sqlContext.read.parquet(out)
>>
>> // remove previous checkpoint
>> if (iteration > checkpointInterval) {
>>   *FileSystem.get(sc.hadoopConfiguration)*
>> *.delete(new Path(s"${checkpointDir.get}/${iteration -
>> checkpointInterval}"), true)*
>> }
>>
>> System.gc() // hint Spark to clean shuffle directories
>>   }
>>
>>
>> Thanks
>> Ankur
>>
>> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
>> wrote:
>>
>>> Do you have more of the exception stack?
>>>
>>>
>>> --
>>> *From:* Ankur Srivastava 
>>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* Spark GraphFrame ConnectedComponents
>>>
>>> Hi,
>>>
>>> I am trying to use the ConnectedComponent algorithm of GraphFrames but
>>> by default it needs a checkpoint directory. As I am running my spark
>>> cluster with S3 as the DFS and do not have access to HDFS file system I
>>> tried using a s3 directory as checkpoint directory but I run into below

Re: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Steve Loughran

On 5 Jan 2017, at 09:58, Manohar753 
> wrote:

Hi All,

Using spark is  interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
processing and finally needs to store in S3 and my spark engine runs on AWS
Cluster.

Please let me back is there any way for this kind of usecase bu using
directly spark without any middle components and share the info or link if
you have.

Thanks,


I've not played with GCS, and have some noted concerns about test coverage ( 
https://github.com/GoogleCloudPlatform/bigdata-interop/pull/40 ) , but assuming 
you are not hitting any specific problems, it should be a matter of having the 
input as gs://bucket/path and dest s3a://bucket-on-s3/path2

You'll need the google storage JARs on your classpath, along with those needed 
for S3n/s3a.

1. little talk on the topic, though I only play with azure and s3
https://www.youtube.com/watch?v=ND4L_zSDqF0

2. some notes; bear in mind that the s3a performance tuning covered relates to 
things surfacing in Hadoop 2.8, which you probably wont have.


https://hortonworks.github.io/hdp-aws/s3-spark/

A one line test for s3 installed is can you read the landsat CSV file

sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()

this should work from wherever you are if your classpath and credentials are 
set up


[Spark 2.1.0] Resource Scheduling Challenge in pyspark sparkSession

2017-01-05 Thread Palash Gupta
Hi User Team,
I'm trying to schedule resource in spark 2.1.0 using below code but still all 
the cpu cores are captured by only single spark application and hence no other 
application is starting. Could you please help me out:
sqlContext = 
SparkSession.builder.master("spark://172.26.7.192:7077").config("spark.sql.warehouse.dir",
 "/tmp/PM/").config("spark.sql.shuffle.partitions", 
"6").config("spark.cores.max", "5").config("spark.executor.cores", 
"2").config("spark.driver.memory", "8g").config("spark.executor.memory", 
"4g").appName(APP_NAME).getOrCreate()


Thanks & Best Regards,
Engr. Palash GuptaWhatsApp/Viber: +8801817181502Skype: palash2494

 

Thanks & Best Regards,Palash Gupta


unsubscribe

2017-01-05 Thread Nikola Z



Re: ToLocalIterator vs collect

2017-01-05 Thread Richard Startin
Why not do that with spark sql to utilise the executors properly, rather than a 
sequential filter on the driver.

Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k

If you were sorting just so you could iterate in order, this might save you a 
couple of sorts too.

https://richardstartin.com

> On 5 Jan 2017, at 10:40, Rohit Verma  wrote:
> 
> Hi all,
> 
> I am aware that collect will return a list aggregated on driver, this will 
> return OOM when we have a too big list.
> Is toLocalIterator safe to use with very big list, i want to access all 
> values one by one.
> 
> Basically the goal is to compare two sorted rdds (A and B) to find top k 
> entries missed in B but there in A 
> 
> Rohit
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



ToLocalIterator vs collect

2017-01-05 Thread Rohit Verma
Hi all,

I am aware that collect will return a list aggregated on driver, this will 
return OOM when we have a too big list.
Is toLocalIterator safe to use with very big list, i want to access all values 
one by one.

Basically the goal is to compare two sorted rdds (A and B) to find top k 
entries missed in B but there in A 

Rohit
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Palash Gupta
Hi Macro,
Yes it was in the same host when problem was found.
Even when I tried to start with different host, the problem is still there.
Any hints or suggestion will be appreciated.
 Thanks & Best Regards,
Palash Gupta


  From: Marco Mistroni 
 To: Palash Gupta  
Cc: ayan guha ; User 
 Sent: Thursday, January 5, 2017 1:01 PM
 Subject: Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed 
to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"
   
Hi If it only happens when u run 2 app at same time could it be that these 2 
apps somehow run on same host?Kr
On 5 Jan 2017 9:00 am, "Palash Gupta"  wrote:

Hi Marco and respected member,
I have done all the possible things suggested by Forum but still I'm having 
same issue:

1. I will migrate my applications to production environment where I will have 
more resourcesPalash>> I migrated my application in production where I have 
more CPU Cores, Memory & total 7 host in spark cluster. 
2. Use Spark 2.0.0 function to load CSV rather using databrics apiPalash>> 
Earlier I'm using databricks csv api with Spark 2.0.0. As suggested by one of 
the mate, Now I'm using spark 2.0.0 built in csv loader.
3. In production I will run multiple spark application at a time and try to 
reproduce this error for both file system and HDFS loading casPalash>> yes I 
reproduced and it only happen when two spark application run at a time. Please 
see the logs:
17/01/05 01:50:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, 10.15.187.79): java.io.IOException: org.apache.spa
rk.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
    at org.apache.spark.util.Utils$. tryOrIOException(Utils.scala: 1260)
    at org.apache.spark.broadcast. TorrentBroadcast. readBroadcastBlock( 
TorrentBroadcast.scala:174)
    at org.apache.spark.broadcast. TorrentBroadcast._value$ 
lzycompute(TorrentBroadcast. scala:65)
    at org.apache.spark.broadcast. TorrentBroadcast._value( 
TorrentBroadcast.scala:65)
    at org.apache.spark.broadcast. TorrentBroadcast.getValue( 
TorrentBroadcast.scala:89)
    at org.apache.spark.broadcast. Broadcast.value(Broadcast. scala:70)
    at org.apache.spark.scheduler. ResultTask.runTask(ResultTask. scala:67)
    at org.apache.spark.scheduler. Task.run(Task.scala:85)
    at org.apache.spark.executor. Executor$TaskRunner.run( 
Executor.scala:274)
    at java.util.concurrent. ThreadPoolExecutor.runWorker( 
ThreadPoolExecutor.java:1145)
    at java.util.concurrent. ThreadPoolExecutor$Worker.run( 
ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread. java:745)
Caused by: org.apache.spark. SparkException: Failed to get broadcast_1_piece0 
of broadcast_1
    at org.apache.spark.broadcast. TorrentBroadcast$$anonfun$org$ 
apache$spark$broadcast$ TorrentBroadcast$$readBlocks$ 1.apply$mcVI$s
p(TorrentBroadcast.scala:146)
    at org.apache.spark.broadcast. TorrentBroadcast$$anonfun$org$ 
apache$spark$broadcast$ TorrentBroadcast$$readBlocks$ 1.apply(Torren
tBroadcast.scala:125)
    at org.apache.spark.broadcast. TorrentBroadcast$$anonfun$org$ 
apache$spark$broadcast$ TorrentBroadcast$$readBlocks$ 1.apply(Torren
tBroadcast.scala:125)
    at scala.collection.immutable. List.foreach(List.scala:381)
    at org.apache.spark.broadcast. TorrentBroadcast.org$apache$ 
spark$broadcast$ TorrentBroadcast$$readBlocks( TorrentBroadcast.scala:
125)
    at org.apache.spark.broadcast. TorrentBroadcast$$anonfun$ 
readBroadcastBlock$1.apply( TorrentBroadcast.scala:186)
    at org.apache.spark.util.Utils$. tryOrIOException(Utils.scala: 1253)
    ... 11 more

17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 
(TID 1, 10.15.187.78, partition 0, ANY, 7305 bytes)
17/01/05 01:50:15 INFO cluster. CoarseGrainedSchedulerBackend$ DriverEndpoint: 
Launching task 1 on executor id: 1 hostname: 10.15.187.78
.
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 
(TID 1) on executor 10.15.187.78: java.io.IOException (org
.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1) 
[duplicate 1]
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 
(TID 2, 10.15.187.78, partition 0, ANY, 7305 bytes)
17/01/05 01:50:15 INFO cluster. CoarseGrainedSchedulerBackend$ DriverEndpoint: 
Launching task 2 on executor id: 1 hostname: 10.15.187.78
.
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 
(TID 2) on executor 10.15.187.78: java.io.IOException (org
.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1) 
[duplicate 2]
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 0.0 
(TID 3, 10.15.187.76, partition 0, ANY, 7305 bytes)
17/01/05 01:50:15 INFO cluster. CoarseGrainedSchedulerBackend$ 

Spark Read from Google store and save in AWS s3

2017-01-05 Thread Manohar753
Hi All,

Using spark is  interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
processing and finally needs to store in S3 and my spark engine runs on AWS
Cluster.

Please let me back is there any way for this kind of usecase bu using
directly spark without any middle components and share the info or link if
you have.

Thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Read-from-Google-store-and-save-in-AWS-s3-tp28278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Marco Mistroni
Hi
 If it only happens when u run 2 app at same time could it be that these 2
apps somehow run on same host?
Kr

On 5 Jan 2017 9:00 am, "Palash Gupta"  wrote:

> Hi Marco and respected member,
>
> I have done all the possible things suggested by Forum but still I'm
> having same issue:
>
> 1. I will migrate my applications to production environment where I will
> have more resources
> Palash>> I migrated my application in production where I have more CPU
> Cores, Memory & total 7 host in spark cluster.
> 2. Use Spark 2.0.0 function to load CSV rather using databrics api
> Palash>> Earlier I'm using databricks csv api with Spark 2.0.0. As
> suggested by one of the mate, Now I'm using spark 2.0.0 built in csv loader.
> 3. In production I will run multiple spark application at a time and try
> to reproduce this error for both file system and HDFS loading cas
> Palash>> yes I reproduced and it only happen when two spark application
> run at a time. Please see the logs:
>
> 17/01/05 01:50:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, 10.15.187.79): java.io.IOException: org.apache.spa
> rk.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
> at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(
> TorrentBroadcast.scala:174)
> at org.apache.spark.broadcast.TorrentBroadcast._value$
> lzycompute(TorrentBroadcast.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast._value(
> TorrentBroadcast.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast.getValue(
> TorrentBroadcast.scala:89)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:67)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get
> broadcast_1_piece0 of broadcast_1
> at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$
> apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$s
> p(TorrentBroadcast.scala:146)
> at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$
> apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(Torren
> tBroadcast.scala:125)
> at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$
> apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(Torren
> tBroadcast.scala:125)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at org.apache.spark.broadcast.TorrentBroadcast.org$apache$
> spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:
> 125)
> at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$
> readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1253)
> ... 11 more
>
> 17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.1 in
> stage 0.0 (TID 1, 10.15.187.78, partition 0, ANY, 7305 bytes)
> 17/01/05 01:50:15 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 1 on executor id: 1 hostname: 10.15.187.78
> .
> 17/01/05 01:50:15 INFO scheduler.TaskSetManager: Lost task 0.1 in stage
> 0.0 (TID 1) on executor 10.15.187.78: java.io.IOException (org
> .apache.spark.SparkException: Failed to get broadcast_1_piece0 of
> broadcast_1) [duplicate 1]
> 17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.2 in
> stage 0.0 (TID 2, 10.15.187.78, partition 0, ANY, 7305 bytes)
> 17/01/05 01:50:15 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 2 on executor id: 1 hostname: 10.15.187.78
> .
> 17/01/05 01:50:15 INFO scheduler.TaskSetManager: Lost task 0.2 in stage
> 0.0 (TID 2) on executor 10.15.187.78: java.io.IOException (org
> .apache.spark.SparkException: Failed to get broadcast_1_piece0 of
> broadcast_1) [duplicate 2]
> 17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.3 in
> stage 0.0 (TID 3, 10.15.187.76, partition 0, ANY, 7305 bytes)
> 17/01/05 01:50:15 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 3 on executor id: 6 hostname: 10.15.187.76
> .
> 17/01/05 01:50:16 INFO scheduler.TaskSetManager: Lost task 0.3 in stage
> 0.0 (TID 3) on executor 10.15.187.76: java.io.IOException (org
> .apache.spark.SparkException: Failed to get broadcast_1_piece0 of
> broadcast_1) [duplicate 3]
> 17/01/05 01:50:16 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0
> failed 4 times; aborting job
> 17/01/05 01:50:16 INFO scheduler.TaskSchedulerImpl: Removed 

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Palash Gupta
Hi Marco and respected member,
I have done all the possible things suggested by Forum but still I'm having 
same issue:

1. I will migrate my applications to production environment where I will have 
more resourcesPalash>> I migrated my application in production where I have 
more CPU Cores, Memory & total 7 host in spark cluster. 
2. Use Spark 2.0.0 function to load CSV rather using databrics apiPalash>> 
Earlier I'm using databricks csv api with Spark 2.0.0. As suggested by one of 
the mate, Now I'm using spark 2.0.0 built in csv loader.
3. In production I will run multiple spark application at a time and try to 
reproduce this error for both file system and HDFS loading casPalash>> yes I 
reproduced and it only happen when two spark application run at a time. Please 
see the logs:
17/01/05 01:50:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, 10.15.187.79): java.io.IOException: org.apache.spa
rk.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
    at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
    at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
    at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
    at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
    at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of 
broadcast_1
    at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$s
p(TorrentBroadcast.scala:146)
    at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(Torren
tBroadcast.scala:125)
    at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(Torren
tBroadcast.scala:125)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:
125)
    at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1253)
    ... 11 more

17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 
(TID 1, 10.15.187.78, partition 0, ANY, 7305 bytes)
17/01/05 01:50:15 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: 
Launching task 1 on executor id: 1 hostname: 10.15.187.78
.
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 
(TID 1) on executor 10.15.187.78: java.io.IOException (org
.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1) 
[duplicate 1]
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 
(TID 2, 10.15.187.78, partition 0, ANY, 7305 bytes)
17/01/05 01:50:15 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: 
Launching task 2 on executor id: 1 hostname: 10.15.187.78
.
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 
(TID 2) on executor 10.15.187.78: java.io.IOException (org
.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1) 
[duplicate 2]
17/01/05 01:50:15 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 0.0 
(TID 3, 10.15.187.76, partition 0, ANY, 7305 bytes)
17/01/05 01:50:15 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: 
Launching task 3 on executor id: 6 hostname: 10.15.187.76
.
17/01/05 01:50:16 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 0.0 
(TID 3) on executor 10.15.187.76: java.io.IOException (org
.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1) 
[duplicate 3]
17/01/05 01:50:16 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 
times; aborting job
17/01/05 01:50:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose 
tasks have all completed, from pool
17/01/05 01:50:16 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
17/01/05 01:50:16 INFO scheduler.DAGScheduler: ResultStage 0 (load at 
NativeMethodAccessorImpl.java:-2) failed in 2.110 s
17/01/05 01:50:16 INFO scheduler.DAGScheduler: Job 0 failed: load at 

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung >
Cc: >


This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am referring the code 
@https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(s"${checkpointDir.get}/${iteration - 
checkpointInterval}"), true)
}

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
> wrote:
Do you have more of the exception stack?



From: Ankur Srivastava 
>
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: user@spark.apache.org
Subject: Spark GraphFrame ConnectedComponents

Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by 
default it needs a checkpoint directory. As I am running my spark cluster with 
S3 as the DFS and do not have access to HDFS file system I tried using a s3 
directory as checkpoint directory but I run into below exception:


Exception in thread "main"java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid checkpointing the driver just hangs 
after 3 or 4 iterations.

Is there some way I can set the default FileSystem to S3 for Spark or any other 
option?

Thanks

Spark java with Google Store

2017-01-05 Thread Manohar753
Hi Team,

Can some please share any examples on spark java read and write files from
Google Store.

Thanks You in advance. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-java-with-Google-Store-tp28276.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Setting Spark Properties on Dataframes

2017-01-05 Thread neil90
Can you be more specific on what you would want to change on the DF level?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Python in Jupyter Notebook

2017-01-05 Thread neil90
Assuming you don't have your environment variables setup in your
.bash_profile you would do it like this - 

import os
import sys

spark_home = '/usr/local/spark'
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home,
'python/lib/py4j-0.10.1-src.zip'))
#os.environ['PYSPARK_SUBMIT_ARGS'] = """--master spark://54.68.147.137:7077
pyspark-shell""" < where you can pass commands you would pass in
launching pyspark directly from command line

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()\
.setMaster("local[8]")\
.setAppName("Test")  

sc = SparkContext(conf=conf)

spark = SparkSession.builder\
.config(conf=sc.getConf())\
.enableHiveSupport()\
.getOrCreate()

Mind you this is for spark 2.0 and above



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Python-in-Jupyter-Notebook-tp28268p28274.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



query on Spark Log directory

2017-01-05 Thread Divya Gehlot
Hi ,
I am using EMR machine and I could see the Spark log directory has grown
till 4G.

file name : spark-history-server.out

Need advise how can I reduce the the size of the above mentioned file.

Is there config property which can help me .



Thanks,

Divya