newAPIHadoopFile bad performance

2017-01-05 Thread Mudasar
Hi, I am using newAPIHadoopFile to process large number of s3 files(around 20 thousand) by passing URLs as comma separated String. It take around *7 minutes* to start the job. I am running the job on EMR 5.2.0 with spark 2.0.2. Here is the code Configuration conf = new Configuration();

unsubscribe

2017-01-05 Thread bobwang
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
Would it be more robust to use the Path when creating the FileSystem? https://github.com/graphframes/graphframes/issues/160 On Thu, Jan 5, 2017 at 4:57 PM, Felix Cheung wrote: > This is likely a factor of your hadoop config and Spark rather then > anything specific

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
This is likely a factor of your hadoop config and Spark rather then anything specific with GraphFrames. You might have better luck getting assistance if you could isolate the code to a simple case that manifests the problem (without GraphFrames), and repost.

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Adding DEV mailing list to see if this is a defect with ConnectedComponent or if they can recommend any solution. Thanks Ankur On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava wrote: > Yes I did try it out and it choses the local file system as my checkpoint >

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung wrote: > Right, I'd agree, it seems to be only with delete. > >

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
Hello Experts, I am trying to allow null values in numeric fields. Here are the details of the issue I have: http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields I also tried making all columns nullable by using the below function (from one of

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
Right, I'd agree, it seems to be only with delete. Could you by chance run just the delete to see if it fails FileSystem.get(sc.hadoopConfiguration) .delete(new Path(somepath), true) From: Ankur Srivastava Sent: Thursday, January 5,

RE: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Manohar Reddy
Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) will that identify and write/read appropriate cloud. Is that my

Re: Spark SQL - Applying transformation on a struct inside an array

2017-01-05 Thread Olivier Girardot
So, it seems the only way I found for now is a recursive handling of the Row instances directly, but to do that I have to go back to RDDs, i've put together a simple test case demonstrating the problem : import org.apache.spark.sql.{DataFrame, SparkSession} import org.scalatest.{FlatSpec,

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Marco Mistroni
Hi might be off topic, but databricks has a web application in whicn you can use spark with jupyter. have a look at https://community.cloud.databricks.com kr On Thu, Jan 5, 2017 at 7:53 PM, Jon G wrote: > I don't use MapR but I use pyspark with jupyter, and this MapR

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Jon G
I don't use MapR but I use pyspark with jupyter, and this MapR blogpost looks similar to what I do to setup: https://community.mapr.com/docs/DOC-1874-how-to-use-jupyter-pyspark-on-mapr On Thu, Jan 5, 2017 at 3:05 AM, neil90 wrote: > Assuming you don't have your

Re: Help in generating unique Id in spark row

2017-01-05 Thread Olivier Girardot
There is a way, you can use org.apache.spark.sql.functions.monotonicallyIncreasingId it will give each rows of your dataframe a unique Id On Tue, Oct 18, 2016 10:36 AM, ayan guha guha.a...@gmail.com wrote: Do you have any primary key or unique identifier in your data? Even if multiple

Re: Setting Spark Properties on Dataframes

2017-01-05 Thread neil90
This blog post(Not mine) has some nice examples - https://hadoopist.wordpress.com/2016/08/19/how-to-create-compressed-output-files-in-spark-2-0/ >From the blog - df.write.mode("overwrite").format("parquet").option("compression", "none").mode("overwrite").save("/tmp/file_no_compression_parq")

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it tries to use the default file system. I tried looking up how to update the default file system but could not find

Re: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Steve Loughran
On 5 Jan 2017, at 09:58, Manohar753 > wrote: Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some

[Spark 2.1.0] Resource Scheduling Challenge in pyspark sparkSession

2017-01-05 Thread Palash Gupta
Hi User Team, I'm trying to schedule resource in spark 2.1.0 using below code but still all the cpu cores are captured by only single spark application and hence no other application is starting. Could you please help me out: sqlContext =

unsubscribe

2017-01-05 Thread Nikola Z

Re: ToLocalIterator vs collect

2017-01-05 Thread Richard Startin
Why not do that with spark sql to utilise the executors properly, rather than a sequential filter on the driver. Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k If you were sorting just so you could iterate in order, this might save you a couple of sorts too.

ToLocalIterator vs collect

2017-01-05 Thread Rohit Verma
Hi all, I am aware that collect will return a list aggregated on driver, this will return OOM when we have a too big list. Is toLocalIterator safe to use with very big list, i want to access all values one by one. Basically the goal is to compare two sorted rdds (A and B) to find top k

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Palash Gupta
Hi Macro, Yes it was in the same host when problem was found. Even when I tried to start with different host, the problem is still there. Any hints or suggestion will be appreciated.  Thanks & Best Regards, Palash Gupta From: Marco Mistroni To: Palash Gupta

Spark Read from Google store and save in AWS s3

2017-01-05 Thread Manohar753
Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S3 and my spark engine runs on AWS Cluster. Please let me back is there any way for

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Marco Mistroni
Hi If it only happens when u run 2 app at same time could it be that these 2 apps somehow run on same host? Kr On 5 Jan 2017 9:00 am, "Palash Gupta" wrote: > Hi Marco and respected member, > > I have done all the possible things suggested by Forum but still I'm >

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Palash Gupta
Hi Marco and respected member, I have done all the possible things suggested by Forum but still I'm having same issue: 1. I will migrate my applications to production environment where I will have more resourcesPalash>> I migrated my application in production where I have more CPU Cores,

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
>From the stack it looks to be an error from the explicit call to >hadoop.fs.FileSystem. Is the URL scheme for s3n registered? Does it work when you try to read from s3 from Spark? _ From: Ankur Srivastava

Spark java with Google Store

2017-01-05 Thread Manohar753
Hi Team, Can some please share any examples on spark java read and write files from Google Store. Thanks You in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-java-with-Google-Store-tp28276.html Sent from the Apache Spark User List

Re: Setting Spark Properties on Dataframes

2017-01-05 Thread neil90
Can you be more specific on what you would want to change on the DF level? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28275.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread neil90
Assuming you don't have your environment variables setup in your .bash_profile you would do it like this - import os import sys spark_home = '/usr/local/spark' sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.1-src.zip'))

query on Spark Log directory

2017-01-05 Thread Divya Gehlot
Hi , I am using EMR machine and I could see the Spark log directory has grown till 4G. file name : spark-history-server.out Need advise how can I reduce the the size of the above mentioned file. Is there config property which can help me . Thanks, Divya