Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread jeff saremi
asking this on a tangent: Is there anyway for the shuffle data to be replicated to more than one server? thanks From: jeff saremi Sent: Friday, July 28, 2017 4:38:08 PM To: Juan Rodríguez Hortalá Cc: user@spark.apache.org Subject: Re:

Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread jeff saremi
Thanks Juan for taking the time Here's more info: - This is running on Yarn in Master mode - See config params below - This is a corporate environment. In general nodes should not be added or removed that often to the cluster. Even if that is the case I would expect that to be one or 2

Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread Juan Rodríguez Hortalá
Hi Jeff, Can you provide more information about how are you running your job? In particular: - which cluster manager are you using? It is YARN, Mesos, Spark Standalone? - with configuration options are you using to submit the job? In particular are you using dynamic allocation or external

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Priyank Shrivastava
Also, in your example doesn't the tempview need to be accessed using the same sparkSession on the scala side? Since I am not using a notebook, how can I get access to the same sparksession in scala. On Fri, Jul 28, 2017 at 3:17 PM, Priyank Shrivastava wrote: > Thanks

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Priyank Shrivastava
Thanks Burak. In a streaming context would I need to do any state management for the temp views? for example across sliding windows. Priyank On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz wrote: > Hi Priyank, > > You may register them as temporary tables to use across language

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Burak Yavuz
Hi Priyank, You may register them as temporary tables to use across language boundaries. Python: df = spark.readStream... # Python logic df.createOrReplaceTempView("tmp1") Scala: val df = spark.table("tmp1") df.writeStream .foreach(...) On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Priyank Shrivastava
TD, For a hybrid python-scala approach, what's the recommended way of handing off a dataframe from python to scala. I would like to know especially in a streaming context. I am not using notebooks/databricks. We are running it on our own spark 2.1 cluster. Priyank On Wed, Jul 26, 2017 at

can I do spark-submit --jars [s3://bucket/folder/jar_file]? or --jars

2017-07-28 Thread Richard Xin
Can we add extra library (jars on S3) to spark-submit? if yes, how? such as --jars, extraClassPath, extraLibPathThanks,Richard

Persisting RDD: Low Percentage with a lot of memory available

2017-07-28 Thread pedroT
Hi, This problem is very annoying for me and I'm tired of surfing the network without any good advice to follow. I have a complex job. It has been worked fine until I needed to save partial results (RDDs) to files. So I tried to cache the RDDs and then call a saveAsText method and follow the

RE: changing directories in Spark Streming

2017-07-28 Thread Siddhartha Singh Sandhu
Hi, I am saving the output of my streaming process to s3. I want to able to change the directory of the stream as an hour passes by. Will this work: parsed_kf_frame.saveAsTextFiles((s3_location).format( datetime.datetime.today().strftime("%Y%m%d"),

Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
For yarn, I'm speaking about the file fairscheduler.xml (if you kept the default scheduling of Yarn): https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format Yohann Jardin Le 7/28/2017 à 8:00 PM, jeff saremi a écrit : The only relevant

Re: How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
The only relevant setting i see in Yarn is this: yarn.nodemanager.resource.memory-mb 120726 which is 120GB and we are well below that. I don't see a total limit. I haven't played with spark.memory.fraction. I'm not sure if it makes a difference. Note that there are no errors

Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
Not sure that we are OK on one thing: Yarn limitations are for the sum of all nodes, while you only specify the memory for a single node through Spark. By the way, the memory displayed in the UI is only a part of the total memory allocation:

Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread jeff saremi
We have a not too complex and not too large spark job that keeps dying with this error I have researched it and I have not seen any convincing explanation on why I am not using a shuffle service. Which server is the one that is refusing the connection? If I go to the server that is being

Re: How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
Thanks so much Yohann I checked the Storage/Memory column in Executors status page. Well below where I wanted to be. I will try the suggestion on smaller data sets. I am also well within the Yarn limitations (128GB). In my last try I asked for 48+32 (overhead). So somehow I am exceeding that or

Re: Spark Streaming with long batch / window duration

2017-07-28 Thread emceemouli
Thanks. If i not use Window and choose to use Streaming the data on to HDFS, could you suggest how to only store 1 week worth of data. Should i create a cron job to delete HDFS files older than a week. PLease let me know if you have any other suggestions -- View this message in context:

Spark Streaming as a Service

2017-07-28 Thread ajit roshen
We have few Spark Streaming Apps running on our AWS Spark 2.1 Yarn cluster. We currently log on to the Master Node of the cluster and start the App using "spark-submit", calling the jar. We would like to open up this to our users, so that they can submit their own Apps, but we would not be able

subscribe

2017-07-28 Thread ajit roshen

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
I think it will be same, but let me try that FYR - https://issues.apache.org/jira/browse/SPARK-19881 On Fri, Jul 28, 2017 at 4:44 PM, ayan guha wrote: > Try running spark.sql("set yourconf=val") > > On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread ayan guha
Try running spark.sql("set yourconf=val") On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri wrote: > Jorn, Both are same. > > On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote: > >> Try sparksession.conf().set >> >> On 28. Jul 2017, at 12:19,

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Jorn, Both are same. On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote: > Try sparksession.conf().set > > On 28. Jul 2017, at 12:19, Chetan Khatri > wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Jörn Franke
Try sparksession.conf().set > On 28. Jul 2017, at 12:19, Chetan Khatri wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing > below issue: > > org.apache.hadoop.hive.ql.metadata.HiveException: > Number of

Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Hey Dev/ USer, I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing below issue: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1344, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least

Re: SPARK Storagelevel issues

2017-07-28 Thread 周康
All right, i did not catch the point ,sorry for that. But you can take a snapshot of the heap, and then analysis heap dump by mat or other tools. >From the code i can not find any clue. 2017-07-28 17:09 GMT+08:00 Gourav Sengupta : > Hi, > > I have done all of that, but

Re: SPARK Storagelevel issues

2017-07-28 Thread Gourav Sengupta
Hi, I have done all of that, but my question is "why should a 62 MB data give memory error when we have over 2 GB of memory available". Therefore all that is mentioned by Zhoukang is not pertinent at all. Regards, Gourav Sengupta On Fri, Jul 28, 2017 at 4:43 AM, 周康

Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
Check the executor page of the Spark UI, to check if your storage level is limiting. Also, instead of starting with 100 TB of data, sample it, make it work, and grow it little by little until you reached 100 TB. This will validate the workflow and let you see how much data is shuffled, etc.

How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
I have the simplest job which i'm running against 100TB of data. The job keeps failing with ExecutorLostFailure's on containers killed by Yarn for exceeding memory limits I have varied the executor-memory from 32GB to 96GB, the spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar