the spark job is so slow during shuffle - almost frozen

2016-07-18 Thread Zhiliang Zhu
Show original message Hi  All ,  While referring to spark UI , displayed as  198/200 - almost frozen...during shuffle stage of one task, most of the executor is with 0 byte, but just  one executor is with 1 G . moreover, in the several join operation , some case is like this, one table or

Unsubscribe

2016-07-18 Thread Jinan Alhajjaj

Re: Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
Hi Jacek, Can you please share example how can I access broacasted map val pltStnMapBrdcst = sc.broadcast(keyvalueMap ) val df_replacekeys = df_input.withColumn("map_values", pltStnMapBrdcst.value.get("key" Is the above the right way to access the broadcasted map ? Thanks, Divya On 18

Re: Execute function once on each node

2016-07-18 Thread Josh Asplund
The spark workers are running side-by-side with scientific simulation code. The code writes output to local SSDs to keep latency low. Due to the volume of data being moved (10's of terabytes +), it isn't really feasible to copy the data to a global filesystem. Executing a function on each node

Re: spark-submit local and Akka startup timeouts

2016-07-18 Thread Bryan Cutler
Hi Rory, for starters what version of Spark are you using? I believe that in a 1.5.? release (I don't know which one off the top of my head) there was an addition that would also display the config property when a timeout happened. That might help some if you are able to upgrade. On Jul 18,

ApacheCon: Getting the word out internally

2016-07-18 Thread Melissa Warnkin
ApacheCon: Getting the word out internally Dear Apache Enthusiast, As you are no doubt already aware, we will be holding ApacheCon in Seville, Spain, the week of November 14th, 2016. The call for papers (CFP) for this event is now open, and will remain open until September 9th. The event is

Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
You can't assume that the number to nodes will be constant as some may fail, hence you can't guarantee that a function will execute at most once or atleast once on a node. Can you explain your use case in a bit more detail? On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: >

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Reynold Xin
Good idea. https://github.com/apache/spark/pull/14252 On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust wrote: > + dev, reynold > > Yeah, thats a good point. I wonder if SparkSession.sqlContext should be > public/deprecated? > > On Mon, Jul 18, 2016 at 8:37 AM,

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Benjamin Kim
From what I read, there is no more Contexts. "SparkContext, SQLContext, HiveContext merged into SparkSession" I have not tested it, but I don’t know if it’s true. Cheers, Ben > On Jul 18, 2016, at 8:37 AM, Koert Kuipers wrote: > > in my codebase i would like to

Execute function once on each node

2016-07-18 Thread joshuata
I am working on a spark application that requires the ability to run a function on each node in the cluster. This is used to read data from a directory that is not globally accessible to the cluster. I have tried creating an RDD with n elements and n partitions so that it is evenly distributed

Re: Spark driver getting out of memory

2016-07-18 Thread Mich Talebzadeh
can you please clarify: 1. In what mode are you running the spark standalone, yarn-client, yarn cluster etc 2. You have 4 nodes with each executor having 10G. How many actual executors do you see in UI (Port 4040 by default) 3. What is master memory? Are you referring to diver

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Bin Fan
Here is one blog illustrating how to use Spark on Alluxio for this purpose. Hope it will help: http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/ On Mon, Jul 18, 2016 at 6:36 AM, Gene Pang wrote: > Hi, > > If you want to use Alluxio with Spark 2.x, it

Re: pyspark 1.5 0 save model ?

2016-07-18 Thread Holden Karau
If you used RandomForestClassifier from mllib you can use the save method described in http://spark.apache.org/docs/1.5.0/api/python/pyspark.mllib.html#module-pyspark.mllib.classification which will write out some JSON metadata as well as parquet for the actual model. For the newer ml pipeline one

spark-submit local and Akka startup timeouts

2016-07-18 Thread Rory Waite
Hi All, We have created a regression test for a spark job that is executed during our automated build. It executes a spark-submit with a local master, processes some data, and the exits. We have an issue in that we get a non-deterministic timeout error. It seems to be when the spark context

Re: Trouble while running spark at ec2 cluster

2016-07-18 Thread Andy Davidson
Hi Hassan Typically I log on to my master to submit my app. [ec2-user@ip-172-31-11-222 bin]$ echo $SPARK_ROOT /root/spark [ec2-user@ip-172-31-11-222 bin]$echo $MASTER_URL spark://ec2-54-215-11-222.us-west-1.compute.amazonaws.com:7077 [ec2-user@ip-172-31-11-222 bin]$

Re: Increasing spark.yarn.executor.memoryOverhead degrades performance

2016-07-18 Thread Sean Owen
Possibilities: - You are using more memory now (and not getting killed), but now are exceeding OS memory and are swapping - Your heap sizes / config aren't quite right and now, instead of failing earlier because YARN killed the job, you're running normally but seeing a lot of time lost to GC

Increasing spark.yarn.executor.memoryOverhead degrades performance

2016-07-18 Thread Sunita Arvind
Hello Experts, For one of our streaming appilcation, we intermittently saw: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Based on what I found on internet and the error

transtition SQLContext to SparkSession

2016-07-18 Thread Koert Kuipers
in my codebase i would like to gradually transition to SparkSession, so while i start using SparkSession i also want a SQLContext to be available as before (but with a deprecated warning when i use it). this should be easy since SQLContext is now a wrapper for SparkSession. so basically: val

Re: Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Jacek Laskowski
See broadcast variable. Or (just a thought) do join between DataFrames. Jacek On 18 Jul 2016 9:24 a.m., "Divya Gehlot" wrote: > Hi, > > I have created a map by reading a text file > val keyValueMap = file_read.map(t => t.getString(0) -> >

Re: Custom Spark Error on Hadoop Cluster

2016-07-18 Thread Xiangrui Meng
Glad to hear. Could you please share your solution on the user mailing list? -Xiangrui On Mon, Jul 18, 2016 at 2:26 AM Alger Remirata wrote: > Hi Xiangrui, > > We have now solved the problem. Thanks for all the tips you've given. > > Best Regards, > > Alger > > On Thu,

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Gene Pang
Hi, If you want to use Alluxio with Spark 2.x, it is recommended to write to and read from Alluxio with files. You can save an RDD with saveAsObjectFile with an Alluxio path (alluxio://host:port/path/to/file), and you can read that file from any other Spark job. Here is additional information on

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
Thanks Nihed. I was able to do this in exactly the same way. Cheers!! Abhi On Mon, Jul 18, 2016 at 5:56 PM, nihed mbarek wrote: > and if we have this static method > df.show(); > Column c = concatFunction(df, "l1", "firstname,lastname"); >

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread nihed mbarek
and if we have this static method df.show(); Column c = concatFunction(df, "l1", "firstname,lastname"); df.select(c).show(); with this code : Column concatFunction(DataFrame df, String fieldName, String columns) { String[] array = columns.split(",");

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
Hi Nihed, Thanks for the reply. I am looking for something like this : DataFrame training = orgdf.withColumn("I1", functions.concat(orgdf.col("C0"),orgdf.col("C1"))); Here I have to give C0 and C1 columns, I am looking to write a generic function that concatenates the columns depending on

Re: Inode for STS

2016-07-18 Thread ayan guha
Hi Thanks for this. However, I am interested in regular deletion of temp while server is up. Additionally, the link says it is not of use for multi-user environment. Any other idea? is there any variation of cleaner.ttl? On Mon, Jul 18, 2016 at 8:00 PM, Chanh Le wrote: >

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread nihed mbarek
Hi, I just wrote this code to help you. Is it what you need ?? SparkConf conf = new SparkConf().setAppName("hello").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); List persons = new

Re: Spark driver getting out of memory

2016-07-18 Thread Saurav Sinha
I have set --drive-memory 5g. I need to understand that as no of partition increase drive-memory need to be increased. What will be best ration of No of partition/drive-memory. On Mon, Jul 18, 2016 at 4:07 PM, Zhiliang Zhu wrote: > try to set --drive-memory xg , x would be

Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
Hi, I have a dataframe say having C0,C1,C2 and so on as columns. I need to create interaction variables to be taken as input for my program. For eg - I need to create I1 as concatenation of C0,C3,C5 Similarly, I2 = concat(C4,C5) and so on .. How can I achieve this in my Java code for

Re: Spark driver getting out of memory

2016-07-18 Thread Zhiliang Zhu
try to set --drive-memory xg , x would be as large as can be set .  On Monday, July 18, 2016 6:31 PM, Saurav Sinha wrote: Hi, I am running spark job. Master memory - 5Gexecutor memort 10G(running on 4 node) My job is getting killed as no of partition increase

Re: the spark job is so slow - almost frozen

2016-07-18 Thread Zhiliang Zhu
Thanks a lot for your reply . In effect , here we tried to run the sql on kettle, hive and spark hive (by HiveContext) respectively, the job seems frozen  to finish to run . In the 6 tables , need to respectively read the different columns in different tables for specific information , then do

Spark driver getting out of memory

2016-07-18 Thread Saurav Sinha
Hi, I am running spark job. Master memory - 5G executor memort 10G(running on 4 node) My job is getting killed as no of partition increase to 20K. 16/07/18 14:53:13 INFO DAGScheduler: Got job 17 (foreachPartition at WriteToKafka.java:45) with 13524 output partitions (allowLocal=false) 16/07/18

Re: the spark job is so slow - almost frozen

2016-07-18 Thread Chanh Le
Hi, What about the network (bandwidth) between hive and spark? Does it run in Hive before then you move to Spark? Because It's complex you can use something like EXPLAIN command to show what going on. > On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu wrote: > > the

Re: the spark job is so slow - almost frozen

2016-07-18 Thread Zhiliang Zhu
the sql logic in the program is very much complex , so do not describe the detailed codes   here .  On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu wrote: Hi All,   Here we have one application, it needs to extract different columns from 6 hive tables, and

Re: Re: how to tuning spark shuffle

2016-07-18 Thread lizhenm...@163.com
Hi, Can you print out the environment tab on your UI. By default spark-sql runs on local mode which means that you only have one driver and one executor in one jvm. Do you increase the executor memory through SET spark.executor.memory=xG And adjust it and run the SQL again. HTH Dr Mich

the spark job is so slow - almost frozen

2016-07-18 Thread Zhiliang Zhu
Hi All,   Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,finally need to output another table or file (with format of consistent columns) .  However, after lots of

Re: Inode for STS

2016-07-18 Thread Chanh Le
Hi Ayan, I seem like you mention this https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.start.cleanup.scratchdir

pyspark 1.5 0 save model ?

2016-07-18 Thread pseudo oduesp
Hi, how i can save model under pyspakr 1.5.0 ? i use RandomForestClassifier() thanks in advance.

Re: how to tuning spark shuffle

2016-07-18 Thread Mich Talebzadeh
Hi, Can you print out the environment tab on your UI. By default spark-sql runs on local mode which means that you only have one driver and one executor in one jvm. Do you increase the executor memory through SET spark.executor.memory=xG And adjust it and run the SQL again. HTH Dr Mich

Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
Hi, I have created a map by reading a text file val keyValueMap = file_read.map(t => t.getString(0) -> t.getString(4)).collect().toMap Now I have another dataframe where I need to dynamically replace all the keys of Map with values val df_input = reading the file as dataframe val df_replacekeys

how to tuning spark shuffle

2016-07-18 Thread leezy
hi: i am run a join operation in the spark-sql, But when i increase the executor-memory, the run time become long. In the spark UI, i can see that the shuffle becomes slowly when the memory becomes big. How can i to tune it? -- View this message in context:

Re: Spark Job trigger in production

2016-07-18 Thread Jagat Singh
You can use following options * spark-submit from shell * some kind of job server. See spark-jobserver for details * some notebook environment See Zeppelin for example On 18 July 2016 at 17:13, manish jaiswal wrote: > Hi, > > > What is the best approach to trigger

Spark Job trigger in production

2016-07-18 Thread manish jaiswal
Hi, What is the best approach to trigger spark job in production cluster?

Question About OFF_HEAP Caching

2016-07-18 Thread condor join
Hi All, I have some questions about OFF_HEAP Caching. In Spark 1.X when we use rdd.persist(StorageLevel.OFF_HEAP),that means rdd caching in Tachyon(Alluxio). However,in Spark 2.X,we can directly use OFF_HEAP For Caching