Shuffle write explosion

2018-11-04 Thread Yichen Zhou
Hi All, When running a spark job, I have 100MB+ input and get more than 700GB shuffle write, which is really weird. And this job finally end up with the OOM error. Does anybody know why this happened? [image: Screen Shot 2018-11-05 at 15.20.35.png] My code is like: > JavaPairRDD inputRDD =

Re: [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.

2018-11-04 Thread Bhaskar Ebbur
Here's some sample code. self.session = SparkSession \ .builder \ .appName(self.app_name) \ .config("spark.dynamicAllocation.enabled", "false") \ .config("hive.exec.dynamic.partition.mode", "nonstrict") \

Re: [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.

2018-11-04 Thread Jörn Franke
Can you share some relevant source code? > Am 05.11.2018 um 07:58 schrieb ehbhaskar : > > I have a pyspark job that inserts data into hive partitioned table using > `Insert Overwrite` statement. > > Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in > S3. But, it's

[Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.

2018-11-04 Thread ehbhaskar
I have a pyspark job that inserts data into hive partitioned table using `Insert Overwrite` statement. Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in S3. But, it's very slow in moving data from temp directory to the target path, it takes more than 40 mins to move

Re: how to use cluster sparkSession like localSession

2018-11-04 Thread Sumedh Wale
Hi, I think what you need is to have a long running Spark cluster to which you can submit jobs dynamically. For SQL, you can start Spark's HiveServer2: https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine This will start a long

RE: how to use cluster sparkSession like localSession

2018-11-04 Thread Sun, Keith
Hello, I think you can try with below , the reason is only yarn-cllient mode is supported for your scenario. master("yarn-client") Thanks very much. Keith From: 张万新 Sent: Thursday, November 1, 2018 11:36 PM To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> Cc: user Subject: Re: how to use cluster

Spark 2.4.0 artifact in Maven repository

2018-11-04 Thread Bartosz Konieczny
Hi, Today I wanted to set up a development environment for GraphX and when I visited Maven central repository ( https://mvnrepository.com/artifact/org.apache.spark/spark-graphx) I saw that it was already available in 2.4.0 version. Does it mean that the new version of Apache Spark was released ?