Hi All,
When running a spark job, I have 100MB+ input and get more than 700GB
shuffle write, which is really weird. And this job finally end up with the
OOM error. Does anybody know why this happened?
[image: Screen Shot 2018-11-05 at 15.20.35.png]
My code is like:
> JavaPairRDD inputRDD =
Here's some sample code.
self.session = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.dynamicAllocation.enabled", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
Can you share some relevant source code?
> Am 05.11.2018 um 07:58 schrieb ehbhaskar :
>
> I have a pyspark job that inserts data into hive partitioned table using
> `Insert Overwrite` statement.
>
> Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in
> S3. But, it's
I have a pyspark job that inserts data into hive partitioned table using
`Insert Overwrite` statement.
Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in
S3. But, it's very slow in moving data from temp directory to the target
path, it takes more than 40 mins to move
Hi,
I think what you need is to have a long running Spark cluster to
which you can submit jobs dynamically.
For SQL, you can start Spark's HiveServer2:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
This will start a long
Hello,
I think you can try with below , the reason is only yarn-cllient mode is
supported for your scenario.
master("yarn-client")
Thanks very much.
Keith
From: 张万新
Sent: Thursday, November 1, 2018 11:36 PM
To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com>
Cc: user
Subject: Re: how to use cluster
Hi,
Today I wanted to set up a development environment for GraphX and when I
visited Maven central repository (
https://mvnrepository.com/artifact/org.apache.spark/spark-graphx) I saw
that it was already available in 2.4.0 version. Does it mean that the new
version of Apache Spark was released ?