[K8S] Driver and Executor Logging

2018-09-07 Thread Rohit Menon
Hello All, We are trying to use a custom appender for Spark driver and executor pods. However, the changes to log4j.properties file in the spark container image are not taking effect. We even tried simpler changes like changing the logging level to DEBUG. Has anyone run into similar issues? or

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman
I understand the error is because the number of partitions is very high, yet when processing 40 TB (and this number is expected to grow) this number seems reasonable: 40TB / 300,000 will result in partitions size of ~ 130MB (data should be evenly distributed). On Fri, Sep 7, 2018 at 6:28 PM Vadim

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Vadim Semenov
You have too many partitions, so when the driver is trying to gather the status of all map outputs and send back to executors it chokes on the size of the structure that needs to be GZipped, and since it's bigger than 2GiB, it produces OOM. On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman wrote: >

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
Yes I think I am confused because originally my thought was that executor only requires 10g then driver ideally do not need to consume more than 10g or at least not more than 20g. But this is not the case. My configuration is setting --dervier-memory to 25g and --executor-memory 10g. And my

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread Apostolos N. Papadopoulos
You are putting all together and this does not make sense. Writing data to HDFS does not require that all data should be transfered back to the driver and THEN saved to HDFS. This would be a disaster and it would never scale. I suggest to check the documentation more carefully because I

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking doc shows that my spark doesn't use those actions functions. But save functions looks resembling the function df.write.mode(overwrite).parquet("hdfs://path/to/parquet-file") used by my spark job uses. Therefore I

Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman
Hi, We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB. It processes ~40 TB of data using aggregateByKey in which we specify numPartitions = 300,000. Map side tasks succeed, but reduce side tasks all fail. We

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread Apostolos N. Papadopoulos
Dear James, - check the Spark documentation to see the actions that return a lot of data back to the driver. One of these actions is collect(). However, take(x) is an action, also reduce() is an action. Before executing collect() find out what is the size of your RDD/DF. - I cannot

Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
I have a Spark job that read data from database. By increasing submit parameter '--driver-memory 25g' the job can works without a problem locally but not in prod env because prod master do not have enough capacity. So I have a few questions: - What functions such as collecct() would cause the

Re: [External Sender] How to debug Spark job

2018-09-07 Thread James Starks
Got the root cause eventually as it throws java.lang.OutOfMemoryError: Java heap space. Increasing --driver-memory temporarily fixes the problem. Thanks. ‐‐‐ Original Message ‐‐‐ On 7 September 2018 12:32 PM, Femi Anthony wrote: > One way I would go about this would be to try running

Re: [External Sender] How to debug Spark job

2018-09-07 Thread Femi Anthony
One way I would go about this would be to try running a new_df.show(numcols, truncate=False) on a few columns before you try writing to parquet to force computation of newdf and see whether the hanging is occurring at that point or during the write. You may also try doing a newdf.count() as well.

Unsubscribe

2018-09-07 Thread Fiona Chow
Hey all, I would like to unsubscribe from this mailing list. thank you -- This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the

Unsubscribe

2018-09-07 Thread Bibudh Lahiri
Unsubscribe

Re: [External Sender] Re: How to make pyspark use custom python?

2018-09-07 Thread mithril
I am sure, all writen as my first post. So this make me very confusing. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How to debug Spark job

2018-09-07 Thread James Starks
I have a Spark job that reads from a postgresql (v9.5) table, and write result to parquet. The code flow is not complicated, basically case class MyCaseClass(field1: String, field2: String) val df = spark.read.format("jdbc")...load() df.createOrReplaceTempView(...) val newdf =

Re: Error in show()

2018-09-07 Thread Apostolos N. Papadopoulos
Can you isolate the row that is causing the problem? I mean start using show(31) up to show(60). Perhaps this will help you to understand the problem. regards, Apostolos On 07/09/2018 01:11 πμ, dimitris plakas wrote: Hello everyone, I am new in Pyspark and i am facing an issue. Let me