OutOfMemoryError

2021-06-30 Thread javaguy Java
Hi, I'm getting Java OOM errors even though I'm setting my driver memory to 24g and I'm executing against local[*] I was wondering if anyone can give me any insight. The server this job is running on has more than enough memory as does the spark driver. The final result does write 3 csv files

Re: Structuring a PySpark Application

2021-06-30 Thread Gourav Sengupta
Hi, I think that reading Matei Zaharia's book "SPARK the definitive guide" will be a good and best starting point. Regards, Gourav Sengupta On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri wrote: > Hi all! > > I am working on a Pyspark application and would like suggestions on how it > should be

Re: Spark Null Pointer Exception

2021-06-30 Thread Russell Spitzer
Could also be transient object being referenced from within the custom code. When serialized the reference shows up as null even though you had set it in the parent object. > On Jun 30, 2021, at 4:44 PM, Sean Owen wrote: > > The error is in your code, which you don't show. You are almost

Re: Spark Null Pointer Exception

2021-06-30 Thread Sean Owen
The error is in your code, which you don't show. You are almost certainly incorrectly referencing something like a SparkContext in a Spark task. On Wed, Jun 30, 2021 at 3:48 PM Amit Sharma wrote: > Hi , I am using spark 2.7 version with scala. I am calling a method as > below > > 1. val

Spark Null Pointer Exception

2021-06-30 Thread Amit Sharma
Hi , I am using spark 2.7 version with scala. I am calling a method as below 1. val rddBacklog = spark.sparkContext.parallelize(MAs) // MA is list of say city 2. rddBacklog.foreach(ma => doAlloc3Daily(ma, fteReview.forecastId, startYear, endYear)) 3.doAlloc3Daily method just doing a database

Re: Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi Mich! We use this in production but indeed there is much scope for improvements, configuration being one of those :). Yes, we have a private on-premise cluster. We run Spark on YARN (no airflow etc.) which controls the scheduling and use HDFS as a datastore. Regards On Wed, Jun 30, 2021 at

Re: Structuring a PySpark Application

2021-06-30 Thread Mich Talebzadeh
Thanks for the details Kartik. Let me go through these. The code itself and indentation looks good. One minor thing I noticed is that you are not using a yaml file (config.yml) for your variables and you seem to embed them in your config.py code. That is what I used to do before :) a friend

Re: Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi Mich! Thanks for the reply. The zip file contains all of the spark related code, particularly contents of this folder . The requirements_spark.txt

Re: Structuring a PySpark Application

2021-06-30 Thread Mich Talebzadeh
Hi Kartik, Can you explain how you create your zip file? Does that include all in your top project directory as per PyCharm etc. The rest looks Ok as you are creating a Python Virtual Env python3 -m venv pyspark_venv source pyspark_venv/bin/activate How do you create that

RE: Inclusive terminology usage in Spark

2021-06-30 Thread Rao, Abhishek (Nokia - IN/Bangalore)
HI Sean, Thanks for the quick response. We’ll look into this. Thanks and Regards, Abhishek From: Sean Owen Sent: Wednesday, June 30, 2021 6:30 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: User Subject: Re: Inclusive terminology usage in Spark This was covered and mostly done last year:

Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi all! I am working on a Pyspark application and would like suggestions on how it should be structured. We have a number of possible jobs, organized in modules. There is also a " RequestConsumer

Re: Inclusive terminology usage in Spark

2021-06-30 Thread Sean Owen
This was covered and mostly done last year: https://issues.apache.org/jira/browse/SPARK-32004 In some instances, it's hard to change the terminology as it would break user APIs, and the marginal benefit may not be worth it, but, have a look at the remaining task under that umbrella. On Wed, Jun

Inclusive terminology usage in Spark

2021-06-30 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi, Terms such as Blacklist/Whitelist and master/slave is used at different places in Spark Code. Wanted to know if there are any plans to modify this to more inclusive terminology, for eg: Denylist/Allowlist and Leader/Follower? If so, what is the timeline? I've also created an improvement