Re: Spark Push-Based Shuffle causing multiple stage failures

2022-05-25 Thread Han Altae-Tran
Hi Ye, This is super super helpful! It wasn't obvious to me from the documentation that this property needed to be set in the yarn-site.xml file, as all other configurations in the main spark configuration page are set through spark conf. It was particularly confusing because this property, like

Re: Complexity with the data

2022-05-25 Thread Bjørn Jørgensen
Sid, dump one of yours files. https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ ons. 25. mai 2022, 23:04 skrev Sid : > I have 10 columns with me but in the dataset, I observed that some records > have 11 columns of data(for the additional column it is marked as null).

Re: Complexity with the data

2022-05-25 Thread Sid
I have 10 columns with me but in the dataset, I observed that some records have 11 columns of data(for the additional column it is marked as null). But, how do I handle this? Thanks, Sid On Thu, May 26, 2022 at 2:22 AM Sid wrote: > How can I do that? Any examples or links, please. So, this

Re: Complexity with the data

2022-05-25 Thread Sid
How can I do that? Any examples or links, please. So, this works well with pandas I suppose. It's just that I need to convert back to the spark data frame by providing a schema but since we are using a lower spark version and pandas won't work in a distributed way in the lower versions, therefore,

Re: Complexity with the data

2022-05-25 Thread Gavin Ray
Forgot to reply-all last message, whoops. Not very good at email. You need to normalize the CSV with a parser that can escape commas inside of strings Not sure if Spark has an option for this? On Wed, May 25, 2022 at 4:37 PM Sid wrote: > Thank you so much for your time. > > I have data like

Re: Complexity with the data

2022-05-25 Thread Sid
Thank you so much for your time. I have data like below which I tried to load by setting multiple options while reading the file but however, but I am not able to consolidate the 9th column data within itself. [image: image.png] I tried the below code: df = spark.read.option("header",

Re: Complexity with the data

2022-05-25 Thread Apostolos N. Papadopoulos
Dear Sid, can you please give us more info? Is it true that every line may have a different number of columns? Is there any rule followed by every line of the file? From the information you have sent I cannot fully understand the "schema" of your data. Regards, Apostolos On 25/5/22

Complexity with the data

2022-05-25 Thread Sid
Hi Experts, I have below CSV data that is getting generated automatically. I can't change the data manually. The data looks like below: 2020-12-12,abc,2000,,INR, 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing 2020-12-09,fgh,,software_developer,I only manage the development part. Since I

[SPARK SQL] Spark Thrift server, It is not releasing memory.

2022-05-25 Thread Ramakrishna Chilaka
Hi, After some heavy usage, spark thrift server is not releasing memory and when i enabled gc logging. Here is the output from GC Log. ``` 2022-05-25T06:34:50.577+: 581403.163: [Full GC (System.gc()) 8322M->8224M(26312M), 10.8605086 secs] [Eden: 104.0M(14784.0M)->0.0B(14784.0M) Survivors:

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-25 Thread Ranadip Chatterjee
To me, it seems like the data being processed on the 2 systems is not identical. Can't think of any other reason why the single task stage will get a different number of input records in the 2 cases. 700gb of input to a single task is not good, and seems to be the bottleneck. On Wed, 25 May 2022,