Re: Complexity with the data

2022-05-26 Thread Sid
Hi Gourav, Please find the below link for a detailed understanding. https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark/72391090#72391090 @Bjørn Jørgensen : I was able to read such kind of data using the below code:

Re: Complexity with the data

2022-05-26 Thread Gourav Sengupta
Hi, can you please give us a simple map of what the input is and what the output should be like? From your description it looks a bit difficult to figure out what exactly or how exactly you want the records actually parsed. Regards, Gourav Sengupta On Wed, May 25, 2022 at 9:08 PM Sid wrote: >

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
Yes, but how do you read it with spark. tor. 26. mai 2022, 18:30 skrev Sid : > I am not reading it through pandas. I am using Spark because when I tried > to use pandas which comes under import pyspark.pandas, it gives me an > error. > > On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen > wrote:

Re: Complexity with the data

2022-05-26 Thread Sid
I am not reading it through pandas. I am using Spark because when I tried to use pandas which comes under import pyspark.pandas, it gives me an error. On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen wrote: > ok, but how do you read it now? > > >

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
ok, but how do you read it now? https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216 probably have to be updated with the default options. This is so that pandas API on spark will be like pandas. tor. 26. mai 2022 kl. 17:38 skrev

Re: Complexity with the data

2022-05-26 Thread Sid
I was passing the wrong escape characters due to which I was facing the issue. I have updated the user's answer on my post. Now I am able to load the dataset. Thank you everyone for your time and help! Much appreciated. I have more datasets like this. I hope that would be resolved using this

Re: Complexity with the data

2022-05-26 Thread Apostolos N. Papadopoulos
Since you cannot create the DF directly, you may try to first create an RDD of tuples from the file and then convert the RDD to a DF by using the toDF() transformation. Perhaps you may bypass the issue with this. Another thing that I have seen in the example is that you are using "" as an

Re: Complexity with the data

2022-05-26 Thread Sid
Thanks for opening the issue, Bjorn. However, could you help me to address the problem for now with some kind of alternative? I am actually stuck in this since yesterday. Thanks, Sid On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, wrote: > Yes, it looks like a bug that we also have in pandas API

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
Yes, it looks like a bug that we also have in pandas API on spark. So I have opened a JIRA for this. tor. 26. mai 2022 kl. 11:09 skrev Sid : > Hello Everyone, > > I have posted a question finally with the dataset and the column names. > > PFB

Re: Complexity with the data

2022-05-26 Thread Sid
Hello Everyone, I have posted a question finally with the dataset and the column names. PFB link: https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark Thanks, Sid On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen wrote: > Sid, dump one of yours files. > >

Re: Complexity with the data

2022-05-25 Thread Bjørn Jørgensen
Sid, dump one of yours files. https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ ons. 25. mai 2022, 23:04 skrev Sid : > I have 10 columns with me but in the dataset, I observed that some records > have 11 columns of data(for the additional column it is marked as null).

Re: Complexity with the data

2022-05-25 Thread Sid
I have 10 columns with me but in the dataset, I observed that some records have 11 columns of data(for the additional column it is marked as null). But, how do I handle this? Thanks, Sid On Thu, May 26, 2022 at 2:22 AM Sid wrote: > How can I do that? Any examples or links, please. So, this

Re: Complexity with the data

2022-05-25 Thread Sid
How can I do that? Any examples or links, please. So, this works well with pandas I suppose. It's just that I need to convert back to the spark data frame by providing a schema but since we are using a lower spark version and pandas won't work in a distributed way in the lower versions, therefore,

Re: Complexity with the data

2022-05-25 Thread Gavin Ray
Forgot to reply-all last message, whoops. Not very good at email. You need to normalize the CSV with a parser that can escape commas inside of strings Not sure if Spark has an option for this? On Wed, May 25, 2022 at 4:37 PM Sid wrote: > Thank you so much for your time. > > I have data like

Re: Complexity with the data

2022-05-25 Thread Sid
Thank you so much for your time. I have data like below which I tried to load by setting multiple options while reading the file but however, but I am not able to consolidate the 9th column data within itself. [image: image.png] I tried the below code: df = spark.read.option("header",

Re: Complexity with the data

2022-05-25 Thread Apostolos N. Papadopoulos
Dear Sid, can you please give us more info? Is it true that every line may have a different number of columns? Is there any rule followed by every line of the file? From the information you have sent I cannot fully understand the "schema" of your data. Regards, Apostolos On 25/5/22

Complexity with the data

2022-05-25 Thread Sid
Hi Experts, I have below CSV data that is getting generated automatically. I can't change the data manually. The data looks like below: 2020-12-12,abc,2000,,INR, 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing 2020-12-09,fgh,,software_developer,I only manage the development part. Since I