Re: Complexity with the data

Sid Thu, 26 May 2022 08:38:15 -0700

I was passing the wrong escape characters due to which I was facing the
issue. I have updated the user's answer on my post. Now I am able to load
the dataset.


Thank you everyone for your time and help!

Much appreciated.

I have more datasets like this. I hope that would be resolved using this
approach :) Fingers crossed.

Thanks,
Sid

On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:

> Since you cannot create the DF directly, you may try to first create an
> RDD of tuples from the file
>
> and then convert the RDD to a DF by using the toDF() transformation.
>
> Perhaps you may bypass the issue with this.
>
> Another thing that I have seen in the example is that you are using "" as
> an escape character.
>
> Can you check if this may cause any issues?
>
> Regards,
>
> Apostolos
>
>
>
> On 26/5/22 16:31, Sid wrote:
>
> Thanks for opening the issue, Bjorn. However, could you help me to address
> the problem for now with some kind of alternative?
>
> I am actually stuck in this since yesterday.
>
> Thanks,
> Sid
>
> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>
> wrote:
>
>> Yes, it looks like a bug that we also have in pandas API on spark.
>>
>> So I have opened a JIRA
>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>
>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> Hello Everyone,
>>>
>>> I have posted a question finally with the dataset and the column names.
>>>
>>> PFB link:
>>>
>>>
>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
>>>> Sid, dump one of yours files.
>>>>
>>>>
>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>
>>>>
>>>>
>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>>>>
>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>> records have 11 columns of data(for the additional column it is marked as
>>>>> null). But, how do I handle this?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>
>>>>>> How can I do that? Any examples or links, please. So, this works well
>>>>>> with pandas I suppose. It's just that I need to convert back to the spark
>>>>>> data frame by providing a schema but since we are using a lower spark
>>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>
>>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>>> inside of strings
>>>>>>> Not sure if Spark has an option for this?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you so much for your time.
>>>>>>>>
>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> I tried the below code:
>>>>>>>>
>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>
>>>>>>>>                                       '"').option(
>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>
>>>>>>>> What else I can do?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sid
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>> papad...@csd.auth.gr> wrote:
>>>>>>>>
>>>>>>>>> Dear Sid,
>>>>>>>>>
>>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>>> have a
>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>
>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>> cannot
>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Apostolos
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>> > Hi Experts,
>>>>>>>>> >
>>>>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>>>>> can't
>>>>>>>>> > change the data manually.
>>>>>>>>> >
>>>>>>>>> > The data looks like below:
>>>>>>>>> >
>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>>>>> part.
>>>>>>>>> >
>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>> >
>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>> >
>>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>>> the new
>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>> handle this?
>>>>>>>>> >
>>>>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>>>>> records.
>>>>>>>>> >
>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>> eliminate
>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>> >
>>>>>>>>> > Any suggestions?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Sid
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>> Department of Informatics
>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>> tel: ++0030312310991918
>>>>>>>>> email: papad...@csd.auth.gr
>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>

Re: Complexity with the data

Reply via email to