Re: Complexity with the data

Bjørn Jørgensen Thu, 26 May 2022 06:19:34 -0700

Yes, it looks like a bug that we also have in pandas API on spark.

So I have opened a JIRA <https://issues.apache.org/jira/browse/SPARK-39304> for
this.


tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:

> Hello Everyone,
>
> I have posted a question finally with the dataset and the column names.
>
> PFB link:
>
>
> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> Sid, dump one of yours files.
>>
>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>
>>
>>
>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> I have 10 columns with me but in the dataset, I observed that some
>>> records have 11 columns of data(for the additional column it is marked as
>>> null). But, how do I handle this?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> How can I do that? Any examples or links, please. So, this works well
>>>> with pandas I suppose. It's just that I need to convert back to the spark
>>>> data frame by providing a schema but since we are using a lower spark
>>>> version and pandas won't work in a distributed way in the lower versions,
>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>
>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>> inside of strings
>>>>> Not sure if Spark has an option for this?
>>>>>
>>>>>
>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>
>>>>>> Thank you so much for your time.
>>>>>>
>>>>>> I have data like below which I tried to load by setting multiple
>>>>>> options while reading the file but however, but I am not able to
>>>>>> consolidate the 9th column data within itself.
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>> I tried the below code:
>>>>>>
>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>
>>>>>>                                     '"').option(
>>>>>>     "delimiter", ",").csv("path")
>>>>>>
>>>>>> What else I can do?
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>>
>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>> papad...@csd.auth.gr> wrote:
>>>>>>
>>>>>>> Dear Sid,
>>>>>>>
>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>> have a
>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>
>>>>>>> every line of the file? From the information you have sent I cannot
>>>>>>> fully understand the "schema" of your data.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Apostolos
>>>>>>>
>>>>>>>
>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>> > Hi Experts,
>>>>>>> >
>>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>>> can't
>>>>>>> > change the data manually.
>>>>>>> >
>>>>>>> > The data looks like below:
>>>>>>> >
>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>>> part.
>>>>>>> >
>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>> >
>>>>>>> > It is handled by the other people.,INR
>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>> >
>>>>>>> > The third record is a problem. Since the value is separated by the
>>>>>>> new
>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>> handle this?
>>>>>>> >
>>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>>> records.
>>>>>>> >
>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>> eliminate
>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>> >
>>>>>>> > Any suggestions?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Sid
>>>>>>>
>>>>>>> --
>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>> Department of Informatics
>>>>>>> Aristotle University of Thessaloniki
>>>>>>> Thessaloniki, GREECE
>>>>>>> tel: ++0030312310991918
>>>>>>> email: papad...@csd.auth.gr
>>>>>>> twitter: @papadopoulos_ap
>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Complexity with the data

Reply via email to