Re: Complexity with the data

Bjørn Jørgensen Wed, 25 May 2022 14:10:25 -0700

Sid, dump one of yours files.

https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/




ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:

> I have 10 columns with me but in the dataset, I observed that some records
> have 11 columns of data(for the additional column it is marked as null).
> But, how do I handle this?
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>
>> How can I do that? Any examples or links, please. So, this works well
>> with pandas I suppose. It's just that I need to convert back to the spark
>> data frame by providing a schema but since we are using a lower spark
>> version and pandas won't work in a distributed way in the lower versions,
>> therefore, was wondering if spark could handle this in a much better way.
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote:
>>
>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>
>>> You need to normalize the CSV with a parser that can escape commas
>>> inside of strings
>>> Not sure if Spark has an option for this?
>>>
>>>
>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Thank you so much for your time.
>>>>
>>>> I have data like below which I tried to load by setting multiple
>>>> options while reading the file but however, but I am not able to
>>>> consolidate the 9th column data within itself.
>>>>
>>>> [image: image.png]
>>>>
>>>> I tried the below code:
>>>>
>>>> df = spark.read.option("header", "true").option("multiline",
>>>> "true").option("inferSchema", "true").option("quote",
>>>>
>>>>                                   '"').option(
>>>>     "delimiter", ",").csv("path")
>>>>
>>>> What else I can do?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>>
>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>> papad...@csd.auth.gr> wrote:
>>>>
>>>>> Dear Sid,
>>>>>
>>>>> can you please give us more info? Is it true that every line may have
>>>>> a
>>>>> different number of columns? Is there any rule followed by
>>>>>
>>>>> every line of the file? From the information you have sent I cannot
>>>>> fully understand the "schema" of your data.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Apostolos
>>>>>
>>>>>
>>>>> On 25/5/22 23:06, Sid wrote:
>>>>> > Hi Experts,
>>>>> >
>>>>> > I have below CSV data that is getting generated automatically. I
>>>>> can't
>>>>> > change the data manually.
>>>>> >
>>>>> > The data looks like below:
>>>>> >
>>>>> > 2020-12-12,abc,2000,,INR,
>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>> part.
>>>>> >
>>>>> > Since I don't have much experience with the other domains.
>>>>> >
>>>>> > It is handled by the other people.,INR
>>>>> > 2020-12-12,abc,2000,,USD,
>>>>> >
>>>>> > The third record is a problem. Since the value is separated by the
>>>>> new
>>>>> > line by the user while filling up the form. So, how do I handle this?
>>>>> >
>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>> records.
>>>>> >
>>>>> > Should I load it as RDD and then may be using a regex should
>>>>> eliminate
>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>> >
>>>>> > Any suggestions?
>>>>> >
>>>>> > Thanks,
>>>>> > Sid
>>>>>
>>>>> --
>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>> Department of Informatics
>>>>> Aristotle University of Thessaloniki
>>>>> Thessaloniki, GREECE
>>>>> tel: ++0030312310991918
>>>>> email: papad...@csd.auth.gr
>>>>> twitter: @papadopoulos_ap
>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>

Re: Complexity with the data

Reply via email to