Re: Complexity with the data

Sid Wed, 25 May 2022 14:04:09 -0700

I have 10 columns with me but in the dataset, I observed that some records
have 11 columns of data(for the additional column it is marked as null).
But, how do I handle this?


Thanks,
Sid

On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:

> How can I do that? Any examples or links, please. So, this works well with
> pandas I suppose. It's just that I need to convert back to the spark data
> frame by providing a schema but since we are using a lower spark version
> and pandas won't work in a distributed way in the lower versions,
> therefore, was wondering if spark could handle this in a much better way.
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote:
>
>> Forgot to reply-all last message, whoops. Not very good at email.
>>
>> You need to normalize the CSV with a parser that can escape commas inside
>> of strings
>> Not sure if Spark has an option for this?
>>
>>
>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Thank you so much for your time.
>>>
>>> I have data like below which I tried to load by setting multiple options
>>> while reading the file but however, but I am not able to consolidate the
>>> 9th column data within itself.
>>>
>>> [image: image.png]
>>>
>>> I tried the below code:
>>>
>>> df = spark.read.option("header", "true").option("multiline",
>>> "true").option("inferSchema", "true").option("quote",
>>>
>>>                                   '"').option(
>>>     "delimiter", ",").csv("path")
>>>
>>> What else I can do?
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>> papad...@csd.auth.gr> wrote:
>>>
>>>> Dear Sid,
>>>>
>>>> can you please give us more info? Is it true that every line may have a
>>>> different number of columns? Is there any rule followed by
>>>>
>>>> every line of the file? From the information you have sent I cannot
>>>> fully understand the "schema" of your data.
>>>>
>>>> Regards,
>>>>
>>>> Apostolos
>>>>
>>>>
>>>> On 25/5/22 23:06, Sid wrote:
>>>> > Hi Experts,
>>>> >
>>>> > I have below CSV data that is getting generated automatically. I
>>>> can't
>>>> > change the data manually.
>>>> >
>>>> > The data looks like below:
>>>> >
>>>> > 2020-12-12,abc,2000,,INR,
>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>>>> >
>>>> > Since I don't have much experience with the other domains.
>>>> >
>>>> > It is handled by the other people.,INR
>>>> > 2020-12-12,abc,2000,,USD,
>>>> >
>>>> > The third record is a problem. Since the value is separated by the
>>>> new
>>>> > line by the user while filling up the form. So, how do I handle this?
>>>> >
>>>> > There are 6 columns and 4 records in total. These are the sample
>>>> records.
>>>> >
>>>> > Should I load it as RDD and then may be using a regex should
>>>> eliminate
>>>> > the new lines? Or how it should be? with ". /n" ?
>>>> >
>>>> > Any suggestions?
>>>> >
>>>> > Thanks,
>>>> > Sid
>>>>
>>>> --
>>>> Apostolos N. Papadopoulos, Associate Professor
>>>> Department of Informatics
>>>> Aristotle University of Thessaloniki
>>>> Thessaloniki, GREECE
>>>> tel: ++0030312310991918
>>>> email: papad...@csd.auth.gr
>>>> twitter: @papadopoulos_ap
>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>

Re: Complexity with the data

Reply via email to