Re: Complexity with the data

Sid Wed, 25 May 2022 13:52:50 -0700

How can I do that? Any examples or links, please. So, this works well with
pandas I suppose. It's just that I need to convert back to the spark data
frame by providing a schema but since we are using a lower spark version
and pandas won't work in a distributed way in the lower versions,
therefore, was wondering if spark could handle this in a much better way.


Thanks,
Sid

On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote:

> Forgot to reply-all last message, whoops. Not very good at email.
>
> You need to normalize the CSV with a parser that can escape commas inside
> of strings
> Not sure if Spark has an option for this?
>
>
> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Thank you so much for your time.
>>
>> I have data like below which I tried to load by setting multiple options
>> while reading the file but however, but I am not able to consolidate the
>> 9th column data within itself.
>>
>> [image: image.png]
>>
>> I tried the below code:
>>
>> df = spark.read.option("header", "true").option("multiline",
>> "true").option("inferSchema", "true").option("quote",
>>
>>                                 '"').option(
>>     "delimiter", ",").csv("path")
>>
>> What else I can do?
>>
>> Thanks,
>> Sid
>>
>>
>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>> papad...@csd.auth.gr> wrote:
>>
>>> Dear Sid,
>>>
>>> can you please give us more info? Is it true that every line may have a
>>> different number of columns? Is there any rule followed by
>>>
>>> every line of the file? From the information you have sent I cannot
>>> fully understand the "schema" of your data.
>>>
>>> Regards,
>>>
>>> Apostolos
>>>
>>>
>>> On 25/5/22 23:06, Sid wrote:
>>> > Hi Experts,
>>> >
>>> > I have below CSV data that is getting generated automatically. I can't
>>> > change the data manually.
>>> >
>>> > The data looks like below:
>>> >
>>> > 2020-12-12,abc,2000,,INR,
>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>>> >
>>> > Since I don't have much experience with the other domains.
>>> >
>>> > It is handled by the other people.,INR
>>> > 2020-12-12,abc,2000,,USD,
>>> >
>>> > The third record is a problem. Since the value is separated by the new
>>> > line by the user while filling up the form. So, how do I handle this?
>>> >
>>> > There are 6 columns and 4 records in total. These are the sample
>>> records.
>>> >
>>> > Should I load it as RDD and then may be using a regex should eliminate
>>> > the new lines? Or how it should be? with ". /n" ?
>>> >
>>> > Any suggestions?
>>> >
>>> > Thanks,
>>> > Sid
>>>
>>> --
>>> Apostolos N. Papadopoulos, Associate Professor
>>> Department of Informatics
>>> Aristotle University of Thessaloniki
>>> Thessaloniki, GREECE
>>> tel: ++0030312310991918
>>> email: papad...@csd.auth.gr
>>> twitter: @papadopoulos_ap
>>> web: http://datalab.csd.auth.gr/~apostol
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>

Re: Complexity with the data

Reply via email to