Re: Complexity with the data

Sid Thu, 26 May 2022 06:31:48 -0700

Thanks for opening the issue, Bjorn. However, could you help me to address
the problem for now with some kind of alternative?


I am actually stuck in this since yesterday.

Thanks,
Sid

On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>
wrote:

> Yes, it looks like a bug that we also have in pandas API on spark.
>
> So I have opened a JIRA
> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>
> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:
>
>> Hello Everyone,
>>
>> I have posted a question finally with the dataset and the column names.
>>
>> PFB link:
>>
>>
>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com>
>> wrote:
>>
>>> Sid, dump one of yours files.
>>>
>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>
>>>
>>>
>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>>>
>>>> I have 10 columns with me but in the dataset, I observed that some
>>>> records have 11 columns of data(for the additional column it is marked as
>>>> null). But, how do I handle this?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> How can I do that? Any examples or links, please. So, this works well
>>>>> with pandas I suppose. It's just that I need to convert back to the spark
>>>>> data frame by providing a schema but since we are using a lower spark
>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>
>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>> inside of strings
>>>>>> Not sure if Spark has an option for this?
>>>>>>
>>>>>>
>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you so much for your time.
>>>>>>>
>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>> options while reading the file but however, but I am not able to
>>>>>>> consolidate the 9th column data within itself.
>>>>>>>
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> I tried the below code:
>>>>>>>
>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>
>>>>>>>                                       '"').option(
>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>
>>>>>>> What else I can do?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>> papad...@csd.auth.gr> wrote:
>>>>>>>
>>>>>>>> Dear Sid,
>>>>>>>>
>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>> have a
>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>
>>>>>>>> every line of the file? From the information you have sent I cannot
>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Apostolos
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>> > Hi Experts,
>>>>>>>> >
>>>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>>>> can't
>>>>>>>> > change the data manually.
>>>>>>>> >
>>>>>>>> > The data looks like below:
>>>>>>>> >
>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>>>> part.
>>>>>>>> >
>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>> >
>>>>>>>> > It is handled by the other people.,INR
>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>> >
>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>> the new
>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>> handle this?
>>>>>>>> >
>>>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>>>> records.
>>>>>>>> >
>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>> eliminate
>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>> >
>>>>>>>> > Any suggestions?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Sid
>>>>>>>>
>>>>>>>> --
>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>> Department of Informatics
>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>> Thessaloniki, GREECE
>>>>>>>> tel: ++0030312310991918
>>>>>>>> email: papad...@csd.auth.gr
>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Complexity with the data

Reply via email to