Re: Complexity with the data

Bjørn Jørgensen Thu, 26 May 2022 09:22:18 -0700

ok, but how do you read it now?

https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
probably have to be updated with the default options. This is so that
pandas API on spark will be like pandas.


tor. 26. mai 2022 kl. 17:38 skrev Sid <flinkbyhe...@gmail.com>:

> I was passing the wrong escape characters due to which I was facing the
> issue. I have updated the user's answer on my post. Now I am able to load
> the dataset.
>
> Thank you everyone for your time and help!
>
> Much appreciated.
>
> I have more datasets like this. I hope that would be resolved using this
> approach :) Fingers crossed.
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
> papad...@csd.auth.gr> wrote:
>
>> Since you cannot create the DF directly, you may try to first create an
>> RDD of tuples from the file
>>
>> and then convert the RDD to a DF by using the toDF() transformation.
>>
>> Perhaps you may bypass the issue with this.
>>
>> Another thing that I have seen in the example is that you are using "" as
>> an escape character.
>>
>> Can you check if this may cause any issues?
>>
>> Regards,
>>
>> Apostolos
>>
>>
>>
>> On 26/5/22 16:31, Sid wrote:
>>
>> Thanks for opening the issue, Bjorn. However, could you help me to
>> address the problem for now with some kind of alternative?
>>
>> I am actually stuck in this since yesterday.
>>
>> Thanks,
>> Sid
>>
>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>
>> wrote:
>>
>>> Yes, it looks like a bug that we also have in pandas API on spark.
>>>
>>> So I have opened a JIRA
>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>>
>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:
>>>
>>>> Hello Everyone,
>>>>
>>>> I have posted a question finally with the dataset and the column names.
>>>>
>>>> PFB link:
>>>>
>>>>
>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>>> bjornjorgen...@gmail.com> wrote:
>>>>
>>>>> Sid, dump one of yours files.
>>>>>
>>>>>
>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>>
>>>>>
>>>>>
>>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>>>>>
>>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>>> records have 11 columns of data(for the additional column it is marked as
>>>>>> null). But, how do I handle this?
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>
>>>>>>> How can I do that? Any examples or links, please. So, this works
>>>>>>> well with pandas I suppose. It's just that I need to convert back to the
>>>>>>> spark data frame by providing a schema but since we are using a lower 
>>>>>>> spark
>>>>>>> version and pandas won't work in a distributed way in the lower 
>>>>>>> versions,
>>>>>>> therefore, was wondering if spark could handle this in a much better 
>>>>>>> way.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>>
>>>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>>>> inside of strings
>>>>>>>> Not sure if Spark has an option for this?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you so much for your time.
>>>>>>>>>
>>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>>
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> I tried the below code:
>>>>>>>>>
>>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>>
>>>>>>>>>                                         '"').option(
>>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>>
>>>>>>>>> What else I can do?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sid
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>>> papad...@csd.auth.gr> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Sid,
>>>>>>>>>>
>>>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>>>> have a
>>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>>
>>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>>> cannot
>>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Apostolos
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>>> > Hi Experts,
>>>>>>>>>> >
>>>>>>>>>> > I have below CSV data that is getting generated automatically.
>>>>>>>>>> I can't
>>>>>>>>>> > change the data manually.
>>>>>>>>>> >
>>>>>>>>>> > The data looks like below:
>>>>>>>>>> >
>>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the
>>>>>>>>>> development part.
>>>>>>>>>> >
>>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>>> >
>>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>>> >
>>>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>>>> the new
>>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>>> handle this?
>>>>>>>>>> >
>>>>>>>>>> > There are 6 columns and 4 records in total. These are the
>>>>>>>>>> sample records.
>>>>>>>>>> >
>>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>>> eliminate
>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>>> >
>>>>>>>>>> > Any suggestions?
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > Sid
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>>> Department of Informatics
>>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>>> tel: ++0030312310991918
>>>>>>>>>> email: papad...@csd.auth.gr
>>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papad...@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://datalab.csd.auth.gr/~apostol
>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Complexity with the data

Reply via email to