Re: Complexity with the data

Sid Thu, 26 May 2022 09:30:24 -0700

I am not reading it through pandas. I am using Spark because when I tried
to use pandas which comes under import pyspark.pandas, it gives me an
error.


On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> ok, but how do you read it now?
>
>
> https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
> probably have to be updated with the default options. This is so that
> pandas API on spark will be like pandas.
>
> tor. 26. mai 2022 kl. 17:38 skrev Sid <flinkbyhe...@gmail.com>:
>
>> I was passing the wrong escape characters due to which I was facing the
>> issue. I have updated the user's answer on my post. Now I am able to load
>> the dataset.
>>
>> Thank you everyone for your time and help!
>>
>> Much appreciated.
>>
>> I have more datasets like this. I hope that would be resolved using this
>> approach :) Fingers crossed.
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
>> papad...@csd.auth.gr> wrote:
>>
>>> Since you cannot create the DF directly, you may try to first create an
>>> RDD of tuples from the file
>>>
>>> and then convert the RDD to a DF by using the toDF() transformation.
>>>
>>> Perhaps you may bypass the issue with this.
>>>
>>> Another thing that I have seen in the example is that you are using ""
>>> as an escape character.
>>>
>>> Can you check if this may cause any issues?
>>>
>>> Regards,
>>>
>>> Apostolos
>>>
>>>
>>>
>>> On 26/5/22 16:31, Sid wrote:
>>>
>>> Thanks for opening the issue, Bjorn. However, could you help me to
>>> address the problem for now with some kind of alternative?
>>>
>>> I am actually stuck in this since yesterday.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>
>>> wrote:
>>>
>>>> Yes, it looks like a bug that we also have in pandas API on spark.
>>>>
>>>> So I have opened a JIRA
>>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>>>
>>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> I have posted a question finally with the dataset and the column names.
>>>>>
>>>>> PFB link:
>>>>>
>>>>>
>>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>
>>>>>> Sid, dump one of yours files.
>>>>>>
>>>>>>
>>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>>>
>>>>>>
>>>>>>
>>>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>>>>>>
>>>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>>>> records have 11 columns of data(for the additional column it is marked 
>>>>>>> as
>>>>>>> null). But, how do I handle this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>>
>>>>>>>> How can I do that? Any examples or links, please. So, this works
>>>>>>>> well with pandas I suppose. It's just that I need to convert back to 
>>>>>>>> the
>>>>>>>> spark data frame by providing a schema but since we are using a lower 
>>>>>>>> spark
>>>>>>>> version and pandas won't work in a distributed way in the lower 
>>>>>>>> versions,
>>>>>>>> therefore, was wondering if spark could handle this in a much better 
>>>>>>>> way.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sid
>>>>>>>>
>>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>>>
>>>>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>>>>> inside of strings
>>>>>>>>> Not sure if Spark has an option for this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you so much for your time.
>>>>>>>>>>
>>>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>>>
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> I tried the below code:
>>>>>>>>>>
>>>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>>>
>>>>>>>>>>                                         '"').option(
>>>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>>>
>>>>>>>>>> What else I can do?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sid
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>>>> papad...@csd.auth.gr> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear Sid,
>>>>>>>>>>>
>>>>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>>>>> have a
>>>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>>>
>>>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>>>> cannot
>>>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Apostolos
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>>>> > Hi Experts,
>>>>>>>>>>> >
>>>>>>>>>>> > I have below CSV data that is getting generated automatically.
>>>>>>>>>>> I can't
>>>>>>>>>>> > change the data manually.
>>>>>>>>>>> >
>>>>>>>>>>> > The data looks like below:
>>>>>>>>>>> >
>>>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the
>>>>>>>>>>> development part.
>>>>>>>>>>> >
>>>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>>>> >
>>>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>>>> >
>>>>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>>>>> the new
>>>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>>>> handle this?
>>>>>>>>>>> >
>>>>>>>>>>> > There are 6 columns and 4 records in total. These are the
>>>>>>>>>>> sample records.
>>>>>>>>>>> >
>>>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>>>> eliminate
>>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>>>> >
>>>>>>>>>>> > Any suggestions?
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> > Sid
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>>>> Department of Informatics
>>>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>>>> tel: ++0030312310991918
>>>>>>>>>>> email: papad...@csd.auth.gr
>>>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>> --
>>> Apostolos N. Papadopoulos, Associate Professor
>>> Department of Informatics
>>> Aristotle University of Thessaloniki
>>> Thessaloniki, GREECE
>>> tel: ++0030312310991918
>>> email: papad...@csd.auth.gr
>>> twitter: @papadopoulos_ap
>>> web: http://datalab.csd.auth.gr/~apostol
>>>
>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Complexity with the data

Reply via email to