Re: Complexity with the data

Bjørn Jørgensen Thu, 26 May 2022 09:32:01 -0700

Yes, but how do you read it with spark.

tor. 26. mai 2022, 18:30 skrev Sid <flinkbyhe...@gmail.com>:


> I am not reading it through pandas. I am using Spark because when I tried
> to use pandas which comes under import pyspark.pandas, it gives me an
> error.
>
> On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> ok, but how do you read it now?
>>
>>
>> https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
>> probably have to be updated with the default options. This is so that
>> pandas API on spark will be like pandas.
>>
>> tor. 26. mai 2022 kl. 17:38 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> I was passing the wrong escape characters due to which I was facing the
>>> issue. I have updated the user's answer on my post. Now I am able to load
>>> the dataset.
>>>
>>> Thank you everyone for your time and help!
>>>
>>> Much appreciated.
>>>
>>> I have more datasets like this. I hope that would be resolved using this
>>> approach :) Fingers crossed.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
>>> papad...@csd.auth.gr> wrote:
>>>
>>>> Since you cannot create the DF directly, you may try to first create an
>>>> RDD of tuples from the file
>>>>
>>>> and then convert the RDD to a DF by using the toDF() transformation.
>>>>
>>>> Perhaps you may bypass the issue with this.
>>>>
>>>> Another thing that I have seen in the example is that you are using ""
>>>> as an escape character.
>>>>
>>>> Can you check if this may cause any issues?
>>>>
>>>> Regards,
>>>>
>>>> Apostolos
>>>>
>>>>
>>>>
>>>> On 26/5/22 16:31, Sid wrote:
>>>>
>>>> Thanks for opening the issue, Bjorn. However, could you help me to
>>>> address the problem for now with some kind of alternative?
>>>>
>>>> I am actually stuck in this since yesterday.
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes, it looks like a bug that we also have in pandas API on spark.
>>>>>
>>>>> So I have opened a JIRA
>>>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>>>>
>>>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:
>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> I have posted a question finally with the dataset and the column
>>>>>> names.
>>>>>>
>>>>>> PFB link:
>>>>>>
>>>>>>
>>>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>>
>>>>>>> Sid, dump one of yours files.
>>>>>>>
>>>>>>>
>>>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>>>>>>>
>>>>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>>>>> records have 11 columns of data(for the additional column it is marked 
>>>>>>>> as
>>>>>>>> null). But, how do I handle this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sid
>>>>>>>>
>>>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> How can I do that? Any examples or links, please. So, this works
>>>>>>>>> well with pandas I suppose. It's just that I need to convert back to 
>>>>>>>>> the
>>>>>>>>> spark data frame by providing a schema but since we are using a lower 
>>>>>>>>> spark
>>>>>>>>> version and pandas won't work in a distributed way in the lower 
>>>>>>>>> versions,
>>>>>>>>> therefore, was wondering if spark could handle this in a much better 
>>>>>>>>> way.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sid
>>>>>>>>>
>>>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>>>>
>>>>>>>>>> You need to normalize the CSV with a parser that can escape
>>>>>>>>>> commas inside of strings
>>>>>>>>>> Not sure if Spark has an option for this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you so much for your time.
>>>>>>>>>>>
>>>>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>>>>
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> I tried the below code:
>>>>>>>>>>>
>>>>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>>>>
>>>>>>>>>>>                                           '"').option(
>>>>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>>>>
>>>>>>>>>>> What else I can do?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sid
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>>>>> papad...@csd.auth.gr> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear Sid,
>>>>>>>>>>>>
>>>>>>>>>>>> can you please give us more info? Is it true that every line
>>>>>>>>>>>> may have a
>>>>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>>>>
>>>>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>>>>> cannot
>>>>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Apostolos
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>>>>> > Hi Experts,
>>>>>>>>>>>> >
>>>>>>>>>>>> > I have below CSV data that is getting generated
>>>>>>>>>>>> automatically. I can't
>>>>>>>>>>>> > change the data manually.
>>>>>>>>>>>> >
>>>>>>>>>>>> > The data looks like below:
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the
>>>>>>>>>>>> development part.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>>>>> >
>>>>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>>>>> >
>>>>>>>>>>>> > The third record is a problem. Since the value is separated
>>>>>>>>>>>> by the new
>>>>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>>>>> handle this?
>>>>>>>>>>>> >
>>>>>>>>>>>> > There are 6 columns and 4 records in total. These are the
>>>>>>>>>>>> sample records.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>>>>> eliminate
>>>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Any suggestions?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>> > Sid
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>>>>> Department of Informatics
>>>>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>>>>> tel: ++0030312310991918
>>>>>>>>>>>> email: papad...@csd.auth.gr
>>>>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>> --
>>>> Apostolos N. Papadopoulos, Associate Professor
>>>> Department of Informatics
>>>> Aristotle University of Thessaloniki
>>>> Thessaloniki, GREECE
>>>> tel: ++0030312310991918
>>>> email: papad...@csd.auth.gr
>>>> twitter: @papadopoulos_ap
>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>
>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Complexity with the data

Reply via email to