Re: Complexity with the data

Apostolos N. Papadopoulos Thu, 26 May 2022 08:14:35 -0700

Since you cannot create the DF directly, you may try to first create anRDD of tuples from the file


and then convert the RDD to a DF by using the toDF() transformation.


Perhaps you may bypass the issue with this.

Another thing that I have seen in the example is that you are using ""as an escape character.


Can you check if this may cause any issues?

Regards,

Apostolos



On 26/5/22 16:31, Sid wrote:

Thanks for opening the issue, Bjorn. However, could you help me toaddress the problem for now with some kind of alternative?


I am actually stuck in this since yesterday.

Thanks,
Sid

On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>wrote:


    Yes, it looks like a bug that we also have in pandas API on spark.

    So I have opened a JIRA
    <https://issues.apache.org/jira/browse/SPARK-39304> for this.

    tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:

        Hello Everyone,

        I have posted a question finally with the dataset and the
        column names.

        PFB link:

        
https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark

        Thanks,
        Sid

        On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen
        <bjornjorgen...@gmail.com> wrote:

            Sid, dump one of yours files.

            
https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/



            ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:

                I have 10 columns with me but in the dataset, I
                observed that some records have 11 columns of data(for
                the additional column it is marked as null). But, how
                do I handle this?

                Thanks,
                Sid

                On Thu, May 26, 2022 at 2:22 AM Sid
                <flinkbyhe...@gmail.com> wrote:

                    How can I do that? Any examples or links, please.
                    So, this works well with pandas I suppose. It's
                    just that I need to convert back to the spark data
                    frame by providing a schema but since we are using
                    a lower spark version and pandas won't work in a
                    distributed way in the lower versions, therefore,
                    was wondering if spark could handle this in a much
                    better way.

                    Thanks,
                    Sid

                    On Thu, May 26, 2022 at 2:19 AM Gavin Ray
                    <ray.gavi...@gmail.com> wrote:

                        Forgot to reply-all last message, whoops. Not
                        very good at email.

                        You need to normalize the CSV with a parser
                        that can escape commas inside of strings
                        Not sure if Spark has an option for this?


                        On Wed, May 25, 2022 at 4:37 PM Sid
                        <flinkbyhe...@gmail.com> wrote:

                            Thank you so much for your time.

                            I have data like below which I tried to
                            load by setting multiple options while
                            reading the file but however, but I am not
                            able to consolidate the 9th column data
                            within itself.

                            image.png

                            I tried the below code:

                            df = spark.read.option("header",
                            "true").option("multiline",
                            "true").option("inferSchema",
                            "true").option("quote",
                                '"').option(
                                "delimiter", ",").csv("path")

                            What else I can do?

                            Thanks,
                            Sid


                            On Thu, May 26, 2022 at 1:46 AM Apostolos
                            N. Papadopoulos <papad...@csd.auth.gr> wrote:

                                Dear Sid,

                                can you please give us more info? Is
                                it true that every line may have a
                                different number of columns? Is there
                                any rule followed by

                                every line of the file? From the
                                information you have sent I cannot
                                fully understand the "schema" of your
                                data.

                                Regards,

                                Apostolos


                                On 25/5/22 23:06, Sid wrote:
                                > Hi Experts,
                                >
                                > I have below CSV data that is
                                getting generated automatically. I can't
                                > change the data manually.
                                >
                                > The data looks like below:
                                >
                                > 2020-12-12,abc,2000,,INR,
                                > 2020-12-09,cde,3000,he is a
                                manager,DOLLARS,nothing
                                > 2020-12-09,fgh,,software_developer,I
                                only manage the development part.
                                >
                                > Since I don't have much experience
                                with the other domains.
                                >
                                > It is handled by the other people.,INR
                                > 2020-12-12,abc,2000,,USD,
                                >
                                > The third record is a problem. Since
                                the value is separated by the new
                                > line by the user while filling up
                                the form. So, how do I handle this?
                                >
                                > There are 6 columns and 4 records in
                                total. These are the sample records.
                                >
                                > Should I load it as RDD and then may
                                be using a regex should eliminate
                                > the new lines? Or how it should be?
                                with ". /n" ?
                                >
                                > Any suggestions?
                                >
                                > Thanks,
                                > Sid

--Apostolos N. Papadopoulos, Associate

                                Professor
                                Department of Informatics
                                Aristotle University of Thessaloniki
                                Thessaloniki, GREECE
                                tel: ++0030312310991918
                                email: papad...@csd.auth.gr
                                twitter: @papadopoulos_ap
                                web: http://datalab.csd.auth.gr/~apostol


                                
---------------------------------------------------------------------
                                To unsubscribe e-mail:
                                user-unsubscr...@spark.apache.org

--Bjørn Jørgensen

    Vestre Aspehaug 4, 6010 Ålesund
    Norge

    +47 480 94 297

--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papad...@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol

Re: Complexity with the data

Reply via email to