Yes, it looks like a bug that we also have in pandas API on spark. So I have opened a JIRA <https://issues.apache.org/jira/browse/SPARK-39304> for this.
tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>: > Hello Everyone, > > I have posted a question finally with the dataset and the column names. > > PFB link: > > > https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark > > Thanks, > Sid > > On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> Sid, dump one of yours files. >> >> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ >> >> >> >> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: >> >>> I have 10 columns with me but in the dataset, I observed that some >>> records have 11 columns of data(for the additional column it is marked as >>> null). But, how do I handle this? >>> >>> Thanks, >>> Sid >>> >>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >>> >>>> How can I do that? Any examples or links, please. So, this works well >>>> with pandas I suppose. It's just that I need to convert back to the spark >>>> data frame by providing a schema but since we are using a lower spark >>>> version and pandas won't work in a distributed way in the lower versions, >>>> therefore, was wondering if spark could handle this in a much better way. >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> >>>> wrote: >>>> >>>>> Forgot to reply-all last message, whoops. Not very good at email. >>>>> >>>>> You need to normalize the CSV with a parser that can escape commas >>>>> inside of strings >>>>> Not sure if Spark has an option for this? >>>>> >>>>> >>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >>>>> >>>>>> Thank you so much for your time. >>>>>> >>>>>> I have data like below which I tried to load by setting multiple >>>>>> options while reading the file but however, but I am not able to >>>>>> consolidate the 9th column data within itself. >>>>>> >>>>>> [image: image.png] >>>>>> >>>>>> I tried the below code: >>>>>> >>>>>> df = spark.read.option("header", "true").option("multiline", >>>>>> "true").option("inferSchema", "true").option("quote", >>>>>> >>>>>> '"').option( >>>>>> "delimiter", ",").csv("path") >>>>>> >>>>>> What else I can do? >>>>>> >>>>>> Thanks, >>>>>> Sid >>>>>> >>>>>> >>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>>> papad...@csd.auth.gr> wrote: >>>>>> >>>>>>> Dear Sid, >>>>>>> >>>>>>> can you please give us more info? Is it true that every line may >>>>>>> have a >>>>>>> different number of columns? Is there any rule followed by >>>>>>> >>>>>>> every line of the file? From the information you have sent I cannot >>>>>>> fully understand the "schema" of your data. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Apostolos >>>>>>> >>>>>>> >>>>>>> On 25/5/22 23:06, Sid wrote: >>>>>>> > Hi Experts, >>>>>>> > >>>>>>> > I have below CSV data that is getting generated automatically. I >>>>>>> can't >>>>>>> > change the data manually. >>>>>>> > >>>>>>> > The data looks like below: >>>>>>> > >>>>>>> > 2020-12-12,abc,2000,,INR, >>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development >>>>>>> part. >>>>>>> > >>>>>>> > Since I don't have much experience with the other domains. >>>>>>> > >>>>>>> > It is handled by the other people.,INR >>>>>>> > 2020-12-12,abc,2000,,USD, >>>>>>> > >>>>>>> > The third record is a problem. Since the value is separated by the >>>>>>> new >>>>>>> > line by the user while filling up the form. So, how do I >>>>>>> handle this? >>>>>>> > >>>>>>> > There are 6 columns and 4 records in total. These are the sample >>>>>>> records. >>>>>>> > >>>>>>> > Should I load it as RDD and then may be using a regex should >>>>>>> eliminate >>>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>>> > >>>>>>> > Any suggestions? >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Sid >>>>>>> >>>>>>> -- >>>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>>> Department of Informatics >>>>>>> Aristotle University of Thessaloniki >>>>>>> Thessaloniki, GREECE >>>>>>> tel: ++0030312310991918 >>>>>>> email: papad...@csd.auth.gr >>>>>>> twitter: @papadopoulos_ap >>>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>> >>>>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297