Thanks for opening the issue, Bjorn. However, could you help me to address the problem for now with some kind of alternative?
I am actually stuck in this since yesterday. Thanks, Sid On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com> wrote: > Yes, it looks like a bug that we also have in pandas API on spark. > > So I have opened a JIRA > <https://issues.apache.org/jira/browse/SPARK-39304> for this. > > tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>: > >> Hello Everyone, >> >> I have posted a question finally with the dataset and the column names. >> >> PFB link: >> >> >> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark >> >> Thanks, >> Sid >> >> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com> >> wrote: >> >>> Sid, dump one of yours files. >>> >>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ >>> >>> >>> >>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: >>> >>>> I have 10 columns with me but in the dataset, I observed that some >>>> records have 11 columns of data(for the additional column it is marked as >>>> null). But, how do I handle this? >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >>>> >>>>> How can I do that? Any examples or links, please. So, this works well >>>>> with pandas I suppose. It's just that I need to convert back to the spark >>>>> data frame by providing a schema but since we are using a lower spark >>>>> version and pandas won't work in a distributed way in the lower versions, >>>>> therefore, was wondering if spark could handle this in a much better way. >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> >>>>> wrote: >>>>> >>>>>> Forgot to reply-all last message, whoops. Not very good at email. >>>>>> >>>>>> You need to normalize the CSV with a parser that can escape commas >>>>>> inside of strings >>>>>> Not sure if Spark has an option for this? >>>>>> >>>>>> >>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >>>>>> >>>>>>> Thank you so much for your time. >>>>>>> >>>>>>> I have data like below which I tried to load by setting multiple >>>>>>> options while reading the file but however, but I am not able to >>>>>>> consolidate the 9th column data within itself. >>>>>>> >>>>>>> [image: image.png] >>>>>>> >>>>>>> I tried the below code: >>>>>>> >>>>>>> df = spark.read.option("header", "true").option("multiline", >>>>>>> "true").option("inferSchema", "true").option("quote", >>>>>>> >>>>>>> '"').option( >>>>>>> "delimiter", ",").csv("path") >>>>>>> >>>>>>> What else I can do? >>>>>>> >>>>>>> Thanks, >>>>>>> Sid >>>>>>> >>>>>>> >>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>>>> papad...@csd.auth.gr> wrote: >>>>>>> >>>>>>>> Dear Sid, >>>>>>>> >>>>>>>> can you please give us more info? Is it true that every line may >>>>>>>> have a >>>>>>>> different number of columns? Is there any rule followed by >>>>>>>> >>>>>>>> every line of the file? From the information you have sent I cannot >>>>>>>> fully understand the "schema" of your data. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Apostolos >>>>>>>> >>>>>>>> >>>>>>>> On 25/5/22 23:06, Sid wrote: >>>>>>>> > Hi Experts, >>>>>>>> > >>>>>>>> > I have below CSV data that is getting generated automatically. I >>>>>>>> can't >>>>>>>> > change the data manually. >>>>>>>> > >>>>>>>> > The data looks like below: >>>>>>>> > >>>>>>>> > 2020-12-12,abc,2000,,INR, >>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development >>>>>>>> part. >>>>>>>> > >>>>>>>> > Since I don't have much experience with the other domains. >>>>>>>> > >>>>>>>> > It is handled by the other people.,INR >>>>>>>> > 2020-12-12,abc,2000,,USD, >>>>>>>> > >>>>>>>> > The third record is a problem. Since the value is separated by >>>>>>>> the new >>>>>>>> > line by the user while filling up the form. So, how do I >>>>>>>> handle this? >>>>>>>> > >>>>>>>> > There are 6 columns and 4 records in total. These are the sample >>>>>>>> records. >>>>>>>> > >>>>>>>> > Should I load it as RDD and then may be using a regex should >>>>>>>> eliminate >>>>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>>>> > >>>>>>>> > Any suggestions? >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Sid >>>>>>>> >>>>>>>> -- >>>>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>>>> Department of Informatics >>>>>>>> Aristotle University of Thessaloniki >>>>>>>> Thessaloniki, GREECE >>>>>>>> tel: ++0030312310991918 >>>>>>>> email: papad...@csd.auth.gr >>>>>>>> twitter: @papadopoulos_ap >>>>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >