I was passing the wrong escape characters due to which I was facing the issue. I have updated the user's answer on my post. Now I am able to load the dataset.
Thank you everyone for your time and help! Much appreciated. I have more datasets like this. I hope that would be resolved using this approach :) Fingers crossed. Thanks, Sid On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos < papad...@csd.auth.gr> wrote: > Since you cannot create the DF directly, you may try to first create an > RDD of tuples from the file > > and then convert the RDD to a DF by using the toDF() transformation. > > Perhaps you may bypass the issue with this. > > Another thing that I have seen in the example is that you are using "" as > an escape character. > > Can you check if this may cause any issues? > > Regards, > > Apostolos > > > > On 26/5/22 16:31, Sid wrote: > > Thanks for opening the issue, Bjorn. However, could you help me to address > the problem for now with some kind of alternative? > > I am actually stuck in this since yesterday. > > Thanks, > Sid > > On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com> > wrote: > >> Yes, it looks like a bug that we also have in pandas API on spark. >> >> So I have opened a JIRA >> <https://issues.apache.org/jira/browse/SPARK-39304> for this. >> >> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>: >> >>> Hello Everyone, >>> >>> I have posted a question finally with the dataset and the column names. >>> >>> PFB link: >>> >>> >>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark >>> >>> Thanks, >>> Sid >>> >>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen < >>> bjornjorgen...@gmail.com> wrote: >>> >>>> Sid, dump one of yours files. >>>> >>>> >>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ >>>> >>>> >>>> >>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: >>>> >>>>> I have 10 columns with me but in the dataset, I observed that some >>>>> records have 11 columns of data(for the additional column it is marked as >>>>> null). But, how do I handle this? >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >>>>> >>>>>> How can I do that? Any examples or links, please. So, this works well >>>>>> with pandas I suppose. It's just that I need to convert back to the spark >>>>>> data frame by providing a schema but since we are using a lower spark >>>>>> version and pandas won't work in a distributed way in the lower versions, >>>>>> therefore, was wondering if spark could handle this in a much better way. >>>>>> >>>>>> Thanks, >>>>>> Sid >>>>>> >>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Forgot to reply-all last message, whoops. Not very good at email. >>>>>>> >>>>>>> You need to normalize the CSV with a parser that can escape commas >>>>>>> inside of strings >>>>>>> Not sure if Spark has an option for this? >>>>>>> >>>>>>> >>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >>>>>>> >>>>>>>> Thank you so much for your time. >>>>>>>> >>>>>>>> I have data like below which I tried to load by setting multiple >>>>>>>> options while reading the file but however, but I am not able to >>>>>>>> consolidate the 9th column data within itself. >>>>>>>> >>>>>>>> [image: image.png] >>>>>>>> >>>>>>>> I tried the below code: >>>>>>>> >>>>>>>> df = spark.read.option("header", "true").option("multiline", >>>>>>>> "true").option("inferSchema", "true").option("quote", >>>>>>>> >>>>>>>> '"').option( >>>>>>>> "delimiter", ",").csv("path") >>>>>>>> >>>>>>>> What else I can do? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sid >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>>>>> papad...@csd.auth.gr> wrote: >>>>>>>> >>>>>>>>> Dear Sid, >>>>>>>>> >>>>>>>>> can you please give us more info? Is it true that every line may >>>>>>>>> have a >>>>>>>>> different number of columns? Is there any rule followed by >>>>>>>>> >>>>>>>>> every line of the file? From the information you have sent I >>>>>>>>> cannot >>>>>>>>> fully understand the "schema" of your data. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Apostolos >>>>>>>>> >>>>>>>>> >>>>>>>>> On 25/5/22 23:06, Sid wrote: >>>>>>>>> > Hi Experts, >>>>>>>>> > >>>>>>>>> > I have below CSV data that is getting generated automatically. I >>>>>>>>> can't >>>>>>>>> > change the data manually. >>>>>>>>> > >>>>>>>>> > The data looks like below: >>>>>>>>> > >>>>>>>>> > 2020-12-12,abc,2000,,INR, >>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development >>>>>>>>> part. >>>>>>>>> > >>>>>>>>> > Since I don't have much experience with the other domains. >>>>>>>>> > >>>>>>>>> > It is handled by the other people.,INR >>>>>>>>> > 2020-12-12,abc,2000,,USD, >>>>>>>>> > >>>>>>>>> > The third record is a problem. Since the value is separated by >>>>>>>>> the new >>>>>>>>> > line by the user while filling up the form. So, how do I >>>>>>>>> handle this? >>>>>>>>> > >>>>>>>>> > There are 6 columns and 4 records in total. These are the sample >>>>>>>>> records. >>>>>>>>> > >>>>>>>>> > Should I load it as RDD and then may be using a regex should >>>>>>>>> eliminate >>>>>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>>>>> > >>>>>>>>> > Any suggestions? >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > Sid >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>>>>> Department of Informatics >>>>>>>>> Aristotle University of Thessaloniki >>>>>>>>> Thessaloniki, GREECE >>>>>>>>> tel: ++0030312310991918 >>>>>>>>> email: papad...@csd.auth.gr >>>>>>>>> twitter: @papadopoulos_ap >>>>>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > -- > Apostolos N. Papadopoulos, Associate Professor > Department of Informatics > Aristotle University of Thessaloniki > Thessaloniki, GREECE > tel: ++0030312310991918 > email: papad...@csd.auth.gr > twitter: @papadopoulos_ap > web: http://datalab.csd.auth.gr/~apostol > >