ok, but how do you read it now? https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216 probably have to be updated with the default options. This is so that pandas API on spark will be like pandas.
tor. 26. mai 2022 kl. 17:38 skrev Sid <flinkbyhe...@gmail.com>: > I was passing the wrong escape characters due to which I was facing the > issue. I have updated the user's answer on my post. Now I am able to load > the dataset. > > Thank you everyone for your time and help! > > Much appreciated. > > I have more datasets like this. I hope that would be resolved using this > approach :) Fingers crossed. > > Thanks, > Sid > > On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos < > papad...@csd.auth.gr> wrote: > >> Since you cannot create the DF directly, you may try to first create an >> RDD of tuples from the file >> >> and then convert the RDD to a DF by using the toDF() transformation. >> >> Perhaps you may bypass the issue with this. >> >> Another thing that I have seen in the example is that you are using "" as >> an escape character. >> >> Can you check if this may cause any issues? >> >> Regards, >> >> Apostolos >> >> >> >> On 26/5/22 16:31, Sid wrote: >> >> Thanks for opening the issue, Bjorn. However, could you help me to >> address the problem for now with some kind of alternative? >> >> I am actually stuck in this since yesterday. >> >> Thanks, >> Sid >> >> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com> >> wrote: >> >>> Yes, it looks like a bug that we also have in pandas API on spark. >>> >>> So I have opened a JIRA >>> <https://issues.apache.org/jira/browse/SPARK-39304> for this. >>> >>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>: >>> >>>> Hello Everyone, >>>> >>>> I have posted a question finally with the dataset and the column names. >>>> >>>> PFB link: >>>> >>>> >>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen < >>>> bjornjorgen...@gmail.com> wrote: >>>> >>>>> Sid, dump one of yours files. >>>>> >>>>> >>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ >>>>> >>>>> >>>>> >>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: >>>>> >>>>>> I have 10 columns with me but in the dataset, I observed that some >>>>>> records have 11 columns of data(for the additional column it is marked as >>>>>> null). But, how do I handle this? >>>>>> >>>>>> Thanks, >>>>>> Sid >>>>>> >>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >>>>>> >>>>>>> How can I do that? Any examples or links, please. So, this works >>>>>>> well with pandas I suppose. It's just that I need to convert back to the >>>>>>> spark data frame by providing a schema but since we are using a lower >>>>>>> spark >>>>>>> version and pandas won't work in a distributed way in the lower >>>>>>> versions, >>>>>>> therefore, was wondering if spark could handle this in a much better >>>>>>> way. >>>>>>> >>>>>>> Thanks, >>>>>>> Sid >>>>>>> >>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Forgot to reply-all last message, whoops. Not very good at email. >>>>>>>> >>>>>>>> You need to normalize the CSV with a parser that can escape commas >>>>>>>> inside of strings >>>>>>>> Not sure if Spark has an option for this? >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thank you so much for your time. >>>>>>>>> >>>>>>>>> I have data like below which I tried to load by setting multiple >>>>>>>>> options while reading the file but however, but I am not able to >>>>>>>>> consolidate the 9th column data within itself. >>>>>>>>> >>>>>>>>> [image: image.png] >>>>>>>>> >>>>>>>>> I tried the below code: >>>>>>>>> >>>>>>>>> df = spark.read.option("header", "true").option("multiline", >>>>>>>>> "true").option("inferSchema", "true").option("quote", >>>>>>>>> >>>>>>>>> '"').option( >>>>>>>>> "delimiter", ",").csv("path") >>>>>>>>> >>>>>>>>> What else I can do? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Sid >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>>>>>> papad...@csd.auth.gr> wrote: >>>>>>>>> >>>>>>>>>> Dear Sid, >>>>>>>>>> >>>>>>>>>> can you please give us more info? Is it true that every line may >>>>>>>>>> have a >>>>>>>>>> different number of columns? Is there any rule followed by >>>>>>>>>> >>>>>>>>>> every line of the file? From the information you have sent I >>>>>>>>>> cannot >>>>>>>>>> fully understand the "schema" of your data. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Apostolos >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 25/5/22 23:06, Sid wrote: >>>>>>>>>> > Hi Experts, >>>>>>>>>> > >>>>>>>>>> > I have below CSV data that is getting generated automatically. >>>>>>>>>> I can't >>>>>>>>>> > change the data manually. >>>>>>>>>> > >>>>>>>>>> > The data looks like below: >>>>>>>>>> > >>>>>>>>>> > 2020-12-12,abc,2000,,INR, >>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the >>>>>>>>>> development part. >>>>>>>>>> > >>>>>>>>>> > Since I don't have much experience with the other domains. >>>>>>>>>> > >>>>>>>>>> > It is handled by the other people.,INR >>>>>>>>>> > 2020-12-12,abc,2000,,USD, >>>>>>>>>> > >>>>>>>>>> > The third record is a problem. Since the value is separated by >>>>>>>>>> the new >>>>>>>>>> > line by the user while filling up the form. So, how do I >>>>>>>>>> handle this? >>>>>>>>>> > >>>>>>>>>> > There are 6 columns and 4 records in total. These are the >>>>>>>>>> sample records. >>>>>>>>>> > >>>>>>>>>> > Should I load it as RDD and then may be using a regex should >>>>>>>>>> eliminate >>>>>>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>>>>>> > >>>>>>>>>> > Any suggestions? >>>>>>>>>> > >>>>>>>>>> > Thanks, >>>>>>>>>> > Sid >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>>>>>> Department of Informatics >>>>>>>>>> Aristotle University of Thessaloniki >>>>>>>>>> Thessaloniki, GREECE >>>>>>>>>> tel: ++0030312310991918 >>>>>>>>>> email: papad...@csd.auth.gr >>>>>>>>>> twitter: @papadopoulos_ap >>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>> >>> -- >>> Bjørn Jørgensen >>> Vestre Aspehaug 4, 6010 Ålesund >>> Norge >>> >>> +47 480 94 297 >>> >> -- >> Apostolos N. Papadopoulos, Associate Professor >> Department of Informatics >> Aristotle University of Thessaloniki >> Thessaloniki, GREECE >> tel: ++0030312310991918 >> email: papad...@csd.auth.gr >> twitter: @papadopoulos_ap >> web: http://datalab.csd.auth.gr/~apostol >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297