I am not reading it through pandas. I am using Spark because when I tried to use pandas which comes under import pyspark.pandas, it gives me an error.
On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > ok, but how do you read it now? > > > https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216 > probably have to be updated with the default options. This is so that > pandas API on spark will be like pandas. > > tor. 26. mai 2022 kl. 17:38 skrev Sid <flinkbyhe...@gmail.com>: > >> I was passing the wrong escape characters due to which I was facing the >> issue. I have updated the user's answer on my post. Now I am able to load >> the dataset. >> >> Thank you everyone for your time and help! >> >> Much appreciated. >> >> I have more datasets like this. I hope that would be resolved using this >> approach :) Fingers crossed. >> >> Thanks, >> Sid >> >> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos < >> papad...@csd.auth.gr> wrote: >> >>> Since you cannot create the DF directly, you may try to first create an >>> RDD of tuples from the file >>> >>> and then convert the RDD to a DF by using the toDF() transformation. >>> >>> Perhaps you may bypass the issue with this. >>> >>> Another thing that I have seen in the example is that you are using "" >>> as an escape character. >>> >>> Can you check if this may cause any issues? >>> >>> Regards, >>> >>> Apostolos >>> >>> >>> >>> On 26/5/22 16:31, Sid wrote: >>> >>> Thanks for opening the issue, Bjorn. However, could you help me to >>> address the problem for now with some kind of alternative? >>> >>> I am actually stuck in this since yesterday. >>> >>> Thanks, >>> Sid >>> >>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com> >>> wrote: >>> >>>> Yes, it looks like a bug that we also have in pandas API on spark. >>>> >>>> So I have opened a JIRA >>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this. >>>> >>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>: >>>> >>>>> Hello Everyone, >>>>> >>>>> I have posted a question finally with the dataset and the column names. >>>>> >>>>> PFB link: >>>>> >>>>> >>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen < >>>>> bjornjorgen...@gmail.com> wrote: >>>>> >>>>>> Sid, dump one of yours files. >>>>>> >>>>>> >>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ >>>>>> >>>>>> >>>>>> >>>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: >>>>>> >>>>>>> I have 10 columns with me but in the dataset, I observed that some >>>>>>> records have 11 columns of data(for the additional column it is marked >>>>>>> as >>>>>>> null). But, how do I handle this? >>>>>>> >>>>>>> Thanks, >>>>>>> Sid >>>>>>> >>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >>>>>>> >>>>>>>> How can I do that? Any examples or links, please. So, this works >>>>>>>> well with pandas I suppose. It's just that I need to convert back to >>>>>>>> the >>>>>>>> spark data frame by providing a schema but since we are using a lower >>>>>>>> spark >>>>>>>> version and pandas won't work in a distributed way in the lower >>>>>>>> versions, >>>>>>>> therefore, was wondering if spark could handle this in a much better >>>>>>>> way. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sid >>>>>>>> >>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email. >>>>>>>>> >>>>>>>>> You need to normalize the CSV with a parser that can escape commas >>>>>>>>> inside of strings >>>>>>>>> Not sure if Spark has an option for this? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thank you so much for your time. >>>>>>>>>> >>>>>>>>>> I have data like below which I tried to load by setting multiple >>>>>>>>>> options while reading the file but however, but I am not able to >>>>>>>>>> consolidate the 9th column data within itself. >>>>>>>>>> >>>>>>>>>> [image: image.png] >>>>>>>>>> >>>>>>>>>> I tried the below code: >>>>>>>>>> >>>>>>>>>> df = spark.read.option("header", "true").option("multiline", >>>>>>>>>> "true").option("inferSchema", "true").option("quote", >>>>>>>>>> >>>>>>>>>> '"').option( >>>>>>>>>> "delimiter", ",").csv("path") >>>>>>>>>> >>>>>>>>>> What else I can do? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Sid >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>>>>>>> papad...@csd.auth.gr> wrote: >>>>>>>>>> >>>>>>>>>>> Dear Sid, >>>>>>>>>>> >>>>>>>>>>> can you please give us more info? Is it true that every line may >>>>>>>>>>> have a >>>>>>>>>>> different number of columns? Is there any rule followed by >>>>>>>>>>> >>>>>>>>>>> every line of the file? From the information you have sent I >>>>>>>>>>> cannot >>>>>>>>>>> fully understand the "schema" of your data. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> Apostolos >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 25/5/22 23:06, Sid wrote: >>>>>>>>>>> > Hi Experts, >>>>>>>>>>> > >>>>>>>>>>> > I have below CSV data that is getting generated automatically. >>>>>>>>>>> I can't >>>>>>>>>>> > change the data manually. >>>>>>>>>>> > >>>>>>>>>>> > The data looks like below: >>>>>>>>>>> > >>>>>>>>>>> > 2020-12-12,abc,2000,,INR, >>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the >>>>>>>>>>> development part. >>>>>>>>>>> > >>>>>>>>>>> > Since I don't have much experience with the other domains. >>>>>>>>>>> > >>>>>>>>>>> > It is handled by the other people.,INR >>>>>>>>>>> > 2020-12-12,abc,2000,,USD, >>>>>>>>>>> > >>>>>>>>>>> > The third record is a problem. Since the value is separated by >>>>>>>>>>> the new >>>>>>>>>>> > line by the user while filling up the form. So, how do I >>>>>>>>>>> handle this? >>>>>>>>>>> > >>>>>>>>>>> > There are 6 columns and 4 records in total. These are the >>>>>>>>>>> sample records. >>>>>>>>>>> > >>>>>>>>>>> > Should I load it as RDD and then may be using a regex should >>>>>>>>>>> eliminate >>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>>>>>>> > >>>>>>>>>>> > Any suggestions? >>>>>>>>>>> > >>>>>>>>>>> > Thanks, >>>>>>>>>>> > Sid >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>>>>>>> Department of Informatics >>>>>>>>>>> Aristotle University of Thessaloniki >>>>>>>>>>> Thessaloniki, GREECE >>>>>>>>>>> tel: ++0030312310991918 >>>>>>>>>>> email: papad...@csd.auth.gr >>>>>>>>>>> twitter: @papadopoulos_ap >>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>> -- >>> Apostolos N. Papadopoulos, Associate Professor >>> Department of Informatics >>> Aristotle University of Thessaloniki >>> Thessaloniki, GREECE >>> tel: ++0030312310991918 >>> email: papad...@csd.auth.gr >>> twitter: @papadopoulos_ap >>> web: http://datalab.csd.auth.gr/~apostol >>> >>> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >