Yes, but how do you read it with spark. tor. 26. mai 2022, 18:30 skrev Sid <flinkbyhe...@gmail.com>:
> I am not reading it through pandas. I am using Spark because when I tried > to use pandas which comes under import pyspark.pandas, it gives me an > error. > > On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> ok, but how do you read it now? >> >> >> https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216 >> probably have to be updated with the default options. This is so that >> pandas API on spark will be like pandas. >> >> tor. 26. mai 2022 kl. 17:38 skrev Sid <flinkbyhe...@gmail.com>: >> >>> I was passing the wrong escape characters due to which I was facing the >>> issue. I have updated the user's answer on my post. Now I am able to load >>> the dataset. >>> >>> Thank you everyone for your time and help! >>> >>> Much appreciated. >>> >>> I have more datasets like this. I hope that would be resolved using this >>> approach :) Fingers crossed. >>> >>> Thanks, >>> Sid >>> >>> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos < >>> papad...@csd.auth.gr> wrote: >>> >>>> Since you cannot create the DF directly, you may try to first create an >>>> RDD of tuples from the file >>>> >>>> and then convert the RDD to a DF by using the toDF() transformation. >>>> >>>> Perhaps you may bypass the issue with this. >>>> >>>> Another thing that I have seen in the example is that you are using "" >>>> as an escape character. >>>> >>>> Can you check if this may cause any issues? >>>> >>>> Regards, >>>> >>>> Apostolos >>>> >>>> >>>> >>>> On 26/5/22 16:31, Sid wrote: >>>> >>>> Thanks for opening the issue, Bjorn. However, could you help me to >>>> address the problem for now with some kind of alternative? >>>> >>>> I am actually stuck in this since yesterday. >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com> >>>> wrote: >>>> >>>>> Yes, it looks like a bug that we also have in pandas API on spark. >>>>> >>>>> So I have opened a JIRA >>>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this. >>>>> >>>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>: >>>>> >>>>>> Hello Everyone, >>>>>> >>>>>> I have posted a question finally with the dataset and the column >>>>>> names. >>>>>> >>>>>> PFB link: >>>>>> >>>>>> >>>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark >>>>>> >>>>>> Thanks, >>>>>> Sid >>>>>> >>>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen < >>>>>> bjornjorgen...@gmail.com> wrote: >>>>>> >>>>>>> Sid, dump one of yours files. >>>>>>> >>>>>>> >>>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: >>>>>>> >>>>>>>> I have 10 columns with me but in the dataset, I observed that some >>>>>>>> records have 11 columns of data(for the additional column it is marked >>>>>>>> as >>>>>>>> null). But, how do I handle this? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sid >>>>>>>> >>>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >>>>>>>> >>>>>>>>> How can I do that? Any examples or links, please. So, this works >>>>>>>>> well with pandas I suppose. It's just that I need to convert back to >>>>>>>>> the >>>>>>>>> spark data frame by providing a schema but since we are using a lower >>>>>>>>> spark >>>>>>>>> version and pandas won't work in a distributed way in the lower >>>>>>>>> versions, >>>>>>>>> therefore, was wondering if spark could handle this in a much better >>>>>>>>> way. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Sid >>>>>>>>> >>>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email. >>>>>>>>>> >>>>>>>>>> You need to normalize the CSV with a parser that can escape >>>>>>>>>> commas inside of strings >>>>>>>>>> Not sure if Spark has an option for this? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thank you so much for your time. >>>>>>>>>>> >>>>>>>>>>> I have data like below which I tried to load by setting multiple >>>>>>>>>>> options while reading the file but however, but I am not able to >>>>>>>>>>> consolidate the 9th column data within itself. >>>>>>>>>>> >>>>>>>>>>> [image: image.png] >>>>>>>>>>> >>>>>>>>>>> I tried the below code: >>>>>>>>>>> >>>>>>>>>>> df = spark.read.option("header", "true").option("multiline", >>>>>>>>>>> "true").option("inferSchema", "true").option("quote", >>>>>>>>>>> >>>>>>>>>>> '"').option( >>>>>>>>>>> "delimiter", ",").csv("path") >>>>>>>>>>> >>>>>>>>>>> What else I can do? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Sid >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>>>>>>>> papad...@csd.auth.gr> wrote: >>>>>>>>>>> >>>>>>>>>>>> Dear Sid, >>>>>>>>>>>> >>>>>>>>>>>> can you please give us more info? Is it true that every line >>>>>>>>>>>> may have a >>>>>>>>>>>> different number of columns? Is there any rule followed by >>>>>>>>>>>> >>>>>>>>>>>> every line of the file? From the information you have sent I >>>>>>>>>>>> cannot >>>>>>>>>>>> fully understand the "schema" of your data. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Apostolos >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 25/5/22 23:06, Sid wrote: >>>>>>>>>>>> > Hi Experts, >>>>>>>>>>>> > >>>>>>>>>>>> > I have below CSV data that is getting generated >>>>>>>>>>>> automatically. I can't >>>>>>>>>>>> > change the data manually. >>>>>>>>>>>> > >>>>>>>>>>>> > The data looks like below: >>>>>>>>>>>> > >>>>>>>>>>>> > 2020-12-12,abc,2000,,INR, >>>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the >>>>>>>>>>>> development part. >>>>>>>>>>>> > >>>>>>>>>>>> > Since I don't have much experience with the other domains. >>>>>>>>>>>> > >>>>>>>>>>>> > It is handled by the other people.,INR >>>>>>>>>>>> > 2020-12-12,abc,2000,,USD, >>>>>>>>>>>> > >>>>>>>>>>>> > The third record is a problem. Since the value is separated >>>>>>>>>>>> by the new >>>>>>>>>>>> > line by the user while filling up the form. So, how do I >>>>>>>>>>>> handle this? >>>>>>>>>>>> > >>>>>>>>>>>> > There are 6 columns and 4 records in total. These are the >>>>>>>>>>>> sample records. >>>>>>>>>>>> > >>>>>>>>>>>> > Should I load it as RDD and then may be using a regex should >>>>>>>>>>>> eliminate >>>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>>>>>>>> > >>>>>>>>>>>> > Any suggestions? >>>>>>>>>>>> > >>>>>>>>>>>> > Thanks, >>>>>>>>>>>> > Sid >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>>>>>>>> Department of Informatics >>>>>>>>>>>> Aristotle University of Thessaloniki >>>>>>>>>>>> Thessaloniki, GREECE >>>>>>>>>>>> tel: ++0030312310991918 >>>>>>>>>>>> email: papad...@csd.auth.gr >>>>>>>>>>>> twitter: @papadopoulos_ap >>>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>>> >>>>>>>>>>>> >>>>> >>>>> -- >>>>> Bjørn Jørgensen >>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>> Norge >>>>> >>>>> +47 480 94 297 >>>>> >>>> -- >>>> Apostolos N. Papadopoulos, Associate Professor >>>> Department of Informatics >>>> Aristotle University of Thessaloniki >>>> Thessaloniki, GREECE >>>> tel: ++0030312310991918 >>>> email: papad...@csd.auth.gr >>>> twitter: @papadopoulos_ap >>>> web: http://datalab.csd.auth.gr/~apostol >>>> >>>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >