Sid, dump one of yours files. https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: > I have 10 columns with me but in the dataset, I observed that some records > have 11 columns of data(for the additional column it is marked as null). > But, how do I handle this? > > Thanks, > Sid > > On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: > >> How can I do that? Any examples or links, please. So, this works well >> with pandas I suppose. It's just that I need to convert back to the spark >> data frame by providing a schema but since we are using a lower spark >> version and pandas won't work in a distributed way in the lower versions, >> therefore, was wondering if spark could handle this in a much better way. >> >> Thanks, >> Sid >> >> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote: >> >>> Forgot to reply-all last message, whoops. Not very good at email. >>> >>> You need to normalize the CSV with a parser that can escape commas >>> inside of strings >>> Not sure if Spark has an option for this? >>> >>> >>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >>> >>>> Thank you so much for your time. >>>> >>>> I have data like below which I tried to load by setting multiple >>>> options while reading the file but however, but I am not able to >>>> consolidate the 9th column data within itself. >>>> >>>> [image: image.png] >>>> >>>> I tried the below code: >>>> >>>> df = spark.read.option("header", "true").option("multiline", >>>> "true").option("inferSchema", "true").option("quote", >>>> >>>> '"').option( >>>> "delimiter", ",").csv("path") >>>> >>>> What else I can do? >>>> >>>> Thanks, >>>> Sid >>>> >>>> >>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>> papad...@csd.auth.gr> wrote: >>>> >>>>> Dear Sid, >>>>> >>>>> can you please give us more info? Is it true that every line may have >>>>> a >>>>> different number of columns? Is there any rule followed by >>>>> >>>>> every line of the file? From the information you have sent I cannot >>>>> fully understand the "schema" of your data. >>>>> >>>>> Regards, >>>>> >>>>> Apostolos >>>>> >>>>> >>>>> On 25/5/22 23:06, Sid wrote: >>>>> > Hi Experts, >>>>> > >>>>> > I have below CSV data that is getting generated automatically. I >>>>> can't >>>>> > change the data manually. >>>>> > >>>>> > The data looks like below: >>>>> > >>>>> > 2020-12-12,abc,2000,,INR, >>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>> > 2020-12-09,fgh,,software_developer,I only manage the development >>>>> part. >>>>> > >>>>> > Since I don't have much experience with the other domains. >>>>> > >>>>> > It is handled by the other people.,INR >>>>> > 2020-12-12,abc,2000,,USD, >>>>> > >>>>> > The third record is a problem. Since the value is separated by the >>>>> new >>>>> > line by the user while filling up the form. So, how do I handle this? >>>>> > >>>>> > There are 6 columns and 4 records in total. These are the sample >>>>> records. >>>>> > >>>>> > Should I load it as RDD and then may be using a regex should >>>>> eliminate >>>>> > the new lines? Or how it should be? with ". /n" ? >>>>> > >>>>> > Any suggestions? >>>>> > >>>>> > Thanks, >>>>> > Sid >>>>> >>>>> -- >>>>> Apostolos N. Papadopoulos, Associate Professor >>>>> Department of Informatics >>>>> Aristotle University of Thessaloniki >>>>> Thessaloniki, GREECE >>>>> tel: ++0030312310991918 >>>>> email: papad...@csd.auth.gr >>>>> twitter: @papadopoulos_ap >>>>> web: http://datalab.csd.auth.gr/~apostol >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>>