I have 10 columns with me but in the dataset, I observed that some records have 11 columns of data(for the additional column it is marked as null). But, how do I handle this?
Thanks, Sid On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: > How can I do that? Any examples or links, please. So, this works well with > pandas I suppose. It's just that I need to convert back to the spark data > frame by providing a schema but since we are using a lower spark version > and pandas won't work in a distributed way in the lower versions, > therefore, was wondering if spark could handle this in a much better way. > > Thanks, > Sid > > On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote: > >> Forgot to reply-all last message, whoops. Not very good at email. >> >> You need to normalize the CSV with a parser that can escape commas inside >> of strings >> Not sure if Spark has an option for this? >> >> >> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Thank you so much for your time. >>> >>> I have data like below which I tried to load by setting multiple options >>> while reading the file but however, but I am not able to consolidate the >>> 9th column data within itself. >>> >>> [image: image.png] >>> >>> I tried the below code: >>> >>> df = spark.read.option("header", "true").option("multiline", >>> "true").option("inferSchema", "true").option("quote", >>> >>> '"').option( >>> "delimiter", ",").csv("path") >>> >>> What else I can do? >>> >>> Thanks, >>> Sid >>> >>> >>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>> papad...@csd.auth.gr> wrote: >>> >>>> Dear Sid, >>>> >>>> can you please give us more info? Is it true that every line may have a >>>> different number of columns? Is there any rule followed by >>>> >>>> every line of the file? From the information you have sent I cannot >>>> fully understand the "schema" of your data. >>>> >>>> Regards, >>>> >>>> Apostolos >>>> >>>> >>>> On 25/5/22 23:06, Sid wrote: >>>> > Hi Experts, >>>> > >>>> > I have below CSV data that is getting generated automatically. I >>>> can't >>>> > change the data manually. >>>> > >>>> > The data looks like below: >>>> > >>>> > 2020-12-12,abc,2000,,INR, >>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>> > 2020-12-09,fgh,,software_developer,I only manage the development part. >>>> > >>>> > Since I don't have much experience with the other domains. >>>> > >>>> > It is handled by the other people.,INR >>>> > 2020-12-12,abc,2000,,USD, >>>> > >>>> > The third record is a problem. Since the value is separated by the >>>> new >>>> > line by the user while filling up the form. So, how do I handle this? >>>> > >>>> > There are 6 columns and 4 records in total. These are the sample >>>> records. >>>> > >>>> > Should I load it as RDD and then may be using a regex should >>>> eliminate >>>> > the new lines? Or how it should be? with ". /n" ? >>>> > >>>> > Any suggestions? >>>> > >>>> > Thanks, >>>> > Sid >>>> >>>> -- >>>> Apostolos N. Papadopoulos, Associate Professor >>>> Department of Informatics >>>> Aristotle University of Thessaloniki >>>> Thessaloniki, GREECE >>>> tel: ++0030312310991918 >>>> email: papad...@csd.auth.gr >>>> twitter: @papadopoulos_ap >>>> web: http://datalab.csd.auth.gr/~apostol >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>>