How can I do that? Any examples or links, please. So, this works well with pandas I suppose. It's just that I need to convert back to the spark data frame by providing a schema but since we are using a lower spark version and pandas won't work in a distributed way in the lower versions, therefore, was wondering if spark could handle this in a much better way.
Thanks, Sid On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote: > Forgot to reply-all last message, whoops. Not very good at email. > > You need to normalize the CSV with a parser that can escape commas inside > of strings > Not sure if Spark has an option for this? > > > On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: > >> Thank you so much for your time. >> >> I have data like below which I tried to load by setting multiple options >> while reading the file but however, but I am not able to consolidate the >> 9th column data within itself. >> >> [image: image.png] >> >> I tried the below code: >> >> df = spark.read.option("header", "true").option("multiline", >> "true").option("inferSchema", "true").option("quote", >> >> '"').option( >> "delimiter", ",").csv("path") >> >> What else I can do? >> >> Thanks, >> Sid >> >> >> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >> papad...@csd.auth.gr> wrote: >> >>> Dear Sid, >>> >>> can you please give us more info? Is it true that every line may have a >>> different number of columns? Is there any rule followed by >>> >>> every line of the file? From the information you have sent I cannot >>> fully understand the "schema" of your data. >>> >>> Regards, >>> >>> Apostolos >>> >>> >>> On 25/5/22 23:06, Sid wrote: >>> > Hi Experts, >>> > >>> > I have below CSV data that is getting generated automatically. I can't >>> > change the data manually. >>> > >>> > The data looks like below: >>> > >>> > 2020-12-12,abc,2000,,INR, >>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>> > 2020-12-09,fgh,,software_developer,I only manage the development part. >>> > >>> > Since I don't have much experience with the other domains. >>> > >>> > It is handled by the other people.,INR >>> > 2020-12-12,abc,2000,,USD, >>> > >>> > The third record is a problem. Since the value is separated by the new >>> > line by the user while filling up the form. So, how do I handle this? >>> > >>> > There are 6 columns and 4 records in total. These are the sample >>> records. >>> > >>> > Should I load it as RDD and then may be using a regex should eliminate >>> > the new lines? Or how it should be? with ". /n" ? >>> > >>> > Any suggestions? >>> > >>> > Thanks, >>> > Sid >>> >>> -- >>> Apostolos N. Papadopoulos, Associate Professor >>> Department of Informatics >>> Aristotle University of Thessaloniki >>> Thessaloniki, GREECE >>> tel: ++0030312310991918 >>> email: papad...@csd.auth.gr >>> twitter: @papadopoulos_ap >>> web: http://datalab.csd.auth.gr/~apostol >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>>