Re: Tab delimited csv import and empty columns

2020-08-05 Thread Stephen Coy
Hi Sean, German and others, Setting the “nullValue” option (for parsing CSV at least) seems to be an exercise in futility. When parsing the file, com.univocity.parsers.common.input.AbstractCharInputReader#getString contains the following logic: String out; if (len <= 0) { out =

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Vladimir Ryzhov
Would *df.na.fill("") *do the trick? On Fri, Jul 31, 2020 at 8:43 AM Sean Owen wrote: > Try setting nullValue to anything besides the empty string. Because its > default is the empty string, empty strings become null by default. > > On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy > wrote: > >>

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Sean Owen
Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default. On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy wrote: > That does not work. > > This is Spark 3.0 by the way. > > I have been looking at the Spark unit tests

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Stephen Coy
That does not work. This is Spark 3.0 by the way. I have been looking at the Spark unit tests and there does not seem to be any that load a CSV text file and verify that an empty string maps to an empty string which I think is supposed to be the default behaviour because the “nullValue”

Re: Tab delimited csv import and empty columns

2020-07-30 Thread German Schiavon Matteo
Hey, I understand that your empty values in your CSV are "" , if so, try this option: *.option("emptyValue", "\"\"")* Hope it helps On Thu, 30 Jul 2020 at 08:49, Stephen Coy wrote: > Hi there, > > I’m trying to import a tab delimited file with: > > Dataset catalogData = sparkSession >

Tab delimited csv import and empty columns

2020-07-30 Thread Stephen Coy
Hi there, I’m trying to import a tab delimited file with: Dataset catalogData = sparkSession .read() .option("sep", "\t") .option("header", "true") .csv(args[0]) .cache(); This works great, except for the fact that any column that is empty is given the value null, when I need these