Hi Sean, German and others, Setting the “nullValue” option (for parsing CSV at least) seems to be an exercise in futility.
When parsing the file, com.univocity.parsers.common.input.AbstractCharInputReader#getString contains the following logic: String out; if (len <= 0) { out = nullValue; } else { out = new String(buffer, pos, len); } resulting in the nullValue being assigned to the column value if it has zero length, such as with an empty String. Later, org.apache.spark.sql.catalyst.csv.UnivocityParser#nullSafeDatum is called on the column value: if (datum == options.nullValue || datum == null) { if (!nullable) { throw new RuntimeException(s"null value found but field $name is not nullable.") } null } else { converter.apply(datum) } Therefore, the empty String is first converted to the nullValue, and then matched against the nullValue and, bingo, we get the literal null. For now, the “.na.fill(“”)” addition to the code is doing the right thing for me. Thanks for all the help. Steve C On 1 Aug 2020, at 1:40 am, Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>> wrote: Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default. On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy <s...@infomedia.com.au.invalid<mailto:s...@infomedia.com.au.invalid>> wrote: That does not work. This is Spark 3.0 by the way. I have been looking at the Spark unit tests and there does not seem to be any that load a CSV text file and verify that an empty string maps to an empty string which I think is supposed to be the default behaviour because the “nullValue” option defaults to “". Thanks anyway Steve C On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo <gschiavonsp...@gmail.com<mailto:gschiavonsp...@gmail.com>> wrote: Hey, I understand that your empty values in your CSV are "" , if so, try this option: .option("emptyValue", "\"\"") Hope it helps On Thu, 30 Jul 2020 at 08:49, Stephen Coy <s...@infomedia.com.au.invalid<mailto:s...@infomedia.com.au.invalid>> wrote: Hi there, I’m trying to import a tab delimited file with: Dataset<Row> catalogData = sparkSession .read() .option("sep", "\t") .option("header", "true") .csv(args[0]) .cache(); This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings. Is there any option combination that will achieve this? Thanks, Steve C [http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]<https://www.infomedia.com.au/driving-force/?utm_campaign=200630%20Email%20Signature&utm_source=Internal&utm_medium=Email&utm_content=Driving%20Force> This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/