Re: Tab delimited csv import and empty columns

Stephen Coy Wed, 05 Aug 2020 18:01:04 -0700

Hi Sean, German and others,

Setting the “nullValue” option (for parsing CSV at least) seems to be an 
exercise in futility.


When parsing the file, 
com.univocity.parsers.common.input.AbstractCharInputReader#getString contains 
the following logic:


String out;
if (len <= 0) {
   out = nullValue;
} else {
   out = new String(buffer, pos, len);
}

resulting in the nullValue being assigned to the column value if it has zero 
length, such as with an empty String.

Later, org.apache.spark.sql.catalyst.csv.UnivocityParser#nullSafeDatum is 
called on the column value:


if (datum == options.nullValue || datum == null) {
  if (!nullable) {
    throw new RuntimeException(s"null value found but field $name is not 
nullable.")
  }
  null
} else {
  converter.apply(datum)
}

Therefore, the empty String is first converted to the nullValue, and then 
matched against the nullValue and, bingo, we get the literal null.

For now, the “.na.fill(“”)” addition to the code is doing the right thing for 
me.

Thanks for all the help.


Steve C


On 1 Aug 2020, at 1:40 am, Sean Owen 
<sro...@gmail.com<mailto:sro...@gmail.com>> wrote:

Try setting nullValue to anything besides the empty string. Because its default 
is the empty string, empty strings become null by default.

On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy 
<s...@infomedia.com.au.invalid<mailto:s...@infomedia.com.au.invalid>> wrote:
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any 
that load a CSV text file and verify that an empty string maps to an empty 
string which I think is supposed to be the default behaviour because the 
“nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo 
<gschiavonsp...@gmail.com<mailto:gschiavonsp...@gmail.com>> wrote:

Hey,

I understand that your empty values in your CSV are "" , if so, try this option:

.option("emptyValue", "\"\"")

Hope it helps

On Thu, 30 Jul 2020 at 08:49, Stephen Coy 
<s...@infomedia.com.au.invalid<mailto:s...@infomedia.com.au.invalid>> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]<https://www.infomedia.com.au/driving-force/?utm_campaign=200630%20Email%20Signature&utm_source=Internal&utm_medium=Email&utm_content=Driving%20Force>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/

Re: Tab delimited csv import and empty columns

Reply via email to