I see. We're having problems with code like this (forgive my noob scala): val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements") df .filter($"animals".rlike(".*")) .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals")) .collect()
This code throws a NPE because: * Catalyst combines the filters with an AND * the first filter passes returns null on the first input * the second filter tries to read the length of that null This feels weird. Reading that code, I wouldn't expect null to be passed to the second filter. Even weirder is that if you call collect() after the first filter you won't see nulls, and if you write the data to disk and reread it, the NPE won't happen. It's bewildering! Is this the intended behavior? ________________________________ From: Reynold Xin [r...@databricks.com] Sent: Monday, September 14, 2015 10:14 PM To: Zack Sampson Cc: dev@spark.apache.org Subject: Re: And.eval short circuiting rxin=# select null and true; ?column? ---------- (1 row) rxin=# select null and false; ?column? ---------- f (1 row) null and false should return false. On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com<mailto:zsamp...@palantir.com>> wrote: It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is? override def eval(input: Row): Any = { val l = left.eval(input) if (l == false) { false } else { val r = right.eval(input) if (r == false) { false } else { if (l != null && r != null) { true } else { null } } } }