PR is opened : https://github.com/apache/spark/pull/6237
Le ven. 15 mai 2015 à 17:55, Olivier Girardot <ssab...@gmail.com> a écrit : > yes, please do and send me the link. > @rxin I have trouble building master, but the code is done... > > > Le ven. 15 mai 2015 à 01:27, Haopu Wang <hw...@qilinsoft.com> a écrit : > >> Thank you, should I open a JIRA for this issue? >> >> >> ------------------------------ >> >> *From:* Olivier Girardot [mailto:ssab...@gmail.com] >> *Sent:* Tuesday, May 12, 2015 5:12 AM >> *To:* Reynold Xin >> *Cc:* Haopu Wang; user >> *Subject:* Re: [SparkSQL 1.4.0] groupBy columns are always nullable? >> >> >> >> I'll look into it - not sure yet what I can get out of exprs :p >> >> >> >> Le lun. 11 mai 2015 à 22:35, Reynold Xin <r...@databricks.com> a écrit : >> >> Thanks for catching this. I didn't read carefully enough. >> >> >> >> It'd make sense to have the udaf result be non-nullable, if the exprs are >> indeed non-nullable. >> >> >> >> On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssab...@gmail.com> >> wrote: >> >> Hi Haopu, >> actually here `key` is nullable because this is your input's schema : >> >> scala> result.printSchema >> >> root >> |-- key: string (nullable = true) >> |-- SUM(value): long (nullable = true) >> >> scala> df.printSchema >> root >> |-- key: string (nullable = true) >> |-- value: long (nullable = false) >> >> >> >> I tried it with a schema where the key is not flagged as nullable, and >> the schema is actually respected. What you can argue however is that >> SUM(value) should also be not nullable since value is not nullable. >> >> >> >> @rxin do you think it would be reasonable to flag the Sum aggregation >> function as nullable (or not) depending on the input expression's schema ? >> >> >> >> Regards, >> >> >> >> Olivier. >> >> Le lun. 11 mai 2015 à 22:07, Reynold Xin <r...@databricks.com> a écrit : >> >> Not by design. Would you be interested in submitting a pull request? >> >> >> >> On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <hw...@qilinsoft.com> wrote: >> >> I try to get the result schema of aggregate functions using DataFrame >> API. >> >> However, I find the result field of groupBy columns are always nullable >> even the source field is not nullable. >> >> I want to know if this is by design, thank you! Below is the simple code >> to show the issue. >> >> ====== >> >> import sqlContext.implicits._ >> import org.apache.spark.sql.functions._ >> case class Test(key: String, value: Long) >> val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF >> >> val result = df.groupBy("key").agg($"key", sum("value")) >> >> // From the output, you can see the "key" column is nullable, why?? >> result.printSchema >> // root >> // |-- key: string (nullable = true) >> // |-- SUM(value): long (nullable = true) >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> >> >