This would also probably improve performance: https://github.com/apache/spark/pull/9565
On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari <hamelkoth...@gmail.com> wrote: > Hi all, > > So we have these UDFs which take <1ms to operate and we're seeing pretty > poor performance around them in practice, the overhead being >10ms for the > projections (this data is deeply nested with ArrayTypes and MapTypes so > that could be the cause). Looking at the logs and code for ScalaUDF, I > noticed that there are a series of projections which take place before and > after in order to make the Rows safe and then unsafe again. Is there any > way to opt out of this and input/return InternalRows to skip the > performance hit of the type conversion? It doesn't immediately appear to be > possible but I'd like to make sure that I'm not missing anything. > > I suspect we could make this possible by checking if typetags in the > register function are all internal types, if they are, passing a false > value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF > checking for that flag to figure out if the conversion process needs to > take place. We're still left with the issue of missing a schema in the case > of outputting InternalRows, but we could expose the DataType parameter > rather than inferring it in the register function. Is there anything else > in the code that would prevent this from working? > > Regards, > Hamel >