Map and flatmap are RDD operations, a UDF is a dataframe operation. The big difference from a performance perspective is in the query optimizer. A udf defines the set of input fields it needs and the set of output fields it will produce, map operates on the entire row at a time. This means the optimizer can move operations around, and potentially drop columns earlier to try and make the overall processing more efficient. A map operation requires the entire row as input so the optimizer cannot do anything to it, and does not know what the output is going to look like unless you explicitly tell it. But in reality, udfs are compiled down to map operations on an RDD with some glue code to get the columns in the correct place, so there should be little performance difference if you can manually build a query that is similar to what the catalyst optimizer would have built.
On Sun, Mar 17, 2019 at 1:42 PM kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I am wondering what is the difference between UDF execution and > map(someLambda)? you can assume someLambda ~= UDF. Any performance > difference? > > Thanks! > > >