Map and flatmap are RDD operations, a UDF is a dataframe operation.  The
big difference from a performance perspective is in the query optimizer.  A
udf defines the set of input fields it needs and the set of output fields
it will produce, map operates on the entire row at a time.  This means the
optimizer can move operations around, and potentially drop columns earlier
to try and make the overall processing more efficient.  A map operation
requires the entire row as input so the optimizer cannot do anything to it,
and does not know what the output is going to look like unless you
explicitly tell it.  But in reality, udfs are compiled down to map
operations on an RDD with some glue code to get the columns in the correct
place, so there should be little performance difference if you can manually
build a query that is similar to what the catalyst optimizer would have
built.

On Sun, Mar 17, 2019 at 1:42 PM kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I am wondering what is the difference between UDF execution and
> map(someLambda)? you can assume someLambda ~= UDF. Any performance
> difference?
>
> Thanks!
>
>
>

Reply via email to