`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, spark need to deserialize the internal binary format to type `T`, and this deserialization is costly.
If you do need to do some hack, you can use the internal API: `Dataset.queryExecution.toRdd.mapPartitions`, which has no compatibility guarantees, and you need to deal with `InternalRow` directly. > On 20 Jun 2017, at 8:10 PM, sunerhan1...@sina.com wrote: > > hello, > I'm using dataframe.mappartitions(r=>myfunc(r)) and it works so slow > if if i do nothing in my func but return an empty iterator > Following are two dags about not using and using mappartitions and it,and > using mappartitions costs more time even if myfunc do nothing. > Is there a problem about scheduling stages when mappartitions involved? > <Catch.jpg> > <Catch696F.jpg> > sunerhan1...@sina.com <mailto:sunerhan1...@sina.com>