`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, 
spark need to deserialize the internal binary format to type `T`, and this 
deserialization is costly.

If you do need to do some hack, you can use the internal API: 
`Dataset.queryExecution.toRdd.mapPartitions`, which has no compatibility 
guarantees, and you need to deal with `InternalRow` directly.

> On 20 Jun 2017, at 8:10 PM, sunerhan1...@sina.com wrote:
> 
> hello,
>         I'm using dataframe.mappartitions(r=>myfunc(r)) and it works so slow 
> if if i do nothing in my func but return an empty iterator
> Following are two dags about not using and using mappartitions and it,and 
> using mappartitions costs more time even if myfunc do nothing.
> Is there a problem about scheduling stages when mappartitions involved?
> <Catch.jpg>
> <Catch696F.jpg>
> sunerhan1...@sina.com <mailto:sunerhan1...@sina.com>

Reply via email to