Re: dataframe mappartitions problem

Wenchen Fan Tue, 20 Jun 2017 05:42:17 -0700

`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, 
spark need to deserialize the internal binary format to type `T`, and this 
deserialization is costly.


If you do need to do some hack, you can use the internal API: 
`Dataset.queryExecution.toRdd.mapPartitions`, which has no compatibility 
guarantees, and you need to deal with `InternalRow` directly.

> On 20 Jun 2017, at 8:10 PM, [email protected] wrote:
> 
> hello，
>         I'm using dataframe.mappartitions(r=>myfunc(r)) and it works so slow 
> if if i do nothing in my func but return an empty iterator
> Following are two dags about not using and using mappartitions and it,and 
> using mappartitions costs more time even if myfunc do nothing.
> Is there a problem about scheduling stages when mappartitions involved?
> <Catch.jpg>
> <Catch696F.jpg>
> [email protected] <mailto:[email protected]>

Re: dataframe mappartitions problem

Reply via email to