I am finding using the Dataset API to be very cumbersome to use, which is
unfortunate, as I was looking forward to the type-safety after coming from
a Dataframe codebase.

This link summarizes my troubles: http://loicdescotte.
github.io/posts/spark2-datasets-type-safety/

The problem is having to continuously switch back and forth between typed
and untyped semantics, which really kills productivity. In contrast, the
RDD API is consistently typed and the Dataframe API is consistently
untyped. I don't have to continuously stop and think about which one to use
for each operation.

I gave the Frameless framework (mentioned in the link) a shot, but
eventually started running into oddities and lack of enough documentation
and community support and did not want to sink too much time into it.

At this point I'm considering just sticking with Dataframes, as I don't
really consider Datasets to be usable. Has anyone had a similar experience
or has had better luck?

Alex.

Reply via email to