I am finding using the Dataset API to be very cumbersome to use, which is unfortunate, as I was looking forward to the type-safety after coming from a Dataframe codebase.
This link summarizes my troubles: http://loicdescotte. github.io/posts/spark2-datasets-type-safety/ The problem is having to continuously switch back and forth between typed and untyped semantics, which really kills productivity. In contrast, the RDD API is consistently typed and the Dataframe API is consistently untyped. I don't have to continuously stop and think about which one to use for each operation. I gave the Frameless framework (mentioned in the link) a shot, but eventually started running into oddities and lack of enough documentation and community support and did not want to sink too much time into it. At this point I'm considering just sticking with Dataframes, as I don't really consider Datasets to be usable. Has anyone had a similar experience or has had better luck? Alex.