In Spark 2.0, Dataset and DataFrame are unified. Would this simplify your use case ?
On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <mar...@attivio.com> wrote: > Hi, > > I'm exposing a custom source to the Spark environment. I have a question > about the best way to approach this problem. > > I created a custom relation for my source and it creates a > DataFrame<Row>. My custom source knows the data types which are *dynamic* > so this seemed to be the appropriate return type. This works fine. > > The next step I want to take is to expose some custom mapping functions > (written in Java). But when I look at the APIs, the map method for > DataFrame returns an RDD (not a DataFrame). (Should I use > SqlContext.createDataFrame on the result? -- does this result in additional > processing overhead?) The Dataset type seems to be more of what I'd be > looking for, it's map method returns the Dataset type. So chaining them > together is a natural exercise. > > But to create the Dataset from a DataFrame, it appears that I have to > provide the types of each field in the Row in the DataFrame.as[...] > method. I would think that the DataFrame would be able to do this > automatically since it has all the types already. > > This leads me to wonder how I should be approaching this effort. As all > the fields and types are dynamic, I cannot use beans as my type when > passing data around. Any advice would be appreciated. > > Thanks, > Martin > > > >