Its not throwing away any information from the point of view of the SQL optimizer. The schema preserves all the type information that the catalyst uses. The type information T in Dataset[T] is only used at the API level to ensure compilation-time type checks of the user program.
On Thu, Jun 16, 2016 at 2:05 PM, Cody Koeninger <c...@koeninger.org> wrote: > I'm clear on what a type alias is. My question is more that moving > from e.g. Dataset[T] to Dataset[Row] involves throwing away > information. Reading through code that uses the Dataframe alias, it's > a little hard for me to know when that's intentional or not. > > > On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das > <tathagata.das1...@gmail.com> wrote: > > There are different ways to view this. If its confusing to think that > Source > > API returning DataFrames, its equivalent to thinking that you are > returning > > a Dataset[Row], and DataFrame is just a shorthand. And > > DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] > is > > to Array[String]. DataFrame is more general in a way, as every Dataset > can > > be boiled down to a DataFrame. So to keep the Source APIs general (and > also > > source-compatible with 1.x), they return DataFrame. > > > > On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> > >> Is this really an internal / external distinction? > >> > >> For a concrete example, Source.getBatch seems to be a public > >> interface, but returns DataFrame. > >> > >> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das > >> <tathagata.das1...@gmail.com> wrote: > >> > DataFrame is a type alias of Dataset[Row], so externally it seems like > >> > Dataset is the main type and DataFrame is a derivative type. > >> > However, internally, since everything is processed as Rows, everything > >> > uses > >> > DataFrames, Type classes used in a Dataset is internally converted to > >> > rows > >> > for processing. . Therefore internally DataFrame is like "main" type > >> > that is > >> > used. > >> > > >> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <c...@koeninger.org> > >> > wrote: > >> >> > >> >> Sorry, meant DataFrame vs Dataset > >> >> > >> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <c...@koeninger.org > > > >> >> wrote: > >> >> > Is there a principled reason why sql.streaming.* and > >> >> > sql.execution.streaming.* are making extensive use of DataFrame > >> >> > instead of Datasource? > >> >> > > >> >> > Or is that just a holdover from code written before the move / type > >> >> > alias? > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> >> For additional commands, e-mail: dev-h...@spark.apache.org > >> >> > >> > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >