Yes, as[T] is lazy as any transformation is, but in terms of data
processing not schema. You seem to imply the as[T] is lazy in terms of
the schema, where I do no know of any other transformation that behaves
like this.
Your proposed solution works, because the map transformation returns the
right schema, though it is also a lazy transformation. The as[T] should
behave like this too.
The map transformation is a quick fix in terms of code length, but it
materializes the data as instances of T, which introduces a prohibitive
deserialization / serialization round trip for no good reason:
I think returning the right schema does not need to touch any data and
should be as lightweight as a projection.
Enrico
Am 07.01.20 um 10:13 schrieb Wenchen Fan:
I think it's simply because as[T] is lazy. You will see the right
schema if you do `df.as <http://df.as>[T].map(identity)`.
On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <m...@enrico.minack.dev
<mailto:m...@enrico.minack.dev>> wrote:
Hi Devs,
I'd like to propose a stricter version of as[T]. Given the
interface def as[T](): Dataset[T], it is counter-intuitive that
the schema of the returned Dataset[T] is not agnostic to the
schema of the originating Dataset. The schema should always be
derived only from T.
I am proposing a stricter version so that user code does not need
to pair an .as[T] with a select(schemaOfT.fields.map(col(_.name)):
_*) whenever your code expects Dataset[T] to really contain only
columns of T.
https://github.com/apache/spark/pull/26969
Regards,
Enrico