Yes, as[T] is lazy as any transformation is, but in terms of data processing not schema. You seem to imply the as[T] is lazy in terms of the schema, where I do no know of any other transformation that behaves like this.

Your proposed solution works, because the map transformation returns the right schema, though it is also a lazy transformation. The as[T] should behave like this too.

The map transformation is a quick fix in terms of code length, but it materializes the data as instances of T, which introduces a prohibitive deserialization / serialization round trip for no good reason:

I think returning the right schema does not need to touch any data and should be as lightweight as a projection.

Enrico


Am 07.01.20 um 10:13 schrieb Wenchen Fan:
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as <http://df.as>[T].map(identity)`.



On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <m...@enrico.minack.dev <mailto:m...@enrico.minack.dev>> wrote:

    Hi Devs,

    I'd like to propose a stricter version of as[T]. Given the
    interface def as[T](): Dataset[T], it is counter-intuitive that
    the schema of the returned Dataset[T] is not agnostic to the
    schema of the originating Dataset. The schema should always be
    derived only from T.

    I am proposing a stricter version so that user code does not need
    to pair an .as[T] with a select(schemaOfT.fields.map(col(_.name)):
    _*) whenever your code expects Dataset[T] to really contain only
    columns of T.

    https://github.com/apache/spark/pull/26969

    Regards,
    Enrico


Reply via email to