Re: [SPARK-30319][SQL] Add a stricter version of as[T]

Enrico Minack Wed, 08 Jan 2020 00:52:33 -0800

Yes, as[T] is lazy as any transformation is, but in terms of dataprocessing not schema. You seem to imply the as[T] is lazy in terms ofthe schema, where I do no know of any other transformation that behaveslike this.

Your proposed solution works, because the map transformation returns theright schema, though it is also a lazy transformation. The as[T] shouldbehave like this too.

The map transformation is a quick fix in terms of code length, but itmaterializes the data as instances of T, which introduces a prohibitivedeserialization / serialization round trip for no good reason:

I think returning the right schema does not need to touch any data andshould be as lightweight as a projection.


Enrico


Am 07.01.20 um 10:13 schrieb Wenchen Fan:

I think it's simply because as[T] is lazy. You will see the rightschema if you do `df.as <http://df.as>[T].map(identity)`.

On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <m...@enrico.minack.dev<mailto:m...@enrico.minack.dev>> wrote:


    Hi Devs,

    I'd like to propose a stricter version of as[T]. Given the
    interface def as[T](): Dataset[T], it is counter-intuitive that
    the schema of the returned Dataset[T] is not agnostic to the
    schema of the originating Dataset. The schema should always be
    derived only from T.

    I am proposing a stricter version so that user code does not need
    to pair an .as[T] with a select(schemaOfT.fields.map(col(_.name)):
    _*) whenever your code expects Dataset[T] to really contain only
    columns of T.

    https://github.com/apache/spark/pull/26969

    Regards,
    Enrico

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

Reply via email to