Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Maciej Sun, 14 Aug 2022 04:10:58 -0700

I have mixed feelings about this proposal. Merging or diffing schemas is a common operation, but specific requirements differ from case to case, especially when complex nested data is used.

Even if we put ordering of the fields aside, data types equality semantics (StructField in particular) is likely to result in implementation which is either confusing or has limited applicability.

Additionally, Scala StructType is already a Seq[StructField] and as such provides set-like operations (contains, diff, intersect, union) as well as implementations of ++ / :+ / +: so we cannot do much here, without breaking the existing API.


On 8/14/22 11:30, Alexandros Biratsis wrote:

Hello Rui and Tim,

Indeed this sound a good idea and quite useful. To make it more formal the list of a StructType could be treated as a Scala/Python set by providing(inheriting?) the common sets' functionality i.e add, remove, concat, intersect, diff etc. The set like functionality could be part of StructType class for both languages.

The Scala set collection https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html <https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html>


Best,
Alex

On Wed, Aug 10, 2022, 08:14 Rui Wang <amaliu...@apache.org <mailto:amaliu...@apache.org>> wrote:


    Thanks for the idea!

    I am thinking that the usage of "combined = StructType( a.fields +
    b.fields)" is still good because
    1) it is not horrible to merge a and b in this way.
    2) itself clarifies the intention which is merge two struct's fields
    to construct a new struct
    3) you also have room to apply more complicated operations on fields
    merging. For example remove duplicate files with the same name or
    use a.fields but remove some fields if they are in b.

    overloading "+" could be
    1. it's ambiguous on what this plus is doing.
    2. If you define + is a concatenation on the fields, then it's
    limited to only do the concatenation. How about other operations
    like extract fields from a based on b? Maybe overloading "-"? In
    this case the item list will grow.

    -Rui

    On Tue, Aug 9, 2022 at 1:10 PM Tim <bosse...@posteo.de
    <mailto:bosse...@posteo.de>> wrote:

        Hi all,

        this is my first message to the Spark mailing list, so please
        bear with
        me if I don't fully meet your communication standards.
        I just wanted to discuss one aspect that I've stumbled across
        several
        times over the past few weeks.
        When working with Spark, I often run into the problem of having
        to merge
        two (or more) existing StructTypes into a new one to define a
        schema.
        Usually this looks similar (in Python) to the following simplified
        example:

                  a = StructType([StuctField("field_a", StringType())])
                  b = StructType([StructField("field_b", IntegerType())])

                  combined = StructType( a.fields + b.fields)

        My idea, which I would like to discuss, is to shorten the above
        example
        in Python as follows by supporting Python's add operator for
        StructTypes:

                  combined = a + b


        What do you think of this idea? Are there any reasons why this
        is not
        yet part of StructType's functionality?
        If you support this idea, I could create a first PR for further and
        deeper discussion.

        Best
        Tim

        ---------------------------------------------------------------------
        To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
        <mailto:dev-unsubscr...@spark.apache.org>


--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

OpenPGP_signature
Description: OpenPGP digital signature

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Reply via email to