Hi,

I'm wondering if there has been any past discussion on the subject of
supporting metadata attributes as a first class concept, both at the row
level, as well as the DataFrame level?  I did a Jira search, but most of
the items I found were unrelated to this concept, or pertained to column
level metadata, which is of course already supported.

Row-level metadata, would be useful in scenarios like the following:

   - Lineage and provenance attributes, which need to eventually be
   propagated to some other system, but which shouldn't be written out with
   the "regular" DataFrame.
   - Other custom attributes, such as the input_file_name
   <https://spark.apache.org/docs/2.4.6/api/sql/index.html#input_file_name>
   for data read from HDFS, message keys from Kafka

So why not just store regular the attributes as regular columns (possibly
with some special prefix to help us filter them out if needed)?

   - When passing the DataFrame to another piece of library code, we might
   need to remove those columns, depending on what it does (ex: if it operates
   on every column).  Or we might need to perform an extra join in order to
   "retain" the attributes from the rows processed by the library function.
   - If we need to union an existing DataFrame (with metadata) and another
   one that we read from another source (which has different, or no
   metadata).  If metadata attributes are represented as normal columns, we
   have to do some finagling to get the union to work properly.
   - If we want to simply write the DataFrame somewhere, we probably don't
   want to mix metadata attributes with the actual data.

For DataFrame-level metadata:

   - Attributes such as the table/schema/DB name, or primary key
   information, for DataFrames read from JDBC (ex: downstream processing might
   want to always partitionBy these key columns, whatever they happen to be)
   - Adding tracking information about what app-specific processing steps
   have been applied so far, their timings, etc.
   - For SQL sources, capturing the full query that produced the DataFrame

Some of these scenarios can be made easier by custom code with implicit
conversions, as outlined here <https://stackoverflow.com/a/32646241/375670>.
But that has its own drawbacks and shortcomings (as outlined in the
comments).

How are people currently managing this?  Does it make sense, conceptually,
as something that Spark should directly support?

Reply via email to