> Please elaborate.

I'm mainly aware of the situation in Scala, where the lack of named tuples is 
the reason why type-safe schema transformation is rather limited. When working 
with typed data, there are basically two options:

  * Use unnamed tuples, which is not really an option, because it either 
hard-codes column positions (=> unreadable) or requires tedious pattern 
matching over all fields.
  * Use (case) classes, which is the standard solution: You write out a case 
class, which involves typing out all the field names/types once. The problem is 
that transforming the data cannot be done automatically. Let's assume the input 
data has 30 columns, so we have to write a first class RawInput with 30 fields. 
In a later processing step we might want to remove a few columns. This again 
requires to define a new class ReducedInput with 20+ fields. Eventually we 
might want to add a bunch of derived columns, and again we have to introduce a 
new type. The problem can be mitigated by inheritance/traits, but it still 
remains a work-around which is not very convenient to work with.



In Nim, the same can be solved very elegantly by just transforming/constructing 
named tuples everywhere. That's what the DSL looks like:
    
    
    # A const schema definition is required once. Ideally this is the
    # only point where we have to type out our 30 columns.
    const schema = ... # array with field information
    
    # For here on, it is just a bunch of macros performing named tuple 
transformations
    let df = DF.fromText("test.csv")
               .map(schemaParser(schema, ";"))
    
    # Projection can use whichever is shorter to type
    df.map(t => t.projectAway(fields, to, remove))
    df.map(t => t.projectTo(fields, to, keep))
    
    # Adding new fields also does not require repeating existing fields
    df.map(t => t.addFields(length: sqrt(t.x^2, t.y^2))
    
    # Eventually even the schema of a join can be computed statically:
    let joined = dfA.join(dfB, on=[joinField])
    

This should also play nicely with structural typing in Nim, e.g., passing data 
frames to functions can be done generically, and does not require to write out 
field names explicitly.

I'm not sure how this would work with objects. Since they are nominal, I guess 
they would have to be made explicitly available in the outer scope. Currently I 
leave it up to the user if they want to define their types explicitly, for 
instance via this macro:
    
    
    type
      MyRowType = schemaType(schema)
    
    proc myExplictlyTypedProc(df: DataFrame[MyRowType]) = ...
    

What I wanted to avoid is that a user has to explicitly name their types for 
each transformation.

Reply via email to