Designing a data frame API

bluenote Sun, 12 Feb 2017 17:15:05 +0100

I finally found some time to work on something I wanted to start immediately 
when I discovered Nim: A data frame API allowing lazy out-of-core data 
analytics. You can find a first working prototype over at GitHub (under the 
uninspired name [NimData](https://github.com/bluenote10/NimData)). So far I'm 
pretty happy with what is possible already now. In fact I got carried away a 
little bit when creating the readme example based on this Bundesliga data set .


Quick motivation: The project is targeted between the worlds of Pandas/R and 
map-reduce successors like Spark/Flink. In the Pandas/R world, the general 
assumption is that your data fits into memory, and things get awkward when the 
data size exceeds this threshold. Spark/Flink offer a different API design, 
which makes distributed out-of-core processing easier, and many companies are 
moving in this direction now. However, this can be a heavy investment because 
Spark/Flink basically require that you build and operate a cluster (=> 
costly/complicated) -- the performance of running in local mode is often not so 
great (partly due to required abstractions + JVM). NimData uses a very similar 
API design allowing out-of-core processing, but for the time being, the focus 
is just on running locally. In other words, the target is "medium" sized data, 
i.e., data which may not fit into memory entirely, but does not justify to 
invest in a cluster. I haven't started benchmarking yet, but I could imagine 
that Nim's native speed might even keep up with a small cluster for certain 
problems.

In terms of the design I'm at a point where I could use some help from the Nim 
veterans to verify if this is a good path. I went for a design based on dynamic 
dispatching, and I was facing some of problems related to both method dispatch 
and iterators. The biggest showstopper was that I could not mix generic and 
specific methods (which didn't allow to actually read from files ), but then I 
found the work-around documented in 
[#5325](https://github.com/nim-lang/Nim/issues/5325). But there are still some 
issues left, and I'm not sure if they are either a mistake on my part or a 
known/unknown Nim issue. Some of the issues are tricky, because they disappear 
in isolated examples. I have prepared the following minimal example which 
illustrates both the general design and the outstanding issues:
    
    
    import times
    import typetraits
    import strutils
    import sequtils
    import future
    import macros
    
    type
      DataFrame[T] = ref object of RootObj
      
      CachedDataFrame[T] = ref object of DataFrame[T]
        data: seq[T]
      
      MappedDataFrame[U, T] = ref object of DataFrame[T]
        orig: DataFrame[U]
        f: proc(x: U): T
      
      FilteredDataFrame[T] = ref object of DataFrame[T]
        orig: DataFrame[T]
        f: proc(x: T): bool
    
    proc newCachedDataFrame[T](data: seq[T]): DataFrame[T] =
      result = CachedDataFrame[T](data: data)
    
    # 
-----------------------------------------------------------------------------
    # Transformations
    # 
-----------------------------------------------------------------------------
    
    # Issue 1: I'm getting a deprecation warning for this function:
    #   `Warning: generic method not attachable to object type is deprecated`
    # I don't understand why it is not attachable, T and U are both unambiguous
    # from the mapping proc. Is this a showstopper, because it will break
    # in a future version of Nim?
    method map[U, T](df: DataFrame[U], f: proc(x: U): T): DataFrame[T] {.base.} 
=
      result = MappedDataFrame[U, T](orig: df, f: f)
    
    method filter[T](df: DataFrame[T], f: proc(x: T): bool): DataFrame[T] 
{.base.} =
      result = FilteredDataFrame[T](orig: df, f: f)
    
    # 
-----------------------------------------------------------------------------
    # Iterators
    # 
-----------------------------------------------------------------------------
    
    # Issue 2: I don't understand why this dummy wrapper is required below.
    # In the `for x in ...` lines below I was trying two variants:
    #
    # - `for x in it:` This gives a compilation error:
    #   `Error: type mismatch: got (iterator (): int{.closure.})`
    #   which is surprising because that is exactly the type required, isn't it?
    # - `for x in it():` This compiles, but it leads to bugs!
    #   When chaining e.g. two `map` calls the resulting iterator will
    #   just return zero elements, irrespective of what the original
    #   iterator is.
    #
    # For some strange reason converting the closure iterator to an inline
    # iterator can serve as a work around.
    iterator toIterBugfix[T](closureIt: iterator(): T): T {.inline.} =
      for x in closureIt():
        yield x
    
    method iter[T](df: DataFrame[T]): (iterator(): T) {.base.} =
      quit "base method called (DataFrame.iter)"
    
    method iter[T](df: CachedDataFrame[T]): (iterator(): T) =
      result = iterator(): T =
        for x in df.data:
          yield x
    
    method iter[U, T](df: MappedDataFrame[U, T]): (iterator(): T) =
      result = iterator(): T =
        var it = df.orig.iter()
        for x in toIterBugfix(it): # why not just `it` or `it()`?
          yield df.f(x)
    
    method iter[T](df: FilteredDataFrame[T]): (iterator(): T) =
      result = iterator(): T =
        var it = df.orig.iter()
        for x in toIterBugfix(it): # why not just `it` or `it()`?
          if df.f(x):
            yield x
    
    # 
-----------------------------------------------------------------------------
    # Actions
    # 
-----------------------------------------------------------------------------
    
    # Issue 3: I'm getting multiple warnings for this line, like:
    # df_design_02.nim(87, 8) Warning: method has lock level <unknown>, but 
another method has 0 [LockLevel]
    # df_design_02.nim(81, 8) Warning: method has lock level 0, but another 
method has <unknown> [LockLevel]
    # df_design_02.nim(81, 8) Warning: method has lock level 0, but another 
method has <unknown> [LockLevel]
    # df_design_02.nim(81, 8) Warning: method has lock level 0, but another 
method has <unknown> [LockLevel]
    # df_design_02.nim(81, 8) Warning: method has lock level 0, but another 
method has <unknown> [LockLevel]
    # Where the first warning points to the third definition of collect (the 
one for MappedDataFrame).
    # I'm confused because none of them has an explicit locklevel.
    method collect[T](df: DataFrame[T]): seq[T] {.base.} =
      quit "base method called (DataFrame.collect)"
    
    method collect[T](df: CachedDataFrame[T]): seq[T] =
      result = df.data
    
    method collect[S, T](df: MappedDataFrame[S, T]): seq[T] =
      result = newSeq[T]()
      let it = df.orig.iter()
      for x in it():
        result.add(df.f(x))
    
    # This issue is triggered by client code looking like this (also
    # demonstrates the iterator bug):
    let data = newCachedDataFrame[int](@[1, 2, 3])
    let mapped = data.map(x => x*2)
    echo mapped.collect()
    echo data.map(x => x*2).collect()
    echo data.map(x => x*2).map(x => x*2).collect()
    echo data.filter(x => x mod 2 == 1).map(x => x * 100).collect()
    
    # Issue 4:
    # The following errors with
    #   `Error: method is not a base`
    # but only when there is client code calling it with different
    # types T, e.g. int and string. Not a big problem right now since
    # I can just make it a proc (no overloads required at the moment),
    # but still a bit worrying, because later overloading might be necessary.
    when false:
      method count*[T](df: DataFrame[T]): int {.base.} =
        result = 0
        let it = df.iter()
        for x in toIterBugfix(it):
          result += 1
      
      echo newCachedDataFrame[int](@[1, 2, 3]).count()
      echo newCachedDataFrame[string](@["1", "2", "3"]).count()
    
    # Issue 5: Trying to define generic numerical actions like mean/min/max 
fails
    # with a compilation error:
    #   `Error: type mismatch: got (string) but expected 'float'`
    # even though the method is never called with T being a string.
    # Work around is also to use a proc, but overloading may be
    # desirable.
    when false:
      method mean*[T](df: DataFrame[T]): float =
        result = 0
        var count = 0
        let it = df.iter()
        for x in it():
          count += 1
          result += x.float
        result /= count.float
      
      # To get the error it suffices (and is required) to just have a 
DataFrame[string]
      # in scope:
      let strData = newCachedDataFrame[string](@["A", "B"])
    

Any ideas what is happening here? I feel like Issues 3 + 4 might be related to 
#5325. And do you think the overall design makes sense and is in line with the 
Nim 1.0 roadmap (I'm a bit concerned about the deprecation warning of Issue 1)? 
I though about the following alternatives, not sure if they would work at all:

  * Using an object variant instead of a class hierarchy, but I don't know if 
having a variable number of generic parameters is possible (none for type 
specific data frames, 1 for the regular generic in T, 2 for data frames with a 
U to T mapper).
  * Instead of making the all functions generic in the underlying type T, they 
could be generic in the DataFrame type itself (maybe with having a DataFrame 
concept). But then we could not write a clear signature of mapping functions, 
and everything would have to become entirely generic.
  * Maybe the functions which require dynamic dispatching should be rather 
written as templates/macros, which would probably also allow to bring in 
type-specific implementations. Not sure how this would compare to the current 
design though.



Regarding performance, I'm a bit concerned about the double indirection of 
having to wrap the closure iterators in an inline iterator. I want to start 
with benchmarking soon, and eventually it would be nice if the design is smart 
enough to "collapse" certain operations. For instance, if the transformation 
chain has a map+map, filter+filter, map+filter, ..., they could be merged 
together, so that they will be compiled in a single loop without another 
function call to a closure iterator. Any ideas of how to squeeze out the 
maximum performance are highly appreciated. Or rather any kind of contribution .

Designing a data frame API

Reply via email to