I finally found some time to work on something I wanted to start immediately when I discovered Nim: A data frame API allowing lazy out-of-core data analytics. You can find a first working prototype over at GitHub (under the uninspired name [NimData](https://github.com/bluenote10/NimData)). So far I'm pretty happy with what is possible already now. In fact I got carried away a little bit when creating the readme example based on this Bundesliga data set .
Quick motivation: The project is targeted between the worlds of Pandas/R and map-reduce successors like Spark/Flink. In the Pandas/R world, the general assumption is that your data fits into memory, and things get awkward when the data size exceeds this threshold. Spark/Flink offer a different API design, which makes distributed out-of-core processing easier, and many companies are moving in this direction now. However, this can be a heavy investment because Spark/Flink basically require that you build and operate a cluster (=> costly/complicated) -- the performance of running in local mode is often not so great (partly due to required abstractions + JVM). NimData uses a very similar API design allowing out-of-core processing, but for the time being, the focus is just on running locally. In other words, the target is "medium" sized data, i.e., data which may not fit into memory entirely, but does not justify to invest in a cluster. I haven't started benchmarking yet, but I could imagine that Nim's native speed might even keep up with a small cluster for certain problems. In terms of the design I'm at a point where I could use some help from the Nim veterans to verify if this is a good path. I went for a design based on dynamic dispatching, and I was facing some of problems related to both method dispatch and iterators. The biggest showstopper was that I could not mix generic and specific methods (which didn't allow to actually read from files ), but then I found the work-around documented in [#5325](https://github.com/nim-lang/Nim/issues/5325). But there are still some issues left, and I'm not sure if they are either a mistake on my part or a known/unknown Nim issue. Some of the issues are tricky, because they disappear in isolated examples. I have prepared the following minimal example which illustrates both the general design and the outstanding issues: import times import typetraits import strutils import sequtils import future import macros type DataFrame[T] = ref object of RootObj CachedDataFrame[T] = ref object of DataFrame[T] data: seq[T] MappedDataFrame[U, T] = ref object of DataFrame[T] orig: DataFrame[U] f: proc(x: U): T FilteredDataFrame[T] = ref object of DataFrame[T] orig: DataFrame[T] f: proc(x: T): bool proc newCachedDataFrame[T](data: seq[T]): DataFrame[T] = result = CachedDataFrame[T](data: data) # ----------------------------------------------------------------------------- # Transformations # ----------------------------------------------------------------------------- # Issue 1: I'm getting a deprecation warning for this function: # `Warning: generic method not attachable to object type is deprecated` # I don't understand why it is not attachable, T and U are both unambiguous # from the mapping proc. Is this a showstopper, because it will break # in a future version of Nim? method map[U, T](df: DataFrame[U], f: proc(x: U): T): DataFrame[T] {.base.} = result = MappedDataFrame[U, T](orig: df, f: f) method filter[T](df: DataFrame[T], f: proc(x: T): bool): DataFrame[T] {.base.} = result = FilteredDataFrame[T](orig: df, f: f) # ----------------------------------------------------------------------------- # Iterators # ----------------------------------------------------------------------------- # Issue 2: I don't understand why this dummy wrapper is required below. # In the `for x in ...` lines below I was trying two variants: # # - `for x in it:` This gives a compilation error: # `Error: type mismatch: got (iterator (): int{.closure.})` # which is surprising because that is exactly the type required, isn't it? # - `for x in it():` This compiles, but it leads to bugs! # When chaining e.g. two `map` calls the resulting iterator will # just return zero elements, irrespective of what the original # iterator is. # # For some strange reason converting the closure iterator to an inline # iterator can serve as a work around. iterator toIterBugfix[T](closureIt: iterator(): T): T {.inline.} = for x in closureIt(): yield x method iter[T](df: DataFrame[T]): (iterator(): T) {.base.} = quit "base method called (DataFrame.iter)" method iter[T](df: CachedDataFrame[T]): (iterator(): T) = result = iterator(): T = for x in df.data: yield x method iter[U, T](df: MappedDataFrame[U, T]): (iterator(): T) = result = iterator(): T = var it = df.orig.iter() for x in toIterBugfix(it): # why not just `it` or `it()`? yield df.f(x) method iter[T](df: FilteredDataFrame[T]): (iterator(): T) = result = iterator(): T = var it = df.orig.iter() for x in toIterBugfix(it): # why not just `it` or `it()`? if df.f(x): yield x # ----------------------------------------------------------------------------- # Actions # ----------------------------------------------------------------------------- # Issue 3: I'm getting multiple warnings for this line, like: # df_design_02.nim(87, 8) Warning: method has lock level <unknown>, but another method has 0 [LockLevel] # df_design_02.nim(81, 8) Warning: method has lock level 0, but another method has <unknown> [LockLevel] # df_design_02.nim(81, 8) Warning: method has lock level 0, but another method has <unknown> [LockLevel] # df_design_02.nim(81, 8) Warning: method has lock level 0, but another method has <unknown> [LockLevel] # df_design_02.nim(81, 8) Warning: method has lock level 0, but another method has <unknown> [LockLevel] # Where the first warning points to the third definition of collect (the one for MappedDataFrame). # I'm confused because none of them has an explicit locklevel. method collect[T](df: DataFrame[T]): seq[T] {.base.} = quit "base method called (DataFrame.collect)" method collect[T](df: CachedDataFrame[T]): seq[T] = result = df.data method collect[S, T](df: MappedDataFrame[S, T]): seq[T] = result = newSeq[T]() let it = df.orig.iter() for x in it(): result.add(df.f(x)) # This issue is triggered by client code looking like this (also # demonstrates the iterator bug): let data = newCachedDataFrame[int](@[1, 2, 3]) let mapped = data.map(x => x*2) echo mapped.collect() echo data.map(x => x*2).collect() echo data.map(x => x*2).map(x => x*2).collect() echo data.filter(x => x mod 2 == 1).map(x => x * 100).collect() # Issue 4: # The following errors with # `Error: method is not a base` # but only when there is client code calling it with different # types T, e.g. int and string. Not a big problem right now since # I can just make it a proc (no overloads required at the moment), # but still a bit worrying, because later overloading might be necessary. when false: method count*[T](df: DataFrame[T]): int {.base.} = result = 0 let it = df.iter() for x in toIterBugfix(it): result += 1 echo newCachedDataFrame[int](@[1, 2, 3]).count() echo newCachedDataFrame[string](@["1", "2", "3"]).count() # Issue 5: Trying to define generic numerical actions like mean/min/max fails # with a compilation error: # `Error: type mismatch: got (string) but expected 'float'` # even though the method is never called with T being a string. # Work around is also to use a proc, but overloading may be # desirable. when false: method mean*[T](df: DataFrame[T]): float = result = 0 var count = 0 let it = df.iter() for x in it(): count += 1 result += x.float result /= count.float # To get the error it suffices (and is required) to just have a DataFrame[string] # in scope: let strData = newCachedDataFrame[string](@["A", "B"]) Any ideas what is happening here? I feel like Issues 3 + 4 might be related to #5325. And do you think the overall design makes sense and is in line with the Nim 1.0 roadmap (I'm a bit concerned about the deprecation warning of Issue 1)? I though about the following alternatives, not sure if they would work at all: * Using an object variant instead of a class hierarchy, but I don't know if having a variable number of generic parameters is possible (none for type specific data frames, 1 for the regular generic in T, 2 for data frames with a U to T mapper). * Instead of making the all functions generic in the underlying type T, they could be generic in the DataFrame type itself (maybe with having a DataFrame concept). But then we could not write a clear signature of mapping functions, and everything would have to become entirely generic. * Maybe the functions which require dynamic dispatching should be rather written as templates/macros, which would probably also allow to bring in type-specific implementations. Not sure how this would compare to the current design though. Regarding performance, I'm a bit concerned about the double indirection of having to wrap the closure iterators in an inline iterator. I want to start with benchmarking soon, and eventually it would be nice if the design is smart enough to "collapse" certain operations. For instance, if the transformation chain has a map+map, filter+filter, map+filter, ..., they could be merged together, so that they will be compiled in a single loop without another function call to a closure iterator. Any ideas of how to squeeze out the maximum performance are highly appreciated. Or rather any kind of contribution .