Github user etrain commented on the pull request:

    https://github.com/apache/spark/pull/2919#issuecomment-60969356
  
    I've looked at the code here, and basically seems reasonable. One 
high-level concern I have is around the programming pattern that this 
encourages: complex nesting of otherwise simple structure that may make it 
difficult to program against Datasets for sufficiently complicated applications.
    
    A 'dataset' is now a collection of Row, where we have the guarantee that 
all rows in a Dataset conform to the same schema. A schema is a list of (name, 
type) pairs which describe the attributes available in the dataset. This seems 
like a good thing to me, and is pretty much what we described in MLI (and how 
conventional databases have been structured forever). So far, so good. 
    
    The concern that I have is that we are now encouraging these attributes to 
be complex types. For example, where I might have had 
    val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", 
classOf[Double]))
    This would become
    val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., 
("zGroup", classOf[Vector]))
    
    So, great, my schema now has these vector things in them, which I can 
create separately, pass around, etc.
    
    This clearly has its merits:
    1) Features are groups together logically based on the process that creates 
them.
    2) Managing one short schema where each record is comprised of a few large 
objects (say, 4 vectors, each of length 1000) is probably easier than managing 
a really big schema comprised of lots small objects (say, 4000 doubles).
    
    But, there are some major drawbacks
    1) Why only stop at one level of nesting? Why not have Vector[Vector]? 
    2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is 
there an implicit conversion that flattens these things to RDD[LabeledPoint]? 
Do we want to guarantee these semantics?
    3) Manipulating and subsetting nested schemas like this might be tricky. 
Where before I might be able to write:
    
    val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
    now I might have to write
    val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
    val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => 
col(gs) }
    
    Ignoring raw syntax and semantics of how you might actually map an 
operation over the columns of a Dataset and get back a well-structured dataset, 
I think this makes two conflicting points:
    1) In the first example - presumably all the work goes into figuring out 
what the subset of features you want is in this really wide feature space.
    2) In the second example - there’s a lot of gymnastics that goes into 
subsetting feature groups. I think it’s clear that working with lots of 
feature groups might get unreasonable pretty quickly.
    
    If we look at R or pandas/scikit-learn as examples of projects that have 
(arguably quite successfully) dealt with these interface issues, there is one 
basic pattern: learning algorithms expect big tables of numbers as input. Even 
here, there are some important differences:
    
    For example, in scikit-learn, categorical features aren’t supported 
directly by most learning algorithms. Instead, users are responsible for 
getting data from “table with heterogenously typed columns” to “table of 
numbers.” with something like OneHotEncoder and other feature transformers. 
In R, on the other hand, categorical features are (sometimes frustratingly) 
first class citizens by virtue of the “factor” data type - which is 
essentially and enum. Most out-of-the-box learning algorithms (like glm()) 
accept data frames with categorical inputs and handle them sensibly - 
implicitly one hot encoding (or creating dummy variables, if you prefer) the 
categorical features.
    
    While I have a slight preference for representing things as big flat 
tables, I would be fine coding either way - but I wanted to raise the issue for 
discussion here before the interfaces are set in stone.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to