[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

icexelloss Fri, 25 May 2018 07:13:13 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @rxin @gatorsmile thanks for joining the discussion!
    
    On the configuration side, we have already some mechanism to do so for the 
"timezone" config:
    
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L48
    I'd imagine we could extend the mechanism to support arbitrary 
configuration map. 
    
    On the behavior side, I think more about this and I feel a desirable 
behavior is support both matching by name and by index, i.e.
    (1) If the output dataframe has the same column names as the schema, we 
match by column name, this is desirable behavior where user do:
    ```
    return pd.DataFrame({'a': ..., 'b': ...})
    ```
    (2) If the output dataframe has column names "0, 1, ,2 ...", we match by 
indices, this is because when user doesn't specify column names when creating a 
pd.DataFrame, that's the default column names, e.g.
    ```
    >>> pd.DataFrame([[1, 2.0, "hello"], [4, 5.0, "xxx"]])
       0    1      2
    0  1  2.0  hello
    1  4  5.0    xxx
    ``` 
    (3) throw exception otherwise
    
    What do you think of having the new configuration support this behavior?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Reply via email to