Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/21427 @rxin @gatorsmile thanks for joining the discussion! On the configuration side, we have already some mechanism to do so for the "timezone" config: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L48 I'd imagine we could extend the mechanism to support arbitrary configuration map. On the behavior side, I think more about this and I feel a desirable behavior is support both matching by name and by index, i.e. (1) If the output dataframe has the same column names as the schema, we match by column name, this is desirable behavior where user do: ``` return pd.DataFrame({'a': ..., 'b': ...}) ``` (2) If the output dataframe has column names "0, 1, ,2 ...", we match by indices, this is because when user doesn't specify column names when creating a pd.DataFrame, that's the default column names, e.g. ``` >>> pd.DataFrame([[1, 2.0, "hello"], [4, 5.0, "xxx"]]) 0 1 2 0 1 2.0 hello 1 4 5.0 xxx ``` (3) throw exception otherwise What do you think of having the new configuration support this behavior?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org