[ https://issues.apache.org/jira/browse/SPARK-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christian Zommerfelds updated SPARK-15642: ------------------------------------------ Affects Version/s: 2.2.0 > Metadata gets lost when selecting a field of a StructType > --------------------------------------------------------- > > Key: SPARK-15642 > URL: https://issues.apache.org/jira/browse/SPARK-15642 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0, 1.6.1, 2.2.0 > Reporter: Christian Zommerfelds > > Hi, > When working with Data Frames, sometimes I find myself needing to write a > function that creates multiple columns. Since that is not directly possible, > I create a function that returns a StructType, and then call select() to > assign the fields to different columns. However, I noticed that the metadata > gets lost when I do that. > Example: (Python) > {code} > In: schema = StructType([StructField('foo', StructType([ > StructField('features', ArrayType(IntegerType())), > StructField('label', DoubleType(), False, > {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}} > ) > ]))]) > In: df = sqlContext.createDataFrame([Row(foo=Row(features=[1,2], label=0.0)), > Row(foo=Row(features=[3,4], label=1.0))], schema) > In: df.schema.fields[0].dataType.fields[1].metadata > Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}} > In: df2 = df.select(df.foo['label']) > In: df2.schema.fields[0].metadata > Out: {} > {code} > Expected: same metadata (ml_attrib...) > My work around is to create a new Data Frame from RDD, because as far as I > know PySpark doesn't support adding metadata once the DF is created (should I > create another issue for that?). Work around example: > {code} > In: df3 = sqlContext.createDataFrame(df2.rdd, > StructType([schema.fields[0].dataType.fields[1]])) > In: df3.schema.fields[0].metadata > Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}} > {code} > I am not sure if this affects the Scala API. (EDIT: yes it does. See test > case at https://github.com/apache/spark/pull/13467/files) > Let me know if I can provide any other information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org