Take a look: >>> df = sqlContext.jsonRDD(sc.parallelize(['{"settings": {"os": "OS X", >>> "version": "10.10"}}']))>>> df.printSchema() root |-- settings: struct (nullable = true) | |-- os: string (nullable = true) | |-- version: string (nullable = true) >>> # Now I want to "drop" the version column by>>> # selecting everything >>> else.>>> # I want to preserve the schema otherwise.>>> # That means `os` >>> should stayed nested under>>> # `settings`. >>> df.select('settings.os').printSchema() root |-- os: string (nullable = true) >>> df.select('settings', 'settings.os').printSchema() root |-- settings: struct (nullable = true) | |-- os: string (nullable = true) | |-- version: string (nullable = true) |-- os: string (nullable = true) >>> df.select(df['settings.os'].alias('settings.os')).printSchema() root |-- settings.os: string (nullable = true)
In all cases, selecting a nested field loses the original nesting of that field. What I want is to select settings.os and get back a DataFrame with the following schema: root |-- settings: struct (nullable = true) | |-- os: string (nullable = true) In other words, I want to preserve the fact that os is nested under settings. I’m doing this as a work-around for the fact that PySpark does not currently support dropping columns, Until direct support for such a feature lands as part of SPARK-7509 <https://issues.apache.org/jira/browse/SPARK-7509>, selecting all columns but the ones you want to drop seems way better than directly manipulating the schema (which is the hackier and way more complex alternative for rolling your own “drop” logic). And you want that process to preserve the schema as much as possible, which I assume is how a native “drop column” method would work. Is it possible though? Or do we have to do direct schema manipulation? Nick