Take a look:

>>> df = sqlContext.jsonRDD(sc.parallelize(['{"settings": {"os": "OS X", 
>>> "version": "10.10"}}']))>>> df.printSchema()
root
 |-- settings: struct (nullable = true)
 |    |-- os: string (nullable = true)
 |    |-- version: string (nullable = true)
>>> # Now I want to "drop" the version column by>>> # selecting everything 
>>> else.>>> # I want to preserve the schema otherwise.>>> # That means `os` 
>>> should stayed nested under>>> # `settings`.
>>> df.select('settings.os').printSchema()
root
 |-- os: string (nullable = true)
>>> df.select('settings', 'settings.os').printSchema()
root
 |-- settings: struct (nullable = true)
 |    |-- os: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- os: string (nullable = true)
>>> df.select(df['settings.os'].alias('settings.os')).printSchema()
root
 |-- settings.os: string (nullable = true)

In all cases, selecting a nested field loses the original nesting of that
field.

What I want is to select settings.os and get back a DataFrame with the
following schema:

root
 |-- settings: struct (nullable = true)
 |    |-- os: string (nullable = true)

In other words, I want to preserve the fact that os is nested under settings.
I’m doing this as a work-around for the fact that PySpark does not
currently support dropping columns,

Until direct support for such a feature lands as part of SPARK-7509
<https://issues.apache.org/jira/browse/SPARK-7509>, selecting all columns
but the ones you want to drop seems way better than directly manipulating
the schema (which is the hackier and way more complex alternative for
rolling your own “drop” logic).

And you want that process to preserve the schema as much as possible, which
I assume is how a native “drop column” method would work.

Is it possible though? Or do we have to do direct schema manipulation?

Nick
​

Reply via email to