Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r180291486 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/TypePlaceholder.scala --- @@ -0,0 +1,23 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.types + +/** + * An internal type that is a not yet available and will be replaced by an actual type later. + */ +case object TypePlaceholder extends StringType --- End diff -- Is it necessary to introduce a new DataType? Would it be the same if we use `NullType`? With the flag on, at the end of schema inference, `NullType`, `ArrayType(NullType)`, etc should be dropped instead of using StringType as fallback. Basically, during schema inference, we keep the one that reveals more details, for example: ``` (NullType, ArrayType(NullType)) => ArrayType(NullType) (ArrayType(NullType), ArrayType(StructType(Field("a", NullType)))) => ArrayType(StructType(Field("a", NullType)))) ``` At the end, we implement a util method that determine whether a field is all null and drop them if true. It should be done recursively. I have an internal implementation that implements a similar logic, but on the JSON record itself. You might want to apply it to data types. ```scala /** * Removes null fields recursively from the input JSON record. * An array is null if all its elements are null. * An object is null if all its values are null. */ def removeNullRecursively(jsonStr: String): String = { val json = parse(jsonStr) val cleaned = doRemoveNullRecursively(json) compact(render(cleaned)) // should handle null correctly } private def doRemoveNullRecursively(value: JValue): JValue = { value match { case null => null case JNull => null case JArray(values) => val cleaned = values.map(doRemoveNullRecursively) if (cleaned.exists(_ != null)) { JArray(cleaned) } else { null } case JObject(pairs) => val cleaned = pairs.flatMap { case (k, v) => val cv = doRemoveNullRecursively(v) if (cv != null) { Some((k, cv)) } else { None } } if (cleaned.nonEmpty) { JObject(cleaned) } else { null } // all other types are non-null case _ => value } } ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org