[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46578605 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46578586 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46584481 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46584464 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46587915 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46587917 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15911/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46589367 @rxin any idea why this one test fails? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593227 That test has been flaky. We are fixing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593240 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593586 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593569 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593975 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593978 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15912/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46602864 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46602867 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15914/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46612027 @AndreSchumacher do u mind removing the [WIP] tag from the pull request? Unfortunately due to the avro version bump, we can't include this in 1.0.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46644749 @rxin the avro dependency is for the tests only (to make sure we can read parquet files with avro objects in them). I can remove the one test if that blocks it from being included. When the rest of the build has caught up with the version we can add it again. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46644798 That sounds good. If you can just comment that test out for now, that'd be great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46645804 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46645809 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13926401 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13926479 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13876519 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13892309 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13892425 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13892462 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13892570 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13711334 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -0,0 +1,409 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.io.IOException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapreduce.Job + +import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter} +import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData} +import parquet.hadoop.util.ContextUtil +import parquet.schema.{Type = ParquetType, PrimitiveType = ParquetPrimitiveType, MessageType} +import parquet.schema.{GroupType = ParquetGroupType, OriginalType = ParquetOriginalType, ConversionPatterns} +import parquet.schema.PrimitiveType.{PrimitiveTypeName = ParquetPrimitiveTypeName} +import parquet.schema.Type.Repetition + +import org.apache.spark.Logging +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Attribute} +import org.apache.spark.sql.catalyst.types._ + +// Implicits +import scala.collection.JavaConversions._ + +private[parquet] object ParquetTypesConverter extends Logging { + def isPrimitiveType(ctype: DataType): Boolean = +classOf[PrimitiveType] isAssignableFrom ctype.getClass + + def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): DataType = parquetType match { +case ParquetPrimitiveTypeName.BINARY = StringType +case ParquetPrimitiveTypeName.BOOLEAN = BooleanType +case ParquetPrimitiveTypeName.DOUBLE = DoubleType +case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = ArrayType(ByteType) +case ParquetPrimitiveTypeName.FLOAT = FloatType +case ParquetPrimitiveTypeName.INT32 = IntegerType +case ParquetPrimitiveTypeName.INT64 = LongType +case ParquetPrimitiveTypeName.INT96 = + // TODO: add BigInteger type? TODO(andre) use DecimalType instead + sys.error(Potential loss of precision: cannot convert INT96) +case _ = sys.error( + sUnsupported parquet datatype $parquetType) + } + + /** + * Converts a given Parquet `Type` into the corresponding + * [[org.apache.spark.sql.catalyst.types.DataType]]. + * + * We apply the following conversion rules: + * ul + * li Primitive types are converter to the corresponding primitive type./li + * li Group types that have a single field that is itself a group, which has repetition + *level `REPEATED`, are treated as follows:ul + * li If the nested group has name `values`, the surrounding group is converted + * into an [[ArrayType]] with the corresponding field type (primitive or + * complex) as element type./li + * li If the nested group has name `map` and two fields (named `key` and `value`), + * the surrounding group is converted into a [[MapType]] + * with the corresponding key and value (value possibly complex) types. + * Note that we currently assume map values are not nullable./li + * li Other group types are converted into a [[StructType]] with the corresponding + *field types./li/ul/li + * /ul + * Note that fields are determined to be `nullable` if and only if their Parquet repetition + * level is not `REQUIRED`. + * + * @param parquetType The type to convert. + * @return The corresponding Catalyst type. + */ + def toDataType(parquetType: ParquetType): DataType = { +def correspondsToMap(groupType: ParquetGroupType): Boolean = { + if (groupType.getFieldCount != 1 || groupType.getFields.apply(0).isPrimitive) { +false + } else { +// This mostly follows the convention in
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13661910 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -0,0 +1,409 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.io.IOException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapreduce.Job + +import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter} +import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData} +import parquet.hadoop.util.ContextUtil +import parquet.schema.{Type = ParquetType, PrimitiveType = ParquetPrimitiveType, MessageType} +import parquet.schema.{GroupType = ParquetGroupType, OriginalType = ParquetOriginalType, ConversionPatterns} +import parquet.schema.PrimitiveType.{PrimitiveTypeName = ParquetPrimitiveTypeName} +import parquet.schema.Type.Repetition + +import org.apache.spark.Logging +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Attribute} +import org.apache.spark.sql.catalyst.types._ + +// Implicits +import scala.collection.JavaConversions._ + +private[parquet] object ParquetTypesConverter extends Logging { + def isPrimitiveType(ctype: DataType): Boolean = +classOf[PrimitiveType] isAssignableFrom ctype.getClass + + def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): DataType = parquetType match { +case ParquetPrimitiveTypeName.BINARY = StringType +case ParquetPrimitiveTypeName.BOOLEAN = BooleanType +case ParquetPrimitiveTypeName.DOUBLE = DoubleType +case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = ArrayType(ByteType) +case ParquetPrimitiveTypeName.FLOAT = FloatType +case ParquetPrimitiveTypeName.INT32 = IntegerType +case ParquetPrimitiveTypeName.INT64 = LongType +case ParquetPrimitiveTypeName.INT96 = + // TODO: add BigInteger type? TODO(andre) use DecimalType instead + sys.error(Potential loss of precision: cannot convert INT96) +case _ = sys.error( + sUnsupported parquet datatype $parquetType) + } + + /** + * Converts a given Parquet `Type` into the corresponding + * [[org.apache.spark.sql.catalyst.types.DataType]]. + * + * We apply the following conversion rules: + * ul + * li Primitive types are converter to the corresponding primitive type./li + * li Group types that have a single field that is itself a group, which has repetition + *level `REPEATED`, are treated as follows:ul + * li If the nested group has name `values`, the surrounding group is converted + * into an [[ArrayType]] with the corresponding field type (primitive or + * complex) as element type./li + * li If the nested group has name `map` and two fields (named `key` and `value`), + * the surrounding group is converted into a [[MapType]] + * with the corresponding key and value (value possibly complex) types. + * Note that we currently assume map values are not nullable./li + * li Other group types are converted into a [[StructType]] with the corresponding + *field types./li/ul/li + * /ul + * Note that fields are determined to be `nullable` if and only if their Parquet repetition + * level is not `REQUIRED`. + * + * @param parquetType The type to convert. + * @return The corresponding Catalyst type. + */ + def toDataType(parquetType: ParquetType): DataType = { +def correspondsToMap(groupType: ParquetGroupType): Boolean = { + if (groupType.getFieldCount != 1 || groupType.getFields.apply(0).isPrimitive) { +false + } else { +// This mostly follows the convention in
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13662320 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -0,0 +1,409 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.io.IOException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapreduce.Job + +import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter} +import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData} +import parquet.hadoop.util.ContextUtil +import parquet.schema.{Type = ParquetType, PrimitiveType = ParquetPrimitiveType, MessageType} +import parquet.schema.{GroupType = ParquetGroupType, OriginalType = ParquetOriginalType, ConversionPatterns} +import parquet.schema.PrimitiveType.{PrimitiveTypeName = ParquetPrimitiveTypeName} +import parquet.schema.Type.Repetition + +import org.apache.spark.Logging +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Attribute} +import org.apache.spark.sql.catalyst.types._ + +// Implicits +import scala.collection.JavaConversions._ + +private[parquet] object ParquetTypesConverter extends Logging { + def isPrimitiveType(ctype: DataType): Boolean = +classOf[PrimitiveType] isAssignableFrom ctype.getClass + + def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): DataType = parquetType match { +case ParquetPrimitiveTypeName.BINARY = StringType +case ParquetPrimitiveTypeName.BOOLEAN = BooleanType +case ParquetPrimitiveTypeName.DOUBLE = DoubleType +case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = ArrayType(ByteType) +case ParquetPrimitiveTypeName.FLOAT = FloatType +case ParquetPrimitiveTypeName.INT32 = IntegerType +case ParquetPrimitiveTypeName.INT64 = LongType +case ParquetPrimitiveTypeName.INT96 = + // TODO: add BigInteger type? TODO(andre) use DecimalType instead + sys.error(Potential loss of precision: cannot convert INT96) +case _ = sys.error( + sUnsupported parquet datatype $parquetType) + } + + /** + * Converts a given Parquet `Type` into the corresponding + * [[org.apache.spark.sql.catalyst.types.DataType]]. + * + * We apply the following conversion rules: + * ul + * li Primitive types are converter to the corresponding primitive type./li + * li Group types that have a single field that is itself a group, which has repetition + *level `REPEATED`, are treated as follows:ul + * li If the nested group has name `values`, the surrounding group is converted + * into an [[ArrayType]] with the corresponding field type (primitive or + * complex) as element type./li + * li If the nested group has name `map` and two fields (named `key` and `value`), + * the surrounding group is converted into a [[MapType]] + * with the corresponding key and value (value possibly complex) types. + * Note that we currently assume map values are not nullable./li + * li Other group types are converted into a [[StructType]] with the corresponding + *field types./li/ul/li + * /ul + * Note that fields are determined to be `nullable` if and only if their Parquet repetition + * level is not `REQUIRED`. + * + * @param parquetType The type to convert. + * @return The corresponding Catalyst type. + */ + def toDataType(parquetType: ParquetType): DataType = { +def correspondsToMap(groupType: ParquetGroupType): Boolean = { + if (groupType.getFieldCount != 1 || groupType.getFields.apply(0).isPrimitive) { +false + } else { +// This mostly follows the convention in
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45437844 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45437841 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45439536 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15543/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45439533 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45444581 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45444586 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45446575 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45446576 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15546/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45416599 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45416693 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45416688 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45416740 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45416741 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15529/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45416832 Hey @AndreSchumacher looks like style issues are failing Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13373005 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -29,25 +31,36 @@ abstract class DataType { case e: Expression if e.dataType == this = true case _ = false } + + def isPrimitive(): Boolean = false --- End diff -- No `()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45130821 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45130853 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45131159 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45131161 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15452/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45140163 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45140288 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15454/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-45140285 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44777055 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44777089 The changes to SqlParser should not have any function effects. I just had to reshuffle a few things to make it easier to extend it to a parser that would support the nested array/map-field expressions, see `NestedSqlParser` in `ParquetQuerySuite`. Currently that parser is incompatible with Hive expressions such as `insert into database_name.table_name` so we need to revisit that once the syntax is fixed. It's currently there for the tests only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44778633 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15328/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44778632 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44081955 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44081960 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44081984 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44081985 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15176/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44082041 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44082034 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44083405 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15177/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-44083404 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-42779124 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-42779130 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-42781102 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-42781104 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14889/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-41491324 I added a new issue about the nullability question here: https://issues.apache.org/jira/browse/SPARK-1649 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11985586 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -54,9 +54,57 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] { /** * Optionally resolves the given string to a * [[catalyst.expressions.NamedExpression NamedExpression]]. The attribute is expressed as - * as string in the following form: `[scope].AttributeName.[nested].[fields]...`. + * as string in the following form: `[scope].AttributeName.[nested].[fields]...`. Fields + * can contain ordinal expressions, such as `field[i][j][k]...`. */ def resolve(name: String): Option[NamedExpression] = { +def expandFunc(expType: (Expression, DataType), field: String): (Expression, DataType) = { --- End diff -- Thanks, I will have a look. One question: does this also handle maps and nested fields inside arrays, like `struct.array1[1].field1.map1[key1].array2[0]`? I don't know Optiq but I will still check out how Hive does this. Since there was (is?) no support for nested Parquet types in Hive that may be a dead end though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11985972 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala --- @@ -206,6 +206,67 @@ class GenericMutableRow(size: Int) extends GenericRow(size) with MutableRow { override def copy() = new GenericRow(values.clone()) } +// TODO: this is an awful lot of code duplication. If values would be covariant we could reuse +// much of GenericRow +class NativeRow[T](protected[catalyst] val values: Array[T]) extends Row { --- End diff -- @marmbrus Good question. I think I added that because GetField wants to get a Row when it calls `eval` on its children. I will have another look at that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11986096 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala --- @@ -52,6 +52,7 @@ case class GetItem(child: Expression, ordinal: Expression) extends Expression { } } else { val baseValue = child.eval(input).asInstanceOf[Map[Any, _]] + // TODO: recover key type!! --- End diff -- `MapType` records the type of the key. I was wondering whether one should use that instead if possible and not just Any. The comment relates to the other comments inside the resolver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11986278 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -29,25 +31,36 @@ abstract class DataType { case e: Expression if e.dataType == this = true case _ = false } + + def isPrimitive(): Boolean = false } case object NullType extends DataType +trait PrimitiveType extends DataType { --- End diff -- @marmbrus `PrimitiveType` is maybe a misnomer. It's the same term that Parquet uses. Basically a `PrimitiveType` is a type that is not contained inside another type (so non-nested). You can argue that a String is a Char array and therefore not primitive but in terms of constructing nested rows it means that a primitive type is a leaf inside the tree that produces a record. It would help to somehow distinguish between nested and non-nested types. `NativeType` comes close but for example there is `BinaryType` which is primitive but not native. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11986397 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -0,0 +1,369 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.io.IOException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapreduce.Job + +import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter} +import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData} +import parquet.hadoop.util.ContextUtil +import parquet.schema.{Type = ParquetType, PrimitiveType = ParquetPrimitiveType, MessageType, MessageTypeParser} +import parquet.schema.{GroupType = ParquetGroupType, OriginalType = ParquetOriginalType, ConversionPatterns} +import parquet.schema.PrimitiveType.{PrimitiveTypeName = ParquetPrimitiveTypeName} +import parquet.schema.Type.Repetition + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Attribute} +import org.apache.spark.sql.catalyst.types._ + +// Implicits +import scala.collection.JavaConversions._ + +private[parquet] object ParquetTypesConverter { + def isPrimitiveType(ctype: DataType): Boolean = +classOf[PrimitiveType] isAssignableFrom ctype.getClass + + def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): DataType = parquetType match { +case ParquetPrimitiveTypeName.BINARY = StringType +case ParquetPrimitiveTypeName.BOOLEAN = BooleanType +case ParquetPrimitiveTypeName.DOUBLE = DoubleType +case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = ArrayType(ByteType) +case ParquetPrimitiveTypeName.FLOAT = FloatType +case ParquetPrimitiveTypeName.INT32 = IntegerType +case ParquetPrimitiveTypeName.INT64 = LongType +case ParquetPrimitiveTypeName.INT96 = + // TODO: add BigInteger type? TODO(andre) use DecimalType instead + sys.error(Potential loss of precision: cannot convert INT96) +case _ = sys.error( + sUnsupported parquet datatype $parquetType) + } + + /** + * Converts a given Parquet `Type` into the corresponding + * [[org.apache.spark.sql.catalyst.types.DataType]]. + * + * Note that we apply the following conversion rules: + * ul + * li Primitive types are converter to the corresponding primitive type./li + * li Group types that have a single field that is itself a group, which has repetition + *level `REPEATED`, are treated as follows:ul + * li If the nested group has name `values`, the surrounding group is converted + * into an [[ArrayType]] with the corresponding field type (primitive or + * complex) as element type./li + * li If the nested group has name `map` and two fields (named `key` and `value`), + * the surrounding group is converted into a [[MapType]] + * with the corresponding key and value (value possibly complex) types. + * Note that we currently assume map values are not nullable./li + * li Other group types are converted into a [[StructType]] with the corresponding + *field types./li/ul/li + * /ul + * Note that fields are determined to be `nullable` if and only if their Parquet repetition + * level is not `REQUIRED`. + * + * @param parquetType The type to convert. + * @return The corresponding Catalyst type. + */ + def toDataType(parquetType: ParquetType): DataType = { +def correspondsToMap(groupType: ParquetGroupType): Boolean = { + if (groupType.getFieldCount != 1 || groupType.getFields.apply(0).isPrimitive) { +false + } else { +// This mostly follows the convention in ``parquet.schema.ConversionPatterns`` ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11986431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,582 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +private[parquet] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + case ctype: NativeType = { +// note: for some reason matching for StringType fails so use this ugly if instead +if (ctype == StringType) new CatalystPrimitiveStringConverter(parent, fieldIndex) +else new CatalystPrimitiveConverter(parent, fieldIndex) + } + case _ = throw new RuntimeException( +sunable to convert datatype ${field.dataType.toString} in CatalystGroupConverter) +} + } + + protected[parquet] def createRootConverter(parquetSchema: MessageType): CatalystConverter = { +val attributes = ParquetTypesConverter.convertToAttributes(parquetSchema) +// For non-nested types we use the optimized Row converter +if (attributes.forall(a = ParquetTypesConverter.isPrimitiveType(a.dataType))) { + new PrimitiveRowGroupConverter(attributes) +} else { + new CatalystGroupConverter(attributes) +} + } +} + +private[parquet] trait CatalystConverter { + // the number of fields this group has + protected[parquet] val size: Int + + // the index of this converter in the parent + protected[parquet] val index: Int + + // the parent converter + protected[parquet] val parent: CatalystConverter + + // for child converters to update upstream values + protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit + + protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): Unit = +updateField(fieldIndex, value) + + protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit = +updateField(fieldIndex,
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11987399 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,582 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +private[parquet] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + case ctype: NativeType = { +// note: for some reason matching for StringType fails so use this ugly if instead +if (ctype == StringType) new CatalystPrimitiveStringConverter(parent, fieldIndex) +else new CatalystPrimitiveConverter(parent, fieldIndex) + } + case _ = throw new RuntimeException( +sunable to convert datatype ${field.dataType.toString} in CatalystGroupConverter) +} + } + + protected[parquet] def createRootConverter(parquetSchema: MessageType): CatalystConverter = { +val attributes = ParquetTypesConverter.convertToAttributes(parquetSchema) +// For non-nested types we use the optimized Row converter +if (attributes.forall(a = ParquetTypesConverter.isPrimitiveType(a.dataType))) { + new PrimitiveRowGroupConverter(attributes) +} else { + new CatalystGroupConverter(attributes) +} + } +} + +private[parquet] trait CatalystConverter { + // the number of fields this group has + protected[parquet] val size: Int + + // the index of this converter in the parent + protected[parquet] val index: Int + + // the parent converter + protected[parquet] val parent: CatalystConverter + + // for child converters to update upstream values + protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit + + protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): Unit = +updateField(fieldIndex, value) + + protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit = +updateField(fieldIndex,
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11989530 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala --- @@ -153,9 +153,15 @@ case class InsertIntoParquetTable( val job = new Job(sc.hadoopConfiguration) -ParquetOutputFormat.setWriteSupportClass( - job, - classOf[org.apache.spark.sql.parquet.RowWriteSupport]) +val writeSupport = + if (child.output.map(_.dataType).forall(_.isPrimitive())) { +logger.info(Initializing MutableRowWriteSupport) --- End diff -- @marmbrus Good question. I'm not yet totally sure myself. But consider the following example: You have an array of structs, which have another array as field. So something like: `ArrayType(StructType(Seq(ArrayType(IntegerType` Lets call the inner array `inner` and the outer array `outer`. Note that `outer` could be itself just a field in a higher-level record. Now whenever Parquet is done passing the data for the current `inner` it will let you know by calling `end` on the converter for that field, in this case an array converter. Now the current struct has been processed completely, so its converter's `end` will be called, too. The current `outer` record, however, may or not may be completed. If it's not completed, then the current `inner` needs to be stored somewhere and you cannot use a mutable row because it is not yet save to reuse that chunk of memory whenever the next `inner` comes along. Does this make any sense at all? I'm happy to discuss other solutions, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AndreSchumacher commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-41376843 Yeah, this is probably way more complicated then it needs to be. But part of the code is (I admit premature) optimizations (NativeArrayConverter and MutableRowConverter) so they are kind of optional. Those are based on the discussions we had earlier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11976274 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,582 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +private[parquet] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + case ctype: NativeType = { +// note: for some reason matching for StringType fails so use this ugly if instead +if (ctype == StringType) new CatalystPrimitiveStringConverter(parent, fieldIndex) +else new CatalystPrimitiveConverter(parent, fieldIndex) + } + case _ = throw new RuntimeException( +sunable to convert datatype ${field.dataType.toString} in CatalystGroupConverter) +} + } + + protected[parquet] def createRootConverter(parquetSchema: MessageType): CatalystConverter = { +val attributes = ParquetTypesConverter.convertToAttributes(parquetSchema) +// For non-nested types we use the optimized Row converter +if (attributes.forall(a = ParquetTypesConverter.isPrimitiveType(a.dataType))) { + new PrimitiveRowGroupConverter(attributes) +} else { + new CatalystGroupConverter(attributes) +} + } +} + +private[parquet] trait CatalystConverter { + // the number of fields this group has + protected[parquet] val size: Int + + // the index of this converter in the parent + protected[parquet] val index: Int + + // the parent converter + protected[parquet] val parent: CatalystConverter + + // for child converters to update upstream values + protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit + + protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): Unit = +updateField(fieldIndex, value) + + protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit = +updateField(fieldIndex, value)
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932023 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -54,9 +54,57 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] { /** * Optionally resolves the given string to a * [[catalyst.expressions.NamedExpression NamedExpression]]. The attribute is expressed as - * as string in the following form: `[scope].AttributeName.[nested].[fields]...`. + * as string in the following form: `[scope].AttributeName.[nested].[fields]...`. Fields + * can contain ordinal expressions, such as `field[i][j][k]...`. */ def resolve(name: String): Option[NamedExpression] = { +def expandFunc(expType: (Expression, DataType), field: String): (Expression, DataType) = { --- End diff -- What do you think of #518 instead of changing the resolver? I am not a parser expert, but I think this is closer to the way hive (and probably optiq which we hope to use eventually) work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932114 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala --- @@ -206,6 +206,67 @@ class GenericMutableRow(size: Int) extends GenericRow(size) with MutableRow { override def copy() = new GenericRow(values.clone()) } +// TODO: this is an awful lot of code duplication. If values would be covariant we could reuse +// much of GenericRow +class NativeRow[T](protected[catalyst] val values: Array[T]) extends Row { --- End diff -- Do we need this class? Arrays don't need to be `Row`s inside of the execution engine, they only need to be of type `Seq`, and even that requirements should probably be removed. Instead of NativeRow can we just call `toSeq` on the Array? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala --- @@ -52,6 +52,7 @@ case class GetItem(child: Expression, ordinal: Expression) extends Expression { } } else { val baseValue = child.eval(input).asInstanceOf[Map[Any, _]] + // TODO: recover key type!! --- End diff -- What do you mean? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932307 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -29,25 +31,36 @@ abstract class DataType { case e: Expression if e.dataType == this = true case _ = false } + + def isPrimitive(): Boolean = false } case object NullType extends DataType +trait PrimitiveType extends DataType { --- End diff -- What are the semantics of `PrimitiveType`? Specifically, I'm surprised that `StringType` and `DecimalType` are considered `PrimitiveTypes`. Also I wonder if we can unify this with `NativeType` somehow. I'm not really sure, but I'd like to avoid too much explosion here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932422 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -0,0 +1,369 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.io.IOException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapreduce.Job + +import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter} +import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData} +import parquet.hadoop.util.ContextUtil +import parquet.schema.{Type = ParquetType, PrimitiveType = ParquetPrimitiveType, MessageType, MessageTypeParser} +import parquet.schema.{GroupType = ParquetGroupType, OriginalType = ParquetOriginalType, ConversionPatterns} +import parquet.schema.PrimitiveType.{PrimitiveTypeName = ParquetPrimitiveTypeName} +import parquet.schema.Type.Repetition + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Attribute} +import org.apache.spark.sql.catalyst.types._ + +// Implicits +import scala.collection.JavaConversions._ + +private[parquet] object ParquetTypesConverter { + def isPrimitiveType(ctype: DataType): Boolean = +classOf[PrimitiveType] isAssignableFrom ctype.getClass + + def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): DataType = parquetType match { +case ParquetPrimitiveTypeName.BINARY = StringType +case ParquetPrimitiveTypeName.BOOLEAN = BooleanType +case ParquetPrimitiveTypeName.DOUBLE = DoubleType +case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = ArrayType(ByteType) +case ParquetPrimitiveTypeName.FLOAT = FloatType +case ParquetPrimitiveTypeName.INT32 = IntegerType +case ParquetPrimitiveTypeName.INT64 = LongType +case ParquetPrimitiveTypeName.INT96 = + // TODO: add BigInteger type? TODO(andre) use DecimalType instead + sys.error(Potential loss of precision: cannot convert INT96) +case _ = sys.error( + sUnsupported parquet datatype $parquetType) + } + + /** + * Converts a given Parquet `Type` into the corresponding + * [[org.apache.spark.sql.catalyst.types.DataType]]. + * + * Note that we apply the following conversion rules: + * ul + * li Primitive types are converter to the corresponding primitive type./li + * li Group types that have a single field that is itself a group, which has repetition + *level `REPEATED`, are treated as follows:ul + * li If the nested group has name `values`, the surrounding group is converted + * into an [[ArrayType]] with the corresponding field type (primitive or + * complex) as element type./li + * li If the nested group has name `map` and two fields (named `key` and `value`), + * the surrounding group is converted into a [[MapType]] + * with the corresponding key and value (value possibly complex) types. + * Note that we currently assume map values are not nullable./li + * li Other group types are converted into a [[StructType]] with the corresponding + *field types./li/ul/li + * /ul + * Note that fields are determined to be `nullable` if and only if their Parquet repetition + * level is not `REQUIRED`. + * + * @param parquetType The type to convert. + * @return The corresponding Catalyst type. + */ + def toDataType(parquetType: ParquetType): DataType = { +def correspondsToMap(groupType: ParquetGroupType): Boolean = { + if (groupType.getFieldCount != 1 || groupType.getFields.apply(0).isPrimitive) { +false + } else { +// This mostly follows the convention in ``parquet.schema.ConversionPatterns`` +val
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932461 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -0,0 +1,369 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.io.IOException + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapreduce.Job + +import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter} +import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData} +import parquet.hadoop.util.ContextUtil +import parquet.schema.{Type = ParquetType, PrimitiveType = ParquetPrimitiveType, MessageType, MessageTypeParser} +import parquet.schema.{GroupType = ParquetGroupType, OriginalType = ParquetOriginalType, ConversionPatterns} +import parquet.schema.PrimitiveType.{PrimitiveTypeName = ParquetPrimitiveTypeName} +import parquet.schema.Type.Repetition + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Attribute} +import org.apache.spark.sql.catalyst.types._ + +// Implicits +import scala.collection.JavaConversions._ + +private[parquet] object ParquetTypesConverter { + def isPrimitiveType(ctype: DataType): Boolean = +classOf[PrimitiveType] isAssignableFrom ctype.getClass + + def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): DataType = parquetType match { +case ParquetPrimitiveTypeName.BINARY = StringType +case ParquetPrimitiveTypeName.BOOLEAN = BooleanType +case ParquetPrimitiveTypeName.DOUBLE = DoubleType +case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = ArrayType(ByteType) +case ParquetPrimitiveTypeName.FLOAT = FloatType +case ParquetPrimitiveTypeName.INT32 = IntegerType +case ParquetPrimitiveTypeName.INT64 = LongType +case ParquetPrimitiveTypeName.INT96 = + // TODO: add BigInteger type? TODO(andre) use DecimalType instead + sys.error(Potential loss of precision: cannot convert INT96) +case _ = sys.error( + sUnsupported parquet datatype $parquetType) + } + + /** + * Converts a given Parquet `Type` into the corresponding + * [[org.apache.spark.sql.catalyst.types.DataType]]. + * + * Note that we apply the following conversion rules: + * ul + * li Primitive types are converter to the corresponding primitive type./li + * li Group types that have a single field that is itself a group, which has repetition + *level `REPEATED`, are treated as follows:ul + * li If the nested group has name `values`, the surrounding group is converted + * into an [[ArrayType]] with the corresponding field type (primitive or + * complex) as element type./li + * li If the nested group has name `map` and two fields (named `key` and `value`), + * the surrounding group is converted into a [[MapType]] + * with the corresponding key and value (value possibly complex) types. + * Note that we currently assume map values are not nullable./li + * li Other group types are converted into a [[StructType]] with the corresponding + *field types./li/ul/li + * /ul + * Note that fields are determined to be `nullable` if and only if their Parquet repetition + * level is not `REQUIRED`. + * + * @param parquetType The type to convert. + * @return The corresponding Catalyst type. + */ + def toDataType(parquetType: ParquetType): DataType = { +def correspondsToMap(groupType: ParquetGroupType): Boolean = { + if (groupType.getFieldCount != 1 || groupType.getFields.apply(0).isPrimitive) { +false + } else { +// This mostly follows the convention in ``parquet.schema.ConversionPatterns`` --- End diff
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932494 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala --- @@ -68,14 +93,119 @@ object ParquetTestData { lazy val testData = new ParquetRelation(testDir.toURI.toString) + val testNestedSchema1 = +// based on blogpost example, source: +// https://blog.twitter.com/2013/dremel-made-simple-with-parquet +// note: instead of string we have to use binary (?) otherwise +// Parquet gives us: +// IllegalArgumentException: expected one of [INT64, INT32, BOOLEAN, +// BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY] +// Also repeated primitives seem tricky to convert (AvroParquet +// only uses them in arrays?) so only use at most one in each group +// and nothing else in that group (- is mapped to array)! +// The values inside ownerPhoneNumbers is a keyword currently +// so that array types can be translated correctly. + + |message AddressBook { +|required binary owner; +|optional group ownerPhoneNumbers { + |repeated binary array; +|} +|optional group contacts { + |repeated group array { +|required binary name; +|optional binary phoneNumber; + |} +|} + |} +.stripMargin + + + val testNestedSchema2 = + + |message TestNested2 { +|required int32 firstInt; +|optional int32 secondInt; +|optional group longs { + |repeated int64 array; +|} +|required group entries { + |repeated group array { +|required double value; +|optional boolean truth; + |} +|} +|optional group outerouter { + |repeated group array { +|repeated group array { + |repeated int32 array; +|} + |} +|} + |} +.stripMargin + + val testNestedSchema3 = + + |message TestNested3 { +|required int32 x; +|optional group booleanNumberPairs { + |repeated group array { +|required int32 key; +|optional group value { + |repeated group array { +|required double nestedValue; +|optional boolean truth; + |} +|} + |} +|} + |} +.stripMargin + + val testNestedSchema4 = + + |message TestNested4 { --- End diff -- Nit: why are the margins aligned with the indentation? I think that results in an unindented string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932640 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,582 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +private[parquet] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + case ctype: NativeType = { +// note: for some reason matching for StringType fails so use this ugly if instead +if (ctype == StringType) new CatalystPrimitiveStringConverter(parent, fieldIndex) +else new CatalystPrimitiveConverter(parent, fieldIndex) + } + case _ = throw new RuntimeException( +sunable to convert datatype ${field.dataType.toString} in CatalystGroupConverter) +} + } + + protected[parquet] def createRootConverter(parquetSchema: MessageType): CatalystConverter = { +val attributes = ParquetTypesConverter.convertToAttributes(parquetSchema) +// For non-nested types we use the optimized Row converter +if (attributes.forall(a = ParquetTypesConverter.isPrimitiveType(a.dataType))) { + new PrimitiveRowGroupConverter(attributes) +} else { + new CatalystGroupConverter(attributes) +} + } +} + +private[parquet] trait CatalystConverter { + // the number of fields this group has + protected[parquet] val size: Int + + // the index of this converter in the parent + protected[parquet] val index: Int + + // the parent converter + protected[parquet] val parent: CatalystConverter + + // for child converters to update upstream values + protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit + + protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): Unit = +updateField(fieldIndex, value) + + protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit = +updateField(fieldIndex, value)
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932653 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,582 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +private[parquet] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + case ctype: NativeType = { +// note: for some reason matching for StringType fails so use this ugly if instead +if (ctype == StringType) new CatalystPrimitiveStringConverter(parent, fieldIndex) +else new CatalystPrimitiveConverter(parent, fieldIndex) + } + case _ = throw new RuntimeException( +sunable to convert datatype ${field.dataType.toString} in CatalystGroupConverter) +} + } + + protected[parquet] def createRootConverter(parquetSchema: MessageType): CatalystConverter = { +val attributes = ParquetTypesConverter.convertToAttributes(parquetSchema) +// For non-nested types we use the optimized Row converter +if (attributes.forall(a = ParquetTypesConverter.isPrimitiveType(a.dataType))) { + new PrimitiveRowGroupConverter(attributes) +} else { + new CatalystGroupConverter(attributes) +} + } +} + +private[parquet] trait CatalystConverter { + // the number of fields this group has + protected[parquet] val size: Int + + // the index of this converter in the parent + protected[parquet] val index: Int + + // the parent converter + protected[parquet] val parent: CatalystConverter + + // for child converters to update upstream values + protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit + + protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): Unit = +updateField(fieldIndex, value) + + protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit = +updateField(fieldIndex, value)
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932731 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala --- @@ -160,7 +151,7 @@ private[sql] object ParquetRelation { } if (fs.exists(path) -!fs.getFileStatus(path) + !fs.getFileStatus(path) --- End diff -- I think the indenting was right before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r11932798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala --- @@ -153,9 +153,15 @@ case class InsertIntoParquetTable( val job = new Job(sc.hadoopConfiguration) -ParquetOutputFormat.setWriteSupportClass( - job, - classOf[org.apache.spark.sql.parquet.RowWriteSupport]) +val writeSupport = + if (child.output.map(_.dataType).forall(_.isPrimitive())) { +logger.info(Initializing MutableRowWriteSupport) --- End diff -- Probably should not be info. Also, why do all the data types have to be primitive for us to use mutable rows? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-41234534 Wow, this is a pretty intense PR! :) Still trying to wrap my head around it, but overall I think the approach seems reasonable. Will take another look after a few questions are answered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40890889 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40890891 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40892195 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14268/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40892194 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40872477 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40872482 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40874845 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40874847 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14256/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40362749 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---