[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46578605
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46578586
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46584481
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46584464
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46587915
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46587917
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15911/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46589367
  
@rxin any idea why this one test fails?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46593227
  
That test has been flaky. We are fixing it. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46593240
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46593586
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46593569
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46593975
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46593978
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15912/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46602864
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46602867
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15914/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46612027
  
@AndreSchumacher do u mind removing the [WIP] tag from the pull request?

Unfortunately due to the avro version bump, we can't include this in 1.0.1. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46644749
  
@rxin the avro dependency is for the tests only (to make sure we can read 
parquet files with avro objects in them). I can remove the one test if that 
blocks it from being included. When the rest of the build has caught up with 
the version we can add it again. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46644798
  
That sounds good. If you can just comment that test out for now, that'd be 
great.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46645804
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46645809
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-18 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13926401
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-18 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13926479
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13876519
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need to treat 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13892309
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need to treat 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-17 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13892425
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need to treat 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-17 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13892462
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need to treat 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13892570
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,667 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, 
Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+/**
+ * Collection of converters of Parquet types (group and primitive types) 
that
+ * model arrays and maps. The conversions are partly based on the 
AvroParquet
+ * converters that are part of Parquet in order to be able to process these
+ * types.
+ *
+ * There are several types of converters:
+ * ul
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for 
primitive
+ *   (numeric, boolean and String) types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for 
arrays
+ *   of native JVM element types; note: currently null values are not 
supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for 
arrays of
+ *   arbitrary element types (including nested element types); note: 
currently
+ *   null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for 
structs/li
+ *   li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; 
note:
+ *   currently null values are not supported!/li
+ *   li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] 
for rows
+ *   of only primitive element types/li
+ *   li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other 
nested
+ *   records, including the top-level row record/li
+ * /ul
+ */
+
+private[sql] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  // TODO: consider using Array[T] for arrays to avoid boxing of primitive 
types
+  type ArrayScalaType[T] = Seq[T]
+  type StructScalaType[T] = Seq[T]
+  type MapScalaType[K, V] = Map[K, V]
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  // Strings, Shorts and Bytes do not have a corresponding type in 
Parquet
+  // so we need to treat 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-12 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13711334
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.io.IOException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter}
+import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData}
+import parquet.hadoop.util.ContextUtil
+import parquet.schema.{Type = ParquetType, PrimitiveType = 
ParquetPrimitiveType, MessageType}
+import parquet.schema.{GroupType = ParquetGroupType, OriginalType = 
ParquetOriginalType, ConversionPatterns}
+import parquet.schema.PrimitiveType.{PrimitiveTypeName = 
ParquetPrimitiveTypeName}
+import parquet.schema.Type.Repetition
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Attribute}
+import org.apache.spark.sql.catalyst.types._
+
+// Implicits
+import scala.collection.JavaConversions._
+
+private[parquet] object ParquetTypesConverter extends Logging {
+  def isPrimitiveType(ctype: DataType): Boolean =
+classOf[PrimitiveType] isAssignableFrom ctype.getClass
+
+  def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): 
DataType = parquetType match {
+case ParquetPrimitiveTypeName.BINARY = StringType
+case ParquetPrimitiveTypeName.BOOLEAN = BooleanType
+case ParquetPrimitiveTypeName.DOUBLE = DoubleType
+case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = 
ArrayType(ByteType)
+case ParquetPrimitiveTypeName.FLOAT = FloatType
+case ParquetPrimitiveTypeName.INT32 = IntegerType
+case ParquetPrimitiveTypeName.INT64 = LongType
+case ParquetPrimitiveTypeName.INT96 =
+  // TODO: add BigInteger type? TODO(andre) use DecimalType instead
+  sys.error(Potential loss of precision: cannot convert INT96)
+case _ = sys.error(
+  sUnsupported parquet datatype $parquetType)
+  }
+
+  /**
+   * Converts a given Parquet `Type` into the corresponding
+   * [[org.apache.spark.sql.catalyst.types.DataType]].
+   *
+   * We apply the following conversion rules:
+   * ul
+   *   li Primitive types are converter to the corresponding primitive 
type./li
+   *   li Group types that have a single field that is itself a group, 
which has repetition
+   *level `REPEATED`, are treated as follows:ul
+   *  li If the nested group has name `values`, the surrounding 
group is converted
+   *   into an [[ArrayType]] with the corresponding field type 
(primitive or
+   *   complex) as element type./li
+   *  li If the nested group has name `map` and two fields 
(named `key` and `value`),
+   *   the surrounding group is converted into a [[MapType]]
+   *   with the corresponding key and value (value possibly 
complex) types.
+   *   Note that we currently assume map values are not 
nullable./li
+   *   li Other group types are converted into a [[StructType]] with the 
corresponding
+   *field types./li/ul/li
+   * /ul
+   * Note that fields are determined to be `nullable` if and only if their 
Parquet repetition
+   * level is not `REQUIRED`.
+   *
+   * @param parquetType The type to convert.
+   * @return The corresponding Catalyst type.
+   */
+  def toDataType(parquetType: ParquetType): DataType = {
+def correspondsToMap(groupType: ParquetGroupType): Boolean = {
+  if (groupType.getFieldCount != 1 || 
groupType.getFields.apply(0).isPrimitive) {
+false
+  } else {
+// This mostly follows the convention in 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-11 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13661910
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.io.IOException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter}
+import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData}
+import parquet.hadoop.util.ContextUtil
+import parquet.schema.{Type = ParquetType, PrimitiveType = 
ParquetPrimitiveType, MessageType}
+import parquet.schema.{GroupType = ParquetGroupType, OriginalType = 
ParquetOriginalType, ConversionPatterns}
+import parquet.schema.PrimitiveType.{PrimitiveTypeName = 
ParquetPrimitiveTypeName}
+import parquet.schema.Type.Repetition
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Attribute}
+import org.apache.spark.sql.catalyst.types._
+
+// Implicits
+import scala.collection.JavaConversions._
+
+private[parquet] object ParquetTypesConverter extends Logging {
+  def isPrimitiveType(ctype: DataType): Boolean =
+classOf[PrimitiveType] isAssignableFrom ctype.getClass
+
+  def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): 
DataType = parquetType match {
+case ParquetPrimitiveTypeName.BINARY = StringType
+case ParquetPrimitiveTypeName.BOOLEAN = BooleanType
+case ParquetPrimitiveTypeName.DOUBLE = DoubleType
+case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = 
ArrayType(ByteType)
+case ParquetPrimitiveTypeName.FLOAT = FloatType
+case ParquetPrimitiveTypeName.INT32 = IntegerType
+case ParquetPrimitiveTypeName.INT64 = LongType
+case ParquetPrimitiveTypeName.INT96 =
+  // TODO: add BigInteger type? TODO(andre) use DecimalType instead
+  sys.error(Potential loss of precision: cannot convert INT96)
+case _ = sys.error(
+  sUnsupported parquet datatype $parquetType)
+  }
+
+  /**
+   * Converts a given Parquet `Type` into the corresponding
+   * [[org.apache.spark.sql.catalyst.types.DataType]].
+   *
+   * We apply the following conversion rules:
+   * ul
+   *   li Primitive types are converter to the corresponding primitive 
type./li
+   *   li Group types that have a single field that is itself a group, 
which has repetition
+   *level `REPEATED`, are treated as follows:ul
+   *  li If the nested group has name `values`, the surrounding 
group is converted
+   *   into an [[ArrayType]] with the corresponding field type 
(primitive or
+   *   complex) as element type./li
+   *  li If the nested group has name `map` and two fields 
(named `key` and `value`),
+   *   the surrounding group is converted into a [[MapType]]
+   *   with the corresponding key and value (value possibly 
complex) types.
+   *   Note that we currently assume map values are not 
nullable./li
+   *   li Other group types are converted into a [[StructType]] with the 
corresponding
+   *field types./li/ul/li
+   * /ul
+   * Note that fields are determined to be `nullable` if and only if their 
Parquet repetition
+   * level is not `REQUIRED`.
+   *
+   * @param parquetType The type to convert.
+   * @return The corresponding Catalyst type.
+   */
+  def toDataType(parquetType: ParquetType): DataType = {
+def correspondsToMap(groupType: ParquetGroupType): Boolean = {
+  if (groupType.getFieldCount != 1 || 
groupType.getFields.apply(0).isPrimitive) {
+false
+  } else {
+// This mostly follows the convention in 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-11 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13662320
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.io.IOException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter}
+import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData}
+import parquet.hadoop.util.ContextUtil
+import parquet.schema.{Type = ParquetType, PrimitiveType = 
ParquetPrimitiveType, MessageType}
+import parquet.schema.{GroupType = ParquetGroupType, OriginalType = 
ParquetOriginalType, ConversionPatterns}
+import parquet.schema.PrimitiveType.{PrimitiveTypeName = 
ParquetPrimitiveTypeName}
+import parquet.schema.Type.Repetition
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Attribute}
+import org.apache.spark.sql.catalyst.types._
+
+// Implicits
+import scala.collection.JavaConversions._
+
+private[parquet] object ParquetTypesConverter extends Logging {
+  def isPrimitiveType(ctype: DataType): Boolean =
+classOf[PrimitiveType] isAssignableFrom ctype.getClass
+
+  def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): 
DataType = parquetType match {
+case ParquetPrimitiveTypeName.BINARY = StringType
+case ParquetPrimitiveTypeName.BOOLEAN = BooleanType
+case ParquetPrimitiveTypeName.DOUBLE = DoubleType
+case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = 
ArrayType(ByteType)
+case ParquetPrimitiveTypeName.FLOAT = FloatType
+case ParquetPrimitiveTypeName.INT32 = IntegerType
+case ParquetPrimitiveTypeName.INT64 = LongType
+case ParquetPrimitiveTypeName.INT96 =
+  // TODO: add BigInteger type? TODO(andre) use DecimalType instead
+  sys.error(Potential loss of precision: cannot convert INT96)
+case _ = sys.error(
+  sUnsupported parquet datatype $parquetType)
+  }
+
+  /**
+   * Converts a given Parquet `Type` into the corresponding
+   * [[org.apache.spark.sql.catalyst.types.DataType]].
+   *
+   * We apply the following conversion rules:
+   * ul
+   *   li Primitive types are converter to the corresponding primitive 
type./li
+   *   li Group types that have a single field that is itself a group, 
which has repetition
+   *level `REPEATED`, are treated as follows:ul
+   *  li If the nested group has name `values`, the surrounding 
group is converted
+   *   into an [[ArrayType]] with the corresponding field type 
(primitive or
+   *   complex) as element type./li
+   *  li If the nested group has name `map` and two fields 
(named `key` and `value`),
+   *   the surrounding group is converted into a [[MapType]]
+   *   with the corresponding key and value (value possibly 
complex) types.
+   *   Note that we currently assume map values are not 
nullable./li
+   *   li Other group types are converted into a [[StructType]] with the 
corresponding
+   *field types./li/ul/li
+   * /ul
+   * Note that fields are determined to be `nullable` if and only if their 
Parquet repetition
+   * level is not `REQUIRED`.
+   *
+   * @param parquetType The type to convert.
+   * @return The corresponding Catalyst type.
+   */
+  def toDataType(parquetType: ParquetType): DataType = {
+def correspondsToMap(groupType: ParquetGroupType): Boolean = {
+  if (groupType.getFieldCount != 1 || 
groupType.getFields.apply(0).isPrimitive) {
+false
+  } else {
+// This mostly follows the convention in 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45437844
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45437841
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45439536
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15543/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45439533
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45444581
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45444586
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45446575
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45446576
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15546/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-07 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45416599
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45416693
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45416688
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45416740
  
Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45416741
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15529/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-07 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45416832
  
Hey @AndreSchumacher looks like style issues are failing Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r13373005
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -29,25 +31,36 @@ abstract class DataType {
 case e: Expression if e.dataType == this = true
 case _ = false
   }
+
+  def isPrimitive(): Boolean = false
--- End diff --

No `()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45130821
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45130853
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45131159
  
Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45131161
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15452/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45140163
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45140288
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15454/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-45140285
  
Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44777055
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-01 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44777089
  
The changes to SqlParser should not have any function effects. I just had 
to reshuffle a few things to make it easier to extend it to a parser that would 
support the nested array/map-field expressions, see `NestedSqlParser` in 
`ParquetQuerySuite`. Currently that parser is incompatible with Hive 
expressions such as `insert into database_name.table_name` so we need to 
revisit that once the syntax is fixed. It's currently there for the tests only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44778633
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15328/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-06-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44778632
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44081955
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44081960
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44081984
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44081985
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15176/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44082041
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44082034
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44083405
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15177/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-44083404
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-42779124
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-42779130
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-42781102
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-42781104
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14889/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-27 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-41491324
  
I added a new issue about the nullability question here: 
https://issues.apache.org/jira/browse/SPARK-1649


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11985586
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -54,9 +54,57 @@ abstract class LogicalPlan extends 
QueryPlan[LogicalPlan] {
   /**
* Optionally resolves the given string to a
* [[catalyst.expressions.NamedExpression NamedExpression]]. The 
attribute is expressed as
-   * as string in the following form: 
`[scope].AttributeName.[nested].[fields]...`.
+   * as string in the following form: 
`[scope].AttributeName.[nested].[fields]...`. Fields
+   * can contain ordinal expressions, such as `field[i][j][k]...`.
*/
   def resolve(name: String): Option[NamedExpression] = {
+def expandFunc(expType: (Expression, DataType), field: String): 
(Expression, DataType) = {
--- End diff --

Thanks, I will have a look. One question: does this also handle maps and 
nested fields inside arrays, like 
`struct.array1[1].field1.map1[key1].array2[0]`? I don't know Optiq but I will 
still check out how Hive does this. Since there was (is?) no support for nested 
Parquet types in Hive that may be a dead end though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11985972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala 
---
@@ -206,6 +206,67 @@ class GenericMutableRow(size: Int) extends 
GenericRow(size) with MutableRow {
   override def copy() = new GenericRow(values.clone())
 }
 
+// TODO: this is an awful lot of code duplication. If values would be 
covariant we could reuse
+// much of GenericRow
+class NativeRow[T](protected[catalyst] val values: Array[T]) extends Row {
--- End diff --

@marmbrus  Good question. I think I added that because GetField wants to 
get a Row when it calls `eval` on its children. I will have another look at 
that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11986096
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala
 ---
@@ -52,6 +52,7 @@ case class GetItem(child: Expression, ordinal: 
Expression) extends Expression {
   }
 } else {
   val baseValue = child.eval(input).asInstanceOf[Map[Any, _]]
+  // TODO: recover key type!!
--- End diff --

`MapType` records the type of the key. I was wondering whether one should 
use that instead if possible and not just Any. The comment relates to the other 
comments inside the resolver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11986278
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -29,25 +31,36 @@ abstract class DataType {
 case e: Expression if e.dataType == this = true
 case _ = false
   }
+
+  def isPrimitive(): Boolean = false
 }
 
 case object NullType extends DataType
 
+trait PrimitiveType extends DataType {
--- End diff --

@marmbrus `PrimitiveType` is maybe a misnomer. It's the same term that 
Parquet uses. Basically a `PrimitiveType` is a type that is not contained 
inside another type (so non-nested). You can argue that a String is a Char 
array and therefore not primitive but in terms of constructing nested rows it 
means that a primitive type is a leaf inside the tree that produces a record.

It would help to somehow distinguish between nested and non-nested types. 
`NativeType` comes close but for example there is `BinaryType` which is 
primitive but not native.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11986397
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.io.IOException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter}
+import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData}
+import parquet.hadoop.util.ContextUtil
+import parquet.schema.{Type = ParquetType, PrimitiveType = 
ParquetPrimitiveType, MessageType, MessageTypeParser}
+import parquet.schema.{GroupType = ParquetGroupType, OriginalType = 
ParquetOriginalType, ConversionPatterns}
+import parquet.schema.PrimitiveType.{PrimitiveTypeName = 
ParquetPrimitiveTypeName}
+import parquet.schema.Type.Repetition
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Attribute}
+import org.apache.spark.sql.catalyst.types._
+
+// Implicits
+import scala.collection.JavaConversions._
+
+private[parquet] object ParquetTypesConverter {
+  def isPrimitiveType(ctype: DataType): Boolean =
+classOf[PrimitiveType] isAssignableFrom ctype.getClass
+
+  def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): 
DataType = parquetType match {
+case ParquetPrimitiveTypeName.BINARY = StringType
+case ParquetPrimitiveTypeName.BOOLEAN = BooleanType
+case ParquetPrimitiveTypeName.DOUBLE = DoubleType
+case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = 
ArrayType(ByteType)
+case ParquetPrimitiveTypeName.FLOAT = FloatType
+case ParquetPrimitiveTypeName.INT32 = IntegerType
+case ParquetPrimitiveTypeName.INT64 = LongType
+case ParquetPrimitiveTypeName.INT96 =
+  // TODO: add BigInteger type? TODO(andre) use DecimalType instead
+  sys.error(Potential loss of precision: cannot convert INT96)
+case _ = sys.error(
+  sUnsupported parquet datatype $parquetType)
+  }
+
+  /**
+   * Converts a given Parquet `Type` into the corresponding
+   * [[org.apache.spark.sql.catalyst.types.DataType]].
+   *
+   * Note that we apply the following conversion rules:
+   * ul
+   *   li Primitive types are converter to the corresponding primitive 
type./li
+   *   li Group types that have a single field that is itself a group, 
which has repetition
+   *level `REPEATED`, are treated as follows:ul
+   *  li If the nested group has name `values`, the surrounding 
group is converted
+   *   into an [[ArrayType]] with the corresponding field type 
(primitive or
+   *   complex) as element type./li
+   *  li If the nested group has name `map` and two fields 
(named `key` and `value`),
+   *   the surrounding group is converted into a [[MapType]]
+   *   with the corresponding key and value (value possibly 
complex) types.
+   *   Note that we currently assume map values are not 
nullable./li
+   *   li Other group types are converted into a [[StructType]] with the 
corresponding
+   *field types./li/ul/li
+   * /ul
+   * Note that fields are determined to be `nullable` if and only if their 
Parquet repetition
+   * level is not `REQUIRED`.
+   *
+   * @param parquetType The type to convert.
+   * @return The corresponding Catalyst type.
+   */
+  def toDataType(parquetType: ParquetType): DataType = {
+def correspondsToMap(groupType: ParquetGroupType): Boolean = {
+  if (groupType.getFieldCount != 1 || 
groupType.getFields.apply(0).isPrimitive) {
+false
+  } else {
+// This mostly follows the convention in 
``parquet.schema.ConversionPatterns``
--- 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11986431
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,582 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, 
Row, Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+private[parquet] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  case ctype: NativeType = {
+// note: for some reason matching for StringType fails so use this 
ugly if instead
+if (ctype == StringType) new 
CatalystPrimitiveStringConverter(parent, fieldIndex)
+else new CatalystPrimitiveConverter(parent, fieldIndex)
+  }
+  case _ = throw new RuntimeException(
+sunable to convert datatype ${field.dataType.toString} in 
CatalystGroupConverter)
+}
+  }
+
+  protected[parquet] def createRootConverter(parquetSchema: MessageType): 
CatalystConverter = {
+val attributes = 
ParquetTypesConverter.convertToAttributes(parquetSchema)
+// For non-nested types we use the optimized Row converter
+if (attributes.forall(a = 
ParquetTypesConverter.isPrimitiveType(a.dataType))) {
+  new PrimitiveRowGroupConverter(attributes)
+} else {
+  new CatalystGroupConverter(attributes)
+}
+  }
+}
+
+private[parquet] trait CatalystConverter {
+  // the number of fields this group has
+  protected[parquet] val size: Int
+
+  // the index of this converter in the parent
+  protected[parquet] val index: Int
+
+  // the parent converter
+  protected[parquet] val parent: CatalystConverter
+
+  // for child converters to update upstream values
+  protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit
+
+  protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): 
Unit =
+updateField(fieldIndex, value)
+
+  protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit =
+updateField(fieldIndex, 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11987399
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,582 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, 
Row, Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+private[parquet] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  case ctype: NativeType = {
+// note: for some reason matching for StringType fails so use this 
ugly if instead
+if (ctype == StringType) new 
CatalystPrimitiveStringConverter(parent, fieldIndex)
+else new CatalystPrimitiveConverter(parent, fieldIndex)
+  }
+  case _ = throw new RuntimeException(
+sunable to convert datatype ${field.dataType.toString} in 
CatalystGroupConverter)
+}
+  }
+
+  protected[parquet] def createRootConverter(parquetSchema: MessageType): 
CatalystConverter = {
+val attributes = 
ParquetTypesConverter.convertToAttributes(parquetSchema)
+// For non-nested types we use the optimized Row converter
+if (attributes.forall(a = 
ParquetTypesConverter.isPrimitiveType(a.dataType))) {
+  new PrimitiveRowGroupConverter(attributes)
+} else {
+  new CatalystGroupConverter(attributes)
+}
+  }
+}
+
+private[parquet] trait CatalystConverter {
+  // the number of fields this group has
+  protected[parquet] val size: Int
+
+  // the index of this converter in the parent
+  protected[parquet] val index: Int
+
+  // the parent converter
+  protected[parquet] val parent: CatalystConverter
+
+  // for child converters to update upstream values
+  protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit
+
+  protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): 
Unit =
+updateField(fieldIndex, value)
+
+  protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit =
+updateField(fieldIndex, 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11989530
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
@@ -153,9 +153,15 @@ case class InsertIntoParquetTable(
 
 val job = new Job(sc.hadoopConfiguration)
 
-ParquetOutputFormat.setWriteSupportClass(
-  job,
-  classOf[org.apache.spark.sql.parquet.RowWriteSupport])
+val writeSupport =
+  if (child.output.map(_.dataType).forall(_.isPrimitive())) {
+logger.info(Initializing MutableRowWriteSupport)
--- End diff --

@marmbrus Good question. I'm not yet totally sure myself. But consider the 
following example: You have an array of structs, which have another array as 
field. So something like:

`ArrayType(StructType(Seq(ArrayType(IntegerType`

Lets call the inner array `inner` and the outer array `outer`. Note that 
`outer` could be itself just a field in a higher-level record.

Now whenever Parquet is done passing the data for the current `inner` it 
will let you know by calling `end` on the converter for that field, in this 
case an array converter. Now the current struct has been processed completely, 
so its converter's `end` will be called, too. The current `outer` record, 
however, may or not may be completed. If it's not completed, then the current 
`inner` needs to be stored somewhere and you cannot use a mutable row because 
it is not yet save to reuse that chunk of memory whenever the next `inner` 
comes along.

Does this make any sense at all? I'm happy to discuss other solutions, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-25 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-41376843
  
Yeah, this is probably way more complicated then it needs to be. But part 
of the code is (I admit premature) optimizations (NativeArrayConverter and 
MutableRowConverter) so they are kind of optional. Those are based on the 
discussions we had earlier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-24 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11976274
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,582 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, 
Row, Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+private[parquet] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  case ctype: NativeType = {
+// note: for some reason matching for StringType fails so use this 
ugly if instead
+if (ctype == StringType) new 
CatalystPrimitiveStringConverter(parent, fieldIndex)
+else new CatalystPrimitiveConverter(parent, fieldIndex)
+  }
+  case _ = throw new RuntimeException(
+sunable to convert datatype ${field.dataType.toString} in 
CatalystGroupConverter)
+}
+  }
+
+  protected[parquet] def createRootConverter(parquetSchema: MessageType): 
CatalystConverter = {
+val attributes = 
ParquetTypesConverter.convertToAttributes(parquetSchema)
+// For non-nested types we use the optimized Row converter
+if (attributes.forall(a = 
ParquetTypesConverter.isPrimitiveType(a.dataType))) {
+  new PrimitiveRowGroupConverter(attributes)
+} else {
+  new CatalystGroupConverter(attributes)
+}
+  }
+}
+
+private[parquet] trait CatalystConverter {
+  // the number of fields this group has
+  protected[parquet] val size: Int
+
+  // the index of this converter in the parent
+  protected[parquet] val index: Int
+
+  // the parent converter
+  protected[parquet] val parent: CatalystConverter
+
+  // for child converters to update upstream values
+  protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit
+
+  protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): 
Unit =
+updateField(fieldIndex, value)
+
+  protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit =
+updateField(fieldIndex, value)
 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932023
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -54,9 +54,57 @@ abstract class LogicalPlan extends 
QueryPlan[LogicalPlan] {
   /**
* Optionally resolves the given string to a
* [[catalyst.expressions.NamedExpression NamedExpression]]. The 
attribute is expressed as
-   * as string in the following form: 
`[scope].AttributeName.[nested].[fields]...`.
+   * as string in the following form: 
`[scope].AttributeName.[nested].[fields]...`. Fields
+   * can contain ordinal expressions, such as `field[i][j][k]...`.
*/
   def resolve(name: String): Option[NamedExpression] = {
+def expandFunc(expType: (Expression, DataType), field: String): 
(Expression, DataType) = {
--- End diff --

What do you think of #518 instead of changing the resolver?  I am not a 
parser expert, but I think this is closer to the way hive (and probably optiq 
which we hope to use eventually) work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932114
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala 
---
@@ -206,6 +206,67 @@ class GenericMutableRow(size: Int) extends 
GenericRow(size) with MutableRow {
   override def copy() = new GenericRow(values.clone())
 }
 
+// TODO: this is an awful lot of code duplication. If values would be 
covariant we could reuse
+// much of GenericRow
+class NativeRow[T](protected[catalyst] val values: Array[T]) extends Row {
--- End diff --

Do we need this class?  Arrays don't need to be `Row`s inside of the 
execution engine, they only need to be of type `Seq`, and even that 
requirements should probably be removed.  Instead of NativeRow can we just call 
`toSeq` on the Array?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala
 ---
@@ -52,6 +52,7 @@ case class GetItem(child: Expression, ordinal: 
Expression) extends Expression {
   }
 } else {
   val baseValue = child.eval(input).asInstanceOf[Map[Any, _]]
+  // TODO: recover key type!!
--- End diff --

What do you mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932307
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -29,25 +31,36 @@ abstract class DataType {
 case e: Expression if e.dataType == this = true
 case _ = false
   }
+
+  def isPrimitive(): Boolean = false
 }
 
 case object NullType extends DataType
 
+trait PrimitiveType extends DataType {
--- End diff --

What are the semantics of `PrimitiveType`? Specifically, I'm surprised that 
`StringType` and `DecimalType` are considered `PrimitiveTypes`.  Also I wonder 
if we can unify this with `NativeType` somehow.  I'm not really sure, but I'd 
like to avoid too much explosion here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932422
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.io.IOException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter}
+import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData}
+import parquet.hadoop.util.ContextUtil
+import parquet.schema.{Type = ParquetType, PrimitiveType = 
ParquetPrimitiveType, MessageType, MessageTypeParser}
+import parquet.schema.{GroupType = ParquetGroupType, OriginalType = 
ParquetOriginalType, ConversionPatterns}
+import parquet.schema.PrimitiveType.{PrimitiveTypeName = 
ParquetPrimitiveTypeName}
+import parquet.schema.Type.Repetition
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Attribute}
+import org.apache.spark.sql.catalyst.types._
+
+// Implicits
+import scala.collection.JavaConversions._
+
+private[parquet] object ParquetTypesConverter {
+  def isPrimitiveType(ctype: DataType): Boolean =
+classOf[PrimitiveType] isAssignableFrom ctype.getClass
+
+  def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): 
DataType = parquetType match {
+case ParquetPrimitiveTypeName.BINARY = StringType
+case ParquetPrimitiveTypeName.BOOLEAN = BooleanType
+case ParquetPrimitiveTypeName.DOUBLE = DoubleType
+case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = 
ArrayType(ByteType)
+case ParquetPrimitiveTypeName.FLOAT = FloatType
+case ParquetPrimitiveTypeName.INT32 = IntegerType
+case ParquetPrimitiveTypeName.INT64 = LongType
+case ParquetPrimitiveTypeName.INT96 =
+  // TODO: add BigInteger type? TODO(andre) use DecimalType instead
+  sys.error(Potential loss of precision: cannot convert INT96)
+case _ = sys.error(
+  sUnsupported parquet datatype $parquetType)
+  }
+
+  /**
+   * Converts a given Parquet `Type` into the corresponding
+   * [[org.apache.spark.sql.catalyst.types.DataType]].
+   *
+   * Note that we apply the following conversion rules:
+   * ul
+   *   li Primitive types are converter to the corresponding primitive 
type./li
+   *   li Group types that have a single field that is itself a group, 
which has repetition
+   *level `REPEATED`, are treated as follows:ul
+   *  li If the nested group has name `values`, the surrounding 
group is converted
+   *   into an [[ArrayType]] with the corresponding field type 
(primitive or
+   *   complex) as element type./li
+   *  li If the nested group has name `map` and two fields 
(named `key` and `value`),
+   *   the surrounding group is converted into a [[MapType]]
+   *   with the corresponding key and value (value possibly 
complex) types.
+   *   Note that we currently assume map values are not 
nullable./li
+   *   li Other group types are converted into a [[StructType]] with the 
corresponding
+   *field types./li/ul/li
+   * /ul
+   * Note that fields are determined to be `nullable` if and only if their 
Parquet repetition
+   * level is not `REQUIRED`.
+   *
+   * @param parquetType The type to convert.
+   * @return The corresponding Catalyst type.
+   */
+  def toDataType(parquetType: ParquetType): DataType = {
+def correspondsToMap(groupType: ParquetGroupType): Boolean = {
+  if (groupType.getFieldCount != 1 || 
groupType.getFields.apply(0).isPrimitive) {
+false
+  } else {
+// This mostly follows the convention in 
``parquet.schema.ConversionPatterns``
+val 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932461
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.io.IOException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import parquet.hadoop.{ParquetFileReader, Footer, ParquetFileWriter}
+import parquet.hadoop.metadata.{ParquetMetadata, FileMetaData}
+import parquet.hadoop.util.ContextUtil
+import parquet.schema.{Type = ParquetType, PrimitiveType = 
ParquetPrimitiveType, MessageType, MessageTypeParser}
+import parquet.schema.{GroupType = ParquetGroupType, OriginalType = 
ParquetOriginalType, ConversionPatterns}
+import parquet.schema.PrimitiveType.{PrimitiveTypeName = 
ParquetPrimitiveTypeName}
+import parquet.schema.Type.Repetition
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Attribute}
+import org.apache.spark.sql.catalyst.types._
+
+// Implicits
+import scala.collection.JavaConversions._
+
+private[parquet] object ParquetTypesConverter {
+  def isPrimitiveType(ctype: DataType): Boolean =
+classOf[PrimitiveType] isAssignableFrom ctype.getClass
+
+  def toPrimitiveDataType(parquetType : ParquetPrimitiveTypeName): 
DataType = parquetType match {
+case ParquetPrimitiveTypeName.BINARY = StringType
+case ParquetPrimitiveTypeName.BOOLEAN = BooleanType
+case ParquetPrimitiveTypeName.DOUBLE = DoubleType
+case ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY = 
ArrayType(ByteType)
+case ParquetPrimitiveTypeName.FLOAT = FloatType
+case ParquetPrimitiveTypeName.INT32 = IntegerType
+case ParquetPrimitiveTypeName.INT64 = LongType
+case ParquetPrimitiveTypeName.INT96 =
+  // TODO: add BigInteger type? TODO(andre) use DecimalType instead
+  sys.error(Potential loss of precision: cannot convert INT96)
+case _ = sys.error(
+  sUnsupported parquet datatype $parquetType)
+  }
+
+  /**
+   * Converts a given Parquet `Type` into the corresponding
+   * [[org.apache.spark.sql.catalyst.types.DataType]].
+   *
+   * Note that we apply the following conversion rules:
+   * ul
+   *   li Primitive types are converter to the corresponding primitive 
type./li
+   *   li Group types that have a single field that is itself a group, 
which has repetition
+   *level `REPEATED`, are treated as follows:ul
+   *  li If the nested group has name `values`, the surrounding 
group is converted
+   *   into an [[ArrayType]] with the corresponding field type 
(primitive or
+   *   complex) as element type./li
+   *  li If the nested group has name `map` and two fields 
(named `key` and `value`),
+   *   the surrounding group is converted into a [[MapType]]
+   *   with the corresponding key and value (value possibly 
complex) types.
+   *   Note that we currently assume map values are not 
nullable./li
+   *   li Other group types are converted into a [[StructType]] with the 
corresponding
+   *field types./li/ul/li
+   * /ul
+   * Note that fields are determined to be `nullable` if and only if their 
Parquet repetition
+   * level is not `REQUIRED`.
+   *
+   * @param parquetType The type to convert.
+   * @return The corresponding Catalyst type.
+   */
+  def toDataType(parquetType: ParquetType): DataType = {
+def correspondsToMap(groupType: ParquetGroupType): Boolean = {
+  if (groupType.getFieldCount != 1 || 
groupType.getFields.apply(0).isPrimitive) {
+false
+  } else {
+// This mostly follows the convention in 
``parquet.schema.ConversionPatterns``
--- End diff 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932494
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala ---
@@ -68,14 +93,119 @@ object ParquetTestData {
 
   lazy val testData = new ParquetRelation(testDir.toURI.toString)
 
+  val testNestedSchema1 =
+// based on blogpost example, source:
+// https://blog.twitter.com/2013/dremel-made-simple-with-parquet
+// note: instead of string we have to use binary (?) otherwise
+// Parquet gives us:
+// IllegalArgumentException: expected one of [INT64, INT32, BOOLEAN,
+//   BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY]
+// Also repeated primitives seem tricky to convert (AvroParquet
+// only uses them in arrays?) so only use at most one in each group
+// and nothing else in that group (- is mapped to array)!
+// The values inside ownerPhoneNumbers is a keyword currently
+// so that array types can be translated correctly.
+
+  |message AddressBook {
+|required binary owner;
+|optional group ownerPhoneNumbers {
+  |repeated binary array;
+|}
+|optional group contacts {
+  |repeated group array {
+|required binary name;
+|optional binary phoneNumber;
+  |}
+|}
+  |}
+.stripMargin
+
+
+  val testNestedSchema2 =
+
+  |message TestNested2 {
+|required int32 firstInt;
+|optional int32 secondInt;
+|optional group longs {
+  |repeated int64 array;
+|}
+|required group entries {
+  |repeated group array {
+|required double value;
+|optional boolean truth;
+  |}
+|}
+|optional group outerouter {
+  |repeated group array {
+|repeated group array {
+  |repeated int32 array;
+|}
+  |}
+|}
+  |}
+.stripMargin
+
+  val testNestedSchema3 =
+
+  |message TestNested3 {
+|required int32 x;
+|optional group booleanNumberPairs {
+  |repeated group array {
+|required int32 key;
+|optional group value {
+  |repeated group array {
+|required double nestedValue;
+|optional boolean truth;
+  |}
+|}
+  |}
+|}
+  |}
+.stripMargin
+
+  val testNestedSchema4 =
+
+  |message TestNested4 {
--- End diff --

Nit: why are the margins aligned with the indentation?  I think that 
results in an unindented string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932640
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,582 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, 
Row, Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+private[parquet] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  case ctype: NativeType = {
+// note: for some reason matching for StringType fails so use this 
ugly if instead
+if (ctype == StringType) new 
CatalystPrimitiveStringConverter(parent, fieldIndex)
+else new CatalystPrimitiveConverter(parent, fieldIndex)
+  }
+  case _ = throw new RuntimeException(
+sunable to convert datatype ${field.dataType.toString} in 
CatalystGroupConverter)
+}
+  }
+
+  protected[parquet] def createRootConverter(parquetSchema: MessageType): 
CatalystConverter = {
+val attributes = 
ParquetTypesConverter.convertToAttributes(parquetSchema)
+// For non-nested types we use the optimized Row converter
+if (attributes.forall(a = 
ParquetTypesConverter.isPrimitiveType(a.dataType))) {
+  new PrimitiveRowGroupConverter(attributes)
+} else {
+  new CatalystGroupConverter(attributes)
+}
+  }
+}
+
+private[parquet] trait CatalystConverter {
+  // the number of fields this group has
+  protected[parquet] val size: Int
+
+  // the index of this converter in the parent
+  protected[parquet] val index: Int
+
+  // the parent converter
+  protected[parquet] val parent: CatalystConverter
+
+  // for child converters to update upstream values
+  protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit
+
+  protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): 
Unit =
+updateField(fieldIndex, value)
+
+  protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit =
+updateField(fieldIndex, value)
 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932653
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -0,0 +1,582 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap}
+
+import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, 
Converter}
+import parquet.schema.MessageType
+
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{NativeRow, GenericRow, 
Row, Attribute}
+import org.apache.spark.sql.parquet.CatalystConverter.FieldType
+
+private[parquet] object CatalystConverter {
+  // The type internally used for fields
+  type FieldType = StructField
+
+  // This is mostly Parquet convention (see, e.g., `ConversionPatterns`).
+  // Note that array for the array elements is chosen by ParquetAvro.
+  // Using a different value will result in Parquet silently dropping 
columns.
+  val ARRAY_ELEMENTS_SCHEMA_NAME = array
+  val MAP_KEY_SCHEMA_NAME = key
+  val MAP_VALUE_SCHEMA_NAME = value
+  val MAP_SCHEMA_NAME = map
+
+  protected[parquet] def createConverter(
+  field: FieldType,
+  fieldIndex: Int,
+  parent: CatalystConverter): Converter = {
+val fieldType: DataType = field.dataType
+fieldType match {
+  // For native JVM types we use a converter with native arrays
+  case ArrayType(elementType: NativeType) = {
+new CatalystNativeArrayConverter(elementType, fieldIndex, parent)
+  }
+  // This is for other types of arrays, including those with nested 
fields
+  case ArrayType(elementType: DataType) = {
+new CatalystArrayConverter(elementType, fieldIndex, parent)
+  }
+  case StructType(fields: Seq[StructField]) = {
+new CatalystStructConverter(fields, fieldIndex, parent)
+  }
+  case MapType(keyType: DataType, valueType: DataType) = {
+new CatalystMapConverter(
+  Seq(
+new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false),
+new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)),
+fieldIndex,
+parent)
+  }
+  case ctype: NativeType = {
+// note: for some reason matching for StringType fails so use this 
ugly if instead
+if (ctype == StringType) new 
CatalystPrimitiveStringConverter(parent, fieldIndex)
+else new CatalystPrimitiveConverter(parent, fieldIndex)
+  }
+  case _ = throw new RuntimeException(
+sunable to convert datatype ${field.dataType.toString} in 
CatalystGroupConverter)
+}
+  }
+
+  protected[parquet] def createRootConverter(parquetSchema: MessageType): 
CatalystConverter = {
+val attributes = 
ParquetTypesConverter.convertToAttributes(parquetSchema)
+// For non-nested types we use the optimized Row converter
+if (attributes.forall(a = 
ParquetTypesConverter.isPrimitiveType(a.dataType))) {
+  new PrimitiveRowGroupConverter(attributes)
+} else {
+  new CatalystGroupConverter(attributes)
+}
+  }
+}
+
+private[parquet] trait CatalystConverter {
+  // the number of fields this group has
+  protected[parquet] val size: Int
+
+  // the index of this converter in the parent
+  protected[parquet] val index: Int
+
+  // the parent converter
+  protected[parquet] val parent: CatalystConverter
+
+  // for child converters to update upstream values
+  protected[parquet] def updateField(fieldIndex: Int, value: Any): Unit
+
+  protected[parquet] def updateBoolean(fieldIndex: Int, value: Boolean): 
Unit =
+updateField(fieldIndex, value)
+
+  protected[parquet] def updateInt(fieldIndex: Int, value: Int): Unit =
+updateField(fieldIndex, value)
 

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932731
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
@@ -160,7 +151,7 @@ private[sql] object ParquetRelation {
 }
 
 if (fs.exists(path) 
-!fs.getFileStatus(path)
+  !fs.getFileStatus(path)
--- End diff --

I think the indenting was right before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/360#discussion_r11932798
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
@@ -153,9 +153,15 @@ case class InsertIntoParquetTable(
 
 val job = new Job(sc.hadoopConfiguration)
 
-ParquetOutputFormat.setWriteSupportClass(
-  job,
-  classOf[org.apache.spark.sql.parquet.RowWriteSupport])
+val writeSupport =
+  if (child.output.map(_.dataType).forall(_.isPrimitive())) {
+logger.info(Initializing MutableRowWriteSupport)
--- End diff --

Probably should not be info.  Also, why do all the data types have to be 
primitive for us to use mutable rows?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-23 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-41234534
  
Wow, this is a pretty intense PR! :)

Still trying to wrap my head around it, but overall I think the approach 
seems reasonable.  Will take another look after a few questions are answered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40890889
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40890891
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40892195
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14268/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40892194
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40872477
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40872482
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40874845
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40874847
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14256/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

2014-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-40362749
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >