[jira] [Assigned] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned PARQUET-1102:
---

Assignee: Cheng Lian

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: format-2.3.2
>
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-1091) Wrong and broken links in README

2017-09-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-1091.
-
   Resolution: Fixed
Fix Version/s: format-2.3.2

Issue resolved by pull request 65
[https://github.com/apache/parquet-format/pull/65]

> Wrong and broken links in README
> 
>
> Key: PARQUET-1091
> URL: https://issues.apache.org/jira/browse/PARQUET-1091
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: format-2.3.2
>
>
> Multiple links in README.md still point to the old {{Parquet/parquet-format}} 
> repository, which is now removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-1102.
-
   Resolution: Fixed
Fix Version/s: format-2.3.2

Issue resolved by pull request 66
[https://github.com/apache/parquet-format/pull/66]

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Priority: Blocker
> Fix For: format-2.3.2
>
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-1102:

Priority: Blocker  (was: Major)

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Priority: Blocker
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-1102:
---

 Summary: Travis CI builds are failing for parquet-format PRs
 Key: PARQUET-1102
 URL: https://issues.apache.org/jira/browse/PARQUET-1102
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Cheng Lian


Travis CI builds are failing for parquet-format PRs, probably due to the 
migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
official blog 
post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1091) Wrong and broken links in README

2017-09-07 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-1091:
---

 Summary: Wrong and broken links in README
 Key: PARQUET-1091
 URL: https://issues.apache.org/jira/browse/PARQUET-1091
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


Multiple links in README.md still point to the old {{Parquet/parquet-format}} 
repository, which is now removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007326#comment-16007326
 ] 

Cheng Lian edited comment on PARQUET-980 at 5/11/17 10:46 PM:
--

The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet files successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.


was (Author: lian cheng):
The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet file successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.

> Cannot read row group larger than 2GB
> -
>
> Key: PARQUET-980
> URL: https://issues.apache.org/jira/browse/PARQUET-980
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1, 1.8.2
>Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 1).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](75)
>   iterator.map { id =>
> // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
> val numChars = if (i < 200) 1000 else 75
> i += 1
> // create a random array
> var j = 0
> while (j < numChars) {
>   // Generate a char (borrowed from scala.util.Random)
>   buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
>   j += 1
> }
> 

[jira] [Commented] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007326#comment-16007326
 ] 

Cheng Lian commented on PARQUET-980:


The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet file successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.

> Cannot read row group larger than 2GB
> -
>
> Key: PARQUET-980
> URL: https://issues.apache.org/jira/browse/PARQUET-980
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1, 1.8.2
>Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 1).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](75)
>   iterator.map { id =>
> // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
> val numChars = if (i < 200) 1000 else 75
> i += 1
> // create a random array
> var j = 0
> while (j < numChars) {
>   // Generate a char (borrowed from scala.util.Random)
>   buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
>   j += 1
> }
> // create a string: the string constructor will copy the buffer.
> new String(buffer, 0, numChars)
>   }
> }
> badDf.write.parquet("somefile")
> val corruptedDf = spark.read.parquet("somefile")
> corruptedDf.select(count(lit(1)), max(length($"value"))).show()
> {noformat}
> The latter fails with the following exception:
> {noformat}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
> ...
> {noformat}
> This seems to be fixed by commit 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?
>  This can happen when 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-980:
---
Affects Version/s: 1.8.1
   1.8.2

> Cannot read row group larger than 2GB
> -
>
> Key: PARQUET-980
> URL: https://issues.apache.org/jira/browse/PARQUET-980
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1, 1.8.2
>Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 1).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](75)
>   iterator.map { id =>
> // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
> val numChars = if (i < 200) 1000 else 75
> i += 1
> // create a random array
> var j = 0
> while (j < numChars) {
>   // Generate a char (borrowed from scala.util.Random)
>   buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
>   j += 1
> }
> // create a string: the string constructor will copy the buffer.
> new String(buffer, 0, numChars)
>   }
> }
> badDf.write.parquet("somefile")
> val corruptedDf = spark.read.parquet("somefile")
> corruptedDf.select(count(lit(1)), max(length($"value"))).show()
> {noformat}
> The latter fails with the following exception:
> {noformat}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
> ...
> {noformat}
> This seems to be fixed by commit 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?
>  This can happen when 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-893) GroupColumnIO.getFirst() doesn't check for empty groups

2017-02-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-893:
---
Description: 
The following Spark snippet reproduces this issue with Spark 2.1 (with 
parquet-mr 1.8.1) and Spark 2.2-SNAPSHOT (with parquet-mr 1.8.2):

{code}
import org.apache.spark.sql.types._

val path = "/tmp/parquet-test"

case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)

val df = Seq(Outer(Inner(1), 1)).toDF()

df.printSchema()
// root
//  |-- f0: struct (nullable = true)
//  ||-- f00: integer (nullable = false)
//  |-- f1: integer (nullable = false)

df.write.mode("overwrite").parquet(path)

val requestedSchema =
  new StructType().
add("f0", new StructType().
  // This nested field name differs from the original one
  add("f01", IntegerType)).
add("f1", IntegerType)

println(requestedSchema.treeString)
// root
//  |-- f0: struct (nullable = true)
//  ||-- f01: integer (nullable = true)
//  |-- f1: integer (nullable = true)

spark.read.schema(requestedSchema).parquet(path).show()
{code}

In the above snippet, {{requestedSchema}} is compatible with the schema of the 
written Parquet file, but the following exception is thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/parquet-test/part-7-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}

According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} 
[doesn't check for empty 

[jira] [Created] (PARQUET-893) GroupColumnIO.getFirst() doesn't check for empty groups

2017-02-22 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-893:
--

 Summary: GroupColumnIO.getFirst() doesn't check for empty groups
 Key: PARQUET-893
 URL: https://issues.apache.org/jira/browse/PARQUET-893
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Cheng Lian


The following Spark 2.1 snippet reproduces this issue:

{code}
import org.apache.spark.sql.types._

val path = "/tmp/parquet-test"

case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)

val df = Seq(Outer(Inner(1), 1)).toDF()

df.printSchema()
// root
//  |-- f0: struct (nullable = true)
//  ||-- f00: integer (nullable = false)
//  |-- f1: integer (nullable = false)

df.write.mode("overwrite").parquet(path)

val requestedSchema =
  new StructType().
add("f0", new StructType().
  // This nested field name differs from the original one
  add("f01", IntegerType)).
add("f1", IntegerType)

println(requestedSchema.treeString)
// root
//  |-- f0: struct (nullable = true)
//  ||-- f01: integer (nullable = true)
//  |-- f1: integer (nullable = true)

spark.read.schema(requestedSchema).parquet(path).show()
{code}

In the above snippet, {{requestedSchema}} is compatible with the schema of the 
written Parquet file, but the following exception is thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/parquet-test/part-7-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}

According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} 
[doesn't check for 

[jira] [Created] (PARQUET-754) Deprecate the "strict" argument in MessageType.union()

2016-10-17 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-754:
--

 Summary: Deprecate the "strict" argument in MessageType.union()
 Key: PARQUET-754
 URL: https://issues.apache.org/jira/browse/PARQUET-754
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Cheng Lian
Priority: Minor


As discussed in PARQUET-379, non-strict schema merging doesn't really make any 
sense and we always set to true throughout the code base. Should probably 
deprecate it and make sure no internal code ever use non-strict schema merging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-753) GroupType.union() doesn't merge the original type

2016-10-17 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583942#comment-15583942
 ] 

Cheng Lian commented on PARQUET-753:


PARQUET-379 resolves the {{union}} issue related to primitive types, but 
doesn't handle group types.

> GroupType.union() doesn't merge the original type
> -
>
> Key: PARQUET-753
> URL: https://issues.apache.org/jira/browse/PARQUET-753
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Deneche A. Hakim
>
> When merging two GroupType, the union() method doesn't merge their original 
> type which will be lost after the union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-655) The LogicalTypes.md link in README.md points to the old Parquet GitHub repository

2016-07-08 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-655:
---
Component/s: parquet-format

> The LogicalTypes.md link in README.md points to the old Parquet GitHub 
> repository
> -
>
> Key: PARQUET-655
> URL: https://issues.apache.org/jira/browse/PARQUET-655
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-655) The LogicalTypes.md link in README.md points to the old Parquet GitHub repository

2016-07-08 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-655:
--

 Summary: The LogicalTypes.md link in README.md points to the old 
Parquet GitHub repository
 Key: PARQUET-655
 URL: https://issues.apache.org/jira/browse/PARQUET-655
 Project: Parquet
  Issue Type: Bug
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-654) Make record-level filtering optional

2016-07-08 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-654:
--

 Summary: Make record-level filtering optional
 Key: PARQUET-654
 URL: https://issues.apache.org/jira/browse/PARQUET-654
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Cheng Lian


For some engines, especially those with vectorized Parquet readers, filter 
predicate can often be evaluated more efficiently by the engine. In these 
cases, Parquet record-level filtering may even slow down query execution when 
filter push-down is enabled. On the other hand, when the data is well prepared, 
filter push-down can be very valuable due to row group level filtering.

One possible improvement here is to add a configuration option that makes 
record-level filtering optional. In this way, the upper-level engine may 
leverage both Parquet row group level filtering and faster native record-level 
filtering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly

2016-07-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-651:
---
Affects Version/s: 1.9.0

> Parquet-avro fails to decode array of record with a single field name 
> "element" correctly
> -
>
> Key: PARQUET-651
> URL: https://issues.apache.org/jira/browse/PARQUET-651
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.7.0, 1.8.0, 1.8.1, 1.9.0
>Reporter: Cheng Lian
>
> Found this issue while investigating SPARK-16344.
> For the following Parquet schema
> {noformat}
> message root {
>   optional group f (LIST) {
> repeated group list {
>   optional group element {
> optional int64 element;
>   }
> }
>   }
> }
> {noformat}
> parquet-avro decodes it as something like this:
> {noformat}
> record SingleElement {
>   int element;
> }
> record NestedSingleElement {
>   SingleElement element;
> }
> record Spark16344Wrong {
>   array f;
> }
> {noformat}
> while correct interpretation should be:
> {noformat}
> record SingleElement {
>   int element;
> }
> record Spark16344 {
>   array f;
> }
> {noformat}
> The reason is that the {{element}} syntactic group for LIST in
> {noformat}
>  group  (LIST) {
>   repeated group list {
>   element;
>   }
> }
> {noformat}
> is recognized as a record field named {{element}}. The problematic code lies 
> in 
> [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
>  We should probably check the standard 3-level layout first before falling 
> back to the legacy 2-level layout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly

2016-07-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-651:
---
Description: 
Found this issue while investigating SPARK-16344.

For the following Parquet schema

{noformat}
message root {
  optional group f (LIST) {
repeated group list {
  optional group element {
optional int64 element;
  }
}
  }
}
{noformat}

parquet-avro decodes it as something like this:

{noformat}
record SingleElement {
  int element;
}

record NestedSingleElement {
  SingleElement element;
}

record Spark16344Wrong {
  array f;
}
{noformat}

while correct interpretation should be:

{noformat}
record SingleElement {
  int element;
}

record Spark16344 {
  array f;
}
{noformat}

The reason is that the {{element}} syntactic group for LIST in

{noformat}
 group  (LIST) {
  repeated group list {
  element;
  }
}
{noformat}

is recognized as a record field named {{element}}. The problematic code lies in 
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
 We should probably check the standard 3-level layout first before falling back 
to the legacy 2-level layout.


  was:
Found this issue while investigating SPARK-16344.

For the following Parquet schema

{noformat}
message root {
  optional group f (LIST) {
repeated group list {
  optional group element {
optional int64 element;
  }
}
  }
}
{noformat}

parquet-avro decodes it as something like this:

{noformat}
record SingleElement {
  int element;
}

record NestedSingleElement {
  SingleElement element;
}

record Spark16344Wrong {
  array f;
}
{noformat}

while correct interpretation should be:

{noformat}
record SingleElement {
  int element;
}

record Spark16344 {
  array f;
}
{noformat}

The reason is that the {{element}} syntactic group for LIST in

{noformat}
 group  (LIST) {
  repeated group list {
  element;
  }
}
{noformat}

is recognized as record field {{SingleElement.element}}. The problematic code 
lies in 
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
 We should probably check the standard 3-level layout first before falling back 
to the legacy 2-level layout.



> Parquet-avro fails to decode array of record with a single field name 
> "element" correctly
> -
>
> Key: PARQUET-651
> URL: https://issues.apache.org/jira/browse/PARQUET-651
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.7.0, 1.8.0, 1.8.1
>Reporter: Cheng Lian
>
> Found this issue while investigating SPARK-16344.
> For the following Parquet schema
> {noformat}
> message root {
>   optional group f (LIST) {
> repeated group list {
>   optional group element {
> optional int64 element;
>   }
> }
>   }
> }
> {noformat}
> parquet-avro decodes it as something like this:
> {noformat}
> record SingleElement {
>   int element;
> }
> record NestedSingleElement {
>   SingleElement element;
> }
> record Spark16344Wrong {
>   array f;
> }
> {noformat}
> while correct interpretation should be:
> {noformat}
> record SingleElement {
>   int element;
> }
> record Spark16344 {
>   array f;
> }
> {noformat}
> The reason is that the {{element}} syntactic group for LIST in
> {noformat}
>  group  (LIST) {
>   repeated group list {
>   element;
>   }
> }
> {noformat}
> is recognized as a record field named {{element}}. The problematic code lies 
> in 
> [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
>  We should probably check the standard 3-level layout first before falling 
> back to the legacy 2-level layout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly

2016-07-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-651:
---
Description: 
Found this issue while investigating SPARK-16344.

For the following Parquet schema

{noformat}
message root {
  optional group f (LIST) {
repeated group list {
  optional group element {
optional int64 element;
  }
}
  }
}
{noformat}

parquet-avro decodes it as something like this:

{noformat}
record SingleElement {
  int element;
}

record NestedSingleElement {
  SingleElement element;
}

record Spark16344Wrong {
  array f;
}
{noformat}

while correct interpretation should be:

{noformat}
record SingleElement {
  int element;
}

record Spark16344 {
  array f;
}
{noformat}

The reason is that the {{element}} syntactic group for LIST in

{noformat}
 group  (LIST) {
  repeated group list {
  element;
  }
}
{noformat}

is recognized as record field {{SingleElement.element}}. The problematic code 
lies in 
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
 We should probably check the standard 3-level layout first before falling back 
to the legacy 2-level layout.


  was:
Found this issue while investigating SPARK-16344.

For the following Parquet schema

{noformat}
message root {
  optional group f (LIST) {
repeated group list {
  optional group element {
optional int64 element;
  }
}
  }
}
{noformat}

parquet-avro decodes it as something like this:

{noformat}
record SingleElement {
  int element;
}

record NestedSingleElement {
  SingleElement element;
}

record Spark16344Wrong {
  array f;
}
{noformat}

while correct interpretation should be:

{noformat}
record SingleElement {
  int element;
}

record Spark16344 {
  array f;
}
{noformat}

Adding the following test case to {{TestArrayCompatibility}} may reproduce this 
issue:

{code:java}
@Test
public void testSpark16344() throws Exception {
  Path test = writeDirect(
  "message root {" +
  "  optional group f (LIST) {" +
  "repeated group list {" +
  "  optional group element {" +
  "optional int32 element;" +
  "  }" +
  "}" +
  "  }" +
  "}",
  new DirectWriter() {
@Override
public void write(RecordConsumer rc) {
  rc.startMessage();
  rc.startField("f", 0);

  rc.startGroup();
  rc.startField("list", 0);

  rc.startGroup();
  rc.startField("element", 0);

  rc.startGroup();
  rc.startField("element", 0);

  rc.addInteger(42);

  rc.endField("element", 0);
  rc.endGroup();

  rc.endField("element", 0);
  rc.endGroup();

  rc.endField("list", 0);
  rc.endGroup();

  rc.endField("f", 0);
  rc.endMessage();
}

  });

  Schema element = record("rec", field("element", primitive(Schema.Type.INT)));
  Schema expectedSchema = record("root", field("f", array(element)));

  GenericRecord expectedRecord =
  instance(expectedSchema, "f", Collections.singletonList(instance(element, 
42)));

  assertReaderContains(newBehaviorReader(test), expectedSchema, expectedRecord);
}
{code}

The reason is that the {{element}} syntactic group for LIST in

{noformat}
 group  (LIST) {
  repeated group list {
  element;
  }
}
{noformat}

is recognized as record field {{SingleElement.element}}. The problematic code 
lies in 
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
 We should probably check the standard 3-level layout first before falling back 
to the legacy 2-level layout.



> Parquet-avro fails to decode array of record with a single field name 
> "element" correctly
> -
>
> Key: PARQUET-651
> URL: https://issues.apache.org/jira/browse/PARQUET-651
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.7.0, 1.8.0, 1.8.1
>Reporter: Cheng Lian
>
> Found this issue while investigating SPARK-16344.
> For the following Parquet schema
> {noformat}
> message root {
>   optional group f (LIST) {
> repeated group list {
>   optional group element {
> optional int64 element;
>   }
> }
>   }
> }
> {noformat}
> parquet-avro decodes it as something like this:
> {noformat}
> record SingleElement {
>   int element;
> }
> record NestedSingleElement {
>   SingleElement element;
> }
> record Spark16344Wrong {
>   array f;
> }
> {noformat}
> 

[jira] [Resolved] (PARQUET-528) Fix flush() for RecordConsumer and implementations

2016-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-528.

Resolution: Fixed

Issue resolved by pull request 325
[https://github.com/apache/parquet-mr/pull/325]

> Fix flush() for RecordConsumer and implementations
> --
>
> Key: PARQUET-528
> URL: https://issues.apache.org/jira/browse/PARQUET-528
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
> Fix For: 1.9.0
>
>
> _+flush()+_ was added in _+RecordConsumer+_ and _+MessageColumnIO+_ to help 
> implementing nulls caching.
> However, other _+RecordConsumer+_ implementations should also implements 
> _+flush()+_ properly. For instance, _+RecordConsumerLoggingWrapper+_ and 
> _+ValidatingRecordConsumer+_ should call _+delegate.flush()+_ in their 
> _+flush()+_ methods, otherwise data might be mistakenly truncated.
> This ticket:
> - makes _+flush()+_ abstract in _+RecordConsumer+_
> - implements _+flush()+_ properly for all _+RecordConsumer+_ subclasses, 
> specifically:
> -- _+RecordConsumerLoggingWrapper+_
> -- _+ValidatingRecordConsumer+_
> -- _+ConverterConsumer+_
> -- _+ExpectationValidatingRecordConsumer+_



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-401) Deprecate Log and move to SLF4J Logger

2016-02-01 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127668#comment-15127668
 ] 

Cheng Lian commented on PARQUET-401:


Fix of this issue is nice to have but probably shouldn't block 1.9.0.

> Deprecate Log and move to SLF4J Logger
> --
>
> Key: PARQUET-401
> URL: https://issues.apache.org/jira/browse/PARQUET-401
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Ryan Blue
>
> The current Log class is intended to allow swapping out logger back-ends, but 
> SLF4J already does this. It also doesn't expose as nice of an API as SLF4J, 
> which can handle formatting to avoid the cost of building log messages that 
> won't be used. I think we should deprecate the org.apache.parquet.Log class 
> and move to using SLF4J directly, instead of wrapping SLF4J (PARQUET-305).
> This will require deprecating the current Log class and replacing the current 
> uses of it with SLF4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-495) Fix mismatches in Types class comments

2016-02-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-495.

Resolution: Fixed

Issue resolved by pull request 317
[https://github.com/apache/parquet-mr/pull/317]

> Fix mismatches in Types class comments
> --
>
> Key: PARQUET-495
> URL: https://issues.apache.org/jira/browse/PARQUET-495
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Trivial
> Fix For: 1.9.0
>
>
> To produce:
> required group User \{
> required int64 id;
> *optional* binary email (UTF8);
> \}
> we should do:
> Types.requiredGroup()
>   .required(INT64).named("id")
>   .-*required* (BINARY).as(UTF8).named("email")-
>   .*optional* (BINARY).as(UTF8).named("email")
>   .named("User")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-432) Complete a todo for method ColumnDescriptor.compareTo()

2016-01-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-432.

Resolution: Fixed

Issue resolved by pull request 314
[https://github.com/apache/parquet-mr/pull/314]

> Complete a todo for method ColumnDescriptor.compareTo()
> ---
>
> Key: PARQUET-432
> URL: https://issues.apache.org/jira/browse/PARQUET-432
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 1.9.0
>
>
> The ticket proposes to consider the case *path.length < o.path.length* in, 
> for method ColumnDescriptor.compareTo().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-398) Testing JIRA ticket for testing committership

2015-12-02 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-398:
--

 Summary: Testing JIRA ticket for testing committership
 Key: PARQUET-398
 URL: https://issues.apache.org/jira/browse/PARQUET-398
 Project: Parquet
  Issue Type: Test
Reporter: Cheng Lian
Priority: Minor


This ticket is only used for testing committership. Please keep it open.

New committers can submit a PR to add their names to {{dev/COMMITTERS.md}}, and 
attach ID of this JIRA ticket to the PR title (this convention is required by 
the {{dev/merge_parquet_pr.py}} script).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-389) Filter predicates should work with missing columns

2015-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-389:
--

 Summary: Filter predicates should work with missing columns
 Key: PARQUET-389
 URL: https://issues.apache.org/jira/browse/PARQUET-389
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0, 1.7.0, 1.6.0
Reporter: Cheng Lian


This issue originates from SPARK-11103, which contains detailed information 
about how to reproduce it.

The major problem here is that, filter predicates pushed down assert that 
columns they touch must exist in the target physical files. But this isn't true 
in case of schema merging.

Actually this assertion is unnecessary, because if a column is missing in the 
filter schema, the column is considered to be filled by nulls, and all the 
filters should be able to act accordingly. For example, if we push down {{a = 
1}} but {{a}} is missing in the underlying physical file, all records in this 
file should be dropped since {{a}} is always null. On the other hand, if we 
push down {{a IS NULL}}, all records should be preserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933630#comment-14933630
 ] 

Cheng Lian edited comment on PARQUET-379 at 9/28/15 5:34 PM:
-

While trying to fix this issue, I got a problem regarding to the {{strict}} 
argument of {{PrimitiveType.union}}, and correspondingly, 
{{MessageType.union}}.  Seems that throughout the whole parquet-mr code base 
(including tests), we always call these methods with {{strict}} being {{true}}, 
which means schema primitive types should match.

Maybe I missed something here, but I don't see a sound use case of non-strict 
schema merging.  Especially, the field types of {{t1.union(t2, false)}} is 
completely determined by {{t1}}, rather than the "wider" ones:
{noformat}
message t1 { required int32 f; }
message t2 { required int64 f; }

t1.union(t2, false) =>
  message t3 { required int32 f; }
{noformat}
Basically we can't use such a schema to read actual Parquet files even if we 
add some sort of automatic "type widening" logic inside Parquet readers since 
the merged one above loses precision.

So my questions are:
# Is there a practical scenario where non-strict schema merging makes sense?
# If not, should we deprecate it? (We can't remove it since 
{{MessageType.union(MessageType, boolean)}} is part of the public API.)



was (Author: lian cheng):
While trying to fix this issue, I got a problem regarding to the {{strict}} 
argument of {{PrimitiveType.union}}, and correspondingly, 
{{MessageType.union}}.  Seems that throughout the whole parquet-mr code base 
(including tests), we always call these methods with {{strict}} being {{true}}, 
which means schema primitive types should match.

Maybe I missed something here, but I don't see a sound use case of non-strict 
schema merging.  Especially, the field types of {{t1.union(t2, false)}} is 
completely determined by {{t1}}, rather than the "wider" types of the two:
{noformat}
message t1 { required int32 f; }
message t2 { required int64 f; }

t1.union(t2, false) =>
  message t3 { required int32 f; }
{noformat}
Basically we can't use such a schema to read actual Parquet files even if we 
add some sort of automatic "type widening" logic inside Parquet readers since 
the merged one above loses precision.

So my questions are:
# Is there a practical scenario where non-strict schema merging makes sense?
# If not, should we deprecate it? (We can't remove it since 
{{MessageType.union(MessageType, boolean)}} is part of the public API.


> PrimitiveType.union erases original type
> 
>
> Key: PARQUET-379
> URL: https://issues.apache.org/jira/browse/PARQUET-379
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
> val expected =
>   Types.buildMessage()
> .addField(
>   Types
> .required(INT32)
> .as(DECIMAL)
> .precision(7)
> .scale(2)
> .named("f"))
> .named("root")
> assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-385) PrimitiveType.union accepts fixed_len_byte_array fields with different length when strict mode is on

2015-09-28 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-385:
--

 Summary: PrimitiveType.union accepts fixed_len_byte_array fields 
with different length when strict mode is on
 Key: PARQUET-385
 URL: https://issues.apache.org/jira/browse/PARQUET-385
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0, 1.7.0, 1.6.0, 1.5.0
Reporter: Cheng Lian


The following two schemas probably shouldn't be allowed to be union-ed when 
strict schema-merging mode is on:
{noformat}
message t1 {
  required fixed_len_byte_array(10) f;
}

message t2 {
  required fixed_len_byte_array(5) f;
}
{noformat}
But currently {{t1.union(t2, true)}} yields {{t1}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933630#comment-14933630
 ] 

Cheng Lian commented on PARQUET-379:


While trying to fix this issue, I got a problem regarding to the {{strict}} 
argument of {{PrimitiveType.union}}, and correspondingly, 
{{MessageType.union}}.  Seems that throughout the whole parquet-mr code base 
(including tests), we always call these methods with {{strict}} being {{true}}, 
which means schema primitive types should match.

Maybe I missed something here, but I don't see a sound use case of non-strict 
schema merging.  Especially, the field types of {{t1.union(t2, false)}} is 
completely determined by {{t1}}, rather than the "wider" types of the two:
{noformat}
message t1 { required int32 f; }
message t2 { required int64 f; }

t1.union(t2, false) =>
  message t3 { required int32 f; }
{noformat}
Basically we can't use such a schema to read actual Parquet files even if we 
add some sort of automatic "type widening" logic inside Parquet readers since 
the merged one above loses precision.

So my questions are:
# Is there a practical scenario where non-strict schema merging makes sense?
# If not, should we deprecate it? (We can't remove it since 
{{MessageType.union(MessageType, boolean)}} is part of the public API.


> PrimitiveType.union erases original type
> 
>
> Key: PARQUET-379
> URL: https://issues.apache.org/jira/browse/PARQUET-379
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
> val expected =
>   Types.buildMessage()
> .addField(
>   Types
> .required(INT32)
> .as(DECIMAL)
> .precision(7)
> .scale(2)
> .named("f"))
> .named("root")
> assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934084#comment-14934084
 ] 

Cheng Lian commented on PARQUET-379:


So deprecating non-strict schema merging seems to be reasonable? Namely, 
deprecate {{MessageType.union(MessageType toMerge, boolean strict)}}, and 
always set {{strict}} to {{true}} when we call this method internally.

> PrimitiveType.union erases original type
> 
>
> Key: PARQUET-379
> URL: https://issues.apache.org/jira/browse/PARQUET-379
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
> val expected =
>   Types.buildMessage()
> .addField(
>   Types
> .required(INT32)
> .as(DECIMAL)
> .precision(7)
> .scale(2)
> .named("f"))
> .named("root")
> assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-385) PrimitiveType.union accepts fixed_len_byte_array fields with different lengths when strict mode is on

2015-09-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-385:
---
Summary: PrimitiveType.union accepts fixed_len_byte_array fields with 
different lengths when strict mode is on  (was: PrimitiveType.union accepts 
fixed_len_byte_array fields with different length when strict mode is on)

> PrimitiveType.union accepts fixed_len_byte_array fields with different 
> lengths when strict mode is on
> -
>
> Key: PARQUET-385
> URL: https://issues.apache.org/jira/browse/PARQUET-385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following two schemas probably shouldn't be allowed to be union-ed when 
> strict schema-merging mode is on:
> {noformat}
> message t1 {
>   required fixed_len_byte_array(10) f;
> }
> message t2 {
>   required fixed_len_byte_array(5) f;
> }
> {noformat}
> But currently {{t1.union(t2, true)}} yields {{t1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-379) PrimitiveType.union erases original type

2015-09-27 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-379:
---
Description: 
The following ScalaTest test case
{code}
  test("merge primitive types") {
val expected =
  Types.buildMessage()
.addField(
  Types
.required(INT32)
.as(DECIMAL)
.precision(7)
.scale(2)
.named("f"))
.named("root")

assert(expected.union(expected) === expected)
  }
{code}
produces the following assertion error
{noformat}
message root {
  required int32 f;
}
 did not equal message root {
  required int32 f (DECIMAL(9,0));
}
{noformat}
This is because {{PrimitiveType.union}} doesn't handle original type properly. 
An open question is that, can two primitive types with the same primitive type 
name but different original types be unioned?

  was:
The following ScalaTest test case
{code}
  test("merge primitive types") {
val expected =
  Types.buildMessage()
.addField(
  Types
.required(INT32)
.as(DECIMAL)
.precision(9)
.scale(0)
.named("f"))
.named("root")

assert(expected.union(expected) === expected)
  }
{code}
produces the following assertion error
{noformat}
message root {
  required int32 f;
}
 did not equal message root {
  required int32 f (DECIMAL(9,0));
}
{noformat}
This is because {{PrimitiveType.union}} doesn't handle original type properly. 
An open question is that, can two primitive types with the same primitive type 
name but different original types be unioned?


> PrimitiveType.union erases original type
> 
>
> Key: PARQUET-379
> URL: https://issues.apache.org/jira/browse/PARQUET-379
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
> val expected =
>   Types.buildMessage()
> .addField(
>   Types
> .required(INT32)
> .as(DECIMAL)
> .precision(7)
> .scale(2)
> .named("f"))
> .named("root")
> assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-379) PrimitiveType.union erases original type

2015-09-23 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-379:
--

 Summary: PrimitiveType.union erases original type
 Key: PARQUET-379
 URL: https://issues.apache.org/jira/browse/PARQUET-379
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0, 1.7.0, 1.6.0, 1.5.0
Reporter: Cheng Lian


The following test case
{code}
  test("merge primitive types") {
val expected =
  Types.buildMessage()
.addField(
  Types
.required(INT32)
.as(DECIMAL)
.precision(9)
.scale(0)
.named("f"))
.named("root")

assert(expected.union(expected) === expected)
  }
{code}
produces the following assertion error
{noformat}
message root {
  required int32 f;
}
 did not equal message root {
  required int32 f (DECIMAL(9,0));
}
{noformat}
This is because {{PrimitiveType.union}} doesn't handle original type properly. 
An open question is that, can two primitive types with the same primitive type 
name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-379) PrimitiveType.union erases original type

2015-09-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-379:
---
Description: 
The following ScalaTest test case
{code}
  test("merge primitive types") {
val expected =
  Types.buildMessage()
.addField(
  Types
.required(INT32)
.as(DECIMAL)
.precision(9)
.scale(0)
.named("f"))
.named("root")

assert(expected.union(expected) === expected)
  }
{code}
produces the following assertion error
{noformat}
message root {
  required int32 f;
}
 did not equal message root {
  required int32 f (DECIMAL(9,0));
}
{noformat}
This is because {{PrimitiveType.union}} doesn't handle original type properly. 
An open question is that, can two primitive types with the same primitive type 
name but different original types be unioned?

  was:
The following test case
{code}
  test("merge primitive types") {
val expected =
  Types.buildMessage()
.addField(
  Types
.required(INT32)
.as(DECIMAL)
.precision(9)
.scale(0)
.named("f"))
.named("root")

assert(expected.union(expected) === expected)
  }
{code}
produces the following assertion error
{noformat}
message root {
  required int32 f;
}
 did not equal message root {
  required int32 f (DECIMAL(9,0));
}
{noformat}
This is because {{PrimitiveType.union}} doesn't handle original type properly. 
An open question is that, can two primitive types with the same primitive type 
name but different original types be unioned?


> PrimitiveType.union erases original type
> 
>
> Key: PARQUET-379
> URL: https://issues.apache.org/jira/browse/PARQUET-379
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
> val expected =
>   Types.buildMessage()
> .addField(
>   Types
> .required(INT32)
> .as(DECIMAL)
> .precision(9)
> .scale(0)
> .named("f"))
> .named("root")
> assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-371) Bumps Thrift version to 0.9.0

2015-09-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-371:
---
Summary: Bumps Thrift version to 0.9.0  (was: Add thrift9 Maven profile for 
parquet-format)

> Bumps Thrift version to 0.9.0
> -
>
> Key: PARQUET-371
> URL: https://issues.apache.org/jira/browse/PARQUET-371
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be 
> nice to have a {{thrift9}} Maven profile similar to what we did for 
> parquet-mr to bump Thrift to 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-371) Bumps Thrift version to 0.9.0

2015-09-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-371:
---
Description: Thrift 0.7.0 is too old a version, and it doesn't compile on 
Mac. Would be nice to bump Thrift version.  (was: Thrift 0.7.0 is too old a 
version, and it doesn't compile on Mac. Would be nice to have a {{thrift9}} 
Maven profile similar to what we did for parquet-mr to bump Thrift to 0.9.)

> Bumps Thrift version to 0.9.0
> -
>
> Key: PARQUET-371
> URL: https://issues.apache.org/jira/browse/PARQUET-371
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be 
> nice to bump Thrift version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-370) Nested records are not properly read if none of their fields are requested

2015-09-10 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734568#comment-14734568
 ] 

Cheng Lian edited comment on PARQUET-370 at 9/10/15 11:43 AM:
--

A complete sample code for reproducing this issue against parquet-mr 1.7.0 can 
be found in 
[lianch...@github.com/parquet-compat|https://github.com/liancheng/parquet-compat/blob/cbd9dd89b015049c43054c5db81737405f6618e2/src/test/scala/com/databricks/parquet/schema/SchemaEvolutionSuite.scala#L9-L61].
  This sample writes a Parquet file with schema {{S1}} and reads it back with 
{{S2}} as requested schema using parquet-avro.  Related Avro IDL definition can 
be found 
[here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/avro/parquet-avro-compat.avdl].

BTW, this repository is a playground of mine for investigating various Parquet 
compatibility and interoperability issues.  The Scala DSL illustrated in the 
sample code is inspired by the {{writeDirect}} method in parquet-avro testing 
code.  It is defined 
[here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala].
  I found it pretty neat and intuitive for building test cases, and we are 
using a similar testing API in Spark.


was (Author: lian cheng):
A complete sample code for reproducing this issue against parquet-mr 1.7.0 can 
be found in 
[lianch...@github.com/parquet-compat|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala].
  This sample writes a Parquet file with schema {{S1}} and reads it back with 
{{S2}} as requested schema using parquet-avro.  Related Avro IDL definition can 
be found 
[here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/avro/parquet-avro-compat.avdl].
  The version against parquet-mr 1.8.1 is 
[here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.8.1/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala].

BTW, this repository is a playground of mine for investigating various Parquet 
compatibility and interoperability issues.  The Scala DSL illustrated in the 
sample code is inspired by the {{writeDirect}} method in parquet-avro testing 
code.  It is defined 
[here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala].
  I found it pretty neat and intuitive for building test cases, and we are 
using a similar testing API in Spark.

> Nested records are not properly read if none of their fields are requested
> --
>
> Key: PARQUET-370
> URL: https://issues.apache.org/jira/browse/PARQUET-370
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.1
>Reporter: Cheng Lian
>
> Say we have a Parquet file {{F}} with the following schema {{S1}}:
> {noformat}
> message root {
>   required group n {
> optional int32 a;
> optional int32 b;
>   }
> }
> {noformat}
> Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while 
> {{c}} and {{d}} are added. Now we have schema {{S2}}:
> {noformat}
> message root {
>   required group n {
> optional int32 c;
> optional int32 d;
>   }
> }
> {noformat}
> {{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with 
> {{S2}} as requested schema.
> Say {{F}} contains a single record:
> {noformat}
> {"n": {"a": 1, "b": 2}}
> {noformat}
> When reading {{F}} with {{S2}}, expected output should be:
> {noformat}
> {"n": {"c": null, "d": null}}
> {noformat}
> But currently parquet-mr gives
> {noformat}
> {"n": null}
> {noformat}
> This is because {{MessageColumnIO}} finds that the physical Parquet file 
> contains no leaf columns defined in the requested schema, and shortcuts 
> record reading with an {{EmptyRecordReader}} for column {{n}}. See 
> [here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-371) Add thrift9 Maven profile for parquet-format

2015-09-10 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740068#comment-14740068
 ] 

Cheng Lian commented on PARQUET-371:


That would be even nicer. I'll update my PR.

> Add thrift9 Maven profile for parquet-format
> 
>
> Key: PARQUET-371
> URL: https://issues.apache.org/jira/browse/PARQUET-371
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be 
> nice to have a {{thrift9}} Maven profile similar to what we did for 
> parquet-mr to bump Thrift to 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-371) Add thrift9 Maven profile for parquet-format

2015-09-09 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-371:
--

 Summary: Add thrift9 Maven profile for parquet-format
 Key: PARQUET-371
 URL: https://issues.apache.org/jira/browse/PARQUET-371
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Cheng Lian


Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be nice 
to have a {{thrift9}} Maven profile similar to what we did for parquet-mr to 
bump Thrift to 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-370) Nested records are not properly read if none of their fields are requested

2015-09-08 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734568#comment-14734568
 ] 

Cheng Lian commented on PARQUET-370:


A complete sample code for reproducing this issue against parquet-mr 1.7.0 can 
be found in 
[lianch...@github.com/parquet-compat|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala].
  This sample writes a Parquet file with schema {{S1}} and reads it back with 
{{S2}} as requested schema using parquet-avro.  The version against parquet-mr 
1.8.1 is 
[here|https://github.com/liancheng/parquet-compat/tree/with-parquet-mr-1.8.1].

BTW, this repository is a playground of mine for investigating various Parquet 
compatibility and interoperability issues.  The Scala DSL illustrated in the 
sample code is inspired by the {{writeDirect}} method in parquet-avro testing 
code.  It is defined 
[here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala].
  I found it pretty neat and intuitive for building test cases, and we are 
using a similar testing API in Spark.


> Nested records are not properly read if none of their fields are requested
> --
>
> Key: PARQUET-370
> URL: https://issues.apache.org/jira/browse/PARQUET-370
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.1
>Reporter: Cheng Lian
>
> Say we have a Parquet file {{F}} with the following schema {{S1}}:
> {noformat}
> message root {
>   required group n {
> optional int32 a;
> optional int32 b;
>   }
> }
> {noformat}
> Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while 
> {{c}} and {{d}} are added. Now we have schema {{S2}}:
> {noformat}
> message root {
>   required group n {
> optional int32 c;
> optional int32 d;
>   }
> }
> {noformat}
> {{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with 
> {{S2}} as requested schema.
> Say {{F}} contains a single record:
> {noformat}
> {"n": {"a": 1, "b": 2}}
> {noformat}
> When reading {{F}} with {{S2}}, expected output should be:
> {noformat}
> {"n": {"c": null, "d": null}}
> {noformat}
> But currently parquet-mr gives
> {noformat}
> {"n": null}
> {noformat}
> This is because {{MessageColumnIO}} finds that the physical Parquet file 
> contains no leaf columns defined in the requested schema, and shortcuts 
> record reading with an {{EmptyRecordReader}} for column {{n}}. See 
> [here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-09-07 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733506#comment-14733506
 ] 

Cheng Lian commented on PARQUET-369:


Here is a more concrete version in another thread 
https://groups.google.com/d/msg/parquet-dev/UjpbHbzoQj0/R6LG2gECQuIJ

[~julienledem] According to the link above, is testing the only reason why we 
have the static JUL initialization block in parquet-mr? If that is true, I'm 
happy to have a try to remove it. We've been using pretty hacky way in Spark to 
redirect Parquet JUL logs.

> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-09-07 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-369:
---
Description: 
Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
[here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
 This also accidentally shades [this 
line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
{code}
private static String STATIC_LOGGER_BINDER_PATH = 
"org/slf4j/impl/StaticLoggerBinder.class";
{code}
to
{code}
private static String STATIC_LOGGER_BINDER_PATH = 
"parquet/org/slf4j/impl/StaticLoggerBinder.class";
{code}
and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
classpath.

This happens in Spark. Whenever we write a Parquet file, we see the following 
famous message and can never get rid of it:
{noformat}
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
{noformat}

  was:
Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
[here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]}.
 This also accidentally shades [this 
line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
{code}
private static String STATIC_LOGGER_BINDER_PATH = 
"org/slf4j/impl/StaticLoggerBinder.class";
{code}
to
{code}
private static String STATIC_LOGGER_BINDER_PATH = 
"parquet/org/slf4j/impl/StaticLoggerBinder.class";
{code}
and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
classpath.

This happens in Spark. Whenever we write a Parquet file, we see the following 
famous message and can never get rid of it:
{noformat}
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
{noformat}


> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-09-05 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-369:
--

 Summary: Shading SLF4J prevents SLF4J locating 
org.slf4j.impl.StaticLoggerBinder
 Key: PARQUET-369
 URL: https://issues.apache.org/jira/browse/PARQUET-369
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Cheng Lian


Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
[here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]}.
 This also accidentally shades [this 
line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
{code}
private static String STATIC_LOGGER_BINDER_PATH = 
"org/slf4j/impl/StaticLoggerBinder.class";
{code}
to
{code}
private static String STATIC_LOGGER_BINDER_PATH = 
"parquet/org/slf4j/impl/StaticLoggerBinder.class";
{code}
and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
classpath.

This happens in Spark. Whenever we write a Parquet file, we see the following 
famous message and can never get rid of it:
{noformat}
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-364) Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. array<array>)

2015-09-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-364:
---
Summary: Parquet-avro cannot decode Avro/Thrift array of primitive array 
(e.g. array)  (was: Parque-avro cannot decode Avro/Thrift array of 
primitive array (e.g. array))

> Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. 
> array)
> 
>
> Key: PARQUET-364
> URL: https://issues.apache.org/jira/browse/PARQUET-364
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
> Attachments: bad-avro.parquet, bad-thrift.parquet
>
>
> The problematic Avro and Thrift schemas are:
> {noformat}
> record AvroArrayOfArray {
>   array int_arrays_column;
> }
> {noformat}
> and
> {noformat}
> struct ThriftListOfList {
>   1: list intArraysColumn;
> }
> {noformat}
> They are converted to the following structurally equivalent Parquet schemas 
> by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:
> {noformat}
> message AvroArrayOfArray {
>   required group int_arrays_column (LIST) {
> repeated group array (LIST) {
>   repeated int32 array;
> }
>   }
> }
> {noformat}
> and
> {noformat}
> message ParquetSchema {
>   required group intListsColumn (LIST) {
> repeated group intListsColumn_tuple (LIST) {
>   repeated int32 intListsColumn_tuple_tuple;
> }
>   }
> }
> {noformat}
> {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
> reason is that the 2nd level repeated group {{array}} doesn't pass 
> {{AvroIndexedRecordConverter.isElementType()}} check. We should check for 
> field name "array" and field name suffix "_thrift" in {{isElementType()}} to 
> fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-367) parquet-cat -j doesn't show all records

2015-08-27 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-367:
--

 Summary: parquet-cat -j doesn't show all records
 Key: PARQUET-367
 URL: https://issues.apache.org/jira/browse/PARQUET-367
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0, 1.8.1, 1.9.0
Reporter: Cheng Lian


{noformat}
$ parquet-cat old-repeated-int.parquet
repeatedInt = 1
repeatedInt = 2
repeatedInt = 3

$ parquet-cat -j old-repeated-int.parquet
{repeatedInt:3}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)

2015-08-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708387#comment-14708387
 ] 

Cheng Lian commented on PARQUET-364:


Sent out a PR https://github.com/apache/parquet-mr/pull/264

 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. 
 arrayarrayint)
 ---

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian
 Attachments: bad-avro.parquet, bad-thrift.parquet


 The problematic Avro and Thrift schemas are:
 {noformat}
 record AvroArrayOfArray {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 and
 {noformat}
 struct ThriftListOfList {
   1: listlisti32 intArraysColumn;
 }
 {noformat}
 They are converted to the following structurally equivalent Parquet schemas 
 by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:
 {noformat}
 message AvroArrayOfArray {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 and
 {noformat}
 message ParquetSchema {
   required group intListsColumn (LIST) {
 repeated group intListsColumn_tuple (LIST) {
   repeated int32 intListsColumn_tuple_tuple;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We should check for 
 field name array and field name suffix _thrift in {{isElementType()}} to 
 fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706766#comment-14706766
 ] 

Cheng Lian commented on PARQUET-364:


Although I haven't verified it yet, I suspect parquet-thrift suffers from a 
similar issue, e.g. cannot decode Parquet records translated from Thrift 
structure like {{listlisti32}}.

 Parque-avro cannot decode Avro array of primitive array (e.g. 
 arrayarrayint)
 

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian

 The following Avro schema
 {noformat}
 record AvroNonNullableArrays {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 is translated into the following Parquet schema by parquet-avro 1.7.0:
 {noformat}
 message root {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We probably should 
 check for field name array in {{isElementType()}} to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706760#comment-14706760
 ] 

Cheng Lian commented on PARQUET-364:


Tried to write a test case in parquet-mr, but fail to build parquet-mr locally 
on OSX 10.10 because of some environment issue. Verified this bug while fixing 
SPARK-10136, which is the Spark version of this bug. And here is a Spark SQL 
{{ParquetAvroCompatibilitySuite}} test case for reproducing this issue:
{code}
  test(PARQUET-364 avro array of primitive array) {
withTempPath { dir =
  val path = dir.getCanonicalPath

  val records = (0 until 3).map { i =
AvroArrayOfArray.newBuilder()
  .setIntArraysColumn(
Seq.tabulate(3, 3)((j, k) = i + j * 3 + k: 
Integer).map(_.asJava).asJava)
  .build()
  }

  val writer = new AvroParquetWriter[AvroArrayOfArray](
new Path(path), AvroArrayOfArray.getClassSchema)
  records.foreach(writer.write)
  writer.close()

  val reader = AvroParquetReader.builder[AvroArrayOfArray](new 
Path(path)).build()
  assert((0 until 10).map(_ = reader.read()) === records)
}
  }
{code}
Exception:
{noformat}
[info] - PARQUET-364 avro array of primitive array *** FAILED *** (428 
milliseconds)
[info]   java.lang.ClassCastException: repeated int32 array is not a group
[info]   at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:144)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter.access$200(AvroIndexedRecordConverter.java:42)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter$ElementConverter.init(AvroIndexedRecordConverter.java:548)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.init(AvroIndexedRecordConverter.java:480)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:144)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter.init(AvroIndexedRecordConverter.java:89)
[info]   at 
org.apache.parquet.avro.AvroIndexedRecordConverter.init(AvroIndexedRecordConverter.java:60)
[info]   at 
org.apache.parquet.avro.AvroRecordMaterializer.init(AvroRecordMaterializer.java:34)
[info]   at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:111)
[info]   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:174)
[info]   at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:151)
[info]   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:127)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$13.apply(ParquetAvroCompatibilitySuite.scala:186)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$13.apply(ParquetAvroCompatibilitySuite.scala:186)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.Range.foreach(Range.scala:141)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4.apply(ParquetAvroCompatibilitySuite.scala:186)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4.apply(ParquetAvroCompatibilitySuite.scala:170)
[info]   at 
org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:117)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetCompatibilityTest.withTempPath(ParquetCompatibilityTest.scala:31)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5.apply$mcV$sp(ParquetAvroCompatibilitySuite.scala:170)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5.apply(ParquetAvroCompatibilitySuite.scala:170)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5.apply(ParquetAvroCompatibilitySuite.scala:170)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)

[jira] [Updated] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-364:
---
Description: 
The problematic Avro and Thrift schemas are:
{noformat}
record AvroArrayOfArray {
  arrayarrayint int_arrays_column;
}
{noformat}
and
{noformat}
struct ThriftListOfList {
  1: listlisti32 intArraysColumn;
}
{noformat}
They are converted to the following structurally equivalent Parquet schemas by 
parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:
{noformat}
message AvroArrayOfArray {
  required group int_arrays_column (LIST) {
repeated group array (LIST) {
  repeated int32 array;
}
  }
}
{noformat}
and
{noformat}
message ParquetSchema {
  required group intListsColumn (LIST) {
repeated group intListsColumn_tuple (LIST) {
  repeated int32 intListsColumn_tuple_tuple;
}
  }
}
{noformat}
{{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason 
is that the 2nd level repeated group {{array}} doesn't pass 
{{AvroIndexedRecordConverter.isElementType()}} check. We should check for field 
name array and field name suffix _thrift in {{isElementType()}} to fix this 
issue.

  was:
The problematic Avro and Thrift schemas are:
{noformat}
record AvroArrayOfArray {
  arrayarrayint int_arrays_column;
}
{noformat}
and
{noformat}
struct ThriftListOfList {
  1: listlisti32 intArraysColumn;
}
{noformat}
They are converted to the following Parquet schemas by parquet-avro 1.7.0 and 
parquet-thrift 1.7.0 respectively:
{noformat}
message AvroArrayOfArray {
  required group int_arrays_column (LIST) {
repeated group array (LIST) {
  repeated int32 array;
}
  }
}
{noformat}
and
{noformat}
message ParquetSchema {
  required group intListsColumn (LIST) {
repeated group intListsColumn_tuple (LIST) {
  repeated int32 intListsColumn_tuple_tuple;
}
  }
}
{noformat}
{{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason 
is that the 2nd level repeated group {{array}} doesn't pass 
{{AvroIndexedRecordConverter.isElementType()}} check. We should check for field 
name array and field name suffix _thrift in {{isElementType()}} to fix this 
issue.


 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. 
 arrayarrayint)
 ---

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian

 The problematic Avro and Thrift schemas are:
 {noformat}
 record AvroArrayOfArray {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 and
 {noformat}
 struct ThriftListOfList {
   1: listlisti32 intArraysColumn;
 }
 {noformat}
 They are converted to the following structurally equivalent Parquet schemas 
 by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:
 {noformat}
 message AvroArrayOfArray {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 and
 {noformat}
 message ParquetSchema {
   required group intListsColumn (LIST) {
 repeated group intListsColumn_tuple (LIST) {
   repeated int32 intListsColumn_tuple_tuple;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We should check for 
 field name array and field name suffix _thrift in {{isElementType()}} to 
 fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706959#comment-14706959
 ] 

Cheng Lian commented on PARQUET-364:


I tested the Thrift case with Thrift 0.9.2 because I can't get Thrift 0.7.0 
compiled on Mac OS X 10.10 because of lacking proper C++ header files. I assume 
that this doesn't change the essence of this issue.

(BTW, any plan to upgrade to Thrift 0.9.2?)

 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. 
 arrayarrayint)
 ---

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian

 The problematic Avro and Thrift schemas are:
 {noformat}
 record AvroArrayOfArray {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 and
 {noformat}
 struct ThriftListOfList {
   1: listlisti32 intArraysColumn;
 }
 {noformat}
 They are converted to the following Parquet schemas by parquet-avro 1.7.0 and 
 parquet-thrift 1.7.0 respectively:
 {noformat}
 message AvroArrayOfArray {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 and
 {noformat}
 message ParquetSchema {
   required group intListsColumn (LIST) {
 repeated group intListsColumn_tuple (LIST) {
   repeated int32 intListsColumn_tuple_tuple;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We should check for 
 field name array and field name suffix _thrift in {{isElementType()}} to 
 fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-364:
---
Description: 
The problematic Avro and Thrift schemas are:
{noformat}
record AvroArrayOfArray {
  arrayarrayint int_arrays_column;
}
{noformat}
and
{noformat}
struct ThriftListOfList {
  1: listlisti32 intArraysColumn;
}
{noformat}
They are converted to the following Parquet schemas by parquet-avro 1.7.0 and 
parquet-thrift 1.7.0 respectively:
{noformat}
message AvroArrayOfArray {
  required group int_arrays_column (LIST) {
repeated group array (LIST) {
  repeated int32 array;
}
  }
}
{noformat}
and
{noformat}
message ParquetSchema {
  required group intListsColumn (LIST) {
repeated group intListsColumn_tuple (LIST) {
  repeated int32 intListsColumn_tuple_tuple;
}
  }
}
{noformat}
{{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason 
is that the 2nd level repeated group {{array}} doesn't pass 
{{AvroIndexedRecordConverter.isElementType()}} check. We should check for field 
name array and field name suffix _thrift in {{isElementType()}} to fix this 
issue.

  was:
The following Avro schema
{noformat}
record AvroNonNullableArrays {
  arrayarrayint int_arrays_column;
}
{noformat}
is translated into the following Parquet schema by parquet-avro 1.7.0:
{noformat}
message root {
  required group int_arrays_column (LIST) {
repeated group array (LIST) {
  repeated int32 array;
}
  }
}
{noformat}
{{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason 
is that the 2nd level repeated group {{array}} doesn't pass 
{{AvroIndexedRecordConverter.isElementType()}} check. We probably should check 
for field name array in {{isElementType()}} to fix this issue.


 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. 
 arrayarrayint)
 ---

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian

 The problematic Avro and Thrift schemas are:
 {noformat}
 record AvroArrayOfArray {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 and
 {noformat}
 struct ThriftListOfList {
   1: listlisti32 intArraysColumn;
 }
 {noformat}
 They are converted to the following Parquet schemas by parquet-avro 1.7.0 and 
 parquet-thrift 1.7.0 respectively:
 {noformat}
 message AvroArrayOfArray {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 and
 {noformat}
 message ParquetSchema {
   required group intListsColumn (LIST) {
 repeated group intListsColumn_tuple (LIST) {
   repeated int32 intListsColumn_tuple_tuple;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We should check for 
 field name array and field name suffix _thrift in {{isElementType()}} to 
 fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-364:
---
Summary: Parque-avro cannot decode Avro/Thrift array of primitive array 
(e.g. arrayarrayint)  (was: Parque-avro cannot decode Avro array of 
primitive array (e.g. arrayarrayint))

 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. 
 arrayarrayint)
 ---

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian

 The following Avro schema
 {noformat}
 record AvroNonNullableArrays {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 is translated into the following Parquet schema by parquet-avro 1.7.0:
 {noformat}
 message root {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We probably should 
 check for field name array in {{isElementType()}} to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706938#comment-14706938
 ] 

Cheng Lian commented on PARQUET-364:


Verified that parquet-avro doesn't correctly decode Parquet records generated 
by parquet-thrift with Thrift type {{listlisti32}} either.

 Parque-avro cannot decode Avro array of primitive array (e.g. 
 arrayarrayint)
 

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian

 The following Avro schema
 {noformat}
 record AvroNonNullableArrays {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 is translated into the following Parquet schema by parquet-avro 1.7.0:
 {noformat}
 message root {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We probably should 
 check for field name array in {{isElementType()}} to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)

2015-08-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707150#comment-14707150
 ] 

Cheng Lian commented on PARQUET-364:


[~rdblue] The suggested fix has been verified by [Spark PR 
#8361|https://github.com/apache/spark/pull/8361/files]. I'd like to deliver a 
PR for parquet-mr, but hit some local build issues. Please feel free to assign 
this issue to others.

 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. 
 arrayarrayint)
 ---

 Key: PARQUET-364
 URL: https://issues.apache.org/jira/browse/PARQUET-364
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
Reporter: Cheng Lian
 Attachments: bad-avro.parquet, bad-thrift.parquet


 The problematic Avro and Thrift schemas are:
 {noformat}
 record AvroArrayOfArray {
   arrayarrayint int_arrays_column;
 }
 {noformat}
 and
 {noformat}
 struct ThriftListOfList {
   1: listlisti32 intArraysColumn;
 }
 {noformat}
 They are converted to the following structurally equivalent Parquet schemas 
 by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:
 {noformat}
 message AvroArrayOfArray {
   required group int_arrays_column (LIST) {
 repeated group array (LIST) {
   repeated int32 array;
 }
   }
 }
 {noformat}
 and
 {noformat}
 message ParquetSchema {
   required group intListsColumn (LIST) {
 repeated group intListsColumn_tuple (LIST) {
   repeated int32 intListsColumn_tuple_tuple;
 }
   }
 }
 {noformat}
 {{AvroIndexedRecordConverter}} cannot decode such records correctly. The 
 reason is that the 2nd level repeated group {{array}} doesn't pass 
 {{AvroIndexedRecordConverter.isElementType()}} check. We should check for 
 field name array and field name suffix _thrift in {{isElementType()}} to 
 fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-363) Cannot construct empty MessageType for ReadContext.requestedSchema

2015-08-21 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-363:
--

 Summary: Cannot construct empty MessageType for 
ReadContext.requestedSchema
 Key: PARQUET-363
 URL: https://issues.apache.org/jira/browse/PARQUET-363
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0, 1.8.1
Reporter: Cheng Lian


In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus 
{{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes 
sense in most cases since Parquet doesn't support empty groups. However, there 
is one use case where an empty {{MessageType}} is valid, namely passing an 
empty {{MessageType}} as the {{requestedSchema}} constructor argument of 
{{ReadContext}} when counting rows in a Parquet file. The reason why it works 
is that, Parquet can retrieve row count from block metadata without 
materializing any columns. Take the following PySpark shell snippet 
([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209],
 which uses parquet-mr 1.7.0) as an example:
{noformat}
 path = 'file:///tmp/foo'
 # Writes 10 integers into a Parquet file
 sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
 sqlContext.read.parquet(path).count()

10
{noformat}
Parquet related log lines:
{noformat}
15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields 
from the Parquet file:

Parquet form:
message root {
}


Catalyst form:
StructType()

15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized 
will read a total of 10 records.
15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block
15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0 
ms. row count = 10
{noformat}
We can see that Spark SQL passes no requested columns to the underlying Parquet 
reader. What happens here is that:

# Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus 
only generates empty rows).
# {{InternalParquetRecordReader}} first obtain the row count from block 
metadata 
([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]).
# {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the 
Parquet file 
([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]).
# {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where 
_n_ equals to the row count. Each time, it invokes the converter created by 
Spark SQL and produces an empty Spark SQL row object.

This issue is also the cause of HIVE-11611.  Because when upgrading to Parquet 
1.8.1, Hive worked around this issue by using {{tableSchema}} as 
{{requestedSchema}} when no columns are requested 
([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]).
 IMO this introduces a performance regression in cases like counting, because 
now we need to materialize all columns just for counting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-173) StatisticsFilter doesn't handle And properly

2015-08-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-173:
---
Description: 
I guess it's [a pretty straightforward 
mistake|https://github.com/apache/parquet-mr/blob/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L225-L237]
 :)
{code}
  @Override
  public Boolean visit(And and) {
return and.getLeft().accept(this)  and.getRight().accept(this);
  }

  @Override
  public Boolean visit(Or or) {
// seems unintuitive to put an  not an || here
// but we can only drop a chunk of records if we know that
// both the left and right predicates agree that no matter what
// we don't need this chunk.
return or.getLeft().accept(this)  or.getRight().accept(this);
  }
{code}
The consequence is that filter predicates like {{a  10  a  20}} can never 
drop any row groups.

  was:
I guess it's [a pretty straightforward 
mistake|https://github.com/apache/incubator-parquet-mr/blob/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L225-L237]
 :)
{code}
  @Override
  public Boolean visit(And and) {
return and.getLeft().accept(this)  and.getRight().accept(this);
  }

  @Override
  public Boolean visit(Or or) {
// seems unintuitive to put an  not an || here
// but we can only drop a chunk of records if we know that
// both the left and right predicates agree that no matter what
// we don't need this chunk.
return or.getLeft().accept(this)  or.getRight().accept(this);
  }
{code}
The consequence is that filter predicates like {{a  10  a  20}} can never 
drop any row groups.


 StatisticsFilter doesn't handle And properly
 

 Key: PARQUET-173
 URL: https://issues.apache.org/jira/browse/PARQUET-173
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.6.0


 I guess it's [a pretty straightforward 
 mistake|https://github.com/apache/parquet-mr/blob/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L225-L237]
  :)
 {code}
   @Override
   public Boolean visit(And and) {
 return and.getLeft().accept(this)  and.getRight().accept(this);
   }
   @Override
   public Boolean visit(Or or) {
 // seems unintuitive to put an  not an || here
 // but we can only drop a chunk of records if we know that
 // both the left and right predicates agree that no matter what
 // we don't need this chunk.
 return or.getLeft().accept(this)  or.getRight().accept(this);
   }
 {code}
 The consequence is that filter predicates like {{a  10  a  20}} can never 
 drop any row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-136) NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null

2015-08-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-136:
---
Description: 
For a string or a binary column, if all values in a single column trunk are 
null, so do the min  max values in the column trunk statistics. However, while 
checking the statistics for column trunk pruning, a null check is missing, and 
causes NPE. Corresponding code can be found 
[here|https://github.com/apache/parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100].

This issue can be steadily reproduced with the following Spark shell snippet 
against Spark 1.2.0-SNAPSHOT 
([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]):
{code}
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sqlContext._

case class StringCol(value: String)

sc.parallelize(StringCol(null) :: Nil, 
1).saveAsParquetFile(/tmp/empty.parquet)
parquetFile(/tmp/empty.parquet).registerTempTable(null_table)

sql(SET spark.sql.parquet.filterPushdown=true)
sql(SELECT * FROM null_table WHERE value = 'foo').collect()
{code}
Exception thrown:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.NullPointerException
at 
parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
at 
parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
at 
parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
at 
parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
For a string or a binary column, if all values in a single column trunk are 
null, so do the min  max values in the column trunk statistics. However, while 
checking the statistics for column trunk pruning, a null check is missing, and 
causes NPE. Corresponding code can be found 
[here|https://github.com/apache/incubator-parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100].

This issue can be steadily reproduced with the following Spark shell snippet 
against Spark 1.2.0-SNAPSHOT 
([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]):
{code}
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sqlContext._

case class StringCol(value: String)

sc.parallelize(StringCol(null) :: Nil, 
1).saveAsParquetFile(/tmp/empty.parquet)
parquetFile(/tmp/empty.parquet).registerTempTable(null_table)

sql(SET spark.sql.parquet.filterPushdown=true)
sql(SELECT * FROM null_table WHERE value = 'foo').collect()
{code}
Exception thrown:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 

[jira] [Commented] (PARQUET-70) PARQUET #36: Pig Schema Storage to UDFContext

2015-07-10 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-70?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622600#comment-14622600
 ] 

Cheng Lian commented on PARQUET-70:
---

Just remove the incubator- part of the URL: 
https://github.com/apache/parquet-mr/issues/36

 PARQUET #36: Pig Schema Storage to UDFContext
 -

 Key: PARQUET-70
 URL: https://issues.apache.org/jira/browse/PARQUET-70
 Project: Parquet
  Issue Type: Bug
Reporter: Daniel Weeks
Priority: Critical
 Fix For: 1.6.0


 https://github.com/apache/incubator-parquet-mr/pull/36
 The ParquetLoader was not storing the pig schema into the udfcontext for the 
 full load case which causes a schema reload on the task side, erases the 
 requested schema, and causes problems with column index access.
 This fix stores the pig schema to both the udfcontext (for task side init) 
 and jobcontext (for TupleReadSupport) along with other properties that should 
 be set in the loader context (required field list and column index access 
 toggle).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-10 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580765#comment-14580765
 ] 

Cheng Lian edited comment on PARQUET-222 at 6/10/15 4:37 PM:
-

Hey [~rdblue], it seems that you are referring to use cases like writing to 
Hive dynamic partitions (where a single task may need to write to write 
multiple Parquet files according to partition column values)?  I believe the 
use case described in the JIRA description is different.  Unlike Hive or Pig, 
which use process level parallelism, Spark uses thread level parallelism (tasks 
are executed in thread pool).  And currently, there's no way to pin a Spark 
task to a specific process.  So even each task is guaranteed to write at most 
one Parquet file, it's still possible for a single executor process to write 
multiple Parquet files at some point.  So in the scope of Spark, currently 
there isn't a very good mechanism to fix this problem.  What I suggested was 
essentially to shrink partition number so that on average an executor writes 
only a single file (I made a mistake in my previous comment and said at most 
one Parquet file, it should be on average).

In case of dynamic partitioning, we do plan to re-partition the data according 
to partition column values before writing the data to reduce number of parallel 
writers.  Another possible approach was to sort the data within each task 
before writing, so that only a single writer is active for a task at any point 
of time.


was (Author: lian cheng):
Hey [~rdblue], it seems that you are referring to use cases like writing to 
Hive dynamic partitions (where a single task may need to write to write 
multiple Parquet files according to Partition column values)?  I believe the 
use case described in the JIRA description is different.  Unlike Hive or Pig, 
which use process level parallelism, Spark uses thread level parallelism (tasks 
are executed in thread pool).  And currently, there's no way to pin a Spark 
task to a specific process.  So even each task is guaranteed to write at most 
one Parquet file, it's still possible for a single executor process to write 
multiple Parquet files at some point.  So in the scope of Spark, currently 
there isn't a very good mechanism to fix this problem.  What I suggested was 
essentially to shrink partition number so that on average an executor writes 
only a single file (I made a mistake in my previous comment and said at most 
one Parquet file, it should be on average).

In case of dynamic partitioning, we do plan to re-partition the data according 
to partition column values before writing the data to reduce number of parallel 
writers.  Another possible approach was to sort the data within each task 
before writing, so that only a single writer is active for a task at any point 
of time.

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
 {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
 will fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 {noformat}
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
 Java heap space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at 
 

[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-10 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580765#comment-14580765
 ] 

Cheng Lian commented on PARQUET-222:


Hey [~rdblue], it seems that you are referring to use cases like writing to 
Hive dynamic partitions (where a single task may need to write to write 
multiple Parquet files according to Partition column values)?  I believe the 
use case described in the JIRA description is different.  Unlike Hive or Pig, 
which use process level parallelism, Spark uses thread level parallelism (tasks 
are executed in thread pool).  And currently, there's no way to pin a Spark 
task to a specific process.  So even each task is guaranteed to write at most 
one Parquet file, it's still possible for a single executor process to write 
multiple Parquet files at some point.  So in the scope of Spark, currently 
there isn't a very good mechanism to fix this problem.  What I suggested was 
essentially to shrink partition number so that on average an executor writes 
only a single file (I made a mistake in my previous comment and said at most 
one Parquet file, it should be on average).

In case of dynamic partitioning, we do plan to re-partition the data according 
to partition column values before writing the data to reduce number of parallel 
writers.  Another possible approach was to sort the data within each task 
before writing, so that only a single writer is active for a task at any point 
of time.

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
 {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
 will fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 {noformat}
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
 Java heap space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {noformat}
 By the way, there is another similar issue 
 https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
 it 

[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-10 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580339#comment-14580339
 ] 

Cheng Lian edited comment on PARQUET-222 at 6/10/15 4:39 PM:
-

Hey [~phatak.dev], finally got some time to try 1.3.1 and reproduced this OOM.

While trying this case with 1.4, it got stuck in the query planner, so I was 
adjusting {{\-\-driver-memory}}.  In the case of 1.3.1, by tuning 
{{\-\-executor-memory}}, I can see two kinds of exceptions.  The first one is 
exactly the same as what you saw.  In my test code, I create 26k INT columns, 
so Parquet tries to initialize 26k column writers, each allocates a default 
slab (an {{int[]}}) with 64k elements.  This takes at least {{26k * 64k * 4b = 
6.34gb}} memory.

After increasing executor memory to 10g, I saw similar exception thrown from 
{{RunLengthBitPackingHybridEncoder}}.  I guess Parquet is trying to allocate an 
RLE encoder for each column here to perform compression (not 100% sure about 
this for now).  Similarly, each encoder initializes a default slab (a 
{{byte[]}}) with at least 64k elements, and that's another {{26k * 64k * 1b = 
1.6gb}} memory.

Only have a laptop for now, so... not sure how much memory it needs to write 
such wide a table.  But essentially Parquet needs to pre-allocate some memory 
for each column to compress and buffer data.  And 26k columns altogether just 
eats too much memory here.  That's why even your table has only a single row, 
it still causes OOM.


was (Author: lian cheng):
Hey [~phatak.dev], finally got some time to try 1.3.1 and reproduced this OOM.

While trying this case with 1.4, it got stuck in the query planner, so I was 
adjusting {{--driver-memory}}.  In the case of 1.3.1, by tuning 
{{--executor-memory}}, I can see two kinds of exceptions.  The first one is 
exactly the same as what you saw.  In my test code, I create 26k INT columns, 
so Parquet tries to initialize 26k column writers, each allocates a default 
slab (an {{int[]}}) with 64k elements.  This takes at least {{26k * 64k * 4b = 
6.34gb}} memory.

After increasing executor memory to 10g, I saw similar exception thrown from 
{{RunLengthBitPackingHybridEncoder}}.  I guess Parquet is trying to allocate an 
RLE encoder for each column here to perform compression (not 100% sure about 
this for now).  Similarly, each encoder initializes a default slab (a 
{{byte[]}}) with at least 64k elements, and that's another {{26k * 64k * 1b = 
1.6gb}} memory.

Only have a laptop for now, so... not sure how much memory it needs to write 
such wide a table.  But essentially Parquet needs to pre-allocate some memory 
for each column to compress and buffer data.  And 26k columns altogether just 
eats too much memory here.  That's why even your table has only a single row, 
it still causes OOM.

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
 {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
 will fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 {noformat}
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
 Java heap space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at 
 

[jira] [Created] (PARQUET-305) Logger instantiated for package org.apache.parquet may be GC-ed

2015-06-09 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-305:
--

 Summary: Logger instantiated for package org.apache.parquet may be 
GC-ed
 Key: PARQUET-305
 URL: https://issues.apache.org/jira/browse/PARQUET-305
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Cheng Lian
Priority: Minor


This ticket is derived from SPARK-8122.

According to Javadoc of 
[{{java.util.Logger}}|https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html]:
{quote}
It is important to note that the Logger returned by one of the getLogger 
factory methods may be garbage collected at any time if a strong reference to 
the Logger is not kept.
{quote}
However, the only reference to [the {{Logger}} created for package 
{{org.apache.parquet}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-common/src/main/java/org/apache/parquet/Log.java#L58]
 goes out of scope outside the static initialization block, and thus is 
possible to be garbage collected.

More details can be found in [this 
comment|https://issues.apache.org/jira/browse/SPARK-8122?focusedCommentId=14574419page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14574419].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-08 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577257#comment-14577257
 ] 

Cheng Lian edited comment on PARQUET-222 at 6/8/15 2:33 PM:


Hey [~phatak.dev], thanks for the information.  I tried to reproduce this issue 
with the following Spark shell snippet:
{code}
import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val n = 26000
val schema = StructType((1 to n).map(i = StructField(sf$i, IntegerType, 
nullable = false)))
val bigRow = Row((1 to n): _*)
val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema)
df.coalesce(1).write.mode(overwrite).format(orc).save(file:///tmp/foo)
{code}
I was using Spark 1.4.0-SNAPSHOT.  Command line used to start the shell is:
{noformat}
./bin/spark-shell --driver-memory 4g
{noformat}
I didn't get an OOM, but it does hang like forever.  After profiling it with 
YJP, it turns out that this super wide table is somehow stressing out the query 
planner by making Spark SQL allocates a large number of small objects.  Haven't 
tried 1.3.1 yet. Will do when I got time.

I found that you had once posted this issue to Spark user mailing list.  Would 
you mind to provide a full stack trace of the OOM error?  Maybe it's more like 
a Spark SQL issue rather than a Parquet issue.


was (Author: lian cheng):
Hey [~phatak.dev], thanks for the information.  I tried to reproduce this issue 
with the following Spark shell snippet:
{code}
import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val n = 26000
val schema = StructType((1 to n).map(i = StructField(sf$i, IntegerType, 
nullable = false)))
val bigRow = Row((1 to n): _*)
val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema)
df.coalesce(1).write.mode(overwrite).format(orc).save(file:///tmp/foo)
{code}
I was using Spark 1.4.0-SNAPSHOT.  Command line used to start the shell is:
{noformat}
./bin/spark-shell --driver-memory 4g
{noformat}
I didn't get an OOM, but it does hang like forever.  After profiling it with 
YJP, it turns out that this super wide table is somehow stressing out query 
planner by making Spark SQL allocates a large number of small objects.  Haven't 
tied 1.3.1 yet. 

I found that you've posted this issue to Spark user mailing list.  Would you 
mind to provide a full stack trace of the OOM error?  Maybe it's more like a 
Spark SQL issue rather than a Parquet issue.

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
 {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
 will fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 {noformat}
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
 Java heap space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
  

[jira] [Commented] (PARQUET-294) NPE in ParquetInputFormat.getSplits when no .parquet files exist

2015-06-07 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576258#comment-14576258
 ] 

Cheng Lian commented on PARQUET-294:


Is this related to PARQUET-151?

 NPE in ParquetInputFormat.getSplits when no .parquet files exist
 

 Key: PARQUET-294
 URL: https://issues.apache.org/jira/browse/PARQUET-294
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Paul Nepywoda

 {code}
 JavaSparkContext context = ...
 JavaRDDRow rdd1 = context.parallelize(ImmutableList.Row of());
 SQLContext sqlContext = new SQLContext(context);
 StructType schema = 
 DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(col1,
  DataTypes.StringType, true)));
 DataFrame df = sqlContext.createDataFrame(rdd1, schema);
 String url = file:///tmp/emptyRDD;
 df.saveAsParquetFile(url);
 Configuration configuration = 
 SparkHadoopUtil.get().newConfiguration(context.getConf());
 JobConf jobConf = new JobConf(configuration);
 ParquetInputFormat.setReadSupportClass(jobConf, RowReadSupport.class);
 FileInputFormat.setInputPaths(jobConf, url);
 JavaRDDRow rdd2 = context.newAPIHadoopRDD(
 jobConf, ParquetInputFormat.class, Void.class, 
 Row.class).values();
 rdd2.count();
 df = sqlContext.createDataFrame(rdd2, schema);
 url = file:///tmp/emptyRDD2;
 df.saveAsParquetFile(url);
 FileInputFormat.setInputPaths(jobConf, url);
 JavaRDDRow rdd3 = context.newAPIHadoopRDD(
 jobConf, ParquetInputFormat.class, Void.class, 
 Row.class).values();
 rdd3.count();
 {code}
 The NPE happens here:
 {code}
 java.lang.NullPointerException
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
   at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
 {code}
 This stems from ParquetFileWriter.getGlobalMetaData returning null when there 
 are no footers to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575783#comment-14575783
 ] 

Cheng Lian commented on PARQUET-222:


There are several ways to alleviate this.

Firstly, for those DataFrames whose data sizes are small (e.g., the single row 
case [~phatak.dev] mentioned), you may try {{df.coalesce(1)}} to reduce 
partition number to 1.  In this way, only a single file will be written.  In 
most cases, the default parallelism equals to the number of cores.  For 
example, if you are running a Spark application with a single executor on a 
single 8-core node, that executor process needs to write 8 Parquet files even 
there's only a single row.

Secondly, when you are writing DataFrames with large volume, you may try to 
adjust DataFrame partition number (via {{df.repartition(n)}} and/or 
{{df.coalesce(n)}}) and executor number (via the {{--num-executors}} flag of 
{{spark-submit}}) to ensure the former is less than or equal to the latter, so 
that each executor process only opens and writes at most one Parquet file.

And of course, heap size of a single executor should be large enough to allow 
Parquet to write at least a single file.

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
 {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
 will fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 {noformat}
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
 Java heap space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {noformat}
 By the way, there is another similar issue 
 https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
 it and mark it as resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-06 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-222:
---
Description: 
In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
{{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will 
fail due to the OOM error thrown by parquet-mr. We can see the exception stack 
trace  as follows:

{noformat}
WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
Java heap space
at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
at parquet.column.values.dictionary.IntList.init(IntList.java:83)
at 
parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
at 
parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
at 
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
at 
parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
at 
parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{noformat}

By the way, there is another similar issue 
https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
it and mark it as resolved.

  was:
In Spark SQL, there is a function `saveAsParquetFile` in DataFrame or 
SchemaRDD. That function calls method in parquet-mr, and sometimes it will fail 
due to the OOM error thrown by parquet-mr. We can see the exception stack trace 
 as follows:

WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
0.2 in stag
e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap 
space
at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
at parquet.column.values.dictionary.IntList.init(IntList.java:83)
at 
parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValue
sWriter.java:85)
at 
parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary
ValuesWriter.init(DictionaryValuesWriter.java:549)
at 
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav
a:68)
at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.
java:56)
at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnI
O.java:178)
at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit
er.java:108)
at 
parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.
java:94)
at 
parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28
2)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25
2)
at 

[jira] [Updated] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

2015-05-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-293:
---
Description: 
I get scala.ScalaReflectionException: none is not a term when I try to 
convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF

Has anyone else encountered this problem? 


I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3

Here is my thrift IDL:
{code}
namespace scala com.junk
namespace java com.junk

struct Junk {
10: i64 junkID,
20: string junkString
}
{code}
from a spark-shell: 
{code}
val junks = List( Junk(123L, junk1), Junk(567L, junk2), Junk(789L, junk3) 
)
val junksRDD = sc.parallelize(junks)
junksRDD.toDF
{code}
Exception thrown:
{noformat}
scala.ScalaReflectionException: none is not a term
at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259)
at 
scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:36)
at $iwC$$iwC$$iwC$$iwC.init(console:38)
at $iwC$$iwC$$iwC.init(console:40)
at $iwC$$iwC.init(console:42)
at $iwC.init(console:44)
at init(console:46)
at .init(console:50)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

  was:
I get scala.ScalaReflectionException: none is 

[jira] [Commented] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

2015-05-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564332#comment-14564332
 ] 

Cheng Lian commented on PARQUET-293:


Hm, it's possible. But the context is a little too vague to diagnose.

[~zzztimbo] Could you please provide more details? For example:

- Spark version
- Full exception stack trace
- How your Spark program interacts with Parquet? (I guess you were trying to 
save the Scrooge RDD as a Parquet file?)
- It would be great if you can provide a snippet that reproduces this issue.

 ScalaReflectionException when trying to convert an RDD of Scrooge to a 
 DataFrame
 

 Key: PARQUET-293
 URL: https://issues.apache.org/jira/browse/PARQUET-293
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Affects Versions: 1.6.0
Reporter: Tim Chan

 I get scala.ScalaReflectionException: none is not a term when I try to 
 convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF
 Has anyone else encountered this problem? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)