[jira] [Assigned] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs
[ https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned PARQUET-1102: --- Assignee: Cheng Lian > Travis CI builds are failing for parquet-format PRs > --- > > Key: PARQUET-1102 > URL: https://issues.apache.org/jira/browse/PARQUET-1102 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Fix For: format-2.3.2 > > > Travis CI builds are failing for parquet-format PRs, probably due to the > migration from Ubuntu precise to trusty on Sep 1 according to [this Travis > official blog > post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status]. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (PARQUET-1091) Wrong and broken links in README
[ https://issues.apache.org/jira/browse/PARQUET-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved PARQUET-1091. - Resolution: Fixed Fix Version/s: format-2.3.2 Issue resolved by pull request 65 [https://github.com/apache/parquet-format/pull/65] > Wrong and broken links in README > > > Key: PARQUET-1091 > URL: https://issues.apache.org/jira/browse/PARQUET-1091 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: format-2.3.2 > > > Multiple links in README.md still point to the old {{Parquet/parquet-format}} > repository, which is now removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs
[ https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved PARQUET-1102. - Resolution: Fixed Fix Version/s: format-2.3.2 Issue resolved by pull request 66 [https://github.com/apache/parquet-format/pull/66] > Travis CI builds are failing for parquet-format PRs > --- > > Key: PARQUET-1102 > URL: https://issues.apache.org/jira/browse/PARQUET-1102 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Priority: Blocker > Fix For: format-2.3.2 > > > Travis CI builds are failing for parquet-format PRs, probably due to the > migration from Ubuntu precise to trusty on Sep 1 according to [this Travis > official blog > post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status]. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs
[ https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-1102: Priority: Blocker (was: Major) > Travis CI builds are failing for parquet-format PRs > --- > > Key: PARQUET-1102 > URL: https://issues.apache.org/jira/browse/PARQUET-1102 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Priority: Blocker > > Travis CI builds are failing for parquet-format PRs, probably due to the > migration from Ubuntu precise to trusty on Sep 1 according to [this Travis > official blog > post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status]. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs
Cheng Lian created PARQUET-1102: --- Summary: Travis CI builds are failing for parquet-format PRs Key: PARQUET-1102 URL: https://issues.apache.org/jira/browse/PARQUET-1102 Project: Parquet Issue Type: Bug Components: parquet-format Reporter: Cheng Lian Travis CI builds are failing for parquet-format PRs, probably due to the migration from Ubuntu precise to trusty on Sep 1 according to [this Travis official blog post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status]. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1091) Wrong and broken links in README
Cheng Lian created PARQUET-1091: --- Summary: Wrong and broken links in README Key: PARQUET-1091 URL: https://issues.apache.org/jira/browse/PARQUET-1091 Project: Parquet Issue Type: Bug Components: parquet-format Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Multiple links in README.md still point to the old {{Parquet/parquet-format}} repository, which is now removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-980) Cannot read row group larger than 2GB
[ https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007326#comment-16007326 ] Cheng Lian edited comment on PARQUET-980 at 5/11/17 10:46 PM: -- The current write path ensures that it never writes a page that is larger than 2GB, but the read path may read 1 or more column chunks consisting of multiple pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB. We hit this issue in production because the data distribution happened to be similar to the situation mentioned in the JIRA description and produced a skewed row group containing a column chunk larger than 2GB. I think there are two separate issues to fix: # On the write path, the strategy that dynamically adjusts memory check intervals needs some tweaking. The assumption that sizes of adjacent records are similar can be easily broken. # On the read path, the {{ConsecutiveChunkList.readAll()}} method should support reading data larger than 2GB, probably by using multiple buffers. Another option is to ensure that no row groups larger than 2GB can be ever written. Thoughts? BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library can read this kind of malformed Parquet files successfully with [this patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover our data from the malformed Parquet file. was (Author: lian cheng): The current write path ensures that it never writes a page that is larger than 2GB, but the read path may read 1 or more column chunks consisting of multiple pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB. We hit this issue in production because the data distribution happened to be similar to the situation mentioned in the JIRA description and produced a skewed row group containing a column chunk larger than 2GB. I think there are two separate issues to fix: # On the write path, the strategy that dynamically adjusts memory check intervals needs some tweaking. The assumption that sizes of adjacent records are similar can be easily broken. # On the read path, the {{ConsecutiveChunkList.readAll()}} method should support reading data larger than 2GB, probably by using multiple buffers. Another option is to ensure that no row groups larger than 2GB can be ever written. Thoughts? BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library can read this kind of malformed Parquet file successfully with [this patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover our data from the malformed Parquet file. > Cannot read row group larger than 2GB > - > > Key: PARQUET-980 > URL: https://issues.apache.org/jira/browse/PARQUET-980 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1, 1.8.2 >Reporter: Herman van Hovell > > Parquet MR 1.8.2 does not support reading row groups which are larger than 2 > GB. > See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064 > We are seeing this when writing skewed records. This throws off the > estimation of the memory check interval in the InternalParquetRecordWriter. > The following spark code illustrates this: > {noformat} > /** > * Create a data frame that will make parquet write a file with a row group > larger than 2 GB. Parquet > * only checks the size of the row group after writing a number of records. > This number is based on > * average row size of the already written records. This is problematic in > the following scenario: > * - The initial (100) records in the record group are relatively small. > * - The InternalParquetRecordWriter checks if it needs to write to disk (it > should not), it assumes > * that the remaining records have a similar size, and (greatly) increases > the check interval (usually > * to 1). > * - The remaining records are much larger then expected, making the row > group larger than 2 GB (which > * makes reading the row group impossible). > * > * The data frame below illustrates such a scenario. This creates a row group > of approximately 4GB. > */ > val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator => > var i = 0 > val random = new scala.util.Random(42) > val buffer = new Array[Char](75) > iterator.map { id => > // the first 200 records have a length of 1K and the remaining 2000 have > a length of 750K. > val numChars = if (i < 200) 1000 else 75 > i += 1 > // create a random array > var j = 0 > while (j < numChars) { > // Generate a char (borrowed from scala.util.Random) > buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar > j += 1 > } >
[jira] [Commented] (PARQUET-980) Cannot read row group larger than 2GB
[ https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007326#comment-16007326 ] Cheng Lian commented on PARQUET-980: The current write path ensures that it never writes a page that is larger than 2GB, but the read path may read 1 or more column chunks consisting of multiple pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB. We hit this issue in production because the data distribution happened to be similar to the situation mentioned in the JIRA description and produced a skewed row group containing a column chunk larger than 2GB. I think there are two separate issues to fix: # On the write path, the strategy that dynamically adjusts memory check intervals needs some tweaking. The assumption that sizes of adjacent records are similar can be easily broken. # On the read path, the {{ConsecutiveChunkList.readAll()}} method should support reading data larger than 2GB, probably by using multiple buffers. Another option is to ensure that no row groups larger than 2GB can be ever written. Thoughts? BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library can read this kind of malformed Parquet file successfully with [this patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover our data from the malformed Parquet file. > Cannot read row group larger than 2GB > - > > Key: PARQUET-980 > URL: https://issues.apache.org/jira/browse/PARQUET-980 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1, 1.8.2 >Reporter: Herman van Hovell > > Parquet MR 1.8.2 does not support reading row groups which are larger than 2 > GB. > See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064 > We are seeing this when writing skewed records. This throws off the > estimation of the memory check interval in the InternalParquetRecordWriter. > The following spark code illustrates this: > {noformat} > /** > * Create a data frame that will make parquet write a file with a row group > larger than 2 GB. Parquet > * only checks the size of the row group after writing a number of records. > This number is based on > * average row size of the already written records. This is problematic in > the following scenario: > * - The initial (100) records in the record group are relatively small. > * - The InternalParquetRecordWriter checks if it needs to write to disk (it > should not), it assumes > * that the remaining records have a similar size, and (greatly) increases > the check interval (usually > * to 1). > * - The remaining records are much larger then expected, making the row > group larger than 2 GB (which > * makes reading the row group impossible). > * > * The data frame below illustrates such a scenario. This creates a row group > of approximately 4GB. > */ > val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator => > var i = 0 > val random = new scala.util.Random(42) > val buffer = new Array[Char](75) > iterator.map { id => > // the first 200 records have a length of 1K and the remaining 2000 have > a length of 750K. > val numChars = if (i < 200) 1000 else 75 > i += 1 > // create a random array > var j = 0 > while (j < numChars) { > // Generate a char (borrowed from scala.util.Random) > buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar > j += 1 > } > // create a string: the string constructor will copy the buffer. > new String(buffer, 0, numChars) > } > } > badDf.write.parquet("somefile") > val corruptedDf = spark.read.parquet("somefile") > corruptedDf.select(count(lit(1)), max(length($"value"))).show() > {noformat} > The latter fails with the following exception: > {noformat} > java.lang.NegativeArraySizeException > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698) > ... > {noformat} > This seems to be fixed by commit > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x? > This can happen when -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (PARQUET-980) Cannot read row group larger than 2GB
[ https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-980: --- Affects Version/s: 1.8.1 1.8.2 > Cannot read row group larger than 2GB > - > > Key: PARQUET-980 > URL: https://issues.apache.org/jira/browse/PARQUET-980 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1, 1.8.2 >Reporter: Herman van Hovell > > Parquet MR 1.8.2 does not support reading row groups which are larger than 2 > GB. > See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064 > We are seeing this when writing skewed records. This throws off the > estimation of the memory check interval in the InternalParquetRecordWriter. > The following spark code illustrates this: > {noformat} > /** > * Create a data frame that will make parquet write a file with a row group > larger than 2 GB. Parquet > * only checks the size of the row group after writing a number of records. > This number is based on > * average row size of the already written records. This is problematic in > the following scenario: > * - The initial (100) records in the record group are relatively small. > * - The InternalParquetRecordWriter checks if it needs to write to disk (it > should not), it assumes > * that the remaining records have a similar size, and (greatly) increases > the check interval (usually > * to 1). > * - The remaining records are much larger then expected, making the row > group larger than 2 GB (which > * makes reading the row group impossible). > * > * The data frame below illustrates such a scenario. This creates a row group > of approximately 4GB. > */ > val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator => > var i = 0 > val random = new scala.util.Random(42) > val buffer = new Array[Char](75) > iterator.map { id => > // the first 200 records have a length of 1K and the remaining 2000 have > a length of 750K. > val numChars = if (i < 200) 1000 else 75 > i += 1 > // create a random array > var j = 0 > while (j < numChars) { > // Generate a char (borrowed from scala.util.Random) > buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar > j += 1 > } > // create a string: the string constructor will copy the buffer. > new String(buffer, 0, numChars) > } > } > badDf.write.parquet("somefile") > val corruptedDf = spark.read.parquet("somefile") > corruptedDf.select(count(lit(1)), max(length($"value"))).show() > {noformat} > The latter fails with the following exception: > {noformat} > java.lang.NegativeArraySizeException > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698) > ... > {noformat} > This seems to be fixed by commit > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x? > This can happen when -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (PARQUET-893) GroupColumnIO.getFirst() doesn't check for empty groups
[ https://issues.apache.org/jira/browse/PARQUET-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-893: --- Description: The following Spark snippet reproduces this issue with Spark 2.1 (with parquet-mr 1.8.1) and Spark 2.2-SNAPSHOT (with parquet-mr 1.8.2): {code} import org.apache.spark.sql.types._ val path = "/tmp/parquet-test" case class Inner(f00: Int) case class Outer(f0: Inner, f1: Int) val df = Seq(Outer(Inner(1), 1)).toDF() df.printSchema() // root // |-- f0: struct (nullable = true) // ||-- f00: integer (nullable = false) // |-- f1: integer (nullable = false) df.write.mode("overwrite").parquet(path) val requestedSchema = new StructType(). add("f0", new StructType(). // This nested field name differs from the original one add("f01", IntegerType)). add("f1", IntegerType) println(requestedSchema.treeString) // root // |-- f0: struct (nullable = true) // ||-- f01: integer (nullable = true) // |-- f1: integer (nullable = true) spark.read.schema(requestedSchema).parquet(path).show() {code} In the above snippet, {{requestedSchema}} is compatible with the schema of the written Parquet file, but the following exception is thrown: {noformat} org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/parquet-test/part-7-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102) at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102) at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97) at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) ... 21 more {noformat} According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} [doesn't check for empty
[jira] [Created] (PARQUET-893) GroupColumnIO.getFirst() doesn't check for empty groups
Cheng Lian created PARQUET-893: -- Summary: GroupColumnIO.getFirst() doesn't check for empty groups Key: PARQUET-893 URL: https://issues.apache.org/jira/browse/PARQUET-893 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.1 Reporter: Cheng Lian The following Spark 2.1 snippet reproduces this issue: {code} import org.apache.spark.sql.types._ val path = "/tmp/parquet-test" case class Inner(f00: Int) case class Outer(f0: Inner, f1: Int) val df = Seq(Outer(Inner(1), 1)).toDF() df.printSchema() // root // |-- f0: struct (nullable = true) // ||-- f00: integer (nullable = false) // |-- f1: integer (nullable = false) df.write.mode("overwrite").parquet(path) val requestedSchema = new StructType(). add("f0", new StructType(). // This nested field name differs from the original one add("f01", IntegerType)). add("f1", IntegerType) println(requestedSchema.treeString) // root // |-- f0: struct (nullable = true) // ||-- f01: integer (nullable = true) // |-- f1: integer (nullable = true) spark.read.schema(requestedSchema).parquet(path).show() {code} In the above snippet, {{requestedSchema}} is compatible with the schema of the written Parquet file, but the following exception is thrown: {noformat} org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/parquet-test/part-7-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102) at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102) at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97) at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) ... 21 more {noformat} According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} [doesn't check for
[jira] [Created] (PARQUET-754) Deprecate the "strict" argument in MessageType.union()
Cheng Lian created PARQUET-754: -- Summary: Deprecate the "strict" argument in MessageType.union() Key: PARQUET-754 URL: https://issues.apache.org/jira/browse/PARQUET-754 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.1 Reporter: Cheng Lian Priority: Minor As discussed in PARQUET-379, non-strict schema merging doesn't really make any sense and we always set to true throughout the code base. Should probably deprecate it and make sure no internal code ever use non-strict schema merging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-753) GroupType.union() doesn't merge the original type
[ https://issues.apache.org/jira/browse/PARQUET-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583942#comment-15583942 ] Cheng Lian commented on PARQUET-753: PARQUET-379 resolves the {{union}} issue related to primitive types, but doesn't handle group types. > GroupType.union() doesn't merge the original type > - > > Key: PARQUET-753 > URL: https://issues.apache.org/jira/browse/PARQUET-753 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.1 >Reporter: Deneche A. Hakim > > When merging two GroupType, the union() method doesn't merge their original > type which will be lost after the union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-655) The LogicalTypes.md link in README.md points to the old Parquet GitHub repository
[ https://issues.apache.org/jira/browse/PARQUET-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-655: --- Component/s: parquet-format > The LogicalTypes.md link in README.md points to the old Parquet GitHub > repository > - > > Key: PARQUET-655 > URL: https://issues.apache.org/jira/browse/PARQUET-655 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-655) The LogicalTypes.md link in README.md points to the old Parquet GitHub repository
Cheng Lian created PARQUET-655: -- Summary: The LogicalTypes.md link in README.md points to the old Parquet GitHub repository Key: PARQUET-655 URL: https://issues.apache.org/jira/browse/PARQUET-655 Project: Parquet Issue Type: Bug Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-654) Make record-level filtering optional
Cheng Lian created PARQUET-654: -- Summary: Make record-level filtering optional Key: PARQUET-654 URL: https://issues.apache.org/jira/browse/PARQUET-654 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Cheng Lian For some engines, especially those with vectorized Parquet readers, filter predicate can often be evaluated more efficiently by the engine. In these cases, Parquet record-level filtering may even slow down query execution when filter push-down is enabled. On the other hand, when the data is well prepared, filter push-down can be very valuable due to row group level filtering. One possible improvement here is to add a configuration option that makes record-level filtering optional. In this way, the upper-level engine may leverage both Parquet row group level filtering and faster native record-level filtering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly
[ https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-651: --- Affects Version/s: 1.9.0 > Parquet-avro fails to decode array of record with a single field name > "element" correctly > - > > Key: PARQUET-651 > URL: https://issues.apache.org/jira/browse/PARQUET-651 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.7.0, 1.8.0, 1.8.1, 1.9.0 >Reporter: Cheng Lian > > Found this issue while investigating SPARK-16344. > For the following Parquet schema > {noformat} > message root { > optional group f (LIST) { > repeated group list { > optional group element { > optional int64 element; > } > } > } > } > {noformat} > parquet-avro decodes it as something like this: > {noformat} > record SingleElement { > int element; > } > record NestedSingleElement { > SingleElement element; > } > record Spark16344Wrong { > array f; > } > {noformat} > while correct interpretation should be: > {noformat} > record SingleElement { > int element; > } > record Spark16344 { > array f; > } > {noformat} > The reason is that the {{element}} syntactic group for LIST in > {noformat} > group (LIST) { > repeated group list { > element; > } > } > {noformat} > is recognized as a record field named {{element}}. The problematic code lies > in > [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. > We should probably check the standard 3-level layout first before falling > back to the legacy 2-level layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly
[ https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-651: --- Description: Found this issue while investigating SPARK-16344. For the following Parquet schema {noformat} message root { optional group f (LIST) { repeated group list { optional group element { optional int64 element; } } } } {noformat} parquet-avro decodes it as something like this: {noformat} record SingleElement { int element; } record NestedSingleElement { SingleElement element; } record Spark16344Wrong { array f; } {noformat} while correct interpretation should be: {noformat} record SingleElement { int element; } record Spark16344 { array f; } {noformat} The reason is that the {{element}} syntactic group for LIST in {noformat} group (LIST) { repeated group list { element; } } {noformat} is recognized as a record field named {{element}}. The problematic code lies in [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout. was: Found this issue while investigating SPARK-16344. For the following Parquet schema {noformat} message root { optional group f (LIST) { repeated group list { optional group element { optional int64 element; } } } } {noformat} parquet-avro decodes it as something like this: {noformat} record SingleElement { int element; } record NestedSingleElement { SingleElement element; } record Spark16344Wrong { array f; } {noformat} while correct interpretation should be: {noformat} record SingleElement { int element; } record Spark16344 { array f; } {noformat} The reason is that the {{element}} syntactic group for LIST in {noformat} group (LIST) { repeated group list { element; } } {noformat} is recognized as record field {{SingleElement.element}}. The problematic code lies in [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout. > Parquet-avro fails to decode array of record with a single field name > "element" correctly > - > > Key: PARQUET-651 > URL: https://issues.apache.org/jira/browse/PARQUET-651 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.7.0, 1.8.0, 1.8.1 >Reporter: Cheng Lian > > Found this issue while investigating SPARK-16344. > For the following Parquet schema > {noformat} > message root { > optional group f (LIST) { > repeated group list { > optional group element { > optional int64 element; > } > } > } > } > {noformat} > parquet-avro decodes it as something like this: > {noformat} > record SingleElement { > int element; > } > record NestedSingleElement { > SingleElement element; > } > record Spark16344Wrong { > array f; > } > {noformat} > while correct interpretation should be: > {noformat} > record SingleElement { > int element; > } > record Spark16344 { > array f; > } > {noformat} > The reason is that the {{element}} syntactic group for LIST in > {noformat} > group (LIST) { > repeated group list { > element; > } > } > {noformat} > is recognized as a record field named {{element}}. The problematic code lies > in > [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. > We should probably check the standard 3-level layout first before falling > back to the legacy 2-level layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-651) Parquet-avro fails to decode array of record with a single field name "element" correctly
[ https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-651: --- Description: Found this issue while investigating SPARK-16344. For the following Parquet schema {noformat} message root { optional group f (LIST) { repeated group list { optional group element { optional int64 element; } } } } {noformat} parquet-avro decodes it as something like this: {noformat} record SingleElement { int element; } record NestedSingleElement { SingleElement element; } record Spark16344Wrong { array f; } {noformat} while correct interpretation should be: {noformat} record SingleElement { int element; } record Spark16344 { array f; } {noformat} The reason is that the {{element}} syntactic group for LIST in {noformat} group (LIST) { repeated group list { element; } } {noformat} is recognized as record field {{SingleElement.element}}. The problematic code lies in [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout. was: Found this issue while investigating SPARK-16344. For the following Parquet schema {noformat} message root { optional group f (LIST) { repeated group list { optional group element { optional int64 element; } } } } {noformat} parquet-avro decodes it as something like this: {noformat} record SingleElement { int element; } record NestedSingleElement { SingleElement element; } record Spark16344Wrong { array f; } {noformat} while correct interpretation should be: {noformat} record SingleElement { int element; } record Spark16344 { array f; } {noformat} Adding the following test case to {{TestArrayCompatibility}} may reproduce this issue: {code:java} @Test public void testSpark16344() throws Exception { Path test = writeDirect( "message root {" + " optional group f (LIST) {" + "repeated group list {" + " optional group element {" + "optional int32 element;" + " }" + "}" + " }" + "}", new DirectWriter() { @Override public void write(RecordConsumer rc) { rc.startMessage(); rc.startField("f", 0); rc.startGroup(); rc.startField("list", 0); rc.startGroup(); rc.startField("element", 0); rc.startGroup(); rc.startField("element", 0); rc.addInteger(42); rc.endField("element", 0); rc.endGroup(); rc.endField("element", 0); rc.endGroup(); rc.endField("list", 0); rc.endGroup(); rc.endField("f", 0); rc.endMessage(); } }); Schema element = record("rec", field("element", primitive(Schema.Type.INT))); Schema expectedSchema = record("root", field("f", array(element))); GenericRecord expectedRecord = instance(expectedSchema, "f", Collections.singletonList(instance(element, 42))); assertReaderContains(newBehaviorReader(test), expectedSchema, expectedRecord); } {code} The reason is that the {{element}} syntactic group for LIST in {noformat} group (LIST) { repeated group list { element; } } {noformat} is recognized as record field {{SingleElement.element}}. The problematic code lies in [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858]. We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout. > Parquet-avro fails to decode array of record with a single field name > "element" correctly > - > > Key: PARQUET-651 > URL: https://issues.apache.org/jira/browse/PARQUET-651 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.7.0, 1.8.0, 1.8.1 >Reporter: Cheng Lian > > Found this issue while investigating SPARK-16344. > For the following Parquet schema > {noformat} > message root { > optional group f (LIST) { > repeated group list { > optional group element { > optional int64 element; > } > } > } > } > {noformat} > parquet-avro decodes it as something like this: > {noformat} > record SingleElement { > int element; > } > record NestedSingleElement { > SingleElement element; > } > record Spark16344Wrong { > array f; > } > {noformat} >
[jira] [Resolved] (PARQUET-528) Fix flush() for RecordConsumer and implementations
[ https://issues.apache.org/jira/browse/PARQUET-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved PARQUET-528. Resolution: Fixed Issue resolved by pull request 325 [https://github.com/apache/parquet-mr/pull/325] > Fix flush() for RecordConsumer and implementations > -- > > Key: PARQUET-528 > URL: https://issues.apache.org/jira/browse/PARQUET-528 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1 >Reporter: Liwei Lin >Assignee: Liwei Lin > Fix For: 1.9.0 > > > _+flush()+_ was added in _+RecordConsumer+_ and _+MessageColumnIO+_ to help > implementing nulls caching. > However, other _+RecordConsumer+_ implementations should also implements > _+flush()+_ properly. For instance, _+RecordConsumerLoggingWrapper+_ and > _+ValidatingRecordConsumer+_ should call _+delegate.flush()+_ in their > _+flush()+_ methods, otherwise data might be mistakenly truncated. > This ticket: > - makes _+flush()+_ abstract in _+RecordConsumer+_ > - implements _+flush()+_ properly for all _+RecordConsumer+_ subclasses, > specifically: > -- _+RecordConsumerLoggingWrapper+_ > -- _+ValidatingRecordConsumer+_ > -- _+ConverterConsumer+_ > -- _+ExpectationValidatingRecordConsumer+_ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-401) Deprecate Log and move to SLF4J Logger
[ https://issues.apache.org/jira/browse/PARQUET-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127668#comment-15127668 ] Cheng Lian commented on PARQUET-401: Fix of this issue is nice to have but probably shouldn't block 1.9.0. > Deprecate Log and move to SLF4J Logger > -- > > Key: PARQUET-401 > URL: https://issues.apache.org/jira/browse/PARQUET-401 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.1 >Reporter: Ryan Blue > > The current Log class is intended to allow swapping out logger back-ends, but > SLF4J already does this. It also doesn't expose as nice of an API as SLF4J, > which can handle formatting to avoid the cost of building log messages that > won't be used. I think we should deprecate the org.apache.parquet.Log class > and move to using SLF4J directly, instead of wrapping SLF4J (PARQUET-305). > This will require deprecating the current Log class and replacing the current > uses of it with SLF4J. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-495) Fix mismatches in Types class comments
[ https://issues.apache.org/jira/browse/PARQUET-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved PARQUET-495. Resolution: Fixed Issue resolved by pull request 317 [https://github.com/apache/parquet-mr/pull/317] > Fix mismatches in Types class comments > -- > > Key: PARQUET-495 > URL: https://issues.apache.org/jira/browse/PARQUET-495 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1 >Reporter: Liwei Lin >Assignee: Liwei Lin >Priority: Trivial > Fix For: 1.9.0 > > > To produce: > required group User \{ > required int64 id; > *optional* binary email (UTF8); > \} > we should do: > Types.requiredGroup() > .required(INT64).named("id") > .-*required* (BINARY).as(UTF8).named("email")- > .*optional* (BINARY).as(UTF8).named("email") > .named("User") -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-432) Complete a todo for method ColumnDescriptor.compareTo()
[ https://issues.apache.org/jira/browse/PARQUET-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved PARQUET-432. Resolution: Fixed Issue resolved by pull request 314 [https://github.com/apache/parquet-mr/pull/314] > Complete a todo for method ColumnDescriptor.compareTo() > --- > > Key: PARQUET-432 > URL: https://issues.apache.org/jira/browse/PARQUET-432 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1 >Reporter: Liwei Lin >Assignee: Liwei Lin >Priority: Minor > Fix For: 1.9.0 > > > The ticket proposes to consider the case *path.length < o.path.length* in, > for method ColumnDescriptor.compareTo(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-398) Testing JIRA ticket for testing committership
Cheng Lian created PARQUET-398: -- Summary: Testing JIRA ticket for testing committership Key: PARQUET-398 URL: https://issues.apache.org/jira/browse/PARQUET-398 Project: Parquet Issue Type: Test Reporter: Cheng Lian Priority: Minor This ticket is only used for testing committership. Please keep it open. New committers can submit a PR to add their names to {{dev/COMMITTERS.md}}, and attach ID of this JIRA ticket to the PR title (this convention is required by the {{dev/merge_parquet_pr.py}} script). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-389) Filter predicates should work with missing columns
Cheng Lian created PARQUET-389: -- Summary: Filter predicates should work with missing columns Key: PARQUET-389 URL: https://issues.apache.org/jira/browse/PARQUET-389 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0, 1.7.0, 1.6.0 Reporter: Cheng Lian This issue originates from SPARK-11103, which contains detailed information about how to reproduce it. The major problem here is that, filter predicates pushed down assert that columns they touch must exist in the target physical files. But this isn't true in case of schema merging. Actually this assertion is unnecessary, because if a column is missing in the filter schema, the column is considered to be filled by nulls, and all the filters should be able to act accordingly. For example, if we push down {{a = 1}} but {{a}} is missing in the underlying physical file, all records in this file should be dropped since {{a}} is always null. On the other hand, if we push down {{a IS NULL}}, all records should be preserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-379) PrimitiveType.union erases original type
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933630#comment-14933630 ] Cheng Lian edited comment on PARQUET-379 at 9/28/15 5:34 PM: - While trying to fix this issue, I got a problem regarding to the {{strict}} argument of {{PrimitiveType.union}}, and correspondingly, {{MessageType.union}}. Seems that throughout the whole parquet-mr code base (including tests), we always call these methods with {{strict}} being {{true}}, which means schema primitive types should match. Maybe I missed something here, but I don't see a sound use case of non-strict schema merging. Especially, the field types of {{t1.union(t2, false)}} is completely determined by {{t1}}, rather than the "wider" ones: {noformat} message t1 { required int32 f; } message t2 { required int64 f; } t1.union(t2, false) => message t3 { required int32 f; } {noformat} Basically we can't use such a schema to read actual Parquet files even if we add some sort of automatic "type widening" logic inside Parquet readers since the merged one above loses precision. So my questions are: # Is there a practical scenario where non-strict schema merging makes sense? # If not, should we deprecate it? (We can't remove it since {{MessageType.union(MessageType, boolean)}} is part of the public API.) was (Author: lian cheng): While trying to fix this issue, I got a problem regarding to the {{strict}} argument of {{PrimitiveType.union}}, and correspondingly, {{MessageType.union}}. Seems that throughout the whole parquet-mr code base (including tests), we always call these methods with {{strict}} being {{true}}, which means schema primitive types should match. Maybe I missed something here, but I don't see a sound use case of non-strict schema merging. Especially, the field types of {{t1.union(t2, false)}} is completely determined by {{t1}}, rather than the "wider" types of the two: {noformat} message t1 { required int32 f; } message t2 { required int64 f; } t1.union(t2, false) => message t3 { required int32 f; } {noformat} Basically we can't use such a schema to read actual Parquet files even if we add some sort of automatic "type widening" logic inside Parquet readers since the merged one above loses precision. So my questions are: # Is there a practical scenario where non-strict schema merging makes sense? # If not, should we deprecate it? (We can't remove it since {{MessageType.union(MessageType, boolean)}} is part of the public API. > PrimitiveType.union erases original type > > > Key: PARQUET-379 > URL: https://issues.apache.org/jira/browse/PARQUET-379 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following ScalaTest test case > {code} > test("merge primitive types") { > val expected = > Types.buildMessage() > .addField( > Types > .required(INT32) > .as(DECIMAL) > .precision(7) > .scale(2) > .named("f")) > .named("root") > assert(expected.union(expected) === expected) > } > {code} > produces the following assertion error > {noformat} > message root { > required int32 f; > } > did not equal message root { > required int32 f (DECIMAL(9,0)); > } > {noformat} > This is because {{PrimitiveType.union}} doesn't handle original type > properly. An open question is that, can two primitive types with the same > primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-385) PrimitiveType.union accepts fixed_len_byte_array fields with different length when strict mode is on
Cheng Lian created PARQUET-385: -- Summary: PrimitiveType.union accepts fixed_len_byte_array fields with different length when strict mode is on Key: PARQUET-385 URL: https://issues.apache.org/jira/browse/PARQUET-385 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0, 1.7.0, 1.6.0, 1.5.0 Reporter: Cheng Lian The following two schemas probably shouldn't be allowed to be union-ed when strict schema-merging mode is on: {noformat} message t1 { required fixed_len_byte_array(10) f; } message t2 { required fixed_len_byte_array(5) f; } {noformat} But currently {{t1.union(t2, true)}} yields {{t1}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933630#comment-14933630 ] Cheng Lian commented on PARQUET-379: While trying to fix this issue, I got a problem regarding to the {{strict}} argument of {{PrimitiveType.union}}, and correspondingly, {{MessageType.union}}. Seems that throughout the whole parquet-mr code base (including tests), we always call these methods with {{strict}} being {{true}}, which means schema primitive types should match. Maybe I missed something here, but I don't see a sound use case of non-strict schema merging. Especially, the field types of {{t1.union(t2, false)}} is completely determined by {{t1}}, rather than the "wider" types of the two: {noformat} message t1 { required int32 f; } message t2 { required int64 f; } t1.union(t2, false) => message t3 { required int32 f; } {noformat} Basically we can't use such a schema to read actual Parquet files even if we add some sort of automatic "type widening" logic inside Parquet readers since the merged one above loses precision. So my questions are: # Is there a practical scenario where non-strict schema merging makes sense? # If not, should we deprecate it? (We can't remove it since {{MessageType.union(MessageType, boolean)}} is part of the public API. > PrimitiveType.union erases original type > > > Key: PARQUET-379 > URL: https://issues.apache.org/jira/browse/PARQUET-379 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following ScalaTest test case > {code} > test("merge primitive types") { > val expected = > Types.buildMessage() > .addField( > Types > .required(INT32) > .as(DECIMAL) > .precision(7) > .scale(2) > .named("f")) > .named("root") > assert(expected.union(expected) === expected) > } > {code} > produces the following assertion error > {noformat} > message root { > required int32 f; > } > did not equal message root { > required int32 f (DECIMAL(9,0)); > } > {noformat} > This is because {{PrimitiveType.union}} doesn't handle original type > properly. An open question is that, can two primitive types with the same > primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934084#comment-14934084 ] Cheng Lian commented on PARQUET-379: So deprecating non-strict schema merging seems to be reasonable? Namely, deprecate {{MessageType.union(MessageType toMerge, boolean strict)}}, and always set {{strict}} to {{true}} when we call this method internally. > PrimitiveType.union erases original type > > > Key: PARQUET-379 > URL: https://issues.apache.org/jira/browse/PARQUET-379 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following ScalaTest test case > {code} > test("merge primitive types") { > val expected = > Types.buildMessage() > .addField( > Types > .required(INT32) > .as(DECIMAL) > .precision(7) > .scale(2) > .named("f")) > .named("root") > assert(expected.union(expected) === expected) > } > {code} > produces the following assertion error > {noformat} > message root { > required int32 f; > } > did not equal message root { > required int32 f (DECIMAL(9,0)); > } > {noformat} > This is because {{PrimitiveType.union}} doesn't handle original type > properly. An open question is that, can two primitive types with the same > primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-385) PrimitiveType.union accepts fixed_len_byte_array fields with different lengths when strict mode is on
[ https://issues.apache.org/jira/browse/PARQUET-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-385: --- Summary: PrimitiveType.union accepts fixed_len_byte_array fields with different lengths when strict mode is on (was: PrimitiveType.union accepts fixed_len_byte_array fields with different length when strict mode is on) > PrimitiveType.union accepts fixed_len_byte_array fields with different > lengths when strict mode is on > - > > Key: PARQUET-385 > URL: https://issues.apache.org/jira/browse/PARQUET-385 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following two schemas probably shouldn't be allowed to be union-ed when > strict schema-merging mode is on: > {noformat} > message t1 { > required fixed_len_byte_array(10) f; > } > message t2 { > required fixed_len_byte_array(5) f; > } > {noformat} > But currently {{t1.union(t2, true)}} yields {{t1}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-379) PrimitiveType.union erases original type
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-379: --- Description: The following ScalaTest test case {code} test("merge primitive types") { val expected = Types.buildMessage() .addField( Types .required(INT32) .as(DECIMAL) .precision(7) .scale(2) .named("f")) .named("root") assert(expected.union(expected) === expected) } {code} produces the following assertion error {noformat} message root { required int32 f; } did not equal message root { required int32 f (DECIMAL(9,0)); } {noformat} This is because {{PrimitiveType.union}} doesn't handle original type properly. An open question is that, can two primitive types with the same primitive type name but different original types be unioned? was: The following ScalaTest test case {code} test("merge primitive types") { val expected = Types.buildMessage() .addField( Types .required(INT32) .as(DECIMAL) .precision(9) .scale(0) .named("f")) .named("root") assert(expected.union(expected) === expected) } {code} produces the following assertion error {noformat} message root { required int32 f; } did not equal message root { required int32 f (DECIMAL(9,0)); } {noformat} This is because {{PrimitiveType.union}} doesn't handle original type properly. An open question is that, can two primitive types with the same primitive type name but different original types be unioned? > PrimitiveType.union erases original type > > > Key: PARQUET-379 > URL: https://issues.apache.org/jira/browse/PARQUET-379 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following ScalaTest test case > {code} > test("merge primitive types") { > val expected = > Types.buildMessage() > .addField( > Types > .required(INT32) > .as(DECIMAL) > .precision(7) > .scale(2) > .named("f")) > .named("root") > assert(expected.union(expected) === expected) > } > {code} > produces the following assertion error > {noformat} > message root { > required int32 f; > } > did not equal message root { > required int32 f (DECIMAL(9,0)); > } > {noformat} > This is because {{PrimitiveType.union}} doesn't handle original type > properly. An open question is that, can two primitive types with the same > primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-379) PrimitiveType.union erases original type
Cheng Lian created PARQUET-379: -- Summary: PrimitiveType.union erases original type Key: PARQUET-379 URL: https://issues.apache.org/jira/browse/PARQUET-379 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0, 1.7.0, 1.6.0, 1.5.0 Reporter: Cheng Lian The following test case {code} test("merge primitive types") { val expected = Types.buildMessage() .addField( Types .required(INT32) .as(DECIMAL) .precision(9) .scale(0) .named("f")) .named("root") assert(expected.union(expected) === expected) } {code} produces the following assertion error {noformat} message root { required int32 f; } did not equal message root { required int32 f (DECIMAL(9,0)); } {noformat} This is because {{PrimitiveType.union}} doesn't handle original type properly. An open question is that, can two primitive types with the same primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-379) PrimitiveType.union erases original type
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-379: --- Description: The following ScalaTest test case {code} test("merge primitive types") { val expected = Types.buildMessage() .addField( Types .required(INT32) .as(DECIMAL) .precision(9) .scale(0) .named("f")) .named("root") assert(expected.union(expected) === expected) } {code} produces the following assertion error {noformat} message root { required int32 f; } did not equal message root { required int32 f (DECIMAL(9,0)); } {noformat} This is because {{PrimitiveType.union}} doesn't handle original type properly. An open question is that, can two primitive types with the same primitive type name but different original types be unioned? was: The following test case {code} test("merge primitive types") { val expected = Types.buildMessage() .addField( Types .required(INT32) .as(DECIMAL) .precision(9) .scale(0) .named("f")) .named("root") assert(expected.union(expected) === expected) } {code} produces the following assertion error {noformat} message root { required int32 f; } did not equal message root { required int32 f (DECIMAL(9,0)); } {noformat} This is because {{PrimitiveType.union}} doesn't handle original type properly. An open question is that, can two primitive types with the same primitive type name but different original types be unioned? > PrimitiveType.union erases original type > > > Key: PARQUET-379 > URL: https://issues.apache.org/jira/browse/PARQUET-379 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following ScalaTest test case > {code} > test("merge primitive types") { > val expected = > Types.buildMessage() > .addField( > Types > .required(INT32) > .as(DECIMAL) > .precision(9) > .scale(0) > .named("f")) > .named("root") > assert(expected.union(expected) === expected) > } > {code} > produces the following assertion error > {noformat} > message root { > required int32 f; > } > did not equal message root { > required int32 f (DECIMAL(9,0)); > } > {noformat} > This is because {{PrimitiveType.union}} doesn't handle original type > properly. An open question is that, can two primitive types with the same > primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-371) Bumps Thrift version to 0.9.0
[ https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-371: --- Summary: Bumps Thrift version to 0.9.0 (was: Add thrift9 Maven profile for parquet-format) > Bumps Thrift version to 0.9.0 > - > > Key: PARQUET-371 > URL: https://issues.apache.org/jira/browse/PARQUET-371 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Cheng Lian > > Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be > nice to have a {{thrift9}} Maven profile similar to what we did for > parquet-mr to bump Thrift to 0.9. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-371) Bumps Thrift version to 0.9.0
[ https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-371: --- Description: Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be nice to bump Thrift version. (was: Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be nice to have a {{thrift9}} Maven profile similar to what we did for parquet-mr to bump Thrift to 0.9.) > Bumps Thrift version to 0.9.0 > - > > Key: PARQUET-371 > URL: https://issues.apache.org/jira/browse/PARQUET-371 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Cheng Lian > > Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be > nice to bump Thrift version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-370) Nested records are not properly read if none of their fields are requested
[ https://issues.apache.org/jira/browse/PARQUET-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734568#comment-14734568 ] Cheng Lian edited comment on PARQUET-370 at 9/10/15 11:43 AM: -- A complete sample code for reproducing this issue against parquet-mr 1.7.0 can be found in [lianch...@github.com/parquet-compat|https://github.com/liancheng/parquet-compat/blob/cbd9dd89b015049c43054c5db81737405f6618e2/src/test/scala/com/databricks/parquet/schema/SchemaEvolutionSuite.scala#L9-L61]. This sample writes a Parquet file with schema {{S1}} and reads it back with {{S2}} as requested schema using parquet-avro. Related Avro IDL definition can be found [here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/avro/parquet-avro-compat.avdl]. BTW, this repository is a playground of mine for investigating various Parquet compatibility and interoperability issues. The Scala DSL illustrated in the sample code is inspired by the {{writeDirect}} method in parquet-avro testing code. It is defined [here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala]. I found it pretty neat and intuitive for building test cases, and we are using a similar testing API in Spark. was (Author: lian cheng): A complete sample code for reproducing this issue against parquet-mr 1.7.0 can be found in [lianch...@github.com/parquet-compat|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala]. This sample writes a Parquet file with schema {{S1}} and reads it back with {{S2}} as requested schema using parquet-avro. Related Avro IDL definition can be found [here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/avro/parquet-avro-compat.avdl]. The version against parquet-mr 1.8.1 is [here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.8.1/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala]. BTW, this repository is a playground of mine for investigating various Parquet compatibility and interoperability issues. The Scala DSL illustrated in the sample code is inspired by the {{writeDirect}} method in parquet-avro testing code. It is defined [here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala]. I found it pretty neat and intuitive for building test cases, and we are using a similar testing API in Spark. > Nested records are not properly read if none of their fields are requested > -- > > Key: PARQUET-370 > URL: https://issues.apache.org/jira/browse/PARQUET-370 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.1 >Reporter: Cheng Lian > > Say we have a Parquet file {{F}} with the following schema {{S1}}: > {noformat} > message root { > required group n { > optional int32 a; > optional int32 b; > } > } > {noformat} > Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while > {{c}} and {{d}} are added. Now we have schema {{S2}}: > {noformat} > message root { > required group n { > optional int32 c; > optional int32 d; > } > } > {noformat} > {{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with > {{S2}} as requested schema. > Say {{F}} contains a single record: > {noformat} > {"n": {"a": 1, "b": 2}} > {noformat} > When reading {{F}} with {{S2}}, expected output should be: > {noformat} > {"n": {"c": null, "d": null}} > {noformat} > But currently parquet-mr gives > {noformat} > {"n": null} > {noformat} > This is because {{MessageColumnIO}} finds that the physical Parquet file > contains no leaf columns defined in the requested schema, and shortcuts > record reading with an {{EmptyRecordReader}} for column {{n}}. See > [here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-371) Add thrift9 Maven profile for parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740068#comment-14740068 ] Cheng Lian commented on PARQUET-371: That would be even nicer. I'll update my PR. > Add thrift9 Maven profile for parquet-format > > > Key: PARQUET-371 > URL: https://issues.apache.org/jira/browse/PARQUET-371 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Cheng Lian > > Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be > nice to have a {{thrift9}} Maven profile similar to what we did for > parquet-mr to bump Thrift to 0.9. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-371) Add thrift9 Maven profile for parquet-format
Cheng Lian created PARQUET-371: -- Summary: Add thrift9 Maven profile for parquet-format Key: PARQUET-371 URL: https://issues.apache.org/jira/browse/PARQUET-371 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Cheng Lian Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be nice to have a {{thrift9}} Maven profile similar to what we did for parquet-mr to bump Thrift to 0.9. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-370) Nested records are not properly read if none of their fields are requested
[ https://issues.apache.org/jira/browse/PARQUET-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734568#comment-14734568 ] Cheng Lian commented on PARQUET-370: A complete sample code for reproducing this issue against parquet-mr 1.7.0 can be found in [lianch...@github.com/parquet-compat|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala]. This sample writes a Parquet file with schema {{S1}} and reads it back with {{S2}} as requested schema using parquet-avro. The version against parquet-mr 1.8.1 is [here|https://github.com/liancheng/parquet-compat/tree/with-parquet-mr-1.8.1]. BTW, this repository is a playground of mine for investigating various Parquet compatibility and interoperability issues. The Scala DSL illustrated in the sample code is inspired by the {{writeDirect}} method in parquet-avro testing code. It is defined [here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala]. I found it pretty neat and intuitive for building test cases, and we are using a similar testing API in Spark. > Nested records are not properly read if none of their fields are requested > -- > > Key: PARQUET-370 > URL: https://issues.apache.org/jira/browse/PARQUET-370 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.1 >Reporter: Cheng Lian > > Say we have a Parquet file {{F}} with the following schema {{S1}}: > {noformat} > message root { > required group n { > optional int32 a; > optional int32 b; > } > } > {noformat} > Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while > {{c}} and {{d}} are added. Now we have schema {{S2}}: > {noformat} > message root { > required group n { > optional int32 c; > optional int32 d; > } > } > {noformat} > {{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with > {{S2}} as requested schema. > Say {{F}} contains a single record: > {noformat} > {"n": {"a": 1, "b": 2}} > {noformat} > When reading {{F}} with {{S2}}, expected output should be: > {noformat} > {"n": {"c": null, "d": null}} > {noformat} > But currently parquet-mr gives > {noformat} > {"n": null} > {noformat} > This is because {{MessageColumnIO}} finds that the physical Parquet file > contains no leaf columns defined in the requested schema, and shortcuts > record reading with an {{EmptyRecordReader}} for column {{n}}. See > [here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733506#comment-14733506 ] Cheng Lian commented on PARQUET-369: Here is a more concrete version in another thread https://groups.google.com/d/msg/parquet-dev/UjpbHbzoQj0/R6LG2gECQuIJ [~julienledem] According to the link above, is testing the only reason why we have the static JUL initialization block in parquet-mr? If that is true, I'm happy to have a try to remove it. We've been using pretty hacky way in Spark to redirect Parquet JUL logs. > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-369: --- Description: Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). This also accidentally shades [this line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] {code} private static String STATIC_LOGGER_BINDER_PATH = "org/slf4j/impl/StaticLoggerBinder.class"; {code} to {code} private static String STATIC_LOGGER_BINDER_PATH = "parquet/org/slf4j/impl/StaticLoggerBinder.class"; {code} and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} implementation even if we provide dependencies like {{slf4j-log4j12}} on the classpath. This happens in Spark. Whenever we write a Parquet file, we see the following famous message and can never get rid of it: {noformat} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. {noformat} was: Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]}. This also accidentally shades [this line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] {code} private static String STATIC_LOGGER_BINDER_PATH = "org/slf4j/impl/StaticLoggerBinder.class"; {code} to {code} private static String STATIC_LOGGER_BINDER_PATH = "parquet/org/slf4j/impl/StaticLoggerBinder.class"; {code} and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} implementation even if we provide dependencies like {{slf4j-log4j12}} on the classpath. This happens in Spark. Whenever we write a Parquet file, we see the following famous message and can never get rid of it: {noformat} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. {noformat} > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
Cheng Lian created PARQUET-369: -- Summary: Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder Key: PARQUET-369 URL: https://issues.apache.org/jira/browse/PARQUET-369 Project: Parquet Issue Type: Bug Components: parquet-format Reporter: Cheng Lian Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]}. This also accidentally shades [this line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] {code} private static String STATIC_LOGGER_BINDER_PATH = "org/slf4j/impl/StaticLoggerBinder.class"; {code} to {code} private static String STATIC_LOGGER_BINDER_PATH = "parquet/org/slf4j/impl/StaticLoggerBinder.class"; {code} and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} implementation even if we provide dependencies like {{slf4j-log4j12}} on the classpath. This happens in Spark. Whenever we write a Parquet file, we see the following famous message and can never get rid of it: {noformat} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-364) Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. array<array>)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-364: --- Summary: Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. array) (was: Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. array )) > Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. > array ) > > > Key: PARQUET-364 > URL: https://issues.apache.org/jira/browse/PARQUET-364 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > Attachments: bad-avro.parquet, bad-thrift.parquet > > > The problematic Avro and Thrift schemas are: > {noformat} > record AvroArrayOfArray { > array int_arrays_column; > } > {noformat} > and > {noformat} > struct ThriftListOfList { > 1: list intArraysColumn; > } > {noformat} > They are converted to the following structurally equivalent Parquet schemas > by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: > {noformat} > message AvroArrayOfArray { > required group int_arrays_column (LIST) { > repeated group array (LIST) { > repeated int32 array; > } > } > } > {noformat} > and > {noformat} > message ParquetSchema { > required group intListsColumn (LIST) { > repeated group intListsColumn_tuple (LIST) { > repeated int32 intListsColumn_tuple_tuple; > } > } > } > {noformat} > {{AvroIndexedRecordConverter}} cannot decode such records correctly. The > reason is that the 2nd level repeated group {{array}} doesn't pass > {{AvroIndexedRecordConverter.isElementType()}} check. We should check for > field name "array" and field name suffix "_thrift" in {{isElementType()}} to > fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-367) parquet-cat -j doesn't show all records
Cheng Lian created PARQUET-367: -- Summary: parquet-cat -j doesn't show all records Key: PARQUET-367 URL: https://issues.apache.org/jira/browse/PARQUET-367 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0, 1.8.1, 1.9.0 Reporter: Cheng Lian {noformat} $ parquet-cat old-repeated-int.parquet repeatedInt = 1 repeatedInt = 2 repeatedInt = 3 $ parquet-cat -j old-repeated-int.parquet {repeatedInt:3} {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708387#comment-14708387 ] Cheng Lian commented on PARQUET-364: Sent out a PR https://github.com/apache/parquet-mr/pull/264 Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) --- Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian Attachments: bad-avro.parquet, bad-thrift.parquet The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following structurally equivalent Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706766#comment-14706766 ] Cheng Lian commented on PARQUET-364: Although I haven't verified it yet, I suspect parquet-thrift suffers from a similar issue, e.g. cannot decode Parquet records translated from Thrift structure like {{listlisti32}}. Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint) Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian The following Avro schema {noformat} record AvroNonNullableArrays { arrayarrayint int_arrays_column; } {noformat} is translated into the following Parquet schema by parquet-avro 1.7.0: {noformat} message root { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We probably should check for field name array in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706760#comment-14706760 ] Cheng Lian commented on PARQUET-364: Tried to write a test case in parquet-mr, but fail to build parquet-mr locally on OSX 10.10 because of some environment issue. Verified this bug while fixing SPARK-10136, which is the Spark version of this bug. And here is a Spark SQL {{ParquetAvroCompatibilitySuite}} test case for reproducing this issue: {code} test(PARQUET-364 avro array of primitive array) { withTempPath { dir = val path = dir.getCanonicalPath val records = (0 until 3).map { i = AvroArrayOfArray.newBuilder() .setIntArraysColumn( Seq.tabulate(3, 3)((j, k) = i + j * 3 + k: Integer).map(_.asJava).asJava) .build() } val writer = new AvroParquetWriter[AvroArrayOfArray]( new Path(path), AvroArrayOfArray.getClassSchema) records.foreach(writer.write) writer.close() val reader = AvroParquetReader.builder[AvroArrayOfArray](new Path(path)).build() assert((0 until 10).map(_ = reader.read()) === records) } } {code} Exception: {noformat} [info] - PARQUET-364 avro array of primitive array *** FAILED *** (428 milliseconds) [info] java.lang.ClassCastException: repeated int32 array is not a group [info] at org.apache.parquet.schema.Type.asGroupType(Type.java:202) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:144) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter.access$200(AvroIndexedRecordConverter.java:42) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter$ElementConverter.init(AvroIndexedRecordConverter.java:548) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.init(AvroIndexedRecordConverter.java:480) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:144) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter.init(AvroIndexedRecordConverter.java:89) [info] at org.apache.parquet.avro.AvroIndexedRecordConverter.init(AvroIndexedRecordConverter.java:60) [info] at org.apache.parquet.avro.AvroRecordMaterializer.init(AvroRecordMaterializer.java:34) [info] at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:111) [info] at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:174) [info] at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:151) [info] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:127) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$13.apply(ParquetAvroCompatibilitySuite.scala:186) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$13.apply(ParquetAvroCompatibilitySuite.scala:186) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.immutable.Range.foreach(Range.scala:141) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:105) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4.apply(ParquetAvroCompatibilitySuite.scala:186) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5$$anonfun$apply$mcV$sp$4.apply(ParquetAvroCompatibilitySuite.scala:170) [info] at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:117) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetCompatibilityTest.withTempPath(ParquetCompatibilityTest.scala:31) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5.apply$mcV$sp(ParquetAvroCompatibilitySuite.scala:170) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5.apply(ParquetAvroCompatibilitySuite.scala:170) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetAvroCompatibilitySuite$$anonfun$5.apply(ParquetAvroCompatibilitySuite.scala:170) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[jira] [Updated] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-364: --- Description: The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following structurally equivalent Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. was: The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) --- Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following structurally equivalent Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706959#comment-14706959 ] Cheng Lian commented on PARQUET-364: I tested the Thrift case with Thrift 0.9.2 because I can't get Thrift 0.7.0 compiled on Mac OS X 10.10 because of lacking proper C++ header files. I assume that this doesn't change the essence of this issue. (BTW, any plan to upgrade to Thrift 0.9.2?) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) --- Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-364: --- Description: The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. was: The following Avro schema {noformat} record AvroNonNullableArrays { arrayarrayint int_arrays_column; } {noformat} is translated into the following Parquet schema by parquet-avro 1.7.0: {noformat} message root { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We probably should check for field name array in {{isElementType()}} to fix this issue. Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) --- Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-364: --- Summary: Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) (was: Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) --- Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian The following Avro schema {noformat} record AvroNonNullableArrays { arrayarrayint int_arrays_column; } {noformat} is translated into the following Parquet schema by parquet-avro 1.7.0: {noformat} message root { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We probably should check for field name array in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706938#comment-14706938 ] Cheng Lian commented on PARQUET-364: Verified that parquet-avro doesn't correctly decode Parquet records generated by parquet-thrift with Thrift type {{listlisti32}} either. Parque-avro cannot decode Avro array of primitive array (e.g. arrayarrayint) Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian The following Avro schema {noformat} record AvroNonNullableArrays { arrayarrayint int_arrays_column; } {noformat} is translated into the following Parquet schema by parquet-avro 1.7.0: {noformat} message root { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We probably should check for field name array in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-364) Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint)
[ https://issues.apache.org/jira/browse/PARQUET-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707150#comment-14707150 ] Cheng Lian commented on PARQUET-364: [~rdblue] The suggested fix has been verified by [Spark PR #8361|https://github.com/apache/spark/pull/8361/files]. I'd like to deliver a PR for parquet-mr, but hit some local build issues. Please feel free to assign this issue to others. Parque-avro cannot decode Avro/Thrift array of primitive array (e.g. arrayarrayint) --- Key: PARQUET-364 URL: https://issues.apache.org/jira/browse/PARQUET-364 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 Reporter: Cheng Lian Attachments: bad-avro.parquet, bad-thrift.parquet The problematic Avro and Thrift schemas are: {noformat} record AvroArrayOfArray { arrayarrayint int_arrays_column; } {noformat} and {noformat} struct ThriftListOfList { 1: listlisti32 intArraysColumn; } {noformat} They are converted to the following structurally equivalent Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively: {noformat} message AvroArrayOfArray { required group int_arrays_column (LIST) { repeated group array (LIST) { repeated int32 array; } } } {noformat} and {noformat} message ParquetSchema { required group intListsColumn (LIST) { repeated group intListsColumn_tuple (LIST) { repeated int32 intListsColumn_tuple_tuple; } } } {noformat} {{AvroIndexedRecordConverter}} cannot decode such records correctly. The reason is that the 2nd level repeated group {{array}} doesn't pass {{AvroIndexedRecordConverter.isElementType()}} check. We should check for field name array and field name suffix _thrift in {{isElementType()}} to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-363) Cannot construct empty MessageType for ReadContext.requestedSchema
Cheng Lian created PARQUET-363: -- Summary: Cannot construct empty MessageType for ReadContext.requestedSchema Key: PARQUET-363 URL: https://issues.apache.org/jira/browse/PARQUET-363 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0, 1.8.1 Reporter: Cheng Lian In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus {{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes sense in most cases since Parquet doesn't support empty groups. However, there is one use case where an empty {{MessageType}} is valid, namely passing an empty {{MessageType}} as the {{requestedSchema}} constructor argument of {{ReadContext}} when counting rows in a Parquet file. The reason why it works is that, Parquet can retrieve row count from block metadata without materializing any columns. Take the following PySpark shell snippet ([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209], which uses parquet-mr 1.7.0) as an example: {noformat} path = 'file:///tmp/foo' # Writes 10 integers into a Parquet file sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path) sqlContext.read.parquet(path).count() 10 {noformat} Parquet related log lines: {noformat} 15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields from the Parquet file: Parquet form: message root { } Catalyst form: StructType() 15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 10 records. 15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block 15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0 ms. row count = 10 {noformat} We can see that Spark SQL passes no requested columns to the underlying Parquet reader. What happens here is that: # Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus only generates empty rows). # {{InternalParquetRecordReader}} first obtain the row count from block metadata ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]). # {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the Parquet file ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]). # {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where _n_ equals to the row count. Each time, it invokes the converter created by Spark SQL and produces an empty Spark SQL row object. This issue is also the cause of HIVE-11611. Because when upgrading to Parquet 1.8.1, Hive worked around this issue by using {{tableSchema}} as {{requestedSchema}} when no columns are requested ([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]). IMO this introduces a performance regression in cases like counting, because now we need to materialize all columns just for counting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-173) StatisticsFilter doesn't handle And properly
[ https://issues.apache.org/jira/browse/PARQUET-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-173: --- Description: I guess it's [a pretty straightforward mistake|https://github.com/apache/parquet-mr/blob/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L225-L237] :) {code} @Override public Boolean visit(And and) { return and.getLeft().accept(this) and.getRight().accept(this); } @Override public Boolean visit(Or or) { // seems unintuitive to put an not an || here // but we can only drop a chunk of records if we know that // both the left and right predicates agree that no matter what // we don't need this chunk. return or.getLeft().accept(this) or.getRight().accept(this); } {code} The consequence is that filter predicates like {{a 10 a 20}} can never drop any row groups. was: I guess it's [a pretty straightforward mistake|https://github.com/apache/incubator-parquet-mr/blob/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L225-L237] :) {code} @Override public Boolean visit(And and) { return and.getLeft().accept(this) and.getRight().accept(this); } @Override public Boolean visit(Or or) { // seems unintuitive to put an not an || here // but we can only drop a chunk of records if we know that // both the left and right predicates agree that no matter what // we don't need this chunk. return or.getLeft().accept(this) or.getRight().accept(this); } {code} The consequence is that filter predicates like {{a 10 a 20}} can never drop any row groups. StatisticsFilter doesn't handle And properly Key: PARQUET-173 URL: https://issues.apache.org/jira/browse/PARQUET-173 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.6.0 I guess it's [a pretty straightforward mistake|https://github.com/apache/parquet-mr/blob/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L225-L237] :) {code} @Override public Boolean visit(And and) { return and.getLeft().accept(this) and.getRight().accept(this); } @Override public Boolean visit(Or or) { // seems unintuitive to put an not an || here // but we can only drop a chunk of records if we know that // both the left and right predicates agree that no matter what // we don't need this chunk. return or.getLeft().accept(this) or.getRight().accept(this); } {code} The consequence is that filter predicates like {{a 10 a 20}} can never drop any row groups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-136) NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null
[ https://issues.apache.org/jira/browse/PARQUET-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-136: --- Description: For a string or a binary column, if all values in a single column trunk are null, so do the min max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found [here|https://github.com/apache/parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100]. This issue can be steadily reproduced with the following Spark shell snippet against Spark 1.2.0-SNAPSHOT ([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]): {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class StringCol(value: String) sc.parallelize(StringCol(null) :: Nil, 1).saveAsParquetFile(/tmp/empty.parquet) parquetFile(/tmp/empty.parquet).registerTempTable(null_table) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT * FROM null_table WHERE value = 'foo').collect() {code} Exception thrown: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: For a string or a binary column, if all values in a single column trunk are null, so do the min max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found [here|https://github.com/apache/incubator-parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100]. This issue can be steadily reproduced with the following Spark shell snippet against Spark 1.2.0-SNAPSHOT ([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]): {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class StringCol(value: String) sc.parallelize(StringCol(null) :: Nil, 1).saveAsParquetFile(/tmp/empty.parquet) parquetFile(/tmp/empty.parquet).registerTempTable(null_table) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT * FROM null_table WHERE value = 'foo').collect() {code} Exception thrown: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
[jira] [Commented] (PARQUET-70) PARQUET #36: Pig Schema Storage to UDFContext
[ https://issues.apache.org/jira/browse/PARQUET-70?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622600#comment-14622600 ] Cheng Lian commented on PARQUET-70: --- Just remove the incubator- part of the URL: https://github.com/apache/parquet-mr/issues/36 PARQUET #36: Pig Schema Storage to UDFContext - Key: PARQUET-70 URL: https://issues.apache.org/jira/browse/PARQUET-70 Project: Parquet Issue Type: Bug Reporter: Daniel Weeks Priority: Critical Fix For: 1.6.0 https://github.com/apache/incubator-parquet-mr/pull/36 The ParquetLoader was not storing the pig schema into the udfcontext for the full load case which causes a schema reload on the task side, erases the requested schema, and causes problems with column index access. This fix stores the pig schema to both the udfcontext (for task side init) and jobcontext (for TupleReadSupport) along with other properties that should be set in the loader context (required field list and column index access toggle). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580765#comment-14580765 ] Cheng Lian edited comment on PARQUET-222 at 6/10/15 4:37 PM: - Hey [~rdblue], it seems that you are referring to use cases like writing to Hive dynamic partitions (where a single task may need to write to write multiple Parquet files according to partition column values)? I believe the use case described in the JIRA description is different. Unlike Hive or Pig, which use process level parallelism, Spark uses thread level parallelism (tasks are executed in thread pool). And currently, there's no way to pin a Spark task to a specific process. So even each task is guaranteed to write at most one Parquet file, it's still possible for a single executor process to write multiple Parquet files at some point. So in the scope of Spark, currently there isn't a very good mechanism to fix this problem. What I suggested was essentially to shrink partition number so that on average an executor writes only a single file (I made a mistake in my previous comment and said at most one Parquet file, it should be on average). In case of dynamic partitioning, we do plan to re-partition the data according to partition column values before writing the data to reduce number of parallel writers. Another possible approach was to sort the data within each task before writing, so that only a single writer is active for a task at any point of time. was (Author: lian cheng): Hey [~rdblue], it seems that you are referring to use cases like writing to Hive dynamic partitions (where a single task may need to write to write multiple Parquet files according to Partition column values)? I believe the use case described in the JIRA description is different. Unlike Hive or Pig, which use process level parallelism, Spark uses thread level parallelism (tasks are executed in thread pool). And currently, there's no way to pin a Spark task to a specific process. So even each task is guaranteed to write at most one Parquet file, it's still possible for a single executor process to write multiple Parquet files at some point. So in the scope of Spark, currently there isn't a very good mechanism to fix this problem. What I suggested was essentially to shrink partition number so that on average an executor writes only a single file (I made a mistake in my previous comment and said at most one Parquet file, it should be on average). In case of dynamic partitioning, we do plan to re-partition the data according to partition column values before writing the data to reduce number of parallel writers. Another possible approach was to sort the data within each task before writing, so that only a single writer is active for a task at any point of time. parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at
[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580765#comment-14580765 ] Cheng Lian commented on PARQUET-222: Hey [~rdblue], it seems that you are referring to use cases like writing to Hive dynamic partitions (where a single task may need to write to write multiple Parquet files according to Partition column values)? I believe the use case described in the JIRA description is different. Unlike Hive or Pig, which use process level parallelism, Spark uses thread level parallelism (tasks are executed in thread pool). And currently, there's no way to pin a Spark task to a specific process. So even each task is guaranteed to write at most one Parquet file, it's still possible for a single executor process to write multiple Parquet files at some point. So in the scope of Spark, currently there isn't a very good mechanism to fix this problem. What I suggested was essentially to shrink partition number so that on average an executor writes only a single file (I made a mistake in my previous comment and said at most one Parquet file, it should be on average). In case of dynamic partitioning, we do plan to re-partition the data according to partition column values before writing the data to reduce number of parallel writers. Another possible approach was to sort the data within each task before writing, so that only a single writer is active for a task at any point of time. parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it
[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580339#comment-14580339 ] Cheng Lian edited comment on PARQUET-222 at 6/10/15 4:39 PM: - Hey [~phatak.dev], finally got some time to try 1.3.1 and reproduced this OOM. While trying this case with 1.4, it got stuck in the query planner, so I was adjusting {{\-\-driver-memory}}. In the case of 1.3.1, by tuning {{\-\-executor-memory}}, I can see two kinds of exceptions. The first one is exactly the same as what you saw. In my test code, I create 26k INT columns, so Parquet tries to initialize 26k column writers, each allocates a default slab (an {{int[]}}) with 64k elements. This takes at least {{26k * 64k * 4b = 6.34gb}} memory. After increasing executor memory to 10g, I saw similar exception thrown from {{RunLengthBitPackingHybridEncoder}}. I guess Parquet is trying to allocate an RLE encoder for each column here to perform compression (not 100% sure about this for now). Similarly, each encoder initializes a default slab (a {{byte[]}}) with at least 64k elements, and that's another {{26k * 64k * 1b = 1.6gb}} memory. Only have a laptop for now, so... not sure how much memory it needs to write such wide a table. But essentially Parquet needs to pre-allocate some memory for each column to compress and buffer data. And 26k columns altogether just eats too much memory here. That's why even your table has only a single row, it still causes OOM. was (Author: lian cheng): Hey [~phatak.dev], finally got some time to try 1.3.1 and reproduced this OOM. While trying this case with 1.4, it got stuck in the query planner, so I was adjusting {{--driver-memory}}. In the case of 1.3.1, by tuning {{--executor-memory}}, I can see two kinds of exceptions. The first one is exactly the same as what you saw. In my test code, I create 26k INT columns, so Parquet tries to initialize 26k column writers, each allocates a default slab (an {{int[]}}) with 64k elements. This takes at least {{26k * 64k * 4b = 6.34gb}} memory. After increasing executor memory to 10g, I saw similar exception thrown from {{RunLengthBitPackingHybridEncoder}}. I guess Parquet is trying to allocate an RLE encoder for each column here to perform compression (not 100% sure about this for now). Similarly, each encoder initializes a default slab (a {{byte[]}}) with at least 64k elements, and that's another {{26k * 64k * 1b = 1.6gb}} memory. Only have a laptop for now, so... not sure how much memory it needs to write such wide a table. But essentially Parquet needs to pre-allocate some memory for each column to compress and buffer data. And 26k columns altogether just eats too much memory here. That's why even your table has only a single row, it still causes OOM. parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at
[jira] [Created] (PARQUET-305) Logger instantiated for package org.apache.parquet may be GC-ed
Cheng Lian created PARQUET-305: -- Summary: Logger instantiated for package org.apache.parquet may be GC-ed Key: PARQUET-305 URL: https://issues.apache.org/jira/browse/PARQUET-305 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.7.0 Reporter: Cheng Lian Priority: Minor This ticket is derived from SPARK-8122. According to Javadoc of [{{java.util.Logger}}|https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html]: {quote} It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept. {quote} However, the only reference to [the {{Logger}} created for package {{org.apache.parquet}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-common/src/main/java/org/apache/parquet/Log.java#L58] goes out of scope outside the static initialization block, and thus is possible to be garbage collected. More details can be found in [this comment|https://issues.apache.org/jira/browse/SPARK-8122?focusedCommentId=14574419page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14574419]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577257#comment-14577257 ] Cheng Lian edited comment on PARQUET-222 at 6/8/15 2:33 PM: Hey [~phatak.dev], thanks for the information. I tried to reproduce this issue with the following Spark shell snippet: {code} import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val n = 26000 val schema = StructType((1 to n).map(i = StructField(sf$i, IntegerType, nullable = false))) val bigRow = Row((1 to n): _*) val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema) df.coalesce(1).write.mode(overwrite).format(orc).save(file:///tmp/foo) {code} I was using Spark 1.4.0-SNAPSHOT. Command line used to start the shell is: {noformat} ./bin/spark-shell --driver-memory 4g {noformat} I didn't get an OOM, but it does hang like forever. After profiling it with YJP, it turns out that this super wide table is somehow stressing out the query planner by making Spark SQL allocates a large number of small objects. Haven't tried 1.3.1 yet. Will do when I got time. I found that you had once posted this issue to Spark user mailing list. Would you mind to provide a full stack trace of the OOM error? Maybe it's more like a Spark SQL issue rather than a Parquet issue. was (Author: lian cheng): Hey [~phatak.dev], thanks for the information. I tried to reproduce this issue with the following Spark shell snippet: {code} import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val n = 26000 val schema = StructType((1 to n).map(i = StructField(sf$i, IntegerType, nullable = false))) val bigRow = Row((1 to n): _*) val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema) df.coalesce(1).write.mode(overwrite).format(orc).save(file:///tmp/foo) {code} I was using Spark 1.4.0-SNAPSHOT. Command line used to start the shell is: {noformat} ./bin/spark-shell --driver-memory 4g {noformat} I didn't get an OOM, but it does hang like forever. After profiling it with YJP, it turns out that this super wide table is somehow stressing out query planner by making Spark SQL allocates a large number of small objects. Haven't tied 1.3.1 yet. I found that you've posted this issue to Spark user mailing list. Would you mind to provide a full stack trace of the OOM error? Maybe it's more like a Spark SQL issue rather than a Parquet issue. parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
[jira] [Commented] (PARQUET-294) NPE in ParquetInputFormat.getSplits when no .parquet files exist
[ https://issues.apache.org/jira/browse/PARQUET-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576258#comment-14576258 ] Cheng Lian commented on PARQUET-294: Is this related to PARQUET-151? NPE in ParquetInputFormat.getSplits when no .parquet files exist Key: PARQUET-294 URL: https://issues.apache.org/jira/browse/PARQUET-294 Project: Parquet Issue Type: Bug Affects Versions: 1.6.0 Reporter: Paul Nepywoda {code} JavaSparkContext context = ... JavaRDDRow rdd1 = context.parallelize(ImmutableList.Row of()); SQLContext sqlContext = new SQLContext(context); StructType schema = DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(col1, DataTypes.StringType, true))); DataFrame df = sqlContext.createDataFrame(rdd1, schema); String url = file:///tmp/emptyRDD; df.saveAsParquetFile(url); Configuration configuration = SparkHadoopUtil.get().newConfiguration(context.getConf()); JobConf jobConf = new JobConf(configuration); ParquetInputFormat.setReadSupportClass(jobConf, RowReadSupport.class); FileInputFormat.setInputPaths(jobConf, url); JavaRDDRow rdd2 = context.newAPIHadoopRDD( jobConf, ParquetInputFormat.class, Void.class, Row.class).values(); rdd2.count(); df = sqlContext.createDataFrame(rdd2, schema); url = file:///tmp/emptyRDD2; df.saveAsParquetFile(url); FileInputFormat.setInputPaths(jobConf, url); JavaRDDRow rdd3 = context.newAPIHadoopRDD( jobConf, ParquetInputFormat.class, Void.class, Row.class).values(); rdd3.count(); {code} The NPE happens here: {code} java.lang.NullPointerException at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) {code} This stems from ParquetFileWriter.getGlobalMetaData returning null when there are no footers to read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575783#comment-14575783 ] Cheng Lian commented on PARQUET-222: There are several ways to alleviate this. Firstly, for those DataFrames whose data sizes are small (e.g., the single row case [~phatak.dev] mentioned), you may try {{df.coalesce(1)}} to reduce partition number to 1. In this way, only a single file will be written. In most cases, the default parallelism equals to the number of cores. For example, if you are running a Spark application with a single executor on a single 8-core node, that executor process needs to write 8 Parquet files even there's only a single row. Secondly, when you are writing DataFrames with large volume, you may try to adjust DataFrame partition number (via {{df.repartition(n)}} and/or {{df.coalesce(n)}}) and executor number (via the {{--num-executors}} flag of {{spark-submit}}) to ensure the former is less than or equal to the latter, so that each executor process only opens and writes at most one Parquet file. And of course, heap size of a single executor should be large enough to allow Parquet to write at least a single file. parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it and mark it as resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-222: --- Description: In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it and mark it as resolved. was: In Spark SQL, there is a function `saveAsParquetFile` in DataFrame or SchemaRDD. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stag e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValue sWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary ValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav a:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl. java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnI O.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit er.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter. java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28 2) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25 2) at
[jira] [Updated] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame
[ https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated PARQUET-293: --- Description: I get scala.ScalaReflectionException: none is not a term when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF Has anyone else encountered this problem? I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3 Here is my thrift IDL: {code} namespace scala com.junk namespace java com.junk struct Junk { 10: i64 junkID, 20: string junkString } {code} from a spark-shell: {code} val junks = List( Junk(123L, junk1), Junk(567L, junk2), Junk(789L, junk3) ) val junksRDD = sc.parallelize(junks) junksRDD.toDF {code} Exception thrown: {noformat} scala.ScalaReflectionException: none is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at $iwC$$iwC$$iwC$$iwC.init(console:38) at $iwC$$iwC$$iwC.init(console:40) at $iwC$$iwC.init(console:42) at $iwC.init(console:44) at init(console:46) at .init(console:50) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {noformat} was: I get scala.ScalaReflectionException: none is
[jira] [Commented] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame
[ https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564332#comment-14564332 ] Cheng Lian commented on PARQUET-293: Hm, it's possible. But the context is a little too vague to diagnose. [~zzztimbo] Could you please provide more details? For example: - Spark version - Full exception stack trace - How your Spark program interacts with Parquet? (I guess you were trying to save the Scrooge RDD as a Parquet file?) - It would be great if you can provide a snippet that reproduces this issue. ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame Key: PARQUET-293 URL: https://issues.apache.org/jira/browse/PARQUET-293 Project: Parquet Issue Type: Bug Components: parquet-format Affects Versions: 1.6.0 Reporter: Tim Chan I get scala.ScalaReflectionException: none is not a term when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF Has anyone else encountered this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)