[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

AndreSchumacher Fri, 25 Apr 2014 02:54:08 -0700

Github user AndreSchumacher commented on a diff in the pull request:

    https://github.com/apache/spark/pull/360#discussion_r11989530
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
    @@ -153,9 +153,15 @@ case class InsertIntoParquetTable(
     
         val job = new Job(sc.hadoopConfiguration)
     
    -    ParquetOutputFormat.setWriteSupportClass(
    -      job,
    -      classOf[org.apache.spark.sql.parquet.RowWriteSupport])
    +    val writeSupport =
    +      if (child.output.map(_.dataType).forall(_.isPrimitive())) {
    +        logger.info("Initializing MutableRowWriteSupport")
    --- End diff --
    
    @marmbrus Good question. I'm not yet totally sure myself. But consider the 
following example: You have an array of structs, which have another array as 
field. So something like:
    
    `ArrayType(StructType(Seq(ArrayType(IntegerType))))`
    
    Lets call the inner array `inner` and the outer array `outer`. Note that 
`outer` could be itself just a field in a higher-level record.
    
    Now whenever Parquet is done passing the data for the current `inner` it 
will let you know by calling `end` on the converter for that field, in this 
case an array converter. Now the current struct has been processed completely, 
so its converter's `end` will be called, too. The current `outer` record, 
however, may or not may be completed. If it's not completed, then the current 
`inner` needs to be stored somewhere and you cannot use a mutable row because 
it is not yet save to reuse that chunk of memory whenever the next `inner` 
comes along.
    
    Does this make any sense at all? I'm happy to discuss other solutions, too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...

Reply via email to