[ 
https://issues.apache.org/jira/browse/PARQUET-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370458#comment-16370458
 ] 

Ryan Blue commented on PARQUET-860:
-----------------------------------

The S3 file system implementation should retry and recover if it is a transient 
error. In general I'm skeptical that Parquet can reliably provide what you want.

Parquet makes no guarantee of durability until the close operation returns. As 
a consequence, you should not discard incoming records until then. This is why 
Parquet works better with systems like Kafka that have a long window of time 
where records can be replayed. In general, I would not recommend Parquet as a 
format from other streaming systems like Flume. This works fine with MR or 
Spark where the framework itself will retry a write task when an output writer 
throws an exception.

> ParquetWriter.getDataSize NullPointerException after closed
> -----------------------------------------------------------
>
>                 Key: PARQUET-860
>                 URL: https://issues.apache.org/jira/browse/PARQUET-860
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.9.0
>         Environment: Linux prim 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 
> 07:24:34 CET 2016 x86_64 GNU/Linux
> openjdk version "1.8.0_112"
> OpenJDK Runtime Environment (build 1.8.0_112-b15)
> OpenJDK 64-Bit Server VM (build 25.112-b15, mixed mode)
>            Reporter: Mike Mintz
>            Priority: Major
>
> When I run {{ParquetWriter.getDataSize()}}, it works normally. But after I 
> call {{ParquetWriter.close()}}, subsequent calls to ParquetWriter.getDataSize 
> result in a NullPointerException.
> {noformat}
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.getDataSize(InternalParquetRecordWriter.java:132)
>       at 
> org.apache.parquet.hadoop.ParquetWriter.getDataSize(ParquetWriter.java:314)
>       at FileBufferState.getFileSizeInBytes(FileBufferState.scala:83)
> {noformat}
> The reason for the NPE appears to be in 
> {{InternalParquetRecordWriter.getDataSize}}, where it assumes that 
> {{columnStore}} is not null.
> But the {{close()}} method calls {{flushRowGroupToStore()}} which sets 
> {{columnStore = null}}.
> I'm guessing that once the file is closed, we can just return 
> {{lastRowGroupEndPos}} since there should be no more buffered data, but I 
> don't fully understand how this class works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to