Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16281
  
    @nsync you raised an excellent question on test coverage. The kind of bugs 
we have seen in the past weren't really integration bugs, but bugs in 
parquet-mr. Technically it should be the jobs of parquet-mr to verify 
correctness and performance regressions. If we were to introduce a much more 
broader set of regression tests in Spark, then to me it makes even more sense 
to just move the Parquet code into Spark and fixed issues found there. 
    
    Also I have spent some time understanding the Parquet codec, and I have to 
say it is pretty powerful and complicated and as a result fairly difficult to 
implement correctly. The dremel format optimizes for sparse nested data, but is 
much more difficult to get right than a simpler dense format. 
    
    FWIW, the ideal scenario I can think of is to have parquet-mr publish big 
fix versions that don't include new features. That would make update auditing 
easier and updates lower risk. 
    
    E.g. Parquet-mr 2 adds new features, and 1.x are just bug fixes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to