Github user rxin commented on the issue: https://github.com/apache/spark/pull/16281 @nsync you raised an excellent question on test coverage. The kind of bugs we have seen in the past weren't really integration bugs, but bugs in parquet-mr. Technically it should be the jobs of parquet-mr to verify correctness and performance regressions. If we were to introduce a much more broader set of regression tests in Spark, then to me it makes even more sense to just move the Parquet code into Spark and fixed issues found there. Also I have spent some time understanding the Parquet codec, and I have to say it is pretty powerful and complicated and as a result fairly difficult to implement correctly. The dremel format optimizes for sparse nested data, but is much more difficult to get right than a simpler dense format. FWIW, the ideal scenario I can think of is to have parquet-mr publish big fix versions that don't include new features. That would make update auditing easier and updates lower risk. E.g. Parquet-mr 2 adds new features, and 1.x are just bug fixes.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org