[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2017-01-31 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Thank you for sharing that information, @mallman . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2017-01-31 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16281 FYI, we've been using 1.9.0 patched with a fix for https://issues.apache.org/jira/browse/PARQUET-783 without problem. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2017-01-31 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Hi, all. Now, I'm trying to upgrade Apache Spark to 1.8.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2017-01-27 Thread julienledem
Github user julienledem commented on the issue: https://github.com/apache/spark/pull/16281 FYI: Parquet 1.8.2 vote thread passed: https://mail-archives.apache.org/mod_mbox/parquet-dev/201701.mbox/%3CCAO4re1mHLT%2BLYn8s1RTEDZK8-9WSVugY8-HQqAN%2BtU%3DBOi1L9w%40mail.gmail.com%3E --- If

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 The improvement is in how row groups are garbage collected. G1GC puts humongous allocations directly into the old generation, so you end up needing a full GC to reclaim the space. That just

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread kiszk
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/16281 @rdblue Interesting, do you have any estimated or actual data for performance improvement? I am interested in how you can achieve performance improvement over `byte[]`. - Usage of

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Thank you for confirming, @rdblue . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 I think we should move to a 1.8.2 patch release. The reason is that 1.9.0 moved to ByteBuffer based reads and we've found at least one problem with it. ByteBuffer based reads also changes an

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 Thanks @dongjoon-hyun! Lets get a Parquet 1.8.2 out in January. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Sure. We are going to wait for Parquet 1.8.2. Maybe, in a month for Spark 2.1.1 on January? Anyway, thank you all! I'm closing this PR happily. --- If your project is set up for it,

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16281 As we understand the Parquet community is willing to put out patch releases, I don't see any major reason now to motivate the forking. That is a great news for us. So do we postpone the upgrading to

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 @rdblue Thanks again! : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 Great! I'm glad it was just confusion. I completely agree with @srowen that forking should be a last resort. In the future, please reach out to the community, whether its Parquet or

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 @julienledem @rdblue wow, that is so great!!! It will be much easier for us! I never expected the Parquet community is willing to do the regular patch releases. Sorry, that statement is

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16281 I'd love to see frequent, conservative patch releases. From my experience, parquet bugs cause significant trouble for downstream consumers. For example, we encountered a data corruption bug writing

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread julienledem
Github user julienledem commented on the issue: https://github.com/apache/spark/pull/16281 Like @rdblue said, I don't recall people asking for a 1.8.2 on the parquet dev list. We are happy to help and if there is a 1.8.x patch release branch it's better to maintain it in the

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 I don't think a fork is a good idea, nor do I think there is a reasonable need for one. @gatorsmile brought up that the Parquet community refused to build a patch release: "The problem is

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16281 I don't think we are making a decision at this point to fork ... if we can really push parquet to make another maintenance bug fix release, that'd be great. --- If your project is set up for it,

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 I see the point. Then, the forked repository is going for 2.1.1 or 2.1-rc4 ASAP? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 That is just an example. We definitely need the fix ASAP, right? See[ the past release dates](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.parquet%22%20AND%20a%3A%22parquet%22)

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Is the open issue the main purpose of forking? To resolve it faster than Parquet community? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 There is still an open JIRA: [PARQUET-686 Allow for Unsigned Statistics in Binary Type](https://issues.apache.org/jira/browse/PARQUET-686) Because of that bug, we disable the filter push

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 @dongjoon-hyun I am not sure what is your search conditions. In the recent JIRA, we hit this issue: - PARQUET-389: Filter predicates should work with missing columns --- If your

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 For this issue, it was initially PARQUET-363, but all maintenance fix (with new features) are welcome. For me, the followings? - PARQUET-99: Large rows cause unnecessary OOM

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread ash211
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16281 What are the specific patches to parquet that folks are proposing should be included in a parquet 1.8.1-spark1 ? Or what would be desired in a parquet-released 1.8.2 ? --- If your project is set

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/16281 I agree with @srowen, forking should be the last resort. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16281 @nsync you raised an excellent question on test coverage. The kind of bugs we have seen in the past weren't really integration bugs, but bugs in parquet-mr. Technically it should be the jobs of

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 @dongjoon-hyun I just checked the code changes in `1.2.1.spark2` compared with the official Hive 1.2.1: https://github.com/JoshRosen/hive/commits/release-1.2.1-spark2 Very

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16281 Btw issues are not just performance, but often correctness as well. As the default format, a bug in Parquet is much worse than a bug in say ORC. --- If your project is set up for it, you can reply

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16281 We haven't really added much to Hive though, and as a matter of fact the dependency on Hive is decreasing. Parquet is a much more manageable piece of code to fork. In the past we have seen fairly

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Yep. Spark Thrift Server are different, but it's not actively maintained. For example, the default database feature is recently added. I mean this one by `Spark Hive`. ```

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16281 Has anyone even asked for a new 1.8.x build from Parquet and been told it won't happen? You don't stop consuming non fix changes by forking. You do that by staying on a maintenance branch.

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 We are adding major code changes in Spark Thrift Server? What is the Spark Hive? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 I think we are not adding new features into Parquet. The fixes must be small. To avoid the cost and risk, we need to reject all the major fixes in our special build. At the same time, we also

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Yep. At the beginning, it starts like that. But, please look at Spark Hive or Spark Thrift Server. I don't think we are maintaining that well or visibly. --- If your project is set up for

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 I agree, but, in a long term perspective, the risk and cost of forking could be the worst. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 @dongjoon-hyun What kind of questions/requests should we ask in dev mailing list? IMO, the risk and cost are small if we make a special build by ourselves. We can get the bug fixes

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Actually, this PR is about Apache Spark 2.2 on Late March in terms of RC1. We have a lot of time to discuss. Why don't we discuss that on dev mailing list? --- If your project is set up

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 Basically, the idea is to make a special build for Parquet 1.8.1 with the needed fixes by ourselves. Upgrading to newer version like Parquet 1.9.0 is risky. Parquet 1.9.0 was just

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16281 I'd much rather lobby to release 1.8.2 and help with the legwork than do all that legwork and more to maintain a fork. It's still not clear to me that upgrading to 1.9.0 is not a solution? --- If

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 The problem is the Parquet community will not create a branch 1.8.2+ for us. Upgrading to newer versions 1.9 or 2.0 are always risky. Based on the history, we hit the bugs and performance

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16281 @gatorsmile @rdblue also works directly on Parquet. I am not seeing "unfixable" Parquet problems here. You're just pointing at problems that can and should be fixed, preferably in one place. Forking

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 @srowen Even if we fork our own version, it does not mean we will give up the upgrading to the newer version. We just added a few fixes. This is very normal in the mission-critical

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread nsyca
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/16281 My two cents: - Do we have a Parquet specific test suite **with sufficient coverage** to run and back us up that this upgrade won't cause any regressions? I think simply moving up the version of

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16281 Yes, but we face bugs in third-party components all the time and work around them or get them fixed. There is an unmentioned downside here too: not getting bug fixes and improvements that don't

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16281 Those issues with Parquet are specified for certain Parquet versions. If upgrading Parquet can solve them, it can't justify the decision to fork Parquet. To fork such project we need more

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 One more example: https://github.com/apache/spark/pull/16106 This issue degrades the performance. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 We keep hitting the issues in Parquet. Below is another example: https://issues.apache.org/jira/browse/SPARK-18539 --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16281 Forking is a very bad thing, and only a last resort. We haven't properly managed the fork of Hive yet, even. I don't hear specific bugs to fork around here either. As such I can't see why this would

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16281 Parquet is the default format of Spark. It is pretty significant to Spark. Now, Parquet is becoming stable and might be the right time to fork it. We are just fixing the bugs. @liancheng and

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Thank you for review, @rxin . Forking may give more controllability, but it goes invisible (in terms of documents) soon. According to the recent mail on Spark dev list, only committers

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16281 I'm actually wondering if we should just fork Parquet and maintain a version of it, and add fixes ourselves to the fork. In the past Parquet updates often bring more regressions ... --- If your

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16281 Thank you for review, @srowen . I see. First of all, I'll check the case of the mixed parquet jar files in the class path. Second, among [83

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16281 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70140/ Test PASSed. ---

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16281 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16281 **[Test build #70140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70140/consoleFull)** for PR 16281 at commit

[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-14 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16281 The tricky part about upgrading Parquet is whether it will continue to work with other dependencies that work with Parquet. It's also worth considering that a different Parquet may be on the