Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Thank you for sharing that information, @mallman .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16281
FYI, we've been using 1.9.0 patched with a fix for
https://issues.apache.org/jira/browse/PARQUET-783 without problem.
---
If your project is set up for it, you can reply to this email and have
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Hi, all.
Now, I'm trying to upgrade Apache Spark to 1.8.2.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user julienledem commented on the issue:
https://github.com/apache/spark/pull/16281
FYI: Parquet 1.8.2 vote thread passed:
https://mail-archives.apache.org/mod_mbox/parquet-dev/201701.mbox/%3CCAO4re1mHLT%2BLYn8s1RTEDZK8-9WSVugY8-HQqAN%2BtU%3DBOi1L9w%40mail.gmail.com%3E
---
If
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/16281
The improvement is in how row groups are garbage collected. G1GC puts
humongous allocations directly into the old generation, so you end up needing a
full GC to reclaim the space. That just
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/16281
@rdblue Interesting, do you have any estimated or actual data for
performance improvement?
I am interested in how you can achieve performance improvement over
`byte[]`.
- Usage of
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Thank you for confirming, @rdblue .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/16281
I think we should move to a 1.8.2 patch release. The reason is that 1.9.0
moved to ByteBuffer based reads and we've found at least one problem with it.
ByteBuffer based reads also changes an
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/16281
Thanks @dongjoon-hyun! Lets get a Parquet 1.8.2 out in January.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Sure. We are going to wait for Parquet 1.8.2.
Maybe, in a month for Spark 2.1.1 on January?
Anyway, thank you all! I'm closing this PR happily.
---
If your project is set up for it,
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16281
As we understand the Parquet community is willing to put out patch
releases, I don't see any major reason now to motivate the forking. That is a
great news for us. So do we postpone the upgrading to
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
@rdblue Thanks again! : )
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/16281
Great! I'm glad it was just confusion. I completely agree with @srowen that
forking should be a last resort.
In the future, please reach out to the community, whether its Parquet or
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
@julienledem @rdblue wow, that is so great!!! It will be much easier for us!
I never expected the Parquet community is willing to do the regular patch
releases. Sorry, that statement is
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16281
I'd love to see frequent, conservative patch releases. From my experience,
parquet bugs cause significant trouble for downstream consumers. For example,
we encountered a data corruption bug writing
Github user julienledem commented on the issue:
https://github.com/apache/spark/pull/16281
Like @rdblue said, I don't recall people asking for a 1.8.2 on the parquet
dev list.
We are happy to help and if there is a 1.8.x patch release branch it's
better to maintain it in the
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/16281
I don't think a fork is a good idea, nor do I think there is a reasonable
need for one.
@gatorsmile brought up that the Parquet community refused to build a patch
release: "The problem is
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
+1
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16281
I don't think we are making a decision at this point to fork ... if we can
really push parquet to make another maintenance bug fix release, that'd be
great.
---
If your project is set up for it,
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
I see the point. Then, the forked repository is going for 2.1.1 or 2.1-rc4
ASAP?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
That is just an example. We definitely need the fix ASAP, right? See[ the
past release
dates](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.parquet%22%20AND%20a%3A%22parquet%22)
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Is the open issue the main purpose of forking?
To resolve it faster than Parquet community?
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
There is still an open JIRA: [PARQUET-686 Allow for Unsigned Statistics in
Binary Type](https://issues.apache.org/jira/browse/PARQUET-686)
Because of that bug, we disable the filter push
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
@dongjoon-hyun I am not sure what is your search conditions. In the recent
JIRA, we hit this issue:
- PARQUET-389: Filter predicates should work with missing columns
---
If your
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
For this issue, it was initially PARQUET-363, but all maintenance fix (with
new features) are welcome.
For me, the followings?
- PARQUET-99: Large rows cause unnecessary OOM
Github user ash211 commented on the issue:
https://github.com/apache/spark/pull/16281
What are the specific patches to parquet that folks are proposing should be
included in a parquet 1.8.1-spark1 ? Or what would be desired in a
parquet-released 1.8.2 ?
---
If your project is set
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/16281
I agree with @srowen, forking should be the last resort.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16281
@nsync you raised an excellent question on test coverage. The kind of bugs
we have seen in the past weren't really integration bugs, but bugs in
parquet-mr. Technically it should be the jobs of
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
@dongjoon-hyun
I just checked the code changes in `1.2.1.spark2` compared with the
official Hive 1.2.1:
https://github.com/JoshRosen/hive/commits/release-1.2.1-spark2
Very
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16281
Btw issues are not just performance, but often correctness as well. As the
default format, a bug in Parquet is much worse than a bug in say ORC.
---
If your project is set up for it, you can reply
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16281
We haven't really added much to Hive though, and as a matter of fact the
dependency on Hive is decreasing. Parquet is a much more manageable piece of
code to fork. In the past we have seen fairly
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Yep. Spark Thrift Server are different, but it's not actively maintained.
For example, the default database feature is recently added.
I mean this one by `Spark Hive`.
```
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16281
Has anyone even asked for a new 1.8.x build from Parquet and been told it
won't happen?
You don't stop consuming non fix changes by forking. You do that by staying
on a maintenance branch.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
We are adding major code changes in Spark Thrift Server?
What is the Spark Hive?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
I think we are not adding new features into Parquet. The fixes must be
small. To avoid the cost and risk, we need to reject all the major fixes in our
special build. At the same time, we also
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Yep. At the beginning, it starts like that. But, please look at Spark Hive
or Spark Thrift Server. I don't think we are maintaining that well or visibly.
---
If your project is set up for
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
I agree, but, in a long term perspective, the risk and cost of forking
could be the worst.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
@dongjoon-hyun What kind of questions/requests should we ask in dev mailing
list?
IMO, the risk and cost are small if we make a special build by ourselves.
We can get the bug fixes
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Actually, this PR is about Apache Spark 2.2 on Late March in terms of RC1.
We have a lot of time to discuss. Why don't we discuss that on dev mailing
list?
---
If your project is set up
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
Basically, the idea is to make a special build for Parquet 1.8.1 with the
needed fixes by ourselves.
Upgrading to newer version like Parquet 1.9.0 is risky. Parquet 1.9.0 was
just
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16281
I'd much rather lobby to release 1.8.2 and help with the legwork than do
all that legwork and more to maintain a fork. It's still not clear to me that
upgrading to 1.9.0 is not a solution?
---
If
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
The problem is the Parquet community will not create a branch 1.8.2+ for
us. Upgrading to newer versions 1.9 or 2.0 are always risky. Based on the
history, we hit the bugs and performance
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16281
@gatorsmile @rdblue also works directly on Parquet. I am not seeing
"unfixable" Parquet problems here. You're just pointing at problems that can
and should be fixed, preferably in one place. Forking
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
@srowen Even if we fork our own version, it does not mean we will give up
the upgrading to the newer version. We just added a few fixes.
This is very normal in the mission-critical
Github user nsyca commented on the issue:
https://github.com/apache/spark/pull/16281
My two cents:
- Do we have a Parquet specific test suite **with sufficient coverage** to
run and back us up that this upgrade won't cause any regressions? I think
simply moving up the version of
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16281
Yes, but we face bugs in third-party components all the time and work
around them or get them fixed. There is an unmentioned downside here too: not
getting bug fixes and improvements that don't
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16281
Those issues with Parquet are specified for certain Parquet versions. If
upgrading Parquet can solve them, it can't justify the decision to fork
Parquet. To fork such project we need more
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
One more example: https://github.com/apache/spark/pull/16106 This issue
degrades the performance.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
We keep hitting the issues in Parquet. Below is another example:
https://issues.apache.org/jira/browse/SPARK-18539
---
If your project is set up for it, you can reply to this email and have
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16281
Forking is a very bad thing, and only a last resort. We haven't properly
managed the fork of Hive yet, even. I don't hear specific bugs to fork around
here either. As such I can't see why this would
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16281
Parquet is the default format of Spark. It is pretty significant to Spark.
Now, Parquet is becoming stable and might be the right time to fork it. We are
just fixing the bugs. @liancheng and
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Thank you for review, @rxin .
Forking may give more controllability, but it goes invisible (in terms of
documents) soon.
According to the recent mail on Spark dev list, only committers
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16281
I'm actually wondering if we should just fork Parquet and maintain a
version of it, and add fixes ourselves to the fork. In the past Parquet updates
often bring more regressions ...
---
If your
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/16281
Thank you for review, @srowen . I see.
First of all, I'll check the case of the mixed parquet jar files in the
class path.
Second, among [83
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16281
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70140/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16281
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16281
**[Test build #70140 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70140/consoleFull)**
for PR 16281 at commit
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16281
The tricky part about upgrading Parquet is whether it will continue to work
with other dependencies that work with Parquet. It's also worth considering
that a different Parquet may be on the
58 matches
Mail list logo