Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
> I also think we should respect Spark-generated statistics over Hive's
when it is available.
@gatorsmile OK. Then it's consistent with the current implementation. I'll
change the
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18105
Now, we have two sources of statistics. We need a mechanism to decide which
one should be chosen. We might need to update the code comments at least to
document the behaviors we choose.
---
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
I don't think the analyze table command is bound with CBO, neither. I just
want to change how we read stats from metastore. That is, which side (spark or
hive) of stats we respect based on cbo
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
@cloud-fan I mean the behavior when cbo is disabled should be the same as
the behavior previously without cbo.
Previously, size is read from "totalSize", and it changes after update.
Now, when
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18105
If users have not analyzed the table in Spark yet, we should respect the
stats from hive metastore. But if users have already run the analyze table
command in Spark, I think it's fair to ask them
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
I think we'd better respect the "totalSize" stats when cbo is disabled,
otherwise user has no way to the default behavior unless he re-runs the analyze
command. I personally think that's unfriendly
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
@cloud-fan > What was the behavior before?
Previously, analyze table command only updates the size of table, and it
uses the same hive stats name "totalSize", and stores it in metastore
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
@cloud-fan > What was the behavior before?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18105
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18105
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77365/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18105
**[Test build #77365 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77365/testReport)**
for PR 18105 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18105
I think we should always trust Spark's table stats over Hive's, no matter
CBO is on or not. If users update the stats at hive side, it's their own
responsibility to update it at Spark side.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18105
**[Test build #77365 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77365/testReport)**
for PR 18105 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18105
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77359/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18105
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18105
**[Test build #77359 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77359/testReport)**
for PR 18105 at commit
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/18105
cc @cloud-fan @gatorsmile
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18105
**[Test build #77359 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77359/testReport)**
for PR 18105 at commit
18 matches
Mail list logo