Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
@gatorsmile Sure, I'll make a follow-up PR for CSV.
Great thanks for everyone's feedback in this patch, I really learned lot
for this.
---
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
Thanks @HyukjinKwon @gatorsmile @cloud-fan @dm-tran
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
Thanks! Merged to master.
@jmchung Could you submit a follow-up PR for CSV? Thanks!
---
-
To unsubscribe, e-mail:
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
cc @gatorsmile Please take another look when you have time. I've already
updated. Thanks!
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81603/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81603 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81603/testReport)**
for PR 18865 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81603 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81603/testReport)**
for PR 18865 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81591/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81591 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81591/testReport)**
for PR 18865 at commit
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
@gatorsmile those negative cases are already added in `JsonSuite`.
---
-
To unsubscribe, e-mail:
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81591 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81591/testReport)**
for PR 18865 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81561/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81561 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81561/testReport)**
for PR 18865 at commit
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
Thank you for review, @HyukjinKwon.
cc @gatorsmile
Could you review this again when you have sometime? thanks!
---
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
Seems fine to me too.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81561 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81561/testReport)**
for PR 18865 at commit
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
@viirya, thank you so much for taking a look and your time.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
Leave more few comments on the message. Otherwise LGTM.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81451/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81451 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81451/testReport)**
for PR 18865 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81451 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81451/testReport)**
for PR 18865 at commit
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
Could @gatorsmile and @HyukjinKwon please share some instructions for
revised details on exception message? The current message indicates the reason
of disallowance when users just select the
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
@jmchung, just to be clear, sure, let's go in this way and I guess we have
the only comment left to address now:
> Please update the error message and also add it to the migration
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
@gatorsmile, Thanks for elaborating this. Looks a fair point.
---
-
To unsubscribe, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
Parsing all the columns is another option when users just select the
`_corrupt_record`. However, users might not need all the columns, especially
for the semi-structured formats (CSV and JSON).
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
In that case, the current fix does not cover when selecting few column
together - https://github.com/apache/spark/pull/18865#issuecomment-326865161.
Why wouldn't we block all cases, or document
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
Based on the original PR https://github.com/apache/spark/pull/2680 and the
current behavior, we do not think the current behavior is by design. The usage
cases you mentioned above are rare. When
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
Could you share the offline discussion and why - @gatorsmile and
@sameeragarwal?
---
-
To unsubscribe, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
After an offline discussion with @sameeragarwal , we think we should block
it with a reasonable error message.
@jmchung Please update the error message and also add it to the migration
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
Just to summarise my opinion:
I think I am unclear if we should call, selecting `_corrupt_record` alone,
a bug to disallow. If I understood correctly, I think we are basically saying:
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
Yea, it is and that's what we discussed -
https://github.com/apache/spark/pull/18865#discussion_r131887972. In that way,
I was thinking
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
But doesn't this behavior that `_corrupt_record` content depends on the
selected json fields is designed at the first?
---
If your project is set up for it, you can reply to this email and have
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
> the usage can cause weird results too
> `_corrupt_record` should return all the records that Spark SQL fail to
parse
I think another point should be, this issue still exists
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
BTW, this change should be put into the migration guide of Spark SQL.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
From the viewpoints of the end users of Spark,
`dfFromFile.select($"_corrupt_record").show()` might not return all the
expected records. ``_corrupt_record`` should return all the records that
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
@HyukjinKwon 's provided use case looks pretty fair. The corrupt record is
the whole line which doesn't follow the json format. It is kind of different to
the corrupt record case that some json
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
A use case might be:
```
echo '{"field": 1
{"field" 2}
{"field": 3}' >/tmp/sample.json
```
```scala
val file = "/tmp/sample.json"
val dfFromFile =
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81360/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81360 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81360/testReport)**
for PR 18865 at commit
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
`requiredSchema.length == 1 && requiredSchema.head.name ==
parsedOptions.columnNameOfCorruptRecord`
What is the usage scenario if we do not block the above scenario?
---
If your
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81360 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81360/testReport)**
for PR 18865 at commit
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/18865
@gatorsmile and @cloud-fan, do you guys prefer throwing an exception or
printing a log? I think I like logging one more conservatively.
BTW, I am sorry for raising this issue late now.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81357/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81357 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81357/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
@HyukjinKwon Thanks for the comment.
I think the current behavior confuses users in some ways, as it can have
weird query results shown in previous discussion.
The previous fix that
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
To @viirya and @gatorsmile, I made some modifications as follows:
1. move the check of `_corrupt_record` out of the function block to get
fast fail in driver instead of executor side.
2.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81357 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81357/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
After looking the change, found another two points we should address.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81353 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81353/testReport)**
for PR 18865 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81353/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81353 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81353/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
@gatorsmile Thanks for your feedback.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
@viirya Thanks, the description of PR has been updated.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
@jmchung Please also update the description too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81340/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81340 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81340/testReport)**
for PR 18865 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81340 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81340/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81339/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81339 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81339/testReport)**
for PR 18865 at commit
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
Thanks @viirya's suggestion, the redundant comment is removed and
`withTempPath` is applied in the test case.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81339 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81339/testReport)**
for PR 18865 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81334/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81334 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81334/testReport)**
for PR 18865 at commit
Github user jmchung commented on the issue:
https://github.com/apache/spark/pull/18865
gentle ping @viirya, @gatorsmile, made a minor change to throw reasonable
workaround message.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #81334 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81334/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
I think it makes sense to issue an error with good helpful message when
users only select `_corrupt_record` without other columns.
---
If your project is set up for it, you can reply to this email
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18865
Could we issue a better error message in such a scenario?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80375/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #80375 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80375/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
Oh. sorry. Looks like the Jenkins run tests now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
@HyukjinKwon @cloud-fan May you help trigger Jenkins? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #80375 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80375/testReport)**
for PR 18865 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80350/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #80350 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80350/testReport)**
for PR 18865 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18865
cc @HyukjinKwon
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18865
**[Test build #80350 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80350/testReport)**
for PR 18865 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18865
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
cc @gatorsmile @cloud-fan Can you help trigger Jenkins for this? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18865
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
95 matches
Mail list logo