Kim Jin Chul has posted comments on this change. ( http://gerrit.cloudera.org:8080/8747 )
Change subject: IMPALA-5993: Fix the file offset in value parsing error ...................................................................... Patch Set 2: Code-Review-1 (6 comments) Preparing a testcase is still ongoing locally. http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h@390 PS1, Line 390: /// - error_in_row is an out bool. It is set to true if any field had parse errors > Can you add comments for the new parameters here? Done http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h@405 PS1, Line 405: /// TODO: revisit this > nit: the line wrapping looks odd, can you wrap at 90 chars? Yes, the line 406 cannot be appended to the line 405 due to the limitation. Let me change it like below: uint8_t* error_in_row, int64_t* error_offset, int64_t* curr_offset)<LF> WARN_UNUSED_RESULT; http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@251 PS1, Line 251: error_offsets[i] = *curr_offset; > isn't fields[i].start the same as curr_offset? fields[i].start might be used to calculate offset, but fields[i].start and curr_offset are not completely same. I think we don't have a start pointer of a file. We may need a ptr variable to keep the start of a file. Do you think the following is better than this? int64_t* start_of_file) { ... error_offsets[i] = fields[i].start - start_of_file; This is can be applied on this file, but it cannot be applied to sequence scanner impl. because sequence scanner reads a record(see line 313) and the calculation of pointer seems to be invalid: https://gerrit.cloudera.org/#/c/8747/1/be/src/exec/hdfs-sequence-scanner.cc http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@668 PS1, Line 668: seError() { > I am confused because this function still exists unchanged. If your code do My change uses LogColumnParseError instead of LogRowParseError, but LogRowParseError is still necessary. I think it cannot be replaced by LogColumnParseError. Please see https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-text-scanner.cc#L806 http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@674 PS1, Line 674: Error(const int line, > I agree that it is much easier to understand. One nit: we should print out Done. Yes, I agree "file offset" is more clear. http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@683 PS1, Line 683: char* data, int len) { : if (state_->LogHasSpace() || state_->abort_on_error()) { : stringstream ss; > I don't think this change adds a big improvement, but I'm also OK with keep Done -- To view, visit http://gerrit.cloudera.org:8080/8747 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f Gerrit-Change-Number: 8747 Gerrit-PatchSet: 2 Gerrit-Owner: Kim Jin Chul <jinc...@gmail.com> Gerrit-Reviewer: Kim Jin Chul <jinc...@gmail.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Comment-Date: Tue, 05 Dec 2017 00:18:01 +0000 Gerrit-HasComments: Yes