Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/8747 )
Change subject: IMPALA-5993: Fix the file offset in value parsing error ...................................................................... Patch Set 6: (25 comments) http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@368 PS6, Line 368: /// This is called from WriteAlignedTuples. Explain what error_file_offsets does in comment http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382 PS6, Line 382: column Wouldn't this report an invalid value (column inside row)? http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382 PS6, Line 382: /// Utility function to append an error message for an invalid column. It seems confusing to me that there is now a LogColumnParseError and ReportColumnParseError. Do we need both? http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@391 PS6, Line 391: // - error_offsets is an out array. error_offsets[i] will be set to file offset Start comments with three /// like the other lines. http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@393 PS6, Line 393: is an out 8 byte integer No need to mention its type since it's the same as in the declaration itself. http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@408 PS6, Line 408: uint8_t* error_in_row, int64_t* error_file_offsets, int64_t* curr_file_offset) This looks like it could have some performance impact. Did you do any perf experiments? Have you considered not setting error_file_offsets inside WriteCompleteTuple and computing it from the field lengths during error reporting? http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@233 PS6, Line 233: uint8_t* error_fields, uint8_t* error_in_row, nit: wrap at 90 characters. http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@254 PS6, Line 254: *curr_offset += len + 1; Is +1 for the delimiter? If so, can you add a brief comment? http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@578 PS6, Line 578: builder.CreateAdd(builder.CreateZExt(len, codegen->GetType(TYPE_BIGINT)), Please indent function calls 4 characters, here and elsewhere. http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@759 PS6, Line 759: "Error parsing row: file: $0, (file offset: $1)", nit: line wrapping. http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@761 PS6, Line 761: state_->LogError(ErrorMsg(TErrorCode::GENERAL, s)); Why does this not check state_->HasLogSpace() but the method below does? http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@768 PS6, Line 768: at I think the message was clearer before the change. http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README@136 PS6, Line 136: problematic_rows_impala_5993 Can you think of a name that captures the nature of the problems? Most of the files in here are somewhat problematic. It could be "invalid_int.csv" (if it is an invalid int column). http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@30 PS6, Line 30: from os.path import join as join_path I think elsewhere we just use "os.path.join()" when we need it, so for consistency it would be good to do the same here. http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@841 PS6, Line 841: Tests that wrong file offset in value parsing error messages when : scanning text files. nit: missing word? http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@842 PS6, Line 842: scanning text files. When the text scanner hits an error, it always prints the end of : the file as the offset, even if the error occurs in the middle of the file. Rephrase to say what it should do, not what the problem was. http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@852 PS6, Line 852: if pytest.config.option.namenode_http_address is None: If you change this to "if p.c.o...: " it will also cover the empty string. We often return at the end of the if-branch and put the on the top level. You can move the initialization of hdfs_loc up to achieve that. http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@860 PS6, Line 860: def prepare(self): I think it might be cleaner if every test prepared their own test data separately. http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@862 PS6, Line 862: self.hdfs_test_dir = "/tmp/impala5993" This will break with parallel test runs. Use a tmpdir fixture for pytest instead. http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@876 PS6, Line 876: put nit: putting http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@888 PS6, Line 888: join_path(self.test_data_dir, table_info["file"]), nit: Will this fit within 90 characters of the previous line? http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@891 PS6, Line 891: def execute(self, query): Might be better to inline this to reduce complexity? http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@911 PS6, Line 911: run Can you think of a better name? http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@917 PS6, Line 917: "Error parsing row:.*\(file offset: 317\).*\n"\ Is each of these errors required? Can the order of the messages ever change, e.g. when multiple threads scan the file? http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@930 PS6, Line 930: self.run("sample_text") These should be separate test_..() methods. -- To view, visit http://gerrit.cloudera.org:8080/8747 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f Gerrit-Change-Number: 8747 Gerrit-PatchSet: 6 Gerrit-Owner: Kim Jin Chul <jinc...@gmail.com> Gerrit-Reviewer: Kim Jin Chul <jinc...@gmail.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Comment-Date: Mon, 08 Jan 2018 15:49:36 +0000 Gerrit-HasComments: Yes