Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
......................................................................


Patch Set 6:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@368
PS6, Line 368:   /// This is called from WriteAlignedTuples.
Explain what error_file_offsets does in comment


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382
PS6, Line 382: column
Wouldn't this report an invalid value (column inside row)?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382
PS6, Line 382:   /// Utility function to append an error message for an invalid 
column.
It seems confusing to me that there is now a LogColumnParseError and 
ReportColumnParseError. Do we need both?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@391
PS6, Line 391:   //   - error_offsets is an out array. error_offsets[i] will be 
set to file offset
Start comments with three /// like the other lines.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@393
PS6, Line 393:  is an out 8 byte integer
No need to mention its type since it's the same as in the declaration itself.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@408
PS6, Line 408:       uint8_t* error_in_row, int64_t* error_file_offsets, 
int64_t* curr_file_offset)
This looks like it could have some performance impact. Did you do any perf 
experiments?

Have you considered not setting error_file_offsets inside WriteCompleteTuple 
and computing it from the field lengths during error reporting?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@233
PS6, Line 233:     uint8_t* error_fields, uint8_t* error_in_row,
nit: wrap at 90 characters.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@254
PS6, Line 254:     *curr_offset += len + 1;
Is +1 for the delimiter? If so, can you add a brief comment?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@578
PS6, Line 578:         builder.CreateAdd(builder.CreateZExt(len, 
codegen->GetType(TYPE_BIGINT)),
Please indent function calls 4 characters, here and elsewhere.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@759
PS6, Line 759:       "Error parsing row: file: $0, (file offset: $1)",
nit: line wrapping.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@761
PS6, Line 761:   state_->LogError(ErrorMsg(TErrorCode::GENERAL, s));
Why does this not check state_->HasLogSpace() but the method below does?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@768
PS6, Line 768: at
I think the message was clearer before the change.


http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README@136
PS6, Line 136: problematic_rows_impala_5993
Can you think of a name that captures the nature of the problems? Most of the 
files in here are somewhat problematic. It could be "invalid_int.csv" (if it is 
an invalid int column).


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@30
PS6, Line 30: from os.path import join as join_path
I think elsewhere we just use "os.path.join()" when we need it, so for 
consistency it would be good to do the same here.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@841
PS6, Line 841:  Tests that wrong file offset in value parsing error messages 
when
             :   scanning text files.
nit: missing word?


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@842
PS6, Line 842:   scanning text files. When the text scanner hits an error, it 
always prints the end of
             :   the file as the offset, even if the error occurs in the middle 
of the file.
Rephrase to say what it should do, not what the problem was.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@852
PS6, Line 852:     if pytest.config.option.namenode_http_address is None:
If you change this to "if p.c.o...: " it will also cover the empty string. We 
often return at the end of the if-branch and put the on the top level. You can 
move the initialization of hdfs_loc up to achieve that.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@860
PS6, Line 860:   def prepare(self):
I think it might be cleaner if every test prepared their own test data 
separately.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@862
PS6, Line 862:     self.hdfs_test_dir = "/tmp/impala5993"
This will break with parallel test runs. Use a tmpdir fixture for pytest 
instead.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@876
PS6, Line 876: put
nit: putting


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@888
PS6, Line 888:       join_path(self.test_data_dir, table_info["file"]),
nit: Will this fit within 90 characters of the previous line?


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@891
PS6, Line 891:   def execute(self, query):
Might be better to inline this to reduce complexity?


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@911
PS6, Line 911: run
Can you think of a better name?


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@917
PS6, Line 917:     "Error parsing row:.*\(file offset: 317\).*\n"\
Is each of these errors required? Can the order of the messages ever change, 
e.g. when multiple threads scan the file?


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@930
PS6, Line 930:     self.run("sample_text")
These should be separate test_..() methods.



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 6
Gerrit-Owner: Kim Jin Chul <jinc...@gmail.com>
Gerrit-Reviewer: Kim Jin Chul <jinc...@gmail.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Jan 2018 15:49:36 +0000
Gerrit-HasComments: Yes

Reply via email to