[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2020-06-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has abandoned this change. ( http://gerrit.cloudera.org:8080/8747 
)

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Abandoned

Let's abandon for now to clean up old reviews
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 8
Gerrit-Owner: Jinchul Kim 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Jinchul Kim 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-04-10 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 8:

@Csaba, thanks for your review. I have been busy recently an now I have room 
again. Let me provide a new patch set soon. Have a good day!


--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 8
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Tue, 10 Apr 2018 10:49:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-04-03 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1:

(12 comments)

I looked through the change. I can't say that I completely understand the 
design of the C++ part, so my comments for that part are simple local ones.

http://gerrit.cloudera.org:8080/#/c/8747/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8747/7//COMMIT_MSG@11
PS7, Line 11: hit an error, it always prints the end of the file as the
nit: hits


http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@365
PS7, Line 365:
typo: calculated


http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@373
PS7, Line 373: ers to ex
typo: difference


http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@373
PS7, Line 373: aths in Parquet scanner. Returns the status
Do you mean "taking read_bytes as argument instead of calculating it"?


http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@375
PS7, Line 375:   Status ScannerDebugAction() WARN_UNUSED_RESULT {
nit: remove extra space


http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.cc@497
PS7, Line 497:   // In this branch, we've just materialized slots not 
referenced by any conjunct.
Checking LogHasSpace() is not necessary here - LogError() checks if there is 
space for the log anyway, so the check is only useful to avoid the runtime cost 
of creating the error message if it would be dropped anyway. This is important 
if the logic is called a lot of time, for example for millions of rows. This 
part will be called only once (per fragment), as the function returns on the 
first error.


http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.cc@675
PS7, Line 675:   stream_->filename(), line, pos, offset);
 :   state_->LogError(ErrorMsg(TErrorCode::GENERAL, s));
 : }
The Substitute should be moved inside the if block - as I mentioned in line 
497, the reason to call LogHasSpace() is to avoid the runtime cost of creating 
the error message.


http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@847
PS7, Line 847:
 :
 :
 : 
 :
 :
 :
 :
 :
Can you explain why this function is needed? Other scanner test classes do not 
seem to do similar initialization.


http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@875
PS7, Line 875:
This step could be skipped, if the the table was created first (which creates 
an HDFS directory) and the data was copied afterwards.


http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@890
PS7, Line 890:
Why not use the default location ( /test-warehouse/$db_name.db/$table_name )? 
This line could be removed in that case.

Note that assuming that the databases are in /test-warehouse/ makes the tests 
less flexible, but other tests already rely on that, and the tests are expected 
to run on the Impala minicluster - many tests would fail on a different 
cluster, for example planner tests are sensitive to many cluster parameters. So 
I think that it is not a goal at the moment to make the tests runnable on other 
clusters. If it will be a goal, then a general solution should be created 
(probably in ImpalaTestSuite), not custom solutions for different tests.


http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@893
PS7, Line 893:
 :
nit: indentation and spaces around '='

In Impala, new blocks are indented with 2 spaces, while continuations of broken 
lines are intended with 4 spaces (most of the times...). This is a difference 
compared to PEP8, but makes the python code more consistent with C++ and Java 
code in Impala.

See the python part of 
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Style+Guide for the 
list differences from PEP8.


http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@899
PS7, Line 899:
I would prefer to use self.execute_query(). The result it returns has a .log 
member, which can be split by "\n" to get the list of warnings.

Another way to do this is to create a .test file in testdata/workloads, list 
the warnings in the " ERRORS" section and run it with self.run_test_case(). 
See 
https://github.com/apache/impala/blob/89b29d3a89cca041c541476e25435dc0d34344c1/tests/common/test_result_verifier.py#L314
 for the way it processes the logs.



--
To view, visit 

[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-02-21 Thread Kim Jin Chul (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8747

to look at the new patch set (#8).

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

This change also contains printing errors column-wise instead
of row-wise.

Testing:
Add two test cases:
- TestWrongFileOffset.test_invalid_int
- TestWrongFileOffset.test_invalid_int_gzip

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
M testdata/data/README
A testdata/data/invalid_int.csv
A testdata/data/invalid_int.csv.gz
M tests/query_test/test_scanners.py
8 files changed, 177 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/8
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 8
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-02-08 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 7:

Hi Lars,

You may be busy. It would be great if you review this change again. Thanks!

Regards,
Jinchul


--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 7
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Fri, 09 Feb 2018 00:06:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-01-16 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 7:

(25 comments)

Hi Lars,

Sorry I thought I already replied your comments, but I didn't submit a botton 
when I answered them. Now the latest patch set is ready for your review. Thanks 
for your notification.

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@368
PS6, Line 368:   /// to the error.
> Explain what error_file_offsets does in comment
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382
PS6, Line 382: r. Ret
> Wouldn't this report an invalid value (column inside row)?
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382
PS6, Line 382:   /// readers to exercise various failure paths in Parquet 
scanner. Returns the status
> It seems confusing to me that there is now a LogColumnParseError and Report
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@391
PS6, Line 391:
> Start comments with three /// like the other lines.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@393
PS6, Line 393:  tuple (e.g. fields[0] ma
> No need to mention its type since it's the same as in the declaration itsel
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@408
PS6, Line 408:   /// the reason that the out error parameters are typed uint8_t 
instead of bool. We need
> This looks like it could have some performance impact. Did you do any perf
Done, the computation is moved to the error reporting function. By the way, 
HdfsSequenceScanner::ProcessRange needs the computation because I could not 
find a way to apply the computation logic. See line#337 
https://gerrit.cloudera.org/#/c/8747/7/be/src/exec/hdfs-sequence-scanner.cc


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@233
PS6, Line 233: uint8_t* error_fields, uint8_t* error_in_row) {
> nit: wrap at 90 characters.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@254
PS6, Line 254:   return EvalConjuncts(tuple_row);
> Is +1 for the delimiter? If so, can you add a brief comment?
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@578
PS6, Line 578:   DCHECK_EQ(replaced, 1);
> Please indent function calls 4 characters, here and elsewhere.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@759
PS6, Line 759:
> nit: line wrapping.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@761
PS6, Line 761:
> Why does this not check state_->HasLogSpace() but the method below does?
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@768
PS6, Line 768:
> I think the message was clearer before the change.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README@136
PS6, Line 136: signed_integer_logical_types
> Can you think of a name that captures the nature of the problems? Most of t
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@30
PS6, Line 30: from parquet.ttypes import ConvertedType
> I think elsewhere we just use "os.path.join()" when we need it, so for cons
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@841
PS6, Line 841:
 :   @classmethod
> nit: missing word?
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@842
PS6, Line 842:   @classmethod
 :   def get_workload(cls):
> Rephrase to say what it should do, not what the problem was.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@852
PS6, Line 852:   self.hdfs_client = get_hdfs_client_from_conf(hdfs_conf)
> If you change this to "if p.c.o...: " it will also cover the empty string.
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@860
PS6, Line 860:   "invalid_int":
> I think it might be cleaner if every test prepared their own test data sepa
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@862
PS6, Line 862: "dir_path": os.path.join(tmpdir.strpath, "invalid_int"),
> This will break with parallel test runs. Use a tmpdir fixture for pytest in
Done


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@876
PS6, Line 876: put
> nit: putting
Done


http://gerrit.cloudera.o

[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-01-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 7:

Hi Jinchul. Thanks for uploading a new Patch Set. Is this ready for another 
round of reviews? If so, could you please mark the comments you addressed as 
done?

Thanks, Lars


--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 7
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Tue, 16 Jan 2018 17:18:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-01-15 Thread Kim Jin Chul (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8747

to look at the new patch set (#7).

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

This change also contains printing errors column-wise instead
of row-wise.

Testing:
Add two test cases:
- TestWrongFileOffset.test_invalid_int
- TestWrongFileOffset.test_invalid_int_gzip

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
M testdata/data/README
A testdata/data/invalid_int.csv
A testdata/data/invalid_int.csv.gz
M tests/query_test/test_scanners.py
8 files changed, 177 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/7
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 7
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2018-01-08 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 6:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@368
PS6, Line 368:   /// This is called from WriteAlignedTuples.
Explain what error_file_offsets does in comment


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382
PS6, Line 382: column
Wouldn't this report an invalid value (column inside row)?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@382
PS6, Line 382:   /// Utility function to append an error message for an invalid 
column.
It seems confusing to me that there is now a LogColumnParseError and 
ReportColumnParseError. Do we need both?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@391
PS6, Line 391:   //   - error_offsets is an out array. error_offsets[i] will be 
set to file offset
Start comments with three /// like the other lines.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@393
PS6, Line 393:  is an out 8 byte integer
No need to mention its type since it's the same as in the declaration itself.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.h@408
PS6, Line 408:   uint8_t* error_in_row, int64_t* error_file_offsets, 
int64_t* curr_file_offset)
This looks like it could have some performance impact. Did you do any perf 
experiments?

Have you considered not setting error_file_offsets inside WriteCompleteTuple 
and computing it from the field lengths during error reporting?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@233
PS6, Line 233: uint8_t* error_fields, uint8_t* error_in_row,
nit: wrap at 90 characters.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@254
PS6, Line 254: *curr_offset += len + 1;
Is +1 for the delimiter? If so, can you add a brief comment?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@578
PS6, Line 578: builder.CreateAdd(builder.CreateZExt(len, 
codegen->GetType(TYPE_BIGINT)),
Please indent function calls 4 characters, here and elsewhere.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@759
PS6, Line 759:   "Error parsing row: file: $0, (file offset: $1)",
nit: line wrapping.


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@761
PS6, Line 761:   state_->LogError(ErrorMsg(TErrorCode::GENERAL, s));
Why does this not check state_->HasLogSpace() but the method below does?


http://gerrit.cloudera.org:8080/#/c/8747/6/be/src/exec/hdfs-scanner.cc@768
PS6, Line 768: at
I think the message was clearer before the change.


http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/8747/6/testdata/data/README@136
PS6, Line 136: problematic_rows_impala_5993
Can you think of a name that captures the nature of the problems? Most of the 
files in here are somewhat problematic. It could be "invalid_int.csv" (if it is 
an invalid int column).


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@30
PS6, Line 30: from os.path import join as join_path
I think elsewhere we just use "os.path.join()" when we need it, so for 
consistency it would be good to do the same here.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@841
PS6, Line 841:  Tests that wrong file offset in value parsing error messages 
when
 :   scanning text files.
nit: missing word?


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@842
PS6, Line 842:   scanning text files. When the text scanner hits an error, it 
always prints the end of
 :   the file as the offset, even if the error occurs in the middle 
of the file.
Rephrase to say what it should do, not what the problem was.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@852
PS6, Line 852: if pytest.config.option.namenode_http_address is None:
If you change this to "if p.c.o...: " it will also cover the empty string. We 
often return at the end of the if-branch and put the on the top level. You can 
move the initialization of hdfs_loc up to achieve that.


http://gerrit.cloudera.org:8080/#/c/8747/6/tests/query_test/test_scanners.py@860
PS6, Line 860:   def prepare(self):
I think it might be cleaner if every test prepared their own test data 
separately.


http://gerrit.cloudera.org:8080/#/c/8747/6/test

[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-11 Thread Kim Jin Chul (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8747

to look at the new patch set (#5).

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

This change also contains:
- Print errors column-wise instead of row-wise
- Adopt more user friendly message format

Testing:
Add two test cases:
- TestWrongFileOffset.test_parsing_wrong_text
- TestWrongFileOffset.test_parsing_wrong_gzip_text

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
M testdata/data/README
A testdata/data/problematic_rows_impala_5993.csv
A testdata/data/problematic_rows_impala_5993.csv.gz
M tests/query_test/test_scanners.py
8 files changed, 300 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/5
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 5
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-10 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 4: Code-Review-1

Lars, as I mentioned in the JIRA, let me provide a new patch set.


--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 4
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Sun, 10 Dec 2017 12:07:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-08 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 4:

Can you adjust the logging details according to the discussion in the Jira?


--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 4
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Fri, 08 Dec 2017 19:29:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-05 Thread Kim Jin Chul (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8747

to look at the new patch set (#4).

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

This change also contains:
- Print errors column-wise instead of row-wise
- Adopt more user friendly message format

Testing:
Add two test cases:
- TestWrongFileOffset.test_parsing_wrong_text
- TestWrongFileOffset.test_parsing_wrong_gzip_text

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
M testdata/data/README
A testdata/data/problematic_rows_impala_5993.csv
A testdata/data/problematic_rows_impala_5993.csv.gz
M tests/query_test/test_scanners.py
8 files changed, 303 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/4
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 4
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@264
PS1, Line 264: // define i1 @WriteCompleteTuple(%"class.impala::HdfsScanner"* 
%this,
 : //   %"class.impala::MemPool"* %pool,
 : //   
%"struct.impala::FieldLocation"* %fields,
 : //   %"class.impala::Tuple"* %tuple,
 : //   %"class.impala::TupleRow"* 
%tuple_row,
 : //   %"class.impala::Tuple"* 
%template,
 : //   i8* %error_fields,
 : //   i8* %error_in_row,
 : //   i64* %error_file_offsets,
 : //   i64* %curr_file_offset) #35 {
 : // entry:
 : //   %tuple_ptr = bitcast %"class.impala::Tuple"* %tuple to
 : // <{ %"struct.impala::StringValue", 
%"struct.impala::StringValue", i32, i32, i8 }>*
 : //   call void 
@_ZN6impala11HdfsScanner9InitTupleEPNS_5TupleES2_.4(
 : // %"class.impala::HdfsScanner"* %this, 
%"class.impala::Tuple"* %template,
 : // %"class.impala::Tuple"* %tuple)
 : //   %0 = bitcast %"class.impala::TupleRow"* %tuple_row to
 : // <{ %"struct.impala::StringValue", 
%"struct.impala::StringValue", i32, i32, i8 }>**
 : //   %1 = getelementptr inbounds <{ 
%"struct.impala::StringValue",
 : // %"struct.impala::StringValue", i32, i32, i8 }>*, <{ 
%"struct.impala::StringValue",
 : // %"struct.impala::StringValue", i32, i32, i8 }>** %0, i32 0
 : //   store <{ %"struct.impala::StringValue", 
%"struct.impala::StringValue", i32, i32,
 : // i8 }>* %tuple_ptr, <{ %"struct.impala::StringValue",
 : // %"struct.impala::StringValue", i32, i32, i8 }>** %1
 : //   br label %parse
 : //
 : // parse:; preds = 
%entry
 : //   %data_ptr = getelementptr inbounds 
%"struct.impala::FieldLocation",
 : // %"struct.impala::FieldLocation"* %fields, i32 0, i32 0
 : //   %len_ptr = getelementptr inbounds 
%"struct.impala::FieldLocation",
 : // %"struct.impala::FieldLocation"* %fields, i32 0, i32 1
 : //   %slot_error_ptr = getelementptr inbounds i8, i8* 
%error_fields, i32 0
 : //   %slot_error_file_offset_ptr = getelementptr inbounds i64, 
i64* %error_file_offsets,
 : // i32 0
 : //   %data = load i8*, i8** %data_ptr
 : //   %len = load i32, i32* %len_ptr
 : //   %len_lt_zero = icmp slt i32 %len, 0
 : //   %ones_compliment_len = xor i32 %len, -1
 : //   %positive_len = add i32 %ones_compliment_len, 1
 : //   %select_positive_len = select i1 %len_lt_zero, i32 
%positive_len, i32 %len
 : //   %2 = call i1 @WriteSlot(<{ %"struct.impala::StringValue",
 : // %"struct.impala::StringValue", i32, i32, i8 }>* 
%tuple_ptr, i8* %data,
 : // i32 %select_positive_len)
 : //   %slot_parse_error = xor i1 %2, true
 : //   %error_in_row1 = or i1 false, %slot_parse_error
 : //   %3 = zext i1 %slot_parse_error to i8
 : //   store i8 %3, i8* %slot_error_ptr
 : //   %curr_file_offset2 = load i64, i64* %curr_file_offset
 : //   store i64 %curr_file_offset2, i64* 
%slot_error_file_offset_ptr
 : //   %4 = zext i32 %select_positive_len to i64
 : //   %5 = add i64 %4, 1
 : //   %6 = add i64 %curr_file_offset2, %5
 : //   store i64 %6, i64* %curr_file_offset
 : //
> Yes, please update it. We generally try to maintain these as we come across
Done



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 3
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Tue, 05 Dec 2017 00:27:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8747

to look at the new patch set (#3).

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

Testing:

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
M testdata/data/README
5 files changed, 186 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/3
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 3
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 2: Code-Review-1

(6 comments)

Preparing a testcase is still ongoing locally.

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h@390
PS1, Line 390:   ///  - error_in_row is an out bool.  It is set to true if any 
field had parse errors
> Can you add comments for the new parameters here?
Done


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h@405
PS1, Line 405:   /// TODO: revisit this
> nit: the line wrapping looks odd, can you wrap at 90 chars?
Yes, the line 406 cannot be appended to the line 405 due to the limitation. Let 
me change it like below:

uint8_t* error_in_row, int64_t* error_offset, int64_t* curr_offset)
WARN_UNUSED_RESULT;


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@251
PS1, Line 251: error_offsets[i] = *curr_offset;
> isn't fields[i].start the same as curr_offset?
fields[i].start might be used to calculate offset, but fields[i].start and 
curr_offset are not completely same. I think we don't have a start pointer of a 
file. We may need a ptr variable to keep the start of a file.
Do you think the following is better than this?

int64_t* start_of_file) {
...
error_offsets[i] = fields[i].start - start_of_file;

This is can be applied on this file, but it cannot be applied to sequence 
scanner impl. because sequence scanner reads a record(see line 313) and the 
calculation of pointer seems to be invalid:
https://gerrit.cloudera.org/#/c/8747/1/be/src/exec/hdfs-sequence-scanner.cc


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@668
PS1, Line 668: seError() {
> I am confused because this function still exists unchanged. If your code do
My change uses LogColumnParseError instead of LogRowParseError, but 
LogRowParseError is still necessary. I think it cannot be replaced by 
LogColumnParseError. Please see
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-text-scanner.cc#L806


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@674
PS1, Line 674: Error(const int line,
> I agree that it is much easier to understand. One nit: we should print out
Done. Yes, I agree "file offset" is more clear.


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@683
PS1, Line 683: char* data, int len) {
 :   if (state_->LogHasSpace() || state_->abort_on_error()) {
 : stringstream ss;
> I don't think this change adds a big improvement, but I'm also OK with keep
Done



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 2
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Tue, 05 Dec 2017 00:18:01 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Hello Lars Volker,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8747

to look at the new patch set (#2).

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

Testing:

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
M testdata/data/README
5 files changed, 62 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/2
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 2
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@668
PS1, Line 668: stream_->file_offset()
> Sorry there was my mistake in the last sentence.I meant that please let me
I am confused because this function still exists unchanged. If your code does 
not rely on it, can you remove it?



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 1
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Mon, 04 Dec 2017 23:32:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@668
PS1, Line 668: stream_->file_offset()
> I think there are two approaches to fix:
Sorry there was my mistake in the last sentence.I meant that please let me know 
some reasons if you would prefer the first one than the second one. The fix for 
the first one will read stream line by line.



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 1
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Mon, 04 Dec 2017 21:20:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@668
PS1, Line 668: stream_->file_offset()
> Yes, that was the main reason for opening the Jira. Are you going to fix th
I think there are two approaches to fix:
1. Still rely on the function by reading stream incrementally
2. Track offset instead of using the function
My proposal adopts the second one because I thought ahead of the full scanning 
was intended such as better performace. I implemented the second one on this 
patch set. If you think there is no reason reading the stream incremenally, let 
me try to fix the problem with the first one. Is there your preference? Please 
let me know some reasons if you prefer the second one than the first one.



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 1
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Mon, 04 Dec 2017 21:12:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1:

(7 comments)

Thank you for working on this. Please see my inline comments, especially 
regarding the additional variables to track the file offset.

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h@390
PS1, Line 390:   ///  - error_in_row is an out bool.  It is set to true if any 
field had parse errors
Can you add comments for the new parameters here?


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.h@405
PS1, Line 405:   uint8_t* error_in_row, int64_t* error_offset,
nit: the line wrapping looks odd, can you wrap at 90 chars?


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@251
PS1, Line 251: error_offsets[i] = *curr_offset;
isn't fields[i].start the same as curr_offset?


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@264
PS1, Line 264: // define i1 @WriteCompleteTuple(%"class.impala::HdfsScanner"* 
%this,
 : //   %"class.impala::MemPool"* %pool,
 : //   
%"struct.impala::FieldLocation"* %fields,
 : //   %"class.impala::Tuple"* %tuple,
 : //   %"class.impala::TupleRow"* 
%tuple_row,
 : //   %"class.impala::Tuple"* 
%template,
 : //   i8* %error_fields, i8* 
%error_in_row) {
 : // entry:
 : //   %tuple_ptr = bitcast %"class.impala::Tuple"* %tuple
 : //to <{ %"struct.impala::StringValue", i8 }>*
 : //   %tuple_ptr1 = bitcast %"class.impala::Tuple"* %template
 : // to <{ %"struct.impala::StringValue", i8 }>*
 : //   %int8_ptr = bitcast <{ %"struct.impala::StringValue", i8 
}>* %tuple_ptr to i8*
 : //   %null_bytes_ptr = getelementptr i8, i8* %int8_ptr, i32 16
 : //   call void @llvm.memset.p0i8.i64(i8* %null_bytes_ptr, i8 0, 
i64 1, i32 0, i1 false)
 : //   %0 = bitcast %"class.impala::TupleRow"* %tuple_row
 : //to <{ %"struct.impala::StringValue", i8 }>**
 : //   %1 = getelementptr <{ %"struct.impala::StringValue", i8 }>*,
 : //  <{ %"struct.impala::StringValue", i8 
}>** %0, i32 0
 : //   store <{ %"struct.impala::StringValue", i8 }>* %tuple_ptr,
 : // <{ %"struct.impala::StringValue", i8 }>** %1
 : //   br label %parse
 : //
 : // parse:; preds = 
%entry
 : //  %data_ptr = getelementptr %"struct.impala::FieldLocation",
 : //%"struct.impala::FieldLocation"* 
%fields, i32 0, i32 0
 : //  %len_ptr = getelementptr %"struct.impala::FieldLocation",
 : //   %"struct.impala::FieldLocation"* 
%fields, i32 0, i32 1
 : //  %slot_error_ptr = getelementptr i8, i8* %error_fields, i32 0
 : //  %data = load i8*, i8** %data_ptr
 : //  %len = load i32, i32* %len_ptr
 : //  %2 = call i1 @WriteSlot(<{ %"struct.impala::StringValue", i8 
}>* %tuple_ptr,
 : //  i8* %data, i32 %len)
 : //  %slot_parse_error = xor i1 %2, true
 : //  %error_in_row2 = or i1 false, %slot_parse_error
 : //  %3 = zext i1 %slot_parse_error to i8
 : //  store i8 %3, i8* %slot_error_ptr
 : //  %4 = call %"class.impala::ScalarExprEvaluator"* 
@GetConjunctCtx(
 : //%"class.impala::HdfsScanner"* %this, i32 0)
 : //  %conjunct_eval = call i16 
@"impala::Operators::Eq_StringVal_StringValWrapper"(
 : //%"class.impala::ScalarExprEvaluator"* %4, 
%"class.impala::TupleRow"* %tuple_row)
 : //  %5 = ashr i16 %conjunct_eval, 8
 : //  %6 = trunc i16 %5 to i8
 : //  %val = trunc i8 %6 to i1
 : //  br i1 %val, label %parse3, label %eval_fail
 : //
 : // parse3:   ; preds = 
%parse
 : //   %7 = zext i1 %error_in_row2 to i8
 : //   store i8 %7, i8* %error_in_row
 : //   ret i1 true
 : //
 : // eval_fail:; preds = 
%parse
 : //   ret i1 false
 : // }
> Should I ma

[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@264
PS1, Line 264: // define i1 @WriteCompleteTuple(%"class.impala::HdfsScanner"* 
%this,
 : //   %"class.impala::MemPool"* %pool,
 : //   
%"struct.impala::FieldLocation"* %fields,
 : //   %"class.impala::Tuple"* %tuple,
 : //   %"class.impala::TupleRow"* 
%tuple_row,
 : //   %"class.impala::Tuple"* 
%template,
 : //   i8* %error_fields, i8* 
%error_in_row) {
 : // entry:
 : //   %tuple_ptr = bitcast %"class.impala::Tuple"* %tuple
 : //to <{ %"struct.impala::StringValue", i8 }>*
 : //   %tuple_ptr1 = bitcast %"class.impala::Tuple"* %template
 : // to <{ %"struct.impala::StringValue", i8 }>*
 : //   %int8_ptr = bitcast <{ %"struct.impala::StringValue", i8 
}>* %tuple_ptr to i8*
 : //   %null_bytes_ptr = getelementptr i8, i8* %int8_ptr, i32 16
 : //   call void @llvm.memset.p0i8.i64(i8* %null_bytes_ptr, i8 0, 
i64 1, i32 0, i1 false)
 : //   %0 = bitcast %"class.impala::TupleRow"* %tuple_row
 : //to <{ %"struct.impala::StringValue", i8 }>**
 : //   %1 = getelementptr <{ %"struct.impala::StringValue", i8 }>*,
 : //  <{ %"struct.impala::StringValue", i8 
}>** %0, i32 0
 : //   store <{ %"struct.impala::StringValue", i8 }>* %tuple_ptr,
 : // <{ %"struct.impala::StringValue", i8 }>** %1
 : //   br label %parse
 : //
 : // parse:; preds = 
%entry
 : //  %data_ptr = getelementptr %"struct.impala::FieldLocation",
 : //%"struct.impala::FieldLocation"* 
%fields, i32 0, i32 0
 : //  %len_ptr = getelementptr %"struct.impala::FieldLocation",
 : //   %"struct.impala::FieldLocation"* 
%fields, i32 0, i32 1
 : //  %slot_error_ptr = getelementptr i8, i8* %error_fields, i32 0
 : //  %data = load i8*, i8** %data_ptr
 : //  %len = load i32, i32* %len_ptr
 : //  %2 = call i1 @WriteSlot(<{ %"struct.impala::StringValue", i8 
}>* %tuple_ptr,
 : //  i8* %data, i32 %len)
 : //  %slot_parse_error = xor i1 %2, true
 : //  %error_in_row2 = or i1 false, %slot_parse_error
 : //  %3 = zext i1 %slot_parse_error to i8
 : //  store i8 %3, i8* %slot_error_ptr
 : //  %4 = call %"class.impala::ScalarExprEvaluator"* 
@GetConjunctCtx(
 : //%"class.impala::HdfsScanner"* %this, i32 0)
 : //  %conjunct_eval = call i16 
@"impala::Operators::Eq_StringVal_StringValWrapper"(
 : //%"class.impala::ScalarExprEvaluator"* %4, 
%"class.impala::TupleRow"* %tuple_row)
 : //  %5 = ashr i16 %conjunct_eval, 8
 : //  %6 = trunc i16 %5 to i8
 : //  %val = trunc i8 %6 to i1
 : //  br i1 %val, label %parse3, label %eval_fail
 : //
 : // parse3:   ; preds = 
%parse
 : //   %7 = zext i1 %error_in_row2 to i8
 : //   store i8 %7, i8* %error_in_row
 : //   ret i1 true
 : //
 : // eval_fail:; preds = 
%parse
 : //   ret i1 false
 : // }
Should I maintain this? When I dump IR, this comment has not been updated for a 
while. I would like to remove the old IR completely because it is meaningless 
if we spend our precious time.



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 1
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Mon, 04 Dec 2017 15:16:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8747 )

Change subject: IMPALA-5993: Fix the file offset in value parsing error
..


Patch Set 1: Code-Review-1

(4 comments)

As I commented, this change does not include a test which is ongoing locally. 
By the way, I would like to get your early review for the kernel code change. 
Please check my comments. Thanks.

http://gerrit.cloudera.org:8080/#/c/8747/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8747/1//COMMIT_MSG@14
PS1, Line 14: Testing:
A custom test will be introduced at tests/custom_cluster/test_impala_5993.py. 
The test contains some hdfs commands which come from tests/util/hdfs_util.py 
and table will be created by 'create table stored as textfile location' query.


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@668
PS1, Line 668: stream_->file_offset()
The offset can point to EOF because the data is read ahead of time via 
ScannerContext :: Stream :: GetBuffer in 
impala::HdfsTextScanner::FillByteBuffer. Therefore, file_offset cannot be used 
to print the problematic offset.


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@674
PS1, Line 674: Error parsing row: file: $0, (line: $1, pos: $2, offset: $3)
I think the refined message format is more user friendly. Let me provide 
another change if you think this improvement cannot be in a bug fix.


http://gerrit.cloudera.org:8080/#/c/8747/1/be/src/exec/hdfs-scanner.cc@683
PS1, Line 683: "Error converting column at the index: "
 :<< desc->col_pos() - scan_node_->num_partition_keys()
 :<< " (type: " << desc->type() << ")";
I think the refined message format is more user friendly. Let me provide 
another change if you think this improvement cannot be in a bug fix.



--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 1
Gerrit-Owner: Kim Jin Chul 
Gerrit-Reviewer: Kim Jin Chul 
Gerrit-Reviewer: Lars Volker 
Gerrit-Comment-Date: Mon, 04 Dec 2017 15:09:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5993: Fix the file offset in value parsing error

2017-12-04 Thread Kim Jin Chul (Code Review)
Kim Jin Chul has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8747


Change subject: IMPALA-5993: Fix the file offset in value parsing error
..

IMPALA-5993: Fix the file offset in value parsing error

This is to fix the file offset in value parsing error
messages when scanning text files. When the text scanner
hit an error, it always prints the end of the file as the
offset, even if the error occurs in the middle of the file.

Testing:

Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
---
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-sequence-scanner.cc
4 files changed, 49 insertions(+), 12 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8747/1
--
To view, visit http://gerrit.cloudera.org:8080/8747
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f
Gerrit-Change-Number: 8747
Gerrit-PatchSet: 1
Gerrit-Owner: Kim Jin Chul