[ 
https://issues.apache.org/jira/browse/IMPALA-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811317#comment-16811317
 ] 

Michael Ho edited comment on IMPALA-8394 at 4/5/19 10:30 PM:
-------------------------------------------------------------

Thanks [~stakiar]. I just noticed that we could be issuing multiple calls to 
{{ReadFromPos}} in some cases so yes most likely we need to force a seek to 
move the offset to the right position as we are using exclusive file handle for 
S3. Let me give that a try.

One piece which I have missed is that we always seek when using "borrowed" file 
handle in {{ReadFromPosInternal}}. I mis-read that part and assumed that we 
would always seek before reading which apparently is not the case when using 
exclusive file handle.

{noformat}
    if (is_borrowed_fh) {
      if (hdfsSeek(hdfs_fs_, hdfs_file, position_in_file) != 0) {
        return Status(TErrorCode::DISK_IO_ERROR, GetBackendString(),
            Substitute("Error seeking to $0 in file: $1: $2",
                position_in_file, *scan_range_->file_string(), 
GetHdfsErrorMsg("")));
      }
    }
{noformat}


was (Author: kwho):
Thanks [~stakiar]. I just noticed that we could be issuing multiple calls to 
{{ReadFromPos}} in some cases so yes most likely we need to force a seek to 
move the offset to the right position as we are using exclusive file handle for 
S3. Let me give that a try.

One piece which I have missed is that we always seek when using file handle 
cache in {{ReadFromPosInternal}}. I mis-read that part and assumed that we 
would always seek before reading which apparently is not the case:

{noformat}
    if (is_borrowed_fh) {
      if (hdfsSeek(hdfs_fs_, hdfs_file, position_in_file) != 0) {
        return Status(TErrorCode::DISK_IO_ERROR, GetBackendString(),
            Substitute("Error seeking to $0 in file: $1: $2",
                position_in_file, *scan_range_->file_string(), 
GetHdfsErrorMsg("")));
      }
    }
{noformat}

> Inconsistent data read from S3a connector
> -----------------------------------------
>
>                 Key: IMPALA-8394
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8394
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.2.0, Impala 3.3.0
>            Reporter: Michael Ho
>            Assignee: Michael Ho
>            Priority: Critical
>
> While testing a build with remote data cache 
> (https://github.com/michaelhkw/impala/commits/remote-cache-debug) with S3, it 
> was noticed that data read back from S3 through the HDFS S3 adaptor was 
> inconsistent. This was confirmed by computing the checksum of the buffer 
> right after a successful read. The following are the activities of 2 threads 
> in the log.
> Both thread 18922 and 18924 tried to look up 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  at offset: 89814317. Both of them hit cache miss. They both read from S3 for 
> the content. Thread 18924 won the race to insert into the cache. When 18922 
> came around later to try to insert the same entry into the cache, it noticed 
> that the checksum of the content inserted by thread 18924 was different from 
> its own content. 
> Please note that the checksum of the bytes read from S3 were computed and 
> logged in {{hdfs-file-reader.cc}} before the insertion into the cache (which 
> also computed the checksum again) and the inconsistency was also observed in 
> {{hdfs-file-reader.cc}} already, with thread 18924 computing 
> {{8299739883147237483}} while thread 18922 computing {{9118051972380785265}}.
> We re-ran the same experiment with {{--use_hdfs_pread=true}} and the problem 
> went away. While I don't rule out bugs in the cache prototype at this point, 
> the debugging so far suggests the content read back from S3 via HDFS S3a 
> connector is inconsistent when pread was disabled. It could be that we 
> inadvertently shared the file handle somehow or there are some race 
> conditions in the S3a connector which got exposed by the timing change with 
> the cache enabled.
> FWIW, we also ran the same experiment in HDFS remote read configuration and 
> it was not reproducible there either.
> Thread 18924
> {noformat}
> I0405 12:02:15.316999 18924 data-cache.cc:344] 
> ed4c2ab7791b5883:9f1507450000005f] Looking up 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  mtime: 1549425284000 offset: 89814317 bytes_to_read: 8332914 bytes_read: 0 
> buffer: 4d600000
> I0405 12:02:15.593314 18924 hdfs-file-reader.cc:185] 
> ed4c2ab7791b5883:9f1507450000005f] Caching file 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  mtime: 1549425284000 offset: 89814317 bytes_read 8332914 checksum 
> 8299739883147237483
> I0405 12:02:15.596087 18924 data-cache.cc:233] 
> ed4c2ab7791b5883:9f1507450000005f] Storing file 
> /data0/1/impala/datacache/cf4b57f89e5985f2:487084b5c69b208b offset 1669431296 
> len 8332914 checksum 8299739883147237483
> I0405 12:02:15.602699 18924 data-cache.cc:361] 
> ed4c2ab7791b5883:9f1507450000005f] Storing 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  mtime: 1549425284000 offset: 89814317 bytes_to_read: 8332914 buffer: 
> 4d600000 stored: true
> {noformat}
> Thread 18922:
> {noformat}
> I0405 12:02:15.011065 18922 data-cache.cc:344] 
> ed4c2ab7791b5883:9f150745000000da] Looking up 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  mtime: 1549425284000 offset: 89814317 bytes_to_read: 8332914 bytes_read: 0 
> buffer: 59200000
> I0405 12:02:16.281126 18922 hdfs-file-reader.cc:185] 
> ed4c2ab7791b5883:9f150745000000da] Caching file 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  mtime: 1549425284000 offset: 89814317 bytes_read 8332914 checksum 
> 9118051972380785265
> I0405 12:02:16.282948 18922 data-cache.cc:166] 
> ed4c2ab7791b5883:9f150745000000da] Storing duplicated file 
> /data0/1/impala/datacache/cf4b57f89e5985f2:487084b5c69b208b offset 1669431296 
> len 8332914 checksum 8299739883147237483 buffer checksum: 9118051972380785265
> E0405 12:02:16.282974 18922 data-cache.cc:171] 
> ed4c2ab7791b5883:9f150745000000da] Write checksum mismatch for file 
> /data0/1/impala/datacache/cf4b57f89e5985f2:487084b5c69b208b offset 1669431296 
> entry len: 8332914 store_len: 8332914 Expected 8299739883147237483, Got 
> 9118051972380785265.
> I0405 12:02:16.283023 18922 data-cache.cc:361] 
> ed4c2ab7791b5883:9f150745000000da] Storing 
> s3a://impala-remote-reads/tpcds_3000_decimal_parquet.db/store_sales/ss_sold_date_sk=2451097/46490ec1e79b939b-b427e7c50000000b_986349703_data.0.parq
>  mtime: 1549425284000 offset: 89814317 bytes_to_read: 8332914 buffer: 
> 59200000 stored: false
> {noformat}
> The problem is quite reproducible with TPCDS Q28 at TPCDS 3000 with parquet 
> format.
> {noformat}
> select  *
> from (select avg(ss_list_price) B1_LP
>             ,count(ss_list_price) B1_CNT
>             ,count(distinct ss_list_price) B1_CNTD
>       from store_sales
>       where ss_quantity between 0 and 5
>         and (ss_list_price between 185 and 185+10 
>              or ss_coupon_amt between 10548 and 10548+1000
>              or ss_wholesale_cost between 6 and 6+20)) B1,
>      (select avg(ss_list_price) B2_LP
>             ,count(ss_list_price) B2_CNT
>             ,count(distinct ss_list_price) B2_CNTD
>       from store_sales
>       where ss_quantity between 6 and 10
>         and (ss_list_price between 28 and 28+10
>           or ss_coupon_amt between 6100 and 6100+1000
>           or ss_wholesale_cost between 27 and 27+20)) B2,
>      (select avg(ss_list_price) B3_LP
>             ,count(ss_list_price) B3_CNT
>             ,count(distinct ss_list_price) B3_CNTD
>       from store_sales
>       where ss_quantity between 11 and 15
>         and (ss_list_price between 173 and 173+10
>           or ss_coupon_amt between 6371 and 6371+1000
>           or ss_wholesale_cost between 32 and 32+20)) B3,
>      (select avg(ss_list_price) B4_LP
>             ,count(ss_list_price) B4_CNT
>             ,count(distinct ss_list_price) B4_CNTD
>       from store_sales
>       where ss_quantity between 16 and 20
>         and (ss_list_price between 101 and 101+10
>           or ss_coupon_amt between 2938 and 2938+1000
>           or ss_wholesale_cost between 21 and 21+20)) B4,
>      (select avg(ss_list_price) B5_LP
>             ,count(ss_list_price) B5_CNT
>             ,count(distinct ss_list_price) B5_CNTD
>       from store_sales
>       where ss_quantity between 21 and 25
>         and (ss_list_price between 8 and 8+10
>           or ss_coupon_amt between 5093 and 5093+1000
>           or ss_wholesale_cost between 50 and 50+20)) B5,
>      (select avg(ss_list_price) B6_LP
>             ,count(ss_list_price) B6_CNT
>             ,count(distinct ss_list_price) B6_CNTD
>       from store_sales
>       where ss_quantity between 26 and 30
>         and (ss_list_price between 110 and 110+10
>           or ss_coupon_amt between 2276 and 2276+1000
>           or ss_wholesale_cost between 36 and 36+20)) B6
> limit 100;
> {noformat}
> cc'ing [~stakiar], [~joemcdonnell] [~lv] [~tlipcon] [~drorke]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to