[jira] [Resolved] (IMPALA-955) Implement the BYTES built-in

2022-02-10 Thread Pranav Yogi Lodha (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Yogi Lodha resolved IMPALA-955.
--
Fix Version/s: Impala 4.1.0
   Resolution: Fixed

Resolved

> Implement the BYTES built-in
> 
>
> Key: IMPALA-955
> URL: https://issues.apache.org/jira/browse/IMPALA-955
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Affects Versions: Impala 1.3
>Reporter: David Z. Chen
>Assignee: Pranav Yogi Lodha
>Priority: Minor
>  Labels: built-in-function, newbie, ramp-up
> Fix For: Impala 4.1.0
>
>
> Implement the BYTES built-in: 
> http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1145_111A/Attribute_Functions.089.02.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-955) Implement the BYTES built-in

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490706#comment-17490706
 ] 

ASF subversion and git services commented on IMPALA-955:


Commit bde995483a1b6e91dc5d089dfc07225a93d7c8ca in impala's branch 
refs/heads/master from pranav.lodha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bde9954 ]

IMPALA-955: BYTES built-in function

The Bytes function returns the number of bytes contained
in the specified byte string. There are changes in
4 files. A few testcases are also added in
be/src/exprs/expr-test.cc and an end-to end test in
testdata/workloads/functional-query/queries/QueryTest/exprs.test.

Change-Id: I0bd06c3d6dba354d71f63c649eaa8f9f74d266ee
Reviewed-on: http://gerrit.cloudera.org:8080/18210
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Implement the BYTES built-in
> 
>
> Key: IMPALA-955
> URL: https://issues.apache.org/jira/browse/IMPALA-955
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Affects Versions: Impala 1.3
>Reporter: David Z. Chen
>Assignee: Pranav Yogi Lodha
>Priority: Minor
>  Labels: built-in-function, newbie, ramp-up
>
> Implement the BYTES built-in: 
> http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1145_111A/Attribute_Functions.089.02.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11097) Execute sometimes fails in call to Hive in test framework

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490705#comment-17490705
 ] 

ASF subversion and git services commented on IMPALA-11097:
--

Commit 677d4f91a30e6f12d99b2422514c50d0bb7c799f in impala's branch 
refs/heads/master from Steve Carlin
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=677d4f9 ]

IMPALA-11097: In test framework, call HS2 execute synchronously

Changed the HS2 call to be synchronous. The previous code had a
race condition because wait_to_finish needs to be called before
checking the result set for Hive. Calling execute synchronously
for HS2 ensures that the result set is ready.

Change-Id: I5ab4b90ba2e1a439119d37fe9fb9c55eeeb53ba0
Reviewed-on: http://gerrit.cloudera.org:8080/18133
Reviewed-by: Csaba Ringhofer 
Tested-by: Csaba Ringhofer 


> Execute sometimes fails in call to Hive in test framework
> -
>
> Key: IMPALA-11097
> URL: https://issues.apache.org/jira/browse/IMPALA-11097
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Steve Carlin
>Priority: Major
>
> Hive can fail if you call fetch before the execute succeeds. We should call 
> wait_to_finish before doing any fetch results.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11072) TestSpillingDebugActionDimensions.test_spilling is flaky

2022-02-10 Thread Riza Suminto (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490588#comment-17490588
 ] 

Riza Suminto commented on IMPALA-11072:
---

Hi [~stigahuang] ,
I've seen some flakiness in downstream build for this exact testcase.
There seems to be inconsistent number of fragments assigned to each impalad due 
to different parquet file count/size being created on each run.
I think it is better to investigate in separate JIRA.

> TestSpillingDebugActionDimensions.test_spilling is flaky
> 
>
> Key: IMPALA-11072
> URL: https://issues.apache.org/jira/browse/IMPALA-11072
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0.0
>Reporter: Riza Suminto
>Assignee: Riza Suminto
>Priority: Major
> Fix For: Impala 4.1.0
>
>
> We have seen some failure of TestSpillingDebugActionDimensions.test_spilling 
> in GVO jenkins job and downstream nightly tests. Latest one happen in 
> [https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/15503/]
>  
>  
> {code:java}
> query_test/test_spilling.py:75: in test_spilling
> self.run_test_case('QueryTest/spilling', vector)
> common/impala_test_suite.py:743: in run_test_case
> update_section=pytest.config.option.update_results)
> common/test_result_verifier.py:636: in verify_runtime_profile
> actual))
> E   AssertionError: Did not find matches for lines in runtime profile:
> E   EXPECTED LINES:
> E   row_regex: .*SpilledPartitions: .* \([1-9][0-9]*\)
> E   
> E   ACTUAL PROFILE:
> E   Query (id=8b433ac02c4d3fd2:3c50b7c4):
> E  - InactiveTotalTime: 0.000ns
> E  - TotalTime: 0.000ns
> E Summary:
> E   Session ID: 9448bded8acf05c6:428a7a797f6b9483
> E   Session Type: BEESWAX
> E   Start Time: 2022-01-08 09:37:07.647285000
> E   End Time: 2022-01-08 09:37:15.514936000
> E   Query Type: QUERY
> E   Query State: FINISHED
> E   Impala Query State: FINISHED
> E   Query Status: OK
> E   Impala Version: impalad version 4.1.0-SNAPSHOT RELEASE (build 
> 560ff976d3a08920a08b4ce3325a1dd9dbe81765)
> E   User: ubuntu
> E   Connected User: ubuntu
> E   Delegated User: 
> E   Network Address: :::127.0.0.1:44648
> E   Default Db: tpch_parquet
> E   Sql Statement: select count(l1.l_tax)
> E   from
> E   lineitem l1,
> E   lineitem l2,
> E   lineitem l3
> E   where
> E   l1.l_tax < 0.01 and
> E   l2.l_tax < 0.04 and
> E   l1.l_orderkey = l2.l_orderkey and
> E   l1.l_orderkey = l3.l_orderkey and
> E   l1.l_comment = l3.l_comment and
> E   l1.l_shipdate = l3.l_shipdate
> E   Coordinator: ip-172-31-21-231:27000
> E   Query Options (set by configuration): 
> BUFFER_POOL_LIMIT=225443840,MT_DOP=0,DEFAULT_SPILLABLE_BUFFER_SIZE=262144,TIMEZONE=Universal,CLIENT_IDENTIFIER=query_test/test_spilling.py::TestSpillingDebugActionDimensions::()::test_spilling[protocol:beeswax|exec_option:{'mt_dop':0;'debug_action':None;'default_spillable_buffer_size':'256k'}|table_format:parquet/none]
> E   Query Options (set by configuration and planner): 
> BUFFER_POOL_LIMIT=225443840,MT_DOP=0,DEFAULT_SPILLABLE_BUFFER_SIZE=262144,TIMEZONE=Universal,CLIENT_IDENTIFIER=query_test/test_spilling.py::TestSpillingDebugActionDimensions::()::test_spilling[protocol:beeswax|exec_option:{'mt_dop':0;'debug_action':None;'default_spillable_buffer_size':'256k'}|table_format:parquet/none],MINMAX_FILTER_THRESHOLD=0.5,MINMAX_FILTERING_LEVEL=PAGE
>  
> ...{code}
>  
> We should lower the configured BUFFER_POOL_LIMIT for this test to less than 
> 215MB so that it spill more consistently.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11072) TestSpillingDebugActionDimensions.test_spilling is flaky

2022-02-10 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490585#comment-17490585
 ] 

Quanlong Huang commented on IMPALA-11072:
-

 Saw this again in an unrelated change: 
[https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/5248/]
{code:java}
query_test/test_spilling.py:75: in test_spilling
self.run_test_case('QueryTest/spilling', vector)
common/impala_test_suite.py:743: in run_test_case
update_section=pytest.config.option.update_results)
common/test_result_verifier.py:636: in verify_runtime_profile
actual))
E   AssertionError: Did not find matches for lines in runtime profile:
E   EXPECTED LINES:
E   row_regex: .*SpilledPartitions: .* \([1-9][0-9]*\)
E   
E   ACTUAL PROFILE:
E   Query (id=6d47a6323a1d674b:133714c1):
E DEBUG MODE WARNING: Query profile created while running a DEBUG build of 
Impala. Use RELEASE builds to measure query performance.
E  - InactiveTotalTime: 0.000ns
E  - TotalTime: 0.000ns
E Summary:
E   Session ID: d446ae1fd4c1316d:f637b568190aa0ba
E   Session Type: BEESWAX
E   Start Time: 2022-02-10 14:08:19.382722000
E   End Time: 2022-02-10 14:08:39.128335000
E   Query Type: QUERY
E   Query State: FINISHED
E   Impala Query State: FINISHED
E   Query Status: OK
E   Impala Version: impalad version 4.1.0-SNAPSHOT DEBUG (build 
4e3271faf44433c5d3f847a0f965ab4ef1b1a48d)
E   User: ubuntu
E   Connected User: ubuntu
E   Delegated User: 
E   Network Address: 172.18.0.1:41204
E   Default Db: tpch_parquet
E   Sql Statement: SELECT straight_join o_orderkey
E   FROM (
E SELECT *
E FROM orders
E   JOIN customer ON o_custkey = c_custkey
E   JOIN nation ON c_nationkey = n_nationkey
E   JOIN region ON n_regionkey = r_regionkey
E WHERE  o_orderkey < 50) o1
E LEFT ANTI JOIN /*+broadcast*/ (
E SELECT *
E FROM orders
E   JOIN customer ON o_custkey = c_custkey
E   JOIN nation ON c_nationkey = n_nationkey
E   JOIN region ON n_regionkey = r_regionkey
E WHERE  o_orderkey < 50) o2 ON o1.o_orderkey = o2.o_orderkey
E AND o1.o_custkey = o2.o_custkey
E AND o1.o_orderstatus = o2.o_orderstatus
E AND o1.o_totalprice = o2.o_totalprice
E AND o1.o_orderdate = o2.o_orderdate
E AND o1.o_orderpriority = o2.o_orderpriority
E AND o1.o_clerk = o2.o_clerk
E AND o1.o_shippriority = o2.o_shippriority
E AND o1.o_comment = o2.o_comment
E AND o1.c_custkey = o2.c_custkey
E AND o1.c_name = o2.c_name
E AND o1.c_address = o2.c_address
E AND o1.c_nationkey = o2.c_nationkey
E AND o1.c_phone = o2.c_phone
E AND o1.c_acctbal = o2.c_acctbal
E AND o1.c_mktsegment = o2.c_mktsegment
E AND o1.n_nationkey = o2.n_nationkey
E AND o1.n_name = o2.n_name
E AND o1.n_regionkey = o2.n_regionkey
E AND o1.n_comment = o2.n_comment
E AND o1.r_name = o2.r_name
E AND o1.r_comment = o2.r_comment
E AND fnv_hash(o1.n_name) = fnv_hash(o2.n_name)
E AND fnv_hash(o1.r_name) = fnv_hash(o2.r_name)
E AND fnv_hash(o1.o_orderstatus) = fnv_hash(o2.o_orderstatus)
E AND fnv_hash(o1.o_shippriority) = fnv_hash(o2.o_shippriority)
E AND fnv_hash(o1.o_orderdate) = fnv_hash(o2.o_orderdate)
E AND fnv_hash(o1.o_orderpriority) = fnv_hash(o2.o_orderpriority)
E AND fnv_hash(o1.o_clerk) = fnv_hash(o2.o_clerk)
E   ORDER BY o_orderkey
E   Coordinator: 172.18.0.4:27000
E   Query Options (set by configuration): 
BUFFER_POOL_LIMIT=115343360,RUNTIME_FILTER_MODE=OFF,MT_DOP=0,DEFAULT_SPILLABLE_BUFFER_SIZE=262144,TIMEZONE=UTC,CLIENT_IDENTIFIER=query_test/test_spilling.py::TestSpillingDebugActionDimensions::()::test_spilling[protocol:beeswax|exec_option:{'mt_dop':0;'debug_action':None;'default_spillable_buffer_size':'256k'}|table_format:parquet/none]
E   Query Options (set by configuration and planner): 
BUFFER_POOL_LIMIT=115343360,RUNTIME_FILTER_MODE=OFF,MT_DOP=0,DEFAULT_SPILLABLE_BUFFER_SIZE=262144,TIMEZONE=UTC,CLIENT_IDENTIFIER=query_test/test_spilling.py::TestSpillingDebugActionDimensions::()::test_spilling[protocol:beeswax|exec_option:{'mt_dop':0;'debug_action':None;'default_spillable_buffer_size':'256k'}|table_format:parquet/none]
E   Plan:
...{code}
This is another query. We probably need to set another BUFFER_POOL_LIMIT for 
it. Should we reopen this Jira or create another one?

> TestSpillingDebugActionDimensions.test_spilling is flaky
> 
>
> Key: IMPALA-11072
> URL: https://issues.apache.org/jira/browse/IMPALA-11072
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0.0
>Reporter: Riza Suminto
>

[jira] [Assigned] (IMPALA-10948) Impala shouldn't require DECIMAL scale for Parquet files

2022-02-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-10948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergely Fürnstáhl reassigned IMPALA-10948:
--

Assignee: Gergely Fürnstáhl

> Impala shouldn't require DECIMAL scale for Parquet files
> 
>
> Key: IMPALA-10948
> URL: https://issues.apache.org/jira/browse/IMPALA-10948
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gergely Fürnstáhl
>Priority: Major
>  Labels: ramp-up
>
> Impala requires the 'scale' to be set for decimal columns: 
> https://github.com/apache/impala/blob/1a61a8025c87c37921a1bba4c49f754d8bd10bcc/be/src/exec/parquet/parquet-metadata-utils.cc#L332
> But it is only an optional field in Parquet's 
> [SchemaElement|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L392]
>  and the 
> [docs|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal]
>  says that if the scale is unspecified then it should be considered to be 0.
> Then there's the new logical type annotation 
> [DecimalType|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L253],
>  but Impala doesn't use it during scans.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-10946) RECOVER PARTITIONS might create non-existing partitions

2022-02-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergely Fürnstáhl reassigned IMPALA-10946:
--

Assignee: Gergely Fürnstáhl

> RECOVER PARTITIONS might create non-existing partitions
> ---
>
> Key: IMPALA-10946
> URL: https://issues.apache.org/jira/browse/IMPALA-10946
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gergely Fürnstáhl
>Priority: Major
>  Labels: ramp-up
>
> The following commands reproduce the bug:
> {noformat}
> create table test_table (id int)
> partitioned by (part_field string)
> stored as parquet
> LOCATION ‘/test-warehouse/abc/test’;
> insert into test_table (id, part_field) select 1, ‘abc+’;
> show partitions test_table; > it will show one partition “abc+”
> alter table test_table recover partitions;
> show partitions test_table; > result is showing two partitions, “abc” and 
> “abc+”
> {noformat}
> The + character can occur anywhere in the string, RECOVER PARTITIONS will 
> create a partition where the + is replaced by a space.
> Seems like other characters don't cause this bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-6636) Use async IO in ORC scanner

2022-02-10 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490114#comment-17490114
 ] 

Quanlong Huang commented on IMPALA-6636:


Thank [~rizaon] and [~csringhofer] for making this done! Great work!

> Use async IO in ORC scanner
> ---
>
> Key: IMPALA-6636
> URL: https://issues.apache.org/jira/browse/IMPALA-6636
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Quanlong Huang
>Assignee: Riza Suminto
>Priority: Critical
>
> Though ORC-262 has no progress, we can still prefech data and let the ORC lib 
> reading from an in-memory InputStream.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-6636) Use async IO in ORC scanner

2022-02-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490113#comment-17490113
 ] 

ASF subversion and git services commented on IMPALA-6636:
-

Commit 97dda2b27da99367f4d07699aa046b16cda16dd4 in impala's branch 
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=97dda2b ]

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+--+--+--+
| Total time   | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+--+--+--+
| DISABLE_DATA_CACHE=0 |  3511075 |  3484736 |
| DISABLE_DATA_CACHE=1 |  5243337 |  4370095 |
+--+--+--+

+--+--+--+
| Geomean  | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+--+--+--+
| DISABLE_DATA_CACHE=0 |  12786.58042 |  12454.80365 |
| DISABLE_DATA_CACHE=1 |  23081.10888 |  16692.31512 |
+--+--+--+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Reviewed-on: http://gerrit.cloudera.org:8080/15370
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Use async IO in ORC scanner
> ---
>
> Key: IMPALA-6636
> URL: https://issues.apache.org/jira/browse/IMPALA-6636
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Quanlong Huang
>Assignee: Riza Suminto
>Priority: Critical
>
> Though ORC-262 has no progress, we can still prefech data and let the ORC lib 
> reading from an in-memory InputStream.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org