[jira] [Commented] (IMPALA-8830) Coordinator-only queries get queued when there are no executor groups

2020-06-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140838#comment-17140838
 ] 

ASF subversion and git services commented on IMPALA-8830:
-

Commit 004e3c897e4890adb7d751b881b31ca71f5a533d in impala's branch 
refs/heads/master from Bikramjeet Vig
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=004e3c8 ]

IMPALA-8830: Fix executor group assignment of coordinator only queries

With this fix, coordinator only queries are submitted to a pseudo
executor group named "empty group (using coordinator only)" which
is empty. This allows running coordinator only queries regardless
of the presence of any healthy executor groups.

Testing:
Added a custom cluster test and modified tests that relied on
coordinator only queries to be queued in absence of executor groups.

Change-Id: I8fe098032744aa20bbbe4faddfc67e7a46ce03d5
Reviewed-on: http://gerrit.cloudera.org:8080/14183
Reviewed-by: Bikramjeet Vig 
Tested-by: Impala Public Jenkins 


> Coordinator-only queries get queued when there are no executor groups
> -
>
> Key: IMPALA-8830
> URL: https://issues.apache.org/jira/browse/IMPALA-8830
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Tim Armstrong
>Assignee: Bikramjeet Vig
>Priority: Blocker
>  Labels: admission-control, resource-management
>
> Reproduction:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ start-impala-cluster.py 
> -s1 --use_exclusive_coordinators;
> [localhost:21000] default> select * from tpch.lineitem order by l_orderkey 
> limit 5;
> ERROR: Admission for query exceeded timeout 6ms in pool default-pool. 
> Queued reason: No healthy executor groups found for pool default-pool.
> [localhost:21000] default> select 1;
> ERROR: Admission for query exceeded timeout 6ms in pool default-pool. 
> Queued reason: No healthy executor groups found for pool default-pool.
> {noformat}
> I expected that the second query should run immediately since it doesn't 
> actually need to be scheduled on any executors. I suspect this may be a 
> regression from the executor group changes, but didn't confirm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5904) Enable ThreadSanitizer for Impala

2020-06-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140839#comment-17140839
 ] 

ASF subversion and git services commented on IMPALA-5904:
-

Commit 17fd15c6e4981499932c02d541c76757a5fdf87d in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=17fd15c ]

IMPALA-5904: (part 4) Fix more TSAN bugs

Fixes the following TSAN data races that come up when running custom
cluster tests. The immediate goal is to fix all remaining data races in
custom cluster tests and then enable custom cluster tests in the TSAN
builds. This patch fixes about half of the remaining data races reported
during a TSAN build of custom cluster tests.

SUMMARY: ThreadSanitizer: data race util/stopwatch.h:186:9 in 
impala::MonotonicStopWatch::RunningTime() const
  Read of size 8 at 0x7b58dba8 by thread T342:
#0 impala::MonotonicStopWatch::RunningTime() const util/stopwatch.h:186:9
#1 impala::MonotonicStopWatch::Reset() util/stopwatch.h:136:20
#2 impala::StatestoreSubscriber::Heartbeat(impala::TUniqueId const&) 
statestore/statestore-subscriber.cc:358:35
  Previous write of size 8 at 0x7b58dba8 by thread T341:
#0 impala::MonotonicStopWatch::Reset() util/stopwatch.h:139:21 
(impalad+0x1f744ab)
#1 impala::StatestoreSubscriber::Heartbeat(impala::TUniqueId const&) 
statestore/statestore-subscriber.cc:358:35

SUMMARY: ThreadSanitizer: data race status.h:220:10 in 
impala::Status::operator=(impala::Status&&)
  Write of size 8 at 0x7b50002e01e0 by thread T341 (mutexes: write M17919):
#0 impala::Status::operator=(impala::Status&&) common/status.h:220:10
#1 impala::RuntimeState::SetQueryStatus(std::string const&) 
runtime/runtime-state.h:250
#2 impala_udf::FunctionContext::SetError(char const*) udf/udf.cc:423:47
  Previous read of size 8 at 0x7b50002e01e0 by thread T342:
#0 impala::Status::ok() const common/status.h:236:42
#1 impala::RuntimeState::GetQueryStatus() runtime/runtime-state.h:15
#2 impala::HdfsScanner::CommitRows(int, impala::RowBatch*) 
exec/hdfs-scanner.cc:218:3

SUMMARY: ThreadSanitizer: data race hashtable.h:370:58
  Read of size 8 at 0x7b2400091df8 by thread T338 (mutexes: write 
M106814410723061456):
...
#3 impala::MetricGroup::CMCompatibleCallback() util/metrics.cc:185:40
...
#9 impala::Webserver::RenderUrlWithTemplate() util/webserver.cc:801:3
#10 impala::Webserver::BeginRequestCallback(sq_connection*, 
sq_request_info*) util/webserver.cc:696:5
  Previous write of size 8 at 0x7b2400091df8 by thread T364 (mutexes: write 
M600803201008047112, write M1046659357959855584):
...
#4 impala::AtomicMetric<(impala::TMetricKind::type)0>* 
impala::MetricGroup::RegisterMetric<> >() util/metrics.h:366:5
#5 impala::MetricGroup::AddGauge(std::string const&, long, std::string 
const&) util/metrics.h:384:12
#6 impala::AdmissionController::PoolStats::InitMetrics() 
scheduling/admission-controller.cc:1714:55

Testing:
* Ran core tests
* Re-ran TSAN tests and made sure issues were resolved
* Ran single_node_perf_run for workload TPC-H scale factor 30;
  no regressions detected

+--+---+-++++
| Workload | File Format   | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+--+---+-++++
| TPCH(30) | parquet / none / none | 7.36| -1.77% | 5.01   | -1.61% 
|
+--+---+-++++

Change-Id: Id4244c9a7f971c96b8b8dc7d5262904a0a4b77c1
Reviewed-on: http://gerrit.cloudera.org:8080/16079
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Enable ThreadSanitizer for Impala
> -
>
> Key: IMPALA-5904
> URL: https://issues.apache.org/jira/browse/IMPALA-5904
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Reporter: Tim Armstrong
>Assignee: Sahil Takiar
>Priority: Major
>  Labels: ramp-up
> Attachments: impalad.ERROR
>
>
> It would be great to be able to automatically detect data races in Impala 
> using ThreadSanitizer to avoid tricky-to-reproduce bugs. This issue tracks 
> enabling ThreadSanitizer, fixing bugs and adding suppressions to get to the 
> point where Impala runs cleanly with the sanitizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9688) Support create iceberg table by impala

2020-06-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140835#comment-17140835
 ] 

ASF subversion and git services commented on IMPALA-9688:
-

Commit 8fcad905a12d018eb0a354f7e4793e5b0d5efd3b in impala's branch 
refs/heads/master from skyyws
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8fcad90 ]

IMPALA-9688: Support create iceberg table by impala

This patch mainly realizes the creation of iceberg table through impala,
we can use the following sql to create a new iceberg table:
create table iceberg_test(
level string,
event_time timestamp,
message string,
register_time date,
telephone array 
)
partition by spec(
level identity,
event_time identity,
event_time hour,
register_time day
)
stored as iceberg;
'identity' is one of Iceberg's Partition Transforms. 'identity' means that
the source data values are used to create partitions, and other partition
transfroms would be supported in the future, such as BUCKET/TRUNCATE. We
can alse use 'show create table iceberg_test' to display table schema, and
use 'show partitions iceberg_test' to display partition column info. By the
way, partition column must be the source column.

Testing:
- Add test cases in metadata/test_show_create_table.py.
- Add custom cluster test test_iceberg.py.

Change-Id: I8d85db4c904a8c758c4cfb4f19cfbdab7e6ea284
Reviewed-on: http://gerrit.cloudera.org:8080/15797
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Support create iceberg table by impala
> --
>
> Key: IMPALA-9688
> URL: https://issues.apache.org/jira/browse/IMPALA-9688
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: WangSheng
>Assignee: WangSheng
>Priority: Major
>
> This sub-task mainly realizes the creation of iceberg table through impala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9739) TSAN data races during impalad shutdown

2020-06-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140837#comment-17140837
 ] 

ASF subversion and git services commented on IMPALA-9739:
-

Commit 950e51f9a8531b9388d8e427e5e76dcf13048362 in impala's branch 
refs/heads/master from Bikramjeet Vig
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=950e51f ]

IMPALA-9739: Fix data race during impala graceful shutdown

When impala does a graceful shutdown, exit() method is called at the
end that performs cleanup which interferes with the shutdown signal
handling thread spawned during init() and triggers a data race which
gets caught by the thread sanitizer build. This patch fixes that by
using an _exit() call instead.

Testing:
Ran the offending test TestGracefulShutdown on a thread sanitizer
build and made sure no data race was flagged.

Change-Id: I59bb5326791cd877df4711e23979f9bd88e4659a
Reviewed-on: http://gerrit.cloudera.org:8080/16074
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> TSAN data races during impalad shutdown
> ---
>
> Key: IMPALA-9739
> URL: https://issues.apache.org/jira/browse/IMPALA-9739
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Sahil Takiar
>Assignee: Bikramjeet Vig
>Priority: Major
>
> A TSAN run of the custom cluster tests shows several instances of the 
> following data race during impalad shutdown:
> {code:java}
> WARNING: ThreadSanitizer: data race (pid=12660)
>   Read of size 8 at 0x07786f60 by thread T338:
> #0 std::unique_ptr 
> >::~unique_ptr() 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/unique_ptr.h:235:6
>  (impalad+0x19bd895)
> #1 at_exit_wrapper(void*) 
> /mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:361
>  (impalad+0x191cf13)
> #2 impala::ImpalaServer::StartShutdown(long, 
> impala::ShutdownStatusPB*)::$_2::operator()() const 
> /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:2622:57
>  (impalad+0x21bd871)
> #3 
> boost::detail::function::void_function_obj_invoker0  impala::ShutdownStatusPB*)::$_2, 
> void>::invoke(boost::detail::function::function_buffer&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:159:11
>  (impalad+0x21bd6d9)
> #4 boost::function0::operator()() const 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14
>  (impalad+0x1e192b1)
> #5 impala::Thread::SuperviseThread(std::string const&, std::string 
> const&, boost::function, impala::ThreadDebugInfo const*, 
> impala::Promise*) 
> /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:360:3
>  (impalad+0x23df196)
> #6 void boost::_bi::list5, 
> boost::_bi::value, boost::_bi::value >, 
> boost::_bi::value, 
> boost::_bi::value*> 
> >::operator() boost::function, impala::ThreadDebugInfo const*, 
> impala::Promise*), 
> boost::_bi::list0>(boost::_bi::type, void (*&)(std::string const&, 
> std::string const&, boost::function, impala::ThreadDebugInfo const*, 
> impala::Promise*), boost::_bi::list0&, int) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9
>  (impalad+0x23e735c)
> #7 boost::_bi::bind_t const&, boost::function, impala::ThreadDebugInfo const*, 
> impala::Promise*), 
> boost::_bi::list5, 
> boost::_bi::value, boost::_bi::value >, 
> boost::_bi::value, 
> boost::_bi::value*> > 
> >::operator()() 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
>  (impalad+0x23e7273)
> #8 boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, 
> impala::ThreadDebugInfo const*, impala::Promise (impala::PromiseMode)0>*), boost::_bi::list5, 
> boost::_bi::value, boost::_bi::value >, 
> boost::_bi::value, 
> boost::_bi::value*> > > 
> >::run() 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17
>  (impalad+0x23e6f60)
> #9 thread_proxy  (impalad+0x30e44f9)
>   Previous write of size 8 at 0x07786f60 by main thread:
> #0 void std::swap(impala::Thread*&, impala::Thread*&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/move.h:176:11
>  (impalad+0x221f370)
> #1 std::unique_ptr 
> >::reset(impala::Thread*) 
>

[jira] [Commented] (IMPALA-7538) Support HDFS caching with LocalCatalog

2020-06-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140836#comment-17140836
 ] 

ASF subversion and git services commented on IMPALA-7538:
-

Commit b02fad2db48b5725483fc52098a0c6c04806394b in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b02fad2 ]

IMPALA-7538: Support HDFS caching with LocalCatalog

This patch adds support for HDFS caching in LocalCatalog coordinators.
We use the same way catalogd propagates HdfsCachePools in catalog-v1.
They are cached in LocalCatalog coordinators as v1 and are not
“fetch-on-demand” since only cache pool names are cached.

The isMarkedCached markers of HdfsTable and HdfsPartition are also
propagated to the LocalCatalog coordinators for correctly handling
ShowTableStats and ShowPartitions statements with caching information.

Tests:
 - Revive hdfs caching related tests in metadata/test_ddl.py and
   query_test/test_hdfs_caching.py for LocalCatalog.

Change-Id: I661f7b76a9575f6f5b3fa2c6feebda1a5d7c3712
Reviewed-on: http://gerrit.cloudera.org:8080/16058
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Support HDFS caching with LocalCatalog
> --
>
> Key: IMPALA-7538
> URL: https://issues.apache.org/jira/browse/IMPALA-7538
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Todd Lipcon
>Assignee: Quanlong Huang
>Priority: Major
>  Labels: catalog-v2
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9824) MetastoreClientPool should be singleton

2020-06-19 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140829#comment-17140829
 ] 

Sahil Takiar commented on IMPALA-9824:
--

[~vihangk1] sorry for the delayed response. I'm not super familiar with these 
tests. AFAICT there are 4 places where we create a {{MetaStoreClientPool}}:
{code:java}
stakiar @ stakiar-desktop -bash ~/Impala 2020-06-19 13:30:21 master
 [51] → grep -iIR 'new MetaStoreClientPool' fe/src/main/
fe/src/main/java/org/apache/impala/service/Frontend.java:  
metaStoreClientPool_ = new MetaStoreClientPool(1, 0);
fe/src/main/java/org/apache/impala/catalog/Catalog.java:this(new 
MetaStoreClientPool(0, 0));
fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java:
new MetaStoreClientPool(INITIAL_META_STORE_CLIENT_POOL_SIZE,
fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java:  
msClientPool_ = new MetaStoreClientPool(cfg.num_metadata_loading_threads, {code}
It seems each one can grow up to a max size of 32 connections. I think the 
usage in Frontend.java is only initialized for coordinators and the rest are 
all initialized in the catalog?
 Right, so after IMPALA-9375 the usage in DirectMetaProvider will go away, and 
we will have the usage in CatalogMetaProvider and Catalog remaining? Is there a 
bound on how many connections the Catalog and CatalogMetaProvider end up using, 
or can they both potentially create 32 connections depending on the load?

Linking [https://gerrit.cloudera.org/#/c/16030/] for reference

> MetastoreClientPool should be singleton
> ---
>
> Key: IMPALA-9824
> URL: https://issues.apache.org/jira/browse/IMPALA-9824
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Minor
> Fix For: Not Applicable
>
>
> Currently,  the MetastoreClientPool is instantiated at multiple places in the 
> code and it would be good to refactor the code to make it a singleton. Each 
> MetastoreClientPool creates multiple clients to HMS and unnecessary creation 
> of multiple pools could cause problems on HMS side. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9872) Profile log does not include profiles of failed queries

2020-06-19 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140809#comment-17140809
 ] 

Sahil Takiar commented on IMPALA-9872:
--

Do you mean that the profile of a query should include the profile of all 
previous query attempts? We considered doing that, but we decided against it 
because it would require some potentially incompatible / messy changes to the 
profile format, and it would make the profiles even longer. Instead, we decided 
to keep them separate and just include links to the previous / retried query 
ids.

Some potential improvements to this are IMPALA-9229 and IMPALA-9230

> Profile log does not include profiles of failed queries
> ---
>
> Key: IMPALA-9872
> URL: https://issues.apache.org/jira/browse/IMPALA-9872
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: supportability
>
> As far as I can tell, ImpalaServer::ArchiveQuery() only logs the profile for 
> the last run of the query. We should also include the profiles of failed 
> attempts at the query in the profile log.
> [~stigahuang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-5768) Better documentation for developers

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-5768.
---
Resolution: Later

Too open-ended. Can reopen if specific things are requested.

> Better documentation for developers
> ---
>
> Key: IMPALA-5768
> URL: https://issues.apache.org/jira/browse/IMPALA-5768
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zach Amsden
>Assignee: Zach Amsden
>Priority: Minor
>
> Impala uses a ton of ad-hoc environment variables, paths and scripts, none of 
> which are well documented.  This JIRA is to track improvements on such 
> documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-4487) stress test occasionally leaves runners lingering

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4487.
---
Resolution: Cannot Reproduce

We fixed a lot of issues in the stress test last year.

> stress test occasionally leaves runners lingering
> -
>
> Key: IMPALA-4487
> URL: https://issues.apache.org/jira/browse/IMPALA-4487
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
>Reporter: Michael Brown
>Priority: Major
>  Labels: stress
> Attachments: stacks.txt
>
>
> The stress test runner {{concurrent_select.py}} sometimes continues to run 
> after all the requested queries have been executed. This will be evident from 
> the console report: the number of Done queries will be the number of queries 
> set by {{concurrent_select.py --max-queries}} but {{concurrent_select.py}} 
> will just continue to run indefinitely until terminated.
> I've looked at a cluster at this state and can't evidence of any queries in 
> flight or hung. It leads me to believe then that the bug lies in the test 
> infrastructure, not Impala.
> The debug logs show this over and over:
> {noformat}
> 08:37:53 2845 140608723109632 DEBUG:concurrent_select[398]:Producer is alive: 
> False
> 08:37:53 2845 140608723109632 DEBUG:concurrent_select[399]:Consumer is alive: 
> False
> 08:37:53 2845 140608723109632 DEBUG:concurrent_select[400]:Queue size: 0
> 08:37:53 2845 140608723109632 DEBUG:concurrent_select[401]:Runners: 1
> {noformat}
> *Workaround:*
> Send {{SIGTERM}} to the hung child process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-4419) bin/bootstrap_development.py requires logging out and back in

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4419.
---
Resolution: Won't Fix

> bin/bootstrap_development.py requires logging out and back in
> -
>
> Key: IMPALA-4419
> URL: https://issues.apache.org/jira/browse/IMPALA-4419
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
> Environment: Ubuntu 14.04
>Reporter: Jim Apple
>Priority: Major
>
> bin/bootstrap_development.py uses 
> https://github.com/awleblang/impala-setup/commit/56fa829c99e997585eb63fcd49cb65eb8357e679
> to change /etc/security/limits.conf. After the change, the user must log out 
> and log back in to make the change take effect, but 
> bin/bootstrap_development.py just charges ahead and runs the tests, which 
> takes many hours and may fail because the changes have not yet taken effect.
> I don't see a great fix here - we may need to have two scripts: one to run 
> before logout/login, and one to run after.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-5159) Inequality constraints can be further optimized

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-5159:
-

Assignee: (was: Zach Amsden)

> Inequality constraints can be further optimized
> ---
>
> Key: IMPALA-5159
> URL: https://issues.apache.org/jira/browse/IMPALA-5159
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zach Amsden
>Priority: Minor
>
> As a follow up to IMPALA-5003, we can optimize constraints that include 
> multiple minima or maxima to their greatest lower, or least upper bound.
> Example:
> WHERE t.a > 10 AND t.a >= 5 AND t.b < 50 and t.b <= 50 =>
> t.a > 10 AND t.b < 50
> In cases where it is impossible to satisfy the constraints, we can convert 
> conditions to FALSE.
> Further, for inclusive inequality chains, one can deduce from A <= B, B <= C, 
> C <= A =>
> A = B, A = C
> And from exclusive inequality chains, one can infer a contradiction: 
> A <= B, B < C, C <= A => FALSE
> The last two steps require quite a bit more work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2863) impala-shell can't work with setuptools version 0.7.2

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2863.
---
Resolution: Information Provided

> impala-shell can't work with setuptools version 0.7.2
> -
>
> Key: IMPALA-2863
> URL: https://issues.apache.org/jira/browse/IMPALA-2863
> Project: IMPALA
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: Impala 2.3.0
> Environment: OS: RHEL 6.4
> CDH: 5.5.1
>Reporter: hsuyijun
>Priority: Minor
>  Labels: impala, python, setuptools, shell
>
> [root@datanode1 ~]# impala-shell
> Traceback (most recent call last):
>   File "/usr/lib/impala-shell/impala_shell.py", line 33, in 
> from impala_client import (ImpalaClient, DisconnectedException, 
> QueryStateException,
>   File "/usr/lib/impala-shell/lib/impala_client.py", line 16, in 
> import sasl
>   File 
> "/usr/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/sasl/__init__.py",
>  line 1, in 
>   File 
> "/usr/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/sasl/saslwrapper.py",
>  line 7, in 
>   File 
> "/usr/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/_saslwrapper.py",
>  line 7, in 
>   File 
> "/usr/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/_saslwrapper.py",
>  line 3, in __bootstrap__
>   File "/usr/lib/impala-shell/lib/pkg_resources.py", line 2696, in 
> add_activation_listener(lambda dist: dist.activate())
>   File "/usr/lib/impala-shell/lib/pkg_resources.py", line 673, in subscribe
> callback(dist)
>   File "/usr/lib/impala-shell/lib/pkg_resources.py", line 2696, in 
> add_activation_listener(lambda dist: dist.activate())
>   File "/usr/lib/impala-shell/lib/pkg_resources.py", line 2197, in activate
> self.insert_on(path)
>   File "/usr/lib/impala-shell/lib/pkg_resources.py", line 2298, in insert_on
> "with distribute. Found one at %s" % str(self.location))
> ValueError: A 0.7-series setuptools cannot be installed with distribute. 
> Found one at /usr/lib/python2.6/site-packages



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-4919) Jenkins jobs don't bubble up artifacts to parents

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4919.
---
Resolution: Later

> Jenkins jobs don't bubble up artifacts to parents
> -
>
> Key: IMPALA-4919
> URL: https://issues.apache.org/jira/browse/IMPALA-4919
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
>Reporter: Sailesh Mukil
>Priority: Major
>  Labels: infra
>
> When the 'gerrit-verify-dryrun' job runs, it kicks of a chain of jobs 
> underneath the covers. The final artifacts for the entire run (impalad logs, 
> etc.) do not show up in the 'gerrit-verify-dryrun' job and only show up in 
> final job in the chain of jobs that get kicked off.
> This gets hard to track down if we need to diagnose errors. Ideally, the job 
> that the user interacts with should have the link to all the artifacts 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-4317) Single Overloaded Impalad Causes Entire Cluster to Hang

2020-06-19 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140802#comment-17140802
 ] 

Tim Armstrong commented on IMPALA-4317:
---

This is quite old and it sounds like the problem was likely connected to the 
old RPC stack and thread counts. There is some additional work to blacklist 
unhealthy nodes that is relevant.

We can't detect *in general* slowness of a single impalad and blacklist based 
on that, but I think we would have fixed this particular scenario. So I'll 
close out this JIRA.

> Single Overloaded Impalad Causes Entire Cluster to Hang
> ---
>
> Key: IMPALA-4317
> URL: https://issues.apache.org/jira/browse/IMPALA-4317
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.5.0
> Environment: Enterprise CDH 5.7.0, Parcels
> impalad version 2.5.0-cdh5.7.0 RELEASE (build 
> ad3f5adabedf56fe6bd9eea39147c067cc552703)
>Reporter: Scott Wallace
>Priority: Major
> Attachments: cached_clients.png, health.png, load.png, queries.png, 
> threads.png, worker23.png
>
>
> Occasionally we experience heavy load on a single impalad host. This leads to 
> the entire cluster to hang and prevents any impala queries from being able to 
> execute.
> Here's what we observe:
> -load increases on a single impalad
> -query throughput across the entire impala cluster drops and we cannot get 
> any queries to execute
> -running threads continues to increase until we restart the impala service
> -in the impalad logs we see errors connecting to the unhealthy host. Example: 
> Couldn't open transport for 
> ux-reporting-engine-worker-23-prod-us-east-1a:22000 (connect() failed: 
> Connection timed out)
> Questions:
> Why does the entire Impala service become unstable due to the health of a 
> single impalad?
> Theoretically, shouldn't the impala statestore prevent the single impalad 
> host from being used and allow queries to be processed by healthy nodes?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-4317) Single Overloaded Impalad Causes Entire Cluster to Hang

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4317.
---
Resolution: Cannot Reproduce

> Single Overloaded Impalad Causes Entire Cluster to Hang
> ---
>
> Key: IMPALA-4317
> URL: https://issues.apache.org/jira/browse/IMPALA-4317
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.5.0
> Environment: Enterprise CDH 5.7.0, Parcels
> impalad version 2.5.0-cdh5.7.0 RELEASE (build 
> ad3f5adabedf56fe6bd9eea39147c067cc552703)
>Reporter: Scott Wallace
>Priority: Major
> Attachments: cached_clients.png, health.png, load.png, queries.png, 
> threads.png, worker23.png
>
>
> Occasionally we experience heavy load on a single impalad host. This leads to 
> the entire cluster to hang and prevents any impala queries from being able to 
> execute.
> Here's what we observe:
> -load increases on a single impalad
> -query throughput across the entire impala cluster drops and we cannot get 
> any queries to execute
> -running threads continues to increase until we restart the impala service
> -in the impalad logs we see errors connecting to the unhealthy host. Example: 
> Couldn't open transport for 
> ux-reporting-engine-worker-23-prod-us-east-1a:22000 (connect() failed: 
> Connection timed out)
> Questions:
> Why does the entire Impala service become unstable due to the health of a 
> single impalad?
> Theoretically, shouldn't the impala statestore prevent the single impalad 
> host from being used and allow queries to be processed by healthy nodes?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3991) Trying to get runtime profile for invalid query id spews noisy, unhelpful stack trace to impala log

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3991.
---
Resolution: Fixed

We converted this to use Status::Expected() so that the stack trace is avoided 
afaict.

> Trying to get runtime profile for invalid query id spews noisy, unhelpful 
> stack trace to impala log
> ---
>
> Key: IMPALA-3991
> URL: https://issues.apache.org/jira/browse/IMPALA-3991
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.8.0
> Environment: CDH 5.9.0 cluster running on CentOS6
>Reporter: David Knupp
>Priority: Minor
> Attachments: impalad-logs.zip, 
> impalad.dknupp-centos66-2.vpc.cloudera.com.impala.log.INFO.20160817-104915.8894
>
>
> NOTE: This bug was initially filed with regard to a seeming error with Impala 
> on CDH 5.9.0. It turns out that the issue was a bad query being issued by the 
> CM agent, which is being tracked by 
> [OPSAPS-35830|https://jira.cloudera.com/browse/OPSAPS-35830]. This bug is 
> being left open to address the fact that the stack trace produced in response 
> to that is not helpful, and fills up the logs at an alarming rate.
> The original description follows.
> ---
> Impalad fails on a newly deployed CDH 5.9.0 cluster. Things looks initially 
> normal in the impalad log, but then...
> {code}
> [...]
> I0817 10:49:20.871973  8894 thrift-server.cc:434] ThriftServer 'backend' 
> started on port: 22000s
> I0817 10:49:20.876886  8894 thrift-server.cc:434] ThriftServer 
> 'beeswax-frontend' started on port: 21000s
> I0817 10:49:20.883510  8894 thrift-server.cc:434] ThriftServer 
> 'hiveserver2-frontend' started on port: 21050s
> I0817 10:49:20.883572  8894 exec-env.cc:335] Starting global services
> W0817 10:49:20.898262  8894 exec-env.cc:416] Memory limit 12.46 GB exceeds 
> physical memory of 7.68 GB
> I0817 10:49:20.898339  8894 exec-env.cc:421] Using global memory limit: 12.46 
> GB
> I0817 10:49:20.901062  8894 webserver.cc:227] Starting webserver on 
> 0.0.0.0:25000
> I0817 10:49:20.901114  8894 webserver.cc:233] Webserver: Enabling HTTPS 
> support
> I0817 10:49:20.901154  8894 webserver.cc:241] Document root: 
> /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.171/lib/impala
> I0817 10:49:21.235894  8894 webserver.cc:325] Webserver started
> I0817 10:49:21.235994  8894 simple-scheduler.cc:149] Starting simple scheduler
> I0817 10:49:21.236117  8894 simple-scheduler.cc:163] Simple-scheduler using 
> 172.26.12.162 as IP address
> I0817 10:49:21.242650  8894 statestore-subscriber.cc:181] Starting statestore 
> subscriber
> I0817 10:49:21.244170  8894 thrift-server.cc:434] ThriftServer 
> 'StatestoreSubscriber' started on port: 23000
> I0817 10:49:21.244215  8894 statestore-subscriber.cc:193] Registering with 
> statestore
> I0817 10:49:21.284173  8894 statestore-subscriber.cc:165] Subscriber 
> registration ID: 264ce339eb0242c2:c90af7928c10e782
> I0817 10:49:21.284267  8894 statestore-subscriber.cc:197] statestore 
> registration successful
> I0817 10:49:21.284453  8894 impalad-main.cc:93] Impala has started.
> I0817 10:49:21.296749  9250 authentication.cc:425] Successfully authenticated 
> principal "impala/dknupp-centos66-1.vpc.cloudera@vpc.cloudera.com" on an 
> internal connection
> I0817 10:49:21.301574  9250 authentication.cc:425] Successfully authenticated 
> principal "impala/dknupp-centos66-1.vpc.cloudera@vpc.cloudera.com" on an 
> internal connection
> I0817 10:49:21.308790  9253 simple-scheduler.cc:287] Registering local 
> backend with statestore
> I0817 10:49:38.976960  9554 status.cc:114] Query id 0:0 not found.
> @   0x84f189  (unknown)
> @   0xacd424  (unknown)
> @   0xafa3ec  (unknown)
> @   0xbfc475  (unknown)
> @   0xbfd895  (unknown)
> @   0xc0a8c0  (unknown)
> @   0xc0d03d  (unknown)
> @   0xc0d6cd  (unknown)
> @ 0x7f7244ef5aa1  start_thread
> @ 0x7f7243e46aad  clone
> I0817 10:49:39.965342  9554 status.cc:114] Query id 0:0 not found.
> @   0x84f189  (unknown)
> @   0xacd424  (unknown)
> @   0xafa3ec  (unknown)
> @   0xbfc475  (unknown)
> @   0xbfd895  (unknown)
> @   0xc0a8c0  (unknown)
> @   0xc0d03d  (unknown)
> @   0xc0d6cd  (unknown)
> @ 0x7f7244ef5aa1  start_thread
> @ 0x7f7243e46aad  clone
> I0817 10:49:40.966459  9554 status.cc:114] Query id 0:0 not found.
> @   0x84f189  (unknown)
> @   0xacd424  (unknown)
> @   0xafa3ec  (unknown)
> @   0xbfc475  (unknown)
> @   0xbfd895  (unknown)
>   

[jira] [Resolved] (IMPALA-4292) Build MiniKDC from source

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4292.
---
Resolution: Won't Do

IMPALA-9361 removed the minikdc download.

> Build MiniKDC from source
> -
>
> Key: IMPALA-4292
> URL: https://issues.apache.org/jira/browse/IMPALA-4292
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.8.0
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: asf
>
> The LLAMA minikdc component is used for our test cluster. It's required even 
> though we don't need LLAMA - Impala basically just borrowed a bit of LLAMA's 
> test infrastructure.
> The README explains where it came from:
> {code}
> The contents of this directory came from the llama project on 07/18/2014.
> They were created by
> 1) Download from repo
>- Remote is g...@github.mtv.cloudera.com:CDH/llama.  This is the Cloudera-
>  internal llama repo; there was an external one, but it a the time of this
>  work it was out of sync with the internal repo, and broken.
>- On branch "cdh5-1.0.0"
>- The git hash at the time of this work:
>  d9066d398cc76b6ebb60f77ccd657d1eb46a667b
> 2) mvn package -Pdist
> 3) tar xvfz llama-minikdc-1.0.0-cdh5.2.0-SNAPSHOT.tar.gz in this directory
> This project is used for the minikdc it provides.
> {code}
> We should consider adopting the code from LLAMA and building it ourselves. 
> It's just a small Java project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-2206) report_workload.py does not store ExecSummary in database

2020-06-19 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140793#comment-17140793
 ] 

Tim Armstrong commented on IMPALA-2206:
---

I suspect this was fixed by some of the other changes to include exec summary 
in the profile.

> report_workload.py does not store ExecSummary in database
> -
>
> Key: IMPALA-2206
> URL: https://issues.apache.org/jira/browse/IMPALA-2206
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 2.2
>Reporter: Martin Grund
>Priority: Minor
>  Labels: test-infra
>
> When we store the runtime profiles for each run, the profiles do not contain 
> the exec summary as it seems the profiles are fetched when the query is still 
> running (or not all rows fetched). However, the results contain the JSON 
> encoded representation of the TExecSummary and we should persist this in the 
> database as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2206) report_workload.py does not store ExecSummary in database

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2206.
---
Resolution: Cannot Reproduce

> report_workload.py does not store ExecSummary in database
> -
>
> Key: IMPALA-2206
> URL: https://issues.apache.org/jira/browse/IMPALA-2206
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 2.2
>Reporter: Martin Grund
>Priority: Minor
>  Labels: test-infra
>
> When we store the runtime profiles for each run, the profiles do not contain 
> the exec summary as it seems the profiles are fetched when the query is still 
> running (or not all rows fetched). However, the results contain the JSON 
> encoded representation of the TExecSummary and we should persist this in the 
> database as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-4235) Consider making data load independent of hadoop-Lzo

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4235.
---
Resolution: Duplicate

> Consider making data load independent of hadoop-Lzo
> ---
>
> Key: IMPALA-4235
> URL: https://issues.apache.org/jira/browse/IMPALA-4235
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.6.0
>Reporter: Sailesh Mukil
>Priority: Minor
>  Labels: asf, infra
>
> Currently parts of our data loading uses Hive that depends on hadoop-lzo for 
> all the lzo file formats we load, and there also seems to be some other 
> subtle dependencies on hadoop-lzo during data load (like loading complex 
> types with Hive).
> If we do not have hadoop-lzo, our data loading errors out halfway through. It 
> would be good to refactor our data-load scripts so that we can still go ahead 
> and load other file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3712) Not informative syntax error in UPDATE statement

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3712.
---
Resolution: Won't Fix

I think it's generally hard to change errors that are coming from the parser...

> Not informative syntax error in UPDATE statement
> 
>
> Key: IMPALA-3712
> URL: https://issues.apache.org/jira/browse/IMPALA-3712
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.6.0
>Reporter: Dimitris Tsirogiannis
>Priority: Minor
>  Labels: kudu, usability
>
> A weird error is thrown when anything other than '=' is used in the SET 
> portion of an UPDATE statement. Example:
> {code}
> impala-shell> update t1 set b != 1 where a = 1;
> Query: update t1 set b != 1 where a = 1;
> ERROR: AnalysisException: Syntax error in line 1:
> update t1 set b != 1 where a = 1
>  ^
> Encountered: Unknown last token with id: 200
> Expected
> CAUSED BY: Exception: Syntax error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-4131) Make RAT testing part of run-all-tests.sh

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4131.
---
Resolution: Won't Fix

This is already run in precommit testing automatically.

> Make RAT testing part of run-all-tests.sh
> -
>
> Key: IMPALA-4131
> URL: https://issues.apache.org/jira/browse/IMPALA-4131
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.7.0
>Reporter: Jim Apple
>Priority: Major
>  Labels: asf
>
> Apache RAT is a tool for checking license compliance. We should test license 
> compliance at every commit in the pre-commit CI job, and to do this we should 
> add a test, perhaps to run-all-tests.sh.
> Things to be careful of
> 1. Not every computer will have git, and not every Impala repo will be a git 
> repo. For instance, release tarballs are not git repos.
> 2. Apache RAT is not a pre-req and not in the chef install script or the 
> toolchain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3393) Report thread wake up interval is not well distributed

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3393.
---
Resolution: Later

Closing unless we see that it is a problem in practice.

> Report thread wake up interval is not well distributed
> --
>
> Key: IMPALA-3393
> URL: https://issues.apache.org/jira/browse/IMPALA-3393
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.5.0
>Reporter: Huaisi Xu
>Priority: Minor
>
> https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/plan-fragment-executor.cc#L398
> We take an integer from 0 to 4. so if there are 200 nodes then ~40 of them 
> will report at the same time every 5 seconds. This does not take into account 
> fragment start time is basically random...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-3393) Report thread wake up interval is not well distributed

2020-06-19 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-3393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140787#comment-17140787
 ] 

Tim Armstrong commented on IMPALA-3393:
---

I think there might be value to adding more jitter to status reporting for 
longer-running queries. I think, regardless, we can't avoid some 
synchronisation of status reports since backends will all tend to finish at the 
same time. .  But IMPALA-7213 should fix many scalability problems.

> Report thread wake up interval is not well distributed
> --
>
> Key: IMPALA-3393
> URL: https://issues.apache.org/jira/browse/IMPALA-3393
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.5.0
>Reporter: Huaisi Xu
>Priority: Minor
>
> https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/plan-fragment-executor.cc#L398
> We take an integer from 0 to 4. so if there are 200 nodes then ~40 of them 
> will report at the same time every 5 seconds. This does not take into account 
> fragment start time is basically random...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2230) Create test parquet files with bad metadata, invalid data, etc.

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2230.
---
Resolution: Fixed

We have added a bunch of invalid parquet files in testdata/data over time.

> Create test parquet files with bad metadata, invalid data, etc.
> ---
>
> Key: IMPALA-2230
> URL: https://issues.apache.org/jira/browse/IMPALA-2230
> Project: IMPALA
>  Issue Type: Task
>  Components: Backend
>Affects Versions: Impala 2.2.4
>Reporter: Skye Wanderman-Milne
>Priority: Minor
>
> We have a lot of untested error handling in the Parquet scanner. It would be 
> good to hand-craft test files that exercise some of these cases, including: 
> (feel free to edit to add more cases)
> * Columns with different numbers of values
> * Columns with mismatched # values vs # values reported in metadata



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3769) Add regression tests for sasl transport overhead

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3769.
---
Resolution: Won't Do

> Add regression tests for sasl transport overhead
> 
>
> Key: IMPALA-3769
> URL: https://issues.apache.org/jira/browse/IMPALA-3769
> Project: IMPALA
>  Issue Type: Task
>  Components: Infrastructure
>Affects Versions: Impala 2.6.0
>Reporter: Matthew Jacobs
>Priority: Minor
>  Labels: test, test-infra
>
> Regression tests needed for https://issues.cloudera.org/browse/IMPALA-1928



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2485) Use runtime row count information from blocking operators

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2485.
---
Resolution: Later

This was just an idea, not concrete enough

> Use runtime row count information from blocking operators
> -
>
> Key: IMPALA-2485
> URL: https://issues.apache.org/jira/browse/IMPALA-2485
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.2
>Reporter: Tim Armstrong
>Priority: Minor
>
> At runtime we can obtain accurate row counts from blocking operators before 
> dependent operators start execution. We could use this information to make 
> smarter decisions about hash table sizing, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3793) Trigger alert on specific query errors

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3793.
---
Resolution: Won't Do

I don't think this kind of alerting is within the scope of the Impala project.

> Trigger alert on specific query errors
> --
>
> Key: IMPALA-3793
> URL: https://issues.apache.org/jira/browse/IMPALA-3793
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Perf Investigation
>Affects Versions: Impala 2.2
>Reporter: Matyas Orhidi
>Priority: Minor
>
> It would be a really useful feature to have an option to configure alerting 
> on some specific Impala queries failures. The following error is a serious 
> query failure due to disk error, but no alerting is triggered:
> Error from query 16432c59e68cb814:b237840be7c83fa2: Create file 
> /mnt/sde/impala/impalad/impala-scratch/16432c59e68cb814:b237840be7c83fa2_d8958230-55c3-4e24-9cb4-95b4e9dcc7e8
>  failed with errno=2 description=Error(2): No such file or directory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-2306) Investigate hiding internal agg functions

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-2306:
--
Priority: Trivial  (was: Minor)

> Investigate hiding internal agg functions
> -
>
> Key: IMPALA-2306
> URL: https://issues.apache.org/jira/browse/IMPALA-2306
> Project: IMPALA
>  Issue Type: Task
>  Components: Frontend
>Affects Versions: Impala 2.3.0
>Reporter: casey
>Priority: Trivial
>  Labels: usability
>
> DISTINCTPC, DISTINCTPCSA, and NDV_NO_FINALIZE were most likely added as a 
> means to implement other functionality (compute stats?) and probably not 
> intended for general use. If possible those functions should be hidden so 
> users don't rely on them (and Impala is able to redefine them as needed).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-1732) Impala should have more generic functions DATEDIFF(datepart, startdate, enddate), DATEADD(datepart, number, date)

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-1732:
--
Labels: newbie sql-language  (was: sql-language)

> Impala should have more generic functions DATEDIFF(datepart, startdate, 
> enddate), DATEADD(datepart, number, date)
> -
>
> Key: IMPALA-1732
> URL: https://issues.apache.org/jira/browse/IMPALA-1732
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.1
>Reporter: Denys Lamanov
>Priority: Minor
>  Labels: newbie, sql-language
>
> I guess it should be like implementation in MS SQL:
> DATEDIFF(datepart, startdate, enddate)
> DATEADD(datepart, number, date)
> https://msdn.microsoft.com/en-us/library/ms189794.aspx
> https://msdn.microsoft.com/en-us/library/ms186819.aspx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2748) Show compression codec in 'SHOW FILES' statement

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2748.
---
Resolution: Later

This isn't possible for Parquet and ORC for the reason Juan mentioned. It is 
possible for text files, but then that can be determined based on the file 
extension anyway.

> Show compression codec in 'SHOW FILES' statement
> 
>
> Key: IMPALA-2748
> URL: https://issues.apache.org/jira/browse/IMPALA-2748
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 2.2.4
>Reporter: Dimitris Tsirogiannis
>Priority: Minor
>  Labels: catalog-server, usability
>
> Currently, the only way a user has in knowing the compression codec of table 
> files is to look at the profile of a query that accesses that table. It would 
> be nice to show the compression code of each file in the 'show files' 
> statement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2557) Kerberized hs2/impyla randomly fails: "OverflowError: signed integer is greater than maximum"

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2557.
---
Resolution: Cannot Reproduce

> Kerberized hs2/impyla randomly fails: "OverflowError: signed integer is 
> greater than maximum"
> -
>
> Key: IMPALA-2557
> URL: https://issues.apache.org/jira/browse/IMPALA-2557
> Project: IMPALA
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: Impala 2.3.0
>Reporter: casey
>Priority: Minor
>  Labels: stress
>
> The error below comes up rarely, maybe 1/5000 queries. I haven't investigated 
> this. The error could be either in the server or the client.
> {noformat}
> 03:57:11 16583 140231084865280 ERROR:concurrent_select[705]:Non-mem limit 
> error for query with id e44a41527fce61e8:bea7014c6146dfa2: CancelOperation 
> failed: unknown result
> Traceback (most recent call last):
>   File "tests/stress/concurrent_select.py", line 684, in 
> _check_for_mem_limit_exceeded
> report.profile = cursor.get_profile()
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/hiveserver2.py",
>  line 373, in get_profile
> self.service, self._last_operation_handle, self.session_handle)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/hiveserver2.py",
>  line 916, in get_profile
> resp = service.GetRuntimeProfile(req)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/_thrift_gen/ImpalaService/ImpalaHiveServer2Service.py",
>  line 77, in GetRuntimeProfile
> return self.recv_GetRuntimeProfile()
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/impala/_thrift_gen/ImpalaService/ImpalaHiveServer2Service.py",
>  line 88, in recv_GetRuntimeProfile
> (fname, mtype, rseqid) = self._iprot.readMessageBegin()
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/thirdparty/hive-1.1.0-cdh5.7.0-SNAPSHOT/lib/py/thrift/protocol/TBinaryProtocol.py",
>  line 137, in readMessageBegin
> name = self.trans.readAll(sz)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/thirdparty/hive-1.1.0-cdh5.7.0-SNAPSHOT/lib/py/thrift/transport/TTransport.py",
>  line 58, in readAll
> chunk = self.read(sz-have)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/thrift_sasl/__init__.py",
>  line 159, in read
> self._read_frame()
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/thrift_sasl/__init__.py",
>  line 176, in _read_frame
> decoded = read_all_compat(self._trans, length)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/infra/python/env/local/lib/python2.7/site-packages/thrift_sasl/six.py",
>  line 31, in 
> read_all_compat = lambda trans, sz: trans.readAll(sz)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/thirdparty/hive-1.1.0-cdh5.7.0-SNAPSHOT/lib/py/thrift/transport/TTransport.py",
>  line 58, in readAll
> chunk = self.read(sz-have)
>   File 
> "/var/lib/jenkins/workspace/Impala-Stress-Test-Kerberized/Impala/thirdparty/hive-1.1.0-cdh5.7.0-SNAPSHOT/lib/py/thrift/transport/TSocket.py",
>  line 92, in read
> buff = self.handle.recv(sz)
> OverflowError: signed integer is greater than maximum
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3900) Add per-split runtime filtering to HdfsParquetScanner::ProcessSplit()

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3900.
---
Resolution: Duplicate

> Add per-split runtime filtering to HdfsParquetScanner::ProcessSplit()
> -
>
> Key: IMPALA-3900
> URL: https://issues.apache.org/jira/browse/IMPALA-3900
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Henry Robinson
>Assignee: Henry Robinson
>Priority: Minor
>  Labels: runtime-filters
>
> If a partition filter arrives after a footer scan range for a Parquet has 
> been issued, but before {{HdfsParquetScanner::ProcessSplit()}}, there's an 
> opportunity to filter out all the scan ranges that would otherwise be issued 
> when reading that footer, by adding a call to {{ScanRangeIsFilteredOut()}} at 
> the top of that method.
> Care must be taken to ensure that all scan ranges are marked as done, since 
> they won't be processed by their own scanner instances. This will avoid a 
> recurrence of IMPALA-3804.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-1469) Impala 2.0 rand() seed is now required to be a constant.

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-1469.
---
Resolution: Won't Fix

> Impala 2.0  rand() seed is now required to be a constant.
> -
>
> Key: IMPALA-1469
> URL: https://issues.apache.org/jira/browse/IMPALA-1469
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.0
>Reporter: Chip Sands
>Priority: Minor
>
> In 2.0 rand() function now returns error if you pass it a column name as the 
> seed.
> ERROR:   Seed argument to rand() must be constant.
> It is documented as and INT not as a constant, and did not work this way 
> before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-3881) Make /queries tables searchable / sortable etc.

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-3881:
-

Assignee: (was: Henry Robinson)

> Make /queries tables searchable / sortable etc.
> ---
>
> Key: IMPALA-3881
> URL: https://issues.apache.org/jira/browse/IMPALA-3881
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Henry Robinson
>Priority: Minor
>  Labels: observability
>
> The tables in {{/queries}} would be more useful if they were sortable and 
> searchable, and if the backlog of completed queries were larger so that 
> queries didn't disappear so quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3561) Planner should request statistics for relevant columns not entire table

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3561:
--

Moving with the rest of fine-grained metadata loading. I'm not sure how 
impactful this is for table-level stats, so maybe we just want to close it.

> Planner should request statistics for relevant columns not entire table
> ---
>
> Key: IMPALA-3561
> URL: https://issues.apache.org/jira/browse/IMPALA-3561
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: catalog-server, performance, planner, scalability
>
> When generating a plan the planner request statistics for all columns 
> referenced by all tables which is not necessary, this creates load on the 
> catalog and bloats the meta data cache memory. 
> Stats are only required for columns involved in :
> * Join
> * Aggregations 
> * Filters
> The query below will fetch statistics for all 22 columns/partitions from 
> store_sales and which is un-necessary as only ss_item_sk and ss_promo_sk are 
> needed. 
> {code}
> select 
> count(*)
> from
> store_sales,
> item
> where
> ss_item_sk = i_item_sk
> group by ss_promo_sk;
> {code}
> If this issues is addressed it should reduce the meta data cache memory by an 
> order of magnitude.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3561) Planner should request statistics for relevant columns not entire table

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3561:
--
External issue ID:   (was: IMPALA-3561)
   Parent: (was: IMPALA-2649)
   Issue Type: Improvement  (was: Sub-task)

> Planner should request statistics for relevant columns not entire table
> ---
>
> Key: IMPALA-3561
> URL: https://issues.apache.org/jira/browse/IMPALA-3561
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: catalog-server, performance, planner, scalability
>
> When generating a plan the planner request statistics for all columns 
> referenced by all tables which is not necessary, this creates load on the 
> catalog and bloats the meta data cache memory. 
> Stats are only required for columns involved in :
> * Join
> * Aggregations 
> * Filters
> The query below will fetch statistics for all 22 columns/partitions from 
> store_sales and which is un-necessary as only ss_item_sk and ss_promo_sk are 
> needed. 
> {code}
> select 
> count(*)
> from
> store_sales,
> item
> where
> ss_item_sk = i_item_sk
> group by ss_promo_sk;
> {code}
> If this issues is addressed it should reduce the meta data cache memory by an 
> order of magnitude.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3563) Evaluate using global opposed to per partition statistics

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3563:
--
Issue Type: Improvement  (was: Bug)

> Evaluate using global opposed to per partition statistics
> -
>
> Key: IMPALA-3563
> URL: https://issues.apache.org/jira/browse/IMPALA-3563
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 2.5.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: catalog-server, planner, scalability
>
> Impala and other SQL on Hadoop solutions use per partition statistics which 
> creates a metadata scalability problem which I reckon outweighs benefits of 
> having more accurate statistics. 
> This is the proposal is for a partitioned table :
> * "Compute statistics" computes and stores per partition HLL same as before
> * Catalog merges the HLL(s) for all partitions and stores/persists global 
> statistics 
> * Impalad(s) never request per partition statics only global stats 
> * The only time the catalog needs to read the per partition HLL is when 
> regenerating the global stats as part of adding/removing partitions
> In other words during planning the partitioned table looks very similar to a 
> non-partitioned table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3563) Evaluate using global opposed to per partition statistics

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3563:
--
External issue ID:   (was: IMPALA-3563)
   Parent: (was: IMPALA-2649)
   Issue Type: Bug  (was: Sub-task)

> Evaluate using global opposed to per partition statistics
> -
>
> Key: IMPALA-3563
> URL: https://issues.apache.org/jira/browse/IMPALA-3563
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 2.5.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: catalog-server, planner, scalability
>
> Impala and other SQL on Hadoop solutions use per partition statistics which 
> creates a metadata scalability problem which I reckon outweighs benefits of 
> having more accurate statistics. 
> This is the proposal is for a partitioned table :
> * "Compute statistics" computes and stores per partition HLL same as before
> * Catalog merges the HLL(s) for all partitions and stores/persists global 
> statistics 
> * Impalad(s) never request per partition statics only global stats 
> * The only time the catalog needs to read the per partition HLL is when 
> regenerating the global stats as part of adding/removing partitions
> In other words during planning the partitioned table looks very similar to a 
> non-partitioned table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2145) Local times should not be used to calculate durations

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2145.
---
Resolution: Cannot Reproduce

Unclear at this point what specific problem this is referring to.

> Local times should not be used to calculate durations
> -
>
> Key: IMPALA-2145
> URL: https://issues.apache.org/jira/browse/IMPALA-2145
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.0
>Reporter: casey
>Priority: Minor
>
> Local times are subject to changes in time zone such a daylight savings. 
> Either UTC or the monotonic clock should be used instead. Once IMPALA-1773 is 
> done, a timestamp with time zone could be used too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-1527) JIT broken timestamp functions

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-1527:
--
Labels: codegen ramp-up  (was: codegen)

> JIT broken timestamp functions
> --
>
> Key: IMPALA-1527
> URL: https://issues.apache.org/jira/browse/IMPALA-1527
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.0
>Reporter: Skye Wanderman-Milne
>Priority: Minor
>  Labels: codegen, ramp-up
>
> We always use the non-codegen'd versions of from_utc_timestamp, 
> to_utc_timestamp, and DateAddSub() because there are problems JITing these 
> functions. It might be worth trying again at some point, e.g. if/when we 
> switch to MCJIT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-3351) Improve error reporting for JDBC clients

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-3351:
-

Assignee: (was: Syed A. Hashmi)

> Improve error reporting for JDBC clients
> 
>
> Key: IMPALA-3351
> URL: https://issues.apache.org/jira/browse/IMPALA-3351
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Affects Versions: Impala 2.0
>Reporter: Xiaomin Zhang
>Priority: Minor
>  Labels: jdbc
>
> For JDBC client such as beeline:
> beeline> !connect 
> jdbc:hive2://quickstart.cloudera:21050/;principal=impala/quickstart.cloudera@CLOUDERA
> For queries such as "SELECT * FROM CUSTOMER", when CUSTOMER does not exist, I 
> get below error:
> Error: AnalysisException: Could not resolve table reference: 'customer' 
> (state=HY000,code=0)
> However, for queries such as "SELECT * FROM CUSTOMERS", when this table is 
> using non-supported definition, I will get below error message:
> Error: Invalid query handle (state=HY000,code=0)
> There's no detailed and useful information returned. The only way to 
> troubleshooting this query is to look into the impalad log, I can see errors 
> like:
> Error from query ca4ae093852af48b:a7d9cec9fae3ab4: File 
> hdfs://quickstart.cloudera:8020/user/hive/warehouse/customers/customers has 
> invalid file metadata at file offset 2183. Error = couldn't deserialize 
> thrift msg:
> TProtocolException: Invalid data
> So this looks that JDBC client cannot get detailed SQL exception when the 
> underlining error happens on fragments. 
> Is it possible to also bring those runtime error information to JDBC clients?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3814) Clean up and refactor large try blocks in impala-shell

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3814.
---
Resolution: Later

> Clean up and refactor large try blocks in impala-shell
> --
>
> Key: IMPALA-3814
> URL: https://issues.apache.org/jira/browse/IMPALA-3814
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Affects Versions: Impala 2.6.0
>Reporter: Sailesh Mukil
>Priority: Minor
>  Labels: shell
>
> One particular try block in impala_shell.py (in _execute_stmt()) is 
> particularly large with a large chain of except statements. Not all of the 
> except statements apply to all the code in the try block. This would be a lot 
> cleaner if it's split up into multiple try blocks which each have only the 
> required except statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-1192) Allow setting the impala-shell prompt

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-1192:
--
Labels: newbie shell  (was: shell)

> Allow setting the impala-shell prompt
> -
>
> Key: IMPALA-1192
> URL: https://issues.apache.org/jira/browse/IMPALA-1192
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Affects Versions: Impala 1.4.1
>Reporter: John Russell
>Priority: Minor
>  Labels: newbie, shell
>
> It would be very convenient (from a documentation perspective) to have a way 
> to set the prompt within an impala-shell session.
> The long impala-shell showing the host name and port number makes it 
> difficult to format impala-shell output for use in documentation. Cloudera's 
> PDF and HTML outputs both have problems displaying wide lines in code 
> examples. Continuation lines are indented the same amount as the original >, 
> so even when statements are split across lines there is not much room on each 
> line.
> The shortest prompt is [localhost:21000] so that's what gets used 
> consistently in the documentation, even if that's not best practice.
> If we wanted to show examples other than localhost, we don't have control 
> over what's displayed, so we would have to edit the output to mask real 
> Cloudera host names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7598) Planner timeline not set in the runtime profile incase of errors

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-7598:
--
Labels: newbie ramp-up  (was: ramp-up)

> Planner timeline not set in the runtime profile incase of errors
> 
>
> Key: IMPALA-7598
> URL: https://issues.apache.org/jira/browse/IMPALA-7598
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.1.0
>Reporter: Bharath Vissapragada
>Priority: Major
>  Labels: newbie, ramp-up
>
> Currently, the frontend sets the "Query compilation" timeline only if 
> doCreateExecRequest() finishes successfully.
> {code}
> private TExecRequest doCreateExecRequest(TQueryCtx queryCtx...) {
>   .. ..
>   timeline.markEvent("Planning finished");
>   result.setTimeline(timeline.toThrift());
>   return result;
> }
> {code}
> There can be cases where the timeline could be useful for debugging when the 
> planning fails. 
> One recent case I came across is the exhaustion of retries on 
> InconsistentMetadataFetchException. For every retry of query due to this 
> exception, we add a timeline event.
> {code}
>  while (true) {
>   try {
> return doCreateExecRequest(queryCtx, timeline, explainString);
>   } catch (InconsistentMetadataFetchException e) {
> if (attempt++ == INCONSISTENT_METADATA_NUM_RETRIES) {
>   throw e;
> }
> if (attempt > 1) {
>   // Back-off a bit on later retries.
>   Uninterruptibles.sleepUninterruptibly(200 * attempt, 
> TimeUnit.MILLISECONDS);
> }
> timeline.markEvent(
> String.format("Retrying query planning due to inconsistent 
> metadata "
> + "fetch, attempt %s of %s: ",
> attempt, INCONSISTENT_METADATA_NUM_RETRIES)
> + e.getMessage());
> {code}
> However, if the planning eventually fails, the timeline about retries is 
> gone. This would've confirmed that the query was re-tried for 10 times.
> Would be really useful if we can populate the partial timeline, even if the 
> planning failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3722) Avro codegen can be unnecessarily disabled

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3722:
--
Priority: Minor  (was: Major)

> Avro codegen can be unnecessarily disabled
> --
>
> Key: IMPALA-3722
> URL: https://issues.apache.org/jira/browse/IMPALA-3722
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Skye Wanderman-Milne
>Priority: Minor
>  Labels: avro, codegen, ramp-up
>
> We use avro_schema_equal() from the Avro C library to determine if a file's 
> schema matches the table schema, and if they don't match we disable codegen 
> for that file 
> (https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-avro-scanner.cc#L153).
>  However, avro_schema_equal() is unnecessarily restrictive, because it 
> compares the records' names and namespaces, which don't have to be the same 
> to enable codegen. There are probably other checks we don't need as well, 
> e.g. default values. We should write our own schema comparison function that 
> is tailored to what must match for codegen specifically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3478) Support for UTF-8 BOM on text backed tables.

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3478:
--
Labels: newbie ramp-up  (was: ramp-up)

> Support for UTF-8 BOM on text backed tables.
> 
>
> Key: IMPALA-3478
> URL: https://issues.apache.org/jira/browse/IMPALA-3478
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Clients
>Affects Versions: Impala 2.3.0
>Reporter: Thomas Scott
>Priority: Minor
>  Labels: newbie, ramp-up
>
> Data stored in Unicode UTF-8 can contain the Byte Order Mark (BOM) (Hex 
> values "ef bb bf") at the beginning of the file. This is ignored in Hive but 
> in Impala can cause the first field to be misrepresented. A good example of 
> this is if the first column is of type timestamp. Impala will show this as 
> null even though the data is valid in Hive.
> Steps to reproduce:
> In Hive:
> CREATE EXTERNAL TABLE IF NOT EXISTS test_table (col1 timestamp) LOCATION 
> '/tmp/test_table'
> Then into the /tmp/test_table directory write a file with a BOM. I use vim 
> for this as below:
> echo '2010-01-01 00:00:00.000' > foo
> vim -e -s -c ':set bomb' -c ':wq' foo
> SELECT * FROM test_table
> Will display the timestamp in Hive and NULL in Impala.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2267) "invalidate metadata" should only clear metadata for those tables the current user has access to

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2267.
---
Resolution: Won't Do

> "invalidate metadata" should  only clear metadata for those tables the 
> current user has access to
> -
>
> Key: IMPALA-2267
> URL: https://issues.apache.org/jira/browse/IMPALA-2267
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 2.2.4
>Reporter: Yibing Shi
>Priority: Minor
>  Labels: ramp-up, usability
>
> In impala, users often execute {{invalidate metadata}} (without any table) to 
> clear the metadata cache. Since by default this affects all the 
> databases/tables on the cluster, Sentry asks users to have privileges on the 
> whole cluster (or "server" in Sentry's terminology) to execute this. However, 
> this brings a huge inconvenience to customers, especially when they have 
> multiple tables to refresh. They must do this one table by one.
> From my point of view, when sentry is enabled, we should allow users to clear 
> the metadata cache for all the tables he/she has access to, instead of asking 
> for superuser privilege and clear all the tables metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-2211) Impala reports missing WRITE access on dir in creating an external table

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-2211:
-

Assignee: (was: Chris Channing)

> Impala reports missing WRITE access on dir in creating an external table
> 
>
> Key: IMPALA-2211
> URL: https://issues.apache.org/jira/browse/IMPALA-2211
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 2.2
>Reporter: Vamsee K. Yarlagadda
>Priority: Minor
>  Labels: ramp-up
>
> It looks like Impala complains about missing write permissions on a HDFS 
> directory when trying to create an external table through Impyla.
> {code}
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/dbapi/hiveserver2.py",
>  line 152, in execute
> configuration=configuration)
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/dbapi/hiveserver2.py",
>  line 166, in execute_async
> self._execute_async(op)
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/dbapi/hiveserver2.py",
>  line 172, in _execute_async
> operation_fn()
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/dbapi/hiveserver2.py",
>  line 164, in op
> configuration)
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/_rpc/hiveserver2.py",
>  line 134, in wrapper
> return func(*args, **kwargs)
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/_rpc/hiveserver2.py",
>  line 240, in execute_statement
> err_if_rpc_not_ok(resp)
>   File 
> "/Users/vamsee/Cloudera/OtherProjects/cmf/systest/target/env/src/impyla/impala/_rpc/hiveserver2.py",
>  line 81, in err_if_rpc_not_ok
> raise HiveServer2Error(resp.status.errorMessage)
> HiveServer2Error: 
> ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: 
> CAUSED BY: MetaException: Got exception: 
> org.apache.hadoop.security.AccessControlException Permission denied: 
> user=impala, access=WRITE, 
> inode="/user/systest/tpcds_10_text":systest:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:216)
>   at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:145)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6599)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6581)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6533)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4337)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4307)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4280)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:853)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:321)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:601)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)
> {code} 
> My main worry is if i am creating an external table (pointing to an HDFS 
> location), why does impal

[jira] [Resolved] (IMPALA-2238) Impala doesn't set HUE end user as the owner when creating tables

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2238.
---
Resolution: Duplicate

> Impala doesn't set HUE end user as the owner when creating tables
> -
>
> Key: IMPALA-2238
> URL: https://issues.apache.org/jira/browse/IMPALA-2238
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 2.2.5
>Reporter: Yibing Shi
>Priority: Minor
>  Labels: ramp-up, security
>
> In a Kerberized cluster (sentry not enabled), I can reproduce this problem 
> with below:
> # Login to HUE as user 'yshi'
> # In HUE Hive editor, run query {{create table hue_test_hive (id int)}}
> # In HUE Impala editor, run query  {{create table hue_test_impala (id int)}}
> # Login to the Hive Metastore database (take PostgreSQL as example) and 
> execute query {{select * from "TBLS" where "TBLS"."TBL_NAME" like 
> 'hue_test%'}}, and then we get below result:
> ||TBL_ID||CREATE_TIME||DB_ID||LAST_ACCESS_TIME||OWNER||RETENTION||SD_ID||TBL_NAME||TBL_TYPE||VIEW_EXPANDED_TEXT||VIEW_ORIGINAL_TEXT||
> |29606|1440420180|1|0|yshi|0|88435|hue_test_hive|MANAGED_TABLE|||
> |29607|1440420210|1|0|hue/host-10-17-81-247.coe.cloudera@yshi.com|0|88436|hue_test_impala|MANAGED_TABLE|||
> As we can see, the owner of the table created in Impala editor is owned by 
> {{hue/@REALM}}, while that created in Hive editor is owned by the HUE 
> end user.
> I haven't tried insecure cluster yet, but it should behave the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9829) Add write metrics for Spilling

2020-06-19 Thread Abhishek Rawat (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rawat updated IMPALA-9829:
---
Summary: Add write metrics for Spilling  (was: Add metrics for Spilling to 
S3)

> Add write metrics for Spilling
> --
>
> Key: IMPALA-9829
> URL: https://issues.apache.org/jira/browse/IMPALA-9829
> Project: IMPALA
>  Issue Type: Task
>  Components: Backend
>Reporter: Abhishek Rawat
>Assignee: Yida Wu
>Priority: Major
> Fix For: Impala 4.0
>
>
> Extend io-mgr metrics to include additional metrics as a requirement from 
> Spilling to S3. 
>  
> {code:java}
> impala-server.io-mgr.queue-$0.write-latency // Histogram of write operation 
> times on specific disk queue.
> impala-server.io-mgr.queue-$0s3.write-size // Histogram of write operation 
> sizes on specific disk queue.
> impala-server.io-mgr.queue-$0.write-io-error // I/O errors encountered during 
> writing.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9829) Add write metrics for Spilling

2020-06-19 Thread Abhishek Rawat (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rawat updated IMPALA-9829:
---
Description: 
Extend io-mgr metrics to include additional metrics as a requirement from 
Spilling to S3. 

 
{code:java}
impala-server.io-mgr.queue-$0.write-latency // Histogram of write operation 
times on specific disk queue.
impala-server.io-mgr.queue-$0.write-size // Histogram of write operation sizes 
on specific disk queue.
impala-server.io-mgr.queue-$0.write-io-error // I/O errors encountered during 
writing.
{code}
 

  was:
Extend io-mgr metrics to include additional metrics as a requirement from 
Spilling to S3. 

 
{code:java}
impala-server.io-mgr.queue-$0.write-latency // Histogram of write operation 
times on specific disk queue.
impala-server.io-mgr.queue-$0s3.write-size // Histogram of write operation 
sizes on specific disk queue.
impala-server.io-mgr.queue-$0.write-io-error // I/O errors encountered during 
writing.
{code}
 


> Add write metrics for Spilling
> --
>
> Key: IMPALA-9829
> URL: https://issues.apache.org/jira/browse/IMPALA-9829
> Project: IMPALA
>  Issue Type: Task
>  Components: Backend
>Reporter: Abhishek Rawat
>Assignee: Yida Wu
>Priority: Major
> Fix For: Impala 4.0
>
>
> Extend io-mgr metrics to include additional metrics as a requirement from 
> Spilling to S3. 
>  
> {code:java}
> impala-server.io-mgr.queue-$0.write-latency // Histogram of write operation 
> times on specific disk queue.
> impala-server.io-mgr.queue-$0.write-size // Histogram of write operation 
> sizes on specific disk queue.
> impala-server.io-mgr.queue-$0.write-io-error // I/O errors encountered during 
> writing.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3059) Debug web UI query formatting

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3059:
--
Labels: newbie ramp-up usability  (was: nwb ramp-up usability)

> Debug web UI query formatting
> -
>
> Key: IMPALA-3059
> URL: https://issues.apache.org/jira/browse/IMPALA-3059
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Affects Versions: Impala 2.2.4
>Reporter: Maarten Wullink
>Priority: Minor
>  Labels: newbie, ramp-up, usability
>
> The Impala debug UI ( http://impala-server-hostname:25000/) should format SQL 
> queries better. Currently long queries are often formatted in a way that 
> causes either a lot of horizontal of vertical scrolling. for the /queries 
> page It would be nice if long queries can be wrapped and only displayed 
> completely with a mouseover event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2822) Test runner should try to log runtime profile for failed tests

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2822.
---
Resolution: Won't Do

I think this proved to be somewhat tricky. It's probably better to just get the 
query from the profile logs if needed.

> Test runner should try to log runtime profile for failed tests
> --
>
> Key: IMPALA-2822
> URL: https://issues.apache.org/jira/browse/IMPALA-2822
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 2.5.0
>Reporter: Tim Armstrong
>Assignee: David Knupp
>Priority: Minor
>  Labels: ramp-up, test-infra
> Attachments: runtime_profile.output
>
>
> In many cases, especially failures like mem_limit_exceeded, it would be 
> useful to have the runtime profile to debug failed tests. The test runner 
> gets the runtime profile for successful queries but throws an exception for 
> failed queries.
> The runtime profile will not always be available for query failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3059) Debug web UI query formatting

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3059:
--
Labels: nwb ramp-up usability  (was: ramp-up usability)

> Debug web UI query formatting
> -
>
> Key: IMPALA-3059
> URL: https://issues.apache.org/jira/browse/IMPALA-3059
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Affects Versions: Impala 2.2.4
>Reporter: Maarten Wullink
>Priority: Minor
>  Labels: nwb, ramp-up, usability
>
> The Impala debug UI ( http://impala-server-hostname:25000/) should format SQL 
> queries better. Currently long queries are often formatted in a way that 
> causes either a lot of horizontal of vertical scrolling. for the /queries 
> page It would be nice if long queries can be wrapped and only displayed 
> completely with a mouseover event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9341) A grantee gains the delegation privilege after a revoke statement

2020-06-19 Thread Fang-Yu Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Yu Rao resolved IMPALA-9341.
-
Fix Version/s: Impala 4.0
   Resolution: Fixed

> A grantee gains the delegation privilege after a revoke statement
> -
>
> Key: IMPALA-9341
> URL: https://issues.apache.org/jira/browse/IMPALA-9341
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.4.0
>Reporter: Fang-Yu Rao
>Assignee: Fang-Yu Rao
>Priority: Major
> Fix For: Impala 4.0
>
>
> When Ranger is used as the authorization provider, a grantee that was granted 
> the {{insert}} privilege on a table without the delegation privilege gains 
> the delegation privilege after executing a statement to {{revoke}} the 
> grantee's {{select}} privilege on the same table. In what follows, we provide 
> the steps to reproduce the issue.
> # Start a Ranger-enabled Impala minicluster.
> # Log into Impala shell as the user {{admin}} using "{{./bin/impala-shell.sh 
> -u admin}}".
> # Execute "{{grant insert on table  to user 
> ;}}".
> # Execute "{{show grant user  on table ;}}".
> # Execute "{{revoke select on table  from user 
> ;}}".
> # Execute "{{show grant user  on table ;}}".
> When {{}} equals "{{functional.alltypes}}" and 
> {{}} equals "{{non_owner}}" which was not granted any 
> privilege at the very beginning, after the 4th step, we will see the 
> following.
> {code:java}
> ++++--++-+-+---+--+---+
> | principal_type | principal_name | database   | table| column | uri | 
> udf | privilege | grant_option | create_time   |
> ++++--++-+-+---+--+---+
> | USER   | non_owner  | functional | alltypes | *  | |
>  | insert| false| 1580345254347 |
> ++++--++-+-+---+--+---+
> {code}
> However, we will see the following after the 6th step above. We can see the 
> field of {{grant_option}} is changed from {{false}} to {{true}}, which should 
> not happen.
> {code:java}
> ++++--++-+-+---+--+---+
> | principal_type | principal_name | database   | table| column | uri | 
> udf | privilege | grant_option | create_time   |
> ++++--++-+-+---+--+---+
> | USER   | non_owner  | functional | alltypes | *  | |
>  | insert| true | 1580345254347 |
> ++++--++-+-+---+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-3744) Pattern making in "show ... like ..." behaves differently from Hive

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3744:
--
Summary: Pattern making in "show ... like ..." behaves differently from 
Hive  (was: "show ... like ..." does not work reasonably)

> Pattern making in "show ... like ..." behaves differently from Hive
> ---
>
> Key: IMPALA-3744
> URL: https://issues.apache.org/jira/browse/IMPALA-3744
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.6.0
>Reporter: Huaisi Xu
>Priority: Minor
>  Labels: incompatibility, newbie
>
> In impala:
> {code:java}
> [localhost:21000] > show tables;
> Query: show tables
> +--+
> | name |
> +--+
> | a|
> | aa   |
> | aaa  |
> +--+
> Fetched 3 row(s) in 0.01s
> [localhost:21000] > show tables like "a";
> Query: show tables like "a"
> +--+
> | name |
> +--+
> | a|
> +--+
> Fetched 1 row(s) in 0.01s
> [localhost:21000] > show tables like "a_";
> Query: show tables like "a_"
> Fetched 0 row(s) in 0.01s
> [localhost:21000] > show tables like "a%";
>   
>   
>  
> Query: show tables like "a%"
> Fetched 0 row(s) in 0.00s
> [localhost:21000] > show tables like "a.";
>   
>   
>  
> Query: show tables like "a."
> Fetched 0 row(s) in 0.00s
> [localhost:21000] > show tables like "a*";
> Query: show tables like "a*"
> +--+
> | name |
> +--+
> | a|
> | aa   |
> | aaa  |
> +--+
> Fetched 3 row(s) in 0.01s
> {code}
> In Hive:
> {code:java}
> hive> show tables;
> OK
> a
> aa
> aaa
> Time taken: 0.013 seconds, Fetched: 3 row(s)
> hive> show tables like "a";
> OK
> a
> Time taken: 0.012 seconds, Fetched: 1 row(s)
> hive> show tables like "a_";
> OK
> aa
> Time taken: 0.014 seconds, Fetched: 1 row(s)
> hive> show tables like "a%";
> OK
> a
> aa
> aaa
> Time taken: 0.014 seconds, Fetched: 3 row(s)
> hive> show tables like "a.";
> OK
> aa
> Time taken: 0.011 seconds, Fetched: 1 row(s)
> hive> show tables like "a*";
> OK
> a
> aa
> aaa
> Time taken: 0.015 seconds, Fetched: 3 row(s)
> {code}
> The problem seems to be 
> http://github.mtv.cloudera.com/CDH/Impala/blob/cdh5-trunk/fe/src/main/java/com/cloudera/impala/util/PatternMatcher.java
> It is very confusing to have two pattern matcher.. we use Hive pattern 
> matcher in this case, but it escapes ".", which is a bit surprising.
> Solution is to only use JdbcPatternMatcher, but this may break our API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2839) add stress options to exercise infrequently executed paths

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2839.
---
Resolution: Fixed

We've added a lot of debug or stress options over time, this probably isn't 
useful to keep open.

> add stress options to exercise infrequently executed paths
> --
>
> Key: IMPALA-2839
> URL: https://issues.apache.org/jira/browse/IMPALA-2839
> Project: IMPALA
>  Issue Type: Task
>  Components: Backend
>Affects Versions: Impala 2.3.0
>Reporter: Daniel Hecht
>Priority: Minor
>  Labels: ramp-up
>
> We should add debug-only stress options that help exercise more paths in our 
> testing.  We have one example already: 
> {code:title=global-flags.cc}
> DEFINE_int32(stress_free_pool_alloc, 0, "A stress option which causes memory 
> allocations "
> "to fail once every n allocations where n is the value of this flag. 
> Effective in "
> "debug builds only.");
> {code}
> Some examples that we should add:
> * We could exercise this path by overriding the 
> curr_tuple_pool_->total_allocated_bytes() > MAX_TUPLE_POOL_SIZE condition, 
> which would might help test for IMPALA-2829:
> {code:title=AnalyticEvalNode::ProcessChildBatch()}
>   // Transfer resources to prev_tuple_pool_ when enough resources have 
> accumulated
>   // and the prev_tuple_pool_ has already been transfered to an output batch.
>   if (curr_tuple_pool_->total_allocated_bytes() > MAX_TUPLE_POOL_SIZE &&
>   (prev_pool_last_result_idx_ == -1 || prev_pool_last_window_idx_ == -1)) 
> {
> prev_tuple_pool_->AcquireData(curr_tuple_pool_.get(), false);
> prev_pool_last_result_idx_ = last_result_idx_;
> if (window_tuples_.size() > 0) {
>   prev_pool_last_window_idx_ = window_tuples_.back().first;
> } else {
>   prev_pool_last_window_idx_ = -1;
> }
> VLOG_FILE << id() << " Transfer resources from curr to prev pool at idx: "
>   << stream_idx << ", stores tuples with last result idx: "
>   << prev_pool_last_result_idx_ << " last window idx: "
>   << prev_pool_last_window_idx_;
>   }
> {code}
> * We could cause RowBatch::AtCapacity() to become true more often than normal 
> (will be easier to do after http://gerrit.cloudera.org:8080/#/c/1399/ is 
> committed).
> Then we should enable these flags for some of our testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-5241) Make it possible to run the same query as part of planner and end to end tests

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-5241.
---
Resolution: Won't Fix

> Make it possible to run the same query as part of planner and end to end tests
> --
>
> Key: IMPALA-5241
> URL: https://issues.apache.org/jira/browse/IMPALA-5241
> Project: IMPALA
>  Issue Type: Test
>  Components: Infrastructure
>Reporter: Taras Bobrovytsky
>Priority: Major
>  Labels: newbie
>
> Today, if we want to add a query to planner and to end to end tests, we have 
> to have to add the same query in two places.
> This should be fixed as follows:
> It should be possible to have the "PLAN" and "RESULTS" section in a .test 
> file. The EE framework should ignore the plan section. If a test case has the 
> plan section and no results section, the EE framework should ignore this 
> case. The Java Planner test framework should ignore all sections, except the 
> plan section. If a test case does not have a plan section, the planner test 
> framework should ignore it.
> This should make it possible to use the same .test file by both the EE and 
> Planner test frameworks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-5929) Remove useless explicit casts to string

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-5929:
--
Labels:   (was: newbie)

> Remove useless explicit casts to string
> ---
>
> Key: IMPALA-5929
> URL: https://issues.apache.org/jira/browse/IMPALA-5929
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, 
> Impala 2.9.0, Impala 2.10.0
>Reporter: Alexander Behm
>Assignee: Bikramjeet Vig
>Priority: Major
>
> Some BI tools like to generate expressions that look like this:
> {code}
> cast(numeric_col as string) = '123456'
> {code}
> Casting and comparing as string is expensive and we should convert such 
> expressions to:
> {code}
> numeric_col = 123456
> {code}
> Such transformations may be applicable in other situations as well. We should 
> be careful about transforming inequality predicates because the string and 
> numeric comparison might not always be equivalent. For example, the following 
> transformation would be wrong:
> {code}
> cast(numeric_col as string) = '0123456'
> {code}
> is *not* equivalent to 
> {code}
> numeric_col = 123456
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-6042) Allow Impala shell to also use a global impalarc configuration

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-6042.
---
Fix Version/s: Impala 3.4.0
   Resolution: Fixed

> Allow Impala shell to also use a global impalarc configuration
> --
>
> Key: IMPALA-6042
> URL: https://issues.apache.org/jira/browse/IMPALA-6042
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Reporter: Balazs Jeszenszky
>Assignee: Ethan
>Priority: Minor
>  Labels: newbie, shell, usability
> Fix For: Impala 3.4.0
>
>
> Currently, impalarc files can be specified on a per-user basis (stored in 
> ~/.impalarc), and they aren't created by default. 
> The Impala shell should pick up /etc/impalarc as well, in addition to the 
> user-specific configurations.
> The intent here is to allow a "global" configuration of the shell by a system 
> administrator with common options like:
> {code}
> --ssl
> -l
> -k
> -u 
> -i 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-4140) Increase default threadpool size for statestore heartbeat and topic update

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-4140:
--
Labels:   (was: newbie)

> Increase default threadpool size for statestore heartbeat and topic update
> --
>
> Key: IMPALA-4140
> URL: https://issues.apache.org/jira/browse/IMPALA-4140
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Distributed Exec
>Affects Versions: Impala 2.2
>Reporter: Juan Yu
>Priority: Major
>
> Default threadpool size for statestore heartbeat and topic update is 10. For 
> a large cluster this could be too small and cause membership request and 
> topic update tasks not being processed in timely manner.
> We should increase default size to probably 2~3 times of # cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9258) impala and hive query result are different

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9258:
--
Labels:   (was: newbie)

> impala and hive query result are different 
> ---
>
> Key: IMPALA-9258
> URL: https://issues.apache.org/jira/browse/IMPALA-9258
> Project: IMPALA
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: Impala 3.2.0
> Environment: CDH6.2.1
>Reporter: authur wang
>Priority: Major
> Attachments: user_inf.zip
>
>
> After we use mapreduce to generate rcfiles, we find that the results between 
> hive and impala are different. The hive query will generate the right result 
> while impala will get wrong result.
>  
> the attachment is the data files.
>  
> the ddl of the table is :
> CREATE EXTERNAL TABLE user_inf (
>  id BIGINT,
>  user_id STRING,
>  cert_id STRING,
>  name STRING,
>  mobile STRING,
>  access_id STRING,
>  status STRING,
>  channel STRING,
>  rec_crt_ts STRING,
>  rec_upd_ts STRING,
>  ver INT
>  )
>  STORED AS RCFILE
>  LOCATION '/user_inf'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-1593) Pickup configurable default cache replication factor for HDFS

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-1593.
---
Resolution: Later

> Pickup configurable default cache replication factor for HDFS
> -
>
> Key: IMPALA-1593
> URL: https://issues.apache.org/jira/browse/IMPALA-1593
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Frontend
>Affects Versions: Impala 2.0
>Reporter: Martin Grund
>Priority: Minor
>  Labels: newbie
>
> When HDFS will support a configurable default cache replication factor we 
> should pick this up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-5490) Upgrade to GCC 7.x or greater

2020-06-19 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-5490.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Upgrade to GCC 7.x or greater
> -
>
> Key: IMPALA-5490
> URL: https://issues.apache.org/jira/browse/IMPALA-5490
> Project: IMPALA
>  Issue Type: Epic
>  Components: Infrastructure
>Reporter: Tim Armstrong
>Assignee: Joe McDonnell
>Priority: Minor
> Fix For: Impala 4.0
>
> Attachments: 0001-Switch-to-gcc7.patch
>
>
> This has some nice features like the [[nodiscard]] annotation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-2210) Make Parquet the default file format

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-2210:
--
Summary: Make Parquet the default file format  (was: Consider making 
Parquet the default file format for CREATE TABLE LIKE PARQUET)

> Make Parquet the default file format
> 
>
> Key: IMPALA-2210
> URL: https://issues.apache.org/jira/browse/IMPALA-2210
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.2.4
>Reporter: John Russell
>Priority: Blocker
>  Labels: incompatibility, newbie, usability
>
> I expect that by far the most common use case for CREATE TABLE LIKE PARQUET 
> is to make a table where the specified Parquet file will be queried.  That 
> is, either:
> CREATE TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS PARQUET;
> LOAD DATA INFILE '/blah/blah/file.parq' INTO TABLE foo;
> or
> CREATE EXTERNAL TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS 
> PARQUET LOCATION '/blah/blah';
> I have difficulty imagining a case where someone would do CREATE TABLE LIKE 
> PARQUET and want the result to be a text table. Even if someone planned to 
> convert Parquet -> text, they would need to have a Parquet table to begin 
> with, in which case they would do CREATE TABLE text_table LIKE parquet_table, 
> not CREATE TABLE LIKE PARQUET.
> It is easy to leave off the STORED AS PARQUET clause by mistake from a CTLP 
> statement, because PARQUET already occurs earlier in the statement, resulting 
> in a text table that throws conversion errors when queried. How about making 
> Parquet the default format in this case, and requiring the STORED AS clause 
> only to use a different file format? (Then if Impala implemented  a CREATE 
> TABLE LIKE AVRO syntax, the default in that case would be Avro.)
> Since I guess this would qualify as an incompatible change, we would need to 
> think through the appropriate release vehicle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-2210) Consider making Parquet the default file format for CREATE TABLE LIKE PARQUET

2020-06-19 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140745#comment-17140745
 ] 

Tim Armstrong commented on IMPALA-2210:
---

Per discussion on the mailing list, we wanted to make parquet the default file 
format.

> Consider making Parquet the default file format for CREATE TABLE LIKE PARQUET
> -
>
> Key: IMPALA-2210
> URL: https://issues.apache.org/jira/browse/IMPALA-2210
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.2.4
>Reporter: John Russell
>Priority: Blocker
>  Labels: incompatibility, newbie, usability
>
> I expect that by far the most common use case for CREATE TABLE LIKE PARQUET 
> is to make a table where the specified Parquet file will be queried.  That 
> is, either:
> CREATE TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS PARQUET;
> LOAD DATA INFILE '/blah/blah/file.parq' INTO TABLE foo;
> or
> CREATE EXTERNAL TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS 
> PARQUET LOCATION '/blah/blah';
> I have difficulty imagining a case where someone would do CREATE TABLE LIKE 
> PARQUET and want the result to be a text table. Even if someone planned to 
> convert Parquet -> text, they would need to have a Parquet table to begin 
> with, in which case they would do CREATE TABLE text_table LIKE parquet_table, 
> not CREATE TABLE LIKE PARQUET.
> It is easy to leave off the STORED AS PARQUET clause by mistake from a CTLP 
> statement, because PARQUET already occurs earlier in the statement, resulting 
> in a text table that throws conversion errors when queried. How about making 
> Parquet the default format in this case, and requiring the STORED AS clause 
> only to use a different file format? (Then if Impala implemented  a CREATE 
> TABLE LIKE AVRO syntax, the default in that case would be Avro.)
> Since I guess this would qualify as an incompatible change, we would need to 
> think through the appropriate release vehicle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-2210) Consider making Parquet the default file format for CREATE TABLE LIKE PARQUET

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-2210:
--
Target Version: Impala 4.0  (was: Product Backlog)

> Consider making Parquet the default file format for CREATE TABLE LIKE PARQUET
> -
>
> Key: IMPALA-2210
> URL: https://issues.apache.org/jira/browse/IMPALA-2210
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.2.4
>Reporter: John Russell
>Priority: Minor
>  Labels: incompatibility, newbie, usability
>
> I expect that by far the most common use case for CREATE TABLE LIKE PARQUET 
> is to make a table where the specified Parquet file will be queried.  That 
> is, either:
> CREATE TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS PARQUET;
> LOAD DATA INFILE '/blah/blah/file.parq' INTO TABLE foo;
> or
> CREATE EXTERNAL TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS 
> PARQUET LOCATION '/blah/blah';
> I have difficulty imagining a case where someone would do CREATE TABLE LIKE 
> PARQUET and want the result to be a text table. Even if someone planned to 
> convert Parquet -> text, they would need to have a Parquet table to begin 
> with, in which case they would do CREATE TABLE text_table LIKE parquet_table, 
> not CREATE TABLE LIKE PARQUET.
> It is easy to leave off the STORED AS PARQUET clause by mistake from a CTLP 
> statement, because PARQUET already occurs earlier in the statement, resulting 
> in a text table that throws conversion errors when queried. How about making 
> Parquet the default format in this case, and requiring the STORED AS clause 
> only to use a different file format? (Then if Impala implemented  a CREATE 
> TABLE LIKE AVRO syntax, the default in that case would be Avro.)
> Since I guess this would qualify as an incompatible change, we would need to 
> think through the appropriate release vehicle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-2210) Consider making Parquet the default file format for CREATE TABLE LIKE PARQUET

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-2210:
--
Priority: Blocker  (was: Minor)

> Consider making Parquet the default file format for CREATE TABLE LIKE PARQUET
> -
>
> Key: IMPALA-2210
> URL: https://issues.apache.org/jira/browse/IMPALA-2210
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.2.4
>Reporter: John Russell
>Priority: Blocker
>  Labels: incompatibility, newbie, usability
>
> I expect that by far the most common use case for CREATE TABLE LIKE PARQUET 
> is to make a table where the specified Parquet file will be queried.  That 
> is, either:
> CREATE TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS PARQUET;
> LOAD DATA INFILE '/blah/blah/file.parq' INTO TABLE foo;
> or
> CREATE EXTERNAL TABLE foo LIKE PARQUET '/blah/blah/file.parq' STORED AS 
> PARQUET LOCATION '/blah/blah';
> I have difficulty imagining a case where someone would do CREATE TABLE LIKE 
> PARQUET and want the result to be a text table. Even if someone planned to 
> convert Parquet -> text, they would need to have a Parquet table to begin 
> with, in which case they would do CREATE TABLE text_table LIKE parquet_table, 
> not CREATE TABLE LIKE PARQUET.
> It is easy to leave off the STORED AS PARQUET clause by mistake from a CTLP 
> statement, because PARQUET already occurs earlier in the statement, resulting 
> in a text table that throws conversion errors when queried. How about making 
> Parquet the default format in this case, and requiring the STORED AS clause 
> only to use a different file format? (Then if Impala implemented  a CREATE 
> TABLE LIKE AVRO syntax, the default in that case would be Avro.)
> Since I guess this would qualify as an incompatible change, we would need to 
> think through the appropriate release vehicle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-8013) Switch from boost:: to std:: locks

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-8013:
-

Assignee: Tim Armstrong

> Switch from boost:: to std:: locks
> --
>
> Key: IMPALA-8013
> URL: https://issues.apache.org/jira/browse/IMPALA-8013
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Trivial
>  Labels: newbie
>
> We use boost::unique_lock, boost::lock_guard, boost::mutex, etc throughout 
> the backend. There are now standard library equivalents. It would be good to 
> switch to them and remove the dependency on that part of boost.
> We need to wait for c++17 support for shared_mutex. All others are available 
> in gcc 4.9.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-8013) Switch from boost:: to std:: locks

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-8013.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Switch from boost:: to std:: locks
> --
>
> Key: IMPALA-8013
> URL: https://issues.apache.org/jira/browse/IMPALA-8013
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Trivial
>  Labels: newbie
> Fix For: Impala 4.0
>
>
> We use boost::unique_lock, boost::lock_guard, boost::mutex, etc throughout 
> the backend. There are now standard library equivalents. It would be good to 
> switch to them and remove the dependency on that part of boost.
> We need to wait for c++17 support for shared_mutex. All others are available 
> in gcc 4.9.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9875) Deduplicate build in joins with distinct semantics

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9875:
--
Description: 
For left semi and anti joins with only equi-join predicates, we don't need to 
store duplicates in the hash table, because a probe row will always match the 
first build row. We could rework the build process in PhjBuilder so that it 
builds the hash table on the fly and avoids insertion into the 
BufferedTupleStream if there is a match in the hash table. I.e. the build 
process would be closer to GroupingAggregator.

An alternative approach to building the hash tables on the fly would be to use 
a bloom filter to track which rows are already present in the hash table. This 
would mean some duplicates might be kept. 

Some other joins like that in IMPALA-1706 also have distinct semantics, so 
maybe this could be applied there too to avoid exploding joins.

  was:
For left semi and anti joins with only equi-join predicates, we don't need to 
store duplicates in the hash table, because a probe row will always match the 
first build row. We could rework the build process in PhjBuilder so that it 
builds the hash table on the fly and avoids insertion into the 
BufferedTupleStream if there is a match in the hash table. I.e. the build 
process would be closer to GroupingAggregator.

An alternative approach to building the hash tables on the fly would be to use 
a bloom filter to track which rows are already present in the hash table.

Some other joins like that in IMPALA-1706 also have distinct semantics, so 
maybe this could be applied there too to avoid exploding joins.


> Deduplicate build in joins with distinct semantics
> --
>
> Key: IMPALA-9875
> URL: https://issues.apache.org/jira/browse/IMPALA-9875
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Priority: Major
>
> For left semi and anti joins with only equi-join predicates, we don't need to 
> store duplicates in the hash table, because a probe row will always match the 
> first build row. We could rework the build process in PhjBuilder so that it 
> builds the hash table on the fly and avoids insertion into the 
> BufferedTupleStream if there is a match in the hash table. I.e. the build 
> process would be closer to GroupingAggregator.
> An alternative approach to building the hash tables on the fly would be to 
> use a bloom filter to track which rows are already present in the hash table. 
> This would mean some duplicates might be kept. 
> Some other joins like that in IMPALA-1706 also have distinct semantics, so 
> maybe this could be applied there too to avoid exploding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9875) Deduplicate build in joins with distinct semantics

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9875:
--
Description: 
For left semi and anti joins with only equi-join predicates, we don't need to 
store duplicates in the hash table, because a probe row will always match the 
first build row. We could rework the build process in PhjBuilder so that it 
builds the hash table on the fly and avoids insertion into the 
BufferedTupleStream if there is a match in the hash table. I.e. the build 
process would be closer to GroupingAggregator.

An alternative approach to building the hash tables on the fly would be to use 
a bloom filter to track which rows are already present in the hash table.

Some other joins like that in IMPALA-1706 also have distinct semantics, so 
maybe this could be applied there too to avoid exploding joins.

  was:
For left semi and anti joins with only equi-join predicates, we don't need to 
store duplicates in the hash table, because a probe row will always match the 
first build row. We could rework the build process in PhjBuilder so that it 
builds the hash table on the fly and avoids insertion into the 
BufferedTupleStream if there is a match in the hash table. I.e. the build 
process would be closer to GroupingAggregator.

Some other joins like that in IMPALA-1706 also have distinct semantics, so 
maybe this could be applied there too to avoid exploding joins.


> Deduplicate build in joins with distinct semantics
> --
>
> Key: IMPALA-9875
> URL: https://issues.apache.org/jira/browse/IMPALA-9875
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Priority: Major
>
> For left semi and anti joins with only equi-join predicates, we don't need to 
> store duplicates in the hash table, because a probe row will always match the 
> first build row. We could rework the build process in PhjBuilder so that it 
> builds the hash table on the fly and avoids insertion into the 
> BufferedTupleStream if there is a match in the hash table. I.e. the build 
> process would be closer to GroupingAggregator.
> An alternative approach to building the hash tables on the fly would be to 
> use a bloom filter to track which rows are already present in the hash table.
> Some other joins like that in IMPALA-1706 also have distinct semantics, so 
> maybe this could be applied there too to avoid exploding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9875) Deduplicate build in joins with distinct semantics

2020-06-19 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9875:
-

 Summary: Deduplicate build in joins with distinct semantics
 Key: IMPALA-9875
 URL: https://issues.apache.org/jira/browse/IMPALA-9875
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Tim Armstrong


For left semi and anti joins with only equi-join predicates, we don't need to 
store duplicates in the hash table, because a probe row will always match the 
first build row. We could rework the build process in PhjBuilder so that it 
builds the hash table on the fly and avoids insertion into the 
BufferedTupleStream if there is a match in the hash table. I.e. the build 
process would be closer to GroupingAggregator.

Some other joins like that in IMPALA-1706 also have distinct semantics, so 
maybe this could be applied there too to avoid exploding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9874) Reduce or avoid I/O for pruned columns

2020-06-19 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9874:
-

 Summary: Reduce or avoid I/O for pruned columns
 Key: IMPALA-9874
 URL: https://issues.apache.org/jira/browse/IMPALA-9874
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Tim Armstrong


Skipping decoding of values may not be effective at reducing I/O in some cases, 
because we start the I/O in StartScans(). We don't wait for the I/O until we 
actually read the first data page from the column reader. So there is a race to 
determine whether the I/O happens in some cases.

There are a couple of things we can do here.
* The basic thing is to issue reads for the column readers in the order in 
which they are needed. We may be able to get this for free by ordering the 
column readers based on materialisation order.
* We also want to avoid issuing I/O for columns that are not needed, if 
predicates are highly selective. This is maybe a bit harder and avoids more 
trade-offs, since delaying issuing of the reads may impact scan latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9873) Skip decoding of non-materialised columns in Parquet

2020-06-19 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9873:
-

 Summary: Skip decoding of non-materialised columns in Parquet
 Key: IMPALA-9873
 URL: https://issues.apache.org/jira/browse/IMPALA-9873
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Tim Armstrong


This is a first milestone for lazy materialization in parquet, focusing on 
avoiding decompression and decoding of columns.

* Identify columns referenced by predicates and runtime row filters and 
determine what order the columns need to be materialised in. Probably we want 
to evaluate static predicates before runtime filters to match current behaviour.
* Rework this loop so that it alternates between materialising columns and 
evaluating predicates: 
https://github.com/apache/impala/blob/052129c/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1110
* We probably need to keep track of filtered rows using a new data structure, 
e.g. bitmap
* We need to then check that bitmap at each step to see if we skip 
materialising part or all of the following columns. E.g. if the first N rows 
were pruned, we can skip forward the remaining readers N rows.
* This part may be a little tricky - there is the risk of adding overhead 
compared to the current code.
* It is probably OK to just materialise the partition columns to start off with 
- avoiding materialising those is not going to buy that much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3052) Reorder Parquet Column readers such that slots with probe filters are read first

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3052.
---
Resolution: Duplicate

> Reorder Parquet Column readers such that slots with probe filters are read 
> first
> 
>
> Key: IMPALA-3052
> URL: https://issues.apache.org/jira/browse/IMPALA-3052
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Affects Versions: Impala 2.5.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: performance, runtime-filters
>
> When applying selective probe filters 
> {code}HdfsParquetScanner::CreateColumnReaders{code} should order 
> column_readers such that slots with selective probe filters come first, this 
> should reduce the number of calls to ReadSlot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-2017) Lazy materialization of Parquet columns during query

2020-06-19 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-2017:
--
Summary: Lazy materialization of Parquet columns during query  (was: Lazy 
materialization of columns during query)

> Lazy materialization of Parquet columns during query
> 
>
> Key: IMPALA-2017
> URL: https://issues.apache.org/jira/browse/IMPALA-2017
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 1.4, Impala 2.0, Impala 2.1, Impala 2.2
>Reporter: Lou Bershad
>Priority: Minor
>  Labels: parquet, performance
>
> When I run a query over a 4 billion row table that returns a single row, it 
> takes ~30 seconds if i do 'select * ...'.  It takes only 3 seconds if I do a 
> 'select field1, field2 ...'.  This is repeatable.  
> Given these times, it would seem that the 'select *' query is materializing 
> all the fields for rows whether they match or not.  
> Lazy materialization of columns when they are needed could improve 
> performance.
>  
> These four queries were run back to back.  The actual returned data is elided 
> (sorry).  The table has 35 fields.
> {noformat}
> 0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select * from events where 
> event_id=1416403791; 
> 
> 1 row selected (33.777 seconds)
> 0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select event_id, client_id 
> from events where event_id=1416403791;
> +-++--+
> | event_id | client_id |
> +-++--+
> | 1416403791 |  |
> +-++--+
> 1 row selected (3.363 seconds)
> 0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select * from events where 
> event_id=1416403791; 
> 
> 1 row selected (33.138 seconds)
> 0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select event_id, client_id 
> from events where event_id=1416403791;
> +-++--+
> | event_id | client_id |
> +-++--+
> | 1416403791 |  |
> +-++--+
> 1 row selected (3.074 seconds)
> 0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9229) Link failed and retried runtime profiles

2020-06-19 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140627#comment-17140627
 ] 

Tim Armstrong commented on IMPALA-9229:
---

[~stigahuang] I remembered that I wasn't sure that the profiles of the failed 
queries are included in the profile log - see IMPALA-9872

> Link failed and retried runtime profiles
> 
>
> Key: IMPALA-9229
> URL: https://issues.apache.org/jira/browse/IMPALA-9229
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Priority: Major
>
> There should be a way for clients to link the runtime profiles from failed 
> queries to all retry attempts (whether successful or not), and vice versa.
> There are a few ways to do this:
>  * The simplest way would be to include the query id of the retried query in 
> the runtime profile of the failed query, and vice versa; users could then 
> manually create a chain of runtime profiles in order to fetch all failed / 
> successful attempts
>  * Extend TGetRuntimeProfileReq to include an option to fetch all runtime 
> profiles for the given query id + all retry attempts (or add a new Thrift 
> call TGetRetryQueryIds(TQueryId) which returns a list of retried ids for a 
> given query id)
>  * The Impala debug UI should include a simple way to view all the runtime 
> profiles of a query (the failed attempts + all retry attempts) side by side 
> (perhaps the query_profile?query_id profile should include tabs to easily 
> switch between the runtime profiles of each attempt)
> These are not mutually exclusive, and it might be good to stage these changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9872) Profile log does not include profiles of failed queries

2020-06-19 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9872:
-

 Summary: Profile log does not include profiles of failed queries
 Key: IMPALA-9872
 URL: https://issues.apache.org/jira/browse/IMPALA-9872
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Tim Armstrong


As far as I can tell, ImpalaServer::ArchiveQuery() only logs the profile for 
the last run of the query. We should also include the profiles of failed 
attempts at the query in the profile log.

[~stigahuang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9871) Toolchain bootstrap download fails on SLES12 sp5

2020-06-19 Thread Laszlo Gaal (Jira)
Laszlo Gaal created IMPALA-9871:
---

 Summary: Toolchain bootstrap download fails on SLES12 sp5 
 Key: IMPALA-9871
 URL: https://issues.apache.org/jira/browse/IMPALA-9871
 Project: IMPALA
  Issue Type: Bug
  Components: Infrastructure
Affects Versions: Impala 4.0
 Environment: SLES12 sp5
Reporter: Laszlo Gaal


bootstrap_toolchain.py fails to resolve the {{lsb_release -sir}} signature for 
the recently released SP5:
{code}
22:13:31 2020/06/18 16:43:31 INFO: Traceback (most recent call last):
22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
line 638, in 
22:13:31 2020/06/18 16:43:31 INFO: if __name__ == "__main__": main()
22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
line 617, in main
22:13:31 2020/06/18 16:43:31 INFO: downloads += 
get_toolchain_downloads()
22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
line 523, in get_toolchain_downloads
22:13:31 2020/06/18 16:43:31 INFO: llvm_package = 
ToolchainPackage("llvm")
22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
line 243, in __init__
22:13:31 2020/06/18 16:43:31 INFO: label = 
get_platform_release_label(release=platform_release).toolchain
22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
line 357, in get_platform_release_label
22:13:31 2020/06/18 16:43:31 INFO: raise Exception("Could not find 
package label for OS version: {0}.".format(release))
22:13:31 2020/06/18 16:43:31 INFO: Exception: Could not find package label 
for OS version: suse12.5.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-9871) Toolchain bootstrap download fails on SLES12 sp5

2020-06-19 Thread Laszlo Gaal (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laszlo Gaal reassigned IMPALA-9871:
---

Assignee: Laszlo Gaal

> Toolchain bootstrap download fails on SLES12 sp5 
> -
>
> Key: IMPALA-9871
> URL: https://issues.apache.org/jira/browse/IMPALA-9871
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
> Environment: SLES12 sp5
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Blocker
>
> bootstrap_toolchain.py fails to resolve the {{lsb_release -sir}} signature 
> for the recently released SP5:
> {code}
> 22:13:31 2020/06/18 16:43:31 INFO: Traceback (most recent call last):
> 22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
> line 638, in 
> 22:13:31 2020/06/18 16:43:31 INFO: if __name__ == "__main__": main()
> 22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
> line 617, in main
> 22:13:31 2020/06/18 16:43:31 INFO: downloads += 
> get_toolchain_downloads()
> 22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
> line 523, in get_toolchain_downloads
> 22:13:31 2020/06/18 16:43:31 INFO: llvm_package = 
> ToolchainPackage("llvm")
> 22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
> line 243, in __init__
> 22:13:31 2020/06/18 16:43:31 INFO: label = 
> get_platform_release_label(release=platform_release).toolchain
> 22:13:31 2020/06/18 16:43:31 INFO:   File "./bin/bootstrap_toolchain.py", 
> line 357, in get_platform_release_label
> 22:13:31 2020/06/18 16:43:31 INFO: raise Exception("Could not find 
> package label for OS version: {0}.".format(release))
> 22:13:31 2020/06/18 16:43:31 INFO: Exception: Could not find package 
> label for OS version: suse12.5.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Reopened] (IMPALA-6506) Codegen in ORC scanner

2020-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reopened IMPALA-6506:


Reopen to edit the fix version to include 3.4

> Codegen in ORC scanner
> --
>
> Key: IMPALA-6506
> URL: https://issues.apache.org/jira/browse/IMPALA-6506
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Quanlong Huang
>Assignee: Gabor Kaszab
>Priority: Major
> Fix For: Impala 4.0
>
>
> Currently, the orc-scanner materializes tuples from the orc-reader (ORC lib). 
> We need a Codegen implementation in TransferScratchTuples for the runtime 
> filter + conjunct evaluation loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-6506) Codegen in ORC scanner

2020-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-6506.

Fix Version/s: Impala 3.4.0
   Resolution: Fixed

> Codegen in ORC scanner
> --
>
> Key: IMPALA-6506
> URL: https://issues.apache.org/jira/browse/IMPALA-6506
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Quanlong Huang
>Assignee: Gabor Kaszab
>Priority: Major
> Fix For: Impala 4.0, Impala 3.4.0
>
>
> Currently, the orc-scanner materializes tuples from the orc-reader (ORC lib). 
> We need a Codegen implementation in TransferScratchTuples for the runtime 
> filter + conjunct evaluation loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9228) ORC scanner could be vectorized

2020-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-9228:
---
Fix Version/s: Impala 3.4.0

> ORC scanner could be vectorized
> ---
>
> Key: IMPALA-9228
> URL: https://issues.apache.org/jira/browse/IMPALA-9228
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: orc
> Fix For: Impala 4.0, Impala 3.4.0
>
> Attachments: 1-4_col_measurement_int_only.png
>
>
> The ORC scanners uses an external library to read ORC files. The library 
> reads the file contents into its own memory representation. It is a 
> vectorized representation similar to the Arrow format.
> Impala needs to convert the ORC row batch to an Impala row batch. Currently 
> the conversion happens row-wise via virtual function calls:
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/hdfs-orc-scanner.cc#L671]
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L352]
> Instead of this approach it could work similarly to the Parquet scanner that 
> fills the columns one-by-one into a scratch batch, then evaluate the 
> conjuncts on the scratch batch. For more details see 
> HdfsParquetScanner::AssembleRows():
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1077-L1088]
> This way we'll need a lot less virtual function calls, also the memory 
> reads/writes will be much more localized and predictable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9226) Improve string allocations of the ORC scanner

2020-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-9226:
---
Fix Version/s: Impala 3.4.0

> Improve string allocations of the ORC scanner
> -
>
> Key: IMPALA-9226
> URL: https://issues.apache.org/jira/browse/IMPALA-9226
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Assignee: Norbert Luksa
>Priority: Major
>  Labels: orc
> Fix For: Impala 4.0, Impala 3.4.0
>
>
> Currently the ORC scanner allocates new memory for each string values (except 
> for fixed size strings):
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]
> Besides the too many allocations and copying it's also bad for memory 
> locality.
> Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
> strings in the batch: 
> [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]
> 'blob' has type DataBuffer which is movable, so Impala might be able to get 
> ownership of it. Or, at least we could copy the whole blob array instead of 
> copying the strings one-by-one.
> ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
> 1.5.5.
> ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:
> [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]
> It uses dictionary encoding for storing the values. Impala could copy/move 
> the dictionary as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org