[jira] [Created] (IMPALA-13424) Improve diagnostics for OOM in JVM
Csaba Ringhofer created IMPALA-13424: Summary: Improve diagnostics for OOM in JVM Key: IMPALA-13424 URL: https://issues.apache.org/jira/browse/IMPALA-13424 Project: IMPALA Issue Type: Improvement Components: Catalog, Frontend Reporter: Csaba Ringhofer By default impalad/catalogd can survive a failed allocation in Java and throws an exception that may abort the given thread instead of crashing the whole process. There is already a ticket (IMPALA-1956) about changing the default behavior. Tested locally by allocating till OOM during HdfsTable.load(), which lead to throwing OutOfMemoryError in catalogd. This lead to failing to load the table and catalogd + coordinator could function properly after this. Error reporting for client / in coordinator log seemed ok, but catalogd has no log lines or metrics that clearly identify the issue: client: {code} AnalysisException: Failed to load metadata for table: 'functional.alltypestiny' CAUSED BY: TableLoadingException: Failed to load metadata for table: functional.alltypestiny. Running 'invalidate metadata functional.alltypestiny' may resolve this problem. CAUSED BY: OutOfMemoryError: Java heap space {code} coordinator log: {code} I1007 16:48:41.760563 126482 jni-util.cc:321] 604b67244b0793c6:32c416c0] org.apache.impala.common.AnalysisException: Failed to load metadata for table: 'functional.alltypestiny' at org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:990) at org.apache.impala.analysis.FromClause.analyze(FromClause.java:87) at org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:330) at org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282) at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274) at org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:558) at org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:505) at org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2586) at org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2268) at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:2029) at org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175) Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load metadata for table: functional.alltypestiny. Running 'invalidate metadata functional.alltypestiny' may resolve this problem. CAUSED BY: OutOfMemoryError: Java heap space at org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:181) at org.apache.impala.catalog.Table.fromThrift(Table.java:591) at org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:489) at org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:344) at org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262) at org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114) at org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:590) at org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196) at .: org.apache.impala.catalog.TableLoadingException: Failed to load metadata for table: functional.alltypestiny. Running 'invalidate metadata functional.alltypestiny' may resolve this problem. at org.apache.impala.catalog.TableLoader.load(TableLoader.java:168) at org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251) at org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.OutOfMemoryError: Java heap space () {code} catalogd log: {code} I1007 16:48:33.565677 126490 TableLoader.java:79] Loading metadata for: functional.alltypestiny (background load) ... I1007 16:48:38.755443 126490 TableLoader.java:177] Loaded metadata for: functional.alltypestiny (5189ms) I1007 16:48:38.755504 114205 JvmPauseMonitor.java:209] Detected pause in JVM or host machine (eg GC): pause of approximately 1357ms GC pool 'PS MarkSweep' had collection(s): count=2 time=1702ms {code} The oom doesn't lead to any warning/error level logs. Ideally there should be an error level log with the name of the table and the callstack of the failed allocation. -- This message was sent by Atlassian Jira (v8.20.10#820010) ---
[jira] [Updated] (IMPALA-12594) KrpcDataStreamSender's mem estimate is different than real usage
[ https://issues.apache.org/jira/browse/IMPALA-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12594: - Description: IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are few gaps between the how the frontend estimates memory and how the backend actually allocates it: The frontend uses the following formula: buffer_size = num_channels * 2 * (tuple_buffer_length + compressed_buffer_length) This takes account for the serialization and compression buffer for each OutboundRowBatch. This can both under and over estimate: 1. it doesn't take account of the RowBatch used by channels during partitioned exchange to collect rows belonging to a single channel https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232 2.it ignores the adjustment to the RowBatch capacity above based on flag data_stream_sender_buffer_size https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379 This adjustment can both increase or decrease the capacity to have to desired total size (16K by defaul). Note that the adjustment above ignores var len data, so it can massively underestimate in some cases. Meanwhile the frontend logic calculates string sizes if stats are present. Ideally both logic would be improved and synced to use both data_stream_sender_buffer_size and the string sizes for the estimate (I am not sure about collection types). was: IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are few gaps between the how the frontend estimates memory and how the backend actually allocates it: The frontend uses the following formula: buffer_size = num_channels * 2 * (tuple_buffer_length + compressed_buffer_length) This takes account for the serialization and compression buffer for each OutboundRowBatch. This can both under and over estimate: 1. it doesn't take account of the RowBatch used by channels during partitioned exchange to collact rows belonging to a single channel https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232 2.it ignores the adjustment to the RowBatch capacity above based on flag data_stream_sender_buffer_size https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379 This adjustment can both increase or decrease the capacity to have to desired total size (16K by defaul). Note that the adjustment above ignores var len data, so it can massively underestimate in some cases. Meanwhile the frontend logic calculates string sizes if stats are present. Ideally both logic would be improved and synced to use both data_stream_sender_buffer_size and the string sizes for the estimate (I am not sure about collection types). > KrpcDataStreamSender's mem estimate is different than real usage > > > Key: IMPALA-12594 > URL: https://issues.apache.org/jira/browse/IMPALA-12594 > Project: IMPALA > Issue Type: Bug > Components: Backend, Frontend >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are > few gaps between the how the frontend estimates memory and how the backend > actually allocates it: > The frontend uses the following formula: > buffer_size = num_channels * 2 * (tuple_buffer_length + > compressed_buffer_length) > This takes account for the serialization and compression buffer for each > OutboundRowBatch. > This can both under and over estimate: > 1. it doesn't take account of the RowBatch used by channels during > partitioned exchange to collect rows belonging to a single channel > https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232 > 2.it ignores the adjustment to the RowBatch capacity above based on flag > data_stream_sender_buffer_size > https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379 > This adjustment can both increase or decrease the capacity to have to desired > total size (16K by defaul). > Note that the adjustment above ignores var len data, so it can massively > underestimate in some cases. Meanwhile the frontend logic calculates string > sizes if stats are present. Ideally both logic would be improved and synced > to use both data_stream_sender_buffer_size and the string sizes for the > estimate (I am not sure about collection types). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscrib
[jira] [Resolved] (IMPALA-13371) Avoid throwing exceptions in FileSystemUtil::FindFileInPath()
[ https://issues.apache.org/jira/browse/IMPALA-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-13371. -- Fix Version/s: Impala 4.5.0 Resolution: Fixed > Avoid throwing exceptions in FileSystemUtil::FindFileInPath() > - > > Key: IMPALA-13371 > URL: https://issues.apache.org/jira/browse/IMPALA-13371 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Critical > Fix For: Impala 4.5.0 > > > Some function in std::filesystem can throw exceptions in some scenarios. We > should catch the exception in all cases or use and overload with noexcept. > Even if the error is fatal throwing and exception can lead to not logging it > properly. > An example is filesystem:exists(): > https://github.com/apache/impala/blob/22723d0f276468a25553f007dc65b21d79bd821d/be/src/util/filesystem-util.cc#L271 > Other functions use an overload with noexcept: > https://github.com/apache/impala/blob/22723d0f276468a25553f007dc65b21d79bd821d/be/src/util/filesystem-util.cc#L75 > https://en.cppreference.com/w/cpp/filesystem/exists -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13374) impala-shell can hit errors when downloading runtime profile
[ https://issues.apache.org/jira/browse/IMPALA-13374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13374: - Description: There are several issues with the current way runtime profiles are downloaded in impala-shell: https://github.com/apache/impala/blob/874e4fa117bdccfb8784c1987e5e3bf1ef4fbc1d/shell/impala_shell.py#L1463 1. The profile is fetched AFTER the queries are closed, which means that Impala may have already discarded it from memory, in which case the RPC will return an error. (Closing the query happens at different point depending on is_dml, but both happen before fetching the profile.) 2. If show_profiles=true, then failing to fetch the profiles is treated as an error. This leads to just an error message in interactive sessions, but with -q or -f parameter it will stop executing the queries and return with non 0 exit status. 3. The profile is fetched from Impala even if it is not used at all (show_profiles=false, which is the default). This is not a functional bug but can impact performance. 4. The downloaded profile is not cached, so a subsequent PROFILE; command will download it again. This is not just an optimization issue, but may lead to script failures if the profile is already discarded when PROFILE; is called. Note that the "already discarded" case has special handling during SUMMARY (but not for PROFILE) command, if the query id is not found, then it is not treated as an error. https://github.com/apache/impala/blob/874e4fa117bdccfb8784c1987e5e3bf1ef4fbc1d/shell/impala_shell.py#L820 The main problem is the combination of 1 and 2, as it can lead to failures if show_profiles=true, even when everything works as expected and the coordinator discards the profile between close and get_runtime_profile. was: There are several issues with the current way runtime profiles are downloaded in impala-shell: https://github.infra.cloudera.com/CDH/Impala/blob/2010c93bd364795d4ee7d17ea8805450658fc485/shell/impala_shell.py#L1196 1. The profile is fetched AFTER the queries are closed, which means that Impala may have already discarded it from memory, in which case the RPC will return an error. (Closing the query happens at different point depending on is_dml, but both happen before fetching the profile.) 2. If show_profiles=true, then failing to fetch the profiles is treated as an error. This leads to just an error message in interactive sessions, but with -q or -f parameter it will stop executing the queries and return with non 0 exit status. 3. The profile is fetched from Impala even if it is not used at all (show_profiles=false, which is the default). This is not a functional bug but can impact performance. 4. The downloaded profile is not cached, so a subsequent PROFILE; command will download it again. This is not just an optimization issue, but may lead to script failures if the profile is already discarded when PROFILE; is called. Note that the "already discarded" case has special handling during SUMMARY (but not for PROFILE) command, if the query id is not found, then it is not treated as an error. https://github.infra.cloudera.com/CDH/Impala/blob/2010c93bd364795d4ee7d17ea8805450658fc485/shell/impala_shell.py#L684 The main problem is the combination of 1 and 2, as it can lead to failures if show_profiles=true, even when everything works as expected and the coordinator discards the profile between close and get_runtime_profile. > impala-shell can hit errors when downloading runtime profile > > > Key: IMPALA-13374 > URL: https://issues.apache.org/jira/browse/IMPALA-13374 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Priority: Critical > > There are several issues with the current way runtime profiles are downloaded > in impala-shell: > https://github.com/apache/impala/blob/874e4fa117bdccfb8784c1987e5e3bf1ef4fbc1d/shell/impala_shell.py#L1463 > 1. The profile is fetched AFTER the queries are closed, which means that > Impala may have already discarded it from memory, in which case the RPC will > return an error. > (Closing the query happens at different point depending on is_dml, but both > happen before fetching the profile.) > 2. If show_profiles=true, then failing to fetch the profiles is treated as an > error. This leads to just an error message in interactive sessions, but with > -q or -f parameter it will stop executing the queries and return with non 0 > exit status. > 3. The profile is fetched from Impala even if it is not used at all > (show_profiles=false, which is the default). This is not a functional bug but > can impact performance. > 4. The downloaded profile is not cached, so a subsequent PROFILE; command > will download it again. This is
[jira] [Created] (IMPALA-13374) impala-shell can hit errors when downloading runtime profile
Csaba Ringhofer created IMPALA-13374: Summary: impala-shell can hit errors when downloading runtime profile Key: IMPALA-13374 URL: https://issues.apache.org/jira/browse/IMPALA-13374 Project: IMPALA Issue Type: Bug Components: Clients Reporter: Csaba Ringhofer There are several issues with the current way runtime profiles are downloaded in impala-shell: https://github.infra.cloudera.com/CDH/Impala/blob/2010c93bd364795d4ee7d17ea8805450658fc485/shell/impala_shell.py#L1196 1. The profile is fetched AFTER the queries are closed, which means that Impala may have already discarded it from memory, in which case the RPC will return an error. (Closing the query happens at different point depending on is_dml, but both happen before fetching the profile.) 2. If show_profiles=true, then failing to fetch the profiles is treated as an error. This leads to just an error message in interactive sessions, but with -q or -f parameter it will stop executing the queries and return with non 0 exit status. 3. The profile is fetched from Impala even if it is not used at all (show_profiles=false, which is the default). This is not a functional bug but can impact performance. 4. The downloaded profile is not cached, so a subsequent PROFILE; command will download it again. This is not just an optimization issue, but may lead to script failures if the profile is already discarded when PROFILE; is called. Note that the "already discarded" case has special handling during SUMMARY (but not for PROFILE) command, if the query id is not found, then it is not treated as an error. https://github.infra.cloudera.com/CDH/Impala/blob/2010c93bd364795d4ee7d17ea8805450658fc485/shell/impala_shell.py#L684 The main problem is the combination of 1 and 2, as it can lead to failures if show_profiles=true, even when everything works as expected and the coordinator discards the profile between close and get_runtime_profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13371) Avoid throwing exceptions in FileSystemUtil::FindFileInPath()
[ https://issues.apache.org/jira/browse/IMPALA-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13371: - Summary: Avoid throwing exceptions in FileSystemUtil::FindFileInPath() (was: Avoid throwing exceptions in FileSystemUtil) > Avoid throwing exceptions in FileSystemUtil::FindFileInPath() > - > > Key: IMPALA-13371 > URL: https://issues.apache.org/jira/browse/IMPALA-13371 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Critical > > Some function in std::filesystem can throw exceptions in some scenarios. We > should catch the exception in all cases or use and overload with noexcept. > Even if the error is fatal throwing and exception can lead to not logging it > properly. > An example is filesystem:exists(): > https://github.com/apache/impala/blob/22723d0f276468a25553f007dc65b21d79bd821d/be/src/util/filesystem-util.cc#L271 > Other functions use an overload with noexcept: > https://github.com/apache/impala/blob/22723d0f276468a25553f007dc65b21d79bd821d/be/src/util/filesystem-util.cc#L75 > https://en.cppreference.com/w/cpp/filesystem/exists -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-13371) Avoid throwing exceptions in FileSystemUtil
[ https://issues.apache.org/jira/browse/IMPALA-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-13371: Assignee: Csaba Ringhofer > Avoid throwing exceptions in FileSystemUtil > --- > > Key: IMPALA-13371 > URL: https://issues.apache.org/jira/browse/IMPALA-13371 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Critical > > Some function in std::filesystem can throw exceptions in some scenarios. We > should catch the exception in all cases or use and overload with noexcept. > Even if the error is fatal throwing and exception can lead to not logging it > properly. > An example is filesystem:exists(): > https://github.com/apache/impala/blob/22723d0f276468a25553f007dc65b21d79bd821d/be/src/util/filesystem-util.cc#L271 > Other functions use an overload with noexcept: > https://github.com/apache/impala/blob/22723d0f276468a25553f007dc65b21d79bd821d/be/src/util/filesystem-util.cc#L75 > https://en.cppreference.com/w/cpp/filesystem/exists -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-11431) TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-11431. -- Fix Version/s: Impala 4.5.0 Resolution: Fixed > TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an > exhaustive build > > > Key: IMPALA-11431 > URL: https://issues.apache.org/jira/browse/IMPALA-11431 > Project: IMPALA > Issue Type: Bug >Reporter: Daniel Becker >Assignee: Csaba Ringhofer >Priority: Blocker > Labels: broken-build > Fix For: Impala 4.5.0 > > > In one of the exhaustive builds, > query_test.test_nested_types.TestComputeStatsWithNestedTypes.test_compute_stats_with_structs > fails: > {code:java} > query_test/test_nested_types.py:252: in test_compute_stats_with_structs > self.run_test_case('QueryTest/compute-stats-with-structs', vector) > common/impala_test_suite.py:778: in run_test_case > self.__verify_results_and_errors(vector, test_section, result, use_db) > common/impala_test_suite.py:588: in __verify_results_and_errors > replace_filenames_with_placeholder) > common/test_result_verifier.py:469: in verify_raw_results > VERIFIER_MAP[verifier](expected, actual) > common/test_result_verifier.py:278: in verify_query_result_is_equal > assert expected_results == actual_results > E assert Comparing QueryTestResults (expected vs actual): > E > 'alltypes','STRUCT',-1,-1,-1,-1.0,-1,-1 > == > 'alltypes','STRUCT',-1,-1,-1,-1,-1,-1 > E 'id','INT',6,0,4,4.0,-1,-1 != 'id','INT',-1,-1,4,4,-1,-1 > E 'small_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'small_struct','STRUCT',-1,-1,-1,-1,-1,-1 > E 'str','STRING',6,0,11,10.330154,-1,-1 != > 'str','STRING',-1,-1,-1,-1,-1,-1 > E 'tiny_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'tiny_struct','STRUCT',-1,-1,-1,-1,-1,-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13356) Avoid unnecessarily updating column stats
Csaba Ringhofer created IMPALA-13356: Summary: Avoid unnecessarily updating column stats Key: IMPALA-13356 URL: https://issues.apache.org/jira/browse/IMPALA-13356 Project: IMPALA Issue Type: Improvement Components: Catalog Reporter: Csaba Ringhofer Currently column stats are reloaded every time when the schema is reloaded: https://github.com/apache/impala/blob/48ee4276be1eb278fb628a4813728134a4910b1f/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1304 This includes the common scenario of processing alter table events. Since HIVE-22046 introduced the engine field for column stats it is unlikely that Impala's version of column stats are modified by any other component. If there is another Impala catalogd connecting to the same cluster then it should also update table property impala.lastComputeStatsTime, so it is enough to update column stats when a non-self event is seen that modifies this property. Another case when it can be useful to reload stats is when the schema actually changes, for example when columns are added/removed/renamed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13346) query_test/test_iceberg.py / test_read_position_deletes is flaky
[ https://issues.apache.org/jira/browse/IMPALA-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13346: - Priority: Critical (was: Major) > query_test/test_iceberg.py / test_read_position_deletes is flaky > > > Key: IMPALA-13346 > URL: https://issues.apache.org/jira/browse/IMPALA-13346 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > > {code} > query_test/test_iceberg.py:1466: in test_read_position_deletes > self.run_test_case('QueryTest/iceberg-v2-read-position-deletes', vector) > common/impala_test_suite.py:772: in run_test_case > self.__verify_results_and_errors(vector, test_section, result, use_db) > common/impala_test_suite.py:606: in __verify_results_and_errors > replace_filenames_with_placeholder) common/test_result_verifier.py:503: in > verify_raw_results VERIFIER_MAP[verifier](expected, actual) > common/test_result_verifier.py:296: in verify_query_result_is_equal assert > expected_results == actual_results E assert Comparing QueryTestResults > (expected vs actual): E 3,2,'3.21KB','NOT CACHED','NOT > CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_delete_all_rows','NONE' > != 0,2,'3.21KB','NOT CACHED','NOT > CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_delete_all_rows','NONE' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13346) query_test/test_iceberg.py / test_read_position_deletes is flaky
Csaba Ringhofer created IMPALA-13346: Summary: query_test/test_iceberg.py / test_read_position_deletes is flaky Key: IMPALA-13346 URL: https://issues.apache.org/jira/browse/IMPALA-13346 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer {code} query_test/test_iceberg.py:1466: in test_read_position_deletes self.run_test_case('QueryTest/iceberg-v2-read-position-deletes', vector) common/impala_test_suite.py:772: in run_test_case self.__verify_results_and_errors(vector, test_section, result, use_db) common/impala_test_suite.py:606: in __verify_results_and_errors replace_filenames_with_placeholder) common/test_result_verifier.py:503: in verify_raw_results VERIFIER_MAP[verifier](expected, actual) common/test_result_verifier.py:296: in verify_query_result_is_equal assert expected_results == actual_results E assert Comparing QueryTestResults (expected vs actual): E 3,2,'3.21KB','NOT CACHED','NOT CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_delete_all_rows','NONE' != 0,2,'3.21KB','NOT CACHED','NOT CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_delete_all_rows','NONE' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-11431) TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-11431 started by Csaba Ringhofer. > TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an > exhaustive build > > > Key: IMPALA-11431 > URL: https://issues.apache.org/jira/browse/IMPALA-11431 > Project: IMPALA > Issue Type: Bug >Reporter: Daniel Becker >Assignee: Csaba Ringhofer >Priority: Blocker > Labels: broken-build > > In one of the exhaustive builds, > query_test.test_nested_types.TestComputeStatsWithNestedTypes.test_compute_stats_with_structs > fails: > {code:java} > query_test/test_nested_types.py:252: in test_compute_stats_with_structs > self.run_test_case('QueryTest/compute-stats-with-structs', vector) > common/impala_test_suite.py:778: in run_test_case > self.__verify_results_and_errors(vector, test_section, result, use_db) > common/impala_test_suite.py:588: in __verify_results_and_errors > replace_filenames_with_placeholder) > common/test_result_verifier.py:469: in verify_raw_results > VERIFIER_MAP[verifier](expected, actual) > common/test_result_verifier.py:278: in verify_query_result_is_equal > assert expected_results == actual_results > E assert Comparing QueryTestResults (expected vs actual): > E > 'alltypes','STRUCT',-1,-1,-1,-1.0,-1,-1 > == > 'alltypes','STRUCT',-1,-1,-1,-1,-1,-1 > E 'id','INT',6,0,4,4.0,-1,-1 != 'id','INT',-1,-1,4,4,-1,-1 > E 'small_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'small_struct','STRUCT',-1,-1,-1,-1,-1,-1 > E 'str','STRING',6,0,11,10.330154,-1,-1 != > 'str','STRING',-1,-1,-1,-1,-1,-1 > E 'tiny_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'tiny_struct','STRUCT',-1,-1,-1,-1,-1,-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-11431) TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-11431: Assignee: Csaba Ringhofer (was: Daniel Becker) > TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an > exhaustive build > > > Key: IMPALA-11431 > URL: https://issues.apache.org/jira/browse/IMPALA-11431 > Project: IMPALA > Issue Type: Bug >Reporter: Daniel Becker >Assignee: Csaba Ringhofer >Priority: Blocker > Labels: broken-build > > In one of the exhaustive builds, > query_test.test_nested_types.TestComputeStatsWithNestedTypes.test_compute_stats_with_structs > fails: > {code:java} > query_test/test_nested_types.py:252: in test_compute_stats_with_structs > self.run_test_case('QueryTest/compute-stats-with-structs', vector) > common/impala_test_suite.py:778: in run_test_case > self.__verify_results_and_errors(vector, test_section, result, use_db) > common/impala_test_suite.py:588: in __verify_results_and_errors > replace_filenames_with_placeholder) > common/test_result_verifier.py:469: in verify_raw_results > VERIFIER_MAP[verifier](expected, actual) > common/test_result_verifier.py:278: in verify_query_result_is_equal > assert expected_results == actual_results > E assert Comparing QueryTestResults (expected vs actual): > E > 'alltypes','STRUCT',-1,-1,-1,-1.0,-1,-1 > == > 'alltypes','STRUCT',-1,-1,-1,-1,-1,-1 > E 'id','INT',6,0,4,4.0,-1,-1 != 'id','INT',-1,-1,4,4,-1,-1 > E 'small_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'small_struct','STRUCT',-1,-1,-1,-1,-1,-1 > E 'str','STRING',6,0,11,10.330154,-1,-1 != > 'str','STRING',-1,-1,-1,-1,-1,-1 > E 'tiny_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'tiny_struct','STRUCT',-1,-1,-1,-1,-1,-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11431) TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878309#comment-17878309 ] Csaba Ringhofer commented on IMPALA-11431: -- It seems that besides row__id another problematic column is auto_incrementing_id in Kudu tables with non-unique primary key. I still don't understand why the error is sporadic in HMS. So for I only saw the errors in exhaustive tests and not in core tests. > TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an > exhaustive build > > > Key: IMPALA-11431 > URL: https://issues.apache.org/jira/browse/IMPALA-11431 > Project: IMPALA > Issue Type: Bug >Reporter: Daniel Becker >Assignee: Daniel Becker >Priority: Blocker > Labels: broken-build > > In one of the exhaustive builds, > query_test.test_nested_types.TestComputeStatsWithNestedTypes.test_compute_stats_with_structs > fails: > {code:java} > query_test/test_nested_types.py:252: in test_compute_stats_with_structs > self.run_test_case('QueryTest/compute-stats-with-structs', vector) > common/impala_test_suite.py:778: in run_test_case > self.__verify_results_and_errors(vector, test_section, result, use_db) > common/impala_test_suite.py:588: in __verify_results_and_errors > replace_filenames_with_placeholder) > common/test_result_verifier.py:469: in verify_raw_results > VERIFIER_MAP[verifier](expected, actual) > common/test_result_verifier.py:278: in verify_query_result_is_equal > assert expected_results == actual_results > E assert Comparing QueryTestResults (expected vs actual): > E > 'alltypes','STRUCT',-1,-1,-1,-1.0,-1,-1 > == > 'alltypes','STRUCT',-1,-1,-1,-1,-1,-1 > E 'id','INT',6,0,4,4.0,-1,-1 != 'id','INT',-1,-1,4,4,-1,-1 > E 'small_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'small_struct','STRUCT',-1,-1,-1,-1,-1,-1 > E 'str','STRING',6,0,11,10.330154,-1,-1 != > 'str','STRING',-1,-1,-1,-1,-1,-1 > E 'tiny_struct','STRUCT',-1,-1,-1,-1.0,-1,-1 == > 'tiny_struct','STRUCT',-1,-1,-1,-1,-1,-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-11431) TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878057#comment-17878057 ] Csaba Ringhofer edited comment on IMPALA-11431 at 8/31/24 6:45 AM: --- I see 2 issues here: 1. test_compute_stats_with_structs itself is problematic, as it has side effects. It calls COMPUTE STATS for functional_*.complextypes_structs and complextypes_nested_structs, which modifies the state of tables read by many tests. While this is is not the cause of the issue, it is something to clean up. 2. HMS throws an exception in some cases with "Column row__id doesn't exist in table" row__id is related to full ACID tables (when the test runs for ORC the table is full ACID) This leads to keeping the old empty stats and failing the test. I saw 140 error like this during the exhaustive test run, so this also effects some other tables. It is not clear why is this sporadic though. Impala adds a "synthetic" row__id column to full ACID tables, so these columns don't come from HMS and we should not try to read their statistics. The full exception in HMS: {code:java} 2024-08-29T02:35:24,264 ERROR [TThreadPoolServer WorkerProcess-142] metastore.ObjectStore: Error retrieving statistics via jdo org.apache.hadoop.hive.metastore.api.MetaException: Column row__id doesn't exist in table complextypes_structs in database functional_orc_def at org.apache.hadoop.hive.metastore.ObjectStore.validateTableCols(ObjectStore.java:10342) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.getMTableColumnStatistics(ObjectStore.java:10277) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.access$3100(ObjectStore.java:295) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore$20.getJdoResult(ObjectStore.java:10434) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore$20.getJdoResult(ObjectStore.java:10426) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:4345) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.getTableColumnStatisticsInternal(ObjectStore.java:10454) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.getTableColumnStatistics(ObjectStore.java:10412) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_422] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_422] at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at com.sun.proxy.$Proxy33.getTableColumnStatistics(Unknown Source) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_statistics_req(HiveMetaStore.java:7186) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_422] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_422] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at com.sun.proxy.$Proxy34.get_table_statistics_req(Unknown Source) ~[?:?] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_table_statistics_req.getResult(ThriftHiveMetastore.java:22613) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_table_statistics_req.getResult(ThriftHiveMetastore.java:22592) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.j
[jira] [Commented] (IMPALA-11431) TestComputeStatsWithNestedTypes.test_compute_stats_with_structs fails in an exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878057#comment-17878057 ] Csaba Ringhofer commented on IMPALA-11431: -- I see 2 issues here: 1. test_compute_stats_with_structs itself is problematic, as it has side effects. It calls COMPUTE STATS for functional_*.complextypes_structs and complextypes_nested_structs, which modifies the state of tables read by many tests. While this is is not the cause of the issue, it is something to clean up. 2. HMS throws an exception in some cases with "Column row__id doesn't exist in table" row__id is related to full ACID table (when the test runs for ORC the table is full ACID) This leads to keeping the old empty stats and failing the test. I saw 140 error like this during the exhaustive test run, so this also effects some other tables. It is not clear why is this sporadic though. Impala adds a "synthetic" row__id column to full ACID tables, so these columns don't come from HMS and we should try to read their statistics. The full exception in HMS: {code:java} 2024-08-29T02:35:24,264 ERROR [TThreadPoolServer WorkerProcess-142] metastore.ObjectStore: Error retrieving statistics via jdo org.apache.hadoop.hive.metastore.api.MetaException: Column row__id doesn't exist in table complextypes_structs in database functional_orc_def at org.apache.hadoop.hive.metastore.ObjectStore.validateTableCols(ObjectStore.java:10342) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.getMTableColumnStatistics(ObjectStore.java:10277) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.access$3100(ObjectStore.java:295) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore$20.getJdoResult(ObjectStore.java:10434) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore$20.getJdoResult(ObjectStore.java:10426) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:4345) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.getTableColumnStatisticsInternal(ObjectStore.java:10454) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.ObjectStore.getTableColumnStatistics(ObjectStore.java:10412) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_422] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_422] at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at com.sun.proxy.$Proxy33.getTableColumnStatistics(Unknown Source) ~[?:?] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_statistics_req(HiveMetaStore.java:7186) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_422] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_422] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at com.sun.proxy.$Proxy34.get_table_statistics_req(Unknown Source) ~[?:?] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_table_statistics_req.getResult(ThriftHiveMetastore.java:22613) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_table_statistics_req.getResult(ThriftHiveMetastore.java:22592) ~[hive-standalone-metastore-3.1.3000.2024.0.19.0-170.jar:3.1.3000.2024.0.19.0-170] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:111) ~[hive-standalone-metastore-3.1.3000.2024.0
[jira] [Created] (IMPALA-13331) Add metrics to CatalogdTableInvalidator
Csaba Ringhofer created IMPALA-13331: Summary: Add metrics to CatalogdTableInvalidator Key: IMPALA-13331 URL: https://issues.apache.org/jira/browse/IMPALA-13331 Project: IMPALA Issue Type: Improvement Components: Catalog Reporter: Csaba Ringhofer CatalogdTableInvalidator only logs when it invalidates a table but there are no metrics to track the number of invalidations. This makes it hard to get a picture of the effects of invalidate_tables_timeout_s /invalidate_tables_gc_old_gen_full_threshold. A few examples for useful metrics: - number of times invalidateSome() was called - number of table invalidations due to mem pressure - number of table invalidations due to ttl - time spent in invalidateSome() - time spent in invalidateOlderThan() for the number of invalidations/time spent metrics probably having separate sum/max/avg metrics would be nice -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13246) Smallify strings during broadcast exchange
[ https://issues.apache.org/jira/browse/IMPALA-13246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-13246. -- Fix Version/s: Impala 4.5.0 Resolution: Done > Smallify strings during broadcast exchange > -- > > Key: IMPALA-13246 > URL: https://issues.apache.org/jira/browse/IMPALA-13246 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Fix For: Impala 4.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13293) LocalCatalog's waitForCatalogUpdate() sleeps too much
[ https://issues.apache.org/jira/browse/IMPALA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-13293. -- Fix Version/s: Impala 4.5.0 Resolution: Fixed > LocalCatalog's waitForCatalogUpdate() sleeps too much > - > > Key: IMPALA-13293 > URL: https://issues.apache.org/jira/browse/IMPALA-13293 > Project: IMPALA > Issue Type: Improvement > Components: Catalog, Frontend >Reporter: Csaba Ringhofer >Priority: Major > Fix For: Impala 4.5.0 > > > Unlike ImpaladCatalog's waitForCatalogUpdate() the LocalCatalog version > doesn't use a conditional variable and simply waits for timeoutMs. The > timeout comes from MAX_CATALOG_UPDATE_WAIT_TIME_MS, which is 2 seconds. This > means the the function will wait 2 seconds even it the catalog update arrived > in the meantime. This 2 seconds is often nearly completely added to the > impala cluster startup time. > The sleep was added in > https://gerrit.cloudera.org/#/c/11472/3/fe/src/main/java/org/apache/impala/catalog/local/LocalCatalog.java > Update: realized that this also doesn't work well for ImpaladCatalog - the > FeCatalogManager creates a new ImpaladCatalog when a full topic update > arrives, so the Object that waitForCatalogUpdate() waits for is never > notified. My impression is that was broken a long time ago, even before > LocalCatalog was added. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13306) Store resources attached to row batches per-tuple descriptor
Csaba Ringhofer created IMPALA-13306: Summary: Store resources attached to row batches per-tuple descriptor Key: IMPALA-13306 URL: https://issues.apache.org/jira/browse/IMPALA-13306 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer Currently RowBatch handles resource related info (e.g. FlushMode) globally while it may be different for each tuple descriptor. An example is a row that comes from a join that didn't spill. In this case the memory of the build side tuple remains valid until the join node is closed, while the probe side can change more often, e.g. when the scratch batch in the Parquet scanner gets full and is attached to the row batch. Some operators could benefit from knowing that some tuple pointers remain valid fog longer. An example is tuple deduplication KrpcDataStream sender - if more than one row batches could be sent in a single OutboundRowBatch, then it would be important to know which whether the same tuple pointer really means the same tuple in the new RowBatch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13293) LocalCatalog's waitForCatalogUpdate() sleeps too much
[ https://issues.apache.org/jira/browse/IMPALA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13293: - Description: Unlike ImpaladCatalog's waitForCatalogUpdate() the LocalCatalog version doesn't use a conditional variable and simply waits for timeoutMs. The timeout comes from MAX_CATALOG_UPDATE_WAIT_TIME_MS, which is 2 seconds. This means the the function will wait 2 seconds even it the catalog update arrived in the meantime. This 2 seconds is often nearly completely added to the impala cluster startup time. The sleep was added in https://gerrit.cloudera.org/#/c/11472/3/fe/src/main/java/org/apache/impala/catalog/local/LocalCatalog.java Update: realized that this also doesn't work well for ImpaladCatalog - the FeCatalogManager creates a new ImpaladCatalog when a full topic update arrives, so the Object that waitForCatalogUpdate() waits for is never notified. My impression is that was broken a long time ago, even before LocalCatalog was added. was: Unlike ImpaladCatalog's waitForCatalogUpdate() the LocalCatalog version doesn't use a conditional variable and simply waits for timeoutMs. The timeout comes from MAX_CATALOG_UPDATE_WAIT_TIME_MS, which is 2 seconds. This means the the function will wait 2 seconds even it the catalog update arrived in the meantime. This 2 seconds is often nearly completely added to the impala cluster startup time. The sleep was added in https://gerrit.cloudera.org/#/c/11472/3/fe/src/main/java/org/apache/impala/catalog/local/LocalCatalog.java > LocalCatalog's waitForCatalogUpdate() sleeps too much > - > > Key: IMPALA-13293 > URL: https://issues.apache.org/jira/browse/IMPALA-13293 > Project: IMPALA > Issue Type: Improvement > Components: Catalog, Frontend >Reporter: Csaba Ringhofer >Priority: Major > > Unlike ImpaladCatalog's waitForCatalogUpdate() the LocalCatalog version > doesn't use a conditional variable and simply waits for timeoutMs. The > timeout comes from MAX_CATALOG_UPDATE_WAIT_TIME_MS, which is 2 seconds. This > means the the function will wait 2 seconds even it the catalog update arrived > in the meantime. This 2 seconds is often nearly completely added to the > impala cluster startup time. > The sleep was added in > https://gerrit.cloudera.org/#/c/11472/3/fe/src/main/java/org/apache/impala/catalog/local/LocalCatalog.java > Update: realized that this also doesn't work well for ImpaladCatalog - the > FeCatalogManager creates a new ImpaladCatalog when a full topic update > arrives, so the Object that waitForCatalogUpdate() waits for is never > notified. My impression is that was broken a long time ago, even before > LocalCatalog was added. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13293) LocalCatalog's waitForCatalogUpdate() sleeps too much
Csaba Ringhofer created IMPALA-13293: Summary: LocalCatalog's waitForCatalogUpdate() sleeps too much Key: IMPALA-13293 URL: https://issues.apache.org/jira/browse/IMPALA-13293 Project: IMPALA Issue Type: Improvement Components: Catalog, Frontend Reporter: Csaba Ringhofer Unlike ImpaladCatalog's waitForCatalogUpdate() the LocalCatalog version doesn't use a conditional variable and simply waits for timeoutMs. The timeout comes from MAX_CATALOG_UPDATE_WAIT_TIME_MS, which is 2 seconds. This means the the function will wait 2 seconds even it the catalog update arrived in the meantime. This 2 seconds is often nearly completely added to the impala cluster startup time. The sleep was added in https://gerrit.cloudera.org/#/c/11472/3/fe/src/main/java/org/apache/impala/catalog/local/LocalCatalog.java -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13292) Decrease statestore_update_frequency_ms in development environment
Csaba Ringhofer created IMPALA-13292: Summary: Decrease statestore_update_frequency_ms in development environment Key: IMPALA-13292 URL: https://issues.apache.org/jira/browse/IMPALA-13292 Project: IMPALA Issue Type: Improvement Components: Infrastructure Reporter: Csaba Ringhofer The current default 2s for statestore_update_frequency_ms adds significant delay to lot of operations. While decreasing it in production environment sounds risky, doing this in the development environment could make it faster. Decreasing statestore_update_frequency_ms from 2s to 0.5s reduced cluster startup time by 1-2 seconds, which could make custom cluster tests faster. It also significantly speeds up tests that create / drop metastore objects: impala-py.test -x tests/metadata/test_ddl.py -ktest_create_table_as_select with default settings: first run: 32s -> 16s other runs: 16s -> 8s The effect less drastic on the first run with catalog_topic_mode=minimal: first run: 18s-> 12s other runs: 16s -> 8s -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13272) Analyitic function of collections can lead to crash
Csaba Ringhofer created IMPALA-13272: Summary: Analyitic function of collections can lead to crash Key: IMPALA-13272 URL: https://issues.apache.org/jira/browse/IMPALA-13272 Project: IMPALA Issue Type: Improvement Reporter: Csaba Ringhofer Using Impala's test data the following query leads to DCHECK in debug builds and may cause more subtle issues in RELEASE builds: {code} select row_no from ( select arr.small, row_number() over ( order by arr.inner_struct1.str) as row_no from functional_parquet.collection_struct_mix t, t.arr_contains_nested_struct arr ) res {code} The following DCHECK is hit: {code} tuple.h:296 Check failed: offset != -1 {code} The problem seems to be with arr.small, which is referenced in the inline view, but not used in the outer query - removing it from the inline view or adding it to the outer select leads to avoiding the bug. The problem seems related to materialization - offset==-1 means that the slot is not materialized, but the Parquet scanner still tries to materialize it. It is not clear yet which commit introduced the bug or whether this is a bug in the planner or the backend. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13272) Analyitic function of collections can lead to crash
[ https://issues.apache.org/jira/browse/IMPALA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13272: - Priority: Blocker (was: Major) > Analyitic function of collections can lead to crash > --- > > Key: IMPALA-13272 > URL: https://issues.apache.org/jira/browse/IMPALA-13272 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Blocker > > Using Impala's test data the following query leads to DCHECK in debug builds > and may cause more subtle issues in RELEASE builds: > {code} > select > row_no > from ( > select >arr.small, >row_number() over ( > order by arr.inner_struct1.str) as row_no > from functional_parquet.collection_struct_mix t, > t.arr_contains_nested_struct arr >) res > {code} > The following DCHECK is hit: > {code} > tuple.h:296 Check failed: offset != -1 > {code} > The problem seems to be with arr.small, which is referenced in the inline > view, but not used in the outer query - removing it from the inline view or > adding it to the outer select leads to avoiding the bug. The problem seems > related to materialization - offset==-1 means that the slot is not > materialized, but the Parquet scanner still tries to materialize it. > It is not clear yet which commit introduced the bug or whether this is a bug > in the planner or the backend. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13269) Limit Kudu scan length if not all filters arrived
Csaba Ringhofer created IMPALA-13269: Summary: Limit Kudu scan length if not all filters arrived Key: IMPALA-13269 URL: https://issues.apache.org/jira/browse/IMPALA-13269 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer TARGETED_KUDU_SCAN_RANGE_LENGTH can be used as a hint for Kudu to limit size of scan ranges. As Impala can pickup late runtime filters when a new scan range starts, it can be useful to start with smaller scan ranges if there are unarrived runtime filters, as doing the whole scan without runtime filters can make it much less efficient. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-7086) Cache timezone name look ups
[ https://issues.apache.org/jira/browse/IMPALA-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-7086: --- Assignee: Mihaly Szjatinya > Cache timezone name look ups > > > Key: IMPALA-7086 > URL: https://issues.apache.org/jira/browse/IMPALA-7086 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Mihaly Szjatinya >Priority: Major > Labels: performance, ramp-up, timestamp > > to/from_utc_timestamp looks up time zones by name during every invocation, > even if the timezone parameter is constant. Avoiding this lookup if time zone > name is the same as during the last call (in the fragment) could speed up > time zone conversions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-10536) saml2_callback_token_ttl's unit is seconds instead of milliseconds
[ https://issues.apache.org/jira/browse/IMPALA-10536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-10536. -- Resolution: Fixed > saml2_callback_token_ttl's unit is seconds instead of milliseconds > -- > > Key: IMPALA-10536 > URL: https://issues.apache.org/jira/browse/IMPALA-10536 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > The description of saml2_callback_token_ttl writes "seconds", while its value > is interpreted as milliseconds, which the default 30 is way too short outside > automatic tests. > I think that keeping the semantics and just rewriting the desc to > milliseconds is better than fixing the semantics, because the very low ttl is > actually useful for automatic tests that test expiration logic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-7086) Cache timezone name look ups
[ https://issues.apache.org/jira/browse/IMPALA-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-7086: Labels: performance ramp-up timestamp (was: performance timestamp) > Cache timezone name look ups > > > Key: IMPALA-7086 > URL: https://issues.apache.org/jira/browse/IMPALA-7086 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > Labels: performance, ramp-up, timestamp > > to/from_utc_timestamp looks up time zones by name during every invocation, > even if the timezone parameter is constant. Avoiding this lookup if time zone > name is the same as during the last call (in the fragment) could speed up > time zone conversions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13261) Consider the effect of NULL keys when choosing BROADCAST vs SHUFFLE join
Csaba Ringhofer created IMPALA-13261: Summary: Consider the effect of NULL keys when choosing BROADCAST vs SHUFFLE join Key: IMPALA-13261 URL: https://issues.apache.org/jira/browse/IMPALA-13261 Project: IMPALA Issue Type: Improvement Components: Frontend Reporter: Csaba Ringhofer Currently NULL keys are hashed to a single value and sent to a single fragment instance in partitioned joins. This can cause data skew if the number of NULL keys is large. The planner could give preference to BROADCAST in LEFT OUTER JOIN when the number of NULLs is large on the probe side. Another potential solution for the same problem is IMPALA-13260 - it is about sending rows with NULL keys to local fragment instances in this situation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13260) Exchange on the probe side of outer joins could send NULL keys to local target
[ https://issues.apache.org/jira/browse/IMPALA-13260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13260: - Description: Currently NULL keys are hashed to a single value and sent to a single fragment instance in partitioned joins. This can cause data skew if the number of NULL keys is large. If a NULL key guarantees that no row is matched on the build side, then columns from build side will be all NULL and it doesn't matter which fragment instance processes the row. Always sending rows with NULL key to a local fragment instance would both reduce data skew and make the shuffle cheaper (no compression/network). If mt_dop>0 then to completely avoid data these rows would need to be spread evenly among the local fragment instances. One caveat is that sending NULL keys locally would "weaken" the partitioning of the fragment, so it is no longer "partitioned by col", but "partitioned by col (with the exception of NULLs)". For example if the outer join is followed by a grouping aggregation that uses the same key, then a shuffle is still needed as the aggregation needs all NULL keys in the same fragment instance. was: Currently NULL keys are hashed to a single value and sent to a single fragment instance in partitioned joins. This can cause data skew if the number of NULL keys is large. If a NULL key guarantees that no row is matched on the build side, then columns from build side will be all NULL and it doesn't matter which fragment instance processes the row. Always sending rows with NULL key to a local fragment instance would both reduce data skew and make the shuffle cheaper (no compression/network). If mt_dop>0 then to completely avoid data these rows would need to be spread evenly among the local fragment instances. > Exchange on the probe side of outer joins could send NULL keys to local target > -- > > Key: IMPALA-13260 > URL: https://issues.apache.org/jira/browse/IMPALA-13260 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > > Currently NULL keys are hashed to a single value and sent to a single > fragment instance in partitioned joins. This can cause data skew if the > number of NULL keys is large. > If a NULL key guarantees that no row is matched on the build side, then > columns from build side will be all NULL and it doesn't matter which fragment > instance processes the row. > Always sending rows with NULL key to a local fragment instance would both > reduce data skew and make the shuffle cheaper (no compression/network). If > mt_dop>0 then to completely avoid data these rows would need to be spread > evenly among the local fragment instances. > One caveat is that sending NULL keys locally would "weaken" the partitioning > of the fragment, so it is no longer "partitioned by col", but "partitioned by > col (with the exception of NULLs)". For example if the outer join is followed > by a grouping aggregation that uses the same key, then a shuffle is still > needed as the aggregation needs all NULL keys in the same fragment instance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13260) Exchange on the probe side of outer joins could send NULL keys to local target
Csaba Ringhofer created IMPALA-13260: Summary: Exchange on the probe side of outer joins could send NULL keys to local target Key: IMPALA-13260 URL: https://issues.apache.org/jira/browse/IMPALA-13260 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer Currently NULL keys are hashed to a single value and sent to a single fragment instance in partitioned joins. This can cause data skew if the number of NULL keys is large. If a NULL key guarantees that no row is matched on the build side, then columns from build side will be all NULL and it doesn't matter which fragment instance processes the row. Always sending rows with NULL key to a local fragment instance would both reduce data skew and make the shuffle cheaper (no compression/network). If mt_dop>0 then to completely avoid data these rows would need to be spread evenly among the local fragment instances. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-13246) Smallify strings during broadcast exchange
[ https://issues.apache.org/jira/browse/IMPALA-13246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-13246 started by Csaba Ringhofer. > Smallify strings during broadcast exchange > -- > > Key: IMPALA-13246 > URL: https://issues.apache.org/jira/browse/IMPALA-13246 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13246) Smallify strings during broadcast exchange
Csaba Ringhofer created IMPALA-13246: Summary: Smallify strings during broadcast exchange Key: IMPALA-13246 URL: https://issues.apache.org/jira/browse/IMPALA-13246 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Csaba Ringhofer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13209) ExchangeNode's ConvertRowBatchTime can be high
[ https://issues.apache.org/jira/browse/IMPALA-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-13209. -- Fix Version/s: Impala 4.5.0 Resolution: Fixed > ExchangeNode's ConvertRowBatchTime can be high > -- > > Key: IMPALA-13209 > URL: https://issues.apache.org/jira/browse/IMPALA-13209 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: performance > Fix For: Impala 4.5.0 > > > ConvertRowBatchTime can be surprisingly high - the only thing done during > this timer is copying tuple pointers from one RowBatch to another. > https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/be/src/exec/exchange-node.cc#L217 > {code} > set mt_dop=8; > select straight_join count(*) from tpcds_parquet.store_sales s1 join > /*+broadcast*/ tpcds_parquet.store_sales16 s2 on s1.ss_customer_sk = > s2.ss_customer_sk; > ConvertRowBatchTime dominates the busy exchange node's exec time in the > profile: >- ConvertRowBatchTime: 640.072ms >- InactiveTotalTime: 243.783ms >- PeakMemoryUsage: 12.53 MB (13142368) >- RowsReturned: 46.09M (46086464) >- RowsReturnedRate: 46.93 M/sec >- TotalTime: 981.968ms > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13225) Tuple deduplication does not work in partitioned exchanges
[ https://issues.apache.org/jira/browse/IMPALA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13225: - Labels: performance (was: ) > Tuple deduplication does not work in partitioned exchanges > -- > > Key: IMPALA-13225 > URL: https://issues.apache.org/jira/browse/IMPALA-13225 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > Labels: performance > > RowBatch::Serialize() has a deduplication logic that detects duplicate tuples > (usually the result of joins) based on tuple pointers. This doesn't work in > partitioned exchanges because all rows are deep copied one-by-one when > collecting rows for a given channel, so all tuple pointers will be distinct: > https://github.com/apache/impala/blob/d83b48cf72fa94ec7f6e55da409b4dff3350543b/be/src/runtime/krpc-data-stream-sender.cc#L645 > The deduplication was added a long time ago (doesn't have a Jira): > https://gerrit.cloudera.org/#/c/573/ > I am not sure if it ever worked in the partitioned case (it should work > though in broadcast exchanges). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13225) Tuple deduplication does not work in partitioned exchanges
Csaba Ringhofer created IMPALA-13225: Summary: Tuple deduplication does not work in partitioned exchanges Key: IMPALA-13225 URL: https://issues.apache.org/jira/browse/IMPALA-13225 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer RowBatch::Serialize() has a deduplication logic that detects duplicate tuples (usually the result of joins) based on tuple pointers. This doesn't work in partitioned exchanges because all rows are deep copied one-by-one when collecting rows for a given channel, so all tuple pointers will be distinct: https://github.com/apache/impala/blob/d83b48cf72fa94ec7f6e55da409b4dff3350543b/be/src/runtime/krpc-data-stream-sender.cc#L645 The deduplication was added a long time ago (doesn't have a Jira): https://gerrit.cloudera.org/#/c/573/ I am not sure if it ever worked in the partitioned case (it should work though in broadcast exchanges). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13217) Create allocator in Impala with mem tracker in TLS
Csaba Ringhofer created IMPALA-13217: Summary: Create allocator in Impala with mem tracker in TLS Key: IMPALA-13217 URL: https://issues.apache.org/jira/browse/IMPALA-13217 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer Some libraries allow setting an allocator only globally. In most cases in Impala a thread is only used in a very specific context - this allows saving the context to TLS and use it in stateless allocator. For example: setMemTrackerForThread(mem_tracker_); vector v; v.push_back(1); // Allocator can get mem_tracker_ from TLS -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13216) Switch run_workload.py to use HS2 instead of Beeswax
Csaba Ringhofer created IMPALA-13216: Summary: Switch run_workload.py to use HS2 instead of Beeswax Key: IMPALA-13216 URL: https://issues.apache.org/jira/browse/IMPALA-13216 Project: IMPALA Issue Type: Improvement Components: Clients, Infrastructure Reporter: Csaba Ringhofer Currently the default is using beeswax, which leads to using beeswax in perf tests. https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/bin/run-workload.py#L98 This could affect perf results/variance, because different clients use different sleep intervals when waiting for query status to become finished: beeswax uses 50ms: https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/tests/beeswax/impala_beeswax.py#L408 while hs2 would use a more complicated formula from Impyla, ranging for 10ms to 1s: https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/tests/performance/query_exec_functions.py#L122 https://github.com/cloudera/impyla/blob/acbd481dde28d85976dfc777f888b32ad6c8d721/impala/hiveserver2.py#L513 Making sleep times configurable in Impyla could help with this - it would make sense to use smaller sleeps than in real workloads to reduce variability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-13209) ExchangeNode's ConvertRowBatchTime can be high
[ https://issues.apache.org/jira/browse/IMPALA-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-13209 started by Csaba Ringhofer. > ExchangeNode's ConvertRowBatchTime can be high > -- > > Key: IMPALA-13209 > URL: https://issues.apache.org/jira/browse/IMPALA-13209 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: performance > > ConvertRowBatchTime can be surprisingly high - the only thing done during > this timer is copying tuple pointers from one RowBatch to another. > https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/be/src/exec/exchange-node.cc#L217 > {code} > set mt_dop=8; > select straight_join count(*) from tpcds_parquet.store_sales s1 join > /*+broadcast*/ tpcds_parquet.store_sales16 s2 on s1.ss_customer_sk = > s2.ss_customer_sk; > ConvertRowBatchTime dominates the busy exchange node's exec time in the > profile: >- ConvertRowBatchTime: 640.072ms >- InactiveTotalTime: 243.783ms >- PeakMemoryUsage: 12.53 MB (13142368) >- RowsReturned: 46.09M (46086464) >- RowsReturnedRate: 46.93 M/sec >- TotalTime: 981.968ms > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-13209) ExchangeNode's ConvertRowBatchTime can be high
[ https://issues.apache.org/jira/browse/IMPALA-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-13209: Assignee: Csaba Ringhofer > ExchangeNode's ConvertRowBatchTime can be high > -- > > Key: IMPALA-13209 > URL: https://issues.apache.org/jira/browse/IMPALA-13209 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: performance > > ConvertRowBatchTime can be surprisingly high - the only thing done during > this timer is copying tuple pointers from one RowBatch to another. > https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/be/src/exec/exchange-node.cc#L217 > {code} > set mt_dop=8; > select straight_join count(*) from tpcds_parquet.store_sales s1 join > /*+broadcast*/ tpcds_parquet.store_sales16 s2 on s1.ss_customer_sk = > s2.ss_customer_sk; > ConvertRowBatchTime dominates the busy exchange node's exec time in the > profile: >- ConvertRowBatchTime: 640.072ms >- InactiveTotalTime: 243.783ms >- PeakMemoryUsage: 12.53 MB (13142368) >- RowsReturned: 46.09M (46086464) >- RowsReturnedRate: 46.93 M/sec >- TotalTime: 981.968ms > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13209) ExchangeNode's ConvertRowBatchTime can be high
Csaba Ringhofer created IMPALA-13209: Summary: ExchangeNode's ConvertRowBatchTime can be high Key: IMPALA-13209 URL: https://issues.apache.org/jira/browse/IMPALA-13209 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer ConvertRowBatchTime can be surprisingly high - the only thing done during this timer is copying tuple pointers from one RowBatch to another. https://github.com/apache/impala/blob/c53987480726b114e0c3537c71297df2834a4962/be/src/exec/exchange-node.cc#L217 {code} set mt_dop=8; select straight_join count(*) from tpcds_parquet.store_sales s1 join /*+broadcast*/ tpcds_parquet.store_sales16 s2 on s1.ss_customer_sk = s2.ss_customer_sk; ConvertRowBatchTime dominates the busy exchange node's exec time in the profile: - ConvertRowBatchTime: 640.072ms - InactiveTotalTime: 243.783ms - PeakMemoryUsage: 12.53 MB (13142368) - RowsReturned: 46.09M (46086464) - RowsReturnedRate: 46.93 M/sec - TotalTime: 981.968ms {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13193) RuntimeFilter on parquet dictionary should evaluate null values
[ https://issues.apache.org/jira/browse/IMPALA-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13193: - Labels: correctness (was: ) > RuntimeFilter on parquet dictionary should evaluate null values > --- > > Key: IMPALA-13193 > URL: https://issues.apache.org/jira/browse/IMPALA-13193 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.1.0, Impala 4.2.0, Impala 4.1.1, Impala 4.1.2, > Impala 4.3.0, Impala 4.4.0 >Reporter: Quanlong Huang >Assignee: Zhi Tang >Priority: Critical > Labels: correctness > > IMPALA-10910, IMPALA-5509 introduces an optimization to evaluate runtime > filter on parquet dictionary values. If non of the values can pass the check, > the whole row group will be skipped. However, NULL values are not included in > the parquet dictionary. Runtime filters that accept NULL values might > incorrectly reject the row group if none of the dictionary values can pass > the check. > Here are steps to reproduce the bug: > {code:sql} > create table parq_tbl (id bigint, name string) stored as parquet; > insert into parq_tbl values (0, "abc"), (1, NULL), (2, NULL), (3, "abc"); > create table dim_tbl (name string); > insert into dim_tbl values (NULL); > select * from parq_tbl p join dim_tbl d > on COALESCE(p.name, '') = COALESCE(d.name, '');{code} > The SELECT query should return 2 rows but now it returns 0 rows. > A workaround is to disable this optimization: > {code:sql} > set PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT=0;{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-10985) always_true hint is not needed if all predicates are on partitioning columns
[ https://issues.apache.org/jira/browse/IMPALA-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-10985: - Description: IMPALA-10314 added always_true hint that leads to assuming that a file in a table will return at least one row even if there is a WHERE clause. Currently we need to add it even if all columns used in the WHERE are partitioning columns. This is not needed, as these predicates can't drop any more rows after partition pruning. (was: IMPALA-10314 added always_true hint that leads to assuming that a file in a table will return at least on row even if there is a WHERE clause. Currently we need to add it even if all columns used in the WHERE are partitioning columns. This is not needed, as these predicates can't drop any more rows after partition pruning.) > always_true hint is not needed if all predicates are on partitioning columns > > > Key: IMPALA-10985 > URL: https://issues.apache.org/jira/browse/IMPALA-10985 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Minor > > IMPALA-10314 added always_true hint that leads to assuming that a file in a > table will return at least one row even if there is a WHERE clause. Currently > we need to add it even if all columns used in the WHERE are partitioning > columns. This is not needed, as these predicates can't drop any more rows > after partition pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5078) Break up expr-test.cc
[ https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860452#comment-17860452 ] Csaba Ringhofer commented on IMPALA-5078: - [~sy117] Were you able to make some progress with this / do you plan to? This ticket is not urgent, but in the long run it would be really nice to break up expr-test.cc > Break up expr-test.cc > - > > Key: IMPALA-5078 > URL: https://issues.apache.org/jira/browse/IMPALA-5078 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Henry Robinson >Assignee: Csaba Ringhofer >Priority: Minor > Labels: newbie, ramp-up > Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot > 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, > Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png > > > {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs > to start slowing down a bit. Let's see if we can refactor it enough to have a > couple of test files. Maybe moving all the string instructions into a > separate {{expr-string-test.cc}}, and having a common header will be enough > to make it a bit more manageable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13183) Add default timeout for hs2/beeswax server sockets
Csaba Ringhofer created IMPALA-13183: Summary: Add default timeout for hs2/beeswax server sockets Key: IMPALA-13183 URL: https://issues.apache.org/jira/browse/IMPALA-13183 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer Currently Impala only sets timeout for specific operations, for example during SASL handshake and when checking if connection can be closed due to idle session. https://github.com/apache/impala/blob/d39596f6fb7da54c24d02523c4691e6b1973857b/be/src/rpc/TAcceptQueueServer.cpp#L153 https://github.com/apache/impala/blob/d39596f6fb7da54c24d02523c4691e6b1973857b/be/src/transport/TSaslServerTransport.cpp#L145 There are several cases where an inactive client could keep the connection open indefinitely, for example if it hasn't opened a session yet. I think that there should be a general longer timeout set for both send/recv, e.g. flag client_default_timout_s=3600. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12322) return wrong timestamp when scan kudu timestamp with timezone
[ https://issues.apache.org/jira/browse/IMPALA-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853203#comment-17853203 ] Csaba Ringhofer commented on IMPALA-12322: -- Thanks for the feedback[~eyizoha]. I have uploaded a patch that adds a new query option: https://gerrit.cloudera.org/#/c/21492/ > return wrong timestamp when scan kudu timestamp with timezone > - > > Key: IMPALA-12322 > URL: https://issues.apache.org/jira/browse/IMPALA-12322 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.1.1 > Environment: impala 4.1.1 >Reporter: daicheng >Assignee: Zihao Ye >Priority: Major > Attachments: image-2022-04-24-00-01-05-746-1.png, > image-2022-04-24-00-01-05-746.png, image-2022-04-24-00-01-37-520.png, > image-2022-04-24-00-03-14-467-1.png, image-2022-04-24-00-03-14-467.png, > image-2022-04-24-00-04-16-240-1.png, image-2022-04-24-00-04-16-240.png, > image-2022-04-24-00-04-52-860-1.png, image-2022-04-24-00-04-52-860.png, > image-2022-04-24-00-05-52-086-1.png, image-2022-04-24-00-05-52-086.png, > image-2022-04-24-00-07-09-776-1.png, image-2022-04-24-00-07-09-776.png, > image-2023-07-28-20-31-09-457.png, image-2023-07-28-22-27-38-521.png, > image-2023-07-28-22-29-40-083.png, image-2023-07-28-22-36-17-460.png, > image-2023-07-28-22-36-37-884.png, image-2023-07-28-22-38-19-728.png > > > impala version is 3.1.0-cdh6.1 > i have set system timezone=Asia/Shanghai: > !image-2022-04-24-00-01-37-520.png! > !image-2022-04-24-00-01-05-746.png! > here is the bug: > *step 1* > i have parquet file with two columns like below,and read it with impala-shell > and spark (timezone=shanghai) > !image-2022-04-24-00-03-14-467.png|width=1016,height=154! > !image-2022-04-24-00-04-16-240.png|width=944,height=367! > the result both exactly right。 > *step two* > create kudu table with impala-shell: > CREATE TABLE default.test_{_}test{_}_test_time2 (id BIGINT,t > TIMESTAMP,PRIMARY KEY (id) ) STORED AS KUDU; > note: kudu version:1.8 > and insert 2 row into the table with spark : > !image-2022-04-24-00-04-52-860.png|width=914,height=279! > *stop 3* > read it with spark (timezone=shanghai),spark read kudu table with kudu-client > api,here is the result: > !image-2022-04-24-00-05-52-086.png|width=914,height=301! > the result is still exactly right。 > but read it with impala-shell: > !image-2022-04-24-00-07-09-776.png|width=915,height=154! > the result show late 8hour > *conclusion* > it seems like impala timezone didn't work when kudu column type is > timestamp, but it work fine in parquet file,I don't know why? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12322) return wrong timestamp when scan kudu timestamp with timezone
[ https://issues.apache.org/jira/browse/IMPALA-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851147#comment-17851147 ] Csaba Ringhofer commented on IMPALA-12322: -- [~eyizoha] convert_kudu_utc_timestamps only affects reading, so if Impala writes a Kudu table, it will read back a different timestamp than what it written In IMPALA-12370 there is some discussion about how to configure writing behavior. Do you think that convert_kudu_utc_timestamps should also govern writing, or that should get a separate query option? > return wrong timestamp when scan kudu timestamp with timezone > - > > Key: IMPALA-12322 > URL: https://issues.apache.org/jira/browse/IMPALA-12322 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.1.1 > Environment: impala 4.1.1 >Reporter: daicheng >Assignee: Zihao Ye >Priority: Major > Attachments: image-2022-04-24-00-01-05-746-1.png, > image-2022-04-24-00-01-05-746.png, image-2022-04-24-00-01-37-520.png, > image-2022-04-24-00-03-14-467-1.png, image-2022-04-24-00-03-14-467.png, > image-2022-04-24-00-04-16-240-1.png, image-2022-04-24-00-04-16-240.png, > image-2022-04-24-00-04-52-860-1.png, image-2022-04-24-00-04-52-860.png, > image-2022-04-24-00-05-52-086-1.png, image-2022-04-24-00-05-52-086.png, > image-2022-04-24-00-07-09-776-1.png, image-2022-04-24-00-07-09-776.png, > image-2023-07-28-20-31-09-457.png, image-2023-07-28-22-27-38-521.png, > image-2023-07-28-22-29-40-083.png, image-2023-07-28-22-36-17-460.png, > image-2023-07-28-22-36-37-884.png, image-2023-07-28-22-38-19-728.png > > > impala version is 3.1.0-cdh6.1 > i have set system timezone=Asia/Shanghai: > !image-2022-04-24-00-01-37-520.png! > !image-2022-04-24-00-01-05-746.png! > here is the bug: > *step 1* > i have parquet file with two columns like below,and read it with impala-shell > and spark (timezone=shanghai) > !image-2022-04-24-00-03-14-467.png|width=1016,height=154! > !image-2022-04-24-00-04-16-240.png|width=944,height=367! > the result both exactly right。 > *step two* > create kudu table with impala-shell: > CREATE TABLE default.test_{_}test{_}_test_time2 (id BIGINT,t > TIMESTAMP,PRIMARY KEY (id) ) STORED AS KUDU; > note: kudu version:1.8 > and insert 2 row into the table with spark : > !image-2022-04-24-00-04-52-860.png|width=914,height=279! > *stop 3* > read it with spark (timezone=shanghai),spark read kudu table with kudu-client > api,here is the result: > !image-2022-04-24-00-05-52-086.png|width=914,height=301! > the result is still exactly right。 > but read it with impala-shell: > !image-2022-04-24-00-07-09-776.png|width=915,height=154! > the result show late 8hour > *conclusion* > it seems like impala timezone didn't work when kudu column type is > timestamp, but it work fine in parquet file,I don't know why? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12370) Add an option to customize timezone when working with UNIXTIME_MICROS columns of Kudu tables
[ https://issues.apache.org/jira/browse/IMPALA-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851132#comment-17851132 ] Csaba Ringhofer commented on IMPALA-12370: -- >That will free the users from the inconvenience of running their clusters in >the UTC timezone The timezone doesn't need to be set at server level in Impala, it can be set per query using query option "timezone", e.g. set timezone=CET; > Ideally, the setting should be per Kudu table, but a system-wide flag is also > an option. Query option convert_kudu_utc_timestamps, only affects reading, so there could be a writing related one to, e.g. write_kudu_utc_timestamps. (or convert_kudu_utc_timestamps could be changed to also affect writing). I agree that the ideal would be to be able to override this per table, for example with a table property like "impala.use_kudu_utc_timestamps" which would override both convert_kudu_utc_timestamps / write_kudu_utc_timestamps. It would be even better if other components would also respect this property, so if it is false, then they would write in the timezone agnostic "Impala" way. > Add an option to customize timezone when working with UNIXTIME_MICROS columns > of Kudu tables > > > Key: IMPALA-12370 > URL: https://issues.apache.org/jira/browse/IMPALA-12370 > Project: IMPALA > Issue Type: Improvement >Reporter: Alexey Serbin >Priority: Major > Labels: timezone > > Impala uses the timezone of its server when converting Unix epoch time stored > in a Kudu table in a column of UNIXTIME_MICROS type (legacy type name > TIMESTAMP) into a timestamp. As one can see, the former (a values stored in > a column of the UNIXTIME_MICROS type) does not contain information about > timezone, but the latter (the result timestamp returned by Impala) does, and > Impala's convention does make sense and works totally fine if the data is > being written and read by Impala or by other application that uses the same > convention. > However, Spark uses a different convention. Spark applications convert > timestamps to the UTC timezone before representing the result as Unix epoch > time. So, when a Spark application stores timestamp data in a Kudu table, > there is a difference in the result timestamps upon reading the stored data > via Impala if Impala servers are running in other than the UTC timezone. > As of now, the workaround is to run Impala servers in the UTC timezone, so > the convention used by Spark produces the same result as the convention used > by Impala when converting between timestamps and Unix epoch times. > In this context, it would be great to make it possible customizing the > timezone that's used by Impala when working with UNIXTIME_MICROS/TIMESTAMP > values stored in Kudu tables. That will free the users from the > inconvenience of running their clusters in the UTC timezone if they use a mix > of Spark/Impala applications to work with the same data stored in Kudu > tables. Ideally, the setting should be per Kudu table, but a system-wide > flag is also an option. > This is similar to IMPALA-1658. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12656) impala-shell cannot be installed on Python 3.11
[ https://issues.apache.org/jira/browse/IMPALA-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850457#comment-17850457 ] Csaba Ringhofer commented on IMPALA-12656: -- I also bumped into this and tried building the python-sasl PRs https://github.com/cloudera/python-sasl/pull/32 worked with 3.11 and 3.12 but broke 2.7 (at least in my environment). The other PR only fix 3.11, but had other build failures with 3.12. I think that this is a good reason to drop Python 2.7 support. > impala-shell cannot be installed on Python 3.11 > --- > > Key: IMPALA-12656 > URL: https://issues.apache.org/jira/browse/IMPALA-12656 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.3.0 >Reporter: Michael Smith >Priority: Major > Labels: python3 > > Trying to {{pip install impala-shell}} fails with > {code:java} > clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG > -g -fwrapv -O3 -Wall -isysroot > /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -Isasl > -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 > -c sasl/saslwrapper.cpp -o > build/temp.macosx-14-arm64-cpython-311/sasl/saslwrapper.o > sasl/saslwrapper.cpp:196:12: fatal error: 'longintrepr.h' file not found > #include "longintrepr.h" > ^~~ > 1 error generated. {code} > Python 3.11 moved this file to a subdirectory in > [https://github.com/python/cpython/commit/8e5de40f90476249e9a2e5ef135143b5c6a0b512.] > Adopting [https://github.com/cloudera/python-sasl/pull/31] or > [https://github.com/cloudera/python-sasl/pull/32] might fix it. But they need > to be included in a new release of sasl on pypi.org. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12656) impala-shell cannot be installed on Python 3.11
[ https://issues.apache.org/jira/browse/IMPALA-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12656: - Priority: Critical (was: Major) > impala-shell cannot be installed on Python 3.11 > --- > > Key: IMPALA-12656 > URL: https://issues.apache.org/jira/browse/IMPALA-12656 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.3.0 >Reporter: Michael Smith >Priority: Critical > Labels: python3 > > Trying to {{pip install impala-shell}} fails with > {code:java} > clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG > -g -fwrapv -O3 -Wall -isysroot > /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -Isasl > -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 > -c sasl/saslwrapper.cpp -o > build/temp.macosx-14-arm64-cpython-311/sasl/saslwrapper.o > sasl/saslwrapper.cpp:196:12: fatal error: 'longintrepr.h' file not found > #include "longintrepr.h" > ^~~ > 1 error generated. {code} > Python 3.11 moved this file to a subdirectory in > [https://github.com/python/cpython/commit/8e5de40f90476249e9a2e5ef135143b5c6a0b512.] > Adopting [https://github.com/cloudera/python-sasl/pull/31] or > [https://github.com/cloudera/python-sasl/pull/32] might fix it. But they need > to be included in a new release of sasl on pypi.org. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11512) BINARY support in Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848937#comment-17848937 ] Csaba Ringhofer commented on IMPALA-11512: -- BINARY columns seem to be working with Iceberg, but testing seems very limited. I didn't find any test with partition spec on BINARY columns. > BINARY support in Iceberg > - > > Key: IMPALA-11512 > URL: https://issues.apache.org/jira/browse/IMPALA-11512 > Project: IMPALA > Issue Type: Sub-task > Components: Frontend >Reporter: Csaba Ringhofer >Priority: Major > Labels: impala-iceberg > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
[ https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-12990. -- Fix Version/s: Impala 4.4.0 Resolution: Fixed > impala-shell broken if Iceberg delete deletes 0 rows > > > Key: IMPALA-12990 > URL: https://issues.apache.org/jira/browse/IMPALA-12990 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: iceberg > Fix For: Impala 4.4.0 > > > Happens only with Python 3 > {code} > impala-python3 shell/impala_shell.py > create table icebergupdatet (i int, s string) stored as iceberg; > alter table icebergupdatet set tblproperties("format-version"="2"); > delete from icebergupdatet where i=0; > Unknown Exception : '>' not supported between instances of 'NoneType' and > 'int' > Traceback (most recent call last): > File "shell/impala_shell.py", line 1428, in _execute_stmt > if is_dml and num_rows == 0 and num_deleted_rows > 0: > TypeError: '>' not supported between instances of 'NoneType' and 'int' > {code} > The same erros should also happen when the delete removes > 0 rows, but the > impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13056) HBaseTableScanner's timeout handling looks broken
Csaba Ringhofer created IMPALA-13056: Summary: HBaseTableScanner's timeout handling looks broken Key: IMPALA-13056 URL: https://issues.apache.org/jira/browse/IMPALA-13056 Project: IMPALA Issue Type: Bug Components: Backend Reporter: Csaba Ringhofer https://gerrit.cloudera.org/#/c/12660/ rewrote some JNI exception handling code and accidentally eliminated the timeout handling in https://github.com/apache/impala/blob/7ad94006563b88d9221b4ac978dbf5b4fc0a3ca1/be/src/exec/hbase/hbase-table-scanner.cc#L518 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated
[ https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13052: - Description: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions, haven't checked those yet. This can lead to highly underestimating the memory needs of grouping aggregators: select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 limit 1 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB Enforcing PREAGG_BYTES_LIMIT also doesn't seem to work well -setting a 40MB limit decreased peak mem to 1.5 GB. My guess is that the pre-aggregation logic is not prepared for aggregation states that grow during the execution, so it can decide to not add another group to the hash table, but can't deny increasing an existing one's state. was: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions, haven't checked thos yet. This can lead to highly underestimating the memory needs of grouping aggregators: select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 limit 1 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB > Sampling aggregate result sizes are underestimated > -- > > Key: IMPALA-13052 > URL: https://issues.apache.org/jira/browse/IMPALA-13052 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > Sampling aggregates (sample, appx_median, histogram) return a string that can > be quite large, but the planner assumes it to have a fixed small size. > Examples: > select sample(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) > select appx_median(l_orderkey) from tpch.lineitem; > according to plan: row-size= 8B > in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) > select histogram(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) > This may be also relevant for datasketches functions, haven't checked those > yet. > This can lead to highly underestimating the memory needs of grouping > aggregators: > select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 > limit 1 > 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB > 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB > Enforcing PREAGG_BYTES_LIMIT also doesn't seem to work well -setting a 40MB > limit decreased peak mem to 1.5 GB. My guess is that the pre-aggregation > logic is not prepared for aggregation states that grow during the execution, > so it can decide to not add another group to the hash table, but can't deny > increasing an existing one's state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated
[ https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13052: - Description: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions, haven't checked thos yet. This can lead to highly underestimating the memory needs of grouping aggregators: select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 limit 1 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB was: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions. > Sampling aggregate result sizes are underestimated > -- > > Key: IMPALA-13052 > URL: https://issues.apache.org/jira/browse/IMPALA-13052 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > Sampling aggregates (sample, appx_median, histogram) return a string that can > be quite large, but the planner assumes it to have a fixed small size. > Examples: > select sample(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) > select appx_median(l_orderkey) from tpch.lineitem; > according to plan: row-size= 8B > in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) > select histogram(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) > This may be also relevant for datasketches functions, haven't checked thos > yet. > This can lead to highly underestimating the memory needs of grouping > aggregators: > select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 > limit 1 > 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB > 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13052) Sampling aggregate result sizes are underestimated
Csaba Ringhofer created IMPALA-13052: Summary: Sampling aggregate result sizes are underestimated Key: IMPALA-13052 URL: https://issues.apache.org/jira/browse/IMPALA-13052 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13048) Shuffle hint on joins is ignored in some cases
[ https://issues.apache.org/jira/browse/IMPALA-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13048: - Description: I noticed that shuffle hint is ignored without any warning in some cases shuffle hint is not applied in this query: {code} explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} result plan {code} PLAN-ROOT SINK | 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: a3.tinyint_col = a2.tinyint_col | runtime filters: RF000 <- a2.tinyint_col | row-size=267B cardinality=80 | |--06:EXCHANGE [BROADCAST] | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | hash predicates: a1.id = a2.id | | runtime filters: RF002 <- a2.id | | row-size=178B cardinality=8 | | | |--05:EXCHANGE [BROADCAST] | | | | | 00:SCAN HDFS [functional.alltypestiny a2] | | HDFS partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 01:SCAN HDFS [functional.alltypes a1] | HDFS partitions=24/24 files=24 size=478.45KB | runtime filters: RF002 -> a1.id | row-size=89B cardinality=7.30K | 02:SCAN HDFS [functional.alltypessmall a3] HDFS partitions=4/4 files=4 size=6.32KB runtime filters: RF000 -> a3.tinyint_col row-size=89B cardinality=100 {code} if the first two tables' position is swapped, then it is applied: {code} explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} was: I noticed that shuffle hint is ignore without any warning in some cases shuffle hint is not applied in this query: {code} explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} result plan {code} PLAN-ROOT SINK | 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: a3.tinyint_col = a2.tinyint_col | runtime filters: RF000 <- a2.tinyint_col | row-size=267B cardinality=80 | |--06:EXCHANGE [BROADCAST] | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | hash predicates: a1.id = a2.id | | runtime filters: RF002 <- a2.id | | row-size=178B cardinality=8 | | | |--05:EXCHANGE [BROADCAST] | | | | | 00:SCAN HDFS [functional.alltypestiny a2] | | HDFS partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 01:SCAN HDFS [functional.alltypes a1] | HDFS partitions=24/24 files=24 size=478.45KB | runtime filters: RF002 -> a1.id | row-size=89B cardinality=7.30K | 02:SCAN HDFS [functional.alltypessmall a3] HDFS partitions=4/4 files=4 size=6.32KB runtime filters: RF000 -> a3.tinyint_col row-size=89B cardinality=100 {code} if the first two tables' position is swapped, then it is applied: {code} explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} > Shuffle hint on joins is ignored in some cases > -- > > Key: IMPALA-13048 > URL: https://issues.apache.org/jira/browse/IMPALA-13048 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > I noticed that shuffle hint is ignored without any warning in some cases > shuffle hint is not applied in this query: > {code} > explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on > a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; > {code} > result plan > {code} > PLAN-ROOT SINK > | > 07:EXCHANGE [UNPARTITIONED] > | > 04:HASH JOIN [INNER JOIN, BROADCAST] > | hash predicates: a3.tinyint_col = a2.tinyint_col > | runtime filters: RF000 <- a2.tinyint_col > | row-size=267B cardinality=80 > | > |--06:EXCHANGE [BROADCAST] > | | > | 03:HASH JOIN [INNER JOIN, BROADCAST] > | | hash predicates: a1.id = a2.id > | | runtime filters: RF002 <- a2.id > | | row-size=178B cardinality=8 > | | > | |--05:EXCHANGE [BROADCAST] > | | | > | | 00:SCAN HDFS [functional.alltypestiny a2] > | | HDFS partitions=4/4 files=4 size=460B > | | row-size=89B cardinality=8 > | | > | 01:SCAN HDFS [functional.alltypes a1] > | HDFS partitions=24/24 files=24 size=478.45KB > | runtime filters: RF002 -> a1.id > | row-size=89B cardinality=7.30K > | > 02:SCAN HDFS [functional.alltypessmall a3] >HDFS partitions=4/4 files=4 size=6.32KB >runtime filters: RF000 -> a3.tinyint_col >row-size=89B cardinality=100 > {code} > if the first two tables' position is swapped, then it is applied: > {code} > explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on > a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; > {code} -- This message was sent by Atlassian Jira (v8.2
[jira] [Created] (IMPALA-13048) Shuffle hint on joins is ignored in some cases
Csaba Ringhofer created IMPALA-13048: Summary: Shuffle hint on joins is ignored in some cases Key: IMPALA-13048 URL: https://issues.apache.org/jira/browse/IMPALA-13048 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer I noticed that shuffle hint is ignore without any warning in some cases shuffle hint is not applied in this query: {code} explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} result plan {code} PLAN-ROOT SINK | 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: a3.tinyint_col = a2.tinyint_col | runtime filters: RF000 <- a2.tinyint_col | row-size=267B cardinality=80 | |--06:EXCHANGE [BROADCAST] | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | hash predicates: a1.id = a2.id | | runtime filters: RF002 <- a2.id | | row-size=178B cardinality=8 | | | |--05:EXCHANGE [BROADCAST] | | | | | 00:SCAN HDFS [functional.alltypestiny a2] | | HDFS partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 01:SCAN HDFS [functional.alltypes a1] | HDFS partitions=24/24 files=24 size=478.45KB | runtime filters: RF002 -> a1.id | row-size=89B cardinality=7.30K | 02:SCAN HDFS [functional.alltypessmall a3] HDFS partitions=4/4 files=4 size=6.32KB runtime filters: RF000 -> a3.tinyint_col row-size=89B cardinality=100 {code} if the first two tables' position is swapped, then it is applied: {code} explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13040) SIGSEGV in QueryState::UpdateFilterFromRemote
Csaba Ringhofer created IMPALA-13040: Summary: SIGSEGV in QueryState::UpdateFilterFromRemote Key: IMPALA-13040 URL: https://issues.apache.org/jira/browse/IMPALA-13040 Project: IMPALA Issue Type: Bug Components: Backend Reporter: Csaba Ringhofer {code} Crash reason: SIGSEGV /SEGV_MAPERR Crash address: 0x48 Process uptime: not available Thread 114 (crashed) 0 libpthread.so.0 + 0x9d00 rax = 0x00019e57ad00 rdx = 0x2a656720 rcx = 0x059a9860 rbx = 0x rsi = 0x00019e57ad00 rdi = 0x0038 rbp = 0x7f6233d544e0 rsp = 0x7f6233d544a8 r8 = 0x06a53540r9 = 0x0039 r10 = 0x r11 = 0x000a r12 = 0x00019e57ad00 r13 = 0x7f62a2f997d0 r14 = 0x7f6233d544f8 r15 = 0x1632c0f0 rip = 0x7f62a2f96d00 Found by: given as instruction pointer in context 1 impalad!impala::QueryState::UpdateFilterFromRemote(impala::UpdateFilterParamsPB const&, kudu::rpc::RpcContext*) [query-state.cc : 1033 + 0x5] rbp = 0x7f6233d54520 rsp = 0x7f6233d544f0 rip = 0x015c0837 Found by: previous frame's frame pointer 2 impalad!impala::DataStreamService::UpdateFilterFromRemote(impala::UpdateFilterParamsPB const*, impala::UpdateFilterResultPB*, kudu::rpc::RpcContext*) [data-stream-service.cc : 134 + 0xb] rbp = 0x7f6233d54640 rsp = 0x7f6233d54530 rip = 0x017c05de Found by: previous frame's frame pointer {code} The line that crashes is https://github.com/apache/impala/blob/b39cd79ae84c415e0aebec2c2b4d7690d2a0cc7a/be/src/runtime/query-state.cc#L1033 My guess is that inside the actual segfault is within WaitForPrepare() but it was inlined. Not sure if a remote filter can arrive even before QueryState::Init is finished - that would explain the issue, as instances_prepared_barrier_ is not yet created at that point. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12320) test_topic_updates_unblock fails in ASAN build
[ https://issues.apache.org/jira/browse/IMPALA-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12320: - Priority: Critical (was: Major) > test_topic_updates_unblock fails in ASAN build > -- > > Key: IMPALA-12320 > URL: https://issues.apache.org/jira/browse/IMPALA-12320 > Project: IMPALA > Issue Type: Bug >Reporter: Zoltán Borók-Nagy >Assignee: Joe McDonnell >Priority: Critical > Labels: broken-build > > h3. Error Message > AssertionError: alter table tpcds.store_sales recover partitions query took > less time than 1 msec assert 9622 > 1 + where 9622 = ApplyResult.get of 0x7f1ab45b6d10>>() + where > = > .get > h3. Stacktrace > {noformat} > custom_cluster/test_topic_update_frequency.py:82: in > test_topic_updates_unblock > non_blocking_query_options=non_blocking_query_options) > custom_cluster/test_topic_update_frequency.py:132: in __run_topic_update_test > assert slow_query_future.get() > blocking_query_min_time, \ > E AssertionError: alter table tpcds.store_sales recover partitions query > took less time than 1 msec > E assert 9622 > 1 > E+ where 9622 = >() > E+where > = > .get > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840486#comment-17840486 ] Csaba Ringhofer commented on IMPALA-12266: -- Saw this test failing again. select * from special_chars; Could not resolve table reference: 'special_chars' Looked into coordinator log: {code} I0422 03:48:38.383420 19888 Frontend.java:2127] 1f4e0654b999662f:b6f1b015] Analyzing query: select * from special_chars db: test_convert_table_cdba7383 ... I0422 03:48:42.862898 1012 ImpaladCatalog.java:232] Deleting: TABLE:test_convert_table_cdba7383.special_chars version: 7785 size: 77 I0422 03:48:42.862920 1012 ImpaladCatalog.java:232] Deleting: TABLE:test_convert_table_cdba7383.special_chars_tmp_5eb06c80 version: 7786 size: 714 I0422 03:48:42.862967 1012 ImpaladCatalog.java:232] Adding: CATALOG_SERVICE_ID version: 7786 size: 60 ... I0422 03:48:42.863464 19888 jni-util.cc:302] 1f4e0654b999662f:b6f1b015] org.apache.impala.common.AnalysisException: Could not resolve table reference: 'special_chars' at org.apache.impala.analysis.Analyzer.resolvePath(Analyzer.java:1458) ... I0422 03:48:46.893426 1012 ImpaladCatalog.java:232] Adding: TABLE:test_convert_table_cdba7383.special_chars version: 7794 size: 84 {code} I am not familiar with how convert to Iceberg works, but based on the logs 1. special_chars_tmp_5eb06c80 is created, 2. special_chars is deleted 3. special_chars recreated If the table is queries between 2 and 3 then the coordinator will think that it doesn't exist. > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Quanlong Huang >Priority: Major > Labels: impala-iceberg > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @
[jira] [Created] (IMPALA-13037) EventsProcessorStressTest can hang
Csaba Ringhofer created IMPALA-13037: Summary: EventsProcessorStressTest can hang Key: IMPALA-13037 URL: https://issues.apache.org/jira/browse/IMPALA-13037 Project: IMPALA Issue Type: Bug Components: Catalog, Infrastructure Reporter: Csaba Ringhofer The test failed with timeout. >From mvn.log the last line is: 20:17:53 [INFO] Running org.apache.impala.catalog.events.EventsProcessorStressTest Things seem to be hanging from 2024.04.22 20:17:53 to 2024.04.23 The tests seems to wait for a Hive query. >From FeSupport.INFO: {code} I0422 20:17:55.478875 7949 RandomHiveQueryRunner.java:1102] Client 0 running hive query set 2: insert into table events_stress_db_0.stress_test_tbl_0_alltypes_part partition (year,month) select * from functional.alltypes limit 100 create database if not exists events_stress_db_0 drop table if exists events_stress_db_0.stress_test_tbl_0_alltypes_part create table if not exists events_stress_db_0.stress_test_tbl_0_alltypes_part like functional.alltypes set hive.exec.dynamic.partition.mode = nonstrict set hive.exec.max.dynamic.partitions = 1 set hive.exec.max.dynamic.partitions.pernode = 1 set tez.session.am.dag.submit.timeout.secs = 2 I0422 20:17:55.478940 7949 HiveJdbcClientPool.java:102] Executing sql : create database if not exists events_stress_db_0 I0422 20:17:55.493497 7768 MetastoreShim.java:843] EventId: 33414 EventType: COMMIT_TXN transaction id: 2075 I0422 20:17:55.493682 7768 MetastoreEvents.java:302] Total number of events received: 6 Total number of events filtered out: 0 I0422 20:17:55.494762 7768 MetastoreEvents.java:825] EventId: 33407 EventType: CREATE_DATABASE Successfully added database events_stress_db_0 I0422 20:17:55.508478 7949 HiveJdbcClientPool.java:102] Executing sql : drop table if exists events_stress_db_0.stress_test_tbl_0_alltypes_part I0422 20:17:55.516858 7768 MetastoreEvents.java:825] EventId: 33410 EventType: CREATE_TABLE Successfully added table events_stress_db_0.stress_test_tbl_0_part I0422 20:17:55.518288 7768 CatalogOpExecutor.java:4713] EventId: 33413 Table events_stress_db_0.stress_test_tbl_0_part is not loaded. Skipping add partitions I0422 20:17:55.519479 7768 MetastoreEventsProcessor.java:1340] Time elapsed in processing event batch: 178.895ms I0422 20:17:55.521183 7768 MetastoreEventsProcessor.java:1120] Latest event in HMS: id=33420, time=1713842275. Last synced event: id=33414, time=1713842275. I0422 20:17:55.533375 7949 HiveJdbcClientPool.java:102] Executing sql : create table if not exists events_stress_db_0.stress_test_tbl_0_alltypes_part like functional.alltypes I0422 20:17:55.611153 7949 HiveJdbcClientPool.java:102] Executing sql : set hive.exec.dynamic.partition.mode = nonstrict I0422 20:17:55.616571 7949 HiveJdbcClientPool.java:102] Executing sql : set hive.exec.max.dynamic.partitions = 1 I0422 20:17:55.619197 7949 HiveJdbcClientPool.java:102] Executing sql : set hive.exec.max.dynamic.partitions.pernode = 1 I0422 20:17:55.621069 7949 HiveJdbcClientPool.java:102] Executing sql : set tez.session.am.dag.submit.timeout.secs = 2 I0422 20:17:55.622972 7949 HiveJdbcClientPool.java:102] Executing sql : insert into table events_stress_db_0.stress_test_tbl_0_alltypes_part partition (year,month) select * from functional.alltypes limit 100 I0422 20:17:57.163591 7950 CatalogServiceCatalog.java:2747] Refreshing table metadata: events_stress_db_0.stress_test_tbl_0_part I0422 20:17:57.829802 7768 MetastoreEventsProcessor.java:982] Received 6 events. First event id: 33416. I0422 20:17:57.833026 7768 MetastoreShim.java:843] EventId: 33417 EventType: COMMIT_TXN transaction id: 2076 I0422 20:17:57.833222 7768 MetastoreShim.java:843] EventId: 33419 EventType: COMMIT_TXN transaction id: 2077 I0422 20:17:57.84 7768 MetastoreShim.java:843] EventId: 33421 EventType: COMMIT_TXN transaction id: 2078 I0422 20:17:57.834242 7768 MetastoreShim.java:843] EventId: 33424 EventType: COMMIT_TXN transaction id: 2079 I0422 20:17:57.834323 7768 MetastoreEvents.java:302] Total number of events received: 6 Total number of events filtered out: 0 I0422 20:17:57.834570 7768 CatalogOpExecutor.java:4862] EventId: 33416 Table events_stress_db_0.stress_test_tbl_0_part is not loaded. Not processing the event. I0422 20:17:57.837756 7768 MetastoreEvents.java:825] EventId: 33423 EventType: CREATE_TABLE Successfully added table events_stress_db_0.stress_test_tbl_0_alltypes_part I0422 20:17:57.838668 7768 MetastoreEventsProcessor.java:1340] Time elapsed in processing event batch: 8.625ms I0422 20:17:57.840027 7768 MetastoreEventsProcessor.java:1120] Latest event in HMS: id=33425, time=1713842275. Last synced event: id=33424, time=1713842275. I0422 20:18:03.143219 7768 MetastoreEventsProcessor.java:982] Received 0 events. First event id: non
[jira] [Created] (IMPALA-13026) Creating openai-api-key-secret fails sporadically
Csaba Ringhofer created IMPALA-13026: Summary: Creating openai-api-key-secret fails sporadically Key: IMPALA-13026 URL: https://issues.apache.org/jira/browse/IMPALA-13026 Project: IMPALA Issue Type: Bug Components: Infrastructure Reporter: Csaba Ringhofer Data load fails time to time with the following error: {code} 00:27:17.680 Error loading data. The end of the log file is: 00:27:17.680 04:15:15 /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/bin/load-data.py --workloads functional-query -e core --table_formats kudu/none/none --force --impalad localhost --hive_hs2_hostport localhost:11050 --hdfs_namenode localhost:20500 00:27:17.680 04:15:15 Executing Hadoop command: ... hadoop credential create openai-api-key-secret -value secret -provider localjceks://file/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/testdata/jceks/test.jceks ... 00:27:17.680 java.io.IOException: Credential openai-api-key-secret already exists in localjceks://file/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/testdata/jceks/test.jceks 00:27:17.680at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.createCredentialEntry(AbstractJavaKeyStoreProvider.java:234) 00:27:17.680at org.apache.hadoop.security.alias.CredentialShell$CreateCommand.execute(CredentialShell.java:354) 00:27:17.680at org.apache.hadoop.tools.CommandShell.run(CommandShell.java:72) 00:27:17.680at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) 00:27:17.680at org.apache.hadoop.security.alias.CredentialShell.main(CredentialShell.java:437) 00:27:17.680 04:15:15 Error executing Hadoop command, exiting {code} My guess is that this happens when calling "hadoop credential create" concurrently with different data loader processes. https://github.com/apache/impala/blob/9b05a205fec397fa1e19ae467b1cc406ca43d948/bin/load-data.py#L323 Ideally this would be called in the serial phase of dataload -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-13024) Several tests timeout waiting for admission
[ https://issues.apache.org/jira/browse/IMPALA-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839337#comment-17839337 ] Csaba Ringhofer edited comment on IMPALA-13024 at 4/21/24 8:15 AM: --- >Slot based admission is not enabled when using default groups This was also my assumption, but it seems that it is enforced by default. Reproduced slot starvation locally: Run one query with more fragment instance than core count in one impala-shell: set mt_dop=32; select sleep(1000*60) from tpcds.store_sales limit 200; -- Run a query in another impala-shell: select * from functional.alltypestiny; ERROR: Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. Additional Details: Not Applicable UPDATE: I understand now what is happening: the limit is only enforced on coordinator only queries. While "select * from alltypestiny" failed, the much larger "select * from alltypes" could be run without issues. The reason is that the former query runs on a single node. >From impalad.INFO: "0421 10:10:57.505287 1586078 admission-controller.cc:1962] Trying to admit id=91442a9fa1d2512d:db5337c2 in pool_name=default-pool executor_group_name=empty group (using coordinator only) per_host_mem_estimate=20.00 MB dedicated_coord_mem_estimate=120.00 MB max_requests=-1 max_queued=200 max_mem=-1.00 B is_trivial_query=false I0421 10:10:57.505345 1586078 admission-controller.cc:1971] Stats: agg_num_running=1, agg_num_queued=1, agg_mem_reserved=4.02 MB, local_host(local_mem_admitted=516.57 MB, local_trivial_running=0, num_admitted_running=1, num_queued=1, backend_mem_reserved=4.02 MB, topN_query_stats: queries=[d84f2a7efee0998a:45ac1206], total_mem_consumed=4.02 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=1, min=4.02 MB, max=4.02 MB, pool_total_mem=4.02 MB, average_per_query=4.02 MB) I0421 10:10:57.505407 1586078 admission-controller.cc:2227] Could not dequeue query id=91442a9fa1d2512d:db5337c2 reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use." was (Author: csringhofer): >Slot based admission is not enabled when using default groups This was also my assumption, but it seems that it is enforced by default. Reproduced slot starvation locally: Run one query with more fragment instance than core count in one impala-shell: set mt_dop=32; select sleep(1000*60) from tpcds.store_sales limit 200; -- Run a query in another impala-shell: select * from functional.alltypestiny; ERROR: Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. Additional Details: Not Applicable > Several tests timeout waiting for admission > --- > > Key: IMPALA-13024 > URL: https://issues.apache.org/jira/browse/IMPALA-13024 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > > A bunch of seemingly unrelated tests failed with the following message: > Example: > query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol: > beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, > 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] > {code} > ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded > timeout 6ms in pool default-pool. Queued reason: Not enough admission > control slots available on host ... . Needed 1 slots but 18/16 are already in > use. Additional Details: Not Applicable > {code} > This happened in an ASAN build. Another test also failed which may be related > to the cause: > custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots > > {code} > Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the > expected states [4], last known state 5 > {code} > test_queue_reasons_slots seems to be know flaky test: IMPALA-10338 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13024) Several tests timeout waiting for admission
[ https://issues.apache.org/jira/browse/IMPALA-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839337#comment-17839337 ] Csaba Ringhofer commented on IMPALA-13024: -- >Slot based admission is not enabled when using default groups This was also my assumption, but it seems that it is enforced by default. Reproduced slot starvation locally: Run one query with more fragment instance than core count in one impala-shell: set mt_dop=32; select sleep(1000*60) from tpcds.store_sales limit 200; -- Run a query in another impala-shell: select * from functional.alltypestiny; ERROR: Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. Additional Details: Not Applicable > Several tests timeout waiting for admission > --- > > Key: IMPALA-13024 > URL: https://issues.apache.org/jira/browse/IMPALA-13024 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > > A bunch of seemingly unrelated tests failed with the following message: > Example: > query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol: > beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, > 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] > {code} > ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded > timeout 6ms in pool default-pool. Queued reason: Not enough admission > control slots available on host ... . Needed 1 slots but 18/16 are already in > use. Additional Details: Not Applicable > {code} > This happened in an ASAN build. Another test also failed which may be related > to the cause: > custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots > > {code} > Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the > expected states [4], last known state 5 > {code} > test_queue_reasons_slots seems to be know flaky test: IMPALA-10338 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13024) Several tests timeout waiting for admission
Csaba Ringhofer created IMPALA-13024: Summary: Several tests timeout waiting for admission Key: IMPALA-13024 URL: https://issues.apache.org/jira/browse/IMPALA-13024 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer A bunch of seemingly unrelated tests failed with the following message: Example: query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol: beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] {code} ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host ... . Needed 1 slots but 18/16 are already in use. Additional Details: Not Applicable {code} This happened in an ASAN build. Another test also failed which may be related to the cause: custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots {code} Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the expected states [4], last known state 5 {code} test_queue_reasons_slots seems to be know flaky test: IMPALA-10338 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13021) Failed test: test_iceberg_deletes_and_updates_and_optimize
Csaba Ringhofer created IMPALA-13021: Summary: Failed test: test_iceberg_deletes_and_updates_and_optimize Key: IMPALA-13021 URL: https://issues.apache.org/jira/browse/IMPALA-13021 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer {code} test_iceberg_deletes_and_updates_and_optimize run_tasks([deleter, updater, optimizer, checker]) stress/stress_util.py:46: in run_tasks pool.map_async(Task.run, tasks).get(timeout_seconds) Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/multiprocessing/pool.py:568: in get raise TimeoutError E TimeoutError {code} This happened in an exhaustive test run with data cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-5323) Support Kudu BINARY
[ https://issues.apache.org/jira/browse/IMPALA-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-5323: Fix Version/s: Impala 4.4.0 > Support Kudu BINARY > --- > > Key: IMPALA-5323 > URL: https://issues.apache.org/jira/browse/IMPALA-5323 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Reporter: Pavel Martynov >Assignee: Csaba Ringhofer >Priority: Major > Labels: kudu > Fix For: Impala 4.4.0 > > > I trying to 'CREATE EXTERNAL TABLE STORED AS KUDU' on the table with BINARY > Kudu column data type and got an error: Kudu type 'binary' is not supported > in Impala. > This limitation is not documented, checked: > https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html > https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations > There are some thoughts that Kudu BINARY data type may be supported by > Impala's STRING data type: > https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Does-impala-support-binary-data-type/td-p/24366 > https://groups.google.com/a/cloudera.org/forum/#!msg/impala-user/muguKJU3c3I/_oArmoxSlDMJ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-5323) Support Kudu BINARY
[ https://issues.apache.org/jira/browse/IMPALA-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-5323. - Resolution: Fixed > Support Kudu BINARY > --- > > Key: IMPALA-5323 > URL: https://issues.apache.org/jira/browse/IMPALA-5323 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Reporter: Pavel Martynov >Assignee: Csaba Ringhofer >Priority: Major > Labels: kudu > Fix For: Impala 4.4.0 > > > I trying to 'CREATE EXTERNAL TABLE STORED AS KUDU' on the table with BINARY > Kudu column data type and got an error: Kudu type 'binary' is not supported > in Impala. > This limitation is not documented, checked: > https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html > https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations > There are some thoughts that Kudu BINARY data type may be supported by > Impala's STRING data type: > https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Does-impala-support-binary-data-type/td-p/24366 > https://groups.google.com/a/cloudera.org/forum/#!msg/impala-user/muguKJU3c3I/_oArmoxSlDMJ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
[ https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12990 started by Csaba Ringhofer. > impala-shell broken if Iceberg delete deletes 0 rows > > > Key: IMPALA-12990 > URL: https://issues.apache.org/jira/browse/IMPALA-12990 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: iceberg > > Happens only with Python 3 > {code} > impala-python3 shell/impala_shell.py > create table icebergupdatet (i int, s string) stored as iceberg; > alter table icebergupdatet set tblproperties("format-version"="2"); > delete from icebergupdatet where i=0; > Unknown Exception : '>' not supported between instances of 'NoneType' and > 'int' > Traceback (most recent call last): > File "shell/impala_shell.py", line 1428, in _execute_stmt > if is_dml and num_rows == 0 and num_deleted_rows > 0: > TypeError: '>' not supported between instances of 'NoneType' and 'int' > {code} > The same erros should also happen when the delete removes > 0 rows, but the > impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
[ https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835793#comment-17835793 ] Csaba Ringhofer commented on IMPALA-12990: -- https://gerrit.cloudera.org/#/c/21284 > impala-shell broken if Iceberg delete deletes 0 rows > > > Key: IMPALA-12990 > URL: https://issues.apache.org/jira/browse/IMPALA-12990 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Priority: Major > Labels: iceberg > > Happens only with Python 3 > {code} > impala-python3 shell/impala_shell.py > create table icebergupdatet (i int, s string) stored as iceberg; > alter table icebergupdatet set tblproperties("format-version"="2"); > delete from icebergupdatet where i=0; > Unknown Exception : '>' not supported between instances of 'NoneType' and > 'int' > Traceback (most recent call last): > File "shell/impala_shell.py", line 1428, in _execute_stmt > if is_dml and num_rows == 0 and num_deleted_rows > 0: > TypeError: '>' not supported between instances of 'NoneType' and 'int' > {code} > The same erros should also happen when the delete removes > 0 rows, but the > impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
Csaba Ringhofer created IMPALA-12990: Summary: impala-shell broken if Iceberg delete deletes 0 rows Key: IMPALA-12990 URL: https://issues.apache.org/jira/browse/IMPALA-12990 Project: IMPALA Issue Type: Bug Components: Clients Reporter: Csaba Ringhofer Happens only with Python 3 {code} impala-python3 shell/impala_shell.py create table icebergupdatet (i int, s string) stored as iceberg; alter table icebergupdatet set tblproperties("format-version"="2"); delete from icebergupdatet where i=0; Unknown Exception : '>' not supported between instances of 'NoneType' and 'int' Traceback (most recent call last): File "shell/impala_shell.py", line 1428, in _execute_stmt if is_dml and num_rows == 0 and num_deleted_rows > 0: TypeError: '>' not supported between instances of 'NoneType' and 'int' {code} The same erros should also happen when the delete removes > 0 rows, but the impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12987: - Description: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} The partition directory created above seems truncated: hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's Note that Java handles \0 characters in unicode in a special way, which may be related: https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 was: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} The partition directory created above seems truncated: hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's Note Java handles \0 characters in unicode in a special way, which may be related: https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 > Errors with \0 character in partition values > > > Key: IMPALA-12987 > URL: https://issues.apache.org/jira/browse/IMPALA-12987 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Labels: iceberg > > Inserting strings with "\0" values to partition columns leads errors both in > Iceberg and Hive tables. > The issue is more severe in Iceberg tables as from this point the table can't > be read in Impala or Hive: > {code} > create table iceberg_unicode (s string, p string) partitioned by spec > (identity(p)) stored as iceberg; > insert into iceberg_unicode select "a", "a\0a"; > ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table > hdfs://localhost:20500/test-warehouse/iceberg_unicode > CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 > paths for table default.iceberg_unicode: failed to load 1 paths. Check the > catalog server log for more details. > {code} > The partition directory created above seems truncated: > hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a > In partition Hive tables the insert also returns an error, but the new > partition is not created and the table remains usable. The error is similar > to IMPALA-11499's > Note that Java handles \0 characters in unicode in a special way, which may > be related: > https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12987: - Description: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} The partition directory created above seems truncated: hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's Note Java handles \0 characters in unicode in a special way, which may be related: https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 was: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's > Errors with \0 character in partition values > > > Key: IMPALA-12987 > URL: https://issues.apache.org/jira/browse/IMPALA-12987 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Labels: iceberg > > Inserting strings with "\0" values to partition columns leads errors both in > Iceberg and Hive tables. > The issue is more severe in Iceberg tables as from this point the table can't > be read in Impala or Hive: > {code} > create table iceberg_unicode (s string, p string) partitioned by spec > (identity(p)) stored as iceberg; > insert into iceberg_unicode select "a", "a\0a"; > ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table > hdfs://localhost:20500/test-warehouse/iceberg_unicode > CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 > paths for table default.iceberg_unicode: failed to load 1 paths. Check the > catalog server log for more details. > {code} > The partition directory created above seems truncated: > hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a > In partition Hive tables the insert also returns an error, but the new > partition is not created and the table remains usable. The error is similar > to IMPALA-11499's > Note Java handles \0 characters in unicode in a special way, which may be > related: > https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12987) Errors with \0 character in partition values
Csaba Ringhofer created IMPALA-12987: Summary: Errors with \0 character in partition values Key: IMPALA-12987 URL: https://issues.apache.org/jira/browse/IMPALA-12987 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue issue more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similare to IMPALA-11499's -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12987: - Description: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's was: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue issue more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similare to IMPALA-11499's > Errors with \0 character in partition values > > > Key: IMPALA-12987 > URL: https://issues.apache.org/jira/browse/IMPALA-12987 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Labels: iceberg > > Inserting strings with "\0" values to partition columns leads errors both in > Iceberg and Hive tables. > The issue is more severe in Iceberg tables as from this point the table can't > be read in Impala or Hive: > {code} > create table iceberg_unicode (s string, p string) partitioned by spec > (identity(p)) stored as iceberg; > insert into iceberg_unicode select "a", "a\0a"; > ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table > hdfs://localhost:20500/test-warehouse/iceberg_unicode > CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 > paths for table default.iceberg_unicode: failed to load 1 paths. Check the > catalog server log for more details. > {code} > In partition Hive tables the insert also returns an error, but the new > partition is not created and the table remains usable. The error is similar > to IMPALA-11499's -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources
[ https://issues.apache.org/jira/browse/IMPALA-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12969: - Priority: Critical (was: Major) > DeserializeThriftMsg may leak JNI resources > --- > > Key: IMPALA-12969 > URL: https://issues.apache.org/jira/browse/IMPALA-12969 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Fix For: Impala 4.4.0 > > > JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements > call, but this is not done in case there is an error during deserialization: > [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources
[ https://issues.apache.org/jira/browse/IMPALA-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-12969. -- Fix Version/s: Impala 4.4.0 Resolution: Fixed > DeserializeThriftMsg may leak JNI resources > --- > > Key: IMPALA-12969 > URL: https://issues.apache.org/jira/browse/IMPALA-12969 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > Fix For: Impala 4.4.0 > > > JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements > call, but this is not done in case there is an error during deserialization: > [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12978) IMPALA-12544 made impala-shell incompatible with old impala servers
Csaba Ringhofer created IMPALA-12978: Summary: IMPALA-12544 made impala-shell incompatible with old impala servers Key: IMPALA-12978 URL: https://issues.apache.org/jira/browse/IMPALA-12978 Project: IMPALA Issue Type: Bug Components: Clients Reporter: Csaba Ringhofer IMPALA-12544 uses "progress.total_fragment_instances > 0:", but total_fragment_instances is None if the server is older and does not know this Thrift member yet (added in IMPALA-12048). [https://github.com/apache/impala/blob/fb3c379f395635f9f6927b40694bc3dd95a2866f/shell/impala_shell.py#L1320] This leads to error messages in interactive shell sessions when progress reporting is enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources
Csaba Ringhofer created IMPALA-12969: Summary: DeserializeThriftMsg may leak JNI resources Key: IMPALA-12969 URL: https://issues.apache.org/jira/browse/IMPALA-12969 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements call, but this is not done in case there is an error during deserialization: [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12968) Early EndDataStream RPC could be responded earlier
[ https://issues.apache.org/jira/browse/IMPALA-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12968: - Description: When a producer fragment sends no rows and finishes before the receiver is initialized the EndDataStream rpc is stored as early sender and is responded when the receiver is registered. [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] While it is important to store the information that the EOS has happened to unregister the sender from the receiver, the RPC itself could be responded right after it was stored in the early sender map. was: When a producer fragment sends no rows and finishes before the receiver is initialized te e EndDataStream rpc is stored as early sender and is responded when the receiver is registered. [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] While it is important to store the information that the EOS has happened to unregister the sender from the receiver, the RPC itself could be responded right after it was stored in the early sender map. > Early EndDataStream RPC could be responded earlier > -- > > Key: IMPALA-12968 > URL: https://issues.apache.org/jira/browse/IMPALA-12968 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Minor > Labels: krpc > > When a producer fragment sends no rows and finishes before the receiver is > initialized the EndDataStream rpc is stored as early sender and is responded > when the receiver is registered. > [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] > While it is important to store the information that the EOS has happened to > unregister the sender from the receiver, the RPC itself could be responded > right after it was stored in the early sender map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12968) Early EndDataStream RPC could be responded earlier
Csaba Ringhofer created IMPALA-12968: Summary: Early EndDataStream RPC could be responded earlier Key: IMPALA-12968 URL: https://issues.apache.org/jira/browse/IMPALA-12968 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer When a producer fragment sends no rows and finishes before the receiver is initialized te e EndDataStream rpc is stored as early sender and is responded when the receiver is registered. [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] While it is important to store the information that the EOS has happened to unregister the sender from the receiver, the RPC itself could be responded right after it was stored in the early sender map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-10349) Revisit constant folding on non-ASCII strings
[ https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830545#comment-17830545 ] Csaba Ringhofer edited comment on IMPALA-10349 at 3/25/24 3:55 PM: --- Also bumped into this related to pushing down to Kudu: {code:java} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. was (Author: csringhofer): Also bumped into this related to pushing down to Kudu: {code} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. > Revisit constant folding on non-ASCII strings > - > > Key: IMPALA-10349 > URL: https://issues.apache.org/jira/browse/IMPALA-10349 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Quanlong Huang >Priority: Critical > > Constant folding may produce non-ASCII strings. In such cases, we currently > abandon folding the constant. See commit message of IMPALA-1788 or codes > here: > [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282] > I think we should allow folding non-ASCII strings if they are legal UTF-8 > strings. > Example of constant folding work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('123', 1, 1) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = '1' | > |row-size=89B cardinality=730 | > +-+ > {code} > Example of constant folding doesn't work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('引擎', 1, 3) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:S
[jira] [Comment Edited] (IMPALA-10349) Revisit constant folding on non-ASCII strings
[ https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830545#comment-17830545 ] Csaba Ringhofer edited comment on IMPALA-10349 at 3/25/24 3:55 PM: --- Also bumped into this related to pushing down to Kudu: {code:java} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you know why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. was (Author: csringhofer): Also bumped into this related to pushing down to Kudu: {code:java} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. > Revisit constant folding on non-ASCII strings > - > > Key: IMPALA-10349 > URL: https://issues.apache.org/jira/browse/IMPALA-10349 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Quanlong Huang >Priority: Critical > > Constant folding may produce non-ASCII strings. In such cases, we currently > abandon folding the constant. See commit message of IMPALA-1788 or codes > here: > [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282] > I think we should allow folding non-ASCII strings if they are legal UTF-8 > strings. > Example of constant folding work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('123', 1, 1) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = '1' | > |row-size=89B cardinality=730 | > +-+ > {code} > Example of constant folding doesn't work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('引擎', 1, 3) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 0
[jira] [Commented] (IMPALA-10349) Revisit constant folding on non-ASCII strings
[ https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830545#comment-17830545 ] Csaba Ringhofer commented on IMPALA-10349: -- Also bumped into this related to pushing down to Kudu: {code} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. > Revisit constant folding on non-ASCII strings > - > > Key: IMPALA-10349 > URL: https://issues.apache.org/jira/browse/IMPALA-10349 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Quanlong Huang >Priority: Critical > > Constant folding may produce non-ASCII strings. In such cases, we currently > abandon folding the constant. See commit message of IMPALA-1788 or codes > here: > [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282] > I think we should allow folding non-ASCII strings if they are legal UTF-8 > strings. > Example of constant folding work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('123', 1, 1) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = '1' | > |row-size=89B cardinality=730 | > +-+ > {code} > Example of constant folding doesn't work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('引擎', 1, 3) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = substr('引擎', 1, 3)| > |row-size=89B cardinality=730 | > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829953#comment-17829953 ] Csaba Ringhofer commented on IMPALA-12927: -- I think that the best would be to check tbl property "json.binary.format": * if not set, give a clear error message * if base64, do base64 decoding * if rawstring, handle it the way Hive does: [https://github.com/apache/hive/blame/f216bbb632752f467321869cee03adf9477409cf/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonReader.java#L455] Note that I am don't know how exactly special characters are handled in the rawstring case. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Zihao Ye >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'árvíztűrőtükörfúró' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '你好hello' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '��' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '�D3"' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > {code} > The single file in the table looks like this: > {code} > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 > {"id":1,"string_col":"ascii","binary_col":"binary1"} > {"id":2,"string_col":"ascii","binary_col":"binary2"} > {"id":3,"string_col":"null","binary_col":null} > {"id":4,"string_col":"empty","binary_col":""} > {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} > {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} > {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} > {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829614#comment-17829614 ] Csaba Ringhofer edited comment on IMPALA-12927 at 3/21/24 3:47 PM: --- [~Eyizoha] About AuxColumnType: fyi there is an ongoing refactor to remove that class and make it easier to decided whether a column is STRING or BINARY: [https://gerrit.cloudera.org/#/c/21157/] About encoding of BINARY columns: I looked at the Hive code, but it doesn't match with the encoding I see in the files. [https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135] Current Apache Hive seems to default to using base64 encoding, while it can be altered with tbl property "json.binary.format". In the JSON tables in Impala's dataload the files are certainly not base64 encoded and "json.binary.format" is also not set, so it doesn't seem to work like the current Hive codebase. It is possible that this is related to differences between Apache Impala's Hive dependency and current Apache Hive. Currently Impala base64 decodes the BINARY columns: {code:java} Hive: create table tjsonbinary (string s, binary b) stored as JSONFILE; insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary))); Impala: select * from tjsonbinary; +--+--+ | s | b | +--+--+ | abcd | abcd | +--+--+ {code} What do you think about disabling BINARY column reading in JSON until Hive compatibility is clarified? My concern is that besides error messages and nulled values this may actually lead to correctness issues as many strings are both valid utf8 strings and base64 strings, so Impala may return unintended results. was (Author: csringhofer): [~Eyizoha] About AuxColumnType: fyi is there is an ongoing refactor to remove that class and make it easier to decided whether a column is STRING or BINARY: [https://gerrit.cloudera.org/#/c/21157/] About encoding of BINARY columns: I looked at the Hive code, but it doesn't match with the encoding I see in the files. [https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135] Current Apache Hive seems to default to using base64 encoding, while it can be altered with tbl property "json.binary.format". In the JSON tables in Impala's dataload the files are certainly not base64 encoded and "json.binary.format" is also not set, so it doesn't seem to work like the current Hive codebase. It is possible that this is related to differences between Apache Impala's Hive dependency and current Apache Hive. Currently Impala base64 decodes the BINARY columns: {code} Hive: create table tjsonbinary (string s, binary b) stored as JSONFILE; insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary))); Impala: select * from tjsonbinary; +--+--+ | s | b | +--+--+ | abcd | abcd | +--+--+ {code} What do you think about disabling BINARY column reading in JSON until Hive compatibility is clarified? My concern is that besides error messages and nulled values this may actually lead to correctness issues as many strings are both valid utf8 strings and base64 strings, so Impala may return unintended results. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Zihao Ye >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'á
[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829614#comment-17829614 ] Csaba Ringhofer commented on IMPALA-12927: -- [~Eyizoha] About AuxColumnType: fyi is there is an ongoing refactor to remove that class and make it easier to decided whether a column is STRING or BINARY: [https://gerrit.cloudera.org/#/c/21157/] About encoding of BINARY columns: I looked at the Hive code, but it doesn't match with the encoding I see in the files. [https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135] Current Apache Hive seems to default to using base64 encoding, while it can be altered with tbl property "json.binary.format". In the JSON tables in Impala's dataload the files are certainly not base64 encoded and "json.binary.format" is also not set, so it doesn't seem to work like the current Hive codebase. It is possible that this is related to differences between Apache Impala's Hive dependency and current Apache Hive. Currently Impala base64 decodes the BINARY columns: {code} Hive: create table tjsonbinary (string s, binary b) stored as JSONFILE; insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary))); Impala: select * from tjsonbinary; +--+--+ | s | b | +--+--+ | abcd | abcd | +--+--+ {code} What do you think about disabling BINARY column reading in JSON until Hive compatibility is clarified? My concern is that besides error messages and nulled values this may actually lead to correctness issues as many strings are both valid utf8 strings and base64 strings, so Impala may return unintended results. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Zihao Ye >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'árvíztűrőtükörfúró' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '你好hello' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '��' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '�D3"' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > {code} > The single file in the table looks like this: > {code} > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 > {"id":1,"string_col":"ascii","binary_col":"binary1"} > {"id":2,"string_col":"ascii","binary_col":"binary2"} > {"id":3,"string_col":"null","binary_col":null} > {"id":4,"string_col":"empty","binary_col":""} > {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} > {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} > {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} > {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: is
[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829192#comment-17829192 ] Csaba Ringhofer commented on IMPALA-12927: -- [~Eyizoha] I see that BINARY tests are explicitly skipped for JSON, but I couldn't find any discussion about this in the commit that add the JSON scanner: [https://gerrit.cloudera.org/#/c/19699/33/tests/query_test/test_scanners.py] Do you have an idea on what to do with BINARY columns? I am not familiar with Hive's JSON files, so I don't know what is the intended encoding for BINARY columns. I know that the JSON format doesn't support binary values, so generally some encoding (e.g. base64) is used to convert byte arrays to some ascii representation. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'árvíztűrőtükörfúró' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '你好hello' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '��' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '�D3"' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > {code} > The single file in the table looks like this: > {code} > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 > {"id":1,"string_col":"ascii","binary_col":"binary1"} > {"id":2,"string_col":"ascii","binary_col":"binary2"} > {"id":3,"string_col":"null","binary_col":null} > {"id":4,"string_col":"empty","binary_col":""} > {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} > {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} > {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} > {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12927) Support reading BINARY columns in JSON tables
Csaba Ringhofer created IMPALA-12927: Summary: Support reading BINARY columns in JSON tables Key: IMPALA-12927 URL: https://issues.apache.org/jira/browse/IMPALA-12927 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Csaba Ringhofer Currently Impala cannot read BINARY columns in JSON files written by Hive correctly and returns runtime errors: {code} select * from functional_json.binary_tbl; ++--++ | id | string_col | binary_col | ++--++ | 1 | ascii | NULL | | 2 | ascii | NULL | | 3 | null | NULL | | 4 | empty | | | 5 | valid utf8 | NULL | | 6 | valid utf8 | NULL | | 7 | invalid utf8 | NULL | | 8 | invalid utf8 | NULL | ++--++ WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: 'binary1' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: 'binary2' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: 'árvíztűrőtükörfúró' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: '你好hello' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: '��' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: '�D3"' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 {code} The single file in the table looks like this: {code} hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 {"id":1,"string_col":"ascii","binary_col":"binary1"} {"id":2,"string_col":"ascii","binary_col":"binary2"} {"id":3,"string_col":"null","binary_col":null} {"id":4,"string_col":"empty","binary_col":""} {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12899) Temporary workaround for BINARY in complex types
[ https://issues.apache.org/jira/browse/IMPALA-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828387#comment-17828387 ] Csaba Ringhofer commented on IMPALA-12899: -- base64 encoding seems a sane and widely used approach to me. I would suggest the following: # implement it first with base64 encoding # if there is demand to handle this differently, a query option like binary_column_encoding_in_json=base64 / skip / hive_style_unquoted_string I would avoid a "lossy" solution as default, so one where the original binary value can't be decoded from the output. > Temporary workaround for BINARY in complex types > > > Key: IMPALA-12899 > URL: https://issues.apache.org/jira/browse/IMPALA-12899 > Project: IMPALA > Issue Type: Sub-task >Reporter: Daniel Becker >Assignee: Daniel Becker >Priority: Major > > The BINARY type is currently not supported inside complex types and a > cross-component decision is probably needed to support it (see IMPALA-11491). > We would like to enable EXPAND_COMPLEX_TYPES for Iceberg metadata tables > (IMPALA-12612), which requires that queries with BINARY inside complex types > don't fail. Enabling EXPAND_COMPLEX_TYPES is a more prioritised issue than > IMPALA-11491, so we should come up with a temporary solution, e.g. NULLing > BINARY values in complex types and logging a warning, or setting these BINARY > values to a warning string. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12902) Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false
Csaba Ringhofer created IMPALA-12902: Summary: Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false Key: IMPALA-12902 URL: https://issues.apache.org/jira/browse/IMPALA-12902 Project: IMPALA Issue Type: Bug Components: Catalog Reporter: Csaba Ringhofer when setting hms_event_incremental_refresh_transactional_table=false metadata.test_event_processing.TestEventProcessing.test_event_based_replication fails at the following assert: [https://github.com/apache/impala/blob/6c0c26146d956ad771cee27283c1371b9c23adce/tests/metadata/test_event_processing_base.py#L234] Based on the logs catalogd only sees alter_database and transaction events in this case, so if the transaction events (COMMIT_TXN) are ignore, then it doesn't detect the change in the table. This seems strange as the commit that added the test is older than the one that added hms_event_incremental_refresh_transactional_table [https://github.com/apache/impala/commit/e53d649f8a88f42a70237fe7c2663baa126fed1a] vs [https://github.com/apache/impala/commit/097b10104f23e0927d5b21b43a79f6cc10425f59] So it is not clear to me how could the test pass originally. One possibility is that different events were generated in HMS at that time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12902) Event replication can be broken if hms_event_incremental_refresh_transactional_table=false
[ https://issues.apache.org/jira/browse/IMPALA-12902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12902: - Summary: Event replication can be broken if hms_event_incremental_refresh_transactional_table=false (was: Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false) > Event replication can be broken if > hms_event_incremental_refresh_transactional_table=false > -- > > Key: IMPALA-12902 > URL: https://issues.apache.org/jira/browse/IMPALA-12902 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > > when setting hms_event_incremental_refresh_transactional_table=false > metadata.test_event_processing.TestEventProcessing.test_event_based_replication > fails at the following assert: > [https://github.com/apache/impala/blob/6c0c26146d956ad771cee27283c1371b9c23adce/tests/metadata/test_event_processing_base.py#L234] > > Based on the logs catalogd only sees alter_database and transaction events in > this case, so if the transaction events (COMMIT_TXN) are ignore, then it > doesn't detect the change in the table. > This seems strange as the commit that added the test is older than the one > that added hms_event_incremental_refresh_transactional_table > [https://github.com/apache/impala/commit/e53d649f8a88f42a70237fe7c2663baa126fed1a] > vs > [https://github.com/apache/impala/commit/097b10104f23e0927d5b21b43a79f6cc10425f59] > > So it is not clear to me how could the test pass originally. One possibility > is that different events were generated in HMS at that time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12895) REFRESH doesn't detect changes in partition locations in ACID tables
Csaba Ringhofer created IMPALA-12895: Summary: REFRESH doesn't detect changes in partition locations in ACID tables Key: IMPALA-12895 URL: https://issues.apache.org/jira/browse/IMPALA-12895 Project: IMPALA Issue Type: Bug Components: Catalog Reporter: Csaba Ringhofer This was discovered by running test metadata.test_event_processing.TestEventProcessing.test_transact_partition_location_change_from_hive when flag hms_event_incremental_refresh_transactional_table is set to false. [https://github.com/apache/impala/blob/ab6c9467f6347671b971dbce4c640bea032b6ed9/tests/metadata/test_event_processing.py#L164] When hms_event_incremental_refresh_transactional_table is true (default), the alter partition event is processed correctly and the location change is detected. But if it is false or event processing is turned off, the change is not detected and running REFRESH on the table also doesn't update the location. The different handling based on the flag seems intentional: https://github.com/apache/impala/blob/ab6c9467f6347671b971dbce4c640bea032b6ed9/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L2606 This seems to be an old issues while the test was added in a recent commit: [https://github.com/apache/impala/commit/32b29ff36fb3e05fd620a6714de88805052d0117] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled
[ https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12835 started by Csaba Ringhofer. > Transactional tables are unsynced when > hms_event_incremental_refresh_transactional_table is disabled > > > Key: IMPALA-12835 > URL: https://issues.apache.org/jira/browse/IMPALA-12835 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Csaba Ringhofer >Priority: Critical > > There are some test failures when > hms_event_incremental_refresh_transactional_table is disabled: > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication > I can reproduce the issue locally: > {noformat} > $ bin/start-impala-cluster.py > --catalogd_args=--hms_event_incremental_refresh_transactional_table=false > impala-shell> create table txn_tbl (id int, val int) stored as parquet > tblproperties > ('transactional'='true','transactional_properties'='insert_only'); > impala-shell> describe txn_tbl; -- make the table loaded in Impala > hive> insert into txn_tbl values(101, 200); > impala-shell> select * from txn_tbl; {noformat} > Impala shows no results until a REFRESH runs on this table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled
[ https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824490#comment-17824490 ] Csaba Ringhofer commented on IMPALA-12835: -- https://gerrit.cloudera.org/#/c/21116/ > Transactional tables are unsynced when > hms_event_incremental_refresh_transactional_table is disabled > > > Key: IMPALA-12835 > URL: https://issues.apache.org/jira/browse/IMPALA-12835 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Csaba Ringhofer >Priority: Critical > > There are some test failures when > hms_event_incremental_refresh_transactional_table is disabled: > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication > I can reproduce the issue locally: > {noformat} > $ bin/start-impala-cluster.py > --catalogd_args=--hms_event_incremental_refresh_transactional_table=false > impala-shell> create table txn_tbl (id int, val int) stored as parquet > tblproperties > ('transactional'='true','transactional_properties'='insert_only'); > impala-shell> describe txn_tbl; -- make the table loaded in Impala > hive> insert into txn_tbl values(101, 200); > impala-shell> select * from txn_tbl; {noformat} > Impala shows no results until a REFRESH runs on this table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer closed IMPALA-12812. Resolution: Invalid > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. -It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too.- - UPDATE: the previous sentence was not true with > current Impala. It also reloads the table (similarly to other DDLs) and > detects new files in existing partitions. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12812: - Description: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. -It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.- - UPDATE: the previous sentence was not true with current Impala. It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. was: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too. I{-}t also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. -It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too.- - UPDATE: the previous sentence was not true with > current Impala. It also reloads the table (similarly to other DDLs) and > detects new files in existing partitions. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12812: - Description: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too. I{-}t also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. was: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.-{-}It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too. I{-}t also reloads the table (similarly to other > DDLs) and detects new files in existing partitions. - UPDATE: the previous > sentence was not true with current Impala. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org