[jira] [Commented] (IMPALA-3471) TopN should be able to spill
[ https://issues.apache.org/jira/browse/IMPALA-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134613#comment-17134613 ] Tim Armstrong commented on IMPALA-3471: --- I think we still want to use the regular external sort implementation for large limits, but we could make some further optimisations to avoid spilling as much data. Specifically, in SortCurrentInputRun() we could truncate the in-memory sorted run, and then when merging sorted runs we can apply the limit there too. There are additional tricks that we could add to optimise this for spilling sorts further, mostly various ways to keep track of the upper bound on the row that would be past the threshold. > TopN should be able to spill > > > Key: IMPALA-3471 > URL: https://issues.apache.org/jira/browse/IMPALA-3471 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.6.0 >Reporter: Jim Apple >Priority: Minor > > TopN nodes store OFFSET + LIMIT tuples in memory. (In fact, in a vector > which will throw an exception if allocation fails.) It would be nice to check > allocations before they fail and spill when there isn't enough memory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-9853) Push rank() predicates into sort
[ https://issues.apache.org/jira/browse/IMPALA-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-9853: - Assignee: (was: Tim Armstrong) > Push rank() predicates into sort > > > Key: IMPALA-9853 > URL: https://issues.apache.org/jira/browse/IMPALA-9853 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Tim Armstrong >Priority: Major > Labels: performance, tpcds > > TPC-DS Q67 would benefit significantly if we could push the rank() predicate > into the sort to do some reduction of unneeded data. The sorter could > evaluate this predicate if it had the partition expressions available - as a > post-processing step to the in-memory sort for the analytic sort group, it > could do a pass over the sorted run, resetting a counter at the start of each > partition boundary. > It might be best to start with tackling IMPALA-3471 by applying the limit > within sorted runs, since that doesn't require any planner work. > {noformat} > with results as > ( select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy > ,d_moy ,s_store_id > ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales > from store_sales ,date_dim ,store ,item >where ss_sold_date_sk=d_date_sk > and ss_item_sk=i_item_sk > and ss_store_sk = s_store_sk > and d_month_seq between 1212 and 1212 + 11 >group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, > d_moy,s_store_id) > , > results_rollup as > (select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, > s_store_id, sumsales > from results > union all > select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, > null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy > union all > select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, null > d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy > union all > select i_category, i_class, i_brand, i_product_name, d_year, null d_qoy, > null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name, d_year > union all > select i_category, i_class, i_brand, i_product_name, null d_year, null > d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name > union all > select i_category, i_class, i_brand, null i_product_name, null d_year, null > d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand > union all > select i_category, i_class, null i_brand, null i_product_name, null d_year, > null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class > union all > select i_category, null i_class, null i_brand, null i_product_name, null > d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category > union all > select null i_category, null i_class, null i_brand, null i_product_name, > null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results) > select * > from (select i_category > ,i_class > ,i_brand > ,i_product_name > ,d_year > ,d_qoy > ,d_moy > ,s_store_id > ,sumsales > ,rank() over (partition by i_category order by sumsales desc) rk > from results_rollup) dw2 > where rk <= 100 > order by i_category > ,i_class > ,i_brand > ,i_product_name > ,d_year > ,d_qoy > ,d_moy > ,s_store_id > ,sumsales > ,rk > limit 100 > {noformat} > Assigning to myself to fill in more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5296) Codegen for aggregation with 1K grouping columns takes several seconds
[ https://issues.apache.org/jira/browse/IMPALA-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134607#comment-17134607 ] Tim Armstrong commented on IMPALA-5296: --- This has gotten a little faster, but still took over a minute on my desktop: {noformat} Fragment F01: CodeGen:(Total: 1m9s, non-child: 1m9s, % non-child: 100.00%) - CodegenInvoluntaryContextSwitches: 124 (124) - CodegenTotalWallClockTime: 1m9s - CodegenSysTime: 696.010ms - CodegenUserTime: 1m9s - CodegenVoluntaryContextSwitches: 10 (10) - CompileTime: 16s816ms - IrGenerationTime: 113.400ms - LoadTime: 0.000ns - ModuleBitcodeSize: 2.50 MB (2624788) - NumFunctions: 2.07K (2066) - NumInstructions: 106.15K (106147) - OptimizationTime: 52s908ms - PeakMemoryUsage: 51.83 MB (54347264) - PrepareTime: 19.514ms Fragment F00: CodeGen:(Total: 1m12s, non-child: 1m12s, % non-child: 100.00%) - CodegenInvoluntaryContextSwitches: 104 (104) - CodegenTotalWallClockTime: 1m12s - CodegenSysTime: 779.937ms - CodegenUserTime: 1m12s - CodegenVoluntaryContextSwitches: 12 (12) - CompileTime: 17s127ms - IrGenerationTime: 142.890ms - LoadTime: 0.000ns - ModuleBitcodeSize: 2.50 MB (2624788) - NumFunctions: 3.11K (3106) - NumInstructions: 155.46K (155461) - OptimizationTime: 55s547ms - PeakMemoryUsage: 75.91 MB (79596032) - PrepareTime: 18.261ms {noformat} > Codegen for aggregation with 1K grouping columns takes several seconds > -- > > Key: IMPALA-5296 > URL: https://issues.apache.org/jira/browse/IMPALA-5296 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: codegen > Attachments: codegen-widetable-profile.txt, > wide_parquet_1000_small_data.0.parq > > > Parquet file for sample data attached. > From the query profile > {code} > CodeGen:(Total: 3m6s, non-child: 3m6s, % non-child: 100.00%) > - CodegenTime: 73.408ms > - CompileTime: 48s511ms > - LoadTime: 0.000ns > - ModuleBitcodeSize: 1.98 MB (2074600) > - NumFunctions: 2.07K (2066) > - NumInstructions: 104.95K (104948) > - OptimizationTime: 2m17s > - PeakMemoryUsage: 51.24 MB (53733376) > - PrepareTime: 18.414ms > AGGREGATION_NODE (id=2):(Total: 3s413ms, non-child: 408.959us, % > non-child: 0.01%) > - BuildTime: 775.000ns > - GetResultsTime: 0.000ns > - HTResizeTime: 0.000ns > - HashBuckets: 0 (0) > - LargestPartitionPercent: 0 (0) > - MaxPartitionLevel: 0 (0) > - NumRepartitions: 0 (0) > - PartitionsCreated: 0 (0) > - PeakMemoryUsage: 16.00 KB (16384) > - RowsRepartitioned: 0 (0) > - RowsReturned: 1 (1) > - RowsReturnedRate: 0 > {code} > Query used to repro > {code} > select count(*) from (select distinct * from wide_parquet_1000_small where > l_orderkey = 10) a > {code} > Hot functions from Perf > {code} > Samples: 2M of event 'cycles', Event count (approx.): 1538632554190 > +6.09% 6.07% impalad impalad[.] > llvm::ScalarEvolution::has◆ > +5.75% 0.00% impalad [unknown] [.] > > +5.46% 0.00% impalad [unknown] [.] > 0x7f8dfe242120 > +5.21% 0.00% impalad [unknown] [.] > 0x7f8e5e13ea20 > +4.31% 0.00% impalad [unknown] [.] > 0x7f8e5e13e8e8 > +4.31% 0.00% init [kernel.kallsyms] [k] > start_secondary > +4.30% 0.02% init [kernel.kallsyms] [k] > cpu_idle > +4.03% 0.00% impalad [unknown] [.] > 0x7f8dfe241fe8 > +3.79% 3.77% impalad impalad[.] > llvm::Use::getImpliedUser( > +3.65% 0.02% init [kernel.kallsyms] [k] > cpuidle_idle_call > +3.26% 2.78% impalad impalad[.] > llvm::MemoryDependenceAnal > +3.17% 3.15% impalad impalad[.] > llvm::SmallPtrSetImplBase: > +2.56% 2.52% impalad impalad[.] > llvm::ScalarEvolution::get > +
[jira] [Assigned] (IMPALA-6746) Reduce the number of comparison for analytical functions with partitioning when incoming data is clustered
[ https://issues.apache.org/jira/browse/IMPALA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-6746: - Assignee: (was: Adrian Ng) > Reduce the number of comparison for analytical functions with partitioning > when incoming data is clustered > -- > > Key: IMPALA-6746 > URL: https://issues.apache.org/jira/browse/IMPALA-6746 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.13.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: performance > Attachments: percentile query profile 2.txt > > > Checking if the current row belongs to the same partition in ANALYTIC is very > expensive, as it does N comparisons where N is number of rows, in cases when > the cardinality of the partition column(s) is relatively small the values > will be clustered. > One optimization as proposed by [~alex.behm] is to check the first and last > tuples in the batch and if they match go avoid calling > AnalyticEvalNode::PrevRowCompare for the entire batch. > For the query attached which is a common pattern the expected speedup is > 20-30%. > Query > {code} > select l_commitdate > ,avg(l_extendedprice) as avg_perc > ,percentile_cont (.25) within group (order by l_extendedprice asc) as > perc_25 > ,percentile_cont (.5) within group (order by l_extendedprice asc) as > perc_50 > ,percentile_cont (.75) within group (order by l_extendedprice asc) as > perc_75 > ,percentile_cont (.90) within group (order by l_extendedprice asc) as > perc_90 > from lineitem > group by l_commitdate > order by l_commitdate > {code} > Plan > {code} > F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 > | Per-Host Resources: mem-estimate=0B mem-reservation=0B > PLAN-ROOT SINK > | mem-estimate=0B mem-reservation=0B > | > 09:MERGING-EXCHANGE [UNPARTITIONED] > | order by: l_commitdate ASC > | mem-estimate=0B mem-reservation=0B > | tuple-ids=5 row-size=66B cardinality=2559 > | > F02:PLAN FRAGMENT [HASH(l_commitdate)] hosts=1 instances=1 > Per-Host Resources: mem-estimate=22.00MB mem-reservation=13.94MB > 05:SORT > | order by: l_commitdate ASC > | mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB > | tuple-ids=5 row-size=66B cardinality=2559 > | > 08:AGGREGATE [FINALIZE] > | output: avg:merge(l_extendedprice), > _percentile_cont_interpolation:merge(l_extendedprice, > `_percentile_row_number_diff_0`), > _percentile_cont_interpolation:merge(l_extendedprice, > `_percentile_row_number_diff_1`), > _percentile_cont_interpolation:merge(l_extendedprice, > `_percentile_row_number_diff_2`), > _percentile_cont_interpolation:merge(l_extendedprice, > `_percentile_row_number_diff_3`) > | group by: l_commitdate > | mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB > | tuple-ids=4 row-size=66B cardinality=2559 > | > 07:EXCHANGE [HASH(l_commitdate)] > | mem-estimate=0B mem-reservation=0B > | tuple-ids=3 row-size=66B cardinality=2559 > | > F01:PLAN FRAGMENT [HASH(l_commitdate)] hosts=1 instances=1 > Per-Host Resources: mem-estimate=64.00MB mem-reservation=22.00MB > 04:AGGREGATE [STREAMING] > | output: avg(l_extendedprice), > _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - > count(l_extendedprice) - 1 * 0.25), > _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - > count(l_extendedprice) - 1 * 0.5), > _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - > count(l_extendedprice) - 1 * 0.75), > _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - > count(l_extendedprice) - 1 * 0.90) > | group by: l_commitdate > | mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB > | tuple-ids=3 row-size=66B cardinality=2559 > | > 03:ANALYTIC > | functions: count(l_extendedprice) > | partition by: l_commitdate > | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB > | tuple-ids=9,7,8 row-size=50B cardinality=59986052 > | > 02:ANALYTIC > | functions: row_number() > | partition by: l_commitdate > | order by: l_extendedprice ASC > | window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB > | tuple-ids=9,7 row-size=42B cardinality=59986052 > | > 01:SORT > | order by: l_commitdate ASC NULLS FIRST, l_extendedprice ASC NULLS LAST > | mem-estimate=46.00MB mem-reservation=12.00MB spill-buffer=2.00MB > | tuple-ids=9 row-size=34B cardinality=59986052 > | > 06:EXCHANGE [HASH(l_commitdate)] > | mem-estimate=0B mem-reservation=0B > | tuple-ids=0 row-size=34B cardinality=59986052 > | > F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 > Per-Host Resources: mem-estimate=88.00MB
[jira] [Assigned] (IMPALA-6354) Consider using Guava LoadingCache to cache metadata objects opposed to a ConcurrentHashMap
[ https://issues.apache.org/jira/browse/IMPALA-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-6354: - Assignee: (was: Tianyi Wang) > Consider using Guava LoadingCache to cache metadata objects opposed to a > ConcurrentHashMap > -- > > Key: IMPALA-6354 > URL: https://issues.apache.org/jira/browse/IMPALA-6354 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Mostafa Mokhtar >Priority: Major > Labels: catalog, catalog-server, scalability > > Look into replacing the ConcurrentHashMap(s) used in the Catalog service to > cache metadata with > [LoadingCache|https://google.github.io/guava/releases/19.0/api/docs/com/google/common/cache/LoadingCache.html]. > Caching metadata using a LoadingCache will allow: > * Better coherency by adding expireAfter and refreshAfter clause to the cache > * Weak reference eviction to respond to garbage collection > * Putting a cap on maximum number of caches entires to avoid OOMs and > excessive GC > * Assigning different weights for each entry (For more efficient eviction) > https://github.com/google/guava/wiki/CachesExplained > https://google.github.io/guava/releases/19.0/api/docs/com/google/common/cache/CacheBuilder.html > https://google.github.io/guava/releases/19.0/api/docs/com/google/common/cache/LoadingCache.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-6404) Evenly distribute local and remote scan ranges across Impalad(s) when 100% locality is not achievable
[ https://issues.apache.org/jira/browse/IMPALA-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-6404: - Assignee: (was: Lars Volker) > Evenly distribute local and remote scan ranges across Impalad(s) when 100% > locality is not achievable > - > > Key: IMPALA-6404 > URL: https://issues.apache.org/jira/browse/IMPALA-6404 > Project: IMPALA > Issue Type: Improvement > Components: Distributed Exec >Reporter: Mostafa Mokhtar >Priority: Major > Labels: scheduler > > Current scheduler tries to assign as many local reads as possible, this works > well if 100% locality is achievable, but in cases where some nodes have > locality while others don't an even scan ranges are assigned to the backends > which results in execution skew. > Ideally the scheduler should create an even distribution of local and remote > scan ranges to avoid skew. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-6006) Incorrect cardinality estimation when dimension table has inequality predicate
[ https://issues.apache.org/jira/browse/IMPALA-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-6006: - Assignee: (was: Philip Martin) > Incorrect cardinality estimation when dimension table has inequality predicate > -- > > Key: IMPALA-6006 > URL: https://issues.apache.org/jira/browse/IMPALA-6006 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 2.11.0 >Reporter: Mostafa Mokhtar >Priority: Major > > Query > {code} > select count(*) >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk >where > d_month_seq between 1193 and 1193+11; > {code} > Plan > {code} > +---+ > | Explain String > | > +---+ > | Max Per-Host Resource Reservation: Memory=1.94MB > | > | Per-Host Resource Estimates: Memory=54.94MB > | > | > | > | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 > | > | | Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B > | > | PLAN-ROOT SINK > | > | | mem-estimate=0B mem-reservation=0B > | > | | > | > | 06:AGGREGATE [FINALIZE] > | > | | output: count:merge(*) > | > | | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB > | > | | tuple-ids=2 row-size=8B cardinality=1 > | > | | > | > | 05:EXCHANGE [UNPARTITIONED] > | > | | mem-estimate=0B mem-reservation=0B > | > | | tuple-ids=2 row-size=8B cardinality=1 > | > | | > | > | F00:PLAN FRAGMENT [RANDOM] hosts=7 instances=7 > | > | Per-Host Resources: mem-estimate=12.94MB mem-reservation=1.94MB > | > | 03:AGGREGATE > | > | | output: count(*) > | > | | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB > | > | | tuple-ids=2 row-size=8B cardinality=1 > | > | | > | > | 02:HASH JOIN [INNER JOIN, BROADCAST] > | > | | hash predicates: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > | > | | fk/pk conjuncts: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > | > | | runtime filters: RF000 <- date_dim.d_date_sk > | > | | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB > | > | | tuple-ids=0,1 row-size=16B cardinality=14399964710 > | > | | > | > | |--04:EXCHANGE [BROADCAST] > | > | | | mem-estimate=0B mem-reservation=0B > | > | | | tuple-ids=1 row-size=8B cardinality=7305 > | > | | | > | > | | F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 > | > | | Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B > | > | | 01:SCAN HDFS [tpcds_1_parquet.date_dim, RANDOM] > | > | | partitions=1/1 files=1 size=2.15MB > | > | | predicates: d_month_seq <= 1204, d_month_seq >= 1193 > | > | | stats-rows=73049 extrapolated-rows=disabled > | > | | table stats: rows=73049 size=unavailable > | > | | column stats: all > | > | | parquet statistics predicates: d_month_seq <= 1204, d_month_seq >= > 1193 | >
[jira] [Resolved] (IMPALA-4062) Create thread pool for HdfsScanNode::ScannerThread to limit Kernel contention
[ https://issues.apache.org/jira/browse/IMPALA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-4062. --- Resolution: Won't Fix Not relevant with the IMPALA-3902 changes. > Create thread pool for HdfsScanNode::ScannerThread to limit Kernel contention > - > > Key: IMPALA-4062 > URL: https://issues.apache.org/jira/browse/IMPALA-4062 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.6.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: performance, scalability > Attachments: TPC-DS Q27.txt, q27_perf_kernel.txt, q_27_spinning_1.zip > > > Servers with modern processors like E5-2698 can have up to 80 logical > processors per server, as a result queries occasionally end up running with a > significantly larger number of threads. > Creating and destroying threads is expensive and wastes lots of resources, > hence consider creating a thread pool for scanner threads to avoid resource > contention during thread creation. > For TPC-DS Q27 >40% of CPU cycles are spent pthread_mutex_unlock and > pthread_mutex_lock > Call stacks > {code} > CPU Time > 1 of 5: 71.4% (31.928s of 44.725s) > impalad ! pthread_mutex_unlock - mutex.hpp > impalad ! boost::mutex::unlock + 0x10 - mutex.hpp:125 > impalad ! ~unique_lock + 0x16 - lock_types.hpp:331 > impalad ! impala::HdfsScanNode::ScannerThread + 0x2aa - hdfs-scan-node.cc:1044 > impalad ! boost::function0::operator() + 0x1a - > function_template.hpp:767 > impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318 > impalad ! operator()&, const > std::basic_string&, boost::function, impala::Promise int>*), boost::_bi::list0> + 0x5a - bind.hpp:457 > impalad ! boost::_bi::bind_t const&, boost::function, impala::Promise*), > boost::_bi::list4, > boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*>>>::operator() - > bind_template.hpp:20 > impalad ! boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, > impala::Promise*), boost::_bi::list4, > boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*::run + 0x19 - > thread.hpp:116 > impalad ! thread_proxy + 0xd9 - [unknown source file] > libpthread.so.0 ! start_thread + 0xd0 - [unknown source file] > libc.so.6 ! clone + 0x6c - [unknown source file] > {code} > {code} > CPU Time > 2 of 5: 26.4% (11.787s of 44.725s) > impalad ! pthread_mutex_unlock - mutex.hpp > impalad ! boost::mutex::unlock + 0x10 - mutex.hpp:125 > impalad ! ~unique_lock + 0x16 - lock_types.hpp:331 > impalad ! impala::Promise::Get + 0x82d - promise.h:94 > impalad ! impala::CountingBarrier::Wait - counting-barrier.h:42 > impalad ! impala::HdfsScanNode::ScannerThread + 0x2aa - hdfs-scan-node.cc:1044 > impalad ! boost::function0::operator() + 0x1a - > function_template.hpp:767 > impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318 > impalad ! operator()&, const > std::basic_string&, boost::function, impala::Promise int>*), boost::_bi::list0> + 0x5a - bind.hpp:457 > impalad ! boost::_bi::bind_t const&, boost::function, impala::Promise*), > boost::_bi::list4, > boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*>>>::operator() - > bind_template.hpp:20 > impalad ! boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, > impala::Promise*), boost::_bi::list4, > boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*::run + 0x19 - > thread.hpp:116 > impalad ! thread_proxy + 0xd9 - [unknown source file] > libpthread.so.0 ! start_thread + 0xd0 - [unknown source file] > libc.so.6 ! clone + 0x6c - [unknown source file] > {code} > {code} > CPU Time > 1 of 11: 98.4% (35.681s of 36.271s) > impalad ! pthread_mutex_lock - mutex.hpp > impalad ! boost::mutex::lock + 0x10 - mutex.hpp:116 > impalad ! [impalad] + 0x25f9daf - [unknown source file] > impalad ! boost::function0::operator() + 0x1a - > function_template.hpp:767 > impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318 > impalad ! operator()&, const > std::basic_string&, boost::function, impala::Promise int>*), boost::_bi::list0> + 0x5a - bind.hpp:457 > impalad ! boost::_bi::bind_t const&, boost::function, impala::Promise*), > boost::_bi::list4, > boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*>>>::operator() - > bind_template.hpp:20 > impalad ! boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, > impala::Promise*), boost::_bi::list4, > boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*::run + 0x19 - > thread.hpp:116 > impalad ! thread_proxy + 0xd9 - [unknown source file] > libpthread.so.0 ! start_thread + 0xd0 - [unknown source file] > libc.so.6 ! clone
[jira] [Assigned] (IMPALA-3731) Runtime filters from the same source arrive at different times
[ https://issues.apache.org/jira/browse/IMPALA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-3731: - Assignee: (was: Henry Robinson) > Runtime filters from the same source arrive at different times > -- > > Key: IMPALA-3731 > URL: https://issues.apache.org/jira/browse/IMPALA-3731 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Affects Versions: Impala 2.5.0 >Reporter: Mostafa Mokhtar >Priority: Minor > Labels: runtime-filters > > Runtime filters from the same source are arriving ~5 seconds apart, it seems > that the coordinator is either serializing the filters or it was network > bound. > Query > {code} > select count(*) rowcount > from store_sales a > ,store_returns b > where a.ss_item_sk = b.sr_item_sk >and a.ss_ticket_number = b.sr_ticket_number >and ss_sold_date_sk between 2450816 and 2451500 >and sr_returned_date_sk between 2450816 and 2451500 > group by ss_cdemo_sk,ss_store_sk,ss_item_sk , ss_ticket_number having > count(*) > 1 > {code} > Subplan > {code} > | > 00:SCAN HDFS [tpcds_3000_parquet.store_sales a, RANDOM] >partitions=683/1824 files=944 size=126.77GB >runtime filters: RF000 -> a.ss_item_sk, RF001 -> a.ss_ticket_number >table stats: 8639936081 rows total >column stats: all >hosts=61 per-host-mem=352.00MB >tuple-ids=0 row-size=24B cardinality=2886246552 > {code} > Filter table > {code} > ID Src. Node Tgt. Node(s) Targets Target type Partition filter Pending > (Expected) First arrived Completed > --- > 1 2 0 61 REMOTE false > 0 (61)2s881ms10s265ms > 0 2 0 61 REMOTE false > 0 (61)3s698ms10s350ms > {code} > Filters arriving at different times > {code} > Instance 614bea9715cbde44:b0134609741aea61 > (host=impala-compete-64-5.vpc.cloudera.com:22000):(Total: 30s446ms, > non-child: 10s882ms, % non-child: 35.74%) > Hdfs split stats (:<# splits>/): 0:16/2.33 > GB > Filter 1 arrival: 11s854ms > Filter 0 arrival: 16s047ms > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-3701) Evaluate compressing Runtime filters to save coordinator network bandwidth
[ https://issues.apache.org/jira/browse/IMPALA-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-3701. --- Resolution: Won't Fix We expect these to be generally incompressible, so not worth pursuing. > Evaluate compressing Runtime filters to save coordinator network bandwidth > -- > > Key: IMPALA-3701 > URL: https://issues.apache.org/jira/browse/IMPALA-3701 > Project: IMPALA > Issue Type: New Feature > Components: Distributed Exec >Affects Versions: Impala 2.5.0 >Reporter: Mostafa Mokhtar >Assignee: Henry Robinson >Priority: Major > Labels: runtime-filters, scalability > Attachments: image-2016-06-08-22-55-36-966.png, query17.sql.2.out > > > When running complex queries on large clusters with lots of runtime filters > the coordinator quickly becomes network bound due to the extra incoming and > outgoing traffic for runtime filters, once the coordinator becomes network > bound all other fragments in the cluster are negatively affected as they get > blocked on shuffling/broadcasting data to the coordinator node. > This bottleneck was identified when running large scale tests on EC2 nodes > with less than ideal network throughput. > In attached png is aggregate network throughput across the 32 nodes in the > cluster with the coordinator in red. > !image-2016-06-08-22-55-36-966.png|thumbnail! > Compression should alleviate this bottleneck but we should consider other > solutions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-3617) Incorrect reporting of runtime filters in scan node
[ https://issues.apache.org/jira/browse/IMPALA-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-3617. --- Resolution: Cannot Reproduce > Incorrect reporting of runtime filters in scan node > --- > > Key: IMPALA-3617 > URL: https://issues.apache.org/jira/browse/IMPALA-3617 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 2.6.0 >Reporter: Mostafa Mokhtar >Assignee: Henry Robinson >Priority: Minor > Labels: runtime-filters > Attachments: query31.sql.3.out > > > Summary line reports one filter arriving while both filters were received. > {code} > HDFS_SCAN_NODE (id=6):(Total: 9s443ms, non-child: 9s443ms, % non-child: > 100.00%) > ExecOption: Expr Evaluation Codegen Disabled, Codegen enabled: 0 > out of 14 > Hdfs split stats (:<# splits>/): > 2:39/8.86 GB 5:42/9.29 GB 4:28/6.50 GB 1:29/6.45 GB 6:32/7.38 GB 3:27/6.07 GB > 10:34/7.63 GB 7:26/5.75 GB 0:33/7.24 GB 9:37/8.06 GB 8:22/4.86 GB > Runtime filters: Only following filters arrived: 8, waited 8s991ms > Hdfs Read Thread Concurrency Bucket: 0:100% 1:0% 2:0% 3:0% 4:0% > 5:0% 6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 12:0% 13:0% 14:0% 15:0% 16:0% 17:0% > 18:0% 19:0% 20:0% 21:0% 22:0% 23:0% 24:0% 25:0% 26:0% 27:0% > File Formats: PARQUET/NONE:335 PARQUET/SNAPPY:28 > BytesRead(500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 73.18 MB, 73.18 MB, 73.18 MB, 73.18 MB, 73.18 MB, 73.18 MB, > 73.18 MB, 93.31 MB, 134.68 MB, 161.83 MB, 161.83 MB, 161.83 MB, 161.83 MB, > 184.49 MB, 206.65 MB, 233.75 MB, 233.75 MB, 233.75 MB, 261.02 MB, 261.02 MB, > 288.24 MB, 309.87 MB, 309.87 MB, 309.87 MB, 337.08 MB, 337.08 MB, 337.08 MB, > 337.08 MB, 337.08 MB >- AverageHdfsReadThreadConcurrency: 0.00 >- AverageScannerThreadConcurrency: 3.26 >- BytesRead: 337.08 MB (353454758) >- BytesReadDataNodeCache: 0 >- BytesReadLocal: 337.08 MB (353454758) >- BytesReadRemoteUnexpected: 0 >- BytesReadShortCircuit: 337.08 MB (353454758) >- DecompressionTime: 784.603ms >- MaxCompressedTextFileLength: 0 >- NumColumns: 2 (2) >- NumDisksAccessed: 6 (6) >- NumRowGroups: 14 (14) >- NumScannerThreadsStarted: 4 (4) >- PeakMemoryUsage: 134.29 MB (140810128) >- PerReadThreadRawHdfsThroughput: 2.00 GB/sec >- RemoteScanRanges: 0 (0) >- RowsRead: 90.33M (90325159) >- RowsReturned: 90.32M (90321194) >- RowsReturnedRate: 9.56 M/sec >- ScanRangesComplete: 349 (349) >- ScannerThreadsInvoluntaryContextSwitches: 2.85K (2851) >- ScannerThreadsTotalWallClockTime: 1m16s > - MaterializeTupleTime(*): 50s026ms > - ScannerThreadsSysTime: 887.864ms > - ScannerThreadsUserTime: 6s995ms >- ScannerThreadsVoluntaryContextSwitches: 176.78K (176783) >- TotalRawHdfsReadTime(*): 164.437ms >- TotalReadThroughput: 13.76 MB/sec > Filter 7 (1.00 MB): > - Rows processed: 229.36K (229362) > - Rows rejected: 3.96K (3965) > - Rows total: 229.38K (229376) > Filter 8 (1.00 MB): > - Files processed: 349 (349) > - Files rejected: 335 (335) > - Files total: 349 (349) > - RowGroups processed: 88.21K (88214) > - RowGroups rejected: 0 (0) > - RowGroups total: 88.21K (88214) > - Rows processed: 229.36K (229362) > - Rows rejected: 0 (0) > - Rows total: 229.38K (229376) > - Splits processed: 14 (14) > - Splits rejected: 0 (0) > - Splits total: 14 (14) > {code} > Ditto in final routing table > {code} > ID Src. Node Tgt. Node(s) Targets > Target type Partition filter Pending (Expected) First > arrived Completed > -- > 6 3 0 20 > LOCAL true 0 (20) >N/A N/A > 5 4 0 20 > LOCAL false 0 (20) >N/A N/A > 8 9 6 20 > LOCAL
[jira] [Resolved] (IMPALA-3636) Regression in DecimalOperators::EQ with codegen disabled
[ https://issues.apache.org/jira/browse/IMPALA-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-3636. --- Resolution: Won't Fix > Regression in DecimalOperators::EQ with codegen disabled > > > Key: IMPALA-3636 > URL: https://issues.apache.org/jira/browse/IMPALA-3636 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.6.0 >Reporter: Mostafa Mokhtar >Priority: Minor > Labels: performance, regression > > Some of the decimal improvements that came in Impala 2.6 introduced a > regression in the none-codegened path. > This regression was cause by > https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_all.test. > > After > ||Function Stack||CPU Time: Total|| > |impala::DecimalOperators::Eq_DecimalVal_DecimalVal|62.207s| > | --impala::Expr::GetConstantInt|55.458s| > | --impala::DecimalValue::Eq|1.480s| > | --impala::GetDecimal8Value|0.290s| > | --impala::DecimalValue<__int128>::Eq|0.190s| > > Before > ||Function Stack||CPU Time: Total|| > |impala::DecimalOperators::Eq_DecimalVal_DecimalVal|9.809s| > | --impala::DecimalValue::Compare|2.300s| > | --impala_udf::FunctionContext::GetArgType|2.130s| > | --func@0x812950|0.390s| > This is a simplified version of the query which can be used as a repro > {code} > select * > FROM ( > SELECT Rank() OVER ( > ORDER BY l_extendedprice > ,l_quantity > ,l_discount > ,l_tax > ) AS rank > FROM lineitem > WHERE l_shipdate < '1992-05-09' > ) a > WHERE rank < 10 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-3101) AnalyticEvalNode should use codegened TupleRowComparator instead of PrevRowCompare
[ https://issues.apache.org/jira/browse/IMPALA-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134600#comment-17134600 ] Tim Armstrong commented on IMPALA-3101: --- IMPALA-4356 should guarantee that the full expr tree was codegen'd > AnalyticEvalNode should use codegened TupleRowComparator instead of > PrevRowCompare > -- > > Key: IMPALA-3101 > URL: https://issues.apache.org/jira/browse/IMPALA-3101 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.6.0 >Reporter: Mostafa Mokhtar >Priority: Minor > Labels: codegen, performance > Attachments: primitive_orderby_bigint_VtuneTopDown.csv > > > AnalyticEvalNode uses PrevRowCompare to compare rows, which is very > inefficient compared to the codegend version of TupleRowComparator::Compare > |Function Stack||CPU Time: Total||CPU Time: Self||Module||Function > (Full)||Source File||Start Address| > |impala::AnalyticEvalNode::ProcessChildBatch|47.9%|0.810s|impalad|impala::AnalyticEvalNode::ProcessChildBatch(impala::RuntimeState*)|analytic-eval-node.cc|0xc0a870| > | > impala::AnalyticEvalNode::TryAddResultTupleForPrevRow|35.0%|0.570s|impalad|impala::AnalyticEvalNode::TryAddResultTupleForPrevRow(bool, > long, impala::TupleRow*)|analytic-eval-node.cc|0xc0aa85| > | > impala::AnalyticEvalNode::PrevRowCompare|30.3%|0.040s|impalad|impala::AnalyticEvalNode::PrevRowCompare(impala::ExprContext*)|analytic-eval-node.cc|0xc0ae1d| > | > impala::ExprContext::GetBooleanVal|30.2%|0.330s|impalad|impala::ExprContext::GetBooleanVal(impala::TupleRow*)|expr-context.cc|0x7f0790| > | > impala::AndPredicate::GetBooleanVal|29.8%|1.220s|impalad|impala::AndPredicate::GetBooleanVal(impala::ExprContext*, > impala::TupleRow*)|compound-predicates.cc|0x8575c0| > | > impala::OrPredicate::GetBooleanVal|28.5%|2.840s|impalad|impala::OrPredicate::GetBooleanVal(impala::ExprContext*, > impala::TupleRow*)|compound-predicates.cc|0x857650| > These queries can be used for repro > https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_all.test > https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9847) JSON profiles are mostly space characters
[ https://issues.apache.org/jira/browse/IMPALA-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134597#comment-17134597 ] ASF subversion and git services commented on IMPALA-9847: - Commit 6ca6e403580dc592c026b4f684d31f8a4dcfae11 in impala's branch refs/heads/master from Tim Armstrong [ https://gitbox.apache.org/repos/asf?p=impala.git;h=6ca6e40 ] IMPALA-9847: reduce web UI serialized JSON size Switch to using the plain writer in some places, and tweak PrettyWriter to produce denser output for the debug UI JSON (so that it's still human readable but denser). Testing: Manually tested. The profile for the below query went from 338kB to 134kB. select min(l_orderkey) from tpch_parquet.lineitem; Change-Id: I66af9d00f0f0fc70e324033b6464b75a6adadd6f Reviewed-on: http://gerrit.cloudera.org:8080/16068 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > JSON profiles are mostly space characters > - > > Key: IMPALA-9847 > URL: https://issues.apache.org/jira/browse/IMPALA-9847 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > Fix For: Impala 4.0 > > > JSON profiles are pretty-printed with 4 space characters per indent. This > means that most of the profile data is actually just space characters, and > this can add up for large profiles. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9824) MetastoreClientPool should be singleton
[ https://issues.apache.org/jira/browse/IMPALA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134596#comment-17134596 ] ASF subversion and git services commented on IMPALA-9824: - Commit 0cb44242d20532945e5fb09f5bbef6c65415a753 in impala's branch refs/heads/master from Vihang Karajgaonkar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=0cb4424 ] IMPALA-9791: Support validWriteIdList in getPartialCatalogObject API This change enhances the Catalog-v2 API getPartialCatalogObject to support ValidWriteIdList as an optional field in the TableInfoSelector. When such a field is provided by the clients, catalog compares the provided ValidWriteIdList with the cached ValidWriteIdList of the table. The catalog reloads the table if it determines that the cached table is stale with respect to the ValidWriteIdList provided. In case the table is already at or above the requested ValidWriteIdList catalog uses the cached table metadata information to filter out filedescriptors pertaining to the provided ValidWriteIdList. Note that in case compactions it is possible that the requested ValidWriteIdList cannot be satisfied using the cached file-metadata for some partitions. For such partitions, catalog re-fetches the file-metadata from the FileSystem. In order to implement the fall-back to getting the file-metadata from filesystem, the patch refactor some of file-metadata loading logic into ParallelFileMetadataLoader which also helps simplify some methods in HdfsTable.java. Additionally, it modifies the WriteIdBasedPredicate to optionally do a strict check which throws an exception on some scenarios. This is helpful to provide a snapshot view of the table metadata during query compilation with respect to other changes happening to the table concurrently. Note that this change does not implement the coordinator side changes needed for catalog clients to use such a field. That would be taken up in a separate change to keep this patch smaller. Testing: 1. Ran existing filemetadata loader tests. 2. Added a new test which exercises the various cases for ValidWriteIdList comparison. 3. Ran core tests along with the dependent MetastoreClientPool patch (IMPALA-9824). Change-Id: Ied2c7c3cb2009c407e8fbc3af4722b0d34f57c4a Reviewed-on: http://gerrit.cloudera.org:8080/16008 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > MetastoreClientPool should be singleton > --- > > Key: IMPALA-9824 > URL: https://issues.apache.org/jira/browse/IMPALA-9824 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Minor > > Currently, the MetastoreClientPool is instantiated at multiple places in the > code and it would be good to refactor the code to make it a singleton. Each > MetastoreClientPool creates multiple clients to HMS and unnecessary creation > of multiple pools could cause problems on HMS side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9843) Add ability to run schematool against HMS in minicluster
[ https://issues.apache.org/jira/browse/IMPALA-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134598#comment-17134598 ] ASF subversion and git services commented on IMPALA-9843: - Commit f8c28f8adfd781727c311b15546a532ce65881e0 in impala's branch refs/heads/master from Vihang Karajgaonkar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=f8c28f8 ] IMPALA-9843: Add support for metastore db schema upgrade This change adds support to upgrade the HMS database schema using the hive schema tool. It adds a new option to the buildall.sh script which can be provided to upgrade the HMS db schema. Alternatively, users can directly upgrade the schema using the create-test-configuration.sh script. The logs for the schema upgrade are available in logs/cluster/schematool.log. Following invocations will upgrade the HMS database schema. 1. buildall.sh -upgrade_metastore_db 2. bin/create-test-configuration.sh -upgrade_metastore_db This upgrade option is idempotent. It is a no-op if the metastore schema is already at its latest version. In case of any errors, the only fallback currently is to format the metastore schema and load the test data again. Testing: Upgraded the HMS schema on my local dev environment and made sure that the HMS service starts without any errors. Change-Id: I85af8d57e110ff284832056a1661f94b85ed3b09 Reviewed-on: http://gerrit.cloudera.org:8080/16054 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > Add ability to run schematool against HMS in minicluster > > > Key: IMPALA-9843 > URL: https://issues.apache.org/jira/browse/IMPALA-9843 > Project: IMPALA > Issue Type: Improvement >Reporter: Sahil Takiar >Assignee: Vihang Karajgaonkar >Priority: Major > Fix For: Impala 4.0 > > > When the CDP version is bumped, we often need to re-format the HMS postgres > database because the HMS schema needs updating. Hive provides a standalone > tool for performing schema updates: > [https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool] > Impala should be able to integrate with this tool, so that developers don't > have to blow away their HMS database every time the CDP version is bumped up. > Even worse, blowing away the HMS data requires performing a full data load. > It would be great to have a wrapper around the schematool that can easily be > invoked by developers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9791) Support validWriteIdList in getPartialCatalogObject
[ https://issues.apache.org/jira/browse/IMPALA-9791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134595#comment-17134595 ] ASF subversion and git services commented on IMPALA-9791: - Commit 0cb44242d20532945e5fb09f5bbef6c65415a753 in impala's branch refs/heads/master from Vihang Karajgaonkar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=0cb4424 ] IMPALA-9791: Support validWriteIdList in getPartialCatalogObject API This change enhances the Catalog-v2 API getPartialCatalogObject to support ValidWriteIdList as an optional field in the TableInfoSelector. When such a field is provided by the clients, catalog compares the provided ValidWriteIdList with the cached ValidWriteIdList of the table. The catalog reloads the table if it determines that the cached table is stale with respect to the ValidWriteIdList provided. In case the table is already at or above the requested ValidWriteIdList catalog uses the cached table metadata information to filter out filedescriptors pertaining to the provided ValidWriteIdList. Note that in case compactions it is possible that the requested ValidWriteIdList cannot be satisfied using the cached file-metadata for some partitions. For such partitions, catalog re-fetches the file-metadata from the FileSystem. In order to implement the fall-back to getting the file-metadata from filesystem, the patch refactor some of file-metadata loading logic into ParallelFileMetadataLoader which also helps simplify some methods in HdfsTable.java. Additionally, it modifies the WriteIdBasedPredicate to optionally do a strict check which throws an exception on some scenarios. This is helpful to provide a snapshot view of the table metadata during query compilation with respect to other changes happening to the table concurrently. Note that this change does not implement the coordinator side changes needed for catalog clients to use such a field. That would be taken up in a separate change to keep this patch smaller. Testing: 1. Ran existing filemetadata loader tests. 2. Added a new test which exercises the various cases for ValidWriteIdList comparison. 3. Ran core tests along with the dependent MetastoreClientPool patch (IMPALA-9824). Change-Id: Ied2c7c3cb2009c407e8fbc3af4722b0d34f57c4a Reviewed-on: http://gerrit.cloudera.org:8080/16008 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > Support validWriteIdList in getPartialCatalogObject > --- > > Key: IMPALA-9791 > URL: https://issues.apache.org/jira/browse/IMPALA-9791 > Project: IMPALA > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Major > Labels: impala-acid > > When transactional tables are being queried, the coordinator (or any other > Catalog client) can optionally provide a ValidWriteIdList of the table. In > such case, catalog can return the metadata which is consistent with the given > ValidWriteIdList. There are the following 3 possibilities: > 1. Client provided ValidWriteIdList is more recent. > In this case, catalog should reload the table then send the metadata > consistent with the provided writeIdList. > 2. Client ValidWriteIdList is same. > Catalog can return the cached metadata directly. > 3. ClientValidWriteIdList is stale with respect to the one in catalog. > In this case, catalog can attempt to return metadata which is consistent with > respect to client's view of the writeIdList and return accordingly. Note that > in case 1, it is possible that after reload, catalog moves ahead of the > client's writeIdList and hence this becomes a sub-case of 1. > Having such an enhancement to the API can help support consistent read > support for ACID tables (see IMPALA-8788) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-3101) AnalyticEvalNode should use codegened TupleRowComparator instead of PrevRowCompare
[ https://issues.apache.org/jira/browse/IMPALA-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-3101: - Assignee: (was: Michael Ho) > AnalyticEvalNode should use codegened TupleRowComparator instead of > PrevRowCompare > -- > > Key: IMPALA-3101 > URL: https://issues.apache.org/jira/browse/IMPALA-3101 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.6.0 >Reporter: Mostafa Mokhtar >Priority: Minor > Labels: codegen, performance > Attachments: primitive_orderby_bigint_VtuneTopDown.csv > > > AnalyticEvalNode uses PrevRowCompare to compare rows, which is very > inefficient compared to the codegend version of TupleRowComparator::Compare > |Function Stack||CPU Time: Total||CPU Time: Self||Module||Function > (Full)||Source File||Start Address| > |impala::AnalyticEvalNode::ProcessChildBatch|47.9%|0.810s|impalad|impala::AnalyticEvalNode::ProcessChildBatch(impala::RuntimeState*)|analytic-eval-node.cc|0xc0a870| > | > impala::AnalyticEvalNode::TryAddResultTupleForPrevRow|35.0%|0.570s|impalad|impala::AnalyticEvalNode::TryAddResultTupleForPrevRow(bool, > long, impala::TupleRow*)|analytic-eval-node.cc|0xc0aa85| > | > impala::AnalyticEvalNode::PrevRowCompare|30.3%|0.040s|impalad|impala::AnalyticEvalNode::PrevRowCompare(impala::ExprContext*)|analytic-eval-node.cc|0xc0ae1d| > | > impala::ExprContext::GetBooleanVal|30.2%|0.330s|impalad|impala::ExprContext::GetBooleanVal(impala::TupleRow*)|expr-context.cc|0x7f0790| > | > impala::AndPredicate::GetBooleanVal|29.8%|1.220s|impalad|impala::AndPredicate::GetBooleanVal(impala::ExprContext*, > impala::TupleRow*)|compound-predicates.cc|0x8575c0| > | > impala::OrPredicate::GetBooleanVal|28.5%|2.840s|impalad|impala::OrPredicate::GetBooleanVal(impala::ExprContext*, > impala::TupleRow*)|compound-predicates.cc|0x857650| > These queries can be used for repro > https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_all.test > https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-2400) Unpredictable locality behavior for reading Parquet files
[ https://issues.apache.org/jira/browse/IMPALA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-2400. --- Resolution: Cannot Reproduce > Unpredictable locality behavior for reading Parquet files > - > > Key: IMPALA-2400 > URL: https://issues.apache.org/jira/browse/IMPALA-2400 > Project: IMPALA > Issue Type: Bug > Components: Perf Investigation >Affects Versions: Impala 2.3.0 >Reporter: Mostafa Mokhtar >Priority: Minor > Labels: ramp-up > Attachments: LocalRead.txt, RemoteRead.txt > > > When running the query below I noticed exceptionally high variance even after > running "invalidate metadata". > select * from tpch_bin_flat_parquet_30.lineitem limit 10; > * Fetched 10 row(s) in 1.08s > WARNINGS: Read 139.48 MB of data across network that was expected to be > local. Block locality metadata for table 'tpch_bin_flat_parquet_30.lineitem' > may be stale. Consider running "INVALIDATE METADATA > `tpch_bin_flat_parquet_30`.`lineitem`". > * Fetched 10 row(s) in 1.32s > * Fetched 10 row(s) in 0.09s > * Fetched 10 row(s) in 1.08s > * "invalidate metadata" > * Fetched 10 row(s) in 0.89s > * Fetched 10 row(s) in 0.07s > WARNINGS: Read 76.15 MB of data across network that was expected to be local. > Block locality metadata for table 'tpch_bin_flat_parquet_30.lineitem' may be > stale. Consider running "INVALIDATE METADATA > `tpch_bin_flat_parquet_30`.`lineitem`". > * Fetched 10 row(s) in 1.11s > * Fetched 10 row(s) in 0.73s > * Fetched 10 row(s) in 0.09s > The behavior above is tied to Parquet tables and doesn't repro against text > data. > Profile files attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-2522) Improve the reliability and effectiveness of ETL
[ https://issues.apache.org/jira/browse/IMPALA-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-2522. --- Resolution: Fixed Will mark as fixed for now, since the vast majority of subtasks are completed and there hasn't been movement for a while. > Improve the reliability and effectiveness of ETL > > > Key: IMPALA-2522 > URL: https://issues.apache.org/jira/browse/IMPALA-2522 > Project: IMPALA > Issue Type: Epic > Components: Backend >Affects Versions: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, > Impala 2.6.0, Impala 2.7.0 >Reporter: Mostafa Mokhtar >Assignee: Lars Volker >Priority: Major > Labels: ETL, performance > > h4. Reduce the memory requirements of INSERTs into partitioned tables. > Impala inserts into partitioned Parquet tables suffer from high memory > requirements because each Impala Daemon will keep ~256MB of buffer space per > open partition in the table sink. This often leads to large insert jobs > hitting "Memory limit exceeded" errors. The behavior can be improved by > pre-clustering the data such that only one partition needs to be buffered at > a time in the table sink. > Add a new "clustered" plan hint for insert statements. Example: > {code} > CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT); > INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src; > {code} > The hint specifies that the data fed into the table sink should be clustered > based on the partition columns. For now, we'll use a sort to achieve > clustering, and the plan should look like this: > SCAN -> SORT (year,month) -> TABLE SINK > h4. Give users additional control over the insertion order. > In order to improve compression and/or the effectiveness of min/max pruning, > it is desirable to control the order in which rows are inserted into table > (mostly for Parquet). > Introduce a "sortby" plan hint for insert statements: Example > {code} > CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT); > INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ > SELECT * FROM src > {code} > This would produce the following plan: > SCAN -> SORT(year,month,day,hour) -> TABLE SINK > h4. Improve the sort efficiency > The additional sorting step introduced by both solutions above should be as > efficient as possible. > Codegen TupleRowComparator and Tuple::MaterializeExprs. > h4. Summary > With more predictable and resource-efficient ETL users will extract more > value out of Impala and will need to rely less on slow legacy ETL tools like > Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown
[ https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-7020 started by Tim Armstrong. - > Order by expressions in Analytical functions are not materialized causing > slowdown > -- > > Key: IMPALA-7020 > URL: https://issues.apache.org/jira/browse/IMPALA-7020 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 2.12.0 >Reporter: Mostafa Mokhtar >Assignee: Tim Armstrong >Priority: Major > Labels: performance > Attachments: Slow case profile.txt, Workaround profile.txt > > > Order by expressions in Analytical functions are not materialized and cause > queries to run much slower. > The rewrite for the query below is 20x faster, profiles attached. > Repro > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} > Workaround > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > union all > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem limit 0 > > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9740) TSAN data race in hdfs-bulk-ops
[ https://issues.apache.org/jira/browse/IMPALA-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134569#comment-17134569 ] Sahil Takiar commented on IMPALA-9740: -- custom_cluster.test_insert_behaviour.TestInsertBehaviourCustomCluster.test_insert_inherit_permission reproduces this. > TSAN data race in hdfs-bulk-ops > --- > > Key: IMPALA-9740 > URL: https://issues.apache.org/jira/browse/IMPALA-9740 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Sahil Takiar >Priority: Major > > hdfs-bulk-ops usage of a local connection cache (HdfsFsCache::HdfsFsMap) has > a data race: > {code:java} > WARNING: ThreadSanitizer: data race (pid=23205) > Write of size 8 at 0x7b24005642d8 by thread T47: > #0 > boost::unordered::detail::table_impl const, hdfs_internal*> >, std::string, hdfs_internal*, > boost::hash, std::equal_to > > >::add_node(boost::unordered::detail::node_constructor const, hdfs_internal*> > > >&, unsigned long) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/detail/unique.hpp:329:26 > (impalad+0x1f93832) > #1 > std::pair const, hdfs_internal*> > >, bool> > boost::unordered::detail::table_impl const, hdfs_internal*> >, std::string, hdfs_internal*, > boost::hash, std::equal_to > > >::emplace_impl >(std::string > const&, std::pair&&) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/detail/unique.hpp:420:41 > (impalad+0x1f933ed) > #2 > std::pair const, hdfs_internal*> > >, bool> > boost::unordered::detail::table_impl const, hdfs_internal*> >, std::string, hdfs_internal*, > boost::hash, std::equal_to > > >::emplace > >(std::pair&&) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/detail/unique.hpp:384:20 > (impalad+0x1f932d1) > #3 > std::pair const, hdfs_internal*> > >, bool> > boost::unordered::unordered_map boost::hash, std::equal_to, > std::allocator > > >::emplace > >(std::pair&&) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/unordered_map.hpp:241:27 > (impalad+0x1f93238) > #4 boost::unordered::unordered_map boost::hash, std::equal_to, > std::allocator > > >::insert(std::pair&&) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/unordered_map.hpp:390:26 > (impalad+0x1f92038) > #5 impala::HdfsFsCache::GetConnection(std::string const&, > hdfs_internal**, boost::unordered::unordered_map boost::hash, std::equal_to, > std::allocator > >*) > /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/hdfs-fs-cache.cc:115:18 > (impalad+0x1f916b3) > #6 impala::HdfsOp::Execute() const > /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/hdfs-bulk-ops.cc:84:55 > (impalad+0x23444d5) > #7 HdfsThreadPoolHelper(int, impala::HdfsOp const&) > /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/hdfs-bulk-ops.cc:137:6 > (impalad+0x2344ea9) > #8 boost::detail::function::void_function_invoker2 impala::HdfsOp const&), void, int, impala::HdfsOp > const&>::invoke(boost::detail::function::function_buffer&, int, > impala::HdfsOp const&) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:118:11 > (impalad+0x2345e80) > #9 boost::function2::operator()(int, > impala::HdfsOp const&) const > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14 > (impalad+0x1f883be) > #10 impala::ThreadPool::WorkerThread(int) > /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread-pool.h:166:9 > (impalad+0x1f874e5) > #11 boost::_mfi::mf1, > int>::operator()(impala::ThreadPool*, int) const > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/mem_fn_template.hpp:165:29 > (impalad+0x1f87b7d) > #12 void > boost::_bi::list2*>, > boost::_bi::value >::operator() impala::ThreadPool, int>, > boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf1 impala::ThreadPool, int>&, boost::_bi::list0&, int) > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:319:9 > (impalad+0x1f87abc) > #13 boost::_bi::bind_t impala::ThreadPool, int>, > boost::_bi::list2*>, > boost::_bi::value > >::operator()() > /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16 > (impalad+0x1f87a23) >
[jira] [Commented] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown
[ https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134542#comment-17134542 ] Tim Armstrong commented on IMPALA-7020: --- One proposal: * Casts between integral and floating point types should have ARITHMETIC_OP_COST, because they are simple arithmetic conversion (casts involving decimal are often non-trivial). * Casts between STRING and VARCHAR should have ARITHMETIC_OP_COST, because they are only modifying the length field, at worst. * All other casts should have FUNCTION_CALL_COST, because they require some non-trivial conversion. > Order by expressions in Analytical functions are not materialized causing > slowdown > -- > > Key: IMPALA-7020 > URL: https://issues.apache.org/jira/browse/IMPALA-7020 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 2.12.0 >Reporter: Mostafa Mokhtar >Assignee: Tim Armstrong >Priority: Major > Labels: performance > Attachments: Slow case profile.txt, Workaround profile.txt > > > Order by expressions in Analytical functions are not materialized and cause > queries to run much slower. > The rewrite for the query below is 20x faster, profiles attached. > Repro > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} > Workaround > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > union all > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem limit 0 > > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-9213) Client logs should indicate if a query has been retried
[ https://issues.apache.org/jira/browse/IMPALA-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sahil Takiar reassigned IMPALA-9213: Assignee: Quanlong Huang > Client logs should indicate if a query has been retried > --- > > Key: IMPALA-9213 > URL: https://issues.apache.org/jira/browse/IMPALA-9213 > Project: IMPALA > Issue Type: Sub-task >Reporter: Sahil Takiar >Assignee: Quanlong Huang >Priority: Major > > The client logs should give some indication that a query has been retried and > should print out information such as the new query id and the link to the > retried query on the debug web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown
[ https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134523#comment-17134523 ] Tim Armstrong commented on IMPALA-7020: --- This is sufficient to force it to be materialised: {noformat} tarmstrong@tarmstrong-box2:~/Impala/impala$ git diff diff --git a/fe/src/main/java/org/apache/impala/analysis/Expr.java b/fe/src/main/java/org/apache/impala/analysis/Expr.java index 6ef5715a2..c636b4971 100644 --- a/fe/src/main/java/org/apache/impala/analysis/Expr.java +++ b/fe/src/main/java/org/apache/impala/analysis/Expr.java @@ -83,7 +83,7 @@ abstract public class Expr extends TreeNode implements ParseNode, Cloneabl public static final float ARITHMETIC_OP_COST = 1; public static final float BINARY_PREDICATE_COST = 1; public static final float VAR_LEN_BINARY_PREDICATE_COST = 5; - public static final float CAST_COST = 1; + public static final float CAST_COST = 20; public static final float COMPOUND_PREDICATE_COST = 1; public static final float FUNCTION_CALL_COST = 10; public static final float IS_NOT_EMPTY_COST = 1; {noformat} > Order by expressions in Analytical functions are not materialized causing > slowdown > -- > > Key: IMPALA-7020 > URL: https://issues.apache.org/jira/browse/IMPALA-7020 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 2.12.0 >Reporter: Mostafa Mokhtar >Assignee: Tim Armstrong >Priority: Major > Labels: performance > Attachments: Slow case profile.txt, Workaround profile.txt > > > Order by expressions in Analytical functions are not materialized and cause > queries to run much slower. > The rewrite for the query below is 20x faster, profiles attached. > Repro > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} > Workaround > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > union all > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem limit 0 > > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown
[ https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134508#comment-17134508 ] Tim Armstrong commented on IMPALA-7020: --- I think we're costing the cast expression wrong in this case - the cost of the cast expression is below the threshold to materialise. > Order by expressions in Analytical functions are not materialized causing > slowdown > -- > > Key: IMPALA-7020 > URL: https://issues.apache.org/jira/browse/IMPALA-7020 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 2.12.0 >Reporter: Mostafa Mokhtar >Assignee: Tim Armstrong >Priority: Major > Labels: performance > Attachments: Slow case profile.txt, Workaround profile.txt > > > Order by expressions in Analytical functions are not materialized and cause > queries to run much slower. > The rewrite for the query below is 20x faster, profiles attached. > Repro > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} > Workaround > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > union all > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem limit 0 > > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown
[ https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong reassigned IMPALA-7020: - Assignee: Tim Armstrong > Order by expressions in Analytical functions are not materialized causing > slowdown > -- > > Key: IMPALA-7020 > URL: https://issues.apache.org/jira/browse/IMPALA-7020 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 2.12.0 >Reporter: Mostafa Mokhtar >Assignee: Tim Armstrong >Priority: Major > Labels: performance > Attachments: Slow case profile.txt, Workaround profile.txt > > > Order by expressions in Analytical functions are not materialized and cause > queries to run much slower. > The rewrite for the query below is 20x faster, profiles attached. > Repro > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} > Workaround > {code} > select * > FROM > ( > SELECT > o.*, > ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn > FROM > ( > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem > WHERE > l_shipdate BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00' > union all > SELECT > l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as > string) evt_ts > FROM > lineitem limit 0 > > ) o > ) r > WHERE > rn BETWEEN 1 AND 101 > ORDER BY rn; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9824) MetastoreClientPool should be singleton
[ https://issues.apache.org/jira/browse/IMPALA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134469#comment-17134469 ] Vihang Karajgaonkar commented on IMPALA-9824: - Spend sometime on this and almost made it to work. We can still make it work but it made me think that we are trying to solve 2 conflicting requirements. Our FE unit tests spin up their own CatalogServiceCatalog instances (see CatalogServiceTestCatalog for example). Testing can become flaky if we make MetastoreClientPool a singleton since all the FE tests run within a single process and that would mean they will share the MetastoreClientPool. We currently rely on Catalog#close() call in the tests to shutdown the pool. This works ok for most of the tests except the ones which rely on {{createTransientTestCatalog}} which uses a EmbeddedHMS service. Currently MetastoreClientPool should have one to one mapping with the Catalog instances. The MetastoreClientPool in {{DirectMetaProvider}} should ideally never get instantiated after we fix IMPALA-9375. We should only have either CatalogMetaProvider or DirectMetaProvider running but not both. I am now inclined to abandon this patch and close this JIRA as "wont fix". [~stakiar] [~stigahuang] any thoughts? > MetastoreClientPool should be singleton > --- > > Key: IMPALA-9824 > URL: https://issues.apache.org/jira/browse/IMPALA-9824 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Minor > > Currently, the MetastoreClientPool is instantiated at multiple places in the > code and it would be good to refactor the code to make it a singleton. Each > MetastoreClientPool creates multiple clients to HMS and unnecessary creation > of multiple pools could cause problems on HMS side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-8720) Impala frontend jar should not depend on Sentry jars when building against hive-3 profile
[ https://issues.apache.org/jira/browse/IMPALA-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar closed IMPALA-8720. --- Fix Version/s: Not Applicable Resolution: Not A Problem > Impala frontend jar should not depend on Sentry jars when building against > hive-3 profile > - > > Key: IMPALA-8720 > URL: https://issues.apache.org/jira/browse/IMPALA-8720 > Project: IMPALA > Issue Type: Improvement >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Major > Fix For: Not Applicable > > > It looks like for {{hive-3}} based setups, frontend jar still depends on > sentry jars. However, sentry does not work with HMS-3 as of today. This > unnecessary pulls in sentry jars from maven repositories when building > against a CDP. We should pull in sentry jars only when it is needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-9843) Add ability to run schematool against HMS in minicluster
[ https://issues.apache.org/jira/browse/IMPALA-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar resolved IMPALA-9843. - Fix Version/s: Impala 4.0 Resolution: Fixed The patch was submitted on gerrit today. Users who wish to upgrade the HMS db schema of the minicluster can use the following commands to do so: 1. bin/create-test-configuration.sh -upgrade_hms_db Or if you want to build the source along with upgrading the HMS schema ./buildall.sh -upgrade_hms_db > Add ability to run schematool against HMS in minicluster > > > Key: IMPALA-9843 > URL: https://issues.apache.org/jira/browse/IMPALA-9843 > Project: IMPALA > Issue Type: Improvement >Reporter: Sahil Takiar >Assignee: Vihang Karajgaonkar >Priority: Major > Fix For: Impala 4.0 > > > When the CDP version is bumped, we often need to re-format the HMS postgres > database because the HMS schema needs updating. Hive provides a standalone > tool for performing schema updates: > [https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool] > Impala should be able to integrate with this tool, so that developers don't > have to blow away their HMS database every time the CDP version is bumped up. > Even worse, blowing away the HMS data requires performing a full data load. > It would be great to have a wrapper around the schematool that can easily be > invoked by developers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-9855) TSAN lock-order-inversion warning in QueryDriver::RetryQueryFromThread
Sahil Takiar created IMPALA-9855: Summary: TSAN lock-order-inversion warning in QueryDriver::RetryQueryFromThread Key: IMPALA-9855 URL: https://issues.apache.org/jira/browse/IMPALA-9855 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Sahil Takiar Assignee: Sahil Takiar TSAN reports the following error in {{test_query_retries.py}}. {code:java} WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=3786) Cycle in lock order graph: M17348 (0x7b140035d2d8) => M804309746609755832 (0x) => M17348 Mutex M804309746609755832 acquired here while holding mutex M17348 in thread T370: #0 AnnotateRWLockAcquired /mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cc:271 (impalad+0x19bafcc) #1 base::SpinLock::Lock() /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/gutil/spinlock.h:77:5 (impalad+0x1a11585) #2 impala::SpinLock::lock() /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/spinlock.h:34:8 (impalad+0x1a11519) #3 impala::ScopedShardedMapRef >::ScopedShardedMapRef(impala::TUniqueId const&, impala::ShardedQueryMap >*) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/sharded-query-map-util.h:98:23 (impalad+0x2220661) #4 impala::ImpalaServer::GetQueryDriver(impala::TUniqueId const&, bool) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:1296:53 (impalad+0x22124ba) #5 impala::QueryDriver::RetryQueryFromThread(impala::Status const&, std::shared_ptr) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-driver.cc:279:25 (impalad+0x29dd92c) #6 boost::_mfi::mf2 >::operator()(impala::QueryDriver*, impala::Status const&, std::shared_ptr) const /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/mem_fn_template.hpp:280:29 (impalad+0x29e1669) #7 void boost::_bi::list3, boost::_bi::value, boost::_bi::value > >::operator() >, boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf2 >&, boost::_bi::list0&, int) /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:398:9 (impalad+0x29e1578) #8 boost::_bi::bind_t >, boost::_bi::list3, boost::_bi::value, boost::_bi::value > > >::operator()() /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16 (impalad+0x29e14c3) #9 boost::detail::function::void_function_obj_invoker0 >, boost::_bi::list3, boost::_bi::value, boost::_bi::value > > >, void>::invoke(boost::detail::function::function_buffer&) /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:159:11 (impalad+0x29e1221) #10 boost::function0::operator()() const /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14 (impalad+0x1e5ba81) #11 impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:360:3 (impalad+0x2453776) #12 void boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> >::operator(), impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0>(boost::_bi::type, void (*&)(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0&, int) /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9 (impalad+0x245b93c) #13 boost::_bi::bind_t, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> > >::operator()() /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16 (impalad+0x245b853) #14 boost::detail::thread_data, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> > > >::run() /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17 (impalad+0x245b540) #15 thread_proxy (impalad+0x3171659)Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message Mutex M17348 acquired here while holding mutex M804309746609755832 in thread T392: #0 AnnotateRWLockAcquired
[jira] [Created] (IMPALA-9854) TSAN data race in QueryDriver::CreateRetriedClientRequestState
Sahil Takiar created IMPALA-9854: Summary: TSAN data race in QueryDriver::CreateRetriedClientRequestState Key: IMPALA-9854 URL: https://issues.apache.org/jira/browse/IMPALA-9854 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Sahil Takiar Assignee: Sahil Takiar Seeing the following data race in {{test_query_retries.py}} {code:java} WARNING: ThreadSanitizer: data race (pid=5460) Write of size 8 at 0x7b8c00261510 by thread T38: #0 impala::TUniqueId::operator=(impala::TUniqueId&&) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/Types_types.cpp:967:6 (impalad+0x1de1968) #1 impala::ImpalaServer::PrepareQueryContext(impala::TNetworkAddress const&, impala::TNetworkAddress const&, impala::TQueryCtx*) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:1069:23 (impalad+0x2210dbf) #2 impala::ImpalaServer::PrepareQueryContext(impala::TQueryCtx*) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:1024:3 (impalad+0x220f3c1) #3 impala::QueryDriver::CreateRetriedClientRequestState(impala::ClientRequestState*, std::unique_ptr >*, std::shared_ptr*) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-driver.cc:302:19 (impalad+0x29de3ec) #4 impala::QueryDriver::RetryQueryFromThread(impala::Status const&, std::shared_ptr) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-driver.cc:203:3 (impalad+0x29dd01f) #5 boost::_mfi::mf2 >::operator()(impala::QueryDriver*, impala::Status const&, std::shared_ptr) const /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/mem_fn_template.hpp:280:29 (impalad+0x29e1669) #6 void boost::_bi::list3, boost::_bi::value, boost::_bi::value > >::operator() >, boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf2 >&, boost::_bi::list0&, int) /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:398:9 (impalad+0x29e1578) #7 boost::_bi::bind_t >, boost::_bi::list3, boost::_bi::value, boost::_bi::value > > >::operator()() /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16 (impalad+0x29e14c3) #8 boost::detail::function::void_function_obj_invoker0 >, boost::_bi::list3, boost::_bi::value, boost::_bi::value > > >, void>::invoke(boost::detail::function::function_buffer&) /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:159:11 (impalad+0x29e1221) #9 boost::function0::operator()() const /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14 (impalad+0x1e5ba81) #10 impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:360:3 (impalad+0x2453776) #11 void boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> >::operator(), impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0>(boost::_bi::type, void (*&)(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0&, int) /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9 (impalad+0x245b93c) #12 boost::_bi::bind_t, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> > >::operator()() /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16 (impalad+0x245b853) #13 boost::detail::thread_data, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> > > >::run() /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17 (impalad+0x245b540) #14 thread_proxy (impalad+0x3171659) Previous read of size 8 at 0x7b8c00261510 by thread T100: #0 impala::PrintId(impala::TUniqueId const&, std::string const&) /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/debug-util.cc:108:48 (impalad+0x237557f) #1 impala::Coordinator::ReleaseQueryAdmissionControlResources()
[jira] [Commented] (IMPALA-9739) TSAN data races during impalad shutdown
[ https://issues.apache.org/jira/browse/IMPALA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134456#comment-17134456 ] Sahil Takiar commented on IMPALA-9739: -- [~bikramjeet.vig] the {{TestGracefulShutdown}} tests in {{test_restart_services.py}} can reproduce this. Just build Impala locally with the '-tsan' flag and run the test. You should see a TSAN error in the logs under {{/tmp/impalad.*ERROR}}. When I ran it locally the error was in the {{/tmp/impalad_node1.ERROR}} file. Here is the output, I just ran this on master: {code} WARNING: ThreadSanitizer: data race (pid=19807) Read of size 8 at 0x078569a0 by thread T337: #0 std::unique_ptr >::~unique_ptr() /home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/unique_ptr.h:235:6 (impalad+0x1a10495) #1 at_exit_wrapper(void*) /mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:361 (impalad+0x196fb33) #2 impala::ImpalaServer::StartShutdown(long, impala::ShutdownStatusPB*)::$_2::operator()() const /home/stakiar/Impala/be/src/service/impala-server.cc:2774:57 (impalad+0x2236ba1) #3 boost::detail::function::void_function_obj_invoker0::invoke(boost::detail::function::function_buffer&) /home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/function/function_te mplate.hpp:159:11 (impalad+0x2236a09) #4 boost::function0::operator()() const /home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14 (impalad+0x1e5ee31) #5 impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*) /home/stakiar/Impala/be/src/util/thread.cc:360:3 (impalad+0x246bfd6) #6 void boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> >::operator(), impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0>(boost::_bi::type, void (*&)(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0&, int) /home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9 (impalad+0x247419c) #7 boost::_bi::bind_t, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost::_bi::value, bo ost::_bi::value >, boost::_bi::value, boost::_bi::value*> > >::operator()() /home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16 (impalad+0x24740b3) #8 boost::detail::thread_data, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost:: _bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> > > >::run() /home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17 (impalad+0x2473da0) #9 thread_proxy (impalad+0x3177c59) Previous write of size 8 at 0x078569a0 by main thread: #0 void std::swap(impala::Thread*&, impala::Thread*&) /home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/move.h:176:11 (impalad+0x22a4f20) #1 std::unique_ptr >::reset(impala::Thread*) /home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/unique_ptr.h:342:2 (impalad+0x229fa9b) #2 std::unique_ptr >::operator=(std::unique_ptr >&&) /home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/b its/unique_ptr.h:251:2 (impalad+0x246e918) #3 impala::Thread::StartThread(std::string const&, std::string const&, boost::function const&, std::unique_ptr >*, bool) /home/stakiar/Impala/be/src/util/thread.cc:329:11 (impalad+0x246baec) #4 impala::Status impala::Thread::Create(std::string const&, std::string const&, void (* const&)(), std::unique_ptr >*, bool) /home/stakiar/Impala/be/src/util/thread.h:74:12 (impalad+0x1a6eb2c) #5 impala::StartImpalaShutdownSignalHandlerThread() /home/stakiar/Impala/be/src/common/init.cc:401:10 (impalad+0x1a6df98) #6 ImpaladMain(int, char**) /home/stakiar/Impala/be/src/service/impalad-main.cc:96:43 (impalad+0x221d3ca) #7 main /home/stakiar/Impala/be/src/service/daemon-main.cc:37:12 (impalad+0x1a0b27a) As if synchronized via sleep: #0 nanosleep /mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:343 (impalad+0x19a500a) #1 void std::this_thread::sleep_for >(std::chrono::duration > const&) /home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/thread:279:2 (impalad+0x2475f42) #2 impala::SleepForMs(long) /home/stakiar/Impala/be/src/util/time.cc:31:3 (impalad+0x247537d) #3 impala::ImpalaServer::ShutdownThread() /home/stakiar/Impala/be/src/service/impala-server.cc:2796:5 (impalad+0x2235a19)
[jira] [Updated] (IMPALA-9853) Push rank() predicates into sort
[ https://issues.apache.org/jira/browse/IMPALA-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong updated IMPALA-9853: -- Labels: performance tpcds (was: ) > Push rank() predicates into sort > > > Key: IMPALA-9853 > URL: https://issues.apache.org/jira/browse/IMPALA-9853 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > Labels: performance, tpcds > > TPC-DS Q67 would benefit significantly if we could push the rank() predicate > into the sort to do some reduction of unneeded data. The sorter could > evaluate this predicate if it had the partition expressions available - as a > post-processing step to the in-memory sort for the analytic sort group, it > could do a pass over the sorted run, resetting a counter at the start of each > partition boundary. > It might be best to start with tackling IMPALA-3471 by applying the limit > within sorted runs, since that doesn't require any planner work. > {noformat} > with results as > ( select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy > ,d_moy ,s_store_id > ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales > from store_sales ,date_dim ,store ,item >where ss_sold_date_sk=d_date_sk > and ss_item_sk=i_item_sk > and ss_store_sk = s_store_sk > and d_month_seq between 1212 and 1212 + 11 >group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, > d_moy,s_store_id) > , > results_rollup as > (select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, > s_store_id, sumsales > from results > union all > select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, > null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy > union all > select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, null > d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy > union all > select i_category, i_class, i_brand, i_product_name, d_year, null d_qoy, > null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name, d_year > union all > select i_category, i_class, i_brand, i_product_name, null d_year, null > d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand, i_product_name > union all > select i_category, i_class, i_brand, null i_product_name, null d_year, null > d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class, i_brand > union all > select i_category, i_class, null i_brand, null i_product_name, null d_year, > null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category, i_class > union all > select i_category, null i_class, null i_brand, null i_product_name, null > d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results > group by i_category > union all > select null i_category, null i_class, null i_brand, null i_product_name, > null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales > from results) > select * > from (select i_category > ,i_class > ,i_brand > ,i_product_name > ,d_year > ,d_qoy > ,d_moy > ,s_store_id > ,sumsales > ,rank() over (partition by i_category order by sumsales desc) rk > from results_rollup) dw2 > where rk <= 100 > order by i_category > ,i_class > ,i_brand > ,i_product_name > ,d_year > ,d_qoy > ,d_moy > ,s_store_id > ,sumsales > ,rk > limit 100 > {noformat} > Assigning to myself to fill in more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-9853) Push rank() predicates into sort
Tim Armstrong created IMPALA-9853: - Summary: Push rank() predicates into sort Key: IMPALA-9853 URL: https://issues.apache.org/jira/browse/IMPALA-9853 Project: IMPALA Issue Type: Improvement Components: Frontend Reporter: Tim Armstrong Assignee: Tim Armstrong TPC-DS Q67 would benefit significantly if we could push the rank() predicate into the sort to do some reduction of unneeded data. The sorter could evaluate this predicate if it had the partition expressions available - as a post-processing step to the in-memory sort for the analytic sort group, it could do a pass over the sorted run, resetting a counter at the start of each partition boundary. It might be best to start with tackling IMPALA-3471 by applying the limit within sorted runs, since that doesn't require any planner work. {noformat} with results as ( select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy ,s_store_id ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales from store_sales ,date_dim ,store ,item where ss_sold_date_sk=d_date_sk and ss_item_sk=i_item_sk and ss_store_sk = s_store_sk and d_month_seq between 1212 and 1212 + 11 group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy,s_store_id) , results_rollup as (select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, s_store_id, sumsales from results union all select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy union all select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy union all select i_category, i_class, i_brand, i_product_name, d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category, i_class, i_brand, i_product_name, d_year union all select i_category, i_class, i_brand, i_product_name, null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category, i_class, i_brand, i_product_name union all select i_category, i_class, i_brand, null i_product_name, null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category, i_class, i_brand union all select i_category, i_class, null i_brand, null i_product_name, null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category, i_class union all select i_category, null i_class, null i_brand, null i_product_name, null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results group by i_category union all select null i_category, null i_class, null i_brand, null i_product_name, null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales from results) select * from (select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy ,s_store_id ,sumsales ,rank() over (partition by i_category order by sumsales desc) rk from results_rollup) dw2 where rk <= 100 order by i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy ,s_store_id ,sumsales ,rk limit 100 {noformat} Assigning to myself to fill in more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-9847) JSON profiles are mostly space characters
[ https://issues.apache.org/jira/browse/IMPALA-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-9847. --- Fix Version/s: Impala 4.0 Resolution: Fixed > JSON profiles are mostly space characters > - > > Key: IMPALA-9847 > URL: https://issues.apache.org/jira/browse/IMPALA-9847 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > Fix For: Impala 4.0 > > > JSON profiles are pretty-printed with 4 space characters per indent. This > means that most of the profile data is actually just space characters, and > this can add up for large profiles. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9852) UDA Check that function name, arguments, and return type are correct.
[ https://issues.apache.org/jira/browse/IMPALA-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134317#comment-17134317 ] Tim Armstrong commented on IMPALA-9852: --- You can list out the symbols in the so with "nm -g aggreg.so". I would guess that the symbols are mangled with C++ name mangling. One way to avoid the name mangling is to wrap the function definitions and declarations in extern "C" { }. > UDA Check that function name, arguments, and return type are correct. > -- > > Key: IMPALA-9852 > URL: https://issues.apache.org/jira/browse/IMPALA-9852 > Project: IMPALA > Issue Type: Question > Components: Backend >Affects Versions: Impala 2.10.0 >Reporter: Volnei >Priority: Major > > Hi, > I'm trying to register a UDA in the database, but the error below always > happens: > {code:java} > create aggregate function avgtest(double) returns double > location '/user/cloudera/impala_udf/aggreg.so' > init_fn='Avg_Init' > update_fn='Avg_Update' > merge_fn='Avg_Merge' > finalize_fn='Avg_Finalize'; > AnalysisException: Could not find function Avg_Update(DOUBLE) returns DOUBLE > in: hdfs://quickstart.cloudera:8020/user/cloudera/impala_udf/aggreg.so Check > that function name, arguments, and return type are correct > {code} > If I create a UDA without using BufferVal as one of the arguments, the error > doesnt ' happen. > The UDA in question is the one available in impala-master/be/src/udf_samples / > {code:java} > void Avg_Init(FunctionContext* context, BufferVal* val); > void Avg_Update(FunctionContext* context, *const DoubleVal*& input, > BufferVal* val); > void Avg_Merge(FunctionContext* context, const BufferVal& src, BufferVal* > dst); > *DoubleVal* Avg_Finalize(FunctionContext* context, const BufferVal& val); > {code} > Could anybody give me any suggestions on this problem? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-9852) UDA Check that function name, arguments, and return type are correct.
Volnei created IMPALA-9852: -- Summary: UDA Check that function name, arguments, and return type are correct. Key: IMPALA-9852 URL: https://issues.apache.org/jira/browse/IMPALA-9852 Project: IMPALA Issue Type: Question Components: Backend Affects Versions: Impala 2.10.0 Reporter: Volnei Hi, I'm trying to register a UDA in the database, but the error below always happens: {code:java} create aggregate function avgtest(double) returns double location '/user/cloudera/impala_udf/aggreg.so' init_fn='Avg_Init' update_fn='Avg_Update' merge_fn='Avg_Merge' finalize_fn='Avg_Finalize'; AnalysisException: Could not find function Avg_Update(DOUBLE) returns DOUBLE in: hdfs://quickstart.cloudera:8020/user/cloudera/impala_udf/aggreg.so Check that function name, arguments, and return type are correct {code} If I create a UDA without using BufferVal as one of the arguments, the error doesnt ' happen. The UDA in question is the one available in impala-master/be/src/udf_samples / {code:java} void Avg_Init(FunctionContext* context, BufferVal* val); void Avg_Update(FunctionContext* context, *const DoubleVal*& input, BufferVal* val); void Avg_Merge(FunctionContext* context, const BufferVal& src, BufferVal* dst); *DoubleVal* Avg_Finalize(FunctionContext* context, const BufferVal& val); {code} Could anybody give me any suggestions on this problem? Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9747) More fine-grained codegen for text file scanners
[ https://issues.apache.org/jira/browse/IMPALA-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134199#comment-17134199 ] Daniel Becker commented on IMPALA-9747: --- [https://gerrit.cloudera.org/#/c/16059/] > More fine-grained codegen for text file scanners > > > Key: IMPALA-9747 > URL: https://issues.apache.org/jira/browse/IMPALA-9747 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Daniel Becker >Priority: Major > > Currently if the materialization of any column cannot be codegend for some > reason (e.g. it is CHAR(N)), then the whole codegen is cancelled for the text > scanner, see: > https://github.com/apache/impala/blob/b5805de3e65fd1c7154e4169b323bb38ddc54f4f/be/src/exec/text-converter.cc#L112 > https://github.com/apache/impala/blob/58273fff601dcc763ac43f7cc275a174a2e18b6b/be/src/exec/hdfs-scanner.cc#L342 > It would be much better to use the non-codegend path only for the problematic > columns and use the codegend materialization for the rest + always do > conjunct evaluation with codegen. > The codegend path orders slots based on the conjuncts that use them and > evaluates conjuncts when the slots it need becomes available, so if the row > is dropped then the rest of the slots do not need to be materialized. A > simple solution would be to always do non-codegend slot materialization first > so that they are ready if a conjunct needs them. Moving the columns that are > not used by conjuncts to the end could be a further optimization. > This came up during the materialization of BINARY columns, which needs > base64 decoding during materialization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org