[jira] [Commented] (IMPALA-3471) TopN should be able to spill

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134613#comment-17134613
 ] 

Tim Armstrong commented on IMPALA-3471:
---

I think we still want to use the regular external sort implementation for large 
limits, but we could make some further optimisations to avoid spilling as much 
data. Specifically, in SortCurrentInputRun() we could truncate the in-memory 
sorted run, and then when merging sorted runs we can apply the limit there too.

There are additional tricks that we could add to optimise this for spilling 
sorts further, mostly various ways to keep track of the upper bound on the row 
that would be past the threshold.

> TopN should be able to spill
> 
>
> Key: IMPALA-3471
> URL: https://issues.apache.org/jira/browse/IMPALA-3471
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Jim Apple
>Priority: Minor
>
> TopN nodes store OFFSET + LIMIT  tuples in memory. (In fact, in a vector 
> which will throw an exception if allocation fails.) It would be nice to check 
> allocations before they fail and spill when there isn't enough memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-9853) Push rank() predicates into sort

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-9853:
-

Assignee: (was: Tim Armstrong)

> Push rank() predicates into sort
> 
>
> Key: IMPALA-9853
> URL: https://issues.apache.org/jira/browse/IMPALA-9853
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: performance, tpcds
>
> TPC-DS Q67 would benefit significantly if we could push the rank() predicate 
> into the sort to do some reduction of unneeded data. The sorter could 
> evaluate this predicate if it had the partition expressions available - as a 
> post-processing step to the in-memory sort for the analytic sort group, it 
> could do a pass over the sorted run, resetting a counter at the start of each 
> partition boundary.
> It might be best to start with tackling IMPALA-3471 by applying the limit 
> within sorted runs, since that doesn't require any planner work.
> {noformat}
> with results as
> ( select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy 
> ,d_moy ,s_store_id
>   ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales
> from store_sales ,date_dim ,store ,item
>where  ss_sold_date_sk=d_date_sk
>   and ss_item_sk=i_item_sk
>   and ss_store_sk = s_store_sk
>   and d_month_seq between 1212 and 1212 + 11
>group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, 
> d_moy,s_store_id)
>  ,
>  results_rollup as
>  (select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, 
> s_store_id, sumsales
>   from results
>   union all
>   select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, 
> null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy
>   union all
>   select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, null 
> d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy
>   union all
>   select i_category, i_class, i_brand, i_product_name, d_year, null d_qoy, 
> null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name, d_year
>   union all
>   select i_category, i_class, i_brand, i_product_name, null d_year, null 
> d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name
>   union all
>   select i_category, i_class, i_brand, null i_product_name, null d_year, null 
> d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand
>   union all
>   select i_category, i_class, null i_brand, null i_product_name, null d_year, 
> null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class
>   union all
>   select i_category, null i_class, null i_brand, null i_product_name, null 
> d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category
>   union all
>   select null i_category, null i_class, null i_brand, null i_product_name, 
> null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results)
>  select  *
> from (select i_category
> ,i_class
> ,i_brand
> ,i_product_name
> ,d_year
> ,d_qoy
> ,d_moy
> ,s_store_id
> ,sumsales
> ,rank() over (partition by i_category order by sumsales desc) rk
>   from results_rollup) dw2
> where rk <= 100
> order by i_category
> ,i_class
> ,i_brand
> ,i_product_name
> ,d_year
> ,d_qoy
> ,d_moy
> ,s_store_id
> ,sumsales
> ,rk
> limit 100
> {noformat}
> Assigning to myself to fill in more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5296) Codegen for aggregation with 1K grouping columns takes several seconds

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134607#comment-17134607
 ] 

Tim Armstrong commented on IMPALA-5296:
---

This has gotten a little faster, but still took over a minute on my desktop:
{noformat}
Fragment F01:
  CodeGen:(Total: 1m9s, non-child: 1m9s, % non-child: 100.00%)
 - CodegenInvoluntaryContextSwitches: 124 (124)
 - CodegenTotalWallClockTime: 1m9s
   - CodegenSysTime: 696.010ms
   - CodegenUserTime: 1m9s
 - CodegenVoluntaryContextSwitches: 10 (10)
 - CompileTime: 16s816ms
 - IrGenerationTime: 113.400ms
 - LoadTime: 0.000ns
 - ModuleBitcodeSize: 2.50 MB (2624788)
 - NumFunctions: 2.07K (2066)
 - NumInstructions: 106.15K (106147)
 - OptimizationTime: 52s908ms
 - PeakMemoryUsage: 51.83 MB (54347264)
 - PrepareTime: 19.514ms
Fragment F00:
  CodeGen:(Total: 1m12s, non-child: 1m12s, % non-child: 100.00%)
 - CodegenInvoluntaryContextSwitches: 104 (104)
 - CodegenTotalWallClockTime: 1m12s
   - CodegenSysTime: 779.937ms
   - CodegenUserTime: 1m12s
 - CodegenVoluntaryContextSwitches: 12 (12)
 - CompileTime: 17s127ms
 - IrGenerationTime: 142.890ms
 - LoadTime: 0.000ns
 - ModuleBitcodeSize: 2.50 MB (2624788)
 - NumFunctions: 3.11K (3106)
 - NumInstructions: 155.46K (155461)
 - OptimizationTime: 55s547ms
 - PeakMemoryUsage: 75.91 MB (79596032)
 - PrepareTime: 18.261ms
{noformat}

> Codegen for aggregation with 1K grouping columns takes several seconds
> --
>
> Key: IMPALA-5296
> URL: https://issues.apache.org/jira/browse/IMPALA-5296
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: codegen
> Attachments: codegen-widetable-profile.txt, 
> wide_parquet_1000_small_data.0.parq
>
>
> Parquet file for sample data attached. 
> From the query profile
> {code}
> CodeGen:(Total: 3m6s, non-child: 3m6s, % non-child: 100.00%)
>  - CodegenTime: 73.408ms
>  - CompileTime: 48s511ms
>  - LoadTime: 0.000ns
>  - ModuleBitcodeSize: 1.98 MB (2074600)
>  - NumFunctions: 2.07K (2066)
>  - NumInstructions: 104.95K (104948)
>  - OptimizationTime: 2m17s
>  - PeakMemoryUsage: 51.24 MB (53733376)
>  - PrepareTime: 18.414ms
>   AGGREGATION_NODE (id=2):(Total: 3s413ms, non-child: 408.959us, % 
> non-child: 0.01%)
>  - BuildTime: 775.000ns
>  - GetResultsTime: 0.000ns
>  - HTResizeTime: 0.000ns
>  - HashBuckets: 0 (0)
>  - LargestPartitionPercent: 0 (0)
>  - MaxPartitionLevel: 0 (0)
>  - NumRepartitions: 0 (0)
>  - PartitionsCreated: 0 (0)
>  - PeakMemoryUsage: 16.00 KB (16384)
>  - RowsRepartitioned: 0 (0)
>  - RowsReturned: 1 (1)
>  - RowsReturnedRate: 0
> {code}
> Query used to repro 
> {code}
> select count(*) from (select distinct * from wide_parquet_1000_small where 
> l_orderkey = 10) a
> {code}
> Hot functions from Perf 
> {code}
> Samples: 2M of event 'cycles', Event count (approx.): 1538632554190
> +6.09% 6.07%  impalad  impalad[.] 
> llvm::ScalarEvolution::has◆
> +5.75% 0.00%  impalad  [unknown]  [.] 
>   
> +5.46% 0.00%  impalad  [unknown]  [.] 
> 0x7f8dfe242120
> +5.21% 0.00%  impalad  [unknown]  [.] 
> 0x7f8e5e13ea20
> +4.31% 0.00%  impalad  [unknown]  [.] 
> 0x7f8e5e13e8e8
> +4.31% 0.00% init  [kernel.kallsyms]  [k] 
> start_secondary   
> +4.30% 0.02% init  [kernel.kallsyms]  [k] 
> cpu_idle  
> +4.03% 0.00%  impalad  [unknown]  [.] 
> 0x7f8dfe241fe8
> +3.79% 3.77%  impalad  impalad[.] 
> llvm::Use::getImpliedUser(
> +3.65% 0.02% init  [kernel.kallsyms]  [k] 
> cpuidle_idle_call 
> +3.26% 2.78%  impalad  impalad[.] 
> llvm::MemoryDependenceAnal
> +3.17% 3.15%  impalad  impalad[.] 
> llvm::SmallPtrSetImplBase:
> +2.56% 2.52%  impalad  impalad[.] 
> llvm::ScalarEvolution::get
> +

[jira] [Assigned] (IMPALA-6746) Reduce the number of comparison for analytical functions with partitioning when incoming data is clustered

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-6746:
-

Assignee: (was: Adrian Ng)

> Reduce the number of comparison for analytical functions with partitioning 
> when incoming data is clustered
> --
>
> Key: IMPALA-6746
> URL: https://issues.apache.org/jira/browse/IMPALA-6746
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.13.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: performance
> Attachments: percentile query profile 2.txt
>
>
> Checking if the current row belongs to the same partition in ANALYTIC is very 
> expensive, as it does N comparisons where N is number of rows, in cases when 
> the cardinality of the partition column(s) is relatively small the values 
> will be clustered.
> One optimization as proposed by [~alex.behm] is to check the first and last 
> tuples in the batch and if they match go avoid calling 
> AnalyticEvalNode::PrevRowCompare for the entire batch.
> For the query attached which is a common pattern the expected speedup is 
> 20-30%.
> Query
> {code}
> select l_commitdate
> ,avg(l_extendedprice) as avg_perc
> ,percentile_cont (.25) within group (order by l_extendedprice asc) as 
> perc_25
> ,percentile_cont (.5) within group (order by l_extendedprice asc) as 
> perc_50
> ,percentile_cont (.75) within group (order by l_extendedprice asc) as 
> perc_75
> ,percentile_cont (.90) within group (order by l_extendedprice asc) as 
> perc_90
> from lineitem
> group by l_commitdate
> order by l_commitdate
> {code}
> Plan
> {code}
> F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=0B mem-reservation=0B
> PLAN-ROOT SINK
> |  mem-estimate=0B mem-reservation=0B
> |
> 09:MERGING-EXCHANGE [UNPARTITIONED]
> |  order by: l_commitdate ASC
> |  mem-estimate=0B mem-reservation=0B
> |  tuple-ids=5 row-size=66B cardinality=2559
> |
> F02:PLAN FRAGMENT [HASH(l_commitdate)] hosts=1 instances=1
> Per-Host Resources: mem-estimate=22.00MB mem-reservation=13.94MB
> 05:SORT
> |  order by: l_commitdate ASC
> |  mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB
> |  tuple-ids=5 row-size=66B cardinality=2559
> |
> 08:AGGREGATE [FINALIZE]
> |  output: avg:merge(l_extendedprice), 
> _percentile_cont_interpolation:merge(l_extendedprice, 
> `_percentile_row_number_diff_0`), 
> _percentile_cont_interpolation:merge(l_extendedprice, 
> `_percentile_row_number_diff_1`), 
> _percentile_cont_interpolation:merge(l_extendedprice, 
> `_percentile_row_number_diff_2`), 
> _percentile_cont_interpolation:merge(l_extendedprice, 
> `_percentile_row_number_diff_3`)
> |  group by: l_commitdate
> |  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB
> |  tuple-ids=4 row-size=66B cardinality=2559
> |
> 07:EXCHANGE [HASH(l_commitdate)]
> |  mem-estimate=0B mem-reservation=0B
> |  tuple-ids=3 row-size=66B cardinality=2559
> |
> F01:PLAN FRAGMENT [HASH(l_commitdate)] hosts=1 instances=1
> Per-Host Resources: mem-estimate=64.00MB mem-reservation=22.00MB
> 04:AGGREGATE [STREAMING]
> |  output: avg(l_extendedprice), 
> _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - 
> count(l_extendedprice) - 1 * 0.25), 
> _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - 
> count(l_extendedprice) - 1 * 0.5), 
> _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - 
> count(l_extendedprice) - 1 * 0.75), 
> _percentile_cont_interpolation(l_extendedprice, row_number() - 1 - 
> count(l_extendedprice) - 1 * 0.90)
> |  group by: l_commitdate
> |  mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB
> |  tuple-ids=3 row-size=66B cardinality=2559
> |
> 03:ANALYTIC
> |  functions: count(l_extendedprice)
> |  partition by: l_commitdate
> |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
> |  tuple-ids=9,7,8 row-size=50B cardinality=59986052
> |
> 02:ANALYTIC
> |  functions: row_number()
> |  partition by: l_commitdate
> |  order by: l_extendedprice ASC
> |  window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
> |  tuple-ids=9,7 row-size=42B cardinality=59986052
> |
> 01:SORT
> |  order by: l_commitdate ASC NULLS FIRST, l_extendedprice ASC NULLS LAST
> |  mem-estimate=46.00MB mem-reservation=12.00MB spill-buffer=2.00MB
> |  tuple-ids=9 row-size=34B cardinality=59986052
> |
> 06:EXCHANGE [HASH(l_commitdate)]
> |  mem-estimate=0B mem-reservation=0B
> |  tuple-ids=0 row-size=34B cardinality=59986052
> |
> F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> Per-Host Resources: mem-estimate=88.00MB 

[jira] [Assigned] (IMPALA-6354) Consider using Guava LoadingCache to cache metadata objects opposed to a ConcurrentHashMap

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-6354:
-

Assignee: (was: Tianyi Wang)

> Consider using Guava LoadingCache to cache metadata objects opposed to a 
> ConcurrentHashMap
> --
>
> Key: IMPALA-6354
> URL: https://issues.apache.org/jira/browse/IMPALA-6354
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: catalog, catalog-server, scalability
>
> Look into replacing the ConcurrentHashMap(s) used in the Catalog service to 
> cache metadata with 
> [LoadingCache|https://google.github.io/guava/releases/19.0/api/docs/com/google/common/cache/LoadingCache.html].
> Caching metadata using a LoadingCache will allow:
> * Better coherency by adding expireAfter and refreshAfter clause to the cache
> * Weak reference eviction to respond to garbage collection
> * Putting a cap on maximum number of caches entires to avoid OOMs and 
> excessive GC
> * Assigning different weights for each entry (For more efficient eviction)
> https://github.com/google/guava/wiki/CachesExplained
> https://google.github.io/guava/releases/19.0/api/docs/com/google/common/cache/CacheBuilder.html
> https://google.github.io/guava/releases/19.0/api/docs/com/google/common/cache/LoadingCache.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-6404) Evenly distribute local and remote scan ranges across Impalad(s) when 100% locality is not achievable

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-6404:
-

Assignee: (was: Lars Volker)

> Evenly distribute local and remote scan ranges across Impalad(s) when 100% 
> locality is not achievable
> -
>
> Key: IMPALA-6404
> URL: https://issues.apache.org/jira/browse/IMPALA-6404
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Distributed Exec
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: scheduler
>
> Current scheduler tries to assign as many local reads as possible, this works 
> well if 100% locality is achievable, but in cases where some nodes have 
> locality while others don't an even scan ranges are assigned to the backends 
> which results in execution skew.
> Ideally the scheduler should create an even distribution of local and remote 
> scan ranges to avoid skew.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-6006) Incorrect cardinality estimation when dimension table has inequality predicate

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-6006:
-

Assignee: (was: Philip Martin)

> Incorrect cardinality estimation when dimension table has inequality predicate
> --
>
> Key: IMPALA-6006
> URL: https://issues.apache.org/jira/browse/IMPALA-6006
> Project: IMPALA
>  Issue Type: Bug
>Affects Versions: Impala 2.11.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>
> Query 
> {code}
> select count(*)
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>where
>  d_month_seq between 1193 and 1193+11;
> {code}
> Plan
> {code}
> +---+
> | Explain String  
>   |
> +---+
> | Max Per-Host Resource Reservation: Memory=1.94MB
>   |
> | Per-Host Resource Estimates: Memory=54.94MB 
>   |
> | 
>   |
> | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1   
>   |
> | |  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B  
>   |
> | PLAN-ROOT SINK  
>   |
> | |  mem-estimate=0B mem-reservation=0B   
>   |
> | |   
>   |
> | 06:AGGREGATE [FINALIZE] 
>   |
> | |  output: count:merge(*)   
>   |
> | |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB  
>   |
> | |  tuple-ids=2 row-size=8B cardinality=1
>   |
> | |   
>   |
> | 05:EXCHANGE [UNPARTITIONED] 
>   |
> | |  mem-estimate=0B mem-reservation=0B   
>   |
> | |  tuple-ids=2 row-size=8B cardinality=1
>   |
> | |   
>   |
> | F00:PLAN FRAGMENT [RANDOM] hosts=7 instances=7  
>   |
> | Per-Host Resources: mem-estimate=12.94MB mem-reservation=1.94MB 
>   |
> | 03:AGGREGATE
>   |
> | |  output: count(*) 
>   |
> | |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB  
>   |
> | |  tuple-ids=2 row-size=8B cardinality=1
>   |
> | |   
>   |
> | 02:HASH JOIN [INNER JOIN, BROADCAST]
>   |
> | |  hash predicates: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk  
>   |
> | |  fk/pk conjuncts: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk  
>   |
> | |  runtime filters: RF000 <- date_dim.d_date_sk 
>   |
> | |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB  
>   |
> | |  tuple-ids=0,1 row-size=16B cardinality=14399964710   
>   |
> | |   
>   |
> | |--04:EXCHANGE [BROADCAST]  
>   |
> | |  |  mem-estimate=0B mem-reservation=0B
>   |
> | |  |  tuple-ids=1 row-size=8B cardinality=7305  
>   |
> | |  |
>   |
> | |  F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1   
>   |
> | |  Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B  
>   |
> | |  01:SCAN HDFS [tpcds_1_parquet.date_dim, RANDOM]  
>   |
> | | partitions=1/1 files=1 size=2.15MB
>   |
> | | predicates: d_month_seq <= 1204, d_month_seq >= 1193  
>   |
> | | stats-rows=73049 extrapolated-rows=disabled   
>   |
> | | table stats: rows=73049 size=unavailable  
>   |
> | | column stats: all 
>   |
> | | parquet statistics predicates: d_month_seq <= 1204, d_month_seq >= 
> 1193 |
> 

[jira] [Resolved] (IMPALA-4062) Create thread pool for HdfsScanNode::ScannerThread to limit Kernel contention

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-4062.
---
Resolution: Won't Fix

Not relevant with the IMPALA-3902 changes.

> Create thread pool for HdfsScanNode::ScannerThread to limit Kernel contention
> -
>
> Key: IMPALA-4062
> URL: https://issues.apache.org/jira/browse/IMPALA-4062
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: performance, scalability
> Attachments: TPC-DS Q27.txt, q27_perf_kernel.txt, q_27_spinning_1.zip
>
>
> Servers with modern processors like E5-2698 can have up to 80 logical 
> processors per server, as a result queries occasionally end up running with a 
> significantly larger number of threads.
> Creating and destroying threads is expensive and wastes lots of resources, 
> hence consider creating a thread pool for scanner threads to avoid resource 
> contention during thread creation.
> For TPC-DS Q27 >40% of CPU cycles are spent pthread_mutex_unlock and 
> pthread_mutex_lock
> Call stacks
> {code}
> CPU Time
> 1 of 5: 71.4% (31.928s of 44.725s)
> impalad ! pthread_mutex_unlock - mutex.hpp
> impalad ! boost::mutex::unlock + 0x10 - mutex.hpp:125
> impalad ! ~unique_lock + 0x16 - lock_types.hpp:331
> impalad ! impala::HdfsScanNode::ScannerThread + 0x2aa - hdfs-scan-node.cc:1044
> impalad ! boost::function0::operator() + 0x1a - 
> function_template.hpp:767
> impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318
> impalad ! operator()&, const 
> std::basic_string&, boost::function, impala::Promise int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
> impalad ! boost::_bi::bind_t const&, boost::function, impala::Promise*), 
> boost::_bi::list4, 
> boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*>>>::operator() - 
> bind_template.hpp:20
> impalad ! boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, 
> impala::Promise*), boost::_bi::list4, 
> boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*::run + 0x19 - 
> thread.hpp:116
> impalad ! thread_proxy + 0xd9 - [unknown source file]
> libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
> libc.so.6 ! clone + 0x6c - [unknown source file]
> {code}
> {code}
> CPU Time
> 2 of 5: 26.4% (11.787s of 44.725s)
> impalad ! pthread_mutex_unlock - mutex.hpp
> impalad ! boost::mutex::unlock + 0x10 - mutex.hpp:125
> impalad ! ~unique_lock + 0x16 - lock_types.hpp:331
> impalad ! impala::Promise::Get + 0x82d - promise.h:94
> impalad ! impala::CountingBarrier::Wait - counting-barrier.h:42
> impalad ! impala::HdfsScanNode::ScannerThread + 0x2aa - hdfs-scan-node.cc:1044
> impalad ! boost::function0::operator() + 0x1a - 
> function_template.hpp:767
> impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318
> impalad ! operator()&, const 
> std::basic_string&, boost::function, impala::Promise int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
> impalad ! boost::_bi::bind_t const&, boost::function, impala::Promise*), 
> boost::_bi::list4, 
> boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*>>>::operator() - 
> bind_template.hpp:20
> impalad ! boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, 
> impala::Promise*), boost::_bi::list4, 
> boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*::run + 0x19 - 
> thread.hpp:116
> impalad ! thread_proxy + 0xd9 - [unknown source file]
> libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
> libc.so.6 ! clone + 0x6c - [unknown source file]
> {code}
> {code}
> CPU Time
> 1 of 11: 98.4% (35.681s of 36.271s)
> impalad ! pthread_mutex_lock - mutex.hpp
> impalad ! boost::mutex::lock + 0x10 - mutex.hpp:116
> impalad ! [impalad] + 0x25f9daf - [unknown source file]
> impalad ! boost::function0::operator() + 0x1a - 
> function_template.hpp:767
> impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318
> impalad ! operator()&, const 
> std::basic_string&, boost::function, impala::Promise int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
> impalad ! boost::_bi::bind_t const&, boost::function, impala::Promise*), 
> boost::_bi::list4, 
> boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*>>>::operator() - 
> bind_template.hpp:20
> impalad ! boost::detail::thread_data (*)(std::string const&, std::string const&, boost::function, 
> impala::Promise*), boost::_bi::list4, 
> boost::_bi::value, boost::_bi::value (void)>>, boost::_bi::value*::run + 0x19 - 
> thread.hpp:116
> impalad ! thread_proxy + 0xd9 - [unknown source file]
> libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
> libc.so.6 ! clone 

[jira] [Assigned] (IMPALA-3731) Runtime filters from the same source arrive at different times

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-3731:
-

Assignee: (was: Henry Robinson)

> Runtime filters from the same source arrive at different times
> --
>
> Key: IMPALA-3731
> URL: https://issues.apache.org/jira/browse/IMPALA-3731
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Affects Versions: Impala 2.5.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: runtime-filters
>
> Runtime filters from the same source are arriving ~5 seconds apart, it seems 
> that the coordinator is either serializing the filters or it was network 
> bound. 
> Query
> {code}
> select count(*) rowcount
> from store_sales a
>  ,store_returns b
> where  a.ss_item_sk = b.sr_item_sk
>and a.ss_ticket_number = b.sr_ticket_number
>and ss_sold_date_sk between 2450816 and 2451500
>and sr_returned_date_sk between 2450816 and 2451500
> group by ss_cdemo_sk,ss_store_sk,ss_item_sk , ss_ticket_number having 
> count(*) > 1
> {code}
> Subplan
> {code}
> |
> 00:SCAN HDFS [tpcds_3000_parquet.store_sales a, RANDOM]
>partitions=683/1824 files=944 size=126.77GB
>runtime filters: RF000 -> a.ss_item_sk, RF001 -> a.ss_ticket_number
>table stats: 8639936081 rows total
>column stats: all
>hosts=61 per-host-mem=352.00MB
>tuple-ids=0 row-size=24B cardinality=2886246552
> {code}
> Filter table
> {code}
>  ID  Src. Node  Tgt. Node(s)  Targets  Target type  Partition filter  Pending 
> (Expected)  First arrived   Completed
> ---
>   1  2 0   61   REMOTE false  
> 0 (61)2s881ms10s265ms
>   0  2 0   61   REMOTE false  
> 0 (61)3s698ms10s350ms
> {code}
> Filters arriving at different times
> {code}
>   Instance 614bea9715cbde44:b0134609741aea61 
> (host=impala-compete-64-5.vpc.cloudera.com:22000):(Total: 30s446ms, 
> non-child: 10s882ms, % non-child: 35.74%)
> Hdfs split stats (:<# splits>/): 0:16/2.33 
> GB 
> Filter 1 arrival: 11s854ms
> Filter 0 arrival: 16s047ms
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3701) Evaluate compressing Runtime filters to save coordinator network bandwidth

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3701.
---
Resolution: Won't Fix

We expect these to be generally incompressible, so not worth pursuing.

> Evaluate compressing Runtime filters to save coordinator network bandwidth
> --
>
> Key: IMPALA-3701
> URL: https://issues.apache.org/jira/browse/IMPALA-3701
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Distributed Exec
>Affects Versions: Impala 2.5.0
>Reporter: Mostafa Mokhtar
>Assignee: Henry Robinson
>Priority: Major
>  Labels: runtime-filters, scalability
> Attachments: image-2016-06-08-22-55-36-966.png, query17.sql.2.out
>
>
> When running complex queries on large clusters with lots of runtime filters 
> the coordinator quickly becomes network bound due to the extra incoming and 
> outgoing traffic for runtime filters, once the coordinator becomes network 
> bound all other fragments in the cluster are negatively affected as they get 
> blocked on shuffling/broadcasting data to the coordinator node.
> This bottleneck was identified when running large scale tests on EC2 nodes 
> with less than ideal network throughput. 
> In attached png is aggregate network throughput across the 32 nodes in the 
> cluster with the coordinator in red. 
>  !image-2016-06-08-22-55-36-966.png|thumbnail! 
> Compression should alleviate this bottleneck but we should consider other 
> solutions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-3617) Incorrect reporting of runtime filters in scan node

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3617.
---
Resolution: Cannot Reproduce

> Incorrect reporting of runtime filters in scan node
> ---
>
> Key: IMPALA-3617
> URL: https://issues.apache.org/jira/browse/IMPALA-3617
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Assignee: Henry Robinson
>Priority: Minor
>  Labels: runtime-filters
> Attachments: query31.sql.3.out
>
>
> Summary line reports one filter arriving while both filters were received. 
> {code}   
> HDFS_SCAN_NODE (id=6):(Total: 9s443ms, non-child: 9s443ms, % non-child: 
> 100.00%)
>   ExecOption: Expr Evaluation Codegen Disabled, Codegen enabled: 0 
> out of 14
>   Hdfs split stats (:<# splits>/): 
> 2:39/8.86 GB 5:42/9.29 GB 4:28/6.50 GB 1:29/6.45 GB 6:32/7.38 GB 3:27/6.07 GB 
> 10:34/7.63 GB 7:26/5.75 GB 0:33/7.24 GB 9:37/8.06 GB 8:22/4.86 GB 
>   Runtime filters: Only following filters arrived: 8, waited 8s991ms
>   Hdfs Read Thread Concurrency Bucket: 0:100% 1:0% 2:0% 3:0% 4:0% 
> 5:0% 6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 12:0% 13:0% 14:0% 15:0% 16:0% 17:0% 
> 18:0% 19:0% 20:0% 21:0% 22:0% 23:0% 24:0% 25:0% 26:0% 27:0% 
>   File Formats: PARQUET/NONE:335 PARQUET/SNAPPY:28 
>   BytesRead(500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 73.18 MB, 73.18 MB, 73.18 MB, 73.18 MB, 73.18 MB, 73.18 MB, 
> 73.18 MB, 93.31 MB, 134.68 MB, 161.83 MB, 161.83 MB, 161.83 MB, 161.83 MB, 
> 184.49 MB, 206.65 MB, 233.75 MB, 233.75 MB, 233.75 MB, 261.02 MB, 261.02 MB, 
> 288.24 MB, 309.87 MB, 309.87 MB, 309.87 MB, 337.08 MB, 337.08 MB, 337.08 MB, 
> 337.08 MB, 337.08 MB
>- AverageHdfsReadThreadConcurrency: 0.00 
>- AverageScannerThreadConcurrency: 3.26 
>- BytesRead: 337.08 MB (353454758)
>- BytesReadDataNodeCache: 0
>- BytesReadLocal: 337.08 MB (353454758)
>- BytesReadRemoteUnexpected: 0
>- BytesReadShortCircuit: 337.08 MB (353454758)
>- DecompressionTime: 784.603ms
>- MaxCompressedTextFileLength: 0
>- NumColumns: 2 (2)
>- NumDisksAccessed: 6 (6)
>- NumRowGroups: 14 (14)
>- NumScannerThreadsStarted: 4 (4)
>- PeakMemoryUsage: 134.29 MB (140810128)
>- PerReadThreadRawHdfsThroughput: 2.00 GB/sec
>- RemoteScanRanges: 0 (0)
>- RowsRead: 90.33M (90325159)
>- RowsReturned: 90.32M (90321194)
>- RowsReturnedRate: 9.56 M/sec
>- ScanRangesComplete: 349 (349)
>- ScannerThreadsInvoluntaryContextSwitches: 2.85K (2851)
>- ScannerThreadsTotalWallClockTime: 1m16s
>  - MaterializeTupleTime(*): 50s026ms
>  - ScannerThreadsSysTime: 887.864ms
>  - ScannerThreadsUserTime: 6s995ms
>- ScannerThreadsVoluntaryContextSwitches: 176.78K (176783)
>- TotalRawHdfsReadTime(*): 164.437ms
>- TotalReadThroughput: 13.76 MB/sec
>   Filter 7 (1.00 MB):
>  - Rows processed: 229.36K (229362)
>  - Rows rejected: 3.96K (3965)
>  - Rows total: 229.38K (229376)
>   Filter 8 (1.00 MB):
>  - Files processed: 349 (349)
>  - Files rejected: 335 (335)
>  - Files total: 349 (349)
>  - RowGroups processed: 88.21K (88214)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 88.21K (88214)
>  - Rows processed: 229.36K (229362)
>  - Rows rejected: 0 (0)
>  - Rows total: 229.38K (229376)
>  - Splits processed: 14 (14)
>  - Splits rejected: 0 (0)
>  - Splits total: 14 (14)
> {code}
> Ditto in final routing table 
> {code}
> ID  Src. Node  Tgt. Node(s) Targets   
>   Target type   Partition filter  Pending (Expected)  First 
> arrived   Completed
> --
>   6  3 0  20  
>  LOCAL   true  0 (20) 
>N/A N/A
>   5  4 0  20  
>  LOCAL  false  0 (20) 
>N/A N/A
>   8  9 6  20  
>  LOCAL  

[jira] [Resolved] (IMPALA-3636) Regression in DecimalOperators::EQ with codegen disabled

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3636.
---
Resolution: Won't Fix

> Regression in DecimalOperators::EQ with codegen disabled
> 
>
> Key: IMPALA-3636
> URL: https://issues.apache.org/jira/browse/IMPALA-3636
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: performance, regression
>
> Some of the decimal improvements that came in Impala 2.6 introduced a 
> regression in the none-codegened path.
> This regression was cause by 
> https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_all.test.
>  
> After
> ||Function Stack||CPU Time: Total||
> |impala::DecimalOperators::Eq_DecimalVal_DecimalVal|62.207s|
> |  --impala::Expr::GetConstantInt|55.458s|
> |  --impala::DecimalValue::Eq|1.480s|
> |  --impala::GetDecimal8Value|0.290s|
> |  --impala::DecimalValue<__int128>::Eq|0.190s|
>   
>  Before 
> ||Function Stack||CPU Time: Total||
> |impala::DecimalOperators::Eq_DecimalVal_DecimalVal|9.809s|
> |  --impala::DecimalValue::Compare|2.300s|
> |  --impala_udf::FunctionContext::GetArgType|2.130s|
> |  --func@0x812950|0.390s|
> This is a simplified version of the query which can be used as a repro
> {code}
> select *
> FROM (
>   SELECT Rank() OVER (
>   ORDER BY l_extendedprice
> ,l_quantity
> ,l_discount
> ,l_tax
>   ) AS rank
>   FROM lineitem
>   WHERE l_shipdate < '1992-05-09'
>   ) a
> WHERE rank < 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-3101) AnalyticEvalNode should use codegened TupleRowComparator instead of PrevRowCompare

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134600#comment-17134600
 ] 

Tim Armstrong commented on IMPALA-3101:
---

IMPALA-4356 should guarantee that the full expr tree was codegen'd


> AnalyticEvalNode should use codegened TupleRowComparator instead of 
> PrevRowCompare
> --
>
> Key: IMPALA-3101
> URL: https://issues.apache.org/jira/browse/IMPALA-3101
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: codegen, performance
> Attachments: primitive_orderby_bigint_VtuneTopDown.csv
>
>
> AnalyticEvalNode uses PrevRowCompare to compare rows, which is very 
> inefficient compared to the codegend version of TupleRowComparator::Compare
> |Function Stack||CPU Time: Total||CPU Time: Self||Module||Function 
> (Full)||Source File||Start Address|
> |impala::AnalyticEvalNode::ProcessChildBatch|47.9%|0.810s|impalad|impala::AnalyticEvalNode::ProcessChildBatch(impala::RuntimeState*)|analytic-eval-node.cc|0xc0a870|
> |  
> impala::AnalyticEvalNode::TryAddResultTupleForPrevRow|35.0%|0.570s|impalad|impala::AnalyticEvalNode::TryAddResultTupleForPrevRow(bool,
>  long, impala::TupleRow*)|analytic-eval-node.cc|0xc0aa85|
> |
> impala::AnalyticEvalNode::PrevRowCompare|30.3%|0.040s|impalad|impala::AnalyticEvalNode::PrevRowCompare(impala::ExprContext*)|analytic-eval-node.cc|0xc0ae1d|
> |  
> impala::ExprContext::GetBooleanVal|30.2%|0.330s|impalad|impala::ExprContext::GetBooleanVal(impala::TupleRow*)|expr-context.cc|0x7f0790|
> |
> impala::AndPredicate::GetBooleanVal|29.8%|1.220s|impalad|impala::AndPredicate::GetBooleanVal(impala::ExprContext*,
>  impala::TupleRow*)|compound-predicates.cc|0x8575c0|
> |  
> impala::OrPredicate::GetBooleanVal|28.5%|2.840s|impalad|impala::OrPredicate::GetBooleanVal(impala::ExprContext*,
>  impala::TupleRow*)|compound-predicates.cc|0x857650|
> These queries can be used for repro 
> https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_all.test
> https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9847) JSON profiles are mostly space characters

2020-06-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134597#comment-17134597
 ] 

ASF subversion and git services commented on IMPALA-9847:
-

Commit 6ca6e403580dc592c026b4f684d31f8a4dcfae11 in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=6ca6e40 ]

IMPALA-9847: reduce web UI serialized JSON size

Switch to using the plain writer in some places, and
tweak PrettyWriter to produce denser output for the
debug UI JSON (so that it's still human readable but
denser).

Testing:
Manually tested. The profile for the below query went
from 338kB to 134kB.

  select min(l_orderkey) from tpch_parquet.lineitem;

Change-Id: I66af9d00f0f0fc70e324033b6464b75a6adadd6f
Reviewed-on: http://gerrit.cloudera.org:8080/16068
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> JSON profiles are mostly space characters
> -
>
> Key: IMPALA-9847
> URL: https://issues.apache.org/jira/browse/IMPALA-9847
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: Impala 4.0
>
>
> JSON profiles are pretty-printed with 4 space characters per indent. This 
> means that most of the profile data is actually just space characters, and 
> this can add up for large profiles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9824) MetastoreClientPool should be singleton

2020-06-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134596#comment-17134596
 ] 

ASF subversion and git services commented on IMPALA-9824:
-

Commit 0cb44242d20532945e5fb09f5bbef6c65415a753 in impala's branch 
refs/heads/master from Vihang Karajgaonkar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0cb4424 ]

IMPALA-9791: Support validWriteIdList in getPartialCatalogObject API

This change enhances the Catalog-v2 API getPartialCatalogObject to
support ValidWriteIdList as an optional field in the TableInfoSelector.
When such a field is provided by the clients, catalog compares the
provided ValidWriteIdList with the cached ValidWriteIdList of the
table. The catalog reloads the table if it determines that the cached
table is stale with respect to the ValidWriteIdList provided.
In case the table is already at or above the requested ValidWriteIdList
catalog uses the cached table metadata information to filter out
filedescriptors pertaining to the provided ValidWriteIdList.
Note that in case compactions it is possible that the requested
ValidWriteIdList cannot be satisfied using the cached file-metadata
for some partitions. For such partitions, catalog re-fetches the
file-metadata from the FileSystem.

In order to implement the fall-back to getting the file-metadata from
filesystem, the patch refactor some of file-metadata loading logic into
ParallelFileMetadataLoader which also helps simplify some methods
in HdfsTable.java. Additionally, it modifies the WriteIdBasedPredicate
to optionally do a strict check which throws an exception on some
scenarios.

This is helpful to provide a snapshot view of the table metadata during
query compilation with respect to other changes happening to the table
concurrently. Note that this change does not implement the coordinator
side changes needed for catalog clients to use such a field. That would
be taken up in a separate change to keep this patch smaller.

Testing:
1. Ran existing filemetadata loader tests.
2. Added a new test which exercises the various cases for
ValidWriteIdList comparison.
3. Ran core tests along with the dependent MetastoreClientPool
patch (IMPALA-9824).

Change-Id: Ied2c7c3cb2009c407e8fbc3af4722b0d34f57c4a
Reviewed-on: http://gerrit.cloudera.org:8080/16008
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> MetastoreClientPool should be singleton
> ---
>
> Key: IMPALA-9824
> URL: https://issues.apache.org/jira/browse/IMPALA-9824
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Minor
>
> Currently,  the MetastoreClientPool is instantiated at multiple places in the 
> code and it would be good to refactor the code to make it a singleton. Each 
> MetastoreClientPool creates multiple clients to HMS and unnecessary creation 
> of multiple pools could cause problems on HMS side. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9843) Add ability to run schematool against HMS in minicluster

2020-06-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134598#comment-17134598
 ] 

ASF subversion and git services commented on IMPALA-9843:
-

Commit f8c28f8adfd781727c311b15546a532ce65881e0 in impala's branch 
refs/heads/master from Vihang Karajgaonkar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f8c28f8 ]

IMPALA-9843: Add support for metastore db schema upgrade

This change adds support to upgrade the HMS database schema using the
hive schema tool. It adds a new option to the buildall.sh script
which can be provided to upgrade the HMS db schema. Alternatively,
users can directly upgrade the schema using the
create-test-configuration.sh script. The logs for the schema upgrade
are available in logs/cluster/schematool.log.

Following invocations will upgrade the HMS database schema.

1. buildall.sh -upgrade_metastore_db
2. bin/create-test-configuration.sh -upgrade_metastore_db

This upgrade option is idempotent. It is a no-op if the metastore
schema is already at its latest version. In case of any errors, the
only fallback currently is to format the metastore schema and load
the test data again.

Testing:
Upgraded the HMS schema on my local dev environment and made
sure that the HMS service starts without any errors.

Change-Id: I85af8d57e110ff284832056a1661f94b85ed3b09
Reviewed-on: http://gerrit.cloudera.org:8080/16054
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Add ability to run schematool against HMS in minicluster
> 
>
> Key: IMPALA-9843
> URL: https://issues.apache.org/jira/browse/IMPALA-9843
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Sahil Takiar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Fix For: Impala 4.0
>
>
> When the CDP version is bumped, we often need to re-format the HMS postgres 
> database because the HMS schema needs updating. Hive provides a standalone 
> tool for performing schema updates: 
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool]
> Impala should be able to integrate with this tool, so that developers don't 
> have to blow away their HMS database every time the CDP version is bumped up. 
> Even worse, blowing away the HMS data requires performing a full data load.
> It would be great to have a wrapper around the schematool that can easily be 
> invoked by developers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9791) Support validWriteIdList in getPartialCatalogObject

2020-06-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134595#comment-17134595
 ] 

ASF subversion and git services commented on IMPALA-9791:
-

Commit 0cb44242d20532945e5fb09f5bbef6c65415a753 in impala's branch 
refs/heads/master from Vihang Karajgaonkar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0cb4424 ]

IMPALA-9791: Support validWriteIdList in getPartialCatalogObject API

This change enhances the Catalog-v2 API getPartialCatalogObject to
support ValidWriteIdList as an optional field in the TableInfoSelector.
When such a field is provided by the clients, catalog compares the
provided ValidWriteIdList with the cached ValidWriteIdList of the
table. The catalog reloads the table if it determines that the cached
table is stale with respect to the ValidWriteIdList provided.
In case the table is already at or above the requested ValidWriteIdList
catalog uses the cached table metadata information to filter out
filedescriptors pertaining to the provided ValidWriteIdList.
Note that in case compactions it is possible that the requested
ValidWriteIdList cannot be satisfied using the cached file-metadata
for some partitions. For such partitions, catalog re-fetches the
file-metadata from the FileSystem.

In order to implement the fall-back to getting the file-metadata from
filesystem, the patch refactor some of file-metadata loading logic into
ParallelFileMetadataLoader which also helps simplify some methods
in HdfsTable.java. Additionally, it modifies the WriteIdBasedPredicate
to optionally do a strict check which throws an exception on some
scenarios.

This is helpful to provide a snapshot view of the table metadata during
query compilation with respect to other changes happening to the table
concurrently. Note that this change does not implement the coordinator
side changes needed for catalog clients to use such a field. That would
be taken up in a separate change to keep this patch smaller.

Testing:
1. Ran existing filemetadata loader tests.
2. Added a new test which exercises the various cases for
ValidWriteIdList comparison.
3. Ran core tests along with the dependent MetastoreClientPool
patch (IMPALA-9824).

Change-Id: Ied2c7c3cb2009c407e8fbc3af4722b0d34f57c4a
Reviewed-on: http://gerrit.cloudera.org:8080/16008
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Support validWriteIdList in getPartialCatalogObject
> ---
>
> Key: IMPALA-9791
> URL: https://issues.apache.org/jira/browse/IMPALA-9791
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
>  Labels: impala-acid
>
> When transactional tables are being queried, the coordinator (or any other 
> Catalog client) can optionally provide a ValidWriteIdList of the table. In 
> such case, catalog can return the metadata which is consistent with the given 
> ValidWriteIdList. There are the following 3 possibilities:
> 1. Client provided ValidWriteIdList is more recent.
> In this case, catalog should reload the table then send the metadata 
> consistent with the provided writeIdList.
> 2. Client ValidWriteIdList is same.
> Catalog can return the cached metadata directly.
> 3. ClientValidWriteIdList is stale with respect to the one in catalog.
> In this case, catalog can attempt to return metadata which is consistent with 
> respect to client's view of the writeIdList and return accordingly. Note that 
> in case 1, it is possible that after reload, catalog moves ahead of the 
> client's writeIdList and hence this becomes a sub-case of 1.
> Having such an enhancement to the API can help support consistent read 
> support for ACID tables (see IMPALA-8788)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-3101) AnalyticEvalNode should use codegened TupleRowComparator instead of PrevRowCompare

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-3101:
-

Assignee: (was: Michael Ho)

> AnalyticEvalNode should use codegened TupleRowComparator instead of 
> PrevRowCompare
> --
>
> Key: IMPALA-3101
> URL: https://issues.apache.org/jira/browse/IMPALA-3101
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: codegen, performance
> Attachments: primitive_orderby_bigint_VtuneTopDown.csv
>
>
> AnalyticEvalNode uses PrevRowCompare to compare rows, which is very 
> inefficient compared to the codegend version of TupleRowComparator::Compare
> |Function Stack||CPU Time: Total||CPU Time: Self||Module||Function 
> (Full)||Source File||Start Address|
> |impala::AnalyticEvalNode::ProcessChildBatch|47.9%|0.810s|impalad|impala::AnalyticEvalNode::ProcessChildBatch(impala::RuntimeState*)|analytic-eval-node.cc|0xc0a870|
> |  
> impala::AnalyticEvalNode::TryAddResultTupleForPrevRow|35.0%|0.570s|impalad|impala::AnalyticEvalNode::TryAddResultTupleForPrevRow(bool,
>  long, impala::TupleRow*)|analytic-eval-node.cc|0xc0aa85|
> |
> impala::AnalyticEvalNode::PrevRowCompare|30.3%|0.040s|impalad|impala::AnalyticEvalNode::PrevRowCompare(impala::ExprContext*)|analytic-eval-node.cc|0xc0ae1d|
> |  
> impala::ExprContext::GetBooleanVal|30.2%|0.330s|impalad|impala::ExprContext::GetBooleanVal(impala::TupleRow*)|expr-context.cc|0x7f0790|
> |
> impala::AndPredicate::GetBooleanVal|29.8%|1.220s|impalad|impala::AndPredicate::GetBooleanVal(impala::ExprContext*,
>  impala::TupleRow*)|compound-predicates.cc|0x8575c0|
> |  
> impala::OrPredicate::GetBooleanVal|28.5%|2.840s|impalad|impala::OrPredicate::GetBooleanVal(impala::ExprContext*,
>  impala::TupleRow*)|compound-predicates.cc|0x857650|
> These queries can be used for repro 
> https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_all.test
> https://github.com/cloudera/Impala/blob/cdh5-trunk/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2400) Unpredictable locality behavior for reading Parquet files

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2400.
---
Resolution: Cannot Reproduce

> Unpredictable locality behavior for reading Parquet files
> -
>
> Key: IMPALA-2400
> URL: https://issues.apache.org/jira/browse/IMPALA-2400
> Project: IMPALA
>  Issue Type: Bug
>  Components: Perf Investigation
>Affects Versions: Impala 2.3.0
>Reporter: Mostafa Mokhtar
>Priority: Minor
>  Labels: ramp-up
> Attachments: LocalRead.txt, RemoteRead.txt
>
>
> When running the query below I noticed exceptionally high variance even after 
> running "invalidate metadata". 
> select * from tpch_bin_flat_parquet_30.lineitem limit 10;
> * Fetched 10 row(s) in 1.08s
> WARNINGS: Read 139.48 MB of data across network that was expected to be 
> local. Block locality metadata for table 'tpch_bin_flat_parquet_30.lineitem' 
> may be stale. Consider running "INVALIDATE METADATA 
> `tpch_bin_flat_parquet_30`.`lineitem`".
> * Fetched 10 row(s) in 1.32s
> * Fetched 10 row(s) in 0.09s
> * Fetched 10 row(s) in 1.08s
> * "invalidate metadata"
> * Fetched 10 row(s) in 0.89s
> * Fetched 10 row(s) in 0.07s
> WARNINGS: Read 76.15 MB of data across network that was expected to be local. 
> Block locality metadata for table 'tpch_bin_flat_parquet_30.lineitem' may be 
> stale. Consider running "INVALIDATE METADATA 
> `tpch_bin_flat_parquet_30`.`lineitem`".
> * Fetched 10 row(s) in 1.11s
> * Fetched 10 row(s) in 0.73s
> * Fetched 10 row(s) in 0.09s
> The behavior above is tied to Parquet tables and doesn't repro against text 
> data.
> Profile files attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-2522) Improve the reliability and effectiveness of ETL

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2522.
---
Resolution: Fixed

Will mark as fixed for now, since the vast majority of subtasks are completed 
and there hasn't been movement for a while.

> Improve the reliability and effectiveness of ETL
> 
>
> Key: IMPALA-2522
> URL: https://issues.apache.org/jira/browse/IMPALA-2522
> Project: IMPALA
>  Issue Type: Epic
>  Components: Backend
>Affects Versions: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, 
> Impala 2.6.0, Impala 2.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Lars Volker
>Priority: Major
>  Labels: ETL, performance
>
> h4. Reduce the memory requirements of INSERTs into partitioned tables.
> Impala inserts into partitioned Parquet tables suffer from high memory 
> requirements because each Impala Daemon will keep ~256MB of buffer space per 
> open partition in the table sink. This often leads to large insert jobs 
> hitting "Memory limit exceeded" errors. The behavior can be improved by 
> pre-clustering the data such that only one partition needs to be buffered at 
> a time in the table sink.
> Add a new "clustered" plan hint for insert statements. Example:
> {code}
> CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
> INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;
> {code}
> The hint specifies that the data fed into the table sink should be clustered 
> based on the partition columns. For now, we'll use a sort to achieve 
> clustering, and the plan should look like this:
> SCAN -> SORT (year,month) -> TABLE SINK
> h4. Give users additional control over the insertion order.
> In order to improve compression and/or the effectiveness of min/max pruning, 
> it is desirable to control the order in which rows are inserted into table 
> (mostly for Parquet).
> Introduce a "sortby" plan hint for insert statements: Example
> {code}
> CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
> INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ 
> SELECT * FROM src
> {code}
> This would produce the following plan:
> SCAN -> SORT(year,month,day,hour) -> TABLE SINK
> h4. Improve the sort efficiency
> The additional sorting step introduced by both solutions above should be as 
> efficient as possible.
> Codegen TupleRowComparator and Tuple::MaterializeExprs.
> h4. Summary
> With more predictable and resource-efficient ETL users will extract more 
> value out of Impala and will need to rely less on slow legacy ETL tools like 
> Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-7020 started by Tim Armstrong.
-
> Order by expressions in Analytical functions are not materialized causing 
> slowdown
> --
>
> Key: IMPALA-7020
> URL: https://issues.apache.org/jira/browse/IMPALA-7020
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance
> Attachments: Slow case profile.txt, Workaround profile.txt
>
>
> Order by expressions in Analytical functions are not materialized and cause 
> queries to run much slower.
> The rewrite for the query below is 20x faster, profiles attached.
> Repro 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}
> Workaround 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   union all 
>   SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem limit 0
> 
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9740) TSAN data race in hdfs-bulk-ops

2020-06-12 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134569#comment-17134569
 ] 

Sahil Takiar commented on IMPALA-9740:
--

 
custom_cluster.test_insert_behaviour.TestInsertBehaviourCustomCluster.test_insert_inherit_permission
 reproduces this.

> TSAN data race in hdfs-bulk-ops
> ---
>
> Key: IMPALA-9740
> URL: https://issues.apache.org/jira/browse/IMPALA-9740
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Sahil Takiar
>Priority: Major
>
> hdfs-bulk-ops usage of a local connection cache (HdfsFsCache::HdfsFsMap) has 
> a data race:
> {code:java}
>  WARNING: ThreadSanitizer: data race (pid=23205)
>   Write of size 8 at 0x7b24005642d8 by thread T47:
> #0 
> boost::unordered::detail::table_impl  const, hdfs_internal*> >, std::string, hdfs_internal*, 
> boost::hash, std::equal_to > 
> >::add_node(boost::unordered::detail::node_constructor  const, hdfs_internal*> > > >&, unsigned long) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/detail/unique.hpp:329:26
>  (impalad+0x1f93832)
> #1 
> std::pair  const, hdfs_internal*> > >, bool> 
> boost::unordered::detail::table_impl  const, hdfs_internal*> >, std::string, hdfs_internal*, 
> boost::hash, std::equal_to > 
> >::emplace_impl >(std::string 
> const&, std::pair&&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/detail/unique.hpp:420:41
>  (impalad+0x1f933ed)
> #2 
> std::pair  const, hdfs_internal*> > >, bool> 
> boost::unordered::detail::table_impl  const, hdfs_internal*> >, std::string, hdfs_internal*, 
> boost::hash, std::equal_to > 
> >::emplace 
> >(std::pair&&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/detail/unique.hpp:384:20
>  (impalad+0x1f932d1)
> #3 
> std::pair  const, hdfs_internal*> > >, bool> 
> boost::unordered::unordered_map boost::hash, std::equal_to, 
> std::allocator > 
> >::emplace 
> >(std::pair&&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/unordered_map.hpp:241:27
>  (impalad+0x1f93238)
> #4 boost::unordered::unordered_map boost::hash, std::equal_to, 
> std::allocator > 
> >::insert(std::pair&&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/unordered/unordered_map.hpp:390:26
>  (impalad+0x1f92038)
> #5 impala::HdfsFsCache::GetConnection(std::string const&, 
> hdfs_internal**, boost::unordered::unordered_map boost::hash, std::equal_to, 
> std::allocator > >*) 
> /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/hdfs-fs-cache.cc:115:18
>  (impalad+0x1f916b3)
> #6 impala::HdfsOp::Execute() const 
> /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/hdfs-bulk-ops.cc:84:55
>  (impalad+0x23444d5)
> #7 HdfsThreadPoolHelper(int, impala::HdfsOp const&) 
> /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/hdfs-bulk-ops.cc:137:6
>  (impalad+0x2344ea9)
> #8 boost::detail::function::void_function_invoker2 impala::HdfsOp const&), void, int, impala::HdfsOp 
> const&>::invoke(boost::detail::function::function_buffer&, int, 
> impala::HdfsOp const&) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:118:11
>  (impalad+0x2345e80)
> #9 boost::function2::operator()(int, 
> impala::HdfsOp const&) const 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14
>  (impalad+0x1f883be)
> #10 impala::ThreadPool::WorkerThread(int) 
> /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread-pool.h:166:9
>  (impalad+0x1f874e5)
> #11 boost::_mfi::mf1, 
> int>::operator()(impala::ThreadPool*, int) const 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/mem_fn_template.hpp:165:29
>  (impalad+0x1f87b7d)
> #12 void 
> boost::_bi::list2*>, 
> boost::_bi::value >::operator() impala::ThreadPool, int>, 
> boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf1 impala::ThreadPool, int>&, boost::_bi::list0&, int) 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:319:9
>  (impalad+0x1f87abc)
> #13 boost::_bi::bind_t impala::ThreadPool, int>, 
> boost::_bi::list2*>, 
> boost::_bi::value > >::operator()() 
> /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
>  (impalad+0x1f87a23)
> 

[jira] [Commented] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134542#comment-17134542
 ] 

Tim Armstrong commented on IMPALA-7020:
---

One proposal:

* Casts between integral and floating point types should have 
ARITHMETIC_OP_COST, because they are simple arithmetic conversion (casts 
involving decimal are often non-trivial).
* Casts between STRING and VARCHAR should have ARITHMETIC_OP_COST, because they 
are only modifying the length field, at worst.
* All other casts should have FUNCTION_CALL_COST, because they require some 
non-trivial conversion.

> Order by expressions in Analytical functions are not materialized causing 
> slowdown
> --
>
> Key: IMPALA-7020
> URL: https://issues.apache.org/jira/browse/IMPALA-7020
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance
> Attachments: Slow case profile.txt, Workaround profile.txt
>
>
> Order by expressions in Analytical functions are not materialized and cause 
> queries to run much slower.
> The rewrite for the query below is 20x faster, profiles attached.
> Repro 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}
> Workaround 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   union all 
>   SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem limit 0
> 
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-9213) Client logs should indicate if a query has been retried

2020-06-12 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar reassigned IMPALA-9213:


Assignee: Quanlong Huang

> Client logs should indicate if a query has been retried
> ---
>
> Key: IMPALA-9213
> URL: https://issues.apache.org/jira/browse/IMPALA-9213
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Quanlong Huang
>Priority: Major
>
> The client logs should give some indication that a query has been retried and 
> should print out information such as the new query id and the link to the 
> retried query on the debug web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134523#comment-17134523
 ] 

Tim Armstrong commented on IMPALA-7020:
---

This is sufficient to force it to be materialised:
{noformat}
tarmstrong@tarmstrong-box2:~/Impala/impala$ git diff
diff --git a/fe/src/main/java/org/apache/impala/analysis/Expr.java 
b/fe/src/main/java/org/apache/impala/analysis/Expr.java
index 6ef5715a2..c636b4971 100644
--- a/fe/src/main/java/org/apache/impala/analysis/Expr.java
+++ b/fe/src/main/java/org/apache/impala/analysis/Expr.java
@@ -83,7 +83,7 @@ abstract public class Expr extends TreeNode implements 
ParseNode, Cloneabl
   public static final float ARITHMETIC_OP_COST = 1;
   public static final float BINARY_PREDICATE_COST = 1;
   public static final float VAR_LEN_BINARY_PREDICATE_COST = 5;
-  public static final float CAST_COST = 1;
+  public static final float CAST_COST = 20;
   public static final float COMPOUND_PREDICATE_COST = 1;
   public static final float FUNCTION_CALL_COST = 10;
   public static final float IS_NOT_EMPTY_COST = 1;
{noformat}

> Order by expressions in Analytical functions are not materialized causing 
> slowdown
> --
>
> Key: IMPALA-7020
> URL: https://issues.apache.org/jira/browse/IMPALA-7020
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance
> Attachments: Slow case profile.txt, Workaround profile.txt
>
>
> Order by expressions in Analytical functions are not materialized and cause 
> queries to run much slower.
> The rewrite for the query below is 20x faster, profiles attached.
> Repro 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}
> Workaround 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   union all 
>   SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem limit 0
> 
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134508#comment-17134508
 ] 

Tim Armstrong commented on IMPALA-7020:
---

I think we're costing the cast expression wrong in this case - the cost of the 
cast expression is below the threshold to materialise.

> Order by expressions in Analytical functions are not materialized causing 
> slowdown
> --
>
> Key: IMPALA-7020
> URL: https://issues.apache.org/jira/browse/IMPALA-7020
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance
> Attachments: Slow case profile.txt, Workaround profile.txt
>
>
> Order by expressions in Analytical functions are not materialized and cause 
> queries to run much slower.
> The rewrite for the query below is 20x faster, profiles attached.
> Repro 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}
> Workaround 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   union all 
>   SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem limit 0
> 
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-7020) Order by expressions in Analytical functions are not materialized causing slowdown

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-7020:
-

Assignee: Tim Armstrong

> Order by expressions in Analytical functions are not materialized causing 
> slowdown
> --
>
> Key: IMPALA-7020
> URL: https://issues.apache.org/jira/browse/IMPALA-7020
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance
> Attachments: Slow case profile.txt, Workaround profile.txt
>
>
> Order by expressions in Analytical functions are not materialized and cause 
> queries to run much slower.
> The rewrite for the query below is 20x faster, profiles attached.
> Repro 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}
> Workaround 
> {code}
> select *
> FROM
>   (
> SELECT
>   o.*,
>   ROW_NUMBER() OVER(ORDER BY evt_ts DESC) AS rn
> FROM
>   (
> SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem
> WHERE
>   l_shipdate  BETWEEN '1992-01-01 00:00:00' AND '1992-01-15 00:00:00'
>   union all 
>   SELECT
>   l_orderkey,l_partkey,l_linenumber,l_quantity, cast (l_shipdate as 
> string) evt_ts
> FROM
>   lineitem limit 0
> 
>   ) o
>   ) r
> WHERE
>   rn BETWEEN 1 AND 101
> ORDER BY rn;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9824) MetastoreClientPool should be singleton

2020-06-12 Thread Vihang Karajgaonkar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134469#comment-17134469
 ] 

Vihang Karajgaonkar commented on IMPALA-9824:
-

Spend sometime on this and almost made it to work. We can still make it work 
but it made me think that we are trying to solve 2 conflicting requirements. 
Our FE unit tests spin up their own CatalogServiceCatalog instances (see 
CatalogServiceTestCatalog for example). Testing can become flaky if we make 
MetastoreClientPool a singleton since all the FE tests run within a single 
process and that would mean they will share the MetastoreClientPool. We 
currently rely on Catalog#close() call in the tests to shutdown the pool. This 
works ok for most of the tests except the ones which rely on 
{{createTransientTestCatalog}} which uses a EmbeddedHMS service. Currently 
MetastoreClientPool should have one to one mapping with the Catalog instances. 
The MetastoreClientPool in {{DirectMetaProvider}} should ideally never get 
instantiated after we fix IMPALA-9375. We should only have either 
CatalogMetaProvider or DirectMetaProvider running but not both.

I am now inclined to abandon this patch and close this JIRA as "wont fix". 
[~stakiar] [~stigahuang] any thoughts?

> MetastoreClientPool should be singleton
> ---
>
> Key: IMPALA-9824
> URL: https://issues.apache.org/jira/browse/IMPALA-9824
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Minor
>
> Currently,  the MetastoreClientPool is instantiated at multiple places in the 
> code and it would be good to refactor the code to make it a singleton. Each 
> MetastoreClientPool creates multiple clients to HMS and unnecessary creation 
> of multiple pools could cause problems on HMS side. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Closed] (IMPALA-8720) Impala frontend jar should not depend on Sentry jars when building against hive-3 profile

2020-06-12 Thread Vihang Karajgaonkar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar closed IMPALA-8720.
---
Fix Version/s: Not Applicable
   Resolution: Not A Problem

> Impala frontend jar should not depend on Sentry jars when building against 
> hive-3 profile
> -
>
> Key: IMPALA-8720
> URL: https://issues.apache.org/jira/browse/IMPALA-8720
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Fix For: Not Applicable
>
>
> It looks like for {{hive-3}} based setups, frontend jar still depends on 
> sentry jars. However, sentry does not work with HMS-3 as of today. This 
> unnecessary pulls in sentry jars from maven repositories when building 
> against a CDP. We should pull in sentry jars only when it is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9843) Add ability to run schematool against HMS in minicluster

2020-06-12 Thread Vihang Karajgaonkar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar resolved IMPALA-9843.
-
Fix Version/s: Impala 4.0
   Resolution: Fixed

The patch was submitted on gerrit today. Users who wish to upgrade the HMS db 
schema of the minicluster can use the following commands to do so:

1. bin/create-test-configuration.sh -upgrade_hms_db

Or if you want to build the source along with upgrading the HMS schema
./buildall.sh -upgrade_hms_db

> Add ability to run schematool against HMS in minicluster
> 
>
> Key: IMPALA-9843
> URL: https://issues.apache.org/jira/browse/IMPALA-9843
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Sahil Takiar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Fix For: Impala 4.0
>
>
> When the CDP version is bumped, we often need to re-format the HMS postgres 
> database because the HMS schema needs updating. Hive provides a standalone 
> tool for performing schema updates: 
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool]
> Impala should be able to integrate with this tool, so that developers don't 
> have to blow away their HMS database every time the CDP version is bumped up. 
> Even worse, blowing away the HMS data requires performing a full data load.
> It would be great to have a wrapper around the schematool that can easily be 
> invoked by developers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9855) TSAN lock-order-inversion warning in QueryDriver::RetryQueryFromThread

2020-06-12 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9855:


 Summary: TSAN lock-order-inversion warning in 
QueryDriver::RetryQueryFromThread
 Key: IMPALA-9855
 URL: https://issues.apache.org/jira/browse/IMPALA-9855
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar


TSAN reports the following error in {{test_query_retries.py}}.
{code:java}
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=3786)
  Cycle in lock order graph: M17348 (0x7b140035d2d8) => M804309746609755832 
(0x) => M17348  Mutex M804309746609755832 acquired here while 
holding mutex M17348 in thread T370:
#0 AnnotateRWLockAcquired 
/mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cc:271
 (impalad+0x19bafcc)
#1 base::SpinLock::Lock() 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/gutil/spinlock.h:77:5
 (impalad+0x1a11585)
#2 impala::SpinLock::lock() 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/spinlock.h:34:8
 (impalad+0x1a11519)
#3 impala::ScopedShardedMapRef 
>::ScopedShardedMapRef(impala::TUniqueId const&, 
impala::ShardedQueryMap >*) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/sharded-query-map-util.h:98:23
 (impalad+0x2220661)
#4 impala::ImpalaServer::GetQueryDriver(impala::TUniqueId const&, bool) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:1296:53
 (impalad+0x22124ba)
#5 impala::QueryDriver::RetryQueryFromThread(impala::Status const&, 
std::shared_ptr) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-driver.cc:279:25
 (impalad+0x29dd92c)
#6 boost::_mfi::mf2 >::operator()(impala::QueryDriver*, 
impala::Status const&, std::shared_ptr) const 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/mem_fn_template.hpp:280:29
 (impalad+0x29e1669)
#7 void boost::_bi::list3, 
boost::_bi::value, 
boost::_bi::value > 
>::operator() >, 
boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf2 >&, boost::_bi::list0&, int) 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:398:9
 (impalad+0x29e1578)
#8 boost::_bi::bind_t >, 
boost::_bi::list3, 
boost::_bi::value, 
boost::_bi::value > > >::operator()() 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
 (impalad+0x29e14c3)
#9 
boost::detail::function::void_function_obj_invoker0 >, 
boost::_bi::list3, 
boost::_bi::value, 
boost::_bi::value > > >, 
void>::invoke(boost::detail::function::function_buffer&) 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:159:11
 (impalad+0x29e1221)
#10 boost::function0::operator()() const 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14
 (impalad+0x1e5ba81)
#11 impala::Thread::SuperviseThread(std::string const&, std::string const&, 
boost::function, impala::ThreadDebugInfo const*, impala::Promise*) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:360:3
 (impalad+0x2453776)
#12 void boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> 
>::operator(), impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0>(boost::_bi::type, void 
(*&)(std::string const&, std::string const&, boost::function, 
impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0&, int) 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9
 (impalad+0x245b93c)
#13 boost::_bi::bind_t, impala::ThreadDebugInfo const*, 
impala::Promise*), 
boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> > 
>::operator()() 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
 (impalad+0x245b853)
#14 boost::detail::thread_data, 
impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> > > >::run() 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17
 (impalad+0x245b540)
#15 thread_proxy  (impalad+0x3171659)Hint: use 
TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message


  Mutex M17348 acquired here while holding mutex M804309746609755832 in thread 
T392:
#0 AnnotateRWLockAcquired 

[jira] [Created] (IMPALA-9854) TSAN data race in QueryDriver::CreateRetriedClientRequestState

2020-06-12 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9854:


 Summary: TSAN data race in 
QueryDriver::CreateRetriedClientRequestState
 Key: IMPALA-9854
 URL: https://issues.apache.org/jira/browse/IMPALA-9854
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar


Seeing the following data race in {{test_query_retries.py}}
{code:java}
WARNING: ThreadSanitizer: data race (pid=5460)
  Write of size 8 at 0x7b8c00261510 by thread T38:
#0 impala::TUniqueId::operator=(impala::TUniqueId&&) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/Types_types.cpp:967:6
 (impalad+0x1de1968)
#1 impala::ImpalaServer::PrepareQueryContext(impala::TNetworkAddress 
const&, impala::TNetworkAddress const&, impala::TQueryCtx*) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:1069:23
 (impalad+0x2210dbf)
#2 impala::ImpalaServer::PrepareQueryContext(impala::TQueryCtx*) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-server.cc:1024:3
 (impalad+0x220f3c1)
#3 
impala::QueryDriver::CreateRetriedClientRequestState(impala::ClientRequestState*,
 std::unique_ptr >*, 
std::shared_ptr*) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-driver.cc:302:19
 (impalad+0x29de3ec)
#4 impala::QueryDriver::RetryQueryFromThread(impala::Status const&, 
std::shared_ptr) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-driver.cc:203:3
 (impalad+0x29dd01f)
#5 boost::_mfi::mf2 >::operator()(impala::QueryDriver*, 
impala::Status const&, std::shared_ptr) const 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/mem_fn_template.hpp:280:29
 (impalad+0x29e1669)
#6 void boost::_bi::list3, 
boost::_bi::value, 
boost::_bi::value > 
>::operator() >, 
boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf2 >&, boost::_bi::list0&, int) 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:398:9
 (impalad+0x29e1578)
#7 boost::_bi::bind_t >, 
boost::_bi::list3, 
boost::_bi::value, 
boost::_bi::value > > >::operator()() 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
 (impalad+0x29e14c3)
#8 
boost::detail::function::void_function_obj_invoker0 >, 
boost::_bi::list3, 
boost::_bi::value, 
boost::_bi::value > > >, 
void>::invoke(boost::detail::function::function_buffer&) 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:159:11
 (impalad+0x29e1221)
#9 boost::function0::operator()() const 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14
 (impalad+0x1e5ba81)
#10 impala::Thread::SuperviseThread(std::string const&, std::string const&, 
boost::function, impala::ThreadDebugInfo const*, impala::Promise*) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:360:3
 (impalad+0x2453776)
#11 void boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> 
>::operator(), impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0>(boost::_bi::type, void 
(*&)(std::string const&, std::string const&, boost::function, 
impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0&, int) 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9
 (impalad+0x245b93c)
#12 boost::_bi::bind_t, impala::ThreadDebugInfo const*, 
impala::Promise*), 
boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> > 
>::operator()() 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
 (impalad+0x245b853)
#13 boost::detail::thread_data, 
impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> > > >::run() 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17
 (impalad+0x245b540)
#14 thread_proxy  (impalad+0x3171659)

  Previous read of size 8 at 0x7b8c00261510 by thread T100:
#0 impala::PrintId(impala::TUniqueId const&, std::string const&) 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/debug-util.cc:108:48
 (impalad+0x237557f)
#1 impala::Coordinator::ReleaseQueryAdmissionControlResources() 

[jira] [Commented] (IMPALA-9739) TSAN data races during impalad shutdown

2020-06-12 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134456#comment-17134456
 ] 

Sahil Takiar commented on IMPALA-9739:
--

[~bikramjeet.vig] the {{TestGracefulShutdown}} tests in 
{{test_restart_services.py}} can reproduce this. Just build Impala locally with 
the '-tsan' flag and run the test. You should see a TSAN error in the logs 
under {{/tmp/impalad.*ERROR}}. When I ran it locally the error was in the 
{{/tmp/impalad_node1.ERROR}} file. Here is the output, I just ran this on 
master:

{code}
WARNING: ThreadSanitizer: data race (pid=19807)
  Read of size 8 at 0x078569a0 by thread T337:
   #0 std::unique_ptr 
>::~unique_ptr() 
/home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/unique_ptr.h:235:6
 (impalad+0x1a10495)
   #1 at_exit_wrapper(void*) 
/mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:361
 (impalad+0x196fb33)
   #2 impala::ImpalaServer::StartShutdown(long, 
impala::ShutdownStatusPB*)::$_2::operator()() const 
/home/stakiar/Impala/be/src/service/impala-server.cc:2774:57 (impalad+0x2236ba1)
   #3 
boost::detail::function::void_function_obj_invoker0::invoke(boost::detail::function::function_buffer&) 
/home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/function/function_te
 mplate.hpp:159:11 (impalad+0x2236a09)
   #4 boost::function0::operator()() const 
/home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/function/function_template.hpp:770:14
 (impalad+0x1e5ee31)
   #5 impala::Thread::SuperviseThread(std::string const&, std::string const&, 
boost::function, impala::ThreadDebugInfo const*, impala::Promise*) /home/stakiar/Impala/be/src/util/thread.cc:360:3 
(impalad+0x246bfd6)
   #6 void boost::_bi::list5, 
boost::_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> 
>::operator(), impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0>(boost::_bi::type, void 
(*&)(std::string const&, std::string const&, boost::function, 
impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list0&, int) 
/home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:531:9
 (impalad+0x247419c)
   #7 boost::_bi::bind_t, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, 
boost::_bi::value, bo ost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> > 
>::operator()() 
/home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222:16
 (impalad+0x24740b3)
   #8 boost::detail::thread_data, impala::ThreadDebugInfo 
const*, impala::Promise*), 
boost::_bi::list5, boost:: 
_bi::value, boost::_bi::value >, 
boost::_bi::value, 
boost::_bi::value*> > > >::run() 
/home/stakiar/Impala/toolchain/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116:17
 (impalad+0x2473da0)
   #9 thread_proxy  (impalad+0x3177c59)

  Previous write of size 8 at 0x078569a0 by main thread:
   #0 void std::swap(impala::Thread*&, impala::Thread*&) 
/home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/move.h:176:11
 (impalad+0x22a4f20)
   #1 std::unique_ptr 
>::reset(impala::Thread*) 
/home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/bits/unique_ptr.h:342:2
 (impalad+0x229fa9b)
   #2 std::unique_ptr 
>::operator=(std::unique_ptr >&&) 
/home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/b
 its/unique_ptr.h:251:2 (impalad+0x246e918)
   #3 impala::Thread::StartThread(std::string const&, std::string const&, 
boost::function const&, std::unique_ptr >*, bool) 
/home/stakiar/Impala/be/src/util/thread.cc:329:11 (impalad+0x246baec)
   #4 impala::Status impala::Thread::Create(std::string const&, 
std::string const&, void (* const&)(), std::unique_ptr >*, bool) 
/home/stakiar/Impala/be/src/util/thread.h:74:12 (impalad+0x1a6eb2c)
   #5 impala::StartImpalaShutdownSignalHandlerThread() 
/home/stakiar/Impala/be/src/common/init.cc:401:10 (impalad+0x1a6df98)
   #6 ImpaladMain(int, char**) 
/home/stakiar/Impala/be/src/service/impalad-main.cc:96:43 (impalad+0x221d3ca)
   #7 main /home/stakiar/Impala/be/src/service/daemon-main.cc:37:12 
(impalad+0x1a0b27a)

  As if synchronized via sleep:
   #0 nanosleep 
/mnt/source/llvm/llvm-5.0.1.src-p2/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:343
 (impalad+0x19a500a)
   #1 void std::this_thread::sleep_for 
>(std::chrono::duration > const&) 
/home/stakiar/Impala/toolchain/gcc-4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/../../../../include/c++/4.9.2/thread:279:2
 (impalad+0x2475f42)
   #2 impala::SleepForMs(long) /home/stakiar/Impala/be/src/util/time.cc:31:3 
(impalad+0x247537d)
   #3 impala::ImpalaServer::ShutdownThread() 
/home/stakiar/Impala/be/src/service/impala-server.cc:2796:5 (impalad+0x2235a19)
   

[jira] [Updated] (IMPALA-9853) Push rank() predicates into sort

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9853:
--
Labels: performance tpcds  (was: )

> Push rank() predicates into sort
> 
>
> Key: IMPALA-9853
> URL: https://issues.apache.org/jira/browse/IMPALA-9853
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance, tpcds
>
> TPC-DS Q67 would benefit significantly if we could push the rank() predicate 
> into the sort to do some reduction of unneeded data. The sorter could 
> evaluate this predicate if it had the partition expressions available - as a 
> post-processing step to the in-memory sort for the analytic sort group, it 
> could do a pass over the sorted run, resetting a counter at the start of each 
> partition boundary.
> It might be best to start with tackling IMPALA-3471 by applying the limit 
> within sorted runs, since that doesn't require any planner work.
> {noformat}
> with results as
> ( select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy 
> ,d_moy ,s_store_id
>   ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales
> from store_sales ,date_dim ,store ,item
>where  ss_sold_date_sk=d_date_sk
>   and ss_item_sk=i_item_sk
>   and ss_store_sk = s_store_sk
>   and d_month_seq between 1212 and 1212 + 11
>group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, 
> d_moy,s_store_id)
>  ,
>  results_rollup as
>  (select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, 
> s_store_id, sumsales
>   from results
>   union all
>   select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, 
> null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy
>   union all
>   select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, null 
> d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy
>   union all
>   select i_category, i_class, i_brand, i_product_name, d_year, null d_qoy, 
> null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name, d_year
>   union all
>   select i_category, i_class, i_brand, i_product_name, null d_year, null 
> d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand, i_product_name
>   union all
>   select i_category, i_class, i_brand, null i_product_name, null d_year, null 
> d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class, i_brand
>   union all
>   select i_category, i_class, null i_brand, null i_product_name, null d_year, 
> null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category, i_class
>   union all
>   select i_category, null i_class, null i_brand, null i_product_name, null 
> d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results
>   group by i_category
>   union all
>   select null i_category, null i_class, null i_brand, null i_product_name, 
> null d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
>   from results)
>  select  *
> from (select i_category
> ,i_class
> ,i_brand
> ,i_product_name
> ,d_year
> ,d_qoy
> ,d_moy
> ,s_store_id
> ,sumsales
> ,rank() over (partition by i_category order by sumsales desc) rk
>   from results_rollup) dw2
> where rk <= 100
> order by i_category
> ,i_class
> ,i_brand
> ,i_product_name
> ,d_year
> ,d_qoy
> ,d_moy
> ,s_store_id
> ,sumsales
> ,rk
> limit 100
> {noformat}
> Assigning to myself to fill in more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9853) Push rank() predicates into sort

2020-06-12 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9853:
-

 Summary: Push rank() predicates into sort
 Key: IMPALA-9853
 URL: https://issues.apache.org/jira/browse/IMPALA-9853
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Reporter: Tim Armstrong
Assignee: Tim Armstrong


TPC-DS Q67 would benefit significantly if we could push the rank() predicate 
into the sort to do some reduction of unneeded data. The sorter could evaluate 
this predicate if it had the partition expressions available - as a 
post-processing step to the in-memory sort for the analytic sort group, it 
could do a pass over the sorted run, resetting a counter at the start of each 
partition boundary.

It might be best to start with tackling IMPALA-3471 by applying the limit 
within sorted runs, since that doesn't require any planner work.

{noformat}
with results as
( select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy 
,s_store_id
  ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales
from store_sales ,date_dim ,store ,item
   where  ss_sold_date_sk=d_date_sk
  and ss_item_sk=i_item_sk
  and ss_store_sk = s_store_sk
  and d_month_seq between 1212 and 1212 + 11
   group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, 
d_moy,s_store_id)
 ,
 results_rollup as
 (select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, 
s_store_id, sumsales
  from results
  union all
  select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy, 
null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy
  union all
  select i_category, i_class, i_brand, i_product_name, d_year, d_qoy, null 
d_moy, null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category, i_class, i_brand, i_product_name, d_year, d_qoy
  union all
  select i_category, i_class, i_brand, i_product_name, d_year, null d_qoy, null 
d_moy, null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category, i_class, i_brand, i_product_name, d_year
  union all
  select i_category, i_class, i_brand, i_product_name, null d_year, null d_qoy, 
null d_moy, null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category, i_class, i_brand, i_product_name
  union all
  select i_category, i_class, i_brand, null i_product_name, null d_year, null 
d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category, i_class, i_brand
  union all
  select i_category, i_class, null i_brand, null i_product_name, null d_year, 
null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category, i_class
  union all
  select i_category, null i_class, null i_brand, null i_product_name, null 
d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
  from results
  group by i_category
  union all
  select null i_category, null i_class, null i_brand, null i_product_name, null 
d_year, null d_qoy, null d_moy, null s_store_id, sum(sumsales) sumsales
  from results)

 select  *
from (select i_category
,i_class
,i_brand
,i_product_name
,d_year
,d_qoy
,d_moy
,s_store_id
,sumsales
,rank() over (partition by i_category order by sumsales desc) rk
  from results_rollup) dw2
where rk <= 100
order by i_category
,i_class
,i_brand
,i_product_name
,d_year
,d_qoy
,d_moy
,s_store_id
,sumsales
,rk
limit 100
{noformat}

Assigning to myself to fill in more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9847) JSON profiles are mostly space characters

2020-06-12 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-9847.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> JSON profiles are mostly space characters
> -
>
> Key: IMPALA-9847
> URL: https://issues.apache.org/jira/browse/IMPALA-9847
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: Impala 4.0
>
>
> JSON profiles are pretty-printed with 4 space characters per indent. This 
> means that most of the profile data is actually just space characters, and 
> this can add up for large profiles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9852) UDA Check that function name, arguments, and return type are correct.

2020-06-12 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134317#comment-17134317
 ] 

Tim Armstrong commented on IMPALA-9852:
---

You can list out the symbols in the so with "nm -g aggreg.so".  I would guess 
that the symbols are mangled with C++ name mangling. 

One way to avoid the name mangling is to wrap the function definitions and 
declarations in extern "C" { }.

> UDA Check that function name, arguments, and return type are correct. 
> --
>
> Key: IMPALA-9852
> URL: https://issues.apache.org/jira/browse/IMPALA-9852
> Project: IMPALA
>  Issue Type: Question
>  Components: Backend
>Affects Versions: Impala 2.10.0
>Reporter: Volnei
>Priority: Major
>
> Hi,
> I'm trying to register a UDA in the database, but the error below always 
> happens:
> {code:java}
> create aggregate function avgtest(double) returns double
> location '/user/cloudera/impala_udf/aggreg.so'
> init_fn='Avg_Init'
> update_fn='Avg_Update'
> merge_fn='Avg_Merge'
> finalize_fn='Avg_Finalize';
> AnalysisException: Could not find function Avg_Update(DOUBLE) returns DOUBLE 
> in: hdfs://quickstart.cloudera:8020/user/cloudera/impala_udf/aggreg.so Check 
> that function name, arguments, and return type are correct
> {code}
> If I create a UDA without using BufferVal as one of the arguments, the error 
> doesnt ' happen.
> The UDA in question is the one available in impala-master/be/src/udf_samples /
>  {code:java}
> void Avg_Init(FunctionContext* context, BufferVal* val);
> void Avg_Update(FunctionContext* context, *const DoubleVal*& input, 
> BufferVal* val);
> void Avg_Merge(FunctionContext* context, const BufferVal& src, BufferVal* 
> dst);
> *DoubleVal* Avg_Finalize(FunctionContext* context, const BufferVal& val);
> {code}
> Could anybody give me any suggestions on this problem?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9852) UDA Check that function name, arguments, and return type are correct.

2020-06-12 Thread Volnei (Jira)
Volnei created IMPALA-9852:
--

 Summary: UDA Check that function name, arguments, and return type 
are correct. 
 Key: IMPALA-9852
 URL: https://issues.apache.org/jira/browse/IMPALA-9852
 Project: IMPALA
  Issue Type: Question
  Components: Backend
Affects Versions: Impala 2.10.0
Reporter: Volnei


Hi,

I'm trying to register a UDA in the database, but the error below always 
happens:
{code:java}
create aggregate function avgtest(double) returns double
location '/user/cloudera/impala_udf/aggreg.so'
init_fn='Avg_Init'
update_fn='Avg_Update'
merge_fn='Avg_Merge'
finalize_fn='Avg_Finalize';

AnalysisException: Could not find function Avg_Update(DOUBLE) returns DOUBLE 
in: hdfs://quickstart.cloudera:8020/user/cloudera/impala_udf/aggreg.so Check 
that function name, arguments, and return type are correct
{code}
If I create a UDA without using BufferVal as one of the arguments, the error 
doesnt ' happen.

The UDA in question is the one available in impala-master/be/src/udf_samples /

 {code:java}
void Avg_Init(FunctionContext* context, BufferVal* val);
void Avg_Update(FunctionContext* context, *const DoubleVal*& input, BufferVal* 
val);
void Avg_Merge(FunctionContext* context, const BufferVal& src, BufferVal* dst);
*DoubleVal* Avg_Finalize(FunctionContext* context, const BufferVal& val);
{code}


Could anybody give me any suggestions on this problem?

Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9747) More fine-grained codegen for text file scanners

2020-06-12 Thread Daniel Becker (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134199#comment-17134199
 ] 

Daniel Becker commented on IMPALA-9747:
---

[https://gerrit.cloudera.org/#/c/16059/]

> More fine-grained codegen for text file scanners
> 
>
> Key: IMPALA-9747
> URL: https://issues.apache.org/jira/browse/IMPALA-9747
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Csaba Ringhofer
>Assignee: Daniel Becker
>Priority: Major
>
> Currently if  the materialization of any column cannot be codegend for some 
> reason (e.g. it is CHAR(N)), then the whole codegen is cancelled for the text 
> scanner, see:
> https://github.com/apache/impala/blob/b5805de3e65fd1c7154e4169b323bb38ddc54f4f/be/src/exec/text-converter.cc#L112
> https://github.com/apache/impala/blob/58273fff601dcc763ac43f7cc275a174a2e18b6b/be/src/exec/hdfs-scanner.cc#L342
> It would be much better to use the non-codegend path only for the problematic 
> columns and use the codegend materialization for the rest + always do 
> conjunct  evaluation with codegen.
> The codegend path orders slots based on the conjuncts that use them and 
> evaluates conjuncts when the slots it need becomes available, so if the row 
> is dropped then the rest of the slots do not need to be materialized. A 
> simple solution would be to always do non-codegend slot materialization first 
> so that they are ready if a conjunct needs them. Moving the columns that are 
> not used by conjuncts to the end could be a further optimization.
> This came up during the materialization of BINARY columns, which needs  
> base64 decoding during materialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org