[ 
https://issues.apache.org/jira/browse/HIVE-29516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072671#comment-18072671
 ] 

Stamatis Zampetakis commented on HIVE-29516:
--------------------------------------------

The test that was added in the PR produces the following stack trace on current 
master (d4d166d51d03ecbb1411f4eb701bb0310786a3f9) without the fix:

{noformat}
 java.lang.NullPointerException: Cannot invoke "java.util.List.iterator()" 
because "colStats" is null
        at 
org.apache.hadoop.hive.ql.stats.StatsUtils.updateStats(StatsUtils.java:2036)
        at 
org.apache.hadoop.hive.ql.parse.TezCompiler.removeSemijoinOptimizationByBenefit(TezCompiler.java:1980)
        at 
org.apache.hadoop.hive.ql.parse.TezCompiler.semijoinRemovalBasedTransformations(TezCompiler.java:566)
        at 
org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:247)
        at 
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:182)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.compilePlan(SemanticAnalyzer.java:13159)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13384)
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:481)
        at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:358)
        at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:187)
        at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:358)
        at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224)
        at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:109)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:499)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:451)
        at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:415)
        at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:409)
{noformat}

Putting it here for future reference.


> NPE in StatsUtils.updateStats when removing semijoin by benefit and column 
> statistics are missing
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29516
>                 URL: https://issues.apache.org/jira/browse/HIVE-29516
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor, Statistics
>    Affects Versions: 4.2.0
>            Reporter: Shubham Sharma
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.3.0
>
>
> h3. Problem
> Query compilation fails with {{NullPointerException}} in 
> {{StatsUtils.updateStats()}} when column statistics are not available for 
> certain operators. This occurs during the semijoin optimization phase in 
> {{{}TezCompiler.removeSemijoinOptimizationByBenefit(){}}}.
> The issue is reproducible with TPC-DS queries at scale factors of 100GB or 
> higher, where column-level statistics may be incomplete or unavailable for 
> some tables.
>  
>  
> {code:java}
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.updateStats(StatsUtils.java:2067)
>     at 
> org.apache.hadoop.hive.ql.parse.TezCompiler.removeSemijoinOptimizationByBenefit(TezCompiler.java:1982)
>     at 
> org.apache.hadoop.hive.ql.parse.TezCompiler.semijoinRemovalBasedTransformations(TezCompiler.java:539)
>     at 
> org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:238)
>     at 
> org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:174)
>     at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.compilePlan(SemanticAnalyzer.java:12521)
>     at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12739)
>     at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:460)
>     ... {code}
> h2. How to Reproduce
>  # Generate TPC-DS dataset at 100GB or larger scale
>  # Run TPC-DS queries that involve semijoin optimizations (queries with 
> subqueries or complex joins, eg: 10 17 19 23 24 25 29 32) 
>  # Ensure column statistics are not fully computed for all tables
>  # Observe NPE during query compilation
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to