[jira] [Commented] (HIVE-16257) Intermittent issue with incorrect resultset with Spark

Naveen Gangam (JIRA) Mon, 20 Mar 2017 10:58:02 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933187#comment-15933187
 ]


Naveen Gangam commented on HIVE-16257:
--------------------------------------

[~xuefuz] [~szehon] Any clues on where this could be originating? When the 
problem does occur, the incorrect column value always seems to match a value 
from another row like show above. 
Ruled out any beeline display issue with output because it is reproducible from 
CLI too.
Although this is not reproducible with spark-shell, I have not ruled out to be 
a spark issue because the set of transformations used by spark-shell could be 
different from the transformations used by Hive.

What code should we instrument to confirm or eliminate hive as a source of the 
problem? Any help appreciated. Thank you

> Intermittent issue with incorrect resultset with Spark
> ------------------------------------------------------
>
>                 Key: HIVE-16257
>                 URL: https://issues.apache.org/jira/browse/HIVE-16257
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.1.0
>            Reporter: Naveen Gangam
>
> This issue is highly intermittent that only seems to occurs with spark engine 
> when the query has a GROUPBY clause. The following is the testcase.
> {code}
> drop table if exists test_hos_sample;
> create table test_hos_sample (name string, val1 decimal(18,2), val2 
> decimal(20,3));
> insert into test_hos_sample values 
> ('test1',101.12,102.123),('test1',101.12,102.123),('test2',102.12,103.234),('test1',101.12,102.123),('test3',103.52,102.345),('test3',103.52,102.345),('test3',103.52,102.345),('test3',103.52,102.345),('test3',103.52,102.345),('test4',104.52,104.456),('test4',104.52,104.456),('test5',105.52,105.567),('test3',103.52,102.345),('test5',105.52,105.567);
> set hive.execution.engine=spark;
> select  name, val1,val2 from test_hos_sample group by name, val1, val2;
> {code}
> Expected Results:
> {code}
> name    val1    val2
> test5   105.52  105.567
> test3   103.52  102.345
> test1   101.12  102.123
> test4   104.52  104.456
> test2   102.12  103.234
> {code}
> Incorrect results once in a while:
> {code}
> name    val1    val2
> test5   105.52  105.567
> test3   103.52  102.345
> test1   104.52  102.123
> test4   104.52  104.456
> test2   102.12  103.234
> {code}
> 1) Not reproducible with HoMR.
> 2) Not an issue when running from spark-shell.
> 3) Not reproducible when the column data type is String or double. Only 
> reproducible with decimal data types. Also works fine for decimal datatype if 
> you cast decimal as string on read and cast it back to decimal on select.
> 4) Occurs with parquet and text file format as well. (havent tried with other 
> formats).
> 5) Occurs in both scenarios when table data is within encryption zone and 
> outside.
> 6) Even in clusters where this is reproducible, this occurs once in like 20 
> times or more.
> 7) Occurs with both Beeline and Hive CLI.
> 8) Reproducible only when there is a a groupby clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16257) Intermittent issue with incorrect resultset with Spark

Reply via email to