[ https://issues.apache.org/jira/browse/HIVE-16257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933187#comment-15933187 ]
Naveen Gangam commented on HIVE-16257: -------------------------------------- [~xuefuz] [~szehon] Any clues on where this could be originating? When the problem does occur, the incorrect column value always seems to match a value from another row like show above. Ruled out any beeline display issue with output because it is reproducible from CLI too. Although this is not reproducible with spark-shell, I have not ruled out to be a spark issue because the set of transformations used by spark-shell could be different from the transformations used by Hive. What code should we instrument to confirm or eliminate hive as a source of the problem? Any help appreciated. Thank you > Intermittent issue with incorrect resultset with Spark > ------------------------------------------------------ > > Key: HIVE-16257 > URL: https://issues.apache.org/jira/browse/HIVE-16257 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 1.1.0 > Reporter: Naveen Gangam > > This issue is highly intermittent that only seems to occurs with spark engine > when the query has a GROUPBY clause. The following is the testcase. > {code} > drop table if exists test_hos_sample; > create table test_hos_sample (name string, val1 decimal(18,2), val2 > decimal(20,3)); > insert into test_hos_sample values > ('test1',101.12,102.123),('test1',101.12,102.123),('test2',102.12,103.234),('test1',101.12,102.123),('test3',103.52,102.345),('test3',103.52,102.345),('test3',103.52,102.345),('test3',103.52,102.345),('test3',103.52,102.345),('test4',104.52,104.456),('test4',104.52,104.456),('test5',105.52,105.567),('test3',103.52,102.345),('test5',105.52,105.567); > set hive.execution.engine=spark; > select name, val1,val2 from test_hos_sample group by name, val1, val2; > {code} > Expected Results: > {code} > name val1 val2 > test5 105.52 105.567 > test3 103.52 102.345 > test1 101.12 102.123 > test4 104.52 104.456 > test2 102.12 103.234 > {code} > Incorrect results once in a while: > {code} > name val1 val2 > test5 105.52 105.567 > test3 103.52 102.345 > test1 104.52 102.123 > test4 104.52 104.456 > test2 102.12 103.234 > {code} > 1) Not reproducible with HoMR. > 2) Not an issue when running from spark-shell. > 3) Not reproducible when the column data type is String or double. Only > reproducible with decimal data types. Also works fine for decimal datatype if > you cast decimal as string on read and cast it back to decimal on select. > 4) Occurs with parquet and text file format as well. (havent tried with other > formats). > 5) Occurs in both scenarios when table data is within encryption zone and > outside. > 6) Even in clusters where this is reproducible, this occurs once in like 20 > times or more. > 7) Occurs with both Beeline and Hive CLI. > 8) Reproducible only when there is a a groupby clause. -- This message was sent by Atlassian JIRA (v6.3.15#6346)