Daniel Harper created SPARK-13000:
-------------------------------------

             Summary: Corrupted results when using LIMIT clause via JDBC 
connections to ThriftServer
                 Key: SPARK-13000
                 URL: https://issues.apache.org/jira/browse/SPARK-13000
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.2
         Environment: Amazon EMR AMI 4.2.0
Spark 1.5.2
            Reporter: Daniel Harper


h2. Steps to reproduce

#. Create table in HIVE  (see below for definition)
#. Insert some data (at least 2 rows) 
#. Start thrift service
#. Connect to thrift service via {{beeline}} or custom application via JDBC
#. Run query {{select * from logs_table limit 1}}

h2. Detailed description

We're seeing strange results for the following query when executed via JDBC 
connections to the thrift server 

{code}
select * from logs_table limit 1;
{code}

We've tried this using {{beeline}} and as you can see, the {{service}} and 
other columns are blank 

{code}
[hadoop@ip-x ~]$ beeline
Beeline version 1.0.0-amzn-1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10001/default
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10001/default
Enter username for jdbc:hive2://localhost:10001/default:
Enter password for jdbc:hive2://localhost:10001/default:
Connected to: Spark SQL (version 1.5.2)
Driver: Hive JDBC (version 1.0.0-amzn-1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |   |   |   |   |
+------------------------+----------+-------+-----+-----+--+
1 row selected (9.182 seconds)
{code}

Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are 
present.

{code}
0: jdbc:hive2://localhost:10001/default> select * from logs_table;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |service_1  | 2016  | 01  | 24  |
| 2016-01-24 23:29:24.0  |service_4  | 2016  | 01  | 24  |
+------------------------+----------+-------+-----+-----+--+
2 rows selected (10.956 seconds)
{code}

I ran the query {{select * from logs_table limit 1}} via 

* {{spark-sql}}
* {{spark-shell}}

...and both returned the expected results, limiting the resultset to 1 row and 
with all the columns populated.

This leads me to believe this is an issue with the Thrift Server or Hive JDBC 
driver.

We are starting the thrift server as follows:

{code}
sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf 
hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5 
--executor-memory 38G --conf spark.scheduler.mode=FAIR --conf 
spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G
{code}

h2. Resources 

The HIVE table is defined as follows: 

{code}
CREATE EXTERNAL TABLE IF NOT EXISTS logs_table (
        ts STRING,
        service STRING
)
COMMENT 'logs table'
PARTITIONED BY (yyyy STRING, mm STRING, dd STRING)
STORED AS TEXTFILE
LOCATION 's3://data-lake/structured/';
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to