Daniel Harper created SPARK-13000: ------------------------------------- Summary: Corrupted results when using LIMIT clause via JDBC connections to ThriftServer Key: SPARK-13000 URL: https://issues.apache.org/jira/browse/SPARK-13000 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: Amazon EMR AMI 4.2.0 Spark 1.5.2 Reporter: Daniel Harper
h2. Steps to reproduce #. Create table in HIVE (see below for definition) #. Insert some data (at least 2 rows) #. Start thrift service #. Connect to thrift service via {{beeline}} or custom application via JDBC #. Run query {{select * from logs_table limit 1}} h2. Detailed description We're seeing strange results for the following query when executed via JDBC connections to the thrift server {code} select * from logs_table limit 1; {code} We've tried this using {{beeline}} and as you can see, the {{service}} and other columns are blank {code} [hadoop@ip-x ~]$ beeline Beeline version 1.0.0-amzn-1 by Apache Hive beeline> !connect jdbc:hive2://localhost:10001/default scan complete in 5ms Connecting to jdbc:hive2://localhost:10001/default Enter username for jdbc:hive2://localhost:10001/default: Enter password for jdbc:hive2://localhost:10001/default: Connected to: Spark SQL (version 1.5.2) Driver: Hive JDBC (version 1.0.0-amzn-1) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1; +------------------------+----------+-------+-----+-----+--+ | ts | service | yyyy | mm | dd | +------------------------+----------+-------+-----+-----+--+ | 2016-01-24 23:23:24.0 | | | | | +------------------------+----------+-------+-----+-----+--+ 1 row selected (9.182 seconds) {code} Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are present. {code} 0: jdbc:hive2://localhost:10001/default> select * from logs_table; +------------------------+----------+-------+-----+-----+--+ | ts | service | yyyy | mm | dd | +------------------------+----------+-------+-----+-----+--+ | 2016-01-24 23:23:24.0 |service_1 | 2016 | 01 | 24 | | 2016-01-24 23:29:24.0 |service_4 | 2016 | 01 | 24 | +------------------------+----------+-------+-----+-----+--+ 2 rows selected (10.956 seconds) {code} I ran the query {{select * from logs_table limit 1}} via * {{spark-sql}} * {{spark-shell}} ...and both returned the expected results, limiting the resultset to 1 row and with all the columns populated. This leads me to believe this is an issue with the Thrift Server or Hive JDBC driver. We are starting the thrift server as follows: {code} sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5 --executor-memory 38G --conf spark.scheduler.mode=FAIR --conf spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G {code} h2. Resources The HIVE table is defined as follows: {code} CREATE EXTERNAL TABLE IF NOT EXISTS logs_table ( ts STRING, service STRING ) COMMENT 'logs table' PARTITIONED BY (yyyy STRING, mm STRING, dd STRING) STORED AS TEXTFILE LOCATION 's3://data-lake/structured/'; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org