I am working with Spark SQL and the Thrift server. I ran into an interesting bug, and I am curious on what information/testing I can provide to help narrow things down.
My setup is as follows: Hive 0.12 with a table that has lots of columns (50+) stored as rcfile. Spark-1.1.0-SNAPSHOT with Hive Built in (and Thrift Server) My query is only selecting one STRING column from the data, but only returning data based on other columns . Types: col1 = STRING col2 = STRING col3 = STRING col4 = Partition Field (TYPE STRING) Queries cache table table1; --Run some other queries on other data select col1 from table1 where col2 = 'foo' and col3 = 'bar' and col4 = 'foobar' and col1 is not null limit 100 Fairly simple query. When I run this in SQL Squirrel I get no results. When I remove the and col1 is not null I get 100 rows of <null> When I run this in beeline (the one that is in the spark-1.1.0-SNAPSHOT) I get no results and when I remove 'and col1 is not null' I gett 100 rows of <null> Note: Both of these are after I ran some other queries.. .i.e. on other columns, after I ran CACHE TABLE TABLE1 first before any queries. That seemed interesting to me... So I went to the spark-shell to determine if it was a spark issue, or a thrift issue. I ran: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ cacheTable("table1") Then I ran the same "other" queries" got results, and then I ran the query above, and I got results as expected. Interestingly enough, if I don't cache the table through cache table table1 in thrift, I get results for all queries. If I uncache, I start getting results again. I hope I was clear enough here, I am happy to help however I can. John