I am working with Spark SQL and the Thrift server.  I ran into an
interesting bug, and I am curious on what information/testing I can provide
to help narrow things down.

My setup is as follows:

Hive 0.12 with a table that has lots of columns (50+) stored as rcfile.
Spark-1.1.0-SNAPSHOT with Hive Built in (and Thrift Server)

My query is only selecting one STRING column from the data, but only
returning data based on other columns .

Types:
col1 = STRING
col2 = STRING
col3 = STRING
col4 = Partition Field (TYPE STRING)

Queries
cache table table1;
--Run some other queries on other data
select col1 from table1
where col2 = 'foo' and col3 = 'bar' and col4 = 'foobar' and col1 is not
null limit 100

Fairly simple query.

When I run this in SQL Squirrel I get no results. When I remove the and
col1 is not null I get 100 rows of <null>

When I run this in beeline (the one that is in the spark-1.1.0-SNAPSHOT) I
get no results and when I remove 'and col1 is not null' I gett 100 rows of
<null>

Note: Both of these are after I ran some other queries.. .i.e. on other
columns, after I ran CACHE TABLE TABLE1 first before any queries. That
seemed interesting to me...

So I went to the spark-shell to determine if it was a spark issue, or a
thrift issue.

I ran:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
cacheTable("table1")

Then I ran the same "other" queries" got results, and then I ran the query
above, and I got results as expected.

Interestingly enough, if I don't cache the table through cache table table1
in thrift, I get results for all queries. If I uncache, I start getting
results again.

I hope I was clear enough here, I am happy to help however I can.

John

Reply via email to