SPARK SQL: returns null for a column, while HIVE query returns data for the same column

ARAVIND ARUMUGHAM Sethurathnam Fri, 11 May 2018 12:25:07 -0700

I have a hive table created on top of s3 DATA in parquet format and
partitioned by one column named eventdate.


1) When using HIVE QUERY, it returns data for a column named "headertime"
which is in the schema of BOTH the table and the file.

select headertime from dbName.test_bug where eventdate=20180510 limit 10

2) FROM a scala NOTEBOOK , when directly loading a file from a particular
partition that also works,

val session = org.apache.spark.sql.SparkSession.builder
.appName("searchRequests")
.enableHiveSupport()
.getOrCreate;

val searchRequest =
session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")

searchRequest.createOrReplaceTempView("SearchRequest")

val exploreDF = session.sql("select headertime from SearchRequest
where SearchRequestHeaderDate='2018-05-10' limit 100")

exploreDF.show(20)

this also displays the values for the column "headertime"

3) But, when using spark sql to query directly the HIVE table as below,

val exploreDF = session.sql("select headertime from

dbName.test_bug where eventdate=20180510 limit 100")

exploreDF.show(20)

it keeps returning null always.

I opened the parquet file and see that the column headertime is present
with values, but not sure why spark SQL is not able to read the values for
that column.

it will be helpful if someone can point out from where the spark SQL gets
the schema? I was expecting it to behave similar to the HIVE QUERY

SPARK SQL: returns null for a column, while HIVE query returns data for the same column

Reply via email to