Populating tables using hive and spark

Nitin Kumar Mon, 22 Aug 2016 00:35:01 -0700

Hi!

I've noticed that hive has problems in registering new data records if the
same table is written to using both the hive terminal and spark sql. The
problem is demonstrated through the commands listed below


====================================================================
hive> use default;
hive> create table test_orc (id int, name string, dept string) stored as
ORC;
hive> insert into table test_orc values (1, 'abc', 'xyz');
hive> insert into table test_orc values (2, 'def', 'xyz');
hive> select count(*) from test_orc;
OK
2
hive> select distinct(name) from test_orc;
OK
abc
def

*** files in hdfs path in warehouse for the created table ***




>>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')]
>>> column_names = ['identity_id', 'emp_name', 'dept_name']
>>> data_df = sqlContext.createDataFrame(data_points, column_names)
>>> data_df.show()

+-----------+--------+---------+
|identity_id|emp_name|dept_name|
+-----------+--------+---------+
|          3|     pqr|      xyz|
|          4|     ghi|      xyz|
+-----------+--------+---------+

>>> data_df.registerTempTable('temp_table')
>>> sqlContext.sql('insert into table default.test_orc select * from
temp_table')

*** files in hdfs path in warehouse for the created table ***


hive> select count(*) from test_orc; (Does not launch map-reduce job)
OK
2
hive> select distinct(name) from test_orc; (Launches map-reduce job)
abc
def
ghi
pqr

hive> create table test_orc_new like test_orc stored as ORC;
hive> insert into table test_orc_new select * from test_orc;
hive> select count(*) from test_orc_new;
OK
4
==================================================================

Even if I restart the hive services I cannot get the proper count output
from hive. This problem only occurs if the table is written to using both
hive and spark. If only spark is used to insert records into the table
multiple times, the count query in the hive terminal works perfectly fine.

This problem occurs for tables stored with different storage formats as
well (textFile etc.)

Is this because of the different naming conventions used by hive and spark
to write records to hdfs? Or maybe it is not a recommended practice to
write tables using different services?

Your thoughts and comments on this matter would be highly appreciated!

Thanks!
Nitin

Populating tables using hive and spark

Reply via email to