Hi Furcy, If I execute the command "ANALYZE TABLE TEST_ORC COMPUTE STATISTICS" before checking the count from hive, Hive returns the correct count albeit it does not spawn a map-reduce job for computing the count.
I'm running a HDP 2.4 Cluster with Hive 1.2.1.2.4 and Spark 1.6.1 If others can concur we can go ahead and report it as a bug. Regards, Nitin On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <furcy....@flaminem.com> wrote: > Hi Nitin, > > I confirm that there is something odd here. > > I did the following test : > > create table test_orc (id int, name string, dept string) stored as ORC; > insert into table test_orc values (1, 'abc', 'xyz'); > insert into table test_orc values (2, 'def', 'xyz'); > insert into table test_orc values (3, 'pqr', 'xyz'); > insert into table test_orc values (4, 'ghi', 'xyz'); > > > I ended up with 4 files on hdfs: > > 000000_0 > 000000_0_copy_1 > 000000_0_copy_2 > 000000_0_copy_3 > > > Then I renamed 000000_0_copy_2 to part-00000, and I still got COUNT(*) = 4 > with hive. > So this is not a file name issue. > > I then removed one of the files, and I got this : > > > SELECT COUNT(1) FROM test_orc ; > +------+--+ > | _c0 | > +------+--+ > | 4 | > +------+--+ > > > SELECT * FROM test_orc ; > +--------------+----------------+----------------+--+ > | test_orc.id | test_orc.name | test_orc.dept | > +--------------+----------------+----------------+--+ > | 1 | abc | xyz | > | 2 | def | xyz | > | 4 | ghi | xyz | > +--------------+----------------+----------------+--+ > 3 rows selected (0.162 seconds) > > So, my guess is that when Hive inserts data, it must keep somewhere in the > metastore the number of rows in the table. > However, if the files are modified by someone else than Hive itself, > (either manually or with Spark), you end up with an inconsistency. > > So I guess we can call it a bug: > > Hive should detect that the files changed and invalidate its > pre-calculated count. > Optionally, Spark should be nice with Hive and update the the count when > inserting. > > I don't know if this bug has already been reported, and I tested on Hive > 1.1.0, so perhaps it is already solved in later releases. > > Regards, > > Furcy > > > On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar <nk94.nitinku...@gmail.com> > wrote: > >> Hi! >> >> I've noticed that hive has problems in registering new data records if >> the same table is written to using both the hive terminal and spark sql. >> The problem is demonstrated through the commands listed below >> >> ==================================================================== >> hive> use default; >> hive> create table test_orc (id int, name string, dept string) stored as >> ORC; >> hive> insert into table test_orc values (1, 'abc', 'xyz'); >> hive> insert into table test_orc values (2, 'def', 'xyz'); >> hive> select count(*) from test_orc; >> OK >> 2 >> hive> select distinct(name) from test_orc; >> OK >> abc >> def >> >> *** files in hdfs path in warehouse for the created table *** >> >> >> >> >> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')] >> >>> column_names = ['identity_id', 'emp_name', 'dept_name'] >> >>> data_df = sqlContext.createDataFrame(data_points, column_names) >> >>> data_df.show() >> >> +-----------+--------+---------+ >> |identity_id|emp_name|dept_name| >> +-----------+--------+---------+ >> | 3| pqr| xyz| >> | 4| ghi| xyz| >> +-----------+--------+---------+ >> >> >>> data_df.registerTempTable('temp_table') >> >>> sqlContext.sql('insert into table default.test_orc select * from >> temp_table') >> >> *** files in hdfs path in warehouse for the created table *** >> >> >> hive> select count(*) from test_orc; (Does not launch map-reduce job) >> OK >> 2 >> hive> select distinct(name) from test_orc; (Launches map-reduce job) >> abc >> def >> ghi >> pqr >> >> hive> create table test_orc_new like test_orc stored as ORC; >> hive> insert into table test_orc_new select * from test_orc; >> hive> select count(*) from test_orc_new; >> OK >> 4 >> ================================================================== >> >> Even if I restart the hive services I cannot get the proper count output >> from hive. This problem only occurs if the table is written to using both >> hive and spark. If only spark is used to insert records into the table >> multiple times, the count query in the hive terminal works perfectly fine. >> >> This problem occurs for tables stored with different storage formats as >> well (textFile etc.) >> >> Is this because of the different naming conventions used by hive and >> spark to write records to hdfs? Or maybe it is not a recommended practice >> to write tables using different services? >> >> Your thoughts and comments on this matter would be highly appreciated! >> >> Thanks! >> Nitin >> >> >> >