Re: Populating tables using hive and spark

Nitin Kumar Mon, 22 Aug 2016 03:52:30 -0700

Hi Furcy,

If I execute the command "ANALYZE TABLE TEST_ORC COMPUTE STATISTICS" before
checking the count from hive, Hive returns the correct count albeit it does
not spawn a map-reduce job for computing the count.


I'm running a HDP 2.4 Cluster with Hive 1.2.1.2.4 and Spark 1.6.1

If others can concur we can go ahead and report it as a bug.

Regards,
Nitin



On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <furcy....@flaminem.com> wrote:

> Hi Nitin,
>
> I confirm that there is something odd here.
>
> I did the following test :
>
> create table test_orc (id int, name string, dept string) stored as ORC;
> insert into table test_orc values (1, 'abc', 'xyz');
> insert into table test_orc values (2, 'def', 'xyz');
> insert into table test_orc values (3, 'pqr', 'xyz');
> insert into table test_orc values (4, 'ghi', 'xyz');
>
>
> I ended up with 4 files on hdfs:
>
> 000000_0
> 000000_0_copy_1
> 000000_0_copy_2
> 000000_0_copy_3
>
>
> Then I renamed 000000_0_copy_2 to part-00000, and I still got COUNT(*) = 4
> with hive.
> So this is not a file name issue.
>
> I then removed one of the files, and I got this :
>
> > SELECT COUNT(1) FROM test_orc ;
> +------+--+
> | _c0  |
> +------+--+
> | 4    |
> +------+--+
>
> > SELECT * FROM test_orc ;
> +--------------+----------------+----------------+--+
> | test_orc.id  | test_orc.name  | test_orc.dept  |
> +--------------+----------------+----------------+--+
> | 1            | abc            | xyz            |
> | 2            | def            | xyz            |
> | 4            | ghi            | xyz            |
> +--------------+----------------+----------------+--+
> 3 rows selected (0.162 seconds)
>
> So, my guess is that when Hive inserts data, it must keep somewhere in the
> metastore the number of rows in the table.
> However, if the files are modified by someone else than Hive itself,
> (either manually or with Spark), you end up with an inconsistency.
>
> So I guess we can call it a bug:
>
> Hive should detect that the files changed and invalidate its
> pre-calculated count.
> Optionally, Spark should be nice with Hive and update the the count when
> inserting.
>
> I don't know if this bug has already been reported, and I tested on Hive
> 1.1.0, so perhaps it is already solved in later releases.
>
> Regards,
>
> Furcy
>
>
> On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar <nk94.nitinku...@gmail.com>
> wrote:
>
>> Hi!
>>
>> I've noticed that hive has problems in registering new data records if
>> the same table is written to using both the hive terminal and spark sql.
>> The problem is demonstrated through the commands listed below
>>
>> ====================================================================
>> hive> use default;
>> hive> create table test_orc (id int, name string, dept string) stored as
>> ORC;
>> hive> insert into table test_orc values (1, 'abc', 'xyz');
>> hive> insert into table test_orc values (2, 'def', 'xyz');
>> hive> select count(*) from test_orc;
>> OK
>> 2
>> hive> select distinct(name) from test_orc;
>> OK
>> abc
>> def
>>
>> *** files in hdfs path in warehouse for the created table ***
>>
>>
>> 
>>
>> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')]
>> >>> column_names = ['identity_id', 'emp_name', 'dept_name']
>> >>> data_df = sqlContext.createDataFrame(data_points, column_names)
>> >>> data_df.show()
>>
>> +-----------+--------+---------+
>> |identity_id|emp_name|dept_name|
>> +-----------+--------+---------+
>> |          3|     pqr|      xyz|
>> |          4|     ghi|      xyz|
>> +-----------+--------+---------+
>>
>> >>> data_df.registerTempTable('temp_table')
>> >>> sqlContext.sql('insert into table default.test_orc select * from
>> temp_table')
>>
>> *** files in hdfs path in warehouse for the created table ***
>>
>> 
>> hive> select count(*) from test_orc; (Does not launch map-reduce job)
>> OK
>> 2
>> hive> select distinct(name) from test_orc; (Launches map-reduce job)
>> abc
>> def
>> ghi
>> pqr
>>
>> hive> create table test_orc_new like test_orc stored as ORC;
>> hive> insert into table test_orc_new select * from test_orc;
>> hive> select count(*) from test_orc_new;
>> OK
>> 4
>> ==================================================================
>>
>> Even if I restart the hive services I cannot get the proper count output
>> from hive. This problem only occurs if the table is written to using both
>> hive and spark. If only spark is used to insert records into the table
>> multiple times, the count query in the hive terminal works perfectly fine.
>>
>> This problem occurs for tables stored with different storage formats as
>> well (textFile etc.)
>>
>> Is this because of the different naming conventions used by hive and
>> spark to write records to hdfs? Or maybe it is not a recommended practice
>> to write tables using different services?
>>
>> Your thoughts and comments on this matter would be highly appreciated!
>>
>> Thanks!
>> Nitin
>>
>>
>>
>

Re: Populating tables using hive and spark

Reply via email to