Fwd: Populating tables using hive and spark

Timur Shenkao Fri, 26 Aug 2016 10:21:30 -0700

Hello!

I just wonder: do you (both of you) use the same user for HIVE & Spark? Or
different ? Do you use Kerberized Hadoop?




On Mon, Aug 22, 2016 at 2:20 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok This is my test
>
> 1) create table in Hive and populate it with two rows
>
> hive> create table testme (col1 int, col2 string);
> OK
> hive> insert into testme values (1,'London');
> Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a
> OK
> hive> insert into testme values (2,'NY');
> Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a
> OK
> hive> select * from testme;
> OK
> 1       London
> 2       NY
>
> So the rows are there
>
> Now use  Spark to create two more rows
>
> scala> case class columns (col1: Int, col2: String)
> defined class columns
> scala> val df =sc.parallelize(Array((3,"California"),(4,"Dehli"))).map(p
> => columns(p._1.toString.toInt, p._2.toString)).toDF()
> df: org.apache.spark.sql.DataFrame = [col1: int, col2: string]
> scala> df.show
> +----+----------+
> |col1|      col2|
> +----+----------+
> |   3|California|
> |   4|     Dehli|
> +----+----------+
>
> // register it as tempTable
> scala> df.registerTempTable("tmp")
> scala> sql("insert into test.testme select * from tmp")
> res9: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from testme").show
> +----+----------+
> |col1|      col2|
> +----+----------+
> |   1|    London|
> |   2|        NY|
> |   3|California|
> |   4|     Dehli|
> +----+----------+
> So the rows are there.
>
> Let me go to Hive again now
>
>
> hive>  select * from testme;
> OK
> 1       London
> 2       NY
> 3       California
> 4       Dehli
>
> hive> analyze table testme compute statistics for columns;
>
> So is there any issue here?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 22 August 2016 at 11:51, Nitin Kumar <nk94.nitinku...@gmail.com> wrote:
>
>> Hi Furcy,
>>
>> If I execute the command "ANALYZE TABLE TEST_ORC COMPUTE STATISTICS"
>> before checking the count from hive, Hive returns the correct count albeit
>> it does not spawn a map-reduce job for computing the count.
>>
>> I'm running a HDP 2.4 Cluster with Hive 1.2.1.2.4 and Spark 1.6.1
>>
>> If others can concur we can go ahead and report it as a bug.
>>
>> Regards,
>> Nitin
>>
>>
>>
>> On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <furcy....@flaminem.com>
>> wrote:
>>
>>> Hi Nitin,
>>>
>>> I confirm that there is something odd here.
>>>
>>> I did the following test :
>>>
>>> create table test_orc (id int, name string, dept string) stored as ORC;
>>> insert into table test_orc values (1, 'abc', 'xyz');
>>> insert into table test_orc values (2, 'def', 'xyz');
>>> insert into table test_orc values (3, 'pqr', 'xyz');
>>> insert into table test_orc values (4, 'ghi', 'xyz');
>>>
>>>
>>> I ended up with 4 files on hdfs:
>>>
>>> 000000_0
>>> 000000_0_copy_1
>>> 000000_0_copy_2
>>> 000000_0_copy_3
>>>
>>>
>>> Then I renamed 000000_0_copy_2 to part-00000, and I still got COUNT(*) =
>>> 4 with hive.
>>> So this is not a file name issue.
>>>
>>> I then removed one of the files, and I got this :
>>>
>>> > SELECT COUNT(1) FROM test_orc ;
>>> +------+--+
>>> | _c0  |
>>> +------+--+
>>> | 4    |
>>> +------+--+
>>>
>>> > SELECT * FROM test_orc ;
>>> +--------------+----------------+----------------+--+
>>> | test_orc.id  | test_orc.name  | test_orc.dept  |
>>> +--------------+----------------+----------------+--+
>>> | 1            | abc            | xyz            |
>>> | 2            | def            | xyz            |
>>> | 4            | ghi            | xyz            |
>>> +--------------+----------------+----------------+--+
>>> 3 rows selected (0.162 seconds)
>>>
>>> So, my guess is that when Hive inserts data, it must keep somewhere in
>>> the metastore the number of rows in the table.
>>> However, if the files are modified by someone else than Hive itself,
>>> (either manually or with Spark), you end up with an inconsistency.
>>>
>>> So I guess we can call it a bug:
>>>
>>> Hive should detect that the files changed and invalidate its
>>> pre-calculated count.
>>> Optionally, Spark should be nice with Hive and update the the count when
>>> inserting.
>>>
>>> I don't know if this bug has already been reported, and I tested on Hive
>>> 1.1.0, so perhaps it is already solved in later releases.
>>>
>>> Regards,
>>>
>>> Furcy
>>>
>>>
>>> On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar <nk94.nitinku...@gmail.com>
>>> wrote:
>>>
>>>> Hi!
>>>>
>>>> I've noticed that hive has problems in registering new data records if
>>>> the same table is written to using both the hive terminal and spark sql.
>>>> The problem is demonstrated through the commands listed below
>>>>
>>>> ====================================================================
>>>> hive> use default;
>>>> hive> create table test_orc (id int, name string, dept string) stored
>>>> as ORC;
>>>> hive> insert into table test_orc values (1, 'abc', 'xyz');
>>>> hive> insert into table test_orc values (2, 'def', 'xyz');
>>>> hive> select count(*) from test_orc;
>>>> OK
>>>> 2
>>>> hive> select distinct(name) from test_orc;
>>>> OK
>>>> abc
>>>> def
>>>>
>>>> *** files in hdfs path in warehouse for the created table ***
>>>>
>>>>
>>>> 
>>>>
>>>> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')]
>>>> >>> column_names = ['identity_id', 'emp_name', 'dept_name']
>>>> >>> data_df = sqlContext.createDataFrame(data_points, column_names)
>>>> >>> data_df.show()
>>>>
>>>> +-----------+--------+---------+
>>>> |identity_id|emp_name|dept_name|
>>>> +-----------+--------+---------+
>>>> |          3|     pqr|      xyz|
>>>> |          4|     ghi|      xyz|
>>>> +-----------+--------+---------+
>>>>
>>>> >>> data_df.registerTempTable('temp_table')
>>>> >>> sqlContext.sql('insert into table default.test_orc select * from
>>>> temp_table')
>>>>
>>>> *** files in hdfs path in warehouse for the created table ***
>>>>
>>>> 
>>>> hive> select count(*) from test_orc; (Does not launch map-reduce job)
>>>> OK
>>>> 2
>>>> hive> select distinct(name) from test_orc; (Launches map-reduce job)
>>>> abc
>>>> def
>>>> ghi
>>>> pqr
>>>>
>>>> hive> create table test_orc_new like test_orc stored as ORC;
>>>> hive> insert into table test_orc_new select * from test_orc;
>>>> hive> select count(*) from test_orc_new;
>>>> OK
>>>> 4
>>>> ==================================================================
>>>>
>>>> Even if I restart the hive services I cannot get the proper count
>>>> output from hive. This problem only occurs if the table is written to using
>>>> both hive and spark. If only spark is used to insert records into the table
>>>> multiple times, the count query in the hive terminal works perfectly fine.
>>>>
>>>> This problem occurs for tables stored with different storage formats as
>>>> well (textFile etc.)
>>>>
>>>> Is this because of the different naming conventions used by hive and
>>>> spark to write records to hdfs? Or maybe it is not a recommended practice
>>>> to write tables using different services?
>>>>
>>>> Your thoughts and comments on this matter would be highly appreciated!
>>>>
>>>> Thanks!
>>>> Nitin
>>>>
>>>>
>>>>
>>>
>>
>

Fwd: Populating tables using hive and spark

Reply via email to