Hello! I just wonder: do you (both of you) use the same user for HIVE & Spark? Or different ? Do you use Kerberized Hadoop?
On Mon, Aug 22, 2016 at 2:20 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Ok This is my test > > 1) create table in Hive and populate it with two rows > > hive> create table testme (col1 int, col2 string); > OK > hive> insert into testme values (1,'London'); > Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a > OK > hive> insert into testme values (2,'NY'); > Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a > OK > hive> select * from testme; > OK > 1 London > 2 NY > > So the rows are there > > Now use Spark to create two more rows > > scala> case class columns (col1: Int, col2: String) > defined class columns > scala> val df =sc.parallelize(Array((3,"California"),(4,"Dehli"))).map(p > => columns(p._1.toString.toInt, p._2.toString)).toDF() > df: org.apache.spark.sql.DataFrame = [col1: int, col2: string] > scala> df.show > +----+----------+ > |col1| col2| > +----+----------+ > | 3|California| > | 4| Dehli| > +----+----------+ > > // register it as tempTable > scala> df.registerTempTable("tmp") > scala> sql("insert into test.testme select * from tmp") > res9: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from testme").show > +----+----------+ > |col1| col2| > +----+----------+ > | 1| London| > | 2| NY| > | 3|California| > | 4| Dehli| > +----+----------+ > So the rows are there. > > Let me go to Hive again now > > > hive> select * from testme; > OK > 1 London > 2 NY > 3 California > 4 Dehli > > hive> analyze table testme compute statistics for columns; > > So is there any issue here? > > Thanks > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 22 August 2016 at 11:51, Nitin Kumar <nk94.nitinku...@gmail.com> wrote: > >> Hi Furcy, >> >> If I execute the command "ANALYZE TABLE TEST_ORC COMPUTE STATISTICS" >> before checking the count from hive, Hive returns the correct count albeit >> it does not spawn a map-reduce job for computing the count. >> >> I'm running a HDP 2.4 Cluster with Hive 1.2.1.2.4 and Spark 1.6.1 >> >> If others can concur we can go ahead and report it as a bug. >> >> Regards, >> Nitin >> >> >> >> On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <furcy....@flaminem.com> >> wrote: >> >>> Hi Nitin, >>> >>> I confirm that there is something odd here. >>> >>> I did the following test : >>> >>> create table test_orc (id int, name string, dept string) stored as ORC; >>> insert into table test_orc values (1, 'abc', 'xyz'); >>> insert into table test_orc values (2, 'def', 'xyz'); >>> insert into table test_orc values (3, 'pqr', 'xyz'); >>> insert into table test_orc values (4, 'ghi', 'xyz'); >>> >>> >>> I ended up with 4 files on hdfs: >>> >>> 000000_0 >>> 000000_0_copy_1 >>> 000000_0_copy_2 >>> 000000_0_copy_3 >>> >>> >>> Then I renamed 000000_0_copy_2 to part-00000, and I still got COUNT(*) = >>> 4 with hive. >>> So this is not a file name issue. >>> >>> I then removed one of the files, and I got this : >>> >>> > SELECT COUNT(1) FROM test_orc ; >>> +------+--+ >>> | _c0 | >>> +------+--+ >>> | 4 | >>> +------+--+ >>> >>> > SELECT * FROM test_orc ; >>> +--------------+----------------+----------------+--+ >>> | test_orc.id | test_orc.name | test_orc.dept | >>> +--------------+----------------+----------------+--+ >>> | 1 | abc | xyz | >>> | 2 | def | xyz | >>> | 4 | ghi | xyz | >>> +--------------+----------------+----------------+--+ >>> 3 rows selected (0.162 seconds) >>> >>> So, my guess is that when Hive inserts data, it must keep somewhere in >>> the metastore the number of rows in the table. >>> However, if the files are modified by someone else than Hive itself, >>> (either manually or with Spark), you end up with an inconsistency. >>> >>> So I guess we can call it a bug: >>> >>> Hive should detect that the files changed and invalidate its >>> pre-calculated count. >>> Optionally, Spark should be nice with Hive and update the the count when >>> inserting. >>> >>> I don't know if this bug has already been reported, and I tested on Hive >>> 1.1.0, so perhaps it is already solved in later releases. >>> >>> Regards, >>> >>> Furcy >>> >>> >>> On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar <nk94.nitinku...@gmail.com> >>> wrote: >>> >>>> Hi! >>>> >>>> I've noticed that hive has problems in registering new data records if >>>> the same table is written to using both the hive terminal and spark sql. >>>> The problem is demonstrated through the commands listed below >>>> >>>> ==================================================================== >>>> hive> use default; >>>> hive> create table test_orc (id int, name string, dept string) stored >>>> as ORC; >>>> hive> insert into table test_orc values (1, 'abc', 'xyz'); >>>> hive> insert into table test_orc values (2, 'def', 'xyz'); >>>> hive> select count(*) from test_orc; >>>> OK >>>> 2 >>>> hive> select distinct(name) from test_orc; >>>> OK >>>> abc >>>> def >>>> >>>> *** files in hdfs path in warehouse for the created table *** >>>> >>>> >>>> >>>> >>>> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')] >>>> >>> column_names = ['identity_id', 'emp_name', 'dept_name'] >>>> >>> data_df = sqlContext.createDataFrame(data_points, column_names) >>>> >>> data_df.show() >>>> >>>> +-----------+--------+---------+ >>>> |identity_id|emp_name|dept_name| >>>> +-----------+--------+---------+ >>>> | 3| pqr| xyz| >>>> | 4| ghi| xyz| >>>> +-----------+--------+---------+ >>>> >>>> >>> data_df.registerTempTable('temp_table') >>>> >>> sqlContext.sql('insert into table default.test_orc select * from >>>> temp_table') >>>> >>>> *** files in hdfs path in warehouse for the created table *** >>>> >>>> >>>> hive> select count(*) from test_orc; (Does not launch map-reduce job) >>>> OK >>>> 2 >>>> hive> select distinct(name) from test_orc; (Launches map-reduce job) >>>> abc >>>> def >>>> ghi >>>> pqr >>>> >>>> hive> create table test_orc_new like test_orc stored as ORC; >>>> hive> insert into table test_orc_new select * from test_orc; >>>> hive> select count(*) from test_orc_new; >>>> OK >>>> 4 >>>> ================================================================== >>>> >>>> Even if I restart the hive services I cannot get the proper count >>>> output from hive. This problem only occurs if the table is written to using >>>> both hive and spark. If only spark is used to insert records into the table >>>> multiple times, the count query in the hive terminal works perfectly fine. >>>> >>>> This problem occurs for tables stored with different storage formats as >>>> well (textFile etc.) >>>> >>>> Is this because of the different naming conventions used by hive and >>>> spark to write records to hdfs? Or maybe it is not a recommended practice >>>> to write tables using different services? >>>> >>>> Your thoughts and comments on this matter would be highly appreciated! >>>> >>>> Thanks! >>>> Nitin >>>> >>>> >>>> >>> >> >