RE: wrong number of records loaded to a table is returned by Hive

Steven Wong Fri, 01 Oct 2010 22:44:44 -0700

Based on my cursory code inspection, the non-final row count is set when 
ExecDriver.progress calls ss.getHiveHistory().setCounters(...) inside the while 
loop, and we need to add the same call after the while loop (after the last 
updateCounters call at the end) to set the final row count.

From: gaurav jain [mailto:jainy_gau...@yahoo.com]
Sent: Friday, October 01, 2010 1:17 PM
To: hive-user@hadoop.apache.org
Cc: hive-...@hadoop.apache.org
Subject: Re: wrong number of records loaded to a table is returned by Hive

One more data point:

in Hive History:

org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:
 26002996

in JT:
org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum TABLE_ID_1_ROWCOUNT 
0 31,208,099 31,208,099

________________________________
From: gaurav jain <jainy_gau...@yahoo.com>
To: hive-user@hadoop.apache.org
Cc: hive-...@hadoop.apache.org
Sent: Fri, October 1, 2010 12:07:14 PM
Subject: Re: wrong number of records loaded to a table is returned by Hive
Hi Ning,

I also see the same behavior. Below is some data for your reference.

This behavior is observed for large values.

I believe HIVE is recording non-final values at the end of insert query:

Since hive reads the HIVE History file counters, it may be printing non-final 
values.

Relevant function I looked at:

org.apache.hadoop.hive.ql.Driver.execute()         
SessionState.get().getHiveHistory().printRowCount(queryId);

org.apache.hadoop.hive.ql.history.HiveHistory.printRowCount(String)

      This function reads ROWS_INSERTED="xxxx~26002996" from hive history.

Regards,
Gaurav Jain

------------------------------------------------------------------------------------------------------

Hive Query Output
26002996 Rows loaded to xxxx

Hive Select Output after insert
31,208,099

>From JobTracker UI:

                                                           MAP              
REDUCE      TOTAL
Map input records           31,208,099       0                   31,208,099
Map output records         31,208,099       0                   31,208,099
Reduce input records       0                      31,208,099    31,208,099

>From Hive History File:

TaskEnd ROWS_INSERTED="xxxx~26002996"
TASK_RET_CODE="0"
TASK_HADOOP_PROGRESS="2010-10-01 18:37:39,548 Stage-1 map = 100%,  reduce = 
100%"
TASK_NAME="org.apache.hadoop.hive.ql.exec.ExecDriver" TASK_COUNTERS=
       Job Counters .Launched reduce tasks:36,
       Job Counters .Rack-local map tasks:50
       Job Counters .Launched map tasks:97
      Job Counters .Data-local maptasks:47
      ...
      ...
      Map-Reduce Framework.Map input records:31208099
      Map-Reduce Framework.Reduce output records:0
      Map-Reduce Framework.Spilled Records:88206972
      Map-Reduce Framework.Map output records:31208099
      Map-Reduce Framework.Reduce input records:28636162
      TASK_ID="Stage-1" QUERY_ID="hadoop_20101001183131"        
TASK_HADOOP_ID="job_201008201925_149454" TIME="1285958308044"

---------------------------------------------------

________________________________
From: Ning Zhang <nzh...@facebook.com>
To: "<hive-user@hadoop.apache.org>" <hive-user@hadoop.apache.org>
Sent: Fri, October 1, 2010 10:45:53 AM
Subject: Re: wrong number of records loaded to a table is returned by Hive

Ping, this is a known issue. The number reported at the end of INSERT OVERWRITE 
is obtained by means of Hadoop counters, which is not very reliable and subject 
to inaccuracy due to failed tasks and speculations.

If you are using the latest trunk, you may want to try the feature of 
automatically gathering statistics during INSERT OVERWRITE TABLE. You need to 
set up a MySQL/HBase for partial stats publishing/aggregation.  You can find 
the design doc at http://wiki.apache.org/hadoop/Hive/StatsDev.

Note that stats is still in this experimental stage. So please feel free to 
report bugs/suggestions here or to 
hive-...@hadoop.apache.org<mailto:hive-...@hadoop.apache.org>.

On Oct 1, 2010, at 10:30 AM, Ping Zhu wrote:

I had such issues on different versions of hadoop/hive: The version of 
hadoop/hive I am using now is hadoop 0.20.2/hive 0.7. The version of 
hadoop/hive I once used is hadoop 0.20.0/hive 0.5

Ping
On Fri, Oct 1, 2010 at 10:23 AM, Ping Zhu 
<p...@sharethis.com<mailto:p...@sharethis.com>> wrote:
Hi,

  I ran a simple Hive query inserting data into a target table from a source 
table. The number of records loaded to the target table (say number A), which 
is returned by running this query, is different with the number (say number B) 
returned by running a query "select count(1) from target". I checked the number 
of rows in target table's HDFS files by running command "hadoop fs -cat 
/root/hive/metastore_db/ptarget/* | wc -l ". The number returned is number B. I 
believe number B is the actual number of rows in target table.

  I had this issue intermittently. Any comments?

  Thank you very much.

  Ping

RE: wrong number of records loaded to a table is returned by Hive

Reply via email to