Hi Zack, No, you don't need to worry about the name of the primary key getting in the way of the rows being added.
Like Anil pointed out, the best thing to look at first is the job counters. The relevant ones for debugging this situation are the total map inputs and total map outputs, total reduce inputs and total reduce outputs, as well as reduce input groups, and finally the PhoenixJobCounters (INPUT_RECORDS, FAILED_RECORDS, and OUTPUT_RECORDS). The INPUT_RECORDS and OUTPUT_RECORDS should both be around the number of rows that you expected (i.e. 1.7 million), along with map input records. If I remember correctly, the reduce input groups should be around the same value as well. Could you post the values that you've got on those counters? - Gabriel On Thu, Jun 25, 2015 at 4:41 PM Riesland, Zack <[email protected]> wrote: > I started writing a long response, and then noticed something: > > > > When I created my new table, I copied/pasted the script and made some > changes, but didn’t change the name of the primary key. > > > > Is it possible that any row being inserted on the new table with a key > that matches a row in the OTHER table is being thrown away? > > > > *From:* anil gupta [mailto:[email protected]] > *Sent:* Thursday, June 25, 2015 10:20 AM > *To:* [email protected] > *Subject:* Re: Bug in CsvBulkLoad tool? > > > > Hi Zack, > > Can you share counters of csvbulkload job? Also, did you run one > csvbulkload job or 35 bulkload job? Whats the schema of phoenix table? How > are you making sure that you have no duplicate rowkey in your dataset? > > If you have duplicate rowkeys. Then cells in that row in HBase will have > more than 1 version. That is something i would check on HBase side to > investigate this problem. > > > > Thanks, > > Anil Gupta > > > > On Thu, Jun 25, 2015 at 3:11 AM, Riesland, Zack <[email protected]> > wrote: > > Earlier this week I was surprised to find that, after dumping tons of data > from a Hive table to an HBase table, about half of the data didn’t end up > in HBase. > > > > So, yesterday, I created a new Phoenix table. > > > > This time, I’m splitting on the first 6 characters of the key, which gives > me about 1700 regions (across 6 fairly beefy region servers). > > > > My 7 billion Hive rows live in 125 5GB csv files on HDFS. > > > > I copied 35 of them to a separate folder, and ran the CsvBuolkLoad tool > against that folder. > > > > The application manager tells me that the job ran to completion. 1042/1042 > successful maps and 1792/1792 successful reduces. > > > > However, when I run the mapreduce.RowCounter against the new table, it > only shows about 300 million rows. > > > > I should see 35/125 * 7 billion = ~ 1.7 billion rows. > > > > These are not primary key collisions. > > > > Can someone please help me understand what is going on? > > > > > > > > > -- > > Thanks & Regards, > Anil Gupta >
