Re: Difference in count(*) result for KUDU and parquet

Geetika Gupta Thu, 10 May 2018 02:55:28 -0700

Thanks, William

The problem was due to the duplicated primary keys issue. So changing the
schema for the table resolved our issue.
But as per the documentation when we try to insert a row with the same
primary key values as an existing row, it should result in a duplicate key
error.
However, no error was thrown related to primary key duplication and the
query execution was successful.




On Thu, May 10, 2018 at 11:29 AM, William Berkeley <wdberke...@cloudera.com>
wrote:

> Hi Geetika. While I don't know anything about TPCH data, when people load
> data and see less rows it's usually because of duplicated primary keys.
> Kudu, unlike parquet, has a unique key constraint. What's the schema for
> the Kudu table?
>
> Also, might be useful to know what Kudu version and Impala version you are
> using.
>
> -Will
>
> On Wed, May 9, 2018 at 10:03 PM, Geetika Gupta <geetika.gu...@knoldus.in>
> wrote:
>
>> Hi community,
>>
>> We executed the below command to load data in KUDU, but the table in
>> which we loaded the data has less number of rows. We executed the following
>> command:
>>
>> insert into LINEITEM select * from PARQUETIMPALA500.LINEITEM
>>
>> This query was successful, but when we tried the count(*) on both the
>> tables, row count was different:
>>
>> 0: jdbc:hive2://slave2:21050/default> select count(*) from lineitem
>> . . . . . . . . . . . . . . . . . . > ;
>> 536870912
>>
>> 0: jdbc:hive2://slave2:21050/default> select count(*) from
>> parquetimpala500.lineitem;
>> 3000028242
>>
>> We are loading 500GB of TPCH data in kudu from parquet table.
>>
>> --
>> Regards,
>> Geetika Gupta
>>
>
>


-- 
Regards,
Geetika Gupta

Re: Difference in count(*) result for KUDU and parquet

Reply via email to