[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

Swapnil Shinde Wed, 23 Aug 2017 07:12:00 -0700

Hello All
    We are observing incorrect query results with carbondata 1.1.1. Please
find details below -

*Datasets used -*
TPC-H star schema based datasets (
http://www.cs.umb.edu/~poneil/StarSchemaB.PDF)
*Query - *
* select cCustKey,loCustKey from customer, lineorder where loCustkey =
cCustKey*
*How we load data -*
We validated loading data through dataframe and "INSERT" statements
and both ways produce incorrect results. I am putting one way here-

*-- CREATE CUSTOMER TABLE*

*carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName
string, cAddress string, cCity string, cNation string, cRegion string,
cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")*

*carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE
customer
OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")*

*-- CREATE LINEORDER TABLE*

*carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey
bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey
Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity
Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue
Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy
String) STORED BY 'carbondata'")*

*carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO TABLE
lineorder
OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")*

*Results with different version - *

* 1.1.0 - *Provides correct results for above query. Validated with
results from parquet.

* 1.1.1 - *Built from this
<https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>.
Join is missing lots of rows compared to parquet.

* 1.1.1 - *Built from source code available for download
<https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>.
Join is missing lots of rows compared to parquet.

* 1.2 - *Built from master branch. Generated correct results similar
to parquet.

*Debugging further - *

1. Row counts for both lineOrder and customer tables are same.

2. If I try to find out key column in carbondata vs parquet then it is
matching as well -

val cd = carbon.sql("select cCustKey from customer")
//.distinct.count -- 30,000,000

val sp = spark.sql("select cCustKey from pcustomer")
//.distinct.count -- 30,000,000

cd.intersect(sp) -- 30,000,000 (carbon data has same keys compared
to parquet)

val cd = carbon.sql("select loCustKey from lineorder")
//.distinct.count -- 13,365,986

val sp = spark.sql("select loCustKey from plineorder")
//.distinct.count -- 13,365,986

cd.intersect(sp) --13,365,986 (carbon data has same keys compared
to parquet)

Above query shows that carbondata customer and lineitem has same key values
compared to parquet.

However, when you run above join query, carbondata generates very small
subset of expected rows. If we run filter query for any specific key then
that also returns no results.

Not sure why v1.1.1 is producing incorrect results. My guess is that
carbondata is skipping rows that it shouldn't in v1.1.1.

Any help and suggestions are very much appreciated!! Thanks in advance..

Thanks

Swapnil Shinde

[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

Reply via email to