[ 
https://issues.apache.org/jira/browse/TAJO-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihoon Son resolved TAJO-1315.
------------------------------
    Resolution: Not a Problem

This problem is due to the broken input data.

> Invalid results are returned when a source table consists of multiple csv 
> files
> -------------------------------------------------------------------------------
>
>                 Key: TAJO-1315
>                 URL: https://issues.apache.org/jira/browse/TAJO-1315
>             Project: Tajo
>          Issue Type: Bug
>          Components: storage
>            Reporter: Jihoon Son
>            Priority: Critical
>             Fix For: 0.10
>
>
> See the title.
> Here are some examples related to this bug.
> {noformat}
> default> \dfs -ls /customer.tbl
> Found 19 items
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000001
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000002
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000003
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000004
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000005
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000006
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000007
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000008
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000009
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000010
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000011
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000012
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000013
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000014
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000015
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 
> /customer.tbl/000016
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:26 
> /customer.tbl/000017
> -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:26 
> /customer.tbl/000018
> -rw-r--r--   3 hadoop supergroup   47571167 2015-01-26 20:26 
> /customer.tbl/000019
> default> create external table test (C_CUSTKEY bigint, C_NAME text, C_ADDRESS 
> text, C_NATIONKEY bigint, C_PHONE text, C_ACCTBAL double, C_MKTSEGMENT text, 
> C_COMMENT text) using csv with ('csvfile.delimiter'='|') location 
> 'hdfs://192.168.0.1:7020/customer.tbl';
> OK
> default> \d test
> table name: tpch_swift.test
> table path: hdfs://192.168.0.1:7020/customer.tbl
> store type: CSV
> number of rows: unknown
> volume: 2.5 GB
> Options: 
>       'text.delimiter'='|'
> schema: 
> c_custkey     INT8
> c_name        TEXT
> c_address     TEXT
> c_nationkey   INT8
> c_phone       TEXT
> c_acctbal     FLOAT8
> c_mktsegment  TEXT
> c_comment     TEXT
> default> select count(*) from test;
> ?count
> -------------------------------
> 15000017
> (1 rows, 3.2 sec, 9 B selected)
> {noformat}
> As you can see, the expected result is 15000000, but the real result was 
> 15000017.
> So, I investigated error tuples as follows.
> {noformat}
> default> select c_custkey, count(*) as cnt from customer2 group by c_custkey 
> having cnt > 1;
> c_custkey,  cnt
> -------------------------------
> ,  14
> 114575,  2
> 14711665,  2
> 34,  2
> (4 rows, 16.681 sec, 29 B selected)
> default> select * from customer2 where c_custkey is null or c_custkey = 
> 114575 or c_custkey = 14711665 or c_custkey = 34;
> c_custkey,  c_name,  c_address,  c_nationkey,  c_phone,  c_acctbal,  
> c_mktsegment,  c_comment
> -------------------------------
> 34,  Customer#000000034,  Q6G9wZ6dnczmtOx509xgE,M2KV,  15,  25-344-968-5422,  
> 8589.7,  HOUSEHOLD,  nder against the even, pending accounts. even
> 114575,  Customer#000114575,  xqLzTzY0,QvqwlSPI8OLxjRQ4s2W7pkSWwK,  16,  
> 26-303-921-2836,  6663.68,  AUTOMOBILE,  le fluffily final deposits. 
> furiously regu
> ,  21,  31-264-911-5053,  ,  HOUSEHOLD,  0.0,  ,  
> ,  IexCQQNp7tsMK63QKrGw37H3JJXGPaXBk,  18,  ,  4313.01,  0.0,   the never 
> pending accounts. slyly fluffy pinto beans run fluffily. furiously ,  
> ,  ,  ,  ,  ,  ,  ,  
> ,  152.95,  MACHINERY,  ,  ,  ,  ,  
> ,  t the ironic, close accounts are careful,  ,  ,  ,  ,  ,  
> ,  20,  30-481-475-8163,  ,  AUTOMOBILE,  0.0,  ,  
> ,  ,  ,  ,  ,  ,  ,  
> ,  MACHINERY,  ts use slyly even dependencie,  ,  ,  ,  ,  
> ,  ,  ,  ,  ,  ,  ,  
> ,  24,  34-639-456-9692,  ,  FURNITURE,  0.0,  ,  
> ,  ,  ,  ,  ,  ,  ,  
> 114575,  ,  ,  ,  ,  ,  ,  
> 34,  Customer#011457534,  wFUkCU67OxuxvfQeSdvSMDtMB7DWt7jiw,  2,  
> 12-145-168-8442,  145.78,  MACHINERY,  ic accounts. ironic, final ideas sleep 
> qu
> ,  XPP8pRDTDs4MFMP7SSlv,  17,  ,  5437.09,  0.0,  egular requests cajole 
> slyly after the ,  
> ,  blithely along the regular, daring deposits. ironic acco,  ,  ,  ,  ,  ,  
> ,  12,  22-656-233-3821,  ,  HOUSEHOLD,  0.0,  ,  
> 14711665,  Customer#0,  ,  ,  ,  ,  ,  
> 14711665,  QKTarsTkX7,  19,  ,  7017.62,  0.0,  ly after the carefully ironic 
> theodolites. pending requests are slyly across the deposits. even accounts 
> boost. fina,  
> (20 rows, 8.964 sec, 1.2 KiB selected)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to