Hi, I have generated tpcds data using https://github.com/brownsys/tpcds . And I have loaded the data using with kettle flow and with new flow, both gives same number of rows after using select count(*) query on the table. Even I have counted rows in excel file , it is matching with count query
Number of rows loaded store_returns : 288279 web_sales : 718931 Scripts : spark.sql(""" CREATE TABLE IF NOT EXISTS STORE_RETURNS (sr_returned_date_sk bigint, sr_return_time_sk bigint, sr_item_sk bigint, sr_customer_sk bigint, sr_cdemo_sk bigint, sr_hdemo_sk bigint, sr_addr_sk bigint, sr_store_sk bigint, sr_reason_sk bigint, sr_ticket_number bigint, sr_return_quantity bigint, sr_return_amt decimal(7,2), sr_return_tax decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2), sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2), sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES('dictionary_include'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number') """) spark.sql("create table if not exists web_sales(ws_sold_date_sk integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer, ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer, ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk integer, ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk integer, ws_order_number integer, ws_quantity integer, ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2), ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2), ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2), ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2), ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk, ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk, ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk, ws_order_number')") spark.sql(s""" LOAD DATA LOCAL INPATH '/home/root1/Downloads/store_returns.csv' into table STORE_RETURNS options('DELIMITER'='|', 'FILEHEADER'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number,sr_return_quantity,sr_return_amt,sr_return_tax,sr_return_amt_inc_tax,sr_fee,sr_return_ship_cost,sr_refunded_cash,sr_reversed_charge,sr_store_credit,sr_net_loss', 'use_kettle'='false') """) spark.sql(s""" LOAD DATA LOCAL INPATH 'hdfs://localhost:9000/tpcds/web_sales/part-r-00000-dca70590-4d9d-4cc9-aff4-e20b85970d2b' into table web_sales options('DELIMITER'='|', 'FILEHEADER'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk, ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk, ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk, ws_order_number, ws_quantity, ws_wholesale_cost, ws_list_price, ws_sales_price, ws_ext_discount_amt, ws_ext_sales_price, ws_ext_wholesale_cost, ws_ext_list_price, ws_ext_tax, ws_coupon_amt, ws_ext_ship_cost, ws_net_paid, ws_net_paid_inc_tax, ws_net_paid_inc_ship, ws_net_paid_inc_ship_tax, ws_net_profit', 'use_kettle'='false') """) Did I miss something here? Regards, Ravindra. On 16 February 2017 at 12:24, Yinwei Li <251469...@qq.com> wrote: > Hi Ravindra: > > > I add DICTIONARY_INCLUDE for each of them: > > > carbon.sql("create table if not exists _1g.store_returns(sr_returned_date_sk > integer, sr_return_time_sk integer, sr_item_sk integer, sr_customer_sk > integer, sr_cdemo_sk integer, sr_hdemo_sk integer, sr_addr_sk integer, > sr_store_sk integer, sr_reason_sk integer, sr_ticket_number integer, > sr_return_quantity integer, sr_return_amt decimal(7,2), sr_return_tax > decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2), > sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2), > sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss > decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES > ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk, > sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk, > sr_store_sk, sr_reason_sk, sr_ticket_number')"); > > > > > carbon.sql("create table if not exists _1g.web_sales(ws_sold_date_sk > integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk > integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer, > ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk > integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer, > ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk integer, > ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk integer, > ws_order_number integer, ws_quantity integer, ws_wholesale_cost > decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2), > ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2), > ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2), > ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost > decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2), > ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2), > ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES > ('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk, > ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk, > ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk, > ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk, > ws_warehouse_sk, ws_promo_sk, ws_order_number')"); > > > > and here is my script for generate tpc-ds data: > [hadoop@master tools]$ ./dsdgen -scale 1 -suffix '.csv' -dir > /data/tpc-ds/data/ > > > > > > > > > ------------------ 原始邮件 ------------------ > 发件人: "Ravindra Pesala";<ravi.pes...@gmail.com>; > 发送时间: 2017年2月16日(星期四) 下午3:15 > 收件人: "dev"<dev@carbondata.incubator.apache.org>; > > 主题: Re: 回复: data lost when loading data from csv file to carbon table > > > > Hi Yinwei, > > Can you provide create table scripts for both the tables store_returns and > web_sales. > > Regards, > Ravindra. > > On 16 February 2017 at 10:07, Ravindra Pesala <ravi.pes...@gmail.com> > wrote: > > > Hi Yinwei, > > > > Thank you for pointing out the issue, I will check with TPC-DS data and > > verify the data load with new flow. > > > > Regards, > > Ravindra. > > > > On 16 February 2017 at 09:35, QiangCai <qiang...@qq.com> wrote: > > > >> Maybe you can check PR594, it will fix a bug which will impact the > result > >> of > >> loading. > >> > >> > >> > >> -- > >> View this message in context: http://apache-carbondata-maili > >> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load > >> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html > >> Sent from the Apache CarbonData Mailing List archive mailing list > archive > >> at Nabble.com. > >> > > > > > > > > -- > > Thanks & Regards, > > Ravi > > > > > > -- > Thanks & Regards, > Ravi > -- Thanks & Regards, Ravi