[jira] [Created] (CARBONDATA-933) performance benchmarking of carbon data on hive and orc on hive using compare test

2017-04-16 Thread anubhav tarar (JIRA)
anubhav tarar created CARBONDATA-933:


 Summary: performance benchmarking of carbon data on hive and orc 
on hive using compare test
 Key: CARBONDATA-933
 URL: https://issues.apache.org/jira/browse/CARBONDATA-933
 Project: CarbonData
  Issue Type: Task
  Components: hive-integration
Affects Versions: 1.1.0-incubating
 Environment: hive,spark2.1
Reporter: anubhav tarar
Assignee: anubhav tarar
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-826) Create carbondata-connector of presto for supporting presto query carbon data

2017-03-27 Thread Liang Chen (JIRA)
Liang Chen created CARBONDATA-826:
-

 Summary: Create carbondata-connector of presto for supporting 
presto query carbon data
 Key: CARBONDATA-826
 URL: https://issues.apache.org/jira/browse/CARBONDATA-826
 Project: CarbonData
  Issue Type: Sub-task
  Components: presto-integration
Reporter: Liang Chen
Assignee: Liang Chen
Priority: Minor


1.In CarbonData project, generate carbondata-connector of presto
2.Copy carbondata-connector to presto/plugin/
3.Run query in presto to read carbon data. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-789) Result does not display while using join in Carbon data.

2017-03-17 Thread Vinod Rohilla (JIRA)
Vinod Rohilla created CARBONDATA-789:


 Summary: Result does not display while using join in Carbon data. 
 Key: CARBONDATA-789
 URL: https://issues.apache.org/jira/browse/CARBONDATA-789
 Project: CarbonData
  Issue Type: Bug
  Components: data-query
Affects Versions: 1.1.0-incubating
 Environment: Spark 2.1
Reporter: Vinod Rohilla
Priority: Trivial
 Attachments: 2000_UniqData.csv

Result does not display under carbon data query.

Steps to Reproduces:
A) Create Table in Hive:

First table:
CREATE TABLE uniqdata_nobucket11_Hive (CUST_ID int,CUST_NAME 
String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 
bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 
decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 
int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

First table Load:
LOAD DATA LOCAL INPATH '/home/vinod/Desktop/AllCSV/2000_UniqData.csv'OVERWRITE 
INTO TABLE uniqdata_nobucket11_Hive;

Second  table :
CREATE TABLE uniqdata_nobucket22_Hive (CUST_ID int,CUST_NAME 
String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 
bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 
decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 
int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

Second  table Load:
LOAD DATA LOCAL INPATH '/home/vinod/Desktop/AllCSV/2000_UniqData.csv'OVERWRITE 
INTO TABLE uniqdata_nobucket22_Hive;


Results in Hive:
| CUST_ID  |CUST_NAME |ACTIVE_EMUI_VERSION |  DOB   
|  DOJ   | BIGINT_COLUMN1  | BIGINT_COLUMN2  | 
DECIMAL_COLUMN1 | DECIMAL_COLUMN2 |Double_COLUMN1|
Double_COLUMN2 | INTEGER_COLUMN1  | CUST_ID  |CUST_NAME |
ACTIVE_EMUI_VERSION |  DOB   |  DOJ   | 
BIGINT_COLUMN1  | BIGINT_COLUMN2  | DECIMAL_COLUMN1 | 
DECIMAL_COLUMN2 |Double_COLUMN1|Double_COLUMN2 | 
INTEGER_COLUMN1  |
+--+--++++-+-+-+-+--+---+--+--+--++++-+-+-+-+--+---+--+--+
| 10999| CUST_NAME_01999  | ACTIVE_EMUI_VERSION_01999  | 1975-06-23 
01:00:03.0  | 1975-06-23 02:00:03.0  | 123372038853| -223372034855   | 
12345680900.123400  | 22345680900.123400  | 1.12345674897976E10  | 
-1.12345674897976E10  | 2000 | 10999| CUST_NAME_01999  | 
ACTIVE_EMUI_VERSION_01999  | 1975-06-23 01:00:03.0  | 1975-06-23 02:00:03.0  | 
123372038853| -223372034855   | 12345680900.123400  | 
22345680900.123400  | 1.12345674897976E10  | -1.12345674897976E10  | 2000   
  |
+--+--++++-+-+-+-+--+---+--+--+--++++-+-+-+-+--+---+--+--+
2,001 rows selected (3.369 seconds)


B) Create table in Carbon data 
First Table:
CREATE TABLE uniqdata_nobucket11 (CUST_ID int,CUST_NAME 
String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 
bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 
decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 
int) STORED BY 'org.apache.carbondata.format' ;

Load Data in table:

LOAD DATA INPATH 'hdfs://localhost:54310/2000_UniqData.csv' into table 
uniqdata_nobucket11 OPTIONS('DELIMITER'=',' , 
'QUOTECHAR'='"','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1')


Create Second table:
CREATE TABLE uniqdata_nobucket22 (CUST_ID int,CUST_NAME 
String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 
bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 
decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 
int) ST

[jira] [Created] (CARBONDATA-695) Create DataFrame example in example/spark2, read carbon data to dataframe

2017-02-04 Thread Liang Chen (JIRA)
Liang Chen created CARBONDATA-695:
-

 Summary: Create DataFrame example in example/spark2,  read carbon 
data to dataframe
 Key: CARBONDATA-695
 URL: https://issues.apache.org/jira/browse/CARBONDATA-695
 Project: CarbonData
  Issue Type: Improvement
  Components: examples
Affects Versions: 1.0.0-incubating
Reporter: Liang Chen
Priority: Minor
 Fix For: 1.1.0-incubating


Create DataFrame example in example/spark2,  read carbon data to dataframe.
For spark2, need to define schema for read carbon data, it is different with 
spark1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] (CARBONDATA-690) Carbon data load fails with default option for USE_KETTLE(False)

2017-01-31 Thread Ramakrishna (JIRA)
Ramakrishna created CARBONDATA-690:
--

 Summary: Carbon data load fails with default option for 
USE_KETTLE(False)
 Key: CARBONDATA-690
 URL: https://issues.apache.org/jira/browse/CARBONDATA-690
 Project: CarbonData
  Issue Type: Bug
 Environment: Spark 2.1
Reporter: Ramakrishna
Priority: Minor


When load query is run with default option for USE_KETTLE, it fails at mdkey 
generation.
sample query and issue:
LOAD DATA  inpath 
'hdfs://hacluster/user/OSCON/sparkhive/warehouse/communication.db/flow_text_1/20140113_0_120.csv'
 into table flow_carbon options('USE_KETTLE'='FALSE', 'DELIMITER'=',', 
'QUOTECHAR'='"','FILEHEADER'='aco_ac,ac_dte,txn_cnt,jrn_par,mfm_jrn_no,cbn_jrn_no,ibs_jrn_no,vch_no,vch_seq,srv_cde,cus_no,bus_cd_no,id_flg,cus_ac,bv_cde,bv_no,txn_dte,txn_time,txn_tlr,txn_bk,txn_br,ety_tlr,ety_bk,ety_br,bus_pss_no,chk_flg,chk_tlr,chk_jrn_no,bus_sys_no,bus_opr_cde,txn_sub_cde,fin_bus_cde,fin_bus_sub_cde,opt_prd_cde,chl,tml_id,sus_no,sus_seq,cho_seq,itm_itm,itm_sub,itm_sss,dc_flg,amt,bal,ccy,spv_flg,vch_vld_dte,pst_bk,pst_br,ec_flg,aco_tlr,opp_ac,opp_ac_nme,opp_bk,gen_flg,his_rec_sum_flg,his_flg,vch_typ,val_dte,opp_ac_flg,cmb_flg,ass_vch_flg,cus_pps_flg,bus_rmk_cde,vch_bus_rmk,tec_rmk_cde,vch_tec_rmk,rsv_ara,own_br,own_bk,gems_last_upd_d,gems_last_upd_d_bat,maps_date,maps_job,dt');
Error: java.lang.Exception: DataLoad failure: There is an unexpected error: 
unable to generate the mdkey (state=,code=0)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Support of Float Data Type in Carbon Data

2016-12-15 Thread Anurag Srivastava
Hi,

Carbon Data is not supporting Float Data type.
Do we need to fix this Jira issue [CARBONDATA-390]
<https://issues.apache.org/jira/browse/CARBONDATA-390>?

I think float data type should have its own range.
So do we need to support range for Float data type?

Proposed Solution :

We have to make changes in the following file :

We have to provide support during parser in CarbonSqlParser class.
We have to add Float Data Type in the DataTypeConverterUtil.

-- 
*Thanks®ards*


*Anurag Srivastava**Software Consultant*
*Knoldus Software LLP*

*India - US - Canada*
* Twitter <http://www.twitter.com/anuragknoldus> | FB
<http://www.facebook.com/anuragsrivastava.06> | LinkedIn
<https://in.linkedin.com/pub/anurag-srivastava/5a/b6/441>*


Re: [Discussion] Please vote and comment for carbon data file format change

2016-12-10 Thread Jean-Baptiste Onofré
+1

Regards
JB⁣​

On Dec 10, 2016, 09:33, at 09:33, "bill.zhou"  wrote:
>+1  this modification will help all the scenario
>
>Kumar Vishal wrote
>> ​Hello All,
>> 
>> Improving carbon first time query performance
>> 
>> Reason:
>> 1. As file system cache is cleared file reading will make it slower
>to
>> read
>> and cache
>> 2. In first time query carbon will have to read the footer from file
>data
>> file to form the btree
>> 3. Carbon reading more footer data than its required(data chunk)
>> 4. There are lots of random seek is happening in carbon as column
>> data(data
>> page, rle, inverted index) are not stored together.
>> 
>> Solution:
>> 1. Improve block loading time. This can be done by removing data
>chunk
>> from
>> blockletInfo and storing only offset and length of data chunk
>> 2. compress presence meta bitset stored for null values for measure
>column
>> using snappy
>> 3. Store the metadata and data of a column together and read together
>this
>> reduces random seek and improve IO
>> 
>> For this I am planing to change the carbondata thrift format
>> 
>> *Old format*
>> 
>> 
>> 
>> *New format*
>> 
>> 
>> 
>> *​*
>> 
>> Please vote and comment for this new format change
>> 
>> -Regards
>> Kumar Vishal
>
>
>
>
>
>--
>View this message in context:
>http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Please-vote-and-comment-for-carbon-data-file-format-change-tp2491p4049.html
>Sent from the Apache CarbonData Mailing List archive mailing list
>archive at Nabble.com.


Re: [Discussion] Please vote and comment for carbon data file format change

2016-12-10 Thread bill.zhou
+1  this modification will help all the scenario

Kumar Vishal wrote
> ​Hello All,
> 
> Improving carbon first time query performance
> 
> Reason:
> 1. As file system cache is cleared file reading will make it slower to
> read
> and cache
> 2. In first time query carbon will have to read the footer from file data
> file to form the btree
> 3. Carbon reading more footer data than its required(data chunk)
> 4. There are lots of random seek is happening in carbon as column
> data(data
> page, rle, inverted index) are not stored together.
> 
> Solution:
> 1. Improve block loading time. This can be done by removing data chunk
> from
> blockletInfo and storing only offset and length of data chunk
> 2. compress presence meta bitset stored for null values for measure column
> using snappy
> 3. Store the metadata and data of a column together and read together this
> reduces random seek and improve IO
> 
> For this I am planing to change the carbondata thrift format
> 
> *Old format*
> 
> 
> 
> *New format*
> 
> 
> 
> *​*
> 
> Please vote and comment for this new format change
> 
> -Regards
> Kumar Vishal





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Please-vote-and-comment-for-carbon-data-file-format-change-tp2491p4049.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [Discussion] Please vote and comment for carbon data file format change

2016-12-09 Thread jarray888
+1 , currrent dataformat have first time query slow issue , should be fixed.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Please-vote-and-comment-for-carbon-data-file-format-change-tp2491p4018.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-29 Thread Kumar Vishal
Hi All,
Please find the JIRA issue which I have raised for above discussion.

https://issues.apache.org/jira/browse/CARBONDATA-458

-Regards
Kumar Vishal

On Tue, Nov 29, 2016 at 7:14 PM, Kumar Vishal 
wrote:

> Hi Jihong Ma,
> Please find the attachment.
>
> -Regards
> Kumar Vishal
>
> On Fri, Nov 4, 2016 at 12:16 AM, Jihong Ma  wrote:
>
>> Hi Kumar,
>>
>> Please place the proposed format changes in attachment or attach to the
>> associated JIRA, I would like to take a look.
>>
>> Thanks!
>>
>> Jihong
>>
>> -Original Message-
>> From: Jacky Li [mailto:jacky.li...@qq.com]
>> Sent: Thursday, November 03, 2016 7:54 AM
>> To: dev@carbondata.incubator.apache.org
>> Subject: Re: [Discussion] Please vote and comment for carbon data file
>> format change
>>
>> The proposed change is reasonable, +1.
>> But is there a plan to make the reader backward compatible with the old
>> format? So the impact to the current deployment is minimum.
>>
>> Regards,
>> Jacky
>>
>> > 在 2016年11月2日,上午12:38,Kumar Vishal  写道:
>> >
>> >  Hi Xiaoqiao He,
>> >
>> > Please find the attachment.
>> >
>> > -Regards
>> > Kumar Vishal
>> >
>> > On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He > <mailto:xq.he2...@gmail.com>> wrote:
>> > Hi Kumar Vishal,
>> >
>> > I couldn't get Fig. of the file format, could you re-upload them?
>> > Thanks.
>> >
>> > Best Regards
>> >
>> > On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal > <mailto:kumarvishal1...@gmail.com>>
>> > wrote:
>> >
>> > >
>> > > ​Hello All,
>> > >
>> > > Improving carbon first time query performance
>> > >
>> > > Reason:
>> > > 1. As file system cache is cleared file reading will make it slower to
>> > > read and cache
>> > > 2. In first time query carbon will have to read the footer from file
>> data
>> > > file to form the btree
>> > > 3. Carbon reading more footer data than its required(data chunk)
>> > > 4. There are lots of random seek is happening in carbon as column
>> > > data(data page, rle, inverted index) are not stored together.
>> > >
>> > > Solution:
>> > > 1. Improve block loading time. This can be done by removing data chunk
>> > > from blockletInfo and storing only offset and length of data chunk
>> > > 2. compress presence meta bitset stored for null values for measure
>> column
>> > > using snappy
>> > > 3. Store the metadata and data of a column together and read together
>> this
>> > > reduces random seek and improve IO
>> > >
>> > > For this I am planing to change the carbondata thrift format
>> > >
>> > > *Old format*
>> > >
>> > >
>> > >
>> > > *New format*
>> > >
>> > >
>> > >
>> > > *​*
>> > >
>> > > Please vote and comment for this new format change
>> > >
>> > > -Regards
>> > > Kumar Vishal
>> > >
>> > >
>> > >
>> > >
>> >
>>
>>
>


Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-29 Thread Kumar Vishal
Hi Jihong Ma,
Please find the attachment.

-Regards
Kumar Vishal

On Fri, Nov 4, 2016 at 12:16 AM, Jihong Ma  wrote:

> Hi Kumar,
>
> Please place the proposed format changes in attachment or attach to the
> associated JIRA, I would like to take a look.
>
> Thanks!
>
> Jihong
>
> -Original Message-
> From: Jacky Li [mailto:jacky.li...@qq.com]
> Sent: Thursday, November 03, 2016 7:54 AM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: [Discussion] Please vote and comment for carbon data file
> format change
>
> The proposed change is reasonable, +1.
> But is there a plan to make the reader backward compatible with the old
> format? So the impact to the current deployment is minimum.
>
> Regards,
> Jacky
>
> > 在 2016年11月2日,上午12:38,Kumar Vishal  写道:
> >
> >  Hi Xiaoqiao He,
> >
> > Please find the attachment.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He  <mailto:xq.he2...@gmail.com>> wrote:
> > Hi Kumar Vishal,
> >
> > I couldn't get Fig. of the file format, could you re-upload them?
> > Thanks.
> >
> > Best Regards
> >
> > On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal  <mailto:kumarvishal1...@gmail.com>>
> > wrote:
> >
> > >
> > > ​Hello All,
> > >
> > > Improving carbon first time query performance
> > >
> > > Reason:
> > > 1. As file system cache is cleared file reading will make it slower to
> > > read and cache
> > > 2. In first time query carbon will have to read the footer from file
> data
> > > file to form the btree
> > > 3. Carbon reading more footer data than its required(data chunk)
> > > 4. There are lots of random seek is happening in carbon as column
> > > data(data page, rle, inverted index) are not stored together.
> > >
> > > Solution:
> > > 1. Improve block loading time. This can be done by removing data chunk
> > > from blockletInfo and storing only offset and length of data chunk
> > > 2. compress presence meta bitset stored for null values for measure
> column
> > > using snappy
> > > 3. Store the metadata and data of a column together and read together
> this
> > > reduces random seek and improve IO
> > >
> > > For this I am planing to change the carbondata thrift format
> > >
> > > *Old format*
> > >
> > >
> > >
> > > *New format*
> > >
> > >
> > >
> > > *​*
> > >
> > > Please vote and comment for this new format change
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > >
> > >
> > >
> >
>
>


Re: carbon data

2016-11-29 Thread william
Try Code like this:
```
if (cc.tableNames().filter(f => f == _cfg.get("tableName").get).size == 0) {
  df.sqlContext.sql(s"DROP TABLE IF EXISTS
${_cfg.get("tableName").get}")

writer.options(_cfg).mode(SaveMode.Overwrite).format(_format).save()
} else {

writer.options(_cfg).mode(SaveMode.valueOf(_mode)).format(_format).save()
}
```
Only when you have table created then you can use SaveMode.Append otherwise
you should use SaveMode.Overwrite to make CarbonData create table for you.

On Tue, Nov 29, 2016 at 5:56 PM, Lu Cao  wrote:

> Thank you for the response Liang. I think I have followed the example but
> it still returns error:
>Data loading failed. table not found: default.carbontest
> attached my code below: I read data from a hive table with HiveContext and
> convert it to CarbonContext then generate the df and save to hdfs. I'm not
> sure whether it's correct or not when I generate the dataframe in
> sc.parallelize(sc.Files,
> 25) Do you have any other mothod we can use to generate DF?
>
> object SparkConvert {
>
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf().setAppName("CarbonTest")
>
> val sc = new SparkContext(conf)
>
> val path = "hdfs:///user/appuser/lucao/CarbonTest_001.carbon"
>
> val hqlContext = new HiveContext(sc)
>
> val df = hqlContext.sql("select * from default.test_data_all")
>
> println("the count is:" + df.count())
>
> val cc = createCarbonContext(df.sqlContext.sparkContext, path)
>
> writeDataFrame(cc, "CarbonTest", SaveMode.Append)
>
>
>
>   }
>
>
>
>   def createCarbonContext(sc : SparkContext, storePath : String):
> CarbonContext = {
>
> val cc = new CarbonContext(sc, storePath)
>
> cc
>
>   }
>
>
>
>   def writeDataFrame(cc : CarbonContext, tableName : String, mode :
> SaveMode) : Unit = {
>
> import cc.implicits._
>
> val sc = cc.sparkContext
>
> val df = sc.parallelize(sc.files,
> 25).toDF(“col1”,”col2”,”col3”..."coln")
>
> df.write
>
>   .format("carbondata")
>
>   .option("tableName", tableName)
>
>   .option("compress", "true")
>
>   .mode(mode)
>
>   .save()
>
>   }
>
>
>
> }
>



-- 
Best Regards
___
开阔视野  专注开发
WilliamZhu   祝海林  zh...@csdn.net
产品事业部-基础平台-搜索&数据挖掘
手机:18601315052
MSN:zhuhailin...@hotmail.com
微博:@PrinceCharmingJ  http://weibo.com/PrinceCharmingJ
地址:北京市朝阳区广顺北大街33号院1号楼福码大厦B座12层
___
http://www.csdn.net  You're the One
全球最大中文IT技术社区   一切由你开始

http://www.iteye.net
程序员深度交流社区


Re: carbon data

2016-11-29 Thread Lu Cao
Thank you for the response Liang. I think I have followed the example but
it still returns error:
   Data loading failed. table not found: default.carbontest
attached my code below: I read data from a hive table with HiveContext and
convert it to CarbonContext then generate the df and save to hdfs. I'm not
sure whether it's correct or not when I generate the dataframe in
sc.parallelize(sc.Files,
25) Do you have any other mothod we can use to generate DF?

object SparkConvert {

  def main(args: Array[String]): Unit = {

val conf = new SparkConf().setAppName("CarbonTest")

val sc = new SparkContext(conf)

val path = "hdfs:///user/appuser/lucao/CarbonTest_001.carbon"

val hqlContext = new HiveContext(sc)

val df = hqlContext.sql("select * from default.test_data_all")

println("the count is:" + df.count())

val cc = createCarbonContext(df.sqlContext.sparkContext, path)

writeDataFrame(cc, "CarbonTest", SaveMode.Append)



  }



  def createCarbonContext(sc : SparkContext, storePath : String):
CarbonContext = {

val cc = new CarbonContext(sc, storePath)

cc

  }



  def writeDataFrame(cc : CarbonContext, tableName : String, mode :
SaveMode) : Unit = {

import cc.implicits._

val sc = cc.sparkContext

val df = sc.parallelize(sc.files,
25).toDF(“col1”,”col2”,”col3”..."coln")

df.write

  .format("carbondata")

  .option("tableName", tableName)

  .option("compress", "true")

  .mode(mode)

  .save()

  }



}


Re: carbon data

2016-11-28 Thread Liang Chen
Hi Lionel

Don't need to create table first, please find the example code in
ExampleUtils.scala

df.write
.format("carbondata")
.option("tableName", tableName)
.option("compress", "true")
.option("useKettle", "false")
.mode(mode)
.save()

Preparing API docs is in progress.

Regards
Liang
2016-11-28 20:24 GMT+08:00 Lu Cao :

> Hi team,
> I'm trying to save spark dataframe to carbondata file. I see the example in
> your wiki
> option("tableName", "carbontable"). Does that mean I have to create a
> carbondata table first and then save data into the table? Can I save it
> directly without creating the carbondata table?
>
> the code is
> df.write.format("carbondata").mode(SaveMode.Append).save("
> hdfs:///user//data.carbon")
>
> BTW, do you have the formal api doc?
>
> Thanks,
> Lionel
>



-- 
Regards
Liang


carbon data

2016-11-28 Thread Lu Cao
Hi team,
I'm trying to save spark dataframe to carbondata file. I see the example in
your wiki
option("tableName", "carbontable"). Does that mean I have to create a
carbondata table first and then save data into the table? Can I save it
directly without creating the carbondata table?

the code is
df.write.format("carbondata").mode(SaveMode.Append).save("hdfs:///user//data.carbon")

BTW, do you have the formal api doc?

Thanks,
Lionel


[jira] [Created] (CARBONDATA-430) Carbon data tpch benchmark

2016-11-21 Thread suo tong (JIRA)
suo tong created CARBONDATA-430:
---

 Summary: Carbon data tpch benchmark
 Key: CARBONDATA-430
 URL: https://issues.apache.org/jira/browse/CARBONDATA-430
 Project: CarbonData
  Issue Type: Task
Reporter: suo tong
Assignee: suo tong






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [Feature] proposal for update and delete support in Carbon data

2016-11-15 Thread Xiaoqiao He
hi Vinod,

It is an expected feature for many people as Jacky mentioned. I think
Update/Delete should be basic module for CarbonData, meanwhile it is
complex question for distributed storage system. The solution you proposed
is based on traditional 'Base + Delta' approach, which is applied on
bigtable/hbase/kudu/etc successfully. following your proposed solution for
CarbonData i have some confusion include doubts Jacky mentioned transaction
and index:

1. How to trade-off IO overhead when add delta files. i think there may be
two query approaches for delta files: (1) load whole delta data and replace
based query result if also exist in delta file. in this case, it may
increase IO overhead which CarbonData try to reduce it as possible.  (2)
build separate index for all delta file, or label delta records and upgrade
file format. right?
2. When and how to do minor/major compaction on (base + delta) or (delta +
delta)?
3. Any questions for update or delete Directory item?

I look forward to the detailed design of your solution.

Please correct me if i am wrong.

Best Regards,
He Xiaoqiao


On Tue, Nov 15, 2016 at 5:39 PM, Jacky Li  wrote:

> Hi Vinod,
>
> It is great to have this feature, as there were many people asking for
> data update during the CarbonData meetup earlier. I believe it will be
> useful for many big data applications.
>
> For the solution you proposed, I have following doubts:
> 1. Data update is complex as if transaction is involved, so what kind of
> ACID level support are you thinking about?
> 2. If I understand correctly, you are proposing to do data update via base
> + delta file approach, right? So in this case, new file format needs to be
> added in CarbonData project.
> 3. As CarbonData has builtin support for index, any idea what is the
> impaction to the B tree index already in driver and executor memory?
>
> Regards,
> Jacky
>
> > 在 2016年11月15日,下午12:25,Vinod KC  写道:
> >
> > Hi All
> > I would like to propose following new features in Carbon data
> > 1) Update statement to support modifying existing records in carbon data
> > table
> > 2) Delete statement to remove records from carbon data table
> >
> > A) Update operation: 'Update' features can be added to CarbonData using
> > intermediate Delta files [delete/update delta files] support with lesser
> > impact on existing code.
> > Update can be considered as a ‘delete’ followed by an‘insert’ operation.
> > Once an update is done on carbon data file, on select query operation,
> > Carbondata store reader can make use of delete delta data cache to
> exclude
> > deleted records in that segment and then include records from newly added
> > update delta files.
> >
> > B) Delete operation: In the case of delete operation, a delete delta file
> > will be added to each segment matching the records. During select query
> > operation Carbon data reader will exclude those deleted records from the
> > result set.
> >
> > Please share your suggestions and thoughts about design and functional
> > aspects on this feature. I’ll share a detailed design document about
> above
> > thoughts later.
> >
> > Regards
> > Vinod
>
>
>
>


Re: [Feature] proposal for update and delete support in Carbon data

2016-11-15 Thread Jacky Li
Hi Vinod,

It is great to have this feature, as there were many people asking for data 
update during the CarbonData meetup earlier. I believe it will be useful for 
many big data applications.

For the solution you proposed, I have following doubts: 
1. Data update is complex as if transaction is involved, so what kind of ACID 
level support are you thinking about?
2. If I understand correctly, you are proposing to do data update via base + 
delta file approach, right? So in this case, new file format needs to be added 
in CarbonData project. 
3. As CarbonData has builtin support for index, any idea what is the impaction 
to the B tree index already in driver and executor memory?

Regards,
Jacky

> 在 2016年11月15日,下午12:25,Vinod KC  写道:
> 
> Hi All
> I would like to propose following new features in Carbon data
> 1) Update statement to support modifying existing records in carbon data
> table
> 2) Delete statement to remove records from carbon data table
> 
> A) Update operation: 'Update' features can be added to CarbonData using
> intermediate Delta files [delete/update delta files] support with lesser
> impact on existing code.
> Update can be considered as a ‘delete’ followed by an‘insert’ operation.
> Once an update is done on carbon data file, on select query operation,
> Carbondata store reader can make use of delete delta data cache to exclude
> deleted records in that segment and then include records from newly added
> update delta files.
> 
> B) Delete operation: In the case of delete operation, a delete delta file
> will be added to each segment matching the records. During select query
> operation Carbon data reader will exclude those deleted records from the
> result set.
> 
> Please share your suggestions and thoughts about design and functional
> aspects on this feature. I’ll share a detailed design document about above
> thoughts later.
> 
> Regards
> Vinod





[Feature] proposal for update and delete support in Carbon data

2016-11-14 Thread Vinod KC
Hi All
I would like to propose following new features in Carbon data
1) Update statement to support modifying existing records in carbon data
table
2) Delete statement to remove records from carbon data table

A) Update operation: 'Update' features can be added to CarbonData using
intermediate Delta files [delete/update delta files] support with lesser
impact on existing code.
Update can be considered as a ‘delete’ followed by an‘insert’ operation.
Once an update is done on carbon data file, on select query operation,
Carbondata store reader can make use of delete delta data cache to exclude
deleted records in that segment and then include records from newly added
update delta files.

B) Delete operation: In the case of delete operation, a delete delta file
will be added to each segment matching the records. During select query
operation Carbon data reader will exclude those deleted records from the
result set.

Please share your suggestions and thoughts about design and functional
aspects on this feature. I’ll share a detailed design document about above
thoughts later.

Regards
Vinod


RE: [Discussion] Please vote and comment for carbon data file format change

2016-11-03 Thread Jihong Ma
Hi Kumar, 

Please place the proposed format changes in attachment or attach to the 
associated JIRA, I would like to take a look. 

Thanks!

Jihong

-Original Message-
From: Jacky Li [mailto:jacky.li...@qq.com] 
Sent: Thursday, November 03, 2016 7:54 AM
To: dev@carbondata.incubator.apache.org
Subject: Re: [Discussion] Please vote and comment for carbon data file format 
change

The proposed change is reasonable, +1.
But is there a plan to make the reader backward compatible with the old format? 
So the impact to the current deployment is minimum.

Regards,
Jacky

> 在 2016年11月2日,上午12:38,Kumar Vishal  写道:
> 
>  Hi Xiaoqiao He,
>   
> Please find the attachment.
> 
> -Regards
> Kumar Vishal
> 
> On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He  <mailto:xq.he2...@gmail.com>> wrote:
> Hi Kumar Vishal,
> 
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
> 
> Best Regards
> 
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal  <mailto:kumarvishal1...@gmail.com>>
> wrote:
> 
> >
> > ​Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure column
> > using snappy
> > 3. Store the metadata and data of a column together and read together this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > *​*
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
> 



Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-03 Thread Kumar Vishal
Dear Jacky,
   Yes I am planning to support both data format reader(new and
old) + writer(new and old), default new writer will be enabled, but if user
wants to write in older format for that i will expose one configuration.
Please let me know if you have any other suggestion.

-Regards
Kumar Vishal

On Thu, Nov 3, 2016 at 8:24 PM, Jacky Li  wrote:

> The proposed change is reasonable, +1.
> But is there a plan to make the reader backward compatible with the old
> format? So the impact to the current deployment is minimum.
>
> Regards,
> Jacky
>
> > 在 2016年11月2日,上午12:38,Kumar Vishal  写道:
> >
> >  Hi Xiaoqiao He,
> >
> > Please find the attachment.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He  > wrote:
> > Hi Kumar Vishal,
> >
> > I couldn't get Fig. of the file format, could you re-upload them?
> > Thanks.
> >
> > Best Regards
> >
> > On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal  >
> > wrote:
> >
> > >
> > > ​Hello All,
> > >
> > > Improving carbon first time query performance
> > >
> > > Reason:
> > > 1. As file system cache is cleared file reading will make it slower to
> > > read and cache
> > > 2. In first time query carbon will have to read the footer from file
> data
> > > file to form the btree
> > > 3. Carbon reading more footer data than its required(data chunk)
> > > 4. There are lots of random seek is happening in carbon as column
> > > data(data page, rle, inverted index) are not stored together.
> > >
> > > Solution:
> > > 1. Improve block loading time. This can be done by removing data chunk
> > > from blockletInfo and storing only offset and length of data chunk
> > > 2. compress presence meta bitset stored for null values for measure
> column
> > > using snappy
> > > 3. Store the metadata and data of a column together and read together
> this
> > > reduces random seek and improve IO
> > >
> > > For this I am planing to change the carbondata thrift format
> > >
> > > *Old format*
> > >
> > >
> > >
> > > *New format*
> > >
> > >
> > >
> > > *​*
> > >
> > > Please vote and comment for this new format change
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > >
> > >
> > >
> >
>
>


Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-03 Thread Jacky Li
The proposed change is reasonable, +1.
But is there a plan to make the reader backward compatible with the old format? 
So the impact to the current deployment is minimum.

Regards,
Jacky

> 在 2016年11月2日,上午12:38,Kumar Vishal  写道:
> 
>  Hi Xiaoqiao He,
>   
> Please find the attachment.
> 
> -Regards
> Kumar Vishal
> 
> On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He  > wrote:
> Hi Kumar Vishal,
> 
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
> 
> Best Regards
> 
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal  >
> wrote:
> 
> >
> > ​Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure column
> > using snappy
> > 3. Store the metadata and data of a column together and read together this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > *​*
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
> 



Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-01 Thread Kumar Vishal
* Hi Xiaoqiao He*,

Please find the *attachment.*

*-Regards*
*Kumar Vishal*

On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He  wrote:

> Hi Kumar Vishal,
>
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
>
> Best Regards
>
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal 
> wrote:
>
> >
> > ​Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure
> column
> > using snappy
> > 3. Store the metadata and data of a column together and read together
> this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > *​*
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
>


Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-01 Thread Xiaoqiao He
Hi Kumar Vishal,

I couldn't get Fig. of the file format, could you re-upload them?
Thanks.

Best Regards

On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal 
wrote:

>
> ​Hello All,
>
> Improving carbon first time query performance
>
> Reason:
> 1. As file system cache is cleared file reading will make it slower to
> read and cache
> 2. In first time query carbon will have to read the footer from file data
> file to form the btree
> 3. Carbon reading more footer data than its required(data chunk)
> 4. There are lots of random seek is happening in carbon as column
> data(data page, rle, inverted index) are not stored together.
>
> Solution:
> 1. Improve block loading time. This can be done by removing data chunk
> from blockletInfo and storing only offset and length of data chunk
> 2. compress presence meta bitset stored for null values for measure column
> using snappy
> 3. Store the metadata and data of a column together and read together this
> reduces random seek and improve IO
>
> For this I am planing to change the carbondata thrift format
>
> *Old format*
>
>
>
> *New format*
>
>
>
> *​*
>
> Please vote and comment for this new format change
>
> -Regards
> Kumar Vishal
>
>
>
>


[Discussion] Please vote and comment for carbon data file format change

2016-11-01 Thread Kumar Vishal
​Hello All,

Improving carbon first time query performance

Reason:
1. As file system cache is cleared file reading will make it slower to read
and cache
2. In first time query carbon will have to read the footer from file data
file to form the btree
3. Carbon reading more footer data than its required(data chunk)
4. There are lots of random seek is happening in carbon as column data(data
page, rle, inverted index) are not stored together.

Solution:
1. Improve block loading time. This can be done by removing data chunk from
blockletInfo and storing only offset and length of data chunk
2. compress presence meta bitset stored for null values for measure column
using snappy
3. Store the metadata and data of a column together and read together this
reduces random seek and improve IO

For this I am planing to change the carbondata thrift format

*Old format*



*New format*



*​*

Please vote and comment for this new format change

-Regards
Kumar Vishal


Re: Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-22 Thread cenyuhai
I think the map default delimiter should be the same with hive. 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-New-feature-Support-Complex-Data-Type-Map-in-Carbon-Data-tp1969p2239.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-17 Thread Vimal Das Kammath
The key in the map can be only primitive data types. At present, Carbon
Data supports following primitive data types Integer, String, Timestamp,
Double and Decimal.
If in future CarbonData adds supports more primitive data types, the same
can be used as key in the Map.

The reason for restricting the keys to primitive data types is that,if keys
were complex data types then lookup using key in the query will not be
possible in the SQL statement.

On Mon, Oct 17, 2016 at 7:43 AM, Liang Chen  wrote:

> Hi Vimal
>
> Thank you started the discussion.
> For keys of Map data only can be primitive, can you list these type which
> will be supported? (Int,String,Double..
>
> For discussing more conveniently, you can go ahead to use google docs.
> After the design document finalized , please archive and upload it to
> cwiki:https://cwiki.apache.org/confluence/display/
> CARBONDATA/CarbonData+Home
>
> Regards
> Liang
>
>
> Vimal Das Kammath wrote
> > Hi All,
> >
> > This discussion is regarding support for Map Data type in Carbon Data.
> >
> > Carbon Data supports complex and nested data types such as Arrays and
> > Struts. However, Carbon Data does not support other complex data types
> > such
> > as Maps and Union which are generally supported by popular opensource
> file
> > formats.
> >
> >
> > Supporting Map data type will require changes/additions to the DDL, Query
> > Syntax, Data Loading and Storage.
> >
> >
> > I have hosted the design on google docs for review and discussion.
> >
> > https://docs.google.com/document/d/1U6wPohvdDHk0B7bONnVHWa6PKG8R9
> q5-oKMqzMMQHYY/edit?usp=sharing
> >
> >
> > Below is the same inline.
> >
> >
> > 1.  DDL Changes
> >
> > Maps are key->value data types and where the value can be fetched by
> > providing the key. Hence we need to restrict keys to primitive data types
> > whereas values can be of any data type supported in Carbon(primitive and
> > complex).
> >
> > Map data types can be defined in the create table DDL as :-
> >
> > “MAP<primitive_data_type, data_type>”
> >
> > For Example:-
> >
> > create table example_table (id Int, name String, salary Int,
> > salary_breakup
> > map<String, Int>, city String)
> >
> >
> > 2.  Data Loading Changes
> >
> > Carbon should be able to support loading data into tables with Map type
> > columns from csv files. It should be possible to represent maps in a
> > single
> > row of csv. This will need carbon to support specifying the delimiters
> for
> > :-
> >
> > 1. Between two Key-Value pairs
> >
> > 2. Between each Key and Value in a pair
> >
> > As Carbon already supports Strut and Array Complex types, the data
> loading
> > process already provides support for defining delimiters for complex data
> > types. Carbon provides two Optional parameters for data loading
> >
> > 1. COMPLEX_DELIMITER_LEVEL_1: will define the delimiter between two
> > Key-Value pairs
> >
> > OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')
> >
> > 2. COMPLEX_DELIMITER_LEVEL_2: will define the delimiter between each
> > Key and Value in a pair
> >
> > OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')
> >
> > With these delimiter options, the below map can be represented in csv:-
> >
> > Fixed->100,000
> >
> > Bonus->30,000
> >
> > Stock->40,000
> >
> > As
> >
> > Fixed:100,000$Bonus:30,000$Stock:40,000 in the csv file.
> >
> >
> >
> > 3.  Query Capabilities
> >
> > A complex datatype like Map will require additional operators to be
> > supported in the query language to fully utilize the strength of the data
> > type.
> >
> > Maps are sequence of key-value pairs, hence should support looking up
> > value
> > for a given key. Users could use the ColumnName[“key”] syntax to lookup
> > values in a map column. For example: salary_breakup[“Fixed”] could be
> used
> > to fetch only the Fixed component in the salary breakup.
> >
> > In Addition, we also need to define how maps can be used in existing
> > constructs such as select, where(filter), group by etc..
> > 1. Select:- Map data type can be directly selected or only the value
> > for a given key can be selected as per the requirement. For
> > example:-“Select
> > name, salary, salary_breakup” will return the content of map long with
> > each
> > row.“Select name, salary, salary_breakup[“Fix

Re: Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-16 Thread Ravindra Pesala
Hi Vimal,

Design doc looks clear, can you also add file format storage design for map
datatype.

Regards,
Ravi.

On 17 October 2016 at 07:43, Liang Chen  wrote:

> Hi Vimal
>
> Thank you started the discussion.
> For keys of Map data only can be primitive, can you list these type which
> will be supported? (Int,String,Double..
>
> For discussing more conveniently, you can go ahead to use google docs.
> After the design document finalized , please archive and upload it to
> cwiki:https://cwiki.apache.org/confluence/display/
> CARBONDATA/CarbonData+Home
>
> Regards
> Liang
>
>
> Vimal Das Kammath wrote
> > Hi All,
> >
> > This discussion is regarding support for Map Data type in Carbon Data.
> >
> > Carbon Data supports complex and nested data types such as Arrays and
> > Struts. However, Carbon Data does not support other complex data types
> > such
> > as Maps and Union which are generally supported by popular opensource
> file
> > formats.
> >
> >
> > Supporting Map data type will require changes/additions to the DDL, Query
> > Syntax, Data Loading and Storage.
> >
> >
> > I have hosted the design on google docs for review and discussion.
> >
> > https://docs.google.com/document/d/1U6wPohvdDHk0B7bONnVHWa6PKG8R9
> q5-oKMqzMMQHYY/edit?usp=sharing
> >
> >
> > Below is the same inline.
> >
> >
> > 1.  DDL Changes
> >
> > Maps are key->value data types and where the value can be fetched by
> > providing the key. Hence we need to restrict keys to primitive data types
> > whereas values can be of any data type supported in Carbon(primitive and
> > complex).
> >
> > Map data types can be defined in the create table DDL as :-
> >
> > “MAP<primitive_data_type, data_type>”
> >
> > For Example:-
> >
> > create table example_table (id Int, name String, salary Int,
> > salary_breakup
> > map<String, Int>, city String)
> >
> >
> > 2.  Data Loading Changes
> >
> > Carbon should be able to support loading data into tables with Map type
> > columns from csv files. It should be possible to represent maps in a
> > single
> > row of csv. This will need carbon to support specifying the delimiters
> for
> > :-
> >
> > 1. Between two Key-Value pairs
> >
> > 2. Between each Key and Value in a pair
> >
> > As Carbon already supports Strut and Array Complex types, the data
> loading
> > process already provides support for defining delimiters for complex data
> > types. Carbon provides two Optional parameters for data loading
> >
> > 1. COMPLEX_DELIMITER_LEVEL_1: will define the delimiter between two
> > Key-Value pairs
> >
> > OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')
> >
> > 2. COMPLEX_DELIMITER_LEVEL_2: will define the delimiter between each
> > Key and Value in a pair
> >
> > OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')
> >
> > With these delimiter options, the below map can be represented in csv:-
> >
> > Fixed->100,000
> >
> > Bonus->30,000
> >
> > Stock->40,000
> >
> > As
> >
> > Fixed:100,000$Bonus:30,000$Stock:40,000 in the csv file.
> >
> >
> >
> > 3.  Query Capabilities
> >
> > A complex datatype like Map will require additional operators to be
> > supported in the query language to fully utilize the strength of the data
> > type.
> >
> > Maps are sequence of key-value pairs, hence should support looking up
> > value
> > for a given key. Users could use the ColumnName[“key”] syntax to lookup
> > values in a map column. For example: salary_breakup[“Fixed”] could be
> used
> > to fetch only the Fixed component in the salary breakup.
> >
> > In Addition, we also need to define how maps can be used in existing
> > constructs such as select, where(filter), group by etc..
> > 1. Select:- Map data type can be directly selected or only the value
> > for a given key can be selected as per the requirement. For
> > example:-“Select
> > name, salary, salary_breakup” will return the content of map long with
> > each
> > row.“Select name, salary, salary_breakup[“Fixed”]” will return only one
> > value from the map whose key is “Fixed”2. Filter:-Map data type
> cannot
> > be directly used in a where clause as where clause can operate only on
> > primitive data types. However the map lookup operator can be used in
> where
> > clauses. For example:-“Select name, salar

Re: Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-16 Thread Liang Chen
Hi Vimal

Thank you started the discussion.
For keys of Map data only can be primitive, can you list these type which
will be supported? (Int,String,Double..

For discussing more conveniently, you can go ahead to use google docs. 
After the design document finalized , please archive and upload it to
cwiki:https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home

Regards
Liang


Vimal Das Kammath wrote
> Hi All,
> 
> This discussion is regarding support for Map Data type in Carbon Data.
> 
> Carbon Data supports complex and nested data types such as Arrays and
> Struts. However, Carbon Data does not support other complex data types
> such
> as Maps and Union which are generally supported by popular opensource file
> formats.
> 
> 
> Supporting Map data type will require changes/additions to the DDL, Query
> Syntax, Data Loading and Storage.
> 
> 
> I have hosted the design on google docs for review and discussion.
> 
> https://docs.google.com/document/d/1U6wPohvdDHk0B7bONnVHWa6PKG8R9q5-oKMqzMMQHYY/edit?usp=sharing
> 
> 
> Below is the same inline.
> 
> 
> 1.  DDL Changes
> 
> Maps are key->value data types and where the value can be fetched by
> providing the key. Hence we need to restrict keys to primitive data types
> whereas values can be of any data type supported in Carbon(primitive and
> complex).
> 
> Map data types can be defined in the create table DDL as :-
> 
> “MAP<primitive_data_type, data_type>”
> 
> For Example:-
> 
> create table example_table (id Int, name String, salary Int,
> salary_breakup
> map<String, Int>, city String)
> 
> 
> 2.  Data Loading Changes
> 
> Carbon should be able to support loading data into tables with Map type
> columns from csv files. It should be possible to represent maps in a
> single
> row of csv. This will need carbon to support specifying the delimiters for
> :-
> 
> 1. Between two Key-Value pairs
> 
> 2. Between each Key and Value in a pair
> 
> As Carbon already supports Strut and Array Complex types, the data loading
> process already provides support for defining delimiters for complex data
> types. Carbon provides two Optional parameters for data loading
> 
> 1. COMPLEX_DELIMITER_LEVEL_1: will define the delimiter between two
> Key-Value pairs
> 
> OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')
> 
> 2. COMPLEX_DELIMITER_LEVEL_2: will define the delimiter between each
> Key and Value in a pair
> 
> OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')
> 
> With these delimiter options, the below map can be represented in csv:-
> 
> Fixed->100,000
> 
> Bonus->30,000
> 
> Stock->40,000
> 
> As
> 
> Fixed:100,000$Bonus:30,000$Stock:40,000 in the csv file.
> 
> 
> 
> 3.  Query Capabilities
> 
> A complex datatype like Map will require additional operators to be
> supported in the query language to fully utilize the strength of the data
> type.
> 
> Maps are sequence of key-value pairs, hence should support looking up
> value
> for a given key. Users could use the ColumnName[“key”] syntax to lookup
> values in a map column. For example: salary_breakup[“Fixed”] could be used
> to fetch only the Fixed component in the salary breakup.
> 
> In Addition, we also need to define how maps can be used in existing
> constructs such as select, where(filter), group by etc..
> 1. Select:- Map data type can be directly selected or only the value
> for a given key can be selected as per the requirement. For
> example:-“Select
> name, salary, salary_breakup” will return the content of map long with
> each
> row.“Select name, salary, salary_breakup[“Fixed”]” will return only one
> value from the map whose key is “Fixed”2. Filter:-Map data type cannot
> be directly used in a where clause as where clause can operate only on
> primitive data types. However the map lookup operator can be used in where
> clauses. For example:-“Select name, salary where
> salary_breakup[“Bonus”]>10,000”*Note: if the value is not of primitive
> type, further assessor operators need to be used depending on the type of
> value to arrive at a primitive type for the filter expression to be
> valid.*
> 3. Group By:- Just like with filters, maps cannot be directly used in
> a
> group by clause, however the lookup operator can be used.
> 
> 4. Functions:- A size() function can be provided for map types to
> determine the number of key-value pairs in a map.
> 4.  Storage changes
> 
> As Carbon is a columnar data store, Map values will be stored using 3
> physical columns
> 
> 1. One Column for representing the Map Data type. Will sto

Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-15 Thread Vimal Das Kammath
Hi All,

This discussion is regarding support for Map Data type in Carbon Data.

Carbon Data supports complex and nested data types such as Arrays and
Struts. However, Carbon Data does not support other complex data types such
as Maps and Union which are generally supported by popular opensource file
formats.


Supporting Map data type will require changes/additions to the DDL, Query
Syntax, Data Loading and Storage.


I have hosted the design on google docs for review and discussion.

https://docs.google.com/document/d/1U6wPohvdDHk0B7bONnVHWa6PKG8R9q5-oKMqzMMQHYY/edit?usp=sharing


Below is the same inline.


1.  DDL Changes

Maps are key->value data types and where the value can be fetched by
providing the key. Hence we need to restrict keys to primitive data types
whereas values can be of any data type supported in Carbon(primitive and
complex).

Map data types can be defined in the create table DDL as :-

“MAP”

For Example:-

create table example_table (id Int, name String, salary Int, salary_breakup
map, city String)


2.  Data Loading Changes

Carbon should be able to support loading data into tables with Map type
columns from csv files. It should be possible to represent maps in a single
row of csv. This will need carbon to support specifying the delimiters for
:-

1. Between two Key-Value pairs

2. Between each Key and Value in a pair

As Carbon already supports Strut and Array Complex types, the data loading
process already provides support for defining delimiters for complex data
types. Carbon provides two Optional parameters for data loading

1. COMPLEX_DELIMITER_LEVEL_1: will define the delimiter between two
Key-Value pairs

OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')

2. COMPLEX_DELIMITER_LEVEL_2: will define the delimiter between each
Key and Value in a pair

OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')

With these delimiter options, the below map can be represented in csv:-

Fixed->100,000

Bonus->30,000

Stock->40,000

As

Fixed:100,000$Bonus:30,000$Stock:40,000 in the csv file.



3.  Query Capabilities

A complex datatype like Map will require additional operators to be
supported in the query language to fully utilize the strength of the data
type.

Maps are sequence of key-value pairs, hence should support looking up value
for a given key. Users could use the ColumnName[“key”] syntax to lookup
values in a map column. For example: salary_breakup[“Fixed”] could be used
to fetch only the Fixed component in the salary breakup.

In Addition, we also need to define how maps can be used in existing
constructs such as select, where(filter), group by etc..
1. Select:- Map data type can be directly selected or only the value
for a given key can be selected as per the requirement. For example:-“Select
name, salary, salary_breakup” will return the content of map long with each
row.“Select name, salary, salary_breakup[“Fixed”]” will return only one
value from the map whose key is “Fixed”2. Filter:-Map data type cannot
be directly used in a where clause as where clause can operate only on
primitive data types. However the map lookup operator can be used in where
clauses. For example:-“Select name, salary where
salary_breakup[“Bonus”]>10,000”*Note: if the value is not of primitive
type, further assessor operators need to be used depending on the type of
value to arrive at a primitive type for the filter expression to be valid.*
3. Group By:- Just like with filters, maps cannot be directly used in a
group by clause, however the lookup operator can be used.

4. Functions:- A size() function can be provided for map types to
determine the number of key-value pairs in a map.
4.  Storage changes

As Carbon is a columnar data store, Map values will be stored using 3
physical columns

1. One Column for representing the Map Data type. Will store the number
of fields and start index, just the same way as it is done for Struts and
Arrays.

2. One Column for the Key

3. One Column for the value, if the value is of primitive data type,
else the value itself will be multiple physical columns depending on the
data type of the value.

Map

Column_1

Column_2

Column_3

Map_Salary_Breakup

Map_Salary_Breakup.key

Map_Salary_Breakup.value

3,1

Fixed

1,00,000

Bonus

30,000

Stock

40,000

2,4

Fixed

1,40,000

Bonus

30,000

3,6

Fixed

1,20,000

Bonus

20,000

Stock

30,000

Regards
Vimal


[jira] [Resolved] (CARBONDATA-12) Carbon data load bad record log file not renamed form inprogress to normal .log

2016-08-17 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala resolved CARBONDATA-12.
---
   Resolution: Fixed
Fix Version/s: 0.1.0-incubating

> Carbon data load bad record log file not renamed form inprogress to normal 
> .log
> ---
>
> Key: CARBONDATA-12
> URL: https://issues.apache.org/jira/browse/CARBONDATA-12
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
> Fix For: 0.1.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (CARBONDATA-12) Carbon data load bad record log file not renamed form inprogress to normal .log

2016-08-17 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala reopened CARBONDATA-12:
---

> Carbon data load bad record log file not renamed form inprogress to normal 
> .log
> ---
>
> Key: CARBONDATA-12
> URL: https://issues.apache.org/jira/browse/CARBONDATA-12
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
> Fix For: 0.1.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (CARBONDATA-9) Carbon data load bad record is not written into the bad record log file

2016-08-17 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala reopened CARBONDATA-9:
--

> Carbon data load bad record is not written into the bad record log file
> ---
>
> Key: CARBONDATA-9
> URL: https://issues.apache.org/jira/browse/CARBONDATA-9
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
> Fix For: 0.1.0-incubating
>
>
> Load csv having bad records, the row having bad columns should be logged into 
> bad record file. The writing is failing due to FileNotFoundException: No 
> lease on file.
> Enviroment:
> 3 node cluster
> having three executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CARBONDATA-9) Carbon data load bad record is not written into the bad record log file

2016-08-17 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala resolved CARBONDATA-9.
--
   Resolution: Fixed
Fix Version/s: 0.1.0-incubating

> Carbon data load bad record is not written into the bad record log file
> ---
>
> Key: CARBONDATA-9
> URL: https://issues.apache.org/jira/browse/CARBONDATA-9
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
> Fix For: 0.1.0-incubating
>
>
> Load csv having bad records, the row having bad columns should be logged into 
> bad record file. The writing is failing due to FileNotFoundException: No 
> lease on file.
> Enviroment:
> 3 node cluster
> having three executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (CARBONDATA-12) Carbon data load bad record log file not renamed form inprogress to normal .log

2016-07-04 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan closed CARBONDATA-12.
--
Resolution: Fixed

Fixed as part of PR
https://github.com/HuaweiBigData/carbondata/pull/753

> Carbon data load bad record log file not renamed form inprogress to normal 
> .log
> ---
>
> Key: CARBONDATA-12
> URL: https://issues.apache.org/jira/browse/CARBONDATA-12
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (CARBONDATA-9) Carbon data load bad record is not written into the bad record log file

2016-07-04 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan closed CARBONDATA-9.
-
Resolution: Fixed

Fixed as part of PR
https://github.com/HuaweiBigData/carbondata/pull/742

> Carbon data load bad record is not written into the bad record log file
> ---
>
> Key: CARBONDATA-9
> URL: https://issues.apache.org/jira/browse/CARBONDATA-9
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
>
> Load csv having bad records, the row having bad columns should be logged into 
> bad record file. The writing is failing due to FileNotFoundException: No 
> lease on file.
> Enviroment:
> 3 node cluster
> having three executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CARBONDATA-9) Carbon data load bad record is not written into the bad record log file

2016-06-25 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan reassigned CARBONDATA-9:
-

Assignee: Mohammad Shahid Khan

> Carbon data load bad record is not written into the bad record log file
> ---
>
> Key: CARBONDATA-9
> URL: https://issues.apache.org/jira/browse/CARBONDATA-9
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
>
> Load csv having bad records, the row having bad columns should be logged into 
> bad record file. The writing is failing due to FileNotFoundException: No 
> lease on file.
> Enviroment:
> 3 node cluster
> having three executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-12) Carbon data load bad record log file not renamed form inprogress to normal .log

2016-06-25 Thread Mohammad Shahid Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349753#comment-15349753
 ] 

Mohammad Shahid Khan commented on CARBONDATA-12:


Fixed 
pull request: https://github.com/HuaweiBigData/carbondata/pull/753

> Carbon data load bad record log file not renamed form inprogress to normal 
> .log
> ---
>
> Key: CARBONDATA-12
> URL: https://issues.apache.org/jira/browse/CARBONDATA-12
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CARBONDATA-12) Carbon data load bad record log file not renamed form inprogress to normal .log

2016-06-25 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan reassigned CARBONDATA-12:
--

Assignee: Mohammad Shahid Khan

> Carbon data load bad record log file not renamed form inprogress to normal 
> .log
> ---
>
> Key: CARBONDATA-12
> URL: https://issues.apache.org/jira/browse/CARBONDATA-12
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-12) Carbon data load bad record log file not renamed form inprogress to normal .log

2016-06-25 Thread Mohammad Shahid Khan (JIRA)
Mohammad Shahid Khan created CARBONDATA-12:
--

 Summary: Carbon data load bad record log file not renamed form 
inprogress to normal .log
 Key: CARBONDATA-12
 URL: https://issues.apache.org/jira/browse/CARBONDATA-12
 Project: CarbonData
  Issue Type: Bug
Reporter: Mohammad Shahid Khan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CARBONDATA-9) Carbon data load bad record is not written into the bad record log file

2016-06-24 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated CARBONDATA-9:
--
Description: 
Load csv having bad records, the row having bad columns should be logged into 
bad record file. The writing is failing due to FileNotFoundException: No lease 
on file.
Enviroment:
3 node cluster
having three executors.


  was:Load csv having bad records, the row having bad columns should be logged 
into bad record file. The writing is failing due to FileNotFoundException: No 
lease on file.


> Carbon data load bad record is not written into the bad record log file
> ---
>
> Key: CARBONDATA-9
> URL: https://issues.apache.org/jira/browse/CARBONDATA-9
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Mohammad Shahid Khan
>
> Load csv having bad records, the row having bad columns should be logged into 
> bad record file. The writing is failing due to FileNotFoundException: No 
> lease on file.
> Enviroment:
> 3 node cluster
> having three executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-9) Carbon data load bad record is not written into the bad record log file

2016-06-24 Thread Mohammad Shahid Khan (JIRA)
Mohammad Shahid Khan created CARBONDATA-9:
-

 Summary: Carbon data load bad record is not written into the bad 
record log file
 Key: CARBONDATA-9
 URL: https://issues.apache.org/jira/browse/CARBONDATA-9
 Project: CarbonData
  Issue Type: Bug
Reporter: Mohammad Shahid Khan


Load csv having bad records, the row having bad columns should be logged into 
bad record file. The writing is failing due to FileNotFoundException: No lease 
on file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)