master timeSeries DATAMAP does not work well as 1.4.1

2018-09-04 Thread aaron
Hi All, 

It seems that master timeSeries DATAMAP does not work well as 1.4.1, could
you please have a look?


Demo data:

|market_code|device_code|  
date|country_code|category_id|product_id|revenue|
+---+---+---++---+--+---+
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|10|  73481|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|11| 713316|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|12| 657503|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|13| 764930|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|14| 835665|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|15| 599234|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|16|  22451|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|17|  17284|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|18| 118846|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|19| 735783|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100010| 698596|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100011| 788919|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100012| 817443|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100013| 839801|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100014| 880020|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100015| 808019|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100016| 740226|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100017| 473469|
|apple-store|  ios-phone|2018-02-01 00:00:00|  CA| 
1|100018| 322765|

SQL:

carbon.sql("DROP TABLE IF EXISTS test_store_int")
val createMainTableSql = s"""
  | CREATE TABLE test_store_int(
  | market_code VARCHAR(50),
  | device_code VARCHAR(50),
  | date TIMESTAMP,
  | country_code CHAR(2),
  | category_id INTEGER,
  | product_id LONG,
  | revenue INTEGER
  | )
  | STORED BY 'carbondata'
  | TBLPROPERTIES(
  | 'SORT_COLUMNS'='market_code, device_code, country_code, category_id,
date',
  | 'DICTIONARY_INCLUDE'='market_code, device_code, country_code,
category_id, date, product_id',
  | 'NO_INVERTED_INDEX'='revenue',
  | 'SORT_SCOPE'='GLOBAL_SORT'
  | )
""".stripMargin
print(createMainTableSql)
carbon.sql(createMainTableSql)

carbon.sql("DROP DATAMAP test_store_int_agg_by_month ON TABLE
test_store_int")
val createTimeSeriesTableSql = s"""
  | CREATE DATAMAP test_store_int_agg_by_month ON TABLE test_store_int
  | USING 'timeSeries'
  | DMPROPERTIES (
  | 'EVENT_TIME'='date',
  | 'MONTH_GRANULARITY'='1')
  | AS SELECT date, market_code, device_code, country_code, category_id,
product_id, sum(revenue), count(revenue), min(revenue), max(revenue) FROM
test_store_int
  | GROUP BY date, market_code, device_code, country_code, category_id,
product_id
""".stripMargin
print(createTimeSeriesTableSql)
carbon.sql(createTimeSeriesTableSql)

Query plan:

1. By month, work
carbon.sql(s"""explain select market_code, device_code, country_code,
category_id, product_id, sum(revenue), timeseries(date, 'month') from
test_store_int group by timeseries(date, 'month'), market_code, device_code,
country_code, category_id, product_id""".stripMargin).show(200,
truncate=false)

|== CarbonData Profiler ==
Query rewrite based on DataMap:
 - test_store_int_agg_by_month (timeseries)
Table Scan on test_store_int_test_store_int_agg_by_month
 - total blocklets: 4
 - filter: none
 - pruned by Main DataMap
- skipped blocklets: 0

2. By year,  not work
carbon.sql(s"""explain select market_code, device_code, country_code,
category_id, product_id, sum(revenue), timeseries(date, 'year') from
test_store_int group by timeseries(date, 'year'), market_code, device_code,
country_code, category_id, product_id""".stripMargin).show(200,
truncate=false)

|== CarbonData Profiler ==
Table Scan on test_store_int
 - total blocklets: 16
 - filter: none
 - pruned by Main DataMap
- skipped blocklets: 0 


Thanks
Aaron   






Re: Feature Proposal: CarbonCli tool

2018-09-04 Thread xuchuanyin
In the above example, you specify one directory and get two segments.
But it only shows one schema info. I thought the number of the schema is the
same as that of data directories. Since you mentioned that we can support
nested folder, what if the schema in these files are not the same?

Another problem:

SegmentID  Status Load Start  Load EndMerged To  Format  
Data Size  Index Size   
0  Marked for Delete  2018-08-31  2018-08-31  NA COLUMNAR_V3 
NA NA   

Why The datasize for segment#0 is NA? Will it affect the total data size of
the carbon table?



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Updates to CarbonData documentation and structure

2018-09-04 Thread xuchuanyin
I think even we split the carbondata command into DDL and DML, it is still
too large for one document.

For example, there are many TBLProperties for creating table in DDL. Some
descriptions of the TBLProperties is long and now we do not have TOC for
them. It's difficult to locate one property in the doc.

Besides, some parameters can be specified in system configuration,
TBLProperties, LoadOptions level at the same time. Where should we describe
this parameter?



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Remove BTree related code

2018-09-04 Thread xuchuanyin
I find the PR in github and leave a comment. Here I copy the comments:

I have doubt about the below scenario:
For sort_columns, the minmax is ordered for all the blocks/blocklets in one
segment.

Suppose that we are doing filtering on sort_columns and the filter looks
like Col1='bb'.
If the minmax values for blocklet#1, blocklet#2, blocklet#3 is [a,c), [c,d),
[d,e).
After carbondata find max value of blocklet#1 already covers filter value
'bb', Will it still compare filter value 'bb' with the minmax of the rest
blocklets#2/#3? I though the BTree can be used to avoid these comparison.

Am I wrong?



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Feature Proposal: CarbonCli tool

2018-09-04 Thread Jacky Li
Hi All,

When I am tuning carbon performance, very often that I want to check the 
metadata in carbon files without launching spark shell or sql. In order to do 
that, I am writing a tool to print metadata information of a given data folder. 
Currently, I am planning to do like this:

usage: CarbonCli
 -a,--allprint all information
 -b,--tblProperties  print table properties
 -c,--columncolumn to print statistics
 -cmd  command to execute, supported commands are:
 summary
 -d,--detailSize print each blocklet size
 -h,--help   print this message
 -m,--showSegmentprint segment information
 -p,--path the path which contains carbondata files,
 nested folder is supported
 -s,--schema print the schema

In first phase, I think “summary” command is high priority, and developers can 
add more command in the future.

Summary command example as below, one good thing is that it can print out the 
column min/max value in percentage visually by using “———“, so that user have 
better understanding of the effectiveness of the carbon sort_columns set in 
create table.
Please suggest if you have any good idea on this tool.

➜  target git:(summary) java -jar carbondata-sdk.jar 
org.apache.carbondata.CarbonCli -cmd summary -p 
/opt/carbonstore/tpchcarbon_default/lineitem -a -c l_orderkey
Data Folder: 
/Users/jacky/code/spark-2.2.1-bin-hadoop2.7/carbonstore/tpchcarbon_default/lineitem
## Summary
10 blocks, 1 shards, 10 blocklets, 375 pages, 11,997,996 rows, 514.64MB

## Schema
schema in part-0-0_batchno0-0-1-1535726689954.carbondata
version: V3
timestamp: 2018-08-31 22:08:48.268
Column Name  Data Type  Column Type  Property Encoding  
   Schema Ordinal  Id 
l_orderkey   INTdimension{sort_columns=true}  [INVERTED_INDEX]  
   0   *0587  
l_linenumber INTdimension{sort_columns=true}  [INVERTED_INDEX]  
   3   *c981  
l_suppkeySTRING dimension [INVERTED_INDEX]  
   2   *75ae  
l_returnflag STRING dimension [INVERTED_INDEX]  
   8   *4ae9  
l_linestatus STRING dimension [INVERTED_INDEX]  
   9   *d358  
l_shipdate   DATE   dimension [DICTIONARY, 
DIRECT_DICTIONARY, INVERTED_INDEX]  10  *7cd0  
l_commitdate DATE   dimension [DICTIONARY, 
DIRECT_DICTIONARY, INVERTED_INDEX]  11  *b192  
l_receiptdateDATE   dimension [DICTIONARY, 
DIRECT_DICTIONARY, INVERTED_INDEX]  12  *b0dd  
l_shipinstruct   STRING dimension [INVERTED_INDEX]  
   13  *5db3  
l_shipmode   STRING dimension [INVERTED_INDEX]  
   14  *2308  
l_commentSTRING dimension [INVERTED_INDEX]  
   15  *4cef  
l_partkeyINTmeasure   []
   1   *9bc7  
l_quantity   DOUBLE measure   []
   4   *418c  
l_extendedprice  DOUBLE measure   []
   5   *bf2c  
l_discount   DOUBLE measure   []
   6   *2085  
l_taxDOUBLE measure   []
   7   *ad33  

## Segment
SegmentID  Status Load Start  Load EndMerged To  Format   
Data Size  Index Size  
0  Marked for Delete  2018-08-31  2018-08-31  NA COLUMNAR_V3  
NA NA  
1  Success2018-08-31  2018-08-31  NA COLUMNAR_V3  
514.64MB   6.40KB  

## Table Properties
Property Name  Property Value 
'sort_columns' 'l_orderkey,l_linenumber'  
'table_blocksize'  '64'   
'comment'  '' 
'bad_records_path' '' 
'local_dictionary_enable'  'false'

## Block Detail
Shard #1 (0_batchno0-0-1-1535726689954)
Block (PartNo)  Blocklet  #Pages  #RowsSize 
0   0 40  128  54.90MB  
1   0 40  128  54.89MB  
2   0 40  128  

Re: [DISCUSSION] Updates to CarbonData documentation and structure

2018-09-04 Thread Liang Chen
Hi Raghu

+1, all these optimizations are very good.

Regards
Liang


sraghunandan wrote
> Dear All,
> 
>  I wanted to propose some updates and changes to our current
> documentation,Please let me know your inputs and comments.
> 
> 
> 1.Split Our carbondata command into DDL and DML
> 
> 2.Add Presto and Hive integration along with Spark into quick start
> 
> 3.Add a master reference manual which lists all the commands supported in
> carbondata.This manual shall have links to DDL and DML supported
> 
> 4.Add a introduction to carbondata covering architecture,design and
> features supported
> 
> 5.Merge FAQ and troubleshooting documents into single document
> 
> 6.Add a separate md file to explain user how to navigate across our
> documentation
> 
> 7.Add the TOC (Table of Contents) to all the md files which has multiple
> sections
> 
> 8.Add list of supported properties at the beginning of each DDL or DML so
> that user knows all the properties that are supported
> 
> 9.Rewrite the configuration properties description to explain the property
> in bit more detail and also highlight when to use the command and any
> caveats
> 
> 10.ReOrder our configuration properties table to group features wise
> 
> 11.Update our webpage(carbondata.apache.org) to have a better navigation
> for documentation section
> 
> 12.Add use cases about carbondata usage and performance tuning tips
> 
> 
> Regards
> 
> Raghu





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[DISCUSSION] Updates to CarbonData documentation and structure

2018-09-04 Thread Raghunandan S
Dear All,

 I wanted to propose some updates and changes to our current
documentation,Please let me know your inputs and comments.


1.Split Our carbondata command into DDL and DML

2.Add Presto and Hive integration along with Spark into quick start

3.Add a master reference manual which lists all the commands supported in
carbondata.This manual shall have links to DDL and DML supported

4.Add a introduction to carbondata covering architecture,design and
features supported

5.Merge FAQ and troubleshooting documents into single document

6.Add a separate md file to explain user how to navigate across our
documentation

7.Add the TOC (Table of Contents) to all the md files which has multiple
sections

8.Add list of supported properties at the beginning of each DDL or DML so
that user knows all the properties that are supported

9.Rewrite the configuration properties description to explain the property
in bit more detail and also highlight when to use the command and any
caveats

10.ReOrder our configuration properties table to group features wise

11.Update our webpage(carbondata.apache.org) to have a better navigation
for documentation section

12.Add use cases about carbondata usage and performance tuning tips


Regards

Raghu


Re: error occur when I load data to s3

2018-09-04 Thread aaron
Hi kunalkapoor, I'd like give you more debug log as below.


application/x-www-form-urlencoded; charset=utf-8
Tue, 04 Sep 2018 06:45:10 GMT
/aa-sdk-test2/carbon-data/example/LockFiles/concurrentload.lock"
18/09/04 14:45:10 DEBUG request: Sending Request: GET
https://aa-sdk-test2.s3.us-east-1.amazonaws.com
/carbon-data/example/LockFiles/concurrentload.lock Headers: (Authorization:
AWS AKIAIAQX5F5B2MLQPRGQ:Ap8rHsiPQPYUdcBb2Ojb/MA9q+I=, User-Agent:
aws-sdk-java/1.7.4 Mac_OS_X/10.13.6
Java_HotSpot(TM)_64-Bit_Server_VM/25.144-b01/1.8.0_144, Range: bytes=0--1,
Date: Tue, 04 Sep 2018 06:45:10 GMT, Content-Type:
application/x-www-form-urlencoded; charset=utf-8, ) 
18/09/04 14:45:10 DEBUG PoolingClientConnectionManager: Connection request:
[route: {s}->https://aa-sdk-test2.s3.us-east-1.amazonaws.com:443][total kept
alive: 1; route allocated: 1 of 15; total allocated: 1 of 15]
18/09/04 14:45:10 DEBUG PoolingClientConnectionManager: Connection leased:
[id: 1][route:
{s}->https://aa-sdk-test2.s3.us-east-1.amazonaws.com:443][total kept alive:
0; route allocated: 1 of 15; total allocated: 1 of 15]
18/09/04 14:45:10 DEBUG SdkHttpClient: Stale connection check
18/09/04 14:45:10 DEBUG RequestAddCookies: CookieSpec selected: default
18/09/04 14:45:10 DEBUG RequestAuthCache: Auth cache not set in the context
18/09/04 14:45:10 DEBUG RequestProxyAuthentication: Proxy auth state:
UNCHALLENGED
18/09/04 14:45:10 DEBUG SdkHttpClient: Attempt 1 to execute request
18/09/04 14:45:10 DEBUG DefaultClientConnection: Sending request: GET
/carbon-data/example/LockFiles/concurrentload.lock HTTP/1.1
18/09/04 14:45:10 DEBUG wire:  >> "GET
/carbon-data/example/LockFiles/concurrentload.lock HTTP/1.1[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "Host:
aa-sdk-test2.s3.us-east-1.amazonaws.com[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "Authorization: AWS
AKIAIAQX5F5B2MLQPRGQ:Ap8rHsiPQPYUdcBb2Ojb/MA9q+I=[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "User-Agent: aws-sdk-java/1.7.4
Mac_OS_X/10.13.6
Java_HotSpot(TM)_64-Bit_Server_VM/25.144-b01/1.8.0_144[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "Range: bytes=0--1[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "Date: Tue, 04 Sep 2018 06:45:10
GMT[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "Content-Type:
application/x-www-form-urlencoded; charset=utf-8[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "Connection: Keep-Alive[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  >> "[\r][\n]"
18/09/04 14:45:10 DEBUG headers: >> GET
/carbon-data/example/LockFiles/concurrentload.lock HTTP/1.1
18/09/04 14:45:10 DEBUG headers: >> Host:
aa-sdk-test2.s3.us-east-1.amazonaws.com
18/09/04 14:45:10 DEBUG headers: >> Authorization: AWS
AKIAIAQX5F5B2MLQPRGQ:Ap8rHsiPQPYUdcBb2Ojb/MA9q+I=
18/09/04 14:45:10 DEBUG headers: >> User-Agent: aws-sdk-java/1.7.4
Mac_OS_X/10.13.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.144-b01/1.8.0_144
18/09/04 14:45:10 DEBUG headers: >> Range: bytes=0--1
18/09/04 14:45:10 DEBUG headers: >> Date: Tue, 04 Sep 2018 06:45:10 GMT
18/09/04 14:45:10 DEBUG headers: >> Content-Type:
application/x-www-form-urlencoded; charset=utf-8
18/09/04 14:45:10 DEBUG headers: >> Connection: Keep-Alive
18/09/04 14:45:10 DEBUG wire:  << "HTTP/1.1 200 OK[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "x-amz-id-2:
ooaOvIUsvupOOYOCVRY7y4TUanV9xJbcAqfd+w31xAkGRptm1blE5E5yMobmKsmRyGj9crhGCao=[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "x-amz-request-id:
A1AD0240EBDD2234[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "Date: Tue, 04 Sep 2018 06:45:11
GMT[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "Last-Modified: Tue, 04 Sep 2018 06:45:05
GMT[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "ETag:
"d41d8cd98f00b204e9800998ecf8427e"[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "Accept-Ranges: bytes[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "Content-Type:
application/octet-stream[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "Content-Length: 0[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "Server: AmazonS3[\r][\n]"
18/09/04 14:45:10 DEBUG wire:  << "[\r][\n]"
18/09/04 14:45:10 DEBUG DefaultClientConnection: Receiving response:
HTTP/1.1 200 OK
18/09/04 14:45:10 DEBUG headers: << HTTP/1.1 200 OK
18/09/04 14:45:10 DEBUG headers: << x-amz-id-2:
ooaOvIUsvupOOYOCVRY7y4TUanV9xJbcAqfd+w31xAkGRptm1blE5E5yMobmKsmRyGj9crhGCao=
18/09/04 14:45:10 DEBUG headers: << x-amz-request-id: A1AD0240EBDD2234
18/09/04 14:45:10 DEBUG headers: << Date: Tue, 04 Sep 2018 06:45:11 GMT
18/09/04 14:45:10 DEBUG headers: << Last-Modified: Tue, 04 Sep 2018 06:45:05
GMT
18/09/04 14:45:10 DEBUG headers: << ETag: "d41d8cd98f00b204e9800998ecf8427e"
18/09/04 14:45:10 DEBUG headers: << Accept-Ranges: bytes
18/09/04 14:45:10 DEBUG headers: << Content-Type: application/octet-stream
18/09/04 14:45:10 DEBUG headers: << Content-Length: 0
18/09/04 14:45:10 DEBUG headers: << Server: AmazonS3
18/09/04 14:45:10 DEBUG SdkHttpClient: Connection can be kept alive
indefinitely
18/09/04 14:45:10 DEBUG request: Received successful response: 200, AWS
Request ID: A1AD0240EBDD2234
18/09/04 14:45:10 DEBUG