master timeSeries DATAMAP does not work well as 1.4.1
Hi All, It seems that master timeSeries DATAMAP does not work well as 1.4.1, could you please have a look? Demo data: |market_code|device_code| date|country_code|category_id|product_id|revenue| +---+---+---++---+--+---+ |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|10| 73481| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|11| 713316| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|12| 657503| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|13| 764930| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|14| 835665| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|15| 599234| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|16| 22451| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|17| 17284| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|18| 118846| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|19| 735783| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100010| 698596| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100011| 788919| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100012| 817443| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100013| 839801| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100014| 880020| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100015| 808019| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100016| 740226| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100017| 473469| |apple-store| ios-phone|2018-02-01 00:00:00| CA| 1|100018| 322765| SQL: carbon.sql("DROP TABLE IF EXISTS test_store_int") val createMainTableSql = s""" | CREATE TABLE test_store_int( | market_code VARCHAR(50), | device_code VARCHAR(50), | date TIMESTAMP, | country_code CHAR(2), | category_id INTEGER, | product_id LONG, | revenue INTEGER | ) | STORED BY 'carbondata' | TBLPROPERTIES( | 'SORT_COLUMNS'='market_code, device_code, country_code, category_id, date', | 'DICTIONARY_INCLUDE'='market_code, device_code, country_code, category_id, date, product_id', | 'NO_INVERTED_INDEX'='revenue', | 'SORT_SCOPE'='GLOBAL_SORT' | ) """.stripMargin print(createMainTableSql) carbon.sql(createMainTableSql) carbon.sql("DROP DATAMAP test_store_int_agg_by_month ON TABLE test_store_int") val createTimeSeriesTableSql = s""" | CREATE DATAMAP test_store_int_agg_by_month ON TABLE test_store_int | USING 'timeSeries' | DMPROPERTIES ( | 'EVENT_TIME'='date', | 'MONTH_GRANULARITY'='1') | AS SELECT date, market_code, device_code, country_code, category_id, product_id, sum(revenue), count(revenue), min(revenue), max(revenue) FROM test_store_int | GROUP BY date, market_code, device_code, country_code, category_id, product_id """.stripMargin print(createTimeSeriesTableSql) carbon.sql(createTimeSeriesTableSql) Query plan: 1. By month, work carbon.sql(s"""explain select market_code, device_code, country_code, category_id, product_id, sum(revenue), timeseries(date, 'month') from test_store_int group by timeseries(date, 'month'), market_code, device_code, country_code, category_id, product_id""".stripMargin).show(200, truncate=false) |== CarbonData Profiler == Query rewrite based on DataMap: - test_store_int_agg_by_month (timeseries) Table Scan on test_store_int_test_store_int_agg_by_month - total blocklets: 4 - filter: none - pruned by Main DataMap - skipped blocklets: 0 2. By year, not work carbon.sql(s"""explain select market_code, device_code, country_code, category_id, product_id, sum(revenue), timeseries(date, 'year') from test_store_int group by timeseries(date, 'year'), market_code, device_code, country_code, category_id, product_id""".stripMargin).show(200, truncate=false) |== CarbonData Profiler == Table Scan on test_store_int - total blocklets: 16 - filter: none - pruned by Main DataMap - skipped blocklets: 0 Thanks Aaron
Re: Feature Proposal: CarbonCli tool
In the above example, you specify one directory and get two segments. But it only shows one schema info. I thought the number of the schema is the same as that of data directories. Since you mentioned that we can support nested folder, what if the schema in these files are not the same? Another problem: SegmentID Status Load Start Load EndMerged To Format Data Size Index Size 0 Marked for Delete 2018-08-31 2018-08-31 NA COLUMNAR_V3 NA NA Why The datasize for segment#0 is NA? Will it affect the total data size of the carbon table? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] Updates to CarbonData documentation and structure
I think even we split the carbondata command into DDL and DML, it is still too large for one document. For example, there are many TBLProperties for creating table in DDL. Some descriptions of the TBLProperties is long and now we do not have TOC for them. It's difficult to locate one property in the doc. Besides, some parameters can be specified in system configuration, TBLProperties, LoadOptions level at the same time. Where should we describe this parameter? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] Remove BTree related code
I find the PR in github and leave a comment. Here I copy the comments: I have doubt about the below scenario: For sort_columns, the minmax is ordered for all the blocks/blocklets in one segment. Suppose that we are doing filtering on sort_columns and the filter looks like Col1='bb'. If the minmax values for blocklet#1, blocklet#2, blocklet#3 is [a,c), [c,d), [d,e). After carbondata find max value of blocklet#1 already covers filter value 'bb', Will it still compare filter value 'bb' with the minmax of the rest blocklets#2/#3? I though the BTree can be used to avoid these comparison. Am I wrong? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Feature Proposal: CarbonCli tool
Hi All, When I am tuning carbon performance, very often that I want to check the metadata in carbon files without launching spark shell or sql. In order to do that, I am writing a tool to print metadata information of a given data folder. Currently, I am planning to do like this: usage: CarbonCli -a,--allprint all information -b,--tblProperties print table properties -c,--columncolumn to print statistics -cmd command to execute, supported commands are: summary -d,--detailSize print each blocklet size -h,--help print this message -m,--showSegmentprint segment information -p,--path the path which contains carbondata files, nested folder is supported -s,--schema print the schema In first phase, I think “summary” command is high priority, and developers can add more command in the future. Summary command example as below, one good thing is that it can print out the column min/max value in percentage visually by using “———“, so that user have better understanding of the effectiveness of the carbon sort_columns set in create table. Please suggest if you have any good idea on this tool. ➜ target git:(summary) java -jar carbondata-sdk.jar org.apache.carbondata.CarbonCli -cmd summary -p /opt/carbonstore/tpchcarbon_default/lineitem -a -c l_orderkey Data Folder: /Users/jacky/code/spark-2.2.1-bin-hadoop2.7/carbonstore/tpchcarbon_default/lineitem ## Summary 10 blocks, 1 shards, 10 blocklets, 375 pages, 11,997,996 rows, 514.64MB ## Schema schema in part-0-0_batchno0-0-1-1535726689954.carbondata version: V3 timestamp: 2018-08-31 22:08:48.268 Column Name Data Type Column Type Property Encoding Schema Ordinal Id l_orderkey INTdimension{sort_columns=true} [INVERTED_INDEX] 0 *0587 l_linenumber INTdimension{sort_columns=true} [INVERTED_INDEX] 3 *c981 l_suppkeySTRING dimension [INVERTED_INDEX] 2 *75ae l_returnflag STRING dimension [INVERTED_INDEX] 8 *4ae9 l_linestatus STRING dimension [INVERTED_INDEX] 9 *d358 l_shipdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 10 *7cd0 l_commitdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 11 *b192 l_receiptdateDATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 12 *b0dd l_shipinstruct STRING dimension [INVERTED_INDEX] 13 *5db3 l_shipmode STRING dimension [INVERTED_INDEX] 14 *2308 l_commentSTRING dimension [INVERTED_INDEX] 15 *4cef l_partkeyINTmeasure [] 1 *9bc7 l_quantity DOUBLE measure [] 4 *418c l_extendedprice DOUBLE measure [] 5 *bf2c l_discount DOUBLE measure [] 6 *2085 l_taxDOUBLE measure [] 7 *ad33 ## Segment SegmentID Status Load Start Load EndMerged To Format Data Size Index Size 0 Marked for Delete 2018-08-31 2018-08-31 NA COLUMNAR_V3 NA NA 1 Success2018-08-31 2018-08-31 NA COLUMNAR_V3 514.64MB 6.40KB ## Table Properties Property Name Property Value 'sort_columns' 'l_orderkey,l_linenumber' 'table_blocksize' '64' 'comment' '' 'bad_records_path' '' 'local_dictionary_enable' 'false' ## Block Detail Shard #1 (0_batchno0-0-1-1535726689954) Block (PartNo) Blocklet #Pages #RowsSize 0 0 40 128 54.90MB 1 0 40 128 54.89MB 2 0 40 128
Re: [DISCUSSION] Updates to CarbonData documentation and structure
Hi Raghu +1, all these optimizations are very good. Regards Liang sraghunandan wrote > Dear All, > > I wanted to propose some updates and changes to our current > documentation,Please let me know your inputs and comments. > > > 1.Split Our carbondata command into DDL and DML > > 2.Add Presto and Hive integration along with Spark into quick start > > 3.Add a master reference manual which lists all the commands supported in > carbondata.This manual shall have links to DDL and DML supported > > 4.Add a introduction to carbondata covering architecture,design and > features supported > > 5.Merge FAQ and troubleshooting documents into single document > > 6.Add a separate md file to explain user how to navigate across our > documentation > > 7.Add the TOC (Table of Contents) to all the md files which has multiple > sections > > 8.Add list of supported properties at the beginning of each DDL or DML so > that user knows all the properties that are supported > > 9.Rewrite the configuration properties description to explain the property > in bit more detail and also highlight when to use the command and any > caveats > > 10.ReOrder our configuration properties table to group features wise > > 11.Update our webpage(carbondata.apache.org) to have a better navigation > for documentation section > > 12.Add use cases about carbondata usage and performance tuning tips > > > Regards > > Raghu -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
[DISCUSSION] Updates to CarbonData documentation and structure
Dear All, I wanted to propose some updates and changes to our current documentation,Please let me know your inputs and comments. 1.Split Our carbondata command into DDL and DML 2.Add Presto and Hive integration along with Spark into quick start 3.Add a master reference manual which lists all the commands supported in carbondata.This manual shall have links to DDL and DML supported 4.Add a introduction to carbondata covering architecture,design and features supported 5.Merge FAQ and troubleshooting documents into single document 6.Add a separate md file to explain user how to navigate across our documentation 7.Add the TOC (Table of Contents) to all the md files which has multiple sections 8.Add list of supported properties at the beginning of each DDL or DML so that user knows all the properties that are supported 9.Rewrite the configuration properties description to explain the property in bit more detail and also highlight when to use the command and any caveats 10.ReOrder our configuration properties table to group features wise 11.Update our webpage(carbondata.apache.org) to have a better navigation for documentation section 12.Add use cases about carbondata usage and performance tuning tips Regards Raghu
Re: error occur when I load data to s3
Hi kunalkapoor, I'd like give you more debug log as below. application/x-www-form-urlencoded; charset=utf-8 Tue, 04 Sep 2018 06:45:10 GMT /aa-sdk-test2/carbon-data/example/LockFiles/concurrentload.lock" 18/09/04 14:45:10 DEBUG request: Sending Request: GET https://aa-sdk-test2.s3.us-east-1.amazonaws.com /carbon-data/example/LockFiles/concurrentload.lock Headers: (Authorization: AWS AKIAIAQX5F5B2MLQPRGQ:Ap8rHsiPQPYUdcBb2Ojb/MA9q+I=, User-Agent: aws-sdk-java/1.7.4 Mac_OS_X/10.13.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.144-b01/1.8.0_144, Range: bytes=0--1, Date: Tue, 04 Sep 2018 06:45:10 GMT, Content-Type: application/x-www-form-urlencoded; charset=utf-8, ) 18/09/04 14:45:10 DEBUG PoolingClientConnectionManager: Connection request: [route: {s}->https://aa-sdk-test2.s3.us-east-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 15; total allocated: 1 of 15] 18/09/04 14:45:10 DEBUG PoolingClientConnectionManager: Connection leased: [id: 1][route: {s}->https://aa-sdk-test2.s3.us-east-1.amazonaws.com:443][total kept alive: 0; route allocated: 1 of 15; total allocated: 1 of 15] 18/09/04 14:45:10 DEBUG SdkHttpClient: Stale connection check 18/09/04 14:45:10 DEBUG RequestAddCookies: CookieSpec selected: default 18/09/04 14:45:10 DEBUG RequestAuthCache: Auth cache not set in the context 18/09/04 14:45:10 DEBUG RequestProxyAuthentication: Proxy auth state: UNCHALLENGED 18/09/04 14:45:10 DEBUG SdkHttpClient: Attempt 1 to execute request 18/09/04 14:45:10 DEBUG DefaultClientConnection: Sending request: GET /carbon-data/example/LockFiles/concurrentload.lock HTTP/1.1 18/09/04 14:45:10 DEBUG wire: >> "GET /carbon-data/example/LockFiles/concurrentload.lock HTTP/1.1[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "Host: aa-sdk-test2.s3.us-east-1.amazonaws.com[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "Authorization: AWS AKIAIAQX5F5B2MLQPRGQ:Ap8rHsiPQPYUdcBb2Ojb/MA9q+I=[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "User-Agent: aws-sdk-java/1.7.4 Mac_OS_X/10.13.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.144-b01/1.8.0_144[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "Range: bytes=0--1[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "Date: Tue, 04 Sep 2018 06:45:10 GMT[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "Content-Type: application/x-www-form-urlencoded; charset=utf-8[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "Connection: Keep-Alive[\r][\n]" 18/09/04 14:45:10 DEBUG wire: >> "[\r][\n]" 18/09/04 14:45:10 DEBUG headers: >> GET /carbon-data/example/LockFiles/concurrentload.lock HTTP/1.1 18/09/04 14:45:10 DEBUG headers: >> Host: aa-sdk-test2.s3.us-east-1.amazonaws.com 18/09/04 14:45:10 DEBUG headers: >> Authorization: AWS AKIAIAQX5F5B2MLQPRGQ:Ap8rHsiPQPYUdcBb2Ojb/MA9q+I= 18/09/04 14:45:10 DEBUG headers: >> User-Agent: aws-sdk-java/1.7.4 Mac_OS_X/10.13.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.144-b01/1.8.0_144 18/09/04 14:45:10 DEBUG headers: >> Range: bytes=0--1 18/09/04 14:45:10 DEBUG headers: >> Date: Tue, 04 Sep 2018 06:45:10 GMT 18/09/04 14:45:10 DEBUG headers: >> Content-Type: application/x-www-form-urlencoded; charset=utf-8 18/09/04 14:45:10 DEBUG headers: >> Connection: Keep-Alive 18/09/04 14:45:10 DEBUG wire: << "HTTP/1.1 200 OK[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "x-amz-id-2: ooaOvIUsvupOOYOCVRY7y4TUanV9xJbcAqfd+w31xAkGRptm1blE5E5yMobmKsmRyGj9crhGCao=[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "x-amz-request-id: A1AD0240EBDD2234[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "Date: Tue, 04 Sep 2018 06:45:11 GMT[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "Last-Modified: Tue, 04 Sep 2018 06:45:05 GMT[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "ETag: "d41d8cd98f00b204e9800998ecf8427e"[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "Accept-Ranges: bytes[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "Content-Type: application/octet-stream[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "Content-Length: 0[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "Server: AmazonS3[\r][\n]" 18/09/04 14:45:10 DEBUG wire: << "[\r][\n]" 18/09/04 14:45:10 DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 200 OK 18/09/04 14:45:10 DEBUG headers: << HTTP/1.1 200 OK 18/09/04 14:45:10 DEBUG headers: << x-amz-id-2: ooaOvIUsvupOOYOCVRY7y4TUanV9xJbcAqfd+w31xAkGRptm1blE5E5yMobmKsmRyGj9crhGCao= 18/09/04 14:45:10 DEBUG headers: << x-amz-request-id: A1AD0240EBDD2234 18/09/04 14:45:10 DEBUG headers: << Date: Tue, 04 Sep 2018 06:45:11 GMT 18/09/04 14:45:10 DEBUG headers: << Last-Modified: Tue, 04 Sep 2018 06:45:05 GMT 18/09/04 14:45:10 DEBUG headers: << ETag: "d41d8cd98f00b204e9800998ecf8427e" 18/09/04 14:45:10 DEBUG headers: << Accept-Ranges: bytes 18/09/04 14:45:10 DEBUG headers: << Content-Type: application/octet-stream 18/09/04 14:45:10 DEBUG headers: << Content-Length: 0 18/09/04 14:45:10 DEBUG headers: << Server: AmazonS3 18/09/04 14:45:10 DEBUG SdkHttpClient: Connection can be kept alive indefinitely 18/09/04 14:45:10 DEBUG request: Received successful response: 200, AWS Request ID: A1AD0240EBDD2234 18/09/04 14:45:10 DEBUG