[ https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xubo245 updated CARBONDATA-3336: -------------------------------- Description: CarbonData supports binary data type Version Changes Owner Date 0.1 Init doc for Supporting binary data type Xubo 2019-4-10 Background : Binary is basic data type and widely used in various scenarios. So it’s better to support binary data type in CarbonData. Download data from S3 will be slow when dataset has lots of small binary data. The majority of application scenarios are related to storage small binary data type into CarbonData, which can avoid small binary files problem and speed up S3 access performance, also can decrease cost of accessing OBS by decreasing the number of calling S3 API. It also will easier to manage structure data and Unstructured data(binary) by storing them into CarbonData. Goals: 1. Supporting write binary data type by Carbon Java SDK. 2. Supporting read binary data type by Spark Carbon file format(carbon datasource) and CarbonSession. 3. Supporting read binary data type by Carbon SDK 4. Supporting write binary by spark Approach and Detail: 1.Supporting write binary data type by Carbon Java SDK [Formal]: 1.1 Java SDK needs support write data with specific data types, like int, double, byte[ ] data type, no need to convert all data type to string array. User read binary file as byte[], then SDK writes byte[] into binary column.=>Done 1.2 CarbonData compress binary column because now the compressor is table level.=>Done =>TODO, support configuration for compress and no compress, default no compress because binary usually is already compressed, like jpg format image. So no need to uncompress for binary column. 1.5.4 will support column level compression, after that, we can implement no compress for binary. We can talk with community. 1.3 CarbonData stores binary as dimension. => Done 1.4 Support configure page size for binary data type because binary data usually is big, such as 200k. Otherwise it will be very big for one blocklet (32000 rows). =>Done 1.5 Avro, JSON convert need consider • AVRO fixed and variable length binary can be supported => Avro don't support binary data type => No need Support read binary from JSON => done. 1.6 Binay data type as a child columns in Struct, Map => support it in the future, but priority is not very high, not in 1.5.4 1.7 Verify what is the maximum size of the binary value supportred => snappy only support about 1.71 G, the max data size should be 2 GB, but need confirm 2. Supporting read and manage binary data type by Spark Carbon file format(carbon DataSource) and CarbonSession.[Formal] 2.1 Supporting read binary data type from non-transaction table, read binary column and return as byte[] =>Done 2.2 Support create table with binary column, table property doesn’t support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary column => Done => CARBON Datasource don't support dictionary include column =>support carbon.column.compressor= snappy,zstd,gzip for binary, compress is for all columns(table level) 2.3 Support CTAS for binary=> transaction/non-transaction, Carbon/Hive/Parquet => Done 2.4 Support external table for binary=> Done 2.5 Support projection for binary column=> Done 2.6 Support desc formatted=> Done => Carbon Datasource don't support ALTER TABLE add columns sql support ALTER TABLE for(add column, rename, drop column) binary data type in carbon session=> Done Don't support change the data type for binary by alter table => Done 2.7 Don’t support PARTITION, BUCKETCOLUMNS for binary => Done 2.8 Support compaction for binary=> Done 2.9 datamap Support bloomfilter,mv and pre-aggregate Don’t support lucene, timeseries datamap, no need min max datamap for binary =>Done 2.10 CSDK / python SDK support binary in the future.=> TODO, python sdk already merge to pycarbon 2.11 Support S3=> Done 2.12 support UDF, hex, base64, cast:.=> TODO select hex(bin) from carbon_table..=> TODO 2.13 support configurable decode for query, support base64 and Hex decode.=> Done 2.15 How big data size binary data type can support for writing and reading?=> TODO 2.16 support filter for binary => Done 2.17 select CAST(s AS BINARY) from carbon_table. => Done 2.18 Verify the query flow for row filter push down true and false configurations => TODO: should support it 3. Supporting read binary data type by Carbon SDK 3.1 Supporting read binary data type from non-transaction table, read binary column and return as byte[]=> Done 3.2 Supporting projection for binary column=> Done 3.3 Supporting S3=> Done 3.4 no need to support filter.=> ?? 4. Supporting write binary by spark (carbon file format / carbonsession, POC??) 4.1 Convert binary to String and storage in CSV=> Done 4.2 Spark load CSV and convert string to byte[], and storage in CarbonData. read binary column and return as byte[]=> Done 4.3 Support insert into/update/delete for binary data type => Done 4.4 streaming table support binary => Done 4.5 Verify given value for binary column is Base64 encoded,Plain String and byte[] for SDK,fileformat,caronsession. =>xubo: I think we should support configurable decode for binary, like support base64 and Hex, is it ok? Hive also add TODO for configurable. Hive don’t support Hex and normal string now. => TODO 4.6 Local dictionary can be excluded =>Done 4.7 verify Binary type behavior with data having null values, Verify the bad records logger behavior with binary column, better to keep the badrecords file readable, should we encode to base64 ? => support it in the future. Carbon doesn’t encode/decode base64 now, carbon will keep the same for output and input. Carbon can support configurable encode/decode for binary. CarbonSession only support load data from files(csv), so it’s already readable for bad record. How to confirm which is bad record for binary? For CarbonSDK, we can support encode to base64 default and add configure parameter to convert to other format, like Hex. Is it ok? 4.8 Verify with both unsafe true and false configurations for load and query => Done 5. CLI tool support binary data type column => Done 6.Verify presto query for binary data type => support it in the future, but priority is not very high, not in 1.5.4 mail list: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html was: CarbonData supports binary data type Version Changes Owner Date 0.1 Init doc for Supporting binary data type Xubo 2019-4-10 Background : Binary is basic data type and widely used in various scenarios. So it’s better to support binary data type in CarbonData. Download data from S3 will be slow when dataset has lots of small binary data. The majority of application scenarios are related to storage small binary data type into CarbonData, which can avoid small binary files problem and speed up S3 access performance, also can decrease cost of accessing OBS by decreasing the number of calling S3 API. It also will easier to manage structure data and Unstructured data(binary) by storing them into CarbonData. Goals: 1. Supporting write binary data type by Carbon Java SDK. 2. Supporting read binary data type by Spark Carbon file format(carbon datasource) and CarbonSession. 3. Supporting read binary data type by Carbon SDK 4. Supporting write binary by spark Approach and Detail: 1.Supporting write binary data type by Carbon Java SDK [Formal]: 1.1 Java SDK needs support write data with specific data types, like int, double, byte[ ] data type, no need to convert all data type to string array. User read binary file as byte[], then SDK writes byte[] into binary column.=>Done 1.2 CarbonData compress binary column because now the compressor is table level.=>Done =>TODO, support configuration for compress and no compress, default no compress because binary usually is already compressed, like jpg format image. So no need to uncompress for binary column. 1.5.4 will support column level compression, after that, we can implement no compress for binary. We can talk with community. 1.3 CarbonData stores binary as dimension. => Done 1.4 Support configure page size for binary data type because binary data usually is big, such as 200k. Otherwise it will be very big for one blocklet (32000 rows). =>Done 1.5 Avro, JSON convert need consider • AVRO fixed and variable length binary can be supported => Avro don't support binary data type => No need Support read binary from JSON => done. 1.6 Binay data type as a child columns in Struct, Map => support it in the future, but priority is not very high, not in 1.5.4 1.7 Verify what is the maximum size of the binary value supportred => snappy only support about 1.71 G, the max data size should be 2 GB, but need confirm 2. Supporting read and manage binary data type by Spark Carbon file format(carbon DataSource) and CarbonSession.[Formal] 2.1 Supporting read binary data type from non-transaction table, read binary column and return as byte[] =>Done 2.2 Support create table with binary column, table property doesn’t support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary column => Done => CARBON Datasource don't support dictionary include column =>support carbon.column.compressor= snappy,zstd,gzip for binary, compress is for all columns(table level) 2.3 Support CTAS for binary=> transaction/non-transaction, Carbon/Hive/Parquet => Done 2.4 Support external table for binary=> Done 2.5 Support projection for binary column=> Done 2.6 Support desc formatted=> Done => Carbon Datasource don't support ALTER TABLE add columns sql support ALTER TABLE for(add column, rename, drop column) binary data type in carbon session=> Done Don't support change the data type for binary by alter table => Done 2.7 Don’t support PARTITION, BUCKETCOLUMNS for binary => Done 2.8 Support compaction for binary=> Done 2.9 datamap Support bloomfilter,mv and pre-aggregate Don’t support lucene, timeseries datamap, no need min max datamap for binary =>Done 2.10 CSDK / python SDK support binary in the future.=> TODO 2.11 Support S3=> Done 2.12 support UDF, hex, base64, cast:.=> TODO select hex(bin) from carbon_table..=> TODO 2.13 support configurable decode for query, support base64 and Hex decode.=> Done 2.15 How big data size binary data type can support for writing and reading?=> TODO 2.16 support filter for binary => Done 2.17 select CAST(s AS BINARY) from carbon_table. => Done 2.18 Verify the query flow for row filter push down true and false configurations => TODO: should support it 3. Supporting read binary data type by Carbon SDK 3.1 Supporting read binary data type from non-transaction table, read binary column and return as byte[]=> Done 3.2 Supporting projection for binary column=> Done 3.3 Supporting S3=> Done 3.4 no need to support filter.=> ?? 4. Supporting write binary by spark (carbon file format / carbonsession, POC??) 4.1 Convert binary to String and storage in CSV=> Done 4.2 Spark load CSV and convert string to byte[], and storage in CarbonData. read binary column and return as byte[]=> Done 4.3 Support insert into/update/delete for binary data type => Done 4.4 streaming table support binary => Done 4.5 Verify given value for binary column is Base64 encoded,Plain String and byte[] for SDK,fileformat,caronsession. =>xubo: I think we should support configurable decode for binary, like support base64 and Hex, is it ok? Hive also add TODO for configurable. Hive don’t support Hex and normal string now. => TODO 4.6 Local dictionary can be excluded =>Done 4.7 verify Binary type behavior with data having null values, Verify the bad records logger behavior with binary column, better to keep the badrecords file readable, should we encode to base64 ? => support it in the future. Carbon doesn’t encode/decode base64 now, carbon will keep the same for output and input. Carbon can support configurable encode/decode for binary. CarbonSession only support load data from files(csv), so it’s already readable for bad record. How to confirm which is bad record for binary? For CarbonSDK, we can support encode to base64 default and add configure parameter to convert to other format, like Hex. Is it ok? 4.8 Verify with both unsafe true and false configurations for load and query => Done 5. CLI tool support binary data type column => Done 6.Verify presto query for binary data type => support it in the future, but priority is not very high, not in 1.5.4 mail list: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html > Support Binary Data Type > ------------------------ > > Key: CARBONDATA-3336 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3336 > Project: CarbonData > Issue Type: New Feature > Reporter: xubo245 > Assignee: xubo245 > Priority: Major > Attachments: CarbonData support binary data type V0.1.pdf > > Time Spent: 8.5h > Remaining Estimate: 0h > > CarbonData supports binary data type > Version Changes Owner Date > 0.1 Init doc for Supporting binary data type Xubo 2019-4-10 > Background : > Binary is basic data type and widely used in various scenarios. So it’s > better to support binary data type in CarbonData. Download data from S3 will > be slow when dataset has lots of small binary data. The majority of > application scenarios are related to storage small binary data type into > CarbonData, which can avoid small binary files problem and speed up S3 access > performance, also can decrease cost of accessing OBS by decreasing the number > of calling S3 API. It also will easier to manage structure data and > Unstructured data(binary) by storing them into CarbonData. > Goals: > 1. Supporting write binary data type by Carbon Java SDK. > 2. Supporting read binary data type by Spark Carbon file format(carbon > datasource) and CarbonSession. > 3. Supporting read binary data type by Carbon SDK > 4. Supporting write binary by spark > Approach and Detail: > 1.Supporting write binary data type by Carbon Java SDK [Formal]: > 1.1 Java SDK needs support write data with specific data types, > like int, double, byte[ ] data type, no need to convert all data type to > string array. User read binary file as byte[], then SDK writes byte[] into > binary column.=>Done > 1.2 CarbonData compress binary column because now the compressor is > table level.=>Done > =>TODO, support configuration for compress and no compress, > default no compress because binary usually is already compressed, like jpg > format image. So no need to uncompress for binary column. 1.5.4 will support > column level compression, after that, we can implement no compress for > binary. We can talk with community. > 1.3 CarbonData stores binary as dimension. => Done > 1.4 Support configure page size for binary data type because binary > data usually is big, such as 200k. Otherwise it will be very big for one > blocklet (32000 rows). =>Done > 1.5 Avro, JSON convert need consider > • AVRO fixed and variable length binary can be supported > => Avro don't support binary data type => No > need > Support read binary from JSON => done. > 1.6 Binay data type as a child columns in Struct, Map > > => support it in the future, but priority is not very > high, not in 1.5.4 > 1.7 Verify what is the maximum size of the binary value supportred > => snappy only support about 1.71 G, the max data size should be 2 GB, > but need confirm > > 2. Supporting read and manage binary data type by Spark Carbon file > format(carbon DataSource) and CarbonSession.[Formal] > 2.1 Supporting read binary data type from non-transaction table, > read binary column and return as byte[] =>Done > 2.2 Support create table with binary column, table property doesn’t > support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary > column => Done > => CARBON Datasource don't support dictionary include column > =>support carbon.column.compressor= snappy,zstd,gzip for binary, > compress is for all columns(table level) > 2.3 Support CTAS for binary=> transaction/non-transaction, > Carbon/Hive/Parquet => Done > 2.4 Support external table for binary=> Done > 2.5 Support projection for binary column=> Done > 2.6 Support desc formatted=> Done > => Carbon Datasource don't support ALTER TABLE add > columns sql > support ALTER TABLE for(add column, rename, drop column) > binary data type in carbon session=> Done > Don't support change the data type for binary by alter > table => Done > 2.7 Don’t support PARTITION, BUCKETCOLUMNS for binary => Done > 2.8 Support compaction for binary=> Done > 2.9 datamap > Support bloomfilter,mv and pre-aggregate > Don’t support lucene, timeseries datamap, no need min max > datamap for binary > =>Done > 2.10 CSDK / python SDK support binary in the future.=> TODO, python > sdk already merge to pycarbon > 2.11 Support S3=> Done > 2.12 support UDF, hex, base64, cast:.=> TODO > select hex(bin) from carbon_table..=> TODO > > 2.13 support configurable decode for query, support base64 and > Hex decode.=> Done > 2.15 How big data size binary data type can support for writing > and reading?=> TODO > 2.16 support filter for binary => Done > 2.17 select CAST(s AS BINARY) from carbon_table. => Done > 2.18 Verify the query flow for row filter push down true and > false configurations => TODO: should support it > > > > > > > 3. Supporting read binary data type by Carbon SDK > 3.1 Supporting read binary data type from non-transaction table, > read binary column and return as byte[]=> Done > 3.2 Supporting projection for binary column=> Done > 3.3 Supporting S3=> Done > 3.4 no need to support filter.=> ?? > 4. Supporting write binary by spark (carbon file format / > carbonsession, POC??) > 4.1 Convert binary to String and storage in CSV=> Done > 4.2 Spark load CSV and convert string to byte[], and storage in > CarbonData. read binary column and return as byte[]=> Done > 4.3 Support insert into/update/delete for binary data type => Done > 4.4 streaming table support binary => Done > 4.5 Verify given value for binary column is Base64 encoded,Plain > String and byte[] for SDK,fileformat,caronsession. > =>xubo: I think we should support configurable decode for binary, > like support base64 and Hex, is it ok? Hive also add TODO for configurable. > Hive don’t support Hex and normal string now. => TODO > 4.6 Local dictionary can be excluded =>Done > 4.7 verify Binary type behavior with data having null values, > Verify the bad records logger behavior with binary column, better to keep the > badrecords file readable, should we encode to base64 ? > => support it in the future. Carbon doesn’t > encode/decode base64 now, carbon will keep the same for output and input. > Carbon can support configurable encode/decode for binary. CarbonSession > only support load data from files(csv), so it’s already readable for bad > record. How to confirm which is bad record for binary? For CarbonSDK, we can > support encode to base64 default and add configure parameter to convert to > other format, like Hex. Is it ok? > 4.8 Verify with both unsafe true and false configurations for load > and query => Done > 5. CLI tool support binary data type column => Done > 6.Verify presto query for binary data type > > => support it in the future, but priority is not very high, > not in 1.5.4 > > mail list: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)