[ https://issues.apache.org/jira/browse/CARBONDATA-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xubo245 resolved CARBONDATA-3351. --------------------------------- Resolution: Fixed > Support Binary Data Type > ------------------------ > > Key: CARBONDATA-3351 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3351 > Project: CarbonData > Issue Type: Sub-task > Reporter: xubo245 > Assignee: xubo245 > Priority: Major > Time Spent: 35h > Remaining Estimate: 0h > > Background : > Binary is basic data type and widely used in various scenarios. So it’s > better to support binary data type in CarbonData. Download data from S3 will > be slow when dataset has lots of small binary data. The majority of > application scenarios are related to storage small binary data type into > CarbonData, which can avoid small binary files problem and speed up S3 access > performance, also can decrease cost of accessing OBS by decreasing the number > of calling S3 API. It also will easier to manage structure data and > Unstructured data(binary) by storing them into CarbonData. > Goals: > 1. Supporting write binary data type by Carbon Java SDK. > 2. Supporting read binary data type by Spark Carbon file format(carbon > datasource) and CarbonSession. > 3. Supporting read binary data type by Carbon SDK > 4. Supporting write binary by spark > Approach and Detail: > 1.Supporting write binary data type by Carbon Java SDK [Formal]: > 1.1 Java SDK needs support write data with specific data types, > like int, double, byte[ ] data type, no need to convert all data type to > string array. User read binary file as byte[], then SDK writes byte[] into > binary column.=>Done > 1.2 CarbonData compress binary column because now the compressor is > table level.=>Done > 1.3 CarbonData stores binary as dimension. => Done > 1.4 Support configure page size for binary data type because binary > data usually is big, such as 200k. Otherwise it will be very big for one > blocklet (32000 rows). =>Done > 1.5 Avro, JSON convert need consider > • AVRO fixed and variable length binary can be supported > => Avro don't support binary data type => No > need > Support read binary from JSON => done. > 1.6 Binay data type as a child columns in Struct, Map > > => support it in the future, but priority is not very > high, not in 1.5.4 > 1.7 Verify what is the maximum size of the binary value supportred > => snappy only support about 1.71 G, the max data size should be 2 GB, > but need confirm > > 2. Supporting read and manage binary data type by Spark Carbon file > format(carbon DataSource) and CarbonSession.[Formal] > 2.1 Supporting read binary data type from non-transaction table, > read binary column and return as byte[] =>Done > 2.2 Support create table with binary column, table property doesn’t > support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary > column => Done > => CARBON Datasource don't support dictionary include column > =>support carbon.column.compressor= snappy,zstd,gzip for binary, > compress is for all columns(table level) > 2.3 Support CTAS for binary=> transaction/non-transaction, > Carbon/Hive/Parquet => Done > 2.4 Support external table for binary=> Done > 2.5 Support projection for binary column=> Done > 2.6 Support desc formatted=> Done > => Carbon Datasource don't support ALTER TABLE add > columns sql > support ALTER TABLE for(add column, rename, drop column) > binary data type in carbon session=> Done > Don't support change the data type for binary by alter > table => Done > 2.7 Don’t support PARTITION, BUCKETCOLUMNS for binary => Done > 2.8 Support compaction for binary=> Done > 2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, > no need min max datamap for binary, support mv and pre-aggregate in the > future=> TODO > 2.10 CSDK / python SDK support binary in the future.=> TODO > 2.11 Support S3=> Done > 2.12 support UDF, hex, base64, cast:.=> TODO > select hex(bin) from carbon_table..=> TODO > > 2.15 support filter for binary => Done > 2.16 select CAST(s AS BINARY) from carbon_table. => Done > 3. Supporting read binary data type by Carbon SDK > 3.1 Supporting read binary data type from non-transaction table, > read binary column and return as byte[]=> Done > 3.2 Supporting projection for binary column=> Done > 3.3 Supporting S3=> Done > 3.4 no need to support filter.=> to be discussd, not in this PR > 4. Supporting write binary by spark (carbon file format / > carbonsession, POC??) > 4.1 Convert binary to String and storage in CSV=> Done > 4.2 Spark load CSV and convert string to byte[], and storage in > CarbonData. read binary column and return as byte[]=> Done > 4.3 Support insert into/update/delete for binary data type => Done > 4.4 Don’t support stream table. => TODO > 4.5 Verify given value for binary column is Base64 encoded,Plain > String and byte[] for SDK,fileformat,caronsession. > =>xubo: I think we should support configurable decode for binary, > like support base64 and Hex, is it ok? Hive also add TODO for configurable. > Hive don’t support Hex and normal string now. => TODO, not in this PR > 4.6 Local dictionary can be excluded =>Done > 4.7 verify Binary type behavior with data having null values, > Verify the bad records logger behavior with binary column, better to keep the > badrecords file readable, should we encode to base64 ? > => support it in the future. Carbon doesn’t > encode/decode base64 now, carbon will keep the same for output and input. > Carbon can support configurable encode/decode for binary. CarbonSession > only support load data from files(csv), so it’s already readable for bad > record. How to confirm which is bad record for binary? For CarbonSDK, we can > support encode to base64 default and add configure parameter to convert to > other format, like Hex. Is it ok? TODO, not in this PR > 4.8 Verify with both unsafe true and false configurations for load > and query => Done > 5. CLI tool support binary data type column => Done > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)