Re: [DISCUSSION] Support JOIN query with spatial index

2021-05-17 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [ANNOUNCE] Akash R Nilugal as new PMC for Apache CarbonData

2021-04-13 Thread David CaiQiang
Congratulations Akash



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.1.1(RC2) release

2021-03-26 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.1.1(RC1) release

2021-03-18 Thread David CaiQiang
-1, please fix the pending defect and merge the completed PR at first.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Taking the inputs for Segment Interface Refactoring

2021-02-18 Thread David CaiQiang
+1 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Improve carbondata CDC performance

2021-02-18 Thread David CaiQiang
+1, you can finish the implementation.

How about using the following SQL instead of the cartesian join?

SELECT df.filePath
FROM targetTableBlocks df
where exists (select 1 from srcTable where  srcTable.value between df.min
and df.max)



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Improve carbondata CDC performance

2021-02-18 Thread David CaiQiang
I mean you can push your logic into CarbonDataSourceScan as a dynamic runtime
filter.

Actually, CarbonDataSourceScan already used min/max zoom maps as an index
filter to prune blocklist (in the CarbonScanRDD.getPartition method). 

We can do more things on the join query. Here I assume the source table is
much smaller than the target table.

1. when the join broadcast the source table
1.1 when the join columns contain the partition keys of the target
table,  it can reuse the result of the broadcast to prune the partitions of
the target table.
1.2 when the join query has some filters on the target table, use
min/max zoom maps to prune the block list of the target table
1.3 when the join query has some filters on the source table, it can use
min/max zoom maps of join columns to match the result of the broadcast

2. when the join doesn't broadcast the source table
2.1 when the join query has some filters on the target table, use
min/max zoom maps to prune the block list of the target table
2.2 join source table with min/max zoom maps of the target table to get
the new block list.

In the future, it better to move all pruning logics of the driver side into
one place and invoke them in CarbonDataSourceScan to get input partitions
for ScanRDD. (include min/max index, si, partition pruning, and dynamic
filters)



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Improve carbondata CDC performance

2021-02-17 Thread David CaiQiang
Hi Akash,
You can enhance the runtime filter to improve the join performance.

It has the rule to dynamically check whether the join can add the
runtime filter or not.

Better to push down the runtime filter into CarbonDataSourceScan, and
better to avoid adding a UDF function to rewrite the plan.





-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]Presto Queries leveraging Secondary Index

2021-01-17 Thread David CaiQiang
hi Venu and Ajantha,

For the new SI solution, I have some suggestions also.
1. agree to avoid query plan rewrite
2. push down the SI filter to the pruning step of the main table directly on
the driver side, but we need a distributed job to improve performance
3. segment level usability
   for example, when only one segment doesn't have indexes, but other 99
segments have indexes, SI should be used to improve the filter query of the
index column.
4. consider the filter column's selectivity, it should impact the priority
of the indexes (include main index).
phase 1: base on rules(filter order or hint)
phase 2: base on cost (statistics)





-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Display the segment ID when carbondata load is successful

2021-01-17 Thread David CaiQiang
Hi Nihal, my suggestion as following,
1. contain the normal output of the show segment command
2. add more information for loading, like numFiles, numRows, rawDataSize
(maybe show segment need also, take care of CDC which needs to update this
information)



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Geo spatial index algorithm improvement and UDFs enhancement

2020-12-25 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Upgrade presto-sql to 333 version

2020-12-25 Thread David CaiQiang
+1

Are there other impacts?



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION]Improve Simple updates and delete performance in carbondata

2020-12-08 Thread David CaiQiang
Hi akash, 

  for simple updates and delete scenario,  you can try to do it.

  During update/delete,
  1) for updated/deleted segment, no need to update segmentMetadataInfo.
  2) for new inserted segment, you can summary blocklet level index to
segment level index, reading carbonindex/carbonindexmerge file and calculate
it.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION]Improve Simple updates and delete performance in carbondata

2020-11-27 Thread David CaiQiang
hi Akash, for the simple update case, can you do a test to confirm your
inference after a fast change?



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Size control of minor compaction

2020-11-23 Thread David CaiQiang
+1

It will task many resources and a long time to compact a large segment, and
may not get a good result.

Auto compaction is disabled, we could give a large default value(maybe
1024GB), it will not impact the behavior by default.

And the table level threshold is needed also.

If the user wants to skip some segments, the user can adjust the value to
implement it.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [ANNOUNCE] Ajantha as new PMC for Apache CarbonData

2020-11-23 Thread David CaiQiang
Congratulations to Ajantha.




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION]Merge index property and operations improvement.

2020-11-23 Thread David CaiQiang
   a) remove mergeIndex property and event listener, add mergeIndex as a part
of loading/compaction transaction.
   b) if the merging index failed, loading/compaction should fail directly.
   c) keep merge_index command and mark it deprecated.
  for a new table, maybe it will do nothing.
  for an old table, maybe we need to tolerate the probable query issue
(not found index files).
  It could be a deprecated feature in the future.
   b). At the end of loading,  retrying to finish index merging or
tablestatus update is a good suggestion. 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] About carbon.si.segment.merge feature

2020-11-08 Thread David CaiQiang
hi Ajantha,
  Agree to remove "carbon.si.segment.merge"  

  1. dynamic decide the number for the loading tasks 
  Before loading the SI segment, it is easy to estimate the total size of
this SI segment.
  So better to dynamic decide the number for the loading tasks to avoid
small carbon files in the SI segment.
  
  2. can we use global_sort for SI by default?
  SI is used to speed up filter query, global_sort can do better. 
  We need global_sort for SI.
  
  3. use reindex instead of refresh index
  If Refresh index is only used to merge small files, reindex will be
better(should implement point 1).
  So, can we remove Refresh index too?
  
  



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Partition Optimization

2020-11-04 Thread David CaiQiang
Agree with Vishal, better to test and confirm the difference.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-11-04 Thread David CaiQiang
PR#3999 already implemented this enhancement, please know.

PR URL: https://github.com/apache/carbondata/pull/3999



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Taking the inputs for Segment Interface Refactoring

2020-10-18 Thread David CaiQiang
I list feature list about segment as follows before starting to re-factory
segment interface. 

[table related]
1. get lock for table
   lock for tablestatus
   lock for updatedTablestatus
2. get lastModifiedTime of table

[segment related]
1. segment datasource
   datasource: file format,other datasource
   fileformat: carbon,parquet,orc,csv..
   catalog type: segment, external segment
2. data load etl(load/insert/add_external_segment/insert_stage)
   write segment for batch loading
   add external segment by using external folder path for mixed file
formatted table
   append streaming segment for spark structed streaming
   insert_stage for flink writer
3. data query
   segment properties and schema
   segment level index cache and pruning
   cache/refresh block/blocklet index cache if needed by segment
   read segments to a dataframe/rdd
4. segment management
   new segment id for loading/insert/add_external_segment/insert_stage
   create global segment identifier
   show[history]/delete segment
5. stats
   collect dataSize and indexSize of the segment
   lastModifiedTime, start/end time, update start/end time
   fileFormat
   status
6. segment level lock for supporting concurrent operations
7. get tablestatus storage factory
   storage solution 1): use file system by default
   storage solution 2): use hive metastore or db

[table status related]:
1. record new LoadMetadataDetails
 loading/insert/compatcion start/end
 add external segment start/end
 insert stage
  
2. update LoadMetadataDetails
  compation
  update/delete
  drop partition
  delete segment

3. read LoadMetadataDetails
  list all/valid/invalid segment

4. backup and history

[segment file related]
1. write new segment file 
  generate segment file name
 better to use new timestamp to generate new segment file name for each
writing. avoid overwriting segment file with same name.
   write semgent file
   merge temp segment file
2. read segment file 
   readIndexFiles
   readIndexMergeFiles
   getPartitionSpec
3. update segment file
   update
   merge index
   drop partition

[clean files related]
1. clean stale files for the successful  segment operation
   data deletion should delay a period of time(maybe query timeout
interval), avoid deleting file immediately(beside of drop table/partition,
force clean files)
   include data file, index file, segment file, tablestatus file
   impact operation: mergeIndex
2. clean stale files for failed segment operation immediately





-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Segment management enhance

2020-10-09 Thread David CaiQiang
Hi Ramana,
   I agree with you.
   When  writing segment file, the system use listFiles to collect all index
files. 
   In some case, it will add stale  index files into segment file.
   We will try to fix it at first.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [discuss]CarbonData update operation enhance

2020-09-22 Thread David CaiQiang
hi Linwood,
  1. better to implement "Update feature enhancement" at first, it will
create a new segment to store new files.
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Update-feature-enhancement-td99769.html
  2. clean deletedelta files
  now carbon need clean invalid .deletedelta files before update/delete. 
If we don't clean them, after next update/delete, these files will become
valid .deletedela files.

  How to avoid clean invalid .deletedelta files and they don't impact
data after next update/delete operation?



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-18 Thread David CaiQiang
agree with Ravindra,

1. stop all automatic clean data in load/insert/compact/update/delete...

2. when clean files command clean in-progress or uncertain data, we can move
them to data trash.
it can prevent delete useful data by mistake, we already find this issue
in some scenes.
other cases(for example clean mark_for_delete/compacted segment) should
not use the data trash folder, clean data directly.

3. no need data trash management, suggest keeping it simple.
The clean file command should support empty trash immediately, it will
be enough.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread David CaiQiang
1. cleaning the in_progressing segment is very dangerous, please remove this
part from code.  After the user explicitly uses clean file command with an
option "clean_in_progressing"="true", we check segment lock to clean
segment.

2. if the status of a semgent is mark_for_delete/compacted, we can delete
the segment directly without backup.

3. remove code which clean stale data and partial data from
loading/compaction/update/delete feature and so on. better to use a uuid as
segment folder, Let cleaning stale data to be an optional operation. if we
don't clean stale data, table also can work fine.

5. trash folder can be under the table path.  each table has a separate
trash folder. if we clean uncertain data, we can use trash folder to store
data and use a separate folder for each transcation.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Carbon merge should support update random columns each row

2020-09-10 Thread David CaiQiang
+1

maybe we need to use a delta file to store updated values instead of  the
deletedelta file



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-04 Thread David CaiQiang
Hi Akash,

3. Update operation contain a insert operation.  Update operation will
do the same thing how the insert operation process this issue.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Segment management enhance

2020-09-04 Thread David CaiQiang
Hi Kunal,

   1. The user uses SQL API or other interfaces. This UUID is a transaction
id, and we already stored the timestamp and other informations in the
segment metadata.
   This transaction id can be used in the loading/compaction/update
operation. We can append this id into the log if needed.
   Git commit id also uses UUID, so we can consider to use it. What
information do you want to get from the folder name? 
   
   2. It is easy to fix the show segment command's issue. Maybe we can sort
segment by timestamp and UUID to generate the index id.  The user can
continue to use it in other commands.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-04 Thread David CaiQiang
Hi Akash,

1. the update operation still has "deletdelta" files, it keeps the same with
previous. horizontal compaction is still needed.

2. loading one carbonindexmerge file will fast, and not impact the query
performance. (customer has faced this issue)

3. for insert/loading, it can trigger compaction to avoid small segments.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] Segment management enhance

2020-09-03 Thread David CaiQiang
[Background]
1. In some scenes, two loading/compaction jobs maybe write data to the same
segment, it will result in some data confusion and impact some features
which will not work fine again.  
2. Loading/compaction/update/delete operations need  to clean stale data
before execution. Cleaning stale data is a high-risk operation, if it has
some exception, it will delete valid data. If the system doesn't clean stale
data,   in some scenes, it will be added into a new merged index file and
can be queried.
3. Loading/compaction takes a long time and lock will keep a long time also
in some scenes. 

[Motivation & Goal]
We should avoid data confusion and the risk of clean stale data. Maybe we
can use UUID as a segment id to avoid these troubles. Even if we can do
loading/compaction without the segment/compaction lock.

[Modification]
1. segment id 
  Using UUID as segment id instead of the unique numeric value.

2. segment layout
 a) move segment data folder into the table folder
 b) move carbonindexmerge file into Metadata/segments folder, 

 tableFolder
UUID1
 |_xxx.carbondata
 |_xxx.carobnindex
UUID2
Metadata
 |_segemnts
|_UUID1_timestamp1.segment (segment index summary)
|_UUID1_timestamp1.carbonindexmerge (segment index detail)
 |_schema
 |_tablestatus
LockFiles

  partitionTableFolder
partkey=value1
 |_xxx.carbondata
 |_xxx.carobnindex
partkey=value2
Metadata
 |_segemnts
|_UUID1_timestamp1.segment (segment index summary)
|_partkey=value1
  |_UUID1_timestamp1.carbonindexmerge (segment index detail)
|_partkey=value2
 |_schema
 |_tablestatus
LockFiles

3. segment management
Extracting segment interface, it can support open/close, read/write, and
segment level index pruning API.
The segment should support multiple data source types: file format(carbon,
parquet, orc...), HBase...

4. clean stale data
it will become an optional operation.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-03 Thread David CaiQiang
Hi Akash

2.  new tablestsatus, only store the lastest status file name, not all
status files.
   status file will store all segment metadata (just like old tablestatus)

3. if we have delta file, no need to read status file for each query. only
reading delta file is enough if status file not changed.




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] Update feature enhancement

2020-09-02 Thread David CaiQiang
[Background]
Now update feature insert the updated rows into the old segments where the
data are updated. 
In the end, it needs to reload the indexes of related segments.

[Movitation]
If there are many updated segments, it will take a long time to reload the
indexes again.
So I suggest writing the updated rows into a new segment.
It will not impact the indexes of old segments and doesn't need to reload
indexes.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-01 Thread David CaiQiang
add solution 4 to separate the status file by segment status

*solution 4:*   Based on solution 2, support status.inprogress

  1) new tablestatus file format
{
 "statusFileName":"status-uuid1",
 "inProgressStatusFileName": "status-uuid2.inprogess",
 "updateStatusFileName":"updatestatus-timestamp1",
 "historyStatusFileName":"status.history",
 "segmentMaxId":"1000"
}

  2) status.inprogess file store the in-progress segment metadata

Write: at the begin of loading/compaction, add in-progress segment
metadata into status-uuid2.inprogess. at the end, move it to status-uuid1.

Read: query read status-uuid1 only.  other cases read
status-uuid2.inprogess if needed.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-01 Thread David CaiQiang
[Background]
Now the size of one segment metadata entry is about 200 bytes in the
tablestatus file. if the table has 1 million segments and the mean size of
segments is 1GB(means the table size is 1PB), the size of the tablestatus
file will reach 200MB. 

Any reading/writing operation on this tablestatus file will be costly and
has a bad performance.

 For a concurrent scene, it may be easy to result in reading failure on a 
tablestatus file which is being modified and writing lock waiting timeout.

[Motivation & Goal]
Carbon supports the big table which is bigger than 1PB, we should reduce the
tablestatus size to improve the performance of reading/writing operation.
And better to separate reading/writing to the different tablestatus files to
avoid reading  a  tablestatus file which is being modified.

[Modification]
There are three solutions as following.

solution 1: compress tablestatus file
  1) use gzip to compress tablestatus file (200MB -> 20 MB)
  2) keep all previous lock mechanism
  3) support backward compatibility
Read: if magic number (0x1F8B) exists, it will uncompress the
tablestatus file at first
Write:, compress tablestatus directly.

solution 2: Based on solution 1,  separate reading and writing to the
different tablestatus files.
  1) new tablestatus format
{
 "statusFileName":"status-uuid1",
 "updateStatusFileName":"updatestatus-timestamp1",
 "historyStatusFileName":"status.history",
 "segmentMaxId":"1000"
}
keep it small always, reload this file for each operation

  2) add Metadata/tracelog folder
   store files: status-uuid1,updatestatus-timestamp1, status.history

  3) use gzip to compress status-uuid1 file

  4) support backword compatibility
Read: if it start with "[{", go to old reading flow; if it start with
"{", go to the new flow.
Write: generate a new status-uuid1 file and updatestatus file, and store
name in the tablestatus file
  
  5) clean stale files
 if the stale files are create before 1 hour (query timeout), we can
remove them. loading/compaction/cleanfile can trigger this action.

solution 3: Based on solution 2, support tablestatus delta
  1) new tablestatus file format
{
 "statusFileName":"status-uuid1",
 "deltaStatusFileName": "status-uuid2.delta",
 "updateStatusFileName":"updatestatus-timestamp1",
 "historyStatusFileName":"status.history",
 "segmentMaxId":"1000"
}
  2) tablestatus delta store the recent modification

Write: if status file reach 10MB, it starts to write delta file. if
delta file reach 1MB, merge delta to status file and set deltaStatusFileName
to null.

Read: if deltaStatusFileName is not null in the new tablestatus file,
need read delta status and combine status file with delta status.

please vote for all solutions.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]Query Regarding Task launch mechanism for data load operations

2020-08-14 Thread David CaiQiang
This mechanism will work fine for LOCAL_SORT loading of big data and the
small cluster with big executor.

If it doesn't match these conditions, better consider a new solution to
adapter the generic scenario.

I suggest re-factoring NO_SORT, maybe we can check and improve the
global_sort solution.

The solution should support both NO_SORT and GLOBAL_SORT, and automatically
determines the number of partitions to avoid small file issue.




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Disscuss] The precise of timestamp is limited to millisecond in carbondata, which is incompatiable with DB

2020-07-30 Thread David CaiQiang
agree with Ravi



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Implement delete and update feature in carbondata SDK.

2020-07-30 Thread David CaiQiang
+1

Can we add a commit method to support multiple operations at once?

CarbonSDKUID
  .delete(...)
  .delete(...)
  .update(...)
  .commit



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Support the LIMIT operator for show segments command

2020-07-30 Thread David CaiQiang
+1 for solution 1

but the limit statment will get the head or tail of segment list? or need
order by some columns? Please describe the details.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION]Remove the call to update the serde properties in case of alter scenarios

2020-07-30 Thread David CaiQiang
+1 for removing it.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] SI support Complex Array Type

2020-07-30 Thread David CaiQiang
+1 for solution2

Can we support more than one array_contains by using SI join (like SI on
primitive data type)?



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

2020-07-09 Thread David CaiQiang
update reply: 

The merging index should be a part of loading. It is not good to extract the
merging index to an independent process, it brought the query issue (the
system can't find the index files when/after merging).

In my opinion, during loading, new  .carbonindex files should be temporary,
we should merge them to a .carbonindexmerge file in a segment before
updating the segment status to success in tablestatus file.
When the merging index failed, loading should be failed.

for query:
1.  support reading .carbonindex files and .carbonindexmerge files

for loading: (also include the loading part of compaction/create
index/create mv/merge operations)
better to do like this.
step 1. update tablestatus file to add an in-progress segment
step 2. generate carbondata files and temporary .carbonindex files.
step 3. merge .carbonindex files to a .carbonindexmerge file.
step 4. write a segment file. 
step 5. update tablestatus file with final status, segment file name and
some statistics.

So in total,
 update tablestatus file twice,
 write segment file once,
 write .carbonindexmerge files once,
 write and delete .carbonindex files once.

for updating:
1. Now only updating operation can keep .carbonindex file
in the future, maybe we can change updating operations to the same with
merge operations to generate new files into a new segment.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

2020-07-09 Thread David CaiQiang
The merging index should be a part of loading. It is not good to extract the
merging index to an independent process, it brought the query issue (the
system can't find the index files when/after merging). 

In my opinion, during loading, new  .carbonindex files should be temporary, 
we should merge them to a .carbonindexmerge file in a segment before
updating the segment status to success in tablestatus file.
When the merging index failed, loading should be failed.

for query:
1.  support reading .carbonindex files and .carbonindexmerge files

for loading: (also include the loading part of compaction/create
index/create mv/merge operations)
better to do like this.
step 1. update tablestatus file to add an in-progress segment
step 2. generate carbondata file and temporary .carbonindex files, for a
partitioned table, it also generates a temporary segment file for each
related partition.
step 3. merge .carbonindex files to a .carbonindexmerge file.
step 4. write a segment file. for a partitioned table, merge all temporary
segment files to one segment file.
step 5. update tablestatus file with final status, segment file name and
some statistics. 

So in total,
 update tablestatus file twice,
 write segment file once,
 write .carbonindex files once,
 write and delete .carbonindex files once. 

for updating:
1. Now only updating operation can keep .carbonindex file 
in the future, maybe we can change updating operations to the same with
merge operation to generate new files into a new segment.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

2020-07-09 Thread David CaiQiang
Better to always merge index.

-1 for 1,  

+1 for 2, 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.0.1(RC1) release

2020-06-01 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] About global sort in 2.0.0

2020-05-31 Thread David CaiQiang
+1 and agree with Kunal



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-05-18 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-05-18 Thread David CaiQiang
hi, Kunal
 
  another question about point 4:
*4. A staged Maven repository is available for review at:*
https://repository.apache.org/content/repositories/orgapachecarbondata-1062/

please check this URL, it doesn't contain javadoc jar.
https://repository.apache.org/content/repositories/orgapachecarbondata-1062/org/apache/carbondata/carbondata-mv-plan_2.3/2.0.0/



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-05-17 Thread David CaiQiang
I have a doubt about point 3: 
*3. The artifacts to be voted on are located here:*
https://dist.apache.org/repos/dist/dev/carbondata/2.0.0-rc3/

why need to release two same source packages?

apache-carbondata-2.0.0-spark2.3-source-release.zip
apache-carbondata-2.0.0-spark2.4-source-release.zip 

They are the same, besides of zip file name.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Disscussion] Support GloabalSort in the CDC Flow

2020-05-12 Thread David CaiQiang
In my opinion, this is an issue if it can't work.

Better to change the topic title to use ‘question'/'issue' instead of
'discussion'.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Disscussion] Remove 'Create Stream'

2020-05-12 Thread David CaiQiang
How about mark the stream SQL as experimental?

Now in some cases, it is an easy way for the user to understand the
streaming table.

We can improve it in the future.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Dissussion] Support FLOAT datatype in the CDC Flow

2020-05-10 Thread David CaiQiang
please check another topic:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Float-and-Double-compatibility-issue-with-external-segments-to-Carbon-td93870.html.

if this is an issue,  you can create an issue in carbondata jira.




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]Float and Double compatibility issue with external segments to Carbon

2020-05-07 Thread David CaiQiang
It is a historical legacy issue and easy to reuse the solution of the double
data type.

Suggest implementing the float data type independently.




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Disable Adaptive encoding for Double and Float by default

2020-05-07 Thread David CaiQiang
I agree with Ravindra and I can try to fix it (I have mentioned in a PR
review comment). 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [ANNOUNCE] Kunal Kapoor as new PMC for Apache CarbonData

2020-05-07 Thread David CaiQiang
Congratulations Kunal



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Carbon over-use cluster resources

2020-05-07 Thread David CaiQiang
Hi, manhua
  Now no_sort reuse the loading flow of local_sort. It is not a good
solution and led to the situation which you have mentioned. In my opinion,
we need to adjust the loading flow of no_sort, maybe like global_sort
finally.

  In addition, the producer-consumer pattern in data encoding and
compression also can be optimized for no_sort and global_sort, maybe just
prefetch one page and process it instead of using a thread pool.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 2.0.0(RC2) release

2020-05-04 Thread David CaiQiang
-1 for me, based on the below points.

1. We need to update quick-start-guide.md for Carbon 2.0. For example,
Carbon 2.0 supports the multi-tenant scenario, the carbon property
"carbon.storelocation" should be deprecated. only the user who used
"carbon.storelocation" in the previous version can still use it in Carbon
2.0. 

2. "- Adapt to SparkSessionExtensions", 
CarbonSession should be deprecated in Carbon 2.0, only the user who used
CarbonSesion in the previous version can still use it in Carbon 2.0.

3. "- Support heterogeneous format segments in carbondata"
I don't find the doc for this feature.

4. "- Support write Flink streaming data to Carbon"
Better to provide an end-to-end guide.










-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

2020-02-12 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Multi-tenant support by refactoring datamaps

2020-02-12 Thread David CaiQiang
+1

please take care of the performance changes during refactoring datamaps



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Support Secondary Index on Carbon Table

2020-02-06 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Propose feature change in CarbonData 2.0

2019-11-28 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: How to define constraints and indexes in carbondata while creating a table.

2019-06-27 Thread David CaiQiang
doc:
https://github.com/apache/carbondata/blob/master/docs/ddl-of-carbondata.md#sort-columns-configuration

testcase: 
https://github.com/apache/carbondata/blob/a6ab97ca40427af5225f12a063a0e44221a503e1/integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/sortcolumns/TestSortColumns.scala



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: How to define constraints and indexes in carbondata while creating a table.

2019-06-25 Thread David CaiQiang
As so far, Carbondata don't support primary keys, foreign keys, NOT NULL,
etc. 
Table creation can use SORT_COLUMNS to create the main index, but the
secondary index doesn't be supported.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Why metadata path didn't show up on my local disk

2019-04-24 Thread David CaiQiang
Maybe it used javax.jdo.option.ConnectionURL configuration.
When hive,hadoop and spark don't set this configuration, it will use the
parameter of getOrCreateCarbonSession.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] is it necessary to support SORT_COLUMNS modification

2019-04-09 Thread David CaiQiang
please check JIRA and find the design doc:
https://issues.apache.org/jira/browse/CARBONDATA-3347



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Support Compaction for Range Sort

2019-04-08 Thread David CaiQiang
How will it compact Seg_0 and Seg_1 in the new compaction? 

For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has 
2 ranges (50-150) and (250-300); 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 1.5.3(RC1) release

2019-04-07 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: 答复: spark streaming insert data error

2019-03-20 Thread David CaiQiang
There are all documents(include streaming table) under the following link.
https://github.com/apache/carbondata/tree/master/docs

You can find all examples in examples/spark2 module:
example 1 (support Update/Delete)
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/StreamingUsingBatchLoadExample.scala

example 2 (not support Update/Delete)
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/StructuredStreamingExample.scala
or:
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/StreamingWithRowParserExample.scala



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: spark streaming insert data error

2019-03-20 Thread David CaiQiang
You can get table schema by CarbonTable.getCreateOrderColumn method.
It will return the correct table schema.

"name,city,id,salary" is the order of column storage, it is not the table
schema.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] is it necessary to support SORT_COLUMNS modification

2019-03-13 Thread David CaiQiang
Hi all,
Let's discuss whether it is necessary to support SORT_COLUMNS
modification.
*Background*
"SORT_COLUMNS" is a table level property, and we can't change it after
creating a table.
*Motivation*
 When we want to optimize the query performance and found that it needs
to change SORT_COLUMNS, Carbon should support changing SORT_COLUMNS.
SORT_COLUMNS just like the main data index and impact the data layout.  At
the same time, we can re-sort old segment data by new SORT_COLUMNS.
 *Modification*
 1. loading data use table level "SORT_COLUMNS" and store it as a
segment level property
 2. query should use segment level property to read data files
 3. only compacting segments with same "SORT_COLUMNS"
 4. convert old segment one by one by new SORT_COLUMNS and refresh
DataMap if needed
 5. show segment command output SORT_COLUMNS  of each segment

Any suggestions and questions? 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Injection of custom rules in session

2019-03-07 Thread David CaiQiang
>From spark 2.2, Spark can inject extensions.  
for example:
val spark = SparkSession
  .builder()
  ...
  .withExtensions(...)
  ...

CarbonSession.CarbonBuilder uses default extensions to create CarbonSession. 
It doesn't inject parser, analyzer and so on.

And the extensions variable is private in the apache.spark.sql package.
 Maybe We can't get the extensions directly to inject some rules by self.

So in my opinion, we should modify CarbonBuilder to support it. 




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] How to pass some options into Insert Into command

2019-02-19 Thread David CaiQiang
Hi all,
For data loading, we can pass some options into load data command by
using options clause, but insert into command can't.

How to pass some options into Insert Into command?  some options as
following.
1. implement options clause for insert into command
2. use hint
3. set key=value
4. other methods to implement the same result, for example,  use
"clustered by random(2) " to implement "GLOBAL_SORT_PARTITIONS"="2" 
5. ??

   any suggestion? 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] DDLs to operate on CarbonLRUCache

2019-02-19 Thread David CaiQiang
+1 for 5,6, after point 5 estimated the cache size,  the point 6 can modify
the configuration dynamically.

+1 for 3,4: maybe need to add a lock to sync the concurrent operations. If
it wants to release cache, it will not need to restart the driver

Maybe we also need to check how to use these operations in "[DISCUSSION]
Distributed Index Cache Server".



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 1.5.2(RC2) release

2019-02-01 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: 【Discuss】load data cause GC overhead limit exceeded

2019-01-28 Thread David CaiQiang
maybe we can validate this property and limit it to less than the total
memory size (or 60%...)  of the driver side




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSS] Move to gitbox as per ASF infra team mail

2019-01-06 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [discussion] Open check code style of example module

2018-12-24 Thread David CaiQiang
In  some examples, we need to print some info to the console.
So we need to skip some code styles.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [carbondata-presto enhancements] support reading stream segment in presto

2018-12-18 Thread David CaiQiang
I will try to implement it.
PR link:
https://github.com/apache/carbondata/pull/3001



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

2018-12-16 Thread David CaiQiang
Better to support alter 'sort_columns' and 'sort_scope' also.

After the table creation and data loading, the user can adjust
'sort_columns' and 'sort_scope'. 






-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Proposal] Thoughts on general guidelines to follow in Apache CarbonData community

2018-11-18 Thread David CaiQiang
+1 for 1,2,3,4,5,8,9,10

+0 for 6,7



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Throw NullPointerException occasionally when query from stream table

2018-11-06 Thread David CaiQiang
Where do we call SegmentPropertiesAndSchemaHolder.invalidate in handoff
thread?



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Remove BTree related code

2018-08-23 Thread David CaiQiang
+0 for 1. delete 11 files

Better to add Start/End keys to DataMapRow also.
In my opinion, the union of Min/Max values and Start/End keys can work
better.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[DISCUSSION] Implement file-level Min/Max index for streaming segment

2018-08-23 Thread David CaiQiang
Hi All,
Currently, the filter queries on the streaming table always scan all
streaming files, even though there are no data in streaming files that meet
the filter conditions.
So I try to support file-level min/max index on streaming segment. It
helps to reduce the task number and improve the performance of filter scan
in some cases.
Please check the document in JIRA:  
https://issues.apache.org/jira/browse/CARBONDATA-2853
  
Any question, suggestion?



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: CarbonStore Java & REST API proposal

2018-07-04 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: S3 support

2018-06-22 Thread David CaiQiang
Hi Kunal,
 I have some questions.

*Problem(Locking):* 
  Does the memory lock support that the multiple drivers concurrently load
data to the same table? maybe it should note this limitation.
   
*Problem(Write with append mode):* 
  1. atomicity
  After the overwrite operation failed, maybe the old file is destroyed. It
should be able to recover the old file.

*Problem(Alter rename):* 
If the table folder is different with the table name, maybe "refresh
table" command should be enhanced. 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Use RowStreamParserImp as default value of config 'carbon.stream.parser'

2018-06-08 Thread David CaiQiang
+1, I agree with using RowStreamParserImpl by default.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Getting Different Encoding in timestamp and date datatype.

2018-03-22 Thread David CaiQiang
The direct dictionary ignores the millisecond of the timestamp data.
If the millisecond is needless, the direct dictionary uses the integer to
improve compression.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Getting Different Encoding in timestamp and date datatype.

2018-03-21 Thread David CaiQiang
Hi Jatin, Timestamp column is non-dictionary by default. After adding the
Timestamp column to the table property 'dictionary_include', it will have
the same encoding list.





-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 1.3.1(RC1) release

2018-03-05 Thread David CaiQiang
+1




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Implement Lucene DataMap to support full text search

2018-02-09 Thread David CaiQiang
It will be an independent module.
The layout maybe like this:

carbondata
   |_ datamap
   |__ lucene



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Should CarbonData need to integrate with Spark Streaming too?

2018-01-17 Thread David CaiQiang
+1  for 2). The same as integration with Structured Streaming 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Initiating Apache CarbonData-1.3.0 Release

2017-12-25 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Problem with with writing the loadStartTime in "dd-MM-yyyy HH:mm:ss:SSS" format

2017-12-19 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Refactory on spark related modules

2017-12-05 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [VOTE] Apache CarbonData 1.2.0(RC3) release

2017-09-24 Thread David CaiQiang

 +1 Release this package as Apache CarbonData 1.2.0 

1. Release
  There are important new features and the integration of new platform

2. The tag
 " mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format package" passed
 "mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format install" passed

3. The artifacts 
  both md5sum and sha512 are correct



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Update the function of show segments

2017-09-20 Thread David CaiQiang
I agree with Jacky. 

I think enhanced segment metadata will help us to understand the table.  

I suggest the following properties for segment metadata:
1. total data file size
2. total index file size
3. data file count
4. index file count
5. last modified time (last update time)

Through these information,  we can answer the following questions.
1. Is there small file issue? Whether table require compaction or not, which
type should be used?
2. Whether index files is too many or not?  we will can estimate the total
size of index in memory whether it is big or small for driver memory
configuration.
3. Whether some segment has too many files?  Maybe it is useful to locate
some performance issue.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] About data backward compatibility

2017-08-14 Thread David CaiQiang
I agree with Ravindra, now is the time to implement migration tool.



-
Best Regards
David Cai
--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-About-data-backward-compatibility-tp20183p20219.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: [DISCUSSION] Interfaces for index frame work

2017-08-14 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Interfaces-for-index-frame-work-tp13274p20218.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: problem with branch-1.1

2017-06-26 Thread David CaiQiang
The spark core version of  hdp2.6.0-spark2.1.0 is spark 2.1.1.
In spark 2.1.1, CatalystConf was already removed.

We raised PR to support it and will merge it at later.
https://github.com/apache/carbondata/pull/1096
https://github.com/apache/carbondata/pull/1017

And the command will be "mvn package -DskipTests -Pspark-2.1
-Dspark.version=2.1.1 -Phadoop-2.7.2"



-
Best Regards
David Cai
--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/problem-with-branch-1-1-tp16004p16261.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: [VOTE] Presto integration version :Re: [DISCUSSION] Whether Carbondata should work with Presto in the next release version(1.2.0)

2017-06-13 Thread David CaiQiang
+1 for supporting presto integration.



-
Best Regards
David Cai
--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/VOTE-Presto-integration-version-Re-DISCUSSION-Whether-Carbondata-should-work-with-Presto-in-the-next-tp14906p14907.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: About ColumnGroup feature

2017-06-12 Thread David CaiQiang
+1 for A

As I known, so far ColumnGroup feature can't improve performance very well,
it became a useless feature nearly. If necessary, we need redesign this
feature to keep code clean and tune it well to improve performance.



-
Best Regards
David Cai
--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/About-ColumnGroup-feature-tp14436p14729.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


  1   2   >