Re: Improve carbondata CDC performance

2021-03-31 Thread akashrn5
Hi Ajantha, Thanks for your points. Now actually we cache the splits, actual join will be faster and, even though the pruning doesn't happen it wont affect the performance much. This is learned from the test we did during POC and it doesn't make much difference in performance, basically no

Re: Improve carbondata CDC performance

2021-03-31 Thread akashrn5
Hi, In new design we cache the splits and actual join operation makes use of it and will be faster. So from the test results even though the dataset didn't prune anything, it wont make any difference in performance. Basically doesn't degrade. As far as actual use case, the changing of who table

Re: Improve carbondata CDC performance

2021-03-31 Thread akashrn5
Hi Ravi, Thanks for your inputs. Actually, the test with binary search and broadcasting didn't give much benefit and from code perspective also the we need to sort the data our self based on min max search logic for the array, and also considering the scenarios of multiple blocks with same min

Re: [Discussion]Presto Queries leveraging Secondary Index

2021-03-29 Thread akashrn5
Hi, +1 for the feature and the design. I have give some comments on the design doc for handling some missing scenarios and small changes. Can you please update the design doc. As not so major comments except one or two, can go ahead with feature and parallelly can update comments. Thanks

Re: [VOTE] Apache CarbonData 2.1.1(RC2) release

2021-03-26 Thread akashrn5
hi, +1 Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Support SI at Segment level

2021-03-22 Thread akashrn5
Hi, +1 for the feature. This is very important to improve query perf instead of waiting for SI and main table to e always in sync. I have reviewed the doc and given comments, please handle and please discuss with @venu Si as datamap feature to be inline as informed earlier. P.S: This design

Re: [DISCUSSION] Support alter schema for complex types

2021-03-22 Thread akashrn5
Hi, +1 for feature. Thanks for proposing it as now most of the use case from user perspective involves complex columns. I have reviewed the doc and given comments, please work on it, then can be reviewed again. Regards, Akash R -- Sent from:

Re: [DISCUSSION] Describe complex columns

2021-03-22 Thread akashrn5
Hi, +1 for the new functionality. my suggestion is to modify the DDL something like below DESCRIBE column fieldname ON [db_name.]table_name; DESCRIBE table short/transient [db_name.]table_name; Others can give their suggestions Thanks, Regards, Akash R -- Sent from:

Re: create SI can succeed partitially

2021-03-02 Thread akashrn5
Hi, yes, as you mentioned this is a major drawback in the current SI flow. This problem exists because, when we get the set of segments to load, we start an executor service and give all the segment list, after .get we make the status success at once. So we need to rewrite this code to make it

Re: [DISCUSSION] Display the segment ID when carbondata load is successful

2021-02-26 Thread akashrn5
Hi, +1, Considering others opinions, just segment ID can be enough and users should take care to check the status of it after load to decide whether to query or go ahead with any other operation on that segment. This makes code also simple and not induce any bugs and test scope will also be

Re: Support SI at Segment level

2021-02-23 Thread akashrn5
Hi Nihal, Thanks for bringing this up. It's an important feature to leverage SI at the small segment level also. Already a work is being done on making SI to prune at data map interface, so your design should be aligned with that. So better to check the SI as a data map design first and then

Re: Improve carbondata CDC performance

2021-02-23 Thread akashrn5
Hi Venu, Thanks for your review. I have replied the same in the document. you are right 1. its taken care to group by extended blocklets on split path and get the min-max on block level 2. we need to do group by on the file path to avoid the duplicates from dataframe output. I have updated the

Re: Improve carbondata CDC performance

2021-02-18 Thread akashrn5
Hi david, Thanks for your suggestion. I checked in local about the query you suggested, its going as a *BroadcastNestedLoopJoin*. As in local dataset is small it goes for that, but in cluster when the data size grows it goes back to cartesian product again. How about our own search logic in a

Re: Improve carbondata CDC performance

2021-02-17 Thread akashrn5
Hi all, The design doc is updated, please go through and give your inputs/suggestions. Thanks, Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Display the segment ID when carbondata load is successful

2021-02-07 Thread akashrn5
hi, i think still the auto compaction after load is not async, plan is there to make it async. But according to me, we should give back current segment ID and if its merged to some segment we should say that , "X" is the segment ID loaded and its been merged to "Y" segment, so that user can take

Re: [Bug] SI Compatibility Issue

2021-01-21 Thread akashrn5
Hi, I think we cant block any operations on table just for this reason. Since we have give two commands for it, we cant block user. 1. Either we need to handle all these during refresh table only instead of having a one more register index command which will solve the issue. 2. Or we need

Re: [DISCUSSION] Display the segment ID when carbondata load is successful

2021-01-17 Thread akashrn5
Hi Nihal, The problem statement is not so clear, basically what is the use case, or in which scenario thee problem is faced. Because we need to get the result from the success segments itself. So please elaborate a little bit about the problem. Also, if you want to include more details, do not

Re: [Discussion]Presto Queries leveraging Secondary Index

2021-01-17 Thread akashrn5
Hi venu, Thanks for suggesting. 1. option 1 is not a good idea. i think performance will be bad 2. for option2, like we have other indexes of lucene and bloom where the distributed pruning happens. Lucene also a index stored along with table, but not another table like SI, so we scan lucene in a

Re: [Discussion] Upgrade presto-sql to 333 version

2020-12-26 Thread akashrn5
+1 Regards, Akash R Nilugal -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Merge index property and operations improvement.

2020-11-27 Thread akashrn5
Hi, final points to be considered are: 1. make merge index by default enable and fail the compaction, load if the merge index operation fails. 2. Merge index property can be removed completely, if any developer wants to check something code will be simpler to add some check to skip the merge

Re: Size control of minot compaction

2020-11-23 Thread akashrn5
Hi Sunday, This looks like a valid scenario because, may be some user application might be doing the minor compaction by default and some may be enabled auto compaction. which basically will be minor and if size is more we blindly go to compact. So i think instead of supporting auto

Re: [DISCUSSION]Merge index property and operations improvement.

2020-11-23 Thread akashrn5
Hi david, Thanks for reply a) remove mergeIndex property and event listener, add mergeIndex as a part of loading/compaction transaction. ==> yes, this can be done, as already discussed. b) if the merging index failed, loading/compaction should fail directly. ==> Agree to this, same as replied

Re: [DISCUSSION]Merge index property and operations improvement.

2020-11-23 Thread akashrn5
Hi Ajantha, Thanks for the reply, please find my comments *a) and b)* agree to the point that, no need to make load success if the merge index fails, we can fail the load and update the status and segment file only after merge index to avoid many reliability and concurrent and cache issues.

Re: [DISCUSSION]Join optimization with Carbondata's metadata

2020-11-10 Thread akashrn5
please note below points addition to above 1. There is a jira in spark similar what i have raised, https://issues.apache.org/jira/browse/SPARK-27227 they are also aimed at same, but its still in progress and target for spark 3.1.0. Here they plan to first execute a query on right table to get

Re: [Discussion] About carbon.si.segment.merge feature

2020-11-08 Thread akashrn5
Hi, Its better to remove i feel, as lot of code will be avoided and we can do it right the first time we do it. but please consider below points. 1. may be once we can test the time difference of global sort and exiting local sort load time, may be per segment basis, so that we can have a

Re: [DISCUSSION] Support MERGE INTO SQL API

2020-11-08 Thread akashrn5
Hi, Actually these all things i suggested to mention in design document and update. All these are QAs. for your answer: A3 -> The question is, when there is no condition present for when matched clause, then how to update all data, mention SQL example in design document Same for insert also,

Re: [DISCUSSION] Support MERGE INTO SQL API

2020-11-05 Thread akashrn5
Hi, +1 Thanks for proposing the idea. Please consider the below points in design and coding please try to update the below points to design 1. when there are multiple whenMatched conditions what happens? it should be in order 2. validations like when matched can have either update or delete,

Re: [VOTE] Apache CarbonData 2.1.0(RC2) release

2020-11-04 Thread akashrn5
+1 for release. Thanks. Regards, Akash R Nilugal -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Partition Optimization

2020-10-29 Thread akashrn5
Hi, +1. Its a long time pending work, good to complete it now. As ajantha said you can have a look at iceberg hidden partitioning, but this is just about not storing partition data in files and faster query and low storage. You can analyze and suggest the improvement in another discussion like

Re: Re: How to update based on the value of the source table

2020-10-10 Thread akashrn5
Hi, we already support columns greater than 100, what exactly is your question? Also, it would be helpful if you can ask and discuss issues in slack channel than mailing list. It would be easy to follow. Thanks, Akash R -- Sent from:

Re: Re: How to update based on the value of the source table

2020-10-09 Thread akashrn5
Hi, i got your question, we do not yet support the partial column update in carbon. When u say, set the column2, col3 by select col2, col3 from B where a.id = b.id, then whenever the where condition is met, we select the whole column from B and set to A. So you can have query as *update iud.a d

Re: How to update based on the value of the source table

2020-10-08 Thread akashrn5
Hi, I checked our test cases, we have a similar test case and works fine. You can refer "update carbon table[select from source table with where and exist]" in UpdateCarbonTableTestCase.scala, In that test case, you can have a query like below *sql("""update iud.dest11 d set (d.c3, d.c5 ) =

Re: Clean files enhancement

2020-09-15 Thread akashrn5
Hi David, 1. we cannot remove the code of clean up from all commands, because in case of any failures if we do not clean the stale files, there can be issues of wrong data or extra data. What i think is, we are calling the APIs which does may be say X amount of work, but we may just need some Y

Re: [Discussion] Update feature enhancement

2020-09-04 Thread akashrn5
Hi David, 1. Yeah i already told that it will come in to picture in delete case, as update is (delete + insert). 2. yes, we will be loading the single merge file into cache, which can be little bit better compared to existing one. 3. I didnt get the complete ans actually, when exactly you plan

Re: [Discussion] Update feature enhancement

2020-09-04 Thread akashrn5
Hi david, Please check below points One advantage what we get here is , when we insert as new segment, it will take the new insert flow without converter step and that will be faster. But here are some points. 1. when you write for new segments for each update, the horizontal compaction in

Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-03 Thread akashrn5
Hi David, After discussing with you its little bit clear, let me just summarize in some lines *Goals* 1. reduce the size of status file (which reduces overall size wit some MBs) 2. make table status file less prone to failures, and fast reading during read *For the above goals with your

Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-03 Thread akashrn5
Hi david, Thanks for starting this discussion, i have some questions and inputs 1. in solution 1, it just plane compression, where we will get the benefit of size, but still we will face, reliability issues in case of concurrency. So can be -1. 2. solution 2 writing, and reading to separate

Re: [DISCUSSION] Presto+Carbon transactional and Non-transactional Write Support

2020-07-27 Thread akashrn5
Hi Ajantha, Thanks for the inputs, please check the comments. a) you mentioned, currently creating a table form presto and inserting data will be a non-transactional table. so, to create a transactional table, we still depend on spark? > currently it's dependent on spark, but I'm planning to

Re: Improving show segment info

2020-02-16 Thread akashrn5
Hi Ajantha, I think event time comes into picture when the user has the timestamp column, like in timeseries. So only in that case, this column makes sense. Else it won't be there. @Likun, correct me if my understanding is wrong. Regards, Akash R Nilugal -- Sent from:

Re: Improving show segment info

2020-02-16 Thread akashrn5
Hi, >>*1. How about creating a "tableName.segmentInfo" child table for each main >>table?* user can query this table and easy to support filter, group by. we >>just have to finalize the schema of this table. We already have many things like index tables, datamap tables, just to store this

Re: Improving show segment info

2020-02-16 Thread akashrn5
Hi, >I got your point, but given the partition column by user does not help reducing the information. If we want to reduce the >amount of the information, we should ask user to give the filter on partition column like example 3 in my original mail. 1. my concern was if there are more partition

Re: Improving show segment info

2020-02-16 Thread akashrn5
Hi likun, Thanks for proposing this +1, its a good way and its better to provide user more info about segment. I have following doubts and suggestions. 1. You have mentioned DDL as Show segments On table, but currently it is show segments for table, i suggest not to change the current one,we

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

2020-02-16 Thread akashrn5
Hi Indhumathi, +1. It solves many memory problems and improves first time filter query. I have some doubts. 1. can you tell me how you gonna read the in max? I mean to say, are you going to store the segment level min max for all the column or since you said blocklevel, it means for every

Re: Regarding presto carbondata integration

2020-02-16 Thread akashrn5
Hi Ajantha, Whatever you mentioned is a big pain point now. Even when we are try for write support, the hadoop and hive versions supported in carbon version is different from what presto supports, so we might have to have duplicate code for this case also. Either we have to put carbon code in

Re: 回复:[DISCUSSION] Multi-tenant support by refactoring datamaps

2020-02-16 Thread akashrn5
Hi, +1 I agree with jacky, we can store Info in table metadata. But here one problem we can face, that is metastore connection issue. If there are lot of tables and datamaps, doing many connection to metastore reduces performance. In this case reading from one schema file will be better. So

Re: [DISCUSSION] Add new compaction type for compacting delta data file

2019-04-02 Thread akashrn5
hi, Thanks for reply. Once you create jira and design document is ready, we can further decide the impact and any other things to handle. Thank you Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Add new compaction type for compacting delta data file

2019-04-01 Thread akashrn5
hi, Thanks for clearing the doubt. So according to my understanding, bascially you want to merge all the delete delta files and base carbondtaa files and write a new segment. basically this helps to reduce IO right? So here i have some questions regarding that 1. are you planning for a new

Re: [Discuss]Adapt MV datamap to spark 2.1 version

2019-03-28 Thread akashrn5
hi, is the changes intrusive for support to 2.1 or you are going to use the decoupling strategy? I hope decoupling will be better as once we decide to remove 2.1 from carbondata code, it will be easy to remove. Thanks -- Sent from:

Re: Injection of custom rules in session

2019-03-06 Thread akashrn5
hi, extensions you mean to say rules? Did not get the question clearly -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: 【Discussion】master compile failed

2019-03-06 Thread akashrn5
hi litao, is this failure happening when you are not connected to internet. I sometimes faced this issue. If we add the dependency like you suggested? will it be able to find that artifact? regards akash -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] DDLs to operate on CarbonLRUCache

2019-02-19 Thread akashrn5
Hi naman, Thanks for proposing the feature. Looks really helpful from user and developer perspective. Basically needed the design document, so that all the doubts would be cleared. 1. Basicaly how you are going to handle the sync issues like, multiple queries with drop and show cache. are you

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-19 Thread akashrn5
hi ravindra, Got your point. As i had replied to xuchuyain. We can take these index datamap enhancement separately. Thank you -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-18 Thread akashrn5
i got your point. If each segment has status file as i said, we can do pruning without rebuild also. But need to get others suggestion on this point. So may be we can take up this in another JIRA and track. In this jira we can just suppport incremental data load. -- Sent from:

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-18 Thread akashrn5
I agree with you that the index created for old segments will be of no use if rebuild is not happened and these are not considered in query for pruning. But we go for datamap pruning (index) based on the datamap status. Status will be just enabled or disabled. You cannot maintain status for each

Re: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-18 Thread akashrn5
Hi xuchuanyin, For index datamap we can have same behavior as mv datamap only, but it might behave differently in case of lucene. This we can decide whether to enable or lazy load or not. Currently mv behavior is below it supports only lazy load. So when the main table data and the datamap

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-18 Thread akashrn5
Hi dhatchayani, please find the comments below 1. Yes you are right, design document contains this, in datamap status file we will add the mapping of synchrinization of main table and datamap, based on that incremental load is done. 2. I will tell in general, if main table has 10 segments, and

Re: 【Questions about code】

2019-02-15 Thread akashrn5
Hi litao, sparkSql function calls withprofiler function method and whenever the queryExecution object and SQLStart is made, this will call the generateDF function, which creates the new DataSet object. So once the queryExecution object is made from logical plan, we call assertAnalyzed() which

Re: [Discussion]read latest schema in case of external table and file format

2019-02-04 Thread akashrn5
Hi rahul, Actually we are not skipping the old file, currently we are just listing the carbondata files in the location and then take first one to infer the schema, but now i just take the latest carbon data file to infer schema, and while giving the data, if the column is not present in

Re: [Discussion]read latest schema in case of external table and file format

2019-02-04 Thread akashrn5
Hi Liang, When we create a table using location in file format case or when i create an external table from a location, user can place multiple carbondata files with different schema in that location and want to read the data at once, in that scenario we can expect the above condition. So

Re: [Discussion]Alter table column rename feature

2018-12-06 Thread akashrn5
we can use the same existing command for both datatype change and rename -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

RE: [SUGGESTION]Support compaction no_sort

2018-12-05 Thread akashrn5
currently , what i have thought is, if all the loads involved for compaction are no sort then only we will sort during compaction. So currently we have table level, that is fine. So if the table has no_sort during compaction it will be sorted , if local sort it will go to current compaction flow.

Re: [Discuss] Removing search mode

2018-11-06 Thread akashrn5
+1 yes, after search mode implementation we didnt get much advantage as expected and simply code will be complex, i agree with likun. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [SUGGESTION]Support Decoder based fallback mechanism in local dictionary

2018-08-27 Thread akashrn5
As of now i will code as user property, and we can take desicion once we get the performance report with this. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

2018-06-08 Thread akashrn5
1. If user is giving any invalid value, default threshold(1000 unique values) value will be considered.  What is the consideration behind the default value 1000. *1000 is a random value we have mentioned in design doc. CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold

Re: [Discussion] Carbon Local Dictionary Support

2018-06-08 Thread akashrn5
1. If user is giving any invalid value, default threshold(1000 unique values) value will be considered.  What is the consideration behind the default value 1000. *1000 is a random value we have mentioned in design doc. CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold

Re: [Discussion] Carbon Local Dictionary Support

2018-06-08 Thread akashrn5
Hi bhavya, Local dictionary generation is task level. if in ongoing load, if the threshold is breached, then for that load the local dictionary will not be generated for that corresponding column and there is no dependency with the previous loads. For each load new local dictionary will be

Re: [Discussion] Carbon Local Dictionary Support

2018-06-08 Thread akashrn5
Hi xuchuanyin, Please find my comments inline About query filtering 1. “during filter, actual filter values will be generated using column local dictionary values...then filter will be applied on the dictionary encode data” --- If the filter is not 'equal' but 'like','greater than', can it

Re: after load data using SaveMode.Overwrite, query through beeline return all null field

2018-05-31 Thread akashrn5
Hi, I have checked with the current version and the issue is not reproducing, when I checked the code, there are code changes happened for the savemode from 1.3 to 1.4 version. You can check the PR #2186 for the changes done for that part and you can check your issue again with that PR .

Re: loading data from parquet table always

2018-05-29 Thread akashrn5
Hi, The exception says, there is problem while copying from local to carbonstore(HDFS). It means the writing has already finished in the temp folder and after writing it will copy the files to hdfs and it is failing during that time. So with this exception trace, it will be difficult to know the

Re: loading data from parquet table always

2018-05-29 Thread akashrn5
Hi, The exception says, there is problem while copying from local to carbonstore(HDFS). It means the writing has already finished in the temp folder and after writing it will copy the files to hdfs and it is failing during that time. So with this exception trace, it will be difficult to know the