Hi Ajantha,
Thanks for your points.
Now actually we cache the splits, actual join will be faster and, even
though the pruning doesn't happen it wont affect the performance much. This
is learned from the test we did during POC and it doesn't make much
difference in performance, basically no
Hi,
In new design we cache the splits and actual join operation makes use of it
and will be faster. So from the test results even though the dataset didn't
prune anything, it wont make any difference in performance.
Basically doesn't degrade.
As far as actual use case, the changing of who table
Hi Ravi,
Thanks for your inputs.
Actually, the test with binary search and broadcasting didn't give much
benefit and from code perspective also the we need to sort the data our self
based on min max search logic for the array, and also considering the
scenarios of multiple blocks with same min
Hi,
+1 for the feature and the design.
I have give some comments on the design doc for handling some missing
scenarios and small changes.
Can you please update the design doc. As not so major comments except one or
two, can go ahead with feature and parallelly can update comments.
Thanks
hi,
+1
Regards,
Akash R
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi,
+1 for the feature. This is very important to improve query perf instead of
waiting for SI and main table to e always in sync.
I have reviewed the doc and given comments, please handle and please discuss
with @venu Si as datamap feature to be inline as informed earlier.
P.S: This design
Hi,
+1 for feature. Thanks for proposing it as now most of the use case from
user perspective involves complex columns.
I have reviewed the doc and given comments, please work on it, then can be
reviewed again.
Regards,
Akash R
--
Sent from:
Hi,
+1 for the new functionality.
my suggestion is to modify the DDL something like below
DESCRIBE column fieldname ON [db_name.]table_name;
DESCRIBE table short/transient [db_name.]table_name;
Others can give their suggestions
Thanks,
Regards,
Akash R
--
Sent from:
Hi,
yes, as you mentioned this is a major drawback in the current SI flow. This
problem exists because, when we get the set of segments to load, we start an
executor service and give all the segment list, after .get we make the
status success at once.
So we need to rewrite this code to make it
Hi,
+1,
Considering others opinions, just segment ID can be enough and users should
take care to check the status of it after load to decide whether to query or
go ahead with any other operation on that segment.
This makes code also simple and not induce any bugs and test scope will also
be
Hi Nihal,
Thanks for bringing this up. It's an important feature to leverage SI at the
small segment level also.
Already a work is being done on making SI to prune at data map interface, so
your design should be aligned with that.
So better to check the SI as a data map design first and then
Hi Venu,
Thanks for your review.
I have replied the same in the document.
you are right
1. its taken care to group by extended blocklets on split path and get the
min-max on block level
2. we need to do group by on the file path to avoid the duplicates from
dataframe output. I have updated the
Hi david,
Thanks for your suggestion.
I checked in local about the query you suggested, its going as a
*BroadcastNestedLoopJoin*.
As in local dataset is small it goes for that, but in cluster when the data
size grows it goes back to cartesian product again.
How about our own search logic in a
Hi all,
The design doc is updated, please go through and give your
inputs/suggestions.
Thanks,
Regards,
Akash R
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
hi,
i think still the auto compaction after load is not async, plan is there to
make it async.
But according to me, we should give back current segment ID and if its
merged to some segment
we should say that , "X" is the segment ID loaded and its been merged to "Y"
segment, so that user can take
Hi,
I think we cant block any operations on table just for this reason. Since we
have give two commands for it, we cant block user.
1. Either we need to handle all these during refresh table only instead of
having a one more register index command which will solve the issue.
2. Or we need
Hi Nihal,
The problem statement is not so clear, basically what is the use case, or in
which scenario thee problem is faced. Because we need to get the result from
the success segments itself. So please elaborate a little bit about the
problem.
Also, if you want to include more details, do not
Hi venu,
Thanks for suggesting.
1. option 1 is not a good idea. i think performance will be bad
2. for option2, like we have other indexes of lucene and bloom where the
distributed pruning happens. Lucene also a index stored along with table,
but not another table like SI, so we scan lucene in a
+1
Regards,
Akash R Nilugal
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi,
final points to be considered are:
1. make merge index by default enable and fail the compaction, load if the
merge index operation fails.
2. Merge index property can be removed completely, if any developer wants to
check something code will be simpler to add some check to skip the merge
Hi Sunday,
This looks like a valid scenario because, may be some user application
might be doing the minor compaction by default and some may be enabled auto
compaction. which basically will be minor and if size is more we blindly go
to
compact.
So i think instead of supporting auto
Hi david,
Thanks for reply
a) remove mergeIndex property and event listener, add mergeIndex as a part
of loading/compaction transaction.
==> yes, this can be done, as already discussed.
b) if the merging index failed, loading/compaction should fail directly.
==> Agree to this, same as replied
Hi Ajantha,
Thanks for the reply, please find my comments
*a) and b)* agree to the point that, no need to make load success if the
merge index fails, we can fail the load
and update the status and segment file only after merge index to avoid many
reliability and concurrent and cache issues.
please note below points addition to above
1. There is a jira in spark similar what i have raised,
https://issues.apache.org/jira/browse/SPARK-27227
they are also aimed at same, but its still in progress and target for spark
3.1.0.
Here they plan to first execute a query on right table to get
Hi,
Its better to remove i feel, as lot of code will be avoided and we can do it
right the first time we do it.
but please consider below points.
1. may be once we can test the time difference of global sort and exiting
local sort load time, may be per segment basis, so that we can have a
Hi,
Actually these all things i suggested to mention in design document and
update. All these are QAs.
for your answer:
A3 ->
The question is, when there is no condition present for when matched clause,
then how to update all data, mention SQL example in design document
Same for insert also,
Hi,
+1
Thanks for proposing the idea.
Please consider the below points in design and coding
please try to update the below points to design
1. when there are multiple whenMatched conditions what happens? it should be
in order
2. validations like when matched can have either update or delete,
+1 for release.
Thanks.
Regards,
Akash R Nilugal
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi,
+1.
Its a long time pending work, good to complete it now.
As ajantha said you can have a look at iceberg hidden partitioning, but this
is just about not storing partition data in files and faster query and low
storage.
You can analyze and suggest the improvement in another discussion like
Hi,
we already support columns greater than 100, what exactly is your question?
Also, it would be helpful if you can ask and discuss issues in slack channel
than mailing list.
It would be easy to follow.
Thanks,
Akash R
--
Sent from:
Hi,
i got your question, we do not yet support the partial column update in
carbon.
When u say, set the column2, col3 by select col2, col3 from B where a.id =
b.id, then whenever the where condition is met, we select the whole column
from B and set to A.
So you can have query as
*update iud.a d
Hi,
I checked our test cases, we have a similar test case and works fine.
You can refer "update carbon table[select from source table with where and
exist]" in
UpdateCarbonTableTestCase.scala,
In that test case, you can have a query like below
*sql("""update iud.dest11 d set (d.c3, d.c5 ) =
Hi David,
1. we cannot remove the code of clean up from all commands, because in case
of any failures if we do not clean the stale files, there can be issues of
wrong data or extra data.
What i think is, we are calling the APIs which does may be say X amount of
work, but we may just need some Y
Hi David,
1. Yeah i already told that it will come in to picture in delete case, as
update is (delete + insert).
2. yes, we will be loading the single merge file into cache, which can be
little bit better compared to existing one.
3. I didnt get the complete ans actually, when exactly you plan
Hi david,
Please check below points
One advantage what we get here is , when we insert as new segment, it will
take the new insert flow without converter step and that will be faster.
But here are some points.
1. when you write for new segments for each update, the horizontal
compaction in
Hi David,
After discussing with you its little bit clear, let me just summarize in
some lines
*Goals*
1. reduce the size of status file (which reduces overall size wit some MBs)
2. make table status file less prone to failures, and fast reading during
read
*For the above goals with your
Hi david,
Thanks for starting this discussion, i have some questions and inputs
1. in solution 1, it just plane compression, where we will get the benefit
of size,
but still we will face, reliability issues in case of concurrency. So can be
-1.
2. solution 2
writing, and reading to separate
Hi Ajantha,
Thanks for the inputs, please check the comments.
a) you mentioned, currently creating a table form presto and inserting data
will be a non-transactional table.
so, to create a transactional table, we still depend on spark?
> currently it's dependent on spark, but I'm planning to
Hi Ajantha,
I think event time comes into picture when the user has the timestamp
column, like in timeseries. So only in that case, this column makes sense.
Else it won't be there.
@Likun, correct me if my understanding is wrong.
Regards,
Akash R Nilugal
--
Sent from:
Hi,
>>*1. How about creating a "tableName.segmentInfo" child table for each main
>>table?* user can query this table and easy to support filter, group by. we
>>just have to finalize the schema of this table.
We already have many things like index tables, datamap tables, just to store
this
Hi,
>I got your point, but given the partition column by user does not help
reducing the information. If we want to reduce the >amount of the
information, we should ask user to give the filter on partition column like
example 3 in my original mail.
1. my concern was if there are more partition
Hi likun,
Thanks for proposing this
+1, its a good way and its better to provide user more info about segment.
I have following doubts and suggestions.
1. You have mentioned DDL as Show segments On table, but currently it is
show segments for table, i suggest not to change the current one,we
Hi Indhumathi,
+1. It solves many memory problems and improves first time filter query.
I have some doubts.
1. can you tell me how you gonna read the in max? I mean to say, are you
going to store the segment level min max for all the column or since you
said blocklevel, it means for every
Hi Ajantha,
Whatever you mentioned is a big pain point now. Even when we are try for
write support, the hadoop and hive versions supported
in carbon version is different from what presto supports, so we might have
to have duplicate code for this case also. Either we have to
put carbon code in
Hi,
+1
I agree with jacky, we can store Info in table metadata. But here one
problem we can face, that is metastore connection issue. If there are lot of
tables and datamaps, doing many connection to metastore reduces performance.
In this case reading from one schema file will be better.
So
hi,
Thanks for reply. Once you create jira and design document is ready, we can
further decide the impact and any other things to handle.
Thank you
Regards,
Akash R
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
hi,
Thanks for clearing the doubt.
So according to my understanding, bascially you want to merge all the delete
delta files and base carbondtaa files and write a new segment. basically
this helps to reduce IO right?
So here i have some questions regarding that
1. are you planning for a new
hi,
is the changes intrusive for support to 2.1 or you are going to use the
decoupling strategy?
I hope decoupling will be better as once we decide to remove 2.1 from
carbondata code, it will be easy to remove.
Thanks
--
Sent from:
hi,
extensions you mean to say rules? Did not get the question clearly
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
hi litao,
is this failure happening when you are not connected to internet. I
sometimes faced this issue.
If we add the dependency like you suggested? will it be able to find that
artifact?
regards
akash
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi naman,
Thanks for proposing the feature. Looks really helpful from user and
developer perspective.
Basically needed the design document, so that all the doubts would be
cleared.
1. Basicaly how you are going to handle the sync issues like, multiple
queries with drop and show cache. are you
hi ravindra,
Got your point. As i had replied to xuchuyain. We can take these index
datamap enhancement separately.
Thank you
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
i got your point. If each segment has status file as i said, we can do
pruning without rebuild also. But need to get others suggestion on this
point. So may be we can take up this in another JIRA and track. In this jira
we can just suppport incremental data load.
--
Sent from:
I agree with you that the index created for old segments will be of no use if
rebuild is not happened and these are not considered in query for pruning.
But we go for datamap pruning (index) based on the datamap status. Status
will be just enabled or disabled. You cannot maintain status for each
Hi xuchuanyin,
For index datamap we can have same behavior as mv datamap only, but it might
behave differently in case of lucene. This we can decide whether to enable
or lazy load or not.
Currently mv behavior is below
it supports only lazy load. So when the main table data and the datamap
Hi dhatchayani,
please find the comments below
1. Yes you are right, design document contains this, in datamap status file
we will add the mapping of synchrinization of main table and datamap, based
on that incremental load is done.
2. I will tell in general, if main table has 10 segments, and
Hi litao,
sparkSql function calls withprofiler function method
and whenever the queryExecution object and SQLStart is made, this will call
the generateDF function, which creates the new DataSet object.
So once the queryExecution object is made from logical plan, we call
assertAnalyzed() which
Hi rahul,
Actually we are not skipping the old file, currently we are just listing the
carbondata files in the location and then take first one to infer the
schema, but now i just take the latest carbon data file to infer schema, and
while giving the data, if the column is not present in
Hi Liang,
When we create a table using location in file format case or when i create
an external table from a location, user can place multiple carbondata files
with different schema in that location and want to read the data at once, in
that scenario we can expect the above condition.
So
we can use the same existing command for both datatype change and rename
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
currently , what i have thought is, if all the loads involved for compaction
are no sort then only we will sort during compaction. So currently we have
table level, that is fine. So if the table has no_sort during compaction it
will be sorted , if local sort it will go to current compaction flow.
+1
yes, after search mode implementation we didnt get much advantage as
expected and simply code will be complex, i agree with likun.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
As of now i will code as user property, and we can take desicion once we get
the performance report with this.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
1. If user is giving any invalid value, default threshold(1000 unique
values)
value will be considered. What is the consideration behind the default
value 1000.
*1000 is a random value we have mentioned in design doc.
CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold
1. If user is giving any invalid value, default threshold(1000 unique
values)
value will be considered. What is the consideration behind the default
value 1000.
*1000 is a random value we have mentioned in design doc.
CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold
Hi bhavya,
Local dictionary generation is task level. if in ongoing load, if the
threshold is breached, then for that load the local dictionary will not be
generated for that corresponding column and there is no dependency with the
previous loads. For each load new local dictionary will be
Hi xuchuanyin,
Please find my comments inline
About query filtering
1. “during filter, actual filter values will be generated using column local
dictionary values...then filter will be applied on the dictionary encode
data”
---
If the filter is not 'equal' but 'like','greater than', can it
Hi,
I have checked with the current version and the issue is not reproducing,
when I checked the code, there are code changes happened for the savemode
from 1.3 to 1.4 version.
You can check the PR #2186 for the changes done for that part and you can
check your issue again with that PR .
Hi,
The exception says, there is problem while copying from local to
carbonstore(HDFS). It means the writing has already finished in the temp
folder and after writing
it will copy the files to hdfs and it is failing during that time.
So with this exception trace, it will be difficult to know the
Hi,
The exception says, there is problem while copying from local to
carbonstore(HDFS). It means the writing has already finished in the temp
folder and after writing
it will copy the files to hdfs and it is failing during that time.
So with this exception trace, it will be difficult to know the
70 matches
Mail list logo