+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Congratulations Akash
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
-1, please fix the pending defect and merge the completed PR at first.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1, you can finish the implementation.
How about using the following SQL instead of the cartesian join?
SELECT df.filePath
FROM targetTableBlocks df
where exists (select 1 from srcTable where srcTable.value between df.min
and df.max)
-
Best Regards
David Cai
--
Sent from:
I mean you can push your logic into CarbonDataSourceScan as a dynamic runtime
filter.
Actually, CarbonDataSourceScan already used min/max zoom maps as an index
filter to prune blocklist (in the CarbonScanRDD.getPartition method).
We can do more things on the join query. Here I assume the source
Hi Akash,
You can enhance the runtime filter to improve the join performance.
It has the rule to dynamically check whether the join can add the
runtime filter or not.
Better to push down the runtime filter into CarbonDataSourceScan, and
better to avoid adding a UDF function to
hi Venu and Ajantha,
For the new SI solution, I have some suggestions also.
1. agree to avoid query plan rewrite
2. push down the SI filter to the pruning step of the main table directly on
the driver side, but we need a distributed job to improve performance
3. segment level usability
for
Hi Nihal, my suggestion as following,
1. contain the normal output of the show segment command
2. add more information for loading, like numFiles, numRows, rawDataSize
(maybe show segment need also, take care of CDC which needs to update this
information)
-
Best Regards
David Cai
--
Sent
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
Are there other impacts?
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi akash,
for simple updates and delete scenario, you can try to do it.
During update/delete,
1) for updated/deleted segment, no need to update segmentMetadataInfo.
2) for new inserted segment, you can summary blocklet level index to
segment level index, reading
hi Akash, for the simple update case, can you do a test to confirm your
inference after a fast change?
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
It will task many resources and a long time to compact a large segment, and
may not get a good result.
Auto compaction is disabled, we could give a large default value(maybe
1024GB), it will not impact the behavior by default.
And the table level threshold is needed also.
If the user wants
Congratulations to Ajantha.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
a) remove mergeIndex property and event listener, add mergeIndex as a part
of loading/compaction transaction.
b) if the merging index failed, loading/compaction should fail directly.
c) keep merge_index command and mark it deprecated.
for a new table, maybe it will do nothing.
hi Ajantha,
Agree to remove "carbon.si.segment.merge"
1. dynamic decide the number for the loading tasks
Before loading the SI segment, it is easy to estimate the total size of
this SI segment.
So better to dynamic decide the number for the loading tasks to avoid
small carbon files in
Agree with Vishal, better to test and confirm the difference.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
PR#3999 already implemented this enhancement, please know.
PR URL: https://github.com/apache/carbondata/pull/3999
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
I list feature list about segment as follows before starting to re-factory
segment interface.
[table related]
1. get lock for table
lock for tablestatus
lock for updatedTablestatus
2. get lastModifiedTime of table
[segment related]
1. segment datasource
datasource: file format,other
Hi Ramana,
I agree with you.
When writing segment file, the system use listFiles to collect all index
files.
In some case, it will add stale index files into segment file.
We will try to fix it at first.
-
Best Regards
David Cai
--
Sent from:
hi Linwood,
1. better to implement "Update feature enhancement" at first, it will
create a new segment to store new files.
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Update-feature-enhancement-td99769.html
2. clean deletedelta files
now carbon need
agree with Ravindra,
1. stop all automatic clean data in load/insert/compact/update/delete...
2. when clean files command clean in-progress or uncertain data, we can move
them to data trash.
it can prevent delete useful data by mistake, we already find this issue
in some scenes.
other
1. cleaning the in_progressing segment is very dangerous, please remove this
part from code. After the user explicitly uses clean file command with an
option "clean_in_progressing"="true", we check segment lock to clean
segment.
2. if the status of a semgent is mark_for_delete/compacted, we can
+1
maybe we need to use a delta file to store updated values instead of the
deletedelta file
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi Akash,
3. Update operation contain a insert operation. Update operation will
do the same thing how the insert operation process this issue.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi Kunal,
1. The user uses SQL API or other interfaces. This UUID is a transaction
id, and we already stored the timestamp and other informations in the
segment metadata.
This transaction id can be used in the loading/compaction/update
operation. We can append this id into the log if
Hi Akash,
1. the update operation still has "deletdelta" files, it keeps the same with
previous. horizontal compaction is still needed.
2. loading one carbonindexmerge file will fast, and not impact the query
performance. (customer has faced this issue)
3. for insert/loading, it can trigger
[Background]
1. In some scenes, two loading/compaction jobs maybe write data to the same
segment, it will result in some data confusion and impact some features
which will not work fine again.
2. Loading/compaction/update/delete operations need to clean stale data
before execution. Cleaning
Hi Akash
2. new tablestsatus, only store the lastest status file name, not all
status files.
status file will store all segment metadata (just like old tablestatus)
3. if we have delta file, no need to read status file for each query. only
reading delta file is enough if status file not
[Background]
Now update feature insert the updated rows into the old segments where the
data are updated.
In the end, it needs to reload the indexes of related segments.
[Movitation]
If there are many updated segments, it will take a long time to reload the
indexes again.
So I suggest writing
add solution 4 to separate the status file by segment status
*solution 4:* Based on solution 2, support status.inprogress
1) new tablestatus file format
{
"statusFileName":"status-uuid1",
"inProgressStatusFileName": "status-uuid2.inprogess",
[Background]
Now the size of one segment metadata entry is about 200 bytes in the
tablestatus file. if the table has 1 million segments and the mean size of
segments is 1GB(means the table size is 1PB), the size of the tablestatus
file will reach 200MB.
Any reading/writing operation on this
This mechanism will work fine for LOCAL_SORT loading of big data and the
small cluster with big executor.
If it doesn't match these conditions, better consider a new solution to
adapter the generic scenario.
I suggest re-factoring NO_SORT, maybe we can check and improve the
global_sort solution.
agree with Ravi
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
Can we add a commit method to support multiple operations at once?
CarbonSDKUID
.delete(...)
.delete(...)
.update(...)
.commit
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for solution 1
but the limit statment will get the head or tail of segment list? or need
order by some columns? Please describe the details.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for removing it.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for solution2
Can we support more than one array_contains by using SI join (like SI on
primitive data type)?
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
update reply:
The merging index should be a part of loading. It is not good to extract the
merging index to an independent process, it brought the query issue (the
system can't find the index files when/after merging).
In my opinion, during loading, new .carbonindex files should be temporary,
The merging index should be a part of loading. It is not good to extract the
merging index to an independent process, it brought the query issue (the
system can't find the index files when/after merging).
In my opinion, during loading, new .carbonindex files should be temporary,
we should
Better to always merge index.
-1 for 1,
+1 for 2,
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 and agree with Kunal
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
hi, Kunal
another question about point 4:
*4. A staged Maven repository is available for review at:*
https://repository.apache.org/content/repositories/orgapachecarbondata-1062/
please check this URL, it doesn't contain javadoc jar.
I have a doubt about point 3:
*3. The artifacts to be voted on are located here:*
https://dist.apache.org/repos/dist/dev/carbondata/2.0.0-rc3/
why need to release two same source packages?
apache-carbondata-2.0.0-spark2.3-source-release.zip
apache-carbondata-2.0.0-spark2.4-source-release.zip
In my opinion, this is an issue if it can't work.
Better to change the topic title to use ‘question'/'issue' instead of
'discussion'.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
How about mark the stream SQL as experimental?
Now in some cases, it is an easy way for the user to understand the
streaming table.
We can improve it in the future.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
please check another topic:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Float-and-Double-compatibility-issue-with-external-segments-to-Carbon-td93870.html.
if this is an issue, you can create an issue in carbondata jira.
-
Best Regards
David Cai
--
It is a historical legacy issue and easy to reuse the solution of the double
data type.
Suggest implementing the float data type independently.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
I agree with Ravindra and I can try to fix it (I have mentioned in a PR
review comment).
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Congratulations Kunal
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, manhua
Now no_sort reuse the loading flow of local_sort. It is not a good
solution and led to the situation which you have mentioned. In my opinion,
we need to adjust the loading flow of no_sort, maybe like global_sort
finally.
In addition, the producer-consumer pattern in data encoding
-1 for me, based on the below points.
1. We need to update quick-start-guide.md for Carbon 2.0. For example,
Carbon 2.0 supports the multi-tenant scenario, the carbon property
"carbon.storelocation" should be deprecated. only the user who used
"carbon.storelocation" in the previous version can
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
please take care of the performance changes during refactoring datamaps
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
doc:
https://github.com/apache/carbondata/blob/master/docs/ddl-of-carbondata.md#sort-columns-configuration
testcase:
As so far, Carbondata don't support primary keys, foreign keys, NOT NULL,
etc.
Table creation can use SORT_COLUMNS to create the main index, but the
secondary index doesn't be supported.
-
Best Regards
David Cai
--
Sent from:
Maybe it used javax.jdo.option.ConnectionURL configuration.
When hive,hadoop and spark don't set this configuration, it will use the
parameter of getOrCreateCarbonSession.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
please check JIRA and find the design doc:
https://issues.apache.org/jira/browse/CARBONDATA-3347
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
How will it compact Seg_0 and Seg_1 in the new compaction?
For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has
2 ranges (50-150) and (250-300);
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
There are all documents(include streaming table) under the following link.
https://github.com/apache/carbondata/tree/master/docs
You can find all examples in examples/spark2 module:
example 1 (support Update/Delete)
You can get table schema by CarbonTable.getCreateOrderColumn method.
It will return the correct table schema.
"name,city,id,salary" is the order of column storage, it is not the table
schema.
-
Best Regards
David Cai
--
Sent from:
Hi all,
Let's discuss whether it is necessary to support SORT_COLUMNS
modification.
*Background*
"SORT_COLUMNS" is a table level property, and we can't change it after
creating a table.
*Motivation*
When we want to optimize the query performance and found that it needs
to
>From spark 2.2, Spark can inject extensions.
for example:
val spark = SparkSession
.builder()
...
.withExtensions(...)
...
CarbonSession.CarbonBuilder uses default extensions to create CarbonSession.
It doesn't inject parser, analyzer and so on.
And the extensions variable is private
Hi all,
For data loading, we can pass some options into load data command by
using options clause, but insert into command can't.
How to pass some options into Insert Into command? some options as
following.
1. implement options clause for insert into command
2. use hint
+1 for 5,6, after point 5 estimated the cache size, the point 6 can modify
the configuration dynamically.
+1 for 3,4: maybe need to add a lock to sync the concurrent operations. If
it wants to release cache, it will not need to restart the driver
Maybe we also need to check how to use these
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
maybe we can validate this property and limit it to less than the total
memory size (or 60%...) of the driver side
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
In some examples, we need to print some info to the console.
So we need to skip some code styles.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
I will try to implement it.
PR link:
https://github.com/apache/carbondata/pull/3001
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Better to support alter 'sort_columns' and 'sort_scope' also.
After the table creation and data loading, the user can adjust
'sort_columns' and 'sort_scope'.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for 1,2,3,4,5,8,9,10
+0 for 6,7
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Where do we call SegmentPropertiesAndSchemaHolder.invalidate in handoff
thread?
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+0 for 1. delete 11 files
Better to add Start/End keys to DataMapRow also.
In my opinion, the union of Min/Max values and Start/End keys can work
better.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi All,
Currently, the filter queries on the streaming table always scan all
streaming files, even though there are no data in streaming files that meet
the filter conditions.
So I try to support file-level min/max index on streaming segment. It
helps to reduce the task number and improve
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi Kunal,
I have some questions.
*Problem(Locking):*
Does the memory lock support that the multiple drivers concurrently load
data to the same table? maybe it should note this limitation.
*Problem(Write with append mode):*
1. atomicity
After the overwrite operation failed, maybe the
+1, I agree with using RowStreamParserImpl by default.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
The direct dictionary ignores the millisecond of the timestamp data.
If the millisecond is needless, the direct dictionary uses the integer to
improve compression.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi Jatin, Timestamp column is non-dictionary by default. After adding the
Timestamp column to the table property 'dictionary_include', it will have
the same encoding list.
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
It will be an independent module.
The layout maybe like this:
carbondata
|_ datamap
|__ lucene
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for 2). The same as integration with Structured Streaming
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
-
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 Release this package as Apache CarbonData 1.2.0
1. Release
There are important new features and the integration of new platform
2. The tag
" mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format package" passed
"mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format install" passed
3.
I agree with Jacky.
I think enhanced segment metadata will help us to understand the table.
I suggest the following properties for segment metadata:
1. total data file size
2. total index file size
3. data file count
4. index file count
5. last modified time (last update time)
Through these
I agree with Ravindra, now is the time to implement migration tool.
-
Best Regards
David Cai
--
View this message in context:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-About-data-backward-compatibility-tp20183p20219.html
Sent from the Apache
+1
-
Best Regards
David Cai
--
View this message in context:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Interfaces-for-index-frame-work-tp13274p20218.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive
at Nabble.com.
The spark core version of hdp2.6.0-spark2.1.0 is spark 2.1.1.
In spark 2.1.1, CatalystConf was already removed.
We raised PR to support it and will merge it at later.
https://github.com/apache/carbondata/pull/1096
https://github.com/apache/carbondata/pull/1017
And the command will be "mvn
+1 for supporting presto integration.
-
Best Regards
David Cai
--
View this message in context:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/VOTE-Presto-integration-version-Re-DISCUSSION-Whether-Carbondata-should-work-with-Presto-in-the-next-tp14906p14907.html
+1 for A
As I known, so far ColumnGroup feature can't improve performance very well,
it became a useless feature nearly. If necessary, we need redesign this
feature to keep code clean and tune it well to improve performance.
-
Best Regards
David Cai
--
View this message in context:
1 - 100 of 103 matches
Mail list logo