Hi,
Thank you for proposing. Please check my comments below.
1.Global dictionary: It was one of the prime features when it was initially
released to apache. Even though spark has introduced tungsten still it has
its benefits like compression, filtering and aggregation queries. But after
the
Hi Manhua,
Even at page level the row count will not be available probably from the
next version. It would be decided as per size, not per count. Already code
got merged and we are keeping the count based page configuration temporarily
for backward compatibility.
So at any place, we will not
Hi Manhua,
Main problem with this approach is we cannot save any IO as our IO unit is
blocklet not page. Once it is already to memory I really don’t think we can
get performance with bloom at page level. I feel the solution would be
efficient only the IO is saved somewhere.
Our min/max index is
Hi All,
Apache CarbonData community is pleased to announce the release of the
Version 1.5.4 in The Apache Software Foundation (ASF).
Apache CarbonData community is pleased to announce the release of the
Version 1.5.4 in The Apache Software Foundation (ASF).
CarbonData is a high-performance
Hi,
It is better to move to PrestoSQL as the community is more active compared
to PrestoDB. But we should consider the current users of PrestoDB as well.
Maintaining two modules in carbondata is not a viable solution as the
maintenance becomes diffcult.
I feel we can wait for one more version
Hi Akash,
There is a difference between index datamap (like bloom) and olap datamaps
(like MV). Index datamaps used only for pruning the data while olap datamaps
will be used as pre-computed data which can be fetched directly as per
query.
In OLAP datamap case lazy build or deferred build makes
Hi,
Please check my views on it.
Basic design should be there is a clear separation between modules. Like
Spark based configurations are no means to presto, So every module can have
there owned conf and constant classes.
1. No need for CarbonProperties and CarbonCommonConstants.
2. Should have
Hi,
Current Carbon Presto integration added a new presto connector that takes
the carbon store folder and lists the databases and tables from the folders.
In this implementation, we have many issues like.
1. DB and table always need to be in specific order and name of the folders
should always
Hi,
+1 for making 'no_sort' as default sort_scope
1. Regarding removing empty SORT_COLUMNS option, I don't think we change the
current behaviour as already some users might be using it in their script,
so if we remove empty SORT_COLUMNS option then their scripts start failing
after upgrade. It
Hi,
Yes, we have a plan. But it will take some time to bring this to presto
integration as we need to first bring up and stabilize the MV module and
also need to analyze how to update the query plan to use pre-agg tables in
presto .
Regards,
Ravindra.
--
Sent from:
Hi Jacky,
In spark integration we have two approaches one with very deep integration
and one with shallow integration using the sparks fileformat. One with deep
integration we use the datasource name as carbondata, this name also
registered to java services so anything which comes with this
+1
Yes Jacky, he is not going add any new plugin. Depending on the folder
structure and table status he considers whether it is transactional or
non-transactional inside the same plugin. PR
https://github.com/apache/carbondata/pull/2982/ already raised for it.
Regards,
Ravindra.
--
Sent
Hi Jacky,
Its a good idea to support writing transactional table from SDK. But we need
to add following limitations as well
1. It can work on file systems which can take append lock like HDFS.
2. Compaction, delete segment cannot be done on online segments till it is
converted to the
I agree with @kumarvishal , better not add more options as it confuses the
user. We better fallback automatically depends on the size of the
dictionary.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1
Please make sure the DDL is consistent with HIve, no need to add any new
DDL.
Regards,
Ravindra
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for JSON loading from CarbonSession LOAD command.
@xuchuanyin There is a reason why we are not completely depending on Spark
datasource for loading data. We have specific feature called badrecord
handling, if we load data directly from spark I don't think we can get the
bad records present in
Hi,
SInce carbondata is columnar we can filter on individual columns and use the
output of that column to filter the remaining columns. This what we called
bitset pipelining. So in your case, if column A got row 1 after filter then
it uses only that row 1 to filter the remaining columns.
If my
+1
I feel it is better to remove the transactional flag from sdk API as it is
redundant currently. we better support it in a better way in future.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi,
Thanks for testing the performance. We have also observed this performance
difference and working on to improve the same. Please check my latest
discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html)
to improve
Hi ,
I have following doubts and suggestions for this tool.
1. To which module you are planning to keep this tool. Ideally, it should be
under tools folder and going forward we can add more tools like this under
it.
2. Which file schema are you printing? are you randomly choosing the file to
Hi all
PMC vote has passed for Apache Carbondata 1.4.1 release, the result as
below:
+1(binding): 3(Liang Chen, Kumar Vishal, Ravindra)
+1(non-binding) : 5
Thanks all for your vote.
Regards,
Ravindra
--
Sent from:
Hi,
I have fixed review comments and updated the design document. Please check
the V2 version of document in the jira.
https://issues.apache.org/jira/browse/CARBONDATA-2827
Regards,
Ravi
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi,
In the current pr-aggregate design we cannot create on non carbon format
tables. But we have another module called MV, there we will add the
functionality to allow other format tables also as Materialized Views(MV) .
But it may take some more time to get stabilized this feature.
Ravindra
Hi,
REBUILD DATAMAP is implemented only for the full refresh, not done for
incremental data loading, that's why it tries to refresh all the segments
irrespective of it is already built or not. We are planning for the
incremental rebuilding of datamap in the next version.
I feel we can block
Yes Shahid, you are right. During Update scenario there is a chance of
creating new data within the same segment. And it will lead to wrong data if
the schema is different. It is not same case for delete as we don't have any
schema for it.
I feel we always better create a new segment even for
In case of dataframe we can take the varchar(max) as default.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi,
I agree with option 2 but not new datatype use varchar(size).
There are more optimizations we can do with varchar(size) datatype like
1. if the size is smaller (less than 8 bytes) then we can write in fixed
length encoder instead of LV encode it can save a lot of space and memory.
2. If the
Hello,
I don't get much from the logs but the error seems related to memory issue
from Spark. From your old emails I get that you are using 3 node cluster. Is
that all 3 node has nodemanager and datanodes?
So better give only less number of executors and provide more memory to it
like below.
28 matches
Mail list logo