Re: [DISCUSSION] Page Level Bloom Filter

2019-11-28 Thread Jacky Li
Hi,

Since the new bloom filter integration is designed to work in the executor
side, so in term of keeping it simple, I actually prefer to keep it inside
the data file itself instead of keep in a separated index file. So in
executor side, only read one file is enough. And it is always better if we
can follow the datamap interface.

Regards,
Jacky



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Page Level Bloom Filter

2019-11-28 Thread Kumar Vishal
Hi Manhua,

Page size will be smaller/Equal to 32k but here point is how much IO
benefit will get during query. Since you are adding bloom information
inside page metadata, it might benefit in processing time with little bit
of IO overhead. So its better to keep Bloom information at blocklet level
in footer.

-Regards
Kumar Vishal

On Thu, Nov 28, 2019 at 8:21 PM Jacky Li  wrote:

> Hi,
>
> Since the new bloom filter integration is designed to work in the executor
> side, so in term of keeping it simple, I actually prefer to keep it inside
> the data file itself instead of keep in a separated index file. So in
> executor side, only read one file is enough. And it is always better if we
> can follow the datamap interface.
>
> Regards,
> Jacky
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Propose feature change in CarbonData 2.0

2019-11-28 Thread Jacky Li


Hi Community,

As we are moving to CarbonData 2.0, in order to keep the project moving
forward fast and stable, it is necessary to do some refactory and clean up
obsoleted features before introducing new features. 

To do that, I propose making following features obsoleted and not supported
since 2.0. In my opinion, these features are seldom used.

1. Global dictionary
After spark 2.x, the aggregation is much faster since project tungsten, so
Global Dictionary is not much useful but it makes data loading slow and need
very complex SQL plan transformation. 

2. Bucket
Bucket feature of carbon is intented to improve join performance, but actual
improvement is very limited

3. Carbon custom partition
Since now we have Hive standard partition, old custom partition is not very
useful

4. BATCH_SORT
I have not seen anyone use this feature

5. Page level inverse index
This is arguable, I understand in a very specific scenario (when there are
many columns in IN filter) it has benefit, but it slow down the data loading
and make encoding code very complex

5. old preaggregate and time series datamap implementation
As we have introduced MV, these two features can be dropped. And we can
following the standard SQL to have a new syntax to create MV: CREATE
MATERIALIZED VIEW

6. Lucene datamap
This feature is not well implemented, as it will read too much index into
memroy thus creating memory problems in most cases.

7. STORED BY 
We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING). 


And there are some internal refactory we can do:
1. Unify dimension and measure

2. Keep the column order the same as schema order

3. Spark integration refactory based on Spark extension interface

4. Store optimization PR2729


The aim of this proposal is to make CarbonData code cleaner and reduce
community's maitenance effort. 
What do you think of it?


Regards,
Jacky





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Propose feature change in CarbonData 2.0

2019-11-28 Thread David CaiQiang
+1



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/