Re: Abstracting CarbonData's Index Interface

2016-10-02 Thread Aniket Adnaik
I would agree with having simple segment definition. Segment can use a
metadata info that describes the segment - For example; Segment type, index
availability, index type, index storage type (attached or
detached/secondary) etc. For streaming ingest segment, it also may possibly
contain min-max kind of information for each blocklet, that can used for
indexing.
So implementation details of different segment types can be hidden from
user.
We may have to think about partitioning support along with load segments in
future.

Best Regards,
Aniket



On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li  wrote:

> After a second thought regarding the index part, another option is that to
> have a very simple Segment definition which can only list all files it has
> or listFile taking the QueryModel as input, implementation of Segment can
> be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In
> future, developer is free to create MultiIndexSegment to select index
> internally. Is this option better?
>
> Regards,
> Jacky
>
> > 在 2016年10月3日,上午11:00,Jacky Li  写道:
> >
> > I am currently thinking these abstractions:
> >
> > - A SegmentManager is the global manager of all segments for one table.
> It can be used to get all segments and manage the segment while loading and
> compaction.
> > - A CarbonInputFormat will take the input of table path, so means it
> represent the whole table contain all segments.  When getSplit is called,
> it will get all segments by calling SegmentManager interface.
> > - Each Segment contains a list of Index, and an IndexSelector. While
> currently CarbonData only has MDK index, developer can create multiple
> indices for each segment in the future.
> > - An Index is an interface to filtering on block/blocklet, and provide
> this functionality only.  Implementation should hide all complexity like
> deciding where to store the index.
> > - An IndexSelector is an interface to choose which index to use based on
> query predicates. Default implementation is to choose the first index. An
> implementation of IndexChooser can also decide not to use index at all.
> > - A Distributor is used to map the filtered block/blocklet to
> InputSplits. Implementation can take number of node, parallelism into
> consideration. It can also decide to distribute tasks based on block or
> blocklet.
> >
> > So the main concepts are SegmentManager, Segment, Index, IndexSelector,
> InputFormat/OutputFormat, Distributor.
> >
> > There will be a default implementation of CarbonInputFormat whose
> getSplit will do the following:
> > 1. gat all segments by calling SegmentManager
> > 2. for each segment, choose the index to use by IndexSelector
> > 3. invoke the selected Index to filter out block/blocklet (since these
> are two concept, maybe a parent class need to be created to encapsulate
> them)
> > 4. distribute the filtered block/blocklet to InputSplits by Distributor.
> >
> > Regarding the input to the Index.filter interface, I have not decided to
> use the existing QueryModel or create a new cleaner QueryModel interface.
> If new QueryModel is desired, it should only contain filter predicate and
> project columns, so it is much simpler than current QueryModel. But I see
> current QueryModel is used in Compaction also, so I think it is better to
> do this clean up later?
> >
> >
> > Does this look fine to you? Any suggestion is welcome.
> >
> > Regards,
> > Jacky
> >
> >
> >> 在 2016年10月3日,上午2:18,Venkata Gollamudi  写道:
> >>
> >> Yes Jacky, interfaces needs to be revisited.
> >> For Goal 1 and Goal 2: abstraction required for both Index and Index
> store.
> >> Also multi-column index(composite index) needs to be considered.
> >>
> >> Regards,
> >> Ramana
> >>
> >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li  wrote:
> >>
> >>> Hi community,
> >>>
> >>>   Currently CarbonData have builtin index support which is one of the
> key
> >>> strength of CarbonData. Using index, CarbonData can do very fast filter
> >>> query by filtering on block and blocklet level. However, it also
> introduces
> >>> memory consumption of the index tree and impact first query time
> because
> >>> the
> >>> process of loading of index from file footer into memory. On the other
> >>> side,
> >>> in a multi-tennant environment, multiple applications may access data
> files
> >>> simultaneously, which again exacerbate this resource consumption issue.
> >>>   So, I want to propose and discuss a solution with you to solve this
> >>> problem and make an abstraction of interface for CarbonData's future
> >>> evolvement.
> >>>   I am thinking the final result of this work should achieve at least
> two
> >>> goals:
> >>>
> >>> Goal 1: User can choose the place to store Index data, it can be
> stored in
> >>> processing framework's memory space (like in spark driver memory) or in
> >>> another service outside of the processing framework (like using a
> >>> independent database service)
> >>>
> >>> Goal 2: Developer can add more index of his choice to 

Re: Abstracting CarbonData's Index Interface

2016-10-02 Thread Jacky Li
After a second thought regarding the index part, another option is that to have 
a very simple Segment definition which can only list all files it has or 
listFile taking the QueryModel as input, implementation of Segment can be 
IndexSegment, MultiIndexSegment or StreamingSegment (no index). In future, 
developer is free to create MultiIndexSegment to select index internally. Is 
this option better?

Regards,
Jacky

> 在 2016年10月3日,上午11:00,Jacky Li  写道:
> 
> I am currently thinking these abstractions:
> 
> - A SegmentManager is the global manager of all segments for one table. It 
> can be used to get all segments and manage the segment while loading and 
> compaction.
> - A CarbonInputFormat will take the input of table path, so means it 
> represent the whole table contain all segments.  When getSplit is called, it 
> will get all segments by calling SegmentManager interface.
> - Each Segment contains a list of Index, and an IndexSelector. While 
> currently CarbonData only has MDK index, developer can create multiple 
> indices for each segment in the future.
> - An Index is an interface to filtering on block/blocklet, and provide this 
> functionality only.  Implementation should hide all complexity like deciding 
> where to store the index.
> - An IndexSelector is an interface to choose which index to use based on 
> query predicates. Default implementation is to choose the first index. An 
> implementation of IndexChooser can also decide not to use index at all.
> - A Distributor is used to map the filtered block/blocklet to InputSplits. 
> Implementation can take number of node, parallelism into consideration. It 
> can also decide to distribute tasks based on block or blocklet.
> 
> So the main concepts are SegmentManager, Segment, Index, IndexSelector, 
> InputFormat/OutputFormat, Distributor.
> 
> There will be a default implementation of CarbonInputFormat whose getSplit 
> will do the following: 
> 1. gat all segments by calling SegmentManager
> 2. for each segment, choose the index to use by IndexSelector
> 3. invoke the selected Index to filter out block/blocklet (since these are 
> two concept, maybe a parent class need to be created to encapsulate them)
> 4. distribute the filtered block/blocklet to InputSplits by Distributor.
> 
> Regarding the input to the Index.filter interface, I have not decided to use 
> the existing QueryModel or create a new cleaner QueryModel interface. If new 
> QueryModel is desired, it should only contain filter predicate and project 
> columns, so it is much simpler than current QueryModel. But I see current 
> QueryModel is used in Compaction also, so I think it is better to do this 
> clean up later?
> 
> 
> Does this look fine to you? Any suggestion is welcome.
> 
> Regards,
> Jacky
> 
> 
>> 在 2016年10月3日,上午2:18,Venkata Gollamudi  写道:
>> 
>> Yes Jacky, interfaces needs to be revisited.
>> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
>> Also multi-column index(composite index) needs to be considered.
>> 
>> Regards,
>> Ramana
>> 
>> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li  wrote:
>> 
>>> Hi community,
>>> 
>>>   Currently CarbonData have builtin index support which is one of the key
>>> strength of CarbonData. Using index, CarbonData can do very fast filter
>>> query by filtering on block and blocklet level. However, it also introduces
>>> memory consumption of the index tree and impact first query time because
>>> the
>>> process of loading of index from file footer into memory. On the other
>>> side,
>>> in a multi-tennant environment, multiple applications may access data files
>>> simultaneously, which again exacerbate this resource consumption issue.
>>>   So, I want to propose and discuss a solution with you to solve this
>>> problem and make an abstraction of interface for CarbonData's future
>>> evolvement.
>>>   I am thinking the final result of this work should achieve at least two
>>> goals:
>>> 
>>> Goal 1: User can choose the place to store Index data, it can be stored in
>>> processing framework's memory space (like in spark driver memory) or in
>>> another service outside of the processing framework (like using a
>>> independent database service)
>>> 
>>> Goal 2: Developer can add more index of his choice to CarbonData files.
>>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>>> developers are free to add other indexing technology to make certain
>>> workload faster. These new indices should be added in a pluggable way.
>>> 
>>>In order to achieve these goals, an abstraction need to be created for
>>> CarbonData project, including:
>>> 
>>> - Segment: each segment is presenting one load of data, and tie with some
>>> indices created with this load
>>> 
>>> - Index: index is created when this segment is created, and is leveraged
>>> when CarbonInputFormat's getSplit is called, to filter out the required
>>> blocks or even blocklets.
>>> 
>>> - CarbonInputFormat: There maybe n numbe

Re: Abstracting CarbonData's Index Interface

2016-10-02 Thread Jacky Li
I am currently thinking these abstractions:

- A SegmentManager is the global manager of all segments for one table. It can 
be used to get all segments and manage the segment while loading and compaction.
- A CarbonInputFormat will take the input of table path, so means it represent 
the whole table contain all segments.  When getSplit is called, it will get all 
segments by calling SegmentManager interface.
- Each Segment contains a list of Index, and an IndexSelector. While currently 
CarbonData only has MDK index, developer can create multiple indices for each 
segment in the future.
- An Index is an interface to filtering on block/blocklet, and provide this 
functionality only.  Implementation should hide all complexity like deciding 
where to store the index.
- An IndexSelector is an interface to choose which index to use based on query 
predicates. Default implementation is to choose the first index. An 
implementation of IndexChooser can also decide not to use index at all.
- A Distributor is used to map the filtered block/blocklet to InputSplits. 
Implementation can take number of node, parallelism into consideration. It can 
also decide to distribute tasks based on block or blocklet.

So the main concepts are SegmentManager, Segment, Index, IndexSelector, 
InputFormat/OutputFormat, Distributor.

There will be a default implementation of CarbonInputFormat whose getSplit will 
do the following: 
1. gat all segments by calling SegmentManager
2. for each segment, choose the index to use by IndexSelector
3. invoke the selected Index to filter out block/blocklet (since these are two 
concept, maybe a parent class need to be created to encapsulate them)
4. distribute the filtered block/blocklet to InputSplits by Distributor.

Regarding the input to the Index.filter interface, I have not decided to use 
the existing QueryModel or create a new cleaner QueryModel interface. If new 
QueryModel is desired, it should only contain filter predicate and project 
columns, so it is much simpler than current QueryModel. But I see current 
QueryModel is used in Compaction also, so I think it is better to do this clean 
up later?


Does this look fine to you? Any suggestion is welcome.

Regards,
Jacky


> 在 2016年10月3日,上午2:18,Venkata Gollamudi  写道:
> 
> Yes Jacky, interfaces needs to be revisited.
> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
> Also multi-column index(composite index) needs to be considered.
> 
> Regards,
> Ramana
> 
> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li  wrote:
> 
>> Hi community,
>> 
>>Currently CarbonData have builtin index support which is one of the key
>> strength of CarbonData. Using index, CarbonData can do very fast filter
>> query by filtering on block and blocklet level. However, it also introduces
>> memory consumption of the index tree and impact first query time because
>> the
>> process of loading of index from file footer into memory. On the other
>> side,
>> in a multi-tennant environment, multiple applications may access data files
>> simultaneously, which again exacerbate this resource consumption issue.
>>So, I want to propose and discuss a solution with you to solve this
>> problem and make an abstraction of interface for CarbonData's future
>> evolvement.
>>I am thinking the final result of this work should achieve at least two
>> goals:
>> 
>> Goal 1: User can choose the place to store Index data, it can be stored in
>> processing framework's memory space (like in spark driver memory) or in
>> another service outside of the processing framework (like using a
>> independent database service)
>> 
>> Goal 2: Developer can add more index of his choice to CarbonData files.
>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>> developers are free to add other indexing technology to make certain
>> workload faster. These new indices should be added in a pluggable way.
>> 
>> In order to achieve these goals, an abstraction need to be created for
>> CarbonData project, including:
>> 
>> - Segment: each segment is presenting one load of data, and tie with some
>> indices created with this load
>> 
>> - Index: index is created when this segment is created, and is leveraged
>> when CarbonInputFormat's getSplit is called, to filter out the required
>> blocks or even blocklets.
>> 
>> - CarbonInputFormat: There maybe n number of indices created for data file,
>> when querying these data files, InputFormat should know how to access these
>> indices, and initialize or loading these index if required.
>> 
>>Obviously, this work should be separated into different tasks and
>> implemented gradually. But first of all, let's discuss on the goal and the
>> proposed approach. What is your idea?
>> 
>> Regards,
>> Jacky
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
>> CarbonData-s-Index-Interface-tp1587.html
>> Sent from 

[GitHub] incubator-carbondata pull request #189: [CARBONDATA-267] Set block_size for ...

2016-10-02 Thread gvramana
Github user gvramana commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/189#discussion_r81474992
  
--- Diff: format/src/main/thrift/schema.thrift ---
@@ -124,6 +124,7 @@ struct TableSchema{
1: required string table_id;  // ID used to
2: required list table_columns; // Columns in the table
3: required SchemaEvolution schema_evolution; // History of schema 
evolution of this table
+   4: optional i32 block_size
--- End diff --

@Zhangshunyu Actually need not change Schema for such extra properties 
which are being added. We can directly store TABLE_PROPERTIES as a string key 
value pairs in thrift. So that all such properties configuration parameters can 
be effectively handled without modifying thrift in future. @jackylk please 
comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Abstracting CarbonData's Index Interface

2016-10-02 Thread Venkata Gollamudi
Yes Jacky, interfaces needs to be revisited.
For Goal 1 and Goal 2: abstraction required for both Index and Index store.
Also multi-column index(composite index) needs to be considered.

Regards,
Ramana

On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li  wrote:

> Hi community,
>
> Currently CarbonData have builtin index support which is one of the key
> strength of CarbonData. Using index, CarbonData can do very fast filter
> query by filtering on block and blocklet level. However, it also introduces
> memory consumption of the index tree and impact first query time because
> the
> process of loading of index from file footer into memory. On the other
> side,
> in a multi-tennant environment, multiple applications may access data files
> simultaneously, which again exacerbate this resource consumption issue.
> So, I want to propose and discuss a solution with you to solve this
> problem and make an abstraction of interface for CarbonData's future
> evolvement.
> I am thinking the final result of this work should achieve at least two
> goals:
>
> Goal 1: User can choose the place to store Index data, it can be stored in
> processing framework's memory space (like in spark driver memory) or in
> another service outside of the processing framework (like using a
> independent database service)
>
> Goal 2: Developer can add more index of his choice to CarbonData files.
> Besides B+ tree on multi-dimensional key which current CarbonData supports,
> developers are free to add other indexing technology to make certain
> workload faster. These new indices should be added in a pluggable way.
>
>  In order to achieve these goals, an abstraction need to be created for
> CarbonData project, including:
>
> - Segment: each segment is presenting one load of data, and tie with some
> indices created with this load
>
> - Index: index is created when this segment is created, and is leveraged
> when CarbonInputFormat's getSplit is called, to filter out the required
> blocks or even blocklets.
>
> - CarbonInputFormat: There maybe n number of indices created for data file,
> when querying these data files, InputFormat should know how to access these
> indices, and initialize or loading these index if required.
>
> Obviously, this work should be separated into different tasks and
> implemented gradually. But first of all, let's discuss on the goal and the
> proposed approach. What is your idea?
>
> Regards,
> Jacky
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> CarbonData-s-Index-Interface-tp1587.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


[GitHub] incubator-carbondata pull request #197: [CARBONDATA-272]Fixed Test case fail...

2016-10-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/197


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---