Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-09-30 Thread Akash Nilugal
Hi Ajantha,

Thanks for the queries and suggestions

1. Yes, this is a good suggestion, i ll include this change. Both date and 
timestamp columns are supported, will be updated in document.
2. yes, you are right.
3. you are right, if the day level is not available, then we will try to get 
the whole day data from hour level, if not availaible, as explained in design 
document, we will get the data from datamap UNION data from main table based on 
user query.

Regards,
Akash R Nilugal


On 2019/09/30 06:56:45, Ajantha Bhat  wrote: 
> + 1 ,
> 
> I have some suggestions and questions.
> 
> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> 'timeseries_column'.
>  so that it won't give an impression that only time stamp datatype is
> supported and update the document with all the datatype supported.
> 
> 2. Querying on this datamap table is also supported right ? supporting
> changing plan for main table to refer datamap table is for user to avoid
> changing his query or any other reason ?
> 
> 3. If user has not created day granularity datamap, but just created hour
> granularity datamap. When query has day granularity, data will be fetched
> form hour granularity datamap and aggregated ? or data is fetched from main
> table ?
> 
> Thanks,
> Ajantha
> 
> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal 
> wrote:
> 
> > Hi xuchuanyin,
> >
> > Thanks for the comments/Suggestions
> >
> > 1. Preaggregate is productized, but not the timeseries with preaggregate,
> > i think you  got confused with that, if im right.
> > 2. Limitations like, auto sampling or rollup, which we will be supporting
> > now. Retention policies. etc
> > 3. segmentTimestampMin, this i will consider in design.
> > 4. RP is added as a separate task, i thought instead of maintaining two
> > variables better to maintabin one and parse it. But i will consider your
> > point based on feasibility during implementation.
> > 5. We use an accumulator which takes list, so before writing index files
> > we take the min max of the timestamp column and fill in accumulator and
> > then we can access accumulator.value in driver after load is finished.
> >
> > Regards,
> > Akash R Nilugal
> >
> > On 2019/09/28 10:46:31, xuchuanyin  wrote:
> > > Hi akash, glad to see the feature proposed and I have some questions
> > about
> > > this. Please notice that some of the following descriptions are comments
> > > followed by '===' described in the design document attached in the
> > > corresponding jira.
> > >
> > > 1.
> > > "Currently carbondata supports timeseries on preaggregate datamap, but
> > its
> > > an alpha feature"
> > > ===
> > > It has been some time since the preaggregate datamap was introduced and
> > it
> > > is still **alpha**, why it is still not product-ready? Will the new
> > feature
> > > also come into the similar situation?
> > >
> > > 2.
> > > "there are so many limitations when we compare and analyze the existing
> > > timeseries database or projects which supports time series like apache
> > druid
> > > or influxdb"
> > > ===
> > > What are the actual limitations? Besides, please give an example of this.
> > >
> > > 3.
> > > "Segment_Timestamp_Min"
> > > ===
> > > Suggest using camel-case style like 'segmentTimestampMin'
> > >
> > > 4.
> > > "RP is way of telling the system, for how long the data should be kept"
> > > ===
> > > Since the function is simple, I'd suggest using 'retentionTime'=15 and
> > > 'timeUnit'='day' instead of 'RP'='15_days'
> > >
> > > 5.
> > > "When the data load is called for main table, use an spark accumulator to
> > > get the maximum value of timestamp in that load and return to the load."
> > > ===
> > > How can you get the spark accumulator? The load is launched using
> > > loading-by-dataframe not using global-sort-by-spark.
> > >
> > > 6.
> > > For the rest of the content, still reading.
> > >
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
> 


Re: [DISCUSSION] Support heterogeneous format segments in carbondata

2019-09-30 Thread Akash Nilugal
Hi

+1 
One question is , is add segment and load data to main table supported? If yes, 
how the segment locking thing is handled? as we are going to add an entry 
inside table status with a segment id for added segment.

Regards,
Akash

On 2019/09/10 14:41:22, Ravindra Pesala  wrote: 
> Hi All,
> 
>  This discussion is regarding support of other formats in carbon. Already
> existing customers use other formats like parquet, orc etc., but if they
> want to migrate to carbon there is no proper solution at hand. So this
> feature allows all the old data to add as a segment to carbondata .  And
> during query, it reads old data in its respective format and all new
> segments will be read in carbon.
> 
> I have created the design document and attached to the jira. Please review
> it.
> https://issues.apache.org/jira/browse/CARBONDATA-3516
> 
> 
> -- 
> Thanks & Regards,
> Ravindra
> 


Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-09-30 Thread Kumar Vishal
Hi Akash,

In this desing document you haven't mentioned how to handle data loading
for timeseries datamap for older segments[Existing table].
If the customer's main table data is also stored based on time[increasing
time] in different segments,he can use this feature as well.

We can discuss and finalize the solution.

-Regards
Kumar Vishal

On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal 
wrote:

> Hi Ajantha,
>
> Thanks for the queries and suggestions
>
> 1. Yes, this is a good suggestion, i ll include this change. Both date and
> timestamp columns are supported, will be updated in document.
> 2. yes, you are right.
> 3. you are right, if the day level is not available, then we will try to
> get the whole day data from hour level, if not availaible, as explained in
> design document, we will get the data from datamap UNION data from main
> table based on user query.
>
> Regards,
> Akash R Nilugal
>
>
> On 2019/09/30 06:56:45, Ajantha Bhat  wrote:
> > + 1 ,
> >
> > I have some suggestions and questions.
> >
> > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > 'timeseries_column'.
> >  so that it won't give an impression that only time stamp datatype is
> > supported and update the document with all the datatype supported.
> >
> > 2. Querying on this datamap table is also supported right ? supporting
> > changing plan for main table to refer datamap table is for user to avoid
> > changing his query or any other reason ?
> >
> > 3. If user has not created day granularity datamap, but just created hour
> > granularity datamap. When query has day granularity, data will be fetched
> > form hour granularity datamap and aggregated ? or data is fetched from
> main
> > table ?
> >
> > Thanks,
> > Ajantha
> >
> > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal 
> > wrote:
> >
> > > Hi xuchuanyin,
> > >
> > > Thanks for the comments/Suggestions
> > >
> > > 1. Preaggregate is productized, but not the timeseries with
> preaggregate,
> > > i think you  got confused with that, if im right.
> > > 2. Limitations like, auto sampling or rollup, which we will be
> supporting
> > > now. Retention policies. etc
> > > 3. segmentTimestampMin, this i will consider in design.
> > > 4. RP is added as a separate task, i thought instead of maintaining two
> > > variables better to maintabin one and parse it. But i will consider
> your
> > > point based on feasibility during implementation.
> > > 5. We use an accumulator which takes list, so before writing index
> files
> > > we take the min max of the timestamp column and fill in accumulator and
> > > then we can access accumulator.value in driver after load is finished.
> > >
> > > Regards,
> > > Akash R Nilugal
> > >
> > > On 2019/09/28 10:46:31, xuchuanyin  wrote:
> > > > Hi akash, glad to see the feature proposed and I have some questions
> > > about
> > > > this. Please notice that some of the following descriptions are
> comments
> > > > followed by '===' described in the design document attached in the
> > > > corresponding jira.
> > > >
> > > > 1.
> > > > "Currently carbondata supports timeseries on preaggregate datamap,
> but
> > > its
> > > > an alpha feature"
> > > > ===
> > > > It has been some time since the preaggregate datamap was introduced
> and
> > > it
> > > > is still **alpha**, why it is still not product-ready? Will the new
> > > feature
> > > > also come into the similar situation?
> > > >
> > > > 2.
> > > > "there are so many limitations when we compare and analyze the
> existing
> > > > timeseries database or projects which supports time series like
> apache
> > > druid
> > > > or influxdb"
> > > > ===
> > > > What are the actual limitations? Besides, please give an example of
> this.
> > > >
> > > > 3.
> > > > "Segment_Timestamp_Min"
> > > > ===
> > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > >
> > > > 4.
> > > > "RP is way of telling the system, for how long the data should be
> kept"
> > > > ===
> > > > Since the function is simple, I'd suggest using 'retentionTime'=15
> and
> > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > >
> > > > 5.
> > > > "When the data load is called for main table, use an spark
> accumulator to
> > > > get the maximum value of timestamp in that load and return to the
> load."
> > > > ===
> > > > How can you get the spark accumulator? The load is launched using
> > > > loading-by-dataframe not using global-sort-by-spark.
> > > >
> > > > 6.
> > > > For the rest of the content, still reading.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > >
> > >
> >
>


Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-09-30 Thread Akash Nilugal
Hi vishal,

In the design document, in the impacted analysis section, there is a topic 
compatibility/legacy stores, so basically For old tables when the datamap is 
created, we load all the timeseries datamaps with different granularity. I 
think this should do fine, please let me know for further suggestions/comments.

Regards,
Akash R Nilugal

On 2019/09/30 17:09:44, Kumar Vishal  wrote: 
> Hi Akash,
> 
> In this desing document you haven't mentioned how to handle data loading
> for timeseries datamap for older segments[Existing table].
> If the customer's main table data is also stored based on time[increasing
> time] in different segments,he can use this feature as well.
> 
> We can discuss and finalize the solution.
> 
> -Regards
> Kumar Vishal
> 
> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal 
> wrote:
> 
> > Hi Ajantha,
> >
> > Thanks for the queries and suggestions
> >
> > 1. Yes, this is a good suggestion, i ll include this change. Both date and
> > timestamp columns are supported, will be updated in document.
> > 2. yes, you are right.
> > 3. you are right, if the day level is not available, then we will try to
> > get the whole day data from hour level, if not availaible, as explained in
> > design document, we will get the data from datamap UNION data from main
> > table based on user query.
> >
> > Regards,
> > Akash R Nilugal
> >
> >
> > On 2019/09/30 06:56:45, Ajantha Bhat  wrote:
> > > + 1 ,
> > >
> > > I have some suggestions and questions.
> > >
> > > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > > 'timeseries_column'.
> > >  so that it won't give an impression that only time stamp datatype is
> > > supported and update the document with all the datatype supported.
> > >
> > > 2. Querying on this datamap table is also supported right ? supporting
> > > changing plan for main table to refer datamap table is for user to avoid
> > > changing his query or any other reason ?
> > >
> > > 3. If user has not created day granularity datamap, but just created hour
> > > granularity datamap. When query has day granularity, data will be fetched
> > > form hour granularity datamap and aggregated ? or data is fetched from
> > main
> > > table ?
> > >
> > > Thanks,
> > > Ajantha
> > >
> > > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal 
> > > wrote:
> > >
> > > > Hi xuchuanyin,
> > > >
> > > > Thanks for the comments/Suggestions
> > > >
> > > > 1. Preaggregate is productized, but not the timeseries with
> > preaggregate,
> > > > i think you  got confused with that, if im right.
> > > > 2. Limitations like, auto sampling or rollup, which we will be
> > supporting
> > > > now. Retention policies. etc
> > > > 3. segmentTimestampMin, this i will consider in design.
> > > > 4. RP is added as a separate task, i thought instead of maintaining two
> > > > variables better to maintabin one and parse it. But i will consider
> > your
> > > > point based on feasibility during implementation.
> > > > 5. We use an accumulator which takes list, so before writing index
> > files
> > > > we take the min max of the timestamp column and fill in accumulator and
> > > > then we can access accumulator.value in driver after load is finished.
> > > >
> > > > Regards,
> > > > Akash R Nilugal
> > > >
> > > > On 2019/09/28 10:46:31, xuchuanyin  wrote:
> > > > > Hi akash, glad to see the feature proposed and I have some questions
> > > > about
> > > > > this. Please notice that some of the following descriptions are
> > comments
> > > > > followed by '===' described in the design document attached in the
> > > > > corresponding jira.
> > > > >
> > > > > 1.
> > > > > "Currently carbondata supports timeseries on preaggregate datamap,
> > but
> > > > its
> > > > > an alpha feature"
> > > > > ===
> > > > > It has been some time since the preaggregate datamap was introduced
> > and
> > > > it
> > > > > is still **alpha**, why it is still not product-ready? Will the new
> > > > feature
> > > > > also come into the similar situation?
> > > > >
> > > > > 2.
> > > > > "there are so many limitations when we compare and analyze the
> > existing
> > > > > timeseries database or projects which supports time series like
> > apache
> > > > druid
> > > > > or influxdb"
> > > > > ===
> > > > > What are the actual limitations? Besides, please give an example of
> > this.
> > > > >
> > > > > 3.
> > > > > "Segment_Timestamp_Min"
> > > > > ===
> > > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > > >
> > > > > 4.
> > > > > "RP is way of telling the system, for how long the data should be
> > kept"
> > > > > ===
> > > > > Since the function is simple, I'd suggest using 'retentionTime'=15
> > and
> > > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > > >
> > > > > 5.
> > > > > "When the data load is called for main table, use an spark
> > accumulator to
> > > > > get the maximum value of timestamp in that load and return to the
> > load."
> > > > > ===
> > > > > How can you get the spark accumulator? The