Question about CarbonDataFrameWriter

2017-10-17 Thread 徐传印
Hi, community:




When I go through the DataFrame.write related code in Carbondata, I find there 
is an option to control whether to save the dataframe's data to a temporary 
directory as CSV on disk.




My question is why we need this procedure which will consume more disk IO and 
why the option(tempCSV) is true by default?




Related code can be referred:

https://github.com/apache/carbondata/blob/master/integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala#L45




https://github.com/apache/carbondata/blob/master/integration/spark-common/src/main/scala/org/apache/carbondata/spark/CarbonOption.scala#L43

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

2017-10-17 Thread Ravindra Pesala
Hi Bhavya,

For pre-aggregate table load, we will not delete old data and calculate
aggregation every time. We load aggregation tables also incrementally along
with the main table. For suppose if we create an aggregation table on the
main table then aggregation table is calculated and loaded with the
existing data of the main table. For subsequent loads on the main load,
aggregation table also calculated incrementally only for new data and
loaded as a new segment.

Regards,
Ravindra.

On 17 October 2017 at 13:34, Bhavya Aggarwal  wrote:

> Hi Dev,
>
> For the Pre Aggregate tables how will we handle subsequent loads, will we
> be running the query on the whole table and calculating the aggregations
> again and then deleting the existing segment and creating the new segments
> for whole data. With the above approach as the data increases in the main
> table the Loading time will also be increasing substantially. Other way is
> to intelligently determine the new values by querying the latest segment
> and using them in collaboration with the existing pre-aggregated tables.
> Please share your thoughts about it in this discussion.
>
> Regards
> Bhavya
>
> On Mon, Oct 16, 2017 at 4:53 PM, Liang Chen 
> wrote:
>
> > +1 , i agree with Jacky points.
> > As we know, carbondata already be able to get very good performance for
> > filter query scenarios through MDK index.  supports pre-aggregate in
> 1.3.0
> > would improve aggregated query scenarios.   so users can use one
> carbondata
> > to support all query cases(both filter and agg).
> >
> > To Lu cao, you mentioned this solution to build cube schema, it is too
> > complex and there are many limitations, for example: the CUBE data can't
> > support query detail data etc.
> >
> > Regards
> > Liang
> >
> >
> > Jacky Li wrote
> > > Hi Lu Cao,
> > >
> > > In my previous experience on “cube” engine, no matter it is ROLAP or
> > > MOLAP, it is something above SQL layer, because it not only need user
> to
> > > establish cube schema by transform metadata from datawarehouse star
> > schema
> > > but also the engine defines its own query language like MDX, and many
> > > times these languages are not standardized so that different vendor
> need
> > > to provide different BI tools or adaptors for it.
> > > So, although some vendor provides easy-to-use cube management tool, but
> > it
> > > at least has two problems: vendor locking and the rigid of the cube
> mode
> > > once it defines. I think these problems are similar as in other vendor
> > > specific solution.
> > >
> > > Currently one of the strength that carbon store provides is that it
> > > complies to standard SQL support by integrating with SparkSQL, Hive,
> etc.
> > > The intention of providing pre-aggregate table support is, it can
> enable
> > > carbon improve OLAP query performance but still stick with standard SQL
> > > support, it means all users still can use the same BI/JDBC
> > > application/tool which can connect to SparkSQL, Hive, etc.
> > >
> > > If carbon should support “cube”, not only need to defines its
> > > configuration which may be very complex and non-standard, but also will
> > > force user to use vendor specific tools for management and
> visualization.
> > > So, I think before going to this complexity, it is better to provide
> > > pre-agg table as the first step.
> > >
> > > Although we do not want the full complexity of “cube” on arbitrary data
> > > schema, but one special case is for timeseries data. Because time
> > > dimension hierarchy (year/month/day/hour/minute/second) is naturally
> > > understandable and it is consistent in all scenarios, so we can provide
> > > native support for pre-aggregate table on time dimension. Actually it
> is
> > a
> > > cube on time and we can do automatic rollup for all levels in time.
> > >
> > > Finally, please note that, by using CTAS syntax, we are not restricting
> > > carbon to support pre-aggreagate table only, but also arbitrary
> > > materialized view, if we want in the future.
> > >
> > > Hope this make things more clear.
> > >
> > > Regards,
> > > Jacky
> > >
> > >
> > >
> > >  like mandarin provides, Actually, as you can see in the document, I am
> > > avoiding to call this “cube”.
> > >
> > >
> > >> 在 2017年10月15日,下午9:18,Lu Cao 
> >
> > > whucaolu@
> >
> > >  写道:
> > >>
> > >> Hi Jacky,
> > >> If user want to create a cube on main table, does he/she have to
> create
> > >> multiple pre-aggregate tables? It will be a heavy workload to write so
> > >> many
> > >> CTAS commands. If user only need create a few pre-agg tables, current
> > >> carbon already can support this requirement, user can create table
> first
> > >> and then use insert into select statement. The only different is user
> > >> need
> > >> to query the pre-agg table instead of main table.
> > >>
> > >> So maybe we can enable user to create a cube model( in schema or
> > >> metafile?)
> > >> which contains multiple 

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

2017-10-17 Thread Bhavya Aggarwal
Hi Dev,

For the Pre Aggregate tables how will we handle subsequent loads, will we
be running the query on the whole table and calculating the aggregations
again and then deleting the existing segment and creating the new segments
for whole data. With the above approach as the data increases in the main
table the Loading time will also be increasing substantially. Other way is
to intelligently determine the new values by querying the latest segment
and using them in collaboration with the existing pre-aggregated tables.
Please share your thoughts about it in this discussion.

Regards
Bhavya

On Mon, Oct 16, 2017 at 4:53 PM, Liang Chen  wrote:

> +1 , i agree with Jacky points.
> As we know, carbondata already be able to get very good performance for
> filter query scenarios through MDK index.  supports pre-aggregate in 1.3.0
> would improve aggregated query scenarios.   so users can use one carbondata
> to support all query cases(both filter and agg).
>
> To Lu cao, you mentioned this solution to build cube schema, it is too
> complex and there are many limitations, for example: the CUBE data can't
> support query detail data etc.
>
> Regards
> Liang
>
>
> Jacky Li wrote
> > Hi Lu Cao,
> >
> > In my previous experience on “cube” engine, no matter it is ROLAP or
> > MOLAP, it is something above SQL layer, because it not only need user to
> > establish cube schema by transform metadata from datawarehouse star
> schema
> > but also the engine defines its own query language like MDX, and many
> > times these languages are not standardized so that different vendor need
> > to provide different BI tools or adaptors for it.
> > So, although some vendor provides easy-to-use cube management tool, but
> it
> > at least has two problems: vendor locking and the rigid of the cube mode
> > once it defines. I think these problems are similar as in other vendor
> > specific solution.
> >
> > Currently one of the strength that carbon store provides is that it
> > complies to standard SQL support by integrating with SparkSQL, Hive, etc.
> > The intention of providing pre-aggregate table support is, it can enable
> > carbon improve OLAP query performance but still stick with standard SQL
> > support, it means all users still can use the same BI/JDBC
> > application/tool which can connect to SparkSQL, Hive, etc.
> >
> > If carbon should support “cube”, not only need to defines its
> > configuration which may be very complex and non-standard, but also will
> > force user to use vendor specific tools for management and visualization.
> > So, I think before going to this complexity, it is better to provide
> > pre-agg table as the first step.
> >
> > Although we do not want the full complexity of “cube” on arbitrary data
> > schema, but one special case is for timeseries data. Because time
> > dimension hierarchy (year/month/day/hour/minute/second) is naturally
> > understandable and it is consistent in all scenarios, so we can provide
> > native support for pre-aggregate table on time dimension. Actually it is
> a
> > cube on time and we can do automatic rollup for all levels in time.
> >
> > Finally, please note that, by using CTAS syntax, we are not restricting
> > carbon to support pre-aggreagate table only, but also arbitrary
> > materialized view, if we want in the future.
> >
> > Hope this make things more clear.
> >
> > Regards,
> > Jacky
> >
> >
> >
> >  like mandarin provides, Actually, as you can see in the document, I am
> > avoiding to call this “cube”.
> >
> >
> >> 在 2017年10月15日,下午9:18,Lu Cao 
>
> > whucaolu@
>
> >  写道:
> >>
> >> Hi Jacky,
> >> If user want to create a cube on main table, does he/she have to create
> >> multiple pre-aggregate tables? It will be a heavy workload to write so
> >> many
> >> CTAS commands. If user only need create a few pre-agg tables, current
> >> carbon already can support this requirement, user can create table first
> >> and then use insert into select statement. The only different is user
> >> need
> >> to query the pre-agg table instead of main table.
> >>
> >> So maybe we can enable user to create a cube model( in schema or
> >> metafile?)
> >> which contains multiple pre-aggregation definition and carbon can create
> >> those pre-agg tables automatically according to the model. That would be
> >> more easy for using and maintenance.
> >>
> >> Regards,
> >> Lionel
> >>
> >> On Sun, Oct 15, 2017 at 3:56 PM, Jacky Li 
>
> > jacky.likun@
>
> >  wrote:
> >>
> >>> Hi Liang,
> >>>
> >>> For alter table, data update/delete, and delete segment, they are the
> >>> same.
> >>> So I write in document “ User can manually perform this operation and
> >>> rebuild pre-aggregate table as
> >>> update scenario”
> >>> User need to drop the associated aggregate table and perform alter
> >>> table,
> >>> or data update/delete, or delete segment operation, then he can create
> >>> the
> >>> pre-agg table using CTAS command again, and the pre-aggregate table
> will
> >>> be