Re: Improve carbondata CDC performance

2021-03-30 Thread Ajantha Bhat
+1 for this improvement,

But as this optimization is dependent on the data. There may be a scenario
where after you prune with min max also your dataset size remain almost
same as original.
Which brings in extra overhead of the new operations added.
Do you have plan to add some intelligence or threshold or fallback
mechanism for that case ?

Thanks,
Ajantha

On Mon, Mar 29, 2021 at 5:59 PM Indhumathi  wrote:

> +1
>
> Regards,
> Indhumathi M
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Support SI at Segment level

2021-03-30 Thread Ajantha Bhat
+1 for this proposal.

But the other ongoing requirement (
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-td105291.html)
is dependent on *isSITableEnabled*
so, better to wait for it to finish and redesign on top of it.

Thanks,
Ajantha

On Tue, Mar 23, 2021 at 1:03 PM Mahesh Raju Somalaraju <
maheshraju.o...@gmail.com> wrote:

> Hi,
>
> +1 for the feature.
> It will make the query faster.
>
> 1) With design discussion about the feature(SI to prune as a data frame)
> has one property to set.
>   If the data engine wants to use SI as datamap then need to set. if not
> set then it will use plan re-write flow.
>
>   So we have to handle this feature in two cases. Can you please check and
> update the design as per this?
>
> References:
> SI to prune as a data frame
>
> https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing
>
> Thanks & Regards
> Mahesh Raju Somalaraju
>
> On Wed, Feb 17, 2021 at 4:05 PM Nihal  wrote:
>
> > Hi all,
> >
> > Currently, if the parent(main) table and SI table don’t have the same
> valid
> > segments then we disable the SI table. And then from the next query
> > onwards,
> > we scan and prune only the parent table until we trigger the next load or
> > REINDEX command (as these commands will make the parent and SI table
> > segments in sync). Because of this, queries take more time to give the
> > result when SI is disabled.
> >
> > To solve this problem we are planning to support SI at the segment level.
> > It
> > means we will not disable SI if the parent and SI table don’t have the
> same
> > segments, while we will do the pruning on Si for all valid segments, and
> > for
> > the rest of the segments, we will do the pruning on main/parent table.
> >
> >
> > At the time of pruning with the main table in TableIndex.prune, if SI
> > exists
> > for the corresponding filter then all segments which are not present in
> the
> > SI table will be pruned on the corresponding parent table segment.
> >
> > Please let me know your thought and input about the same.
> >
> > Regards
> > Nihal kumar ojha
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [DISCUSSION] Describe complex columns

2021-03-30 Thread Ajantha Bhat
Hi,

+1 for this improvement.

a) you can also print one line of short information about the parent column
when describe column is executed
to avoid executing again to know what is parent column type.
Example,
 Describe column decimalcolumn on complexcarbontable;
*You can mention that decimalcolumn is a MAP<> type and children are as
follows.*

b) Are you blocking describe column on primitive type ? or just print short
information about the primitive data type.
I think the latter one is fine.

Thanks,
Ajantha


On Mon, Mar 22, 2021 at 9:37 PM akashrn5  wrote:

> Hi,
>
> +1 for the new functionality.
>
> my suggestion is to modify the DDL something like below
>
> DESCRIBE column fieldname ON [db_name.]table_name;
> DESCRIBE table short/transient [db_name.]table_name;
>
> Others can give their suggestions
>
> Thanks,
>
> Regards,
> Akash R
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Support JOIN query with spatial index

2021-03-30 Thread Indhumathi
Hello all,

Current design is based on Union of polygons identified from 
polygon table.

Based on discussion with customer, need to change the design to
support IN_POLYGON_JOIN in below way.

Apply IN_POLYGON udf on each polygon identified from Polygon table
and apply aggregation/group by each polygon result.

For example:
Select sum(t1.col1),t2.polygon,t2.type
from table1 t1
inner join
(select polygon,type from table2 where type='x') t2
on in_polygon_join(t1, t2.polygon)
group by t2.polygon, t2.type

table1:
+---
Col1   + mygeohash |
+--
1 |01 |
2 |02 |
3 |03 |
4 |04 |
5 |05 |
6 |06 |
+--

table2:
-+--
polygon+ type |
-+--
1_Polygon() |x   |
2_Polygon() |y   |
3_Polygon() |x   |
4_Polygon() |r|
+---

If 1_polygon lies in range (0 & 1), 3_polygon in (5 & 6),
result could be like,
-+--+
sum(t1.col1  + polygon+ type  |
-+--+
3  + 1_Polygon() | x |
11+ 3_Polygon() | x |
-+-+--

To achieve this, a solution could be to, run query with IN_POLYGON
udf of each polygon and  finally make a union of query result.

Will update design on further analysis.
Any opinion or suggestions are welcomed.

Thanks, 
Indhumathi M



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Support JOIN query with spatial index

2021-03-30 Thread Ajantha Bhat
Hi, I have some doubts and suggestions for the same.

Currently, we support these UDFs --> IN_POLYGON, IN_POLYGON_LIST,
IN_POLYLINE_LIST, IN_POLYGON_RANGE_LIST
but the user needs to give polygon input manually and as polygon can have
many points, it is hard to give manually.
So, your requirement is to give new UDF , IN_POLYGON_JOIN where polygon
inputs are present in another table and you want to join with the main
table.

*please find my doubts below*
a. why to do join ? when you can form a subquery to query the polygons from
table2 and give it as input for IN_POLYGON UDF and other existing UDFs
b. no need to support the same for  IN_POLYLINE_LIST
and IN_POLYGON_RANGE_LIST UDF also ?

*Suggestions:*
a. Table names and queries are not matching, please update
b. The query doesn't look like the union query as explained in the diagram,
please update and explain
c. Please consider some sample data with examples for t1 and t2. Also,
provide the expected query result also
d. Also mention, how to select data from single polygon and multi polygon
from the tables.

Thanks,
Ajantha

On Tue, Mar 30, 2021 at 11:25 AM Kunal Kapoor 
wrote:

> +1
>
> On Mon, Mar 22, 2021 at 4:07 PM Indhumathi 
> wrote:
>
> > Hi community,
> >
> > Currently, carbon supports IN_POLYGON and IN_POLYGON_LIST udf's,
> > where user has to manually provide the polygon points(series of latitude
> > and longitude pair), to query carbon table based on spatial index.
> >
> > This feature will support JOIN tables based on IN_POLYGON udf
> > filter, where polygon data exists in a table.
> >
> > Please find below link of design doc. Please check and give
> > your inputs/suggestions.
> >
> >
> >
> https://docs.google.com/document/d/11PnotaAiEJQK_QvKsHznDy1I9tO4idflW32LstwcLhc/edit#heading=h.yh6qp815dh3p
> >
> >
> > Thanks & Regards,
> > Indhumathi M
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [DISCUSSION] Support alter schema for complex types

2021-03-30 Thread Ajantha Bhat
Hi Akshay,
The mail description and document content are not matching. For
single-level struct also document says cannot support.
So, please list down all the work that need to be done in points and
then divide which is supported in phase1 and which is supported in phase 2
clearly in the summary section of the document.

Also in the query flow, after adding the column, for previously loaded
segments what will be the output NULL or empty complex type ?
you can refer hive behavior for this.  Hope schema evolution (column drift)
also intact with complex column support.

Thanks,
Ajantha

On Tue, Mar 30, 2021 at 11:18 AM Kunal Kapoor 
wrote:

> +1
>
> On Fri, Mar 26, 2021 at 6:19 PM akshay_nuthala 
> wrote:
>
> > No, these and other nested level operations will be taken care in the
> next
> > phase.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>