Re: Improve carbondata CDC performance
+1 for this improvement, But as this optimization is dependent on the data. There may be a scenario where after you prune with min max also your dataset size remain almost same as original. Which brings in extra overhead of the new operations added. Do you have plan to add some intelligence or threshold or fallback mechanism for that case ? Thanks, Ajantha On Mon, Mar 29, 2021 at 5:59 PM Indhumathi wrote: > +1 > > Regards, > Indhumathi M > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
Re: Support SI at Segment level
+1 for this proposal. But the other ongoing requirement ( http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-td105291.html) is dependent on *isSITableEnabled* so, better to wait for it to finish and redesign on top of it. Thanks, Ajantha On Tue, Mar 23, 2021 at 1:03 PM Mahesh Raju Somalaraju < maheshraju.o...@gmail.com> wrote: > Hi, > > +1 for the feature. > It will make the query faster. > > 1) With design discussion about the feature(SI to prune as a data frame) > has one property to set. > If the data engine wants to use SI as datamap then need to set. if not > set then it will use plan re-write flow. > > So we have to handle this feature in two cases. Can you please check and > update the design as per this? > > References: > SI to prune as a data frame > > https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing > > Thanks & Regards > Mahesh Raju Somalaraju > > On Wed, Feb 17, 2021 at 4:05 PM Nihal wrote: > > > Hi all, > > > > Currently, if the parent(main) table and SI table don’t have the same > valid > > segments then we disable the SI table. And then from the next query > > onwards, > > we scan and prune only the parent table until we trigger the next load or > > REINDEX command (as these commands will make the parent and SI table > > segments in sync). Because of this, queries take more time to give the > > result when SI is disabled. > > > > To solve this problem we are planning to support SI at the segment level. > > It > > means we will not disable SI if the parent and SI table don’t have the > same > > segments, while we will do the pruning on Si for all valid segments, and > > for > > the rest of the segments, we will do the pruning on main/parent table. > > > > > > At the time of pruning with the main table in TableIndex.prune, if SI > > exists > > for the corresponding filter then all segments which are not present in > the > > SI table will be pruned on the corresponding parent table segment. > > > > Please let me know your thought and input about the same. > > > > Regards > > Nihal kumar ojha > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > >
Re: [DISCUSSION] Describe complex columns
Hi, +1 for this improvement. a) you can also print one line of short information about the parent column when describe column is executed to avoid executing again to know what is parent column type. Example, Describe column decimalcolumn on complexcarbontable; *You can mention that decimalcolumn is a MAP<> type and children are as follows.* b) Are you blocking describe column on primitive type ? or just print short information about the primitive data type. I think the latter one is fine. Thanks, Ajantha On Mon, Mar 22, 2021 at 9:37 PM akashrn5 wrote: > Hi, > > +1 for the new functionality. > > my suggestion is to modify the DDL something like below > > DESCRIBE column fieldname ON [db_name.]table_name; > DESCRIBE table short/transient [db_name.]table_name; > > Others can give their suggestions > > Thanks, > > Regards, > Akash R > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
Re: [DISCUSSION] Support JOIN query with spatial index
Hello all, Current design is based on Union of polygons identified from polygon table. Based on discussion with customer, need to change the design to support IN_POLYGON_JOIN in below way. Apply IN_POLYGON udf on each polygon identified from Polygon table and apply aggregation/group by each polygon result. For example: Select sum(t1.col1),t2.polygon,t2.type from table1 t1 inner join (select polygon,type from table2 where type='x') t2 on in_polygon_join(t1, t2.polygon) group by t2.polygon, t2.type table1: +--- Col1 + mygeohash | +-- 1 |01 | 2 |02 | 3 |03 | 4 |04 | 5 |05 | 6 |06 | +-- table2: -+-- polygon+ type | -+-- 1_Polygon() |x | 2_Polygon() |y | 3_Polygon() |x | 4_Polygon() |r| +--- If 1_polygon lies in range (0 & 1), 3_polygon in (5 & 6), result could be like, -+--+ sum(t1.col1 + polygon+ type | -+--+ 3 + 1_Polygon() | x | 11+ 3_Polygon() | x | -+-+-- To achieve this, a solution could be to, run query with IN_POLYGON udf of each polygon and finally make a union of query result. Will update design on further analysis. Any opinion or suggestions are welcomed. Thanks, Indhumathi M -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] Support JOIN query with spatial index
Hi, I have some doubts and suggestions for the same. Currently, we support these UDFs --> IN_POLYGON, IN_POLYGON_LIST, IN_POLYLINE_LIST, IN_POLYGON_RANGE_LIST but the user needs to give polygon input manually and as polygon can have many points, it is hard to give manually. So, your requirement is to give new UDF , IN_POLYGON_JOIN where polygon inputs are present in another table and you want to join with the main table. *please find my doubts below* a. why to do join ? when you can form a subquery to query the polygons from table2 and give it as input for IN_POLYGON UDF and other existing UDFs b. no need to support the same for IN_POLYLINE_LIST and IN_POLYGON_RANGE_LIST UDF also ? *Suggestions:* a. Table names and queries are not matching, please update b. The query doesn't look like the union query as explained in the diagram, please update and explain c. Please consider some sample data with examples for t1 and t2. Also, provide the expected query result also d. Also mention, how to select data from single polygon and multi polygon from the tables. Thanks, Ajantha On Tue, Mar 30, 2021 at 11:25 AM Kunal Kapoor wrote: > +1 > > On Mon, Mar 22, 2021 at 4:07 PM Indhumathi > wrote: > > > Hi community, > > > > Currently, carbon supports IN_POLYGON and IN_POLYGON_LIST udf's, > > where user has to manually provide the polygon points(series of latitude > > and longitude pair), to query carbon table based on spatial index. > > > > This feature will support JOIN tables based on IN_POLYGON udf > > filter, where polygon data exists in a table. > > > > Please find below link of design doc. Please check and give > > your inputs/suggestions. > > > > > > > https://docs.google.com/document/d/11PnotaAiEJQK_QvKsHznDy1I9tO4idflW32LstwcLhc/edit#heading=h.yh6qp815dh3p > > > > > > Thanks & Regards, > > Indhumathi M > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > >
Re: [DISCUSSION] Support alter schema for complex types
Hi Akshay, The mail description and document content are not matching. For single-level struct also document says cannot support. So, please list down all the work that need to be done in points and then divide which is supported in phase1 and which is supported in phase 2 clearly in the summary section of the document. Also in the query flow, after adding the column, for previously loaded segments what will be the output NULL or empty complex type ? you can refer hive behavior for this. Hope schema evolution (column drift) also intact with complex column support. Thanks, Ajantha On Tue, Mar 30, 2021 at 11:18 AM Kunal Kapoor wrote: > +1 > > On Fri, Mar 26, 2021 at 6:19 PM akshay_nuthala > wrote: > > > No, these and other nested level operations will be taken care in the > next > > phase. > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > >