I think for trancing use case we need to publish events one by one from each mediator (we can't aggregate all such events as it also contains the message payload)
---------- Forwarded message ---------- From: Supun Sethunga <sup...@wso2.com> Date: Mon, Feb 8, 2016 at 2:54 PM Subject: Re: ESB Analytics Mediation Event Publishing Mechanism To: Anjana Fernando <anj...@wso2.com> Cc: "engineering-gr...@wso2.com" <engineering-gr...@wso2.com>, Srinath Perera <srin...@wso2.com>, Sanjiva Weerawarana <sanj...@wso2.com>, Kasun Indrasiri <ka...@wso2.com>, Isuru Udana <isu...@wso2.com> Hi all, Ran some simple performance tests against the new relational provider, in comparison with the existing one. Follow are the results: *Records in Backend DB Table*: *1,054,057* *Conversion:* Spark Table id a b c Backend DB Table 1 xxx yyy zzz id data 1 ppp qqq rrr 1 [{'a':'aaa','b':'bbb','c':'ccc'},{'a':'xxx','b':'yyy','c':'zzz'},{'a':'ppp','b':'qqq','c':'rrr'}] -- To --> 1 aaa bbb ccc 2 [{'a':'aaa','b':'bbb','c':'ccc'},{'a':'xxx','b':'yyy','c':'zzz'},{'a':'ppp','b':'qqq','c':'rrr'}] 2 xxx yyy zzz 2 aaa bbb ccc 2 ppp qqq rrr *Avg Time for Query Execution:* Querry Execution time (~ sec) Existing Analytics Relation Provider New (ESB) Analytics Relation Provider* * New relational provider split a single row to multiple rows. Hence the number of rows in the table equivalent to 3 times (as each row is split to 3 rows) as the original table. SELECT COUNT(*) FROM <Table>; 13 16 SELECT * FROM <Table> ORDER BY id ASC; 13 16 SELECT * FROM <Table> WHERE id=98435; 13 16 SELECT id,a,first(b),first(c) FROM <Table> GROUP BY id,a ORDER BY id ASC; 18 26 Regards, Supun On Wed, Feb 3, 2016 at 3:36 PM, Supun Sethunga <sup...@wso2.com> wrote: > Hi all, > > I have started working on implementing a new "relation" / "relation > provider", to serve the above requirement. This basically is a modified > version of the existing "Carbon Analytics" relation provider. > > Here I have assumed that the encapsulated data for a single execution flow are > stored in a single row, and the data about the mediators invoked during the > flow are stored in a known column of each row (say "data"), as an array > (say a json array). When each row is read in to spark, this relational > provider create separate rows for each of the element in the array stored > in "data" column. I have tested this with some mocked data, and works as > expected. > > Need to test with the real data/data-formats, and modify the mapping > accordingly. Will update the thread with the details. > > Regards, > Supun > > > On Tue, Feb 2, 2016 at 2:36 AM, Anjana Fernando <anj...@wso2.com> wrote: > >> Hi, >> >> In a meeting I'd with Kasun and the ESB team, I got to know that, for >> their tracing mechanism, they were instructed to publish one event for each >> of the mediator invocations, where, earlier they had an approach, they >> publish one event, which encapsulated data of a whole execution flow. I >> would actually like to support the latter approach, mainly due to >> performance / resource requirements. And also considering the fact, this is >> a feature that could be enabled in production. So simply, if we do one >> event per mediator, this does not scale that well. For example, if the ESB >> is doing 1k TPS, for a sequence that has 20 mediators, that is 20k TPS for >> analytics traffic. Combine that with a possible ESB cluster hitting a DAS >> cluster with a single backend database, this maybe too many rows per second >> written to the database. Where the main problem here is, one event is, a >> single row/record in the backend database in DAS, so it may come to a >> state, where the frequency of row creations by events coming from ESBs >> cannot be sustained. >> >> If we create a single event from the 20 mediators, then it is just 1k TPS >> for DAS event receivers and the database too, event though the message size >> is bigger. It is not necessarily same performance, if you publish lots of >> small events to publishing bigger events. Throughput wise, comparatively >> bigger events will win (even though if we consider that, small operations >> will be batched in transport level etc.. still one event = one database >> row). So I would suggest, we try out a single sequence flow = single event, >> approach, and from the Spark processing side, we consider one of these big >> rows as multiple rows in Spark. I was first thinking, if UDFs can help in >> splitting a single column to multiple rows, and that is not possible, and >> also, a bit troublesome, considering we have to delete the original data >> table after we concerted it using a script, and not forgetting, we actually >> have to schedule and run a separate script to do this post-processing. So a >> much cleaner way to do this would be, to create a new "relation provider" >> in Spark (which is like a data adapter for their DataFrames), and in our >> relation provider, when we are reading rows, we convert a single row's >> column to multiple rows and return that for processing. So Spark will not >> know, physically it was a single row from the data layer, and it can >> summarize the data and all as usual and write to the target summary tables. >> [1] is our existing implementation of Spark relation provider, which >> directly maps to our DAS analytics tables, we can create the new one >> extending / based on it. So I suggest we try out this approach and see, if >> everyone is okay with it. >> >> [1] >> https://github.com/wso2/carbon-analytics/blob/master/components/analytics-processors/org.wso2.carbon.analytics.spark.core/src/main/java/org/wso2/carbon/analytics/spark/core/sources/AnalyticsRelationProvider.java >> >> Cheers, >> Anjana. >> -- >> *Anjana Fernando* >> Senior Technical Lead >> WSO2 Inc. | http://wso2.com >> lean . enterprise . middleware >> >> -- >> You received this message because you are subscribed to the Google Groups >> "WSO2 Engineering Group" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to engineering-group+unsubscr...@wso2.com. >> For more options, visit https://groups.google.com/a/wso2.com/d/optout. >> > > > > -- > *Supun Sethunga* > Software Engineer > WSO2, Inc. > http://wso2.com/ > lean | enterprise | middleware > Mobile : +94 716546324 > -- *Supun Sethunga* Software Engineer WSO2, Inc. http://wso2.com/ lean | enterprise | middleware Mobile : +94 716546324 -- Kasun Indrasiri Software Architect WSO2, Inc.; http://wso2.com lean.enterprise.middleware cell: +94 77 556 5206 Blog : http://kasunpanorama.blogspot.com/
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture