Hi, Andriy Thanks for responding. I dont think we can assume there will always be a choice for streaming or online aggregation.
The two easiest ways out would be a spark guru (ideally gurus) steeping forward or an easier to support alternative for after the fact aggregation over large datasets that minimally works with mysql ES and Cassandra. -A On Tue, Mar 19, 2019, 7:20 PM Andriy Redko <[email protected]> wrote: > Hi Adrian, > > First of all, I want to confirm from the personal experiences, the > dependencies > are often built after the fact, so there is a real need for this kind of > job/component. > There are many choices, either to use the data processing engines you > mentioned, > or onboard the data store with aggregation capabalities (may ClickHouse > fe). What > do you think would be the best route for Zipkin? Keep the Spark but look > for > maintenance help? Or (re)write it altogether, ideally with no data engines > needed? Just trying to understand how you envision it. > > Best Regards, > Andriy Redko > > AC> Hi, team. > > AC> A long time ago, we arbitrarily used spark for dependency link > AC> aggregation (porting the work from Eirik's hadoop job). The initial > AC> spark job was created incomplete then abandoned by the author. I've > AC> tried a lot to support it, but it has been perpetual maintenance and > AC> most of us have no idea how to support it. Yet, we get a lot of user > AC> questions about it and the support load is higher than most of our > AC> projects. > > AC> The Elasticsearch part is landmines from the "wan only" stuff, to them > AC> having a narrow supported range of versions. It is rev-locked to a JRE > AC> (even if will change later). We've had users complain about CVE > AC> maintenance and actively ask for a non-spark option. General support > AC> comes in questions about cluster distribution which no-one knows the > AC> answer to. I've recently in desperation added a change to help show > AC> where Spark support is. > > AC> https://github.com/openzipkin/zipkin-dependencies/pull/133 > > AC> All this said, despite the problems running distributed or with > AC> elasticsearch, most can start the zipkin-dependencies job as a > AC> one-shot cron job without much help. > > AC> I think we have to be honest about the fact that since this project > AC> started, we've rarely had anyone able to support it. I hope we can get > AC> out of the mutually disappointing support swamp. Does anyone have any > AC> ideas? > > AC> I would like to think someone could come in and save us, but seems we > AC> should also consider other tools as that usually doesn't happen, and > AC> one person saving us isn't sustainable (usually we need a few people > AC> to know a tool in order to realistically support it). It is possible > AC> to recruit for this, but we need significant close buy-in from people > AC> who know spark imho, like actually helping with support, if we want to > AC> continue this path. > > AC> I know there's a Kafka streaming option [1]. I also know some have > AC> used Flink, and some have had interest in Pulsar. I think we should > AC> have streaming options, but fact is many don't use any buffer like > AC> Kafka (direct http), which leads me to think we still need an > AC> after-the-fact option (pull from storage). Moreover spark's embedded > AC> mode is nice as it can be treated as a dumb cron job. > > AC> Looking for ideas, > AC> -A > > AC> [1] https://github.com/sysco-middleware/zipkin-dependencies-streaming > > AC> --------------------------------------------------------------------- > AC> To unsubscribe, e-mail: [email protected] > AC> For additional commands, e-mail: [email protected] > > >
