Re: Perpetual support problems using Spark for dependency link aggregation

Adrian Cole Tue, 19 Mar 2019 04:29:33 -0700

Hi, Andriy

Thanks for responding. I dont think we can assume there will always be a
choice for streaming or online aggregation.


The two easiest ways out would be a spark guru (ideally gurus) steeping
forward or an easier to support alternative for after the fact aggregation
over large datasets that minimally works with mysql ES and Cassandra.

-A

On Tue, Mar 19, 2019, 7:20 PM Andriy Redko <[email protected]> wrote:

> Hi Adrian,
>
> First of all, I want to confirm from the personal experiences, the
> dependencies
> are often built after the fact, so there is a real need for this kind of
> job/component.
> There are many choices, either to use the data processing engines you
> mentioned,
> or onboard the data store with aggregation capabalities (may ClickHouse
> fe). What
> do you think would be the best route for Zipkin? Keep the Spark but look
> for
> maintenance help? Or (re)write it altogether, ideally with no data engines
> needed? Just trying to understand how you envision it.
>
> Best Regards,
>     Andriy Redko
>
> AC> Hi, team.
>
> AC> A long time ago, we arbitrarily used spark for dependency link
> AC> aggregation (porting the work from Eirik's hadoop job). The initial
> AC> spark job was created incomplete then abandoned by the author. I've
> AC> tried a lot to support it, but it has been perpetual maintenance and
> AC> most of us have no idea how to support it. Yet, we get a lot of user
> AC> questions about it and the support load is higher than most of our
> AC> projects.
>
> AC> The Elasticsearch part is landmines from the "wan only" stuff, to them
> AC> having a narrow supported range of versions. It is rev-locked to a JRE
> AC> (even if will change later). We've had users complain about CVE
> AC> maintenance and actively ask for a non-spark option. General support
> AC> comes in questions about cluster distribution which no-one knows the
> AC> answer to. I've recently in desperation added a change to help show
> AC> where Spark support is.
>
> AC> https://github.com/openzipkin/zipkin-dependencies/pull/133
>
> AC> All this said, despite the problems running distributed or with
> AC> elasticsearch, most can start the zipkin-dependencies job as a
> AC> one-shot cron job without much help.
>
> AC> I think we have to be honest about the fact that since this project
> AC> started, we've rarely had anyone able to support it. I hope we can get
> AC> out of the mutually disappointing support swamp. Does anyone have any
> AC> ideas?
>
> AC> I would like to think someone could come in and save us, but seems we
> AC> should also consider other tools as that usually doesn't happen, and
> AC> one person saving us isn't sustainable (usually we need a few people
> AC> to know a tool in order to realistically support it). It is possible
> AC> to recruit for this, but we need significant close buy-in from people
> AC> who know spark imho, like actually helping with support, if we want to
> AC> continue this path.
>
> AC> I know there's a Kafka streaming option [1]. I also know some have
> AC> used Flink, and some have had interest in Pulsar. I think we should
> AC> have streaming options, but fact is many don't use any buffer like
> AC> Kafka (direct http), which leads me to think we still need an
> AC> after-the-fact option (pull from storage). Moreover spark's embedded
> AC> mode is nice as it can be treated as a dumb cron job.
>
> AC> Looking for ideas,
> AC> -A
>
> AC> [1] https://github.com/sysco-middleware/zipkin-dependencies-streaming
>
> AC> ---------------------------------------------------------------------
> AC> To unsubscribe, e-mail: [email protected]
> AC> For additional commands, e-mail: [email protected]
>
>
>

Re: Perpetual support problems using Spark for dependency link aggregation

Reply via email to