Perpetual support problems using Spark for dependency link aggregation

Adrian Cole Mon, 18 Mar 2019 17:52:47 -0700

Hi, team.

A long time ago, we arbitrarily used spark for dependency link
aggregation (porting the work from Eirik's hadoop job). The initial
spark job was created incomplete then abandoned by the author. I've
tried a lot to support it, but it has been perpetual maintenance and
most of us have no idea how to support it. Yet, we get a lot of user
questions about it and the support load is higher than most of our
projects.


The Elasticsearch part is landmines from the "wan only" stuff, to them
having a narrow supported range of versions. It is rev-locked to a JRE
(even if will change later). We've had users complain about CVE
maintenance and actively ask for a non-spark option. General support
comes in questions about cluster distribution which no-one knows the
answer to. I've recently in desperation added a change to help show
where Spark support is.

https://github.com/openzipkin/zipkin-dependencies/pull/133

All this said, despite the problems running distributed or with
elasticsearch, most can start the zipkin-dependencies job as a
one-shot cron job without much help.

I think we have to be honest about the fact that since this project
started, we've rarely had anyone able to support it. I hope we can get
out of the mutually disappointing support swamp. Does anyone have any
ideas?

I would like to think someone could come in and save us, but seems we
should also consider other tools as that usually doesn't happen, and
one person saving us isn't sustainable (usually we need a few people
to know a tool in order to realistically support it). It is possible
to recruit for this, but we need significant close buy-in from people
who know spark imho, like actually helping with support, if we want to
continue this path.

I know there's a Kafka streaming option [1]. I also know some have
used Flink, and some have had interest in Pulsar. I think we should
have streaming options, but fact is many don't use any buffer like
Kafka (direct http), which leads me to think we still need an
after-the-fact option (pull from storage). Moreover spark's embedded
mode is nice as it can be treated as a dumb cron job.

Looking for ideas,
-A

[1] https://github.com/sysco-middleware/zipkin-dependencies-streaming

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Perpetual support problems using Spark for dependency link aggregation

Reply via email to