Re: Re: spark-yarn recipe

Marc de Lignie Thu, 06 Jul 2017 12:03:26 -0700

Hi Stephen,

Thanks for your valuable comments, which certainly changed my view onthis matter.

First, I only intended to "recipe test in the doc environment" againstthe vanilla Apache Hadoop; the commercial providers can cater forthemselves. I was not clear on that one.

Secondly, I was not aware of the manifest entry you pointed at. Sinceall dependency convergence conflicts of the spark-gremlin module aremanaged away manually in the pom dependency section, I had not expectedthe spark-gremlin plugin to have a backdoor that reintroduces some ofthese excluded dependencies. Does this mean that spark-gremlin as pluginfrom the gremlin console is not really tested (but only as module)? Andis the manifest entry at all necessary, since spark-gremlin depends onhadoop-gremlin, which depends on hadoop-client? OK, sorry, too manyquestions, it works as it is and the hadoop deps are a jungle ingeneral, as you note. Let's just keep this in the back of our minds.

Apart from my recipe question, it still would be nice to be able definea java project with just spark-gremlin and hadoop-gremlin dependenciesand being able to connect to a yarn cluster. Implicitly, yarn support isin the spark-gremlin API because spark-gremlin accepts thespark.master=yarn-client property from the HadoopGraph. That would speakin favor of including spark-yarn (with the proper excludes) as aspark-gremlin dependency. It would also be consistent withhadoop-yarn.jar hanging around already :-)

For now, your concerns are clear to me. If I want to proceed on this, Iwould first try the spark-yarn recipe from the documentation environmentto see how this works out. Then I can come back with more specificquestions.


Cheers,   Marc

I did see that - I was wondering if anyone would try to convert that into
TinkerPop documentation of some sort. I'll save my less positive comments
for the end and first just say what you could do if everyone is into this
idea. You could add it to the "Implementation Recipes" subsection of the
"Recipes" document.

> - include the spark-yarn dependency to spark-gremlin

I could be wrong, but I don't think you need to add that as a direct
dependency. If we don't need it for compilation it probably shouldn't be in
the pom.xml. If you just need extra jars to come with the plugin to the
console when you do:

:install org.apache.tinkerpop spark-gremlin 3.2.5

you can just add a manifest entry to spark-gremlin to suck in additional
jars as part of that. Note that we already do this with spark-gremlin -
see:

https://github.com/apache/tinkerpop/blob/0d532aa91e0c9bc775c36d9572f5f816d323abb6/spark-gremlin/pom.xml#L406

dependencies are semi-colon separated, so you can just add more after that
entry. As for:

> do you see potential obstacles in accepting a PR along these lines?

Are there any other dependencies to add? Like, the blog post says you
tested on Hortonworks Data Platform sandbox - do we need that in the mix
too?

....and here's where i get sorta cringy as I alluded to at the start of
this......the only problem i'm concerned about is the one you posted:

> the recipe would be maintained and still work after version upgrades

that terrifies me. personally speaking, i'm terribly uninterested in
hunting down spark to the yarn to hadoop to the hortonworks to the cloudera
to the map-red-env.sh to the yarn-site.xml type of errors. it's not a nice
place at all. If that integration starts to fail for some reason our docs
will effectively be broken and someone is going to have to go down into
that ungodly hole of demons to unblock us and i'm scared of the dark.

on the flip side, i'm sensitive to users struggling with yarn stuff and
every time i see you solve a problem like that on the mailing list related
to that, i'm like "All hail the the Tamer of Hadoop! Long live HadoopMarc!"
- so it seems like this is a need to some degree so it would be nice if we
could make it work somehow. Anyway - those are my thoughts on the matter.
Let's see what other people have to say.

Op 06-07-17 om 11:02 schreef Marc de Lignie:

Hi Stephen,
I recently posted recipes on the gremlin and janusgraph users liststo configure the binary distributions to work with a spark-yarncluster. I think it would be useful to have the tinkerpop recipeincluded in Apache Tinkerpop repo itself in the following way:
 - include the spark-yarn dependency to spark-gremlin
- add the recipe to the docs so that it is actually run in theexisting documentation environment at build time
In this way:
- the recipe would be less clumsy for users to follow (no externaldeps)
 - the recipe would be maintained and still work after version upgrades
I do not have to remind you that many users have had problems withspark-yarn and that the ability to run OLAP queries on an existingcluster is one of the attractive feature of Tinkerpop.
This brings me to the question: do you see potential obstacles inaccepting a PR along these lines? I will probably wait for some timeuntil actually doing this, though, to have more opportunity to "eatmy own dogfood" and see if changes are still required.
Cheers,   HadoopMarc


--
Marc de Lignie

Re: Re: spark-yarn recipe

Reply via email to