Stephan,
Thanks for the response.
The one thing that I don't appreciate from those who promote and DOCUMENT spark
on hive is that, seemingly, there is absolutely no evidence seen that says that
hive on spark WORKS. As a matter of fact, after a lot of pain, I noticed it is
not supported by just about anybody.
If someone dares to document Hive on Spark (see link
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
why can't they have the decency to mention what specific combo of
Hadoop/Spark/Hive versions used that works? Have a git repo included in a doc
with all the right versions and libraries. Why not? We can start from there and
progressively use newer libraries in case the doc becomes stale. I am not
really asking much, I just want to know what the documenter used to claim that
Hive on Spark works, that's it.
Clearly, for most cases, this setup is broken and it misleads people to waste
time on a broken setup.
I love this tech. But I do notice that there is some mean spirited or very
negligent actions made by the apache development community. Documenting hive on
spark while knowing it won't work for most cases means apache developers don't
give a crap about the time wasted by people like us.
On Friday, March 17, 2017 1:14 PM, Edward Capriolo <[email protected]>
wrote:
On Fri, Mar 17, 2017 at 2:56 PM, hernan saab <[email protected]>
wrote:
I have been in a similar world of pain. Basically, I tried to use an external
Hive to have user access controls with a spark engine.At the end, I realized
that it was a better idea to use apache tez instead of a spark engine for my
particular case.
But the journey is what I want to share with you.The big data apache tools and
libraries such as Hive, Tez, Spark, Hadoop , Parquet etc etc are not
interchangeable as we would like to think. There are very limited combinations
for very specific versions. This is why tools like Ambari can be useful. Ambari
sets a path of combos of versions known to work and the dirty work is done
under the UI.
More often than not, when you try a version that few people tried, you will get
error messages that will derailed you and cause you to waste a lot of time.
In addition, this group, as well as many other apache big data user groups,
provides extremely poor support for users. The answers you usually get are not
even hints to a solution. Their answers usually translate to "there is nothing
I am willing to do about your problem. If I did, I should get paid" in many
cryptic ways.
If you ask your question to the Spark group they will take you to the Hive
group and viceversa (I can almost guarantee it based on previous experiences)
But in hindsight, people who work on this kinds of things typically make more
money that the average developers. If you make more $$s it makes sense learning
this stuff is supposed to be harder.
Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive if you
are querying large files.
On Friday, March 17, 2017 11:33 AM, Stephen Sprague <[email protected]>
wrote:
:( gettin' no love on this one. any SME's know if Spark 2.1.0 will work
with Hive 2.1.0 ? That JavaSparkListener class looks like a deal breaker to
me, alas.
thanks in advance.
Cheers,
Stephen.
On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague <[email protected]> wrote:
hi guys,
wondering where we stand with Hive On Spark these days?
i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental versions)
and running up against this class not found:
java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
searching the Cyber i find this:
1. http://stackoverflow.com/ questions/41953688/setting-
spark-as-default-execution- engine-for-hive
which pretty much describes my situation too and it references this:
2. https://issues.apache.org/ jira/browse/SPARK-17563
which indicates a "won't fix" - but does reference this:
3. https://issues.apache.org/ jira/browse/HIVE-14029
which looks to be fixed in hive 2.2 - which is not released yet.
so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?
thanks,
Stephen.
Stephan,
I understand some of your frustration. Remember that many in open source are
volunteering their time. This is why if you pay a vendor for support of some
software you might pay 50K a year or $200.00 an hour. If I was your
vendor/consultant I would have started the clock 10 minutes ago just to answer
this email :). The only "pay" I ever got from Hive is that I can use it as a
resume bullet point, and I wrote a book which pays me royalties.
As it relates specifically to your problem, when you see the trends you are
seeing it probably means you are in a minority of the user base. Either your
doing something no one else is doing, you are too cutting edge, or no one has
an easy solution. Hive is making the move from the classic MapReduce, two other
execution engines have been made Tez and HiveOnSpark. Because we are open
source we allow people to "scratch an itch" that is the Apache way. From time
to time in means something that was added stops being viable because of lack of
support.
I agree with your final assessment which is Tez is the most viable engine for
Hive. This is by no means a put down of the HiveOnSpark work and it does not
mean it will never the most viable. By the same token if the versions fall out
of sync and all that exists is complains the viability speaks for itself.
Remember that keeping two fast moving things together is no easy chore. I used
to run Hive + cassandra. Seems easy, crap two versions of common CLI, shade one
version everything works, crap new hive release has different versions of
thrift, shade + patch, crap now one of the other dependencies is incompatible
fork + shade + patch. At some point you have to say to yourself if I can not
make critical mass of this solution such that I am the only one doing/patching
it then I give up and find some other way to do it.