Re: a hive thrift alternative

Edward Capriolo Wed, 02 May 2012 14:46:23 -0700

The jdbc client in embedded thick client mode (never used it) sounds
like the Cli on steroids :) Even more on the client path now.


As I stated originally:

"I would not describe hive-thrift as horrible but there is some unpleasantness."

Two people can declare an X and step on each other. OMG no way around
that right?

application1-x=5
application2-x=6

Wait! What if I need to run two copies of application1 at once? Deal
breaker right? Nope.

application1-$pid-x=5
application1-$pid-x=7

Furthermore, most people do not need to touch the conf to run queries
and those using hive-thrift are more likely to just do any variable
replacement in the java code on the client side.

Again forking hive takes about 3-5 seconds and invocation.

$ time hive -S -e  "show tables"
real    0m3.547s
user    0m5.339s
sys     0m0.351s

We have many processes that have hundreds or thousands of steps. Each
fork really adds up runtime. While hive-thrift has a pooled DB
connection and a pooled FS connection we get tasks done in
milliseconds not seconds. I'm not here to try to convince anyone to
switch to my action or anything, but it works fine for me and there is
a big upside.


On Wed, May 2, 2012 at 5:01 PM, Alejandro Abdelnur <[email protected]> wrote:
> Hi Ed,
>
> I've checked this with Carl and got the following:
>
> ----
> HIVE-2503 doesn't really fix the underlying problems. The example that
> I gave in that earlier email of HiveServer reusing the same HiveConf
> between disconnects is still valid on trunk (i.e. even with
> HIVE-2503). If Ed wants to access Hive from Oozie via an API instead
> of through the CLI, then I think his best bet is to run the JDBC
> driver in embedded (thick-client) mode.
> ----
>
> Hope this clarifies the current state of things regarding the Thrift server.
>
> Thx
>
>
> On Wed, May 2, 2012 at 6:12 AM, Edward Capriolo <[email protected]> wrote:
>> https://issues.apache.org/jira/browse/HIVE-2503
>>
>> I believe what you are describing is fixed in trunk.
>>
>> On Tuesday, May 1, 2012, Alejandro Abdelnur <[email protected]> wrote:
>>> Edward,
>>>
>>> I agree that hive thrift server would be the ideal approach. However
>>> the thrift server is that is not multi-user/multi-job friendly:
>>>
>>>
>> http://mail-archives.apache.org/mod_mbox/hive-dev/201204.mbox/%3CCAJqeMKTDOmDZfNUUW8kSgkivZPkC%2BkH9H5D_RL2YhJGhh4rqNQ%40mail.gmail.com%3E
>>>
>>> Until Hive address this I think we are better off with the CLI approach.
>>>
>>> Thx
>>>
>>> On Mon, Apr 30, 2012 at 10:03 AM, Edward Capriolo <[email protected]>
>> wrote:
>>>> HaHa. I never rejoined the list after it moved from Yahoo.
>>>>
>>>> I would not describe hive-thrift as horrible but there is some
>> unpleasantness.
>>>>
>>>> Near future:
>>>> https://issues.apache.org/jira/browse/HIVE-2935
>>>>
>>>> In any case I am willing to accept the issues. I run multiple
>>>> hive-thrift servers behind ha-proxy
>>>>
>>>>
>> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/running_a_hive_thrift_cluster
>>>>
>>>> This cuts downs concurrency type problems. It's hive so not sure how
>>>> much concurrency is needed there.
>>>>
>>>> Our group just decided to part ways with programming over the CLI. Too
>>>> much stuff like this:
>>>>
>>>> hive -e -S "select x,y from $TABLE WHERE $STUFF" | awk whatever
>>>> or:
>>>> my list=`hadoop dfs -ls /bla`
>>>>
>>>> That was not unit testable and just really ugly. Even if it fails
>>>> 1/1000 times we have try catch , and we have done stuff that can bring
>>>> up the entire stack end to end in an IDE now.
>>>>
>>>> Layering on top of the CLI is a bad idea in the long run, its like
>>>> expect scripting an ssh session. Not that it was a bad design chose
>>>> for oozie at the time but it is certainly not the ideal way to handle
>>>> it.
>>>
>>>
>>>
>>> --
>>> Alejandro
>>>
>
>
>
> --
> Alejandro

Re: a hive thrift alternative

Reply via email to