Re: Hive performance vs. SQL?

2012-03-19 Thread Keith Wiley
Thanks for the response.

Cheers!

On Mar 19, 2012, at 16:42 , Maxime Brugidou wrote:

> From my experience, if you can fit data in a SQL without sharding or 
> anything, don't ever think twice. Hive is not even comparable.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
   --  Keith Wiley




Re: Hive performance vs. SQL?

2012-03-19 Thread Maxime Brugidou
>From my experience, if you can fit data in a SQL without sharding or
anything, don't ever think twice. Hive is not even comparable.

I would rather say that Hive is a nice SQL interface over Hadoop M/R rather
than any SQL replacement. If you are running a DWH in SQL and you don't
need to grow your data to at least a couple of Tb, then keep SQL. The very
nice feature of Hadoop/Hive is that your DWH can grow (almost) horizontally
without much trouble by buying new servers, and most of your queries scale
with the number of servers too.

You have to know that doing a SELECT count(1) FROM t where t is ~1Gb can
take more time to start/stop the M/R job which has huge overhead than to
actually count. A simple wc -l takes about a second on any normal PC.

On Mon, Mar 19, 2012 at 11:51 PM, Keith Wiley  wrote:

> I haven't had an opportunity to set up a huge Hive database yet because
> exporting csv files from our SQL database is, in itself, a rather laborious
> task.  I was just curious how I might expect Hive to perform vs. SQL on
> large databases and large queries?  I realize Hive is pretty "latent" since
> it builds and runs MapReduce jobs for even the simplest queries, but that
> is precisely why I think it might perform better on long queries against
> large (external CSV) databases).
>
> Would you expect Hive to ever outperform SQL on a single machine
> (standalone or pseudo-distributed mode)?  I am entirely open to the
> possibility that the answer is no, that Hive could never compete with SQL
> in a single machine.  Is this true?
>
> If so, how large (how parallel) do you think the underlying Hadoop cluster
> needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover
> point where Hive actually outperforms SQL?
>
> Along similar lines, might Hive never outperform SQL on a database small
> enough for SQL to run on a single machine, a 10s to 100s of GBs?  Must the
> database itself be so large that SQL is effectively crippled and the data
> must be distributed before Hive offer significant gains?
>
> I am really just trying to get a basic feel for how I might anticipate's
> Hive's behavior vs. SQL once I get a large system up and running.
>
> Thanks.
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm
> with
> isn't it, and what's it seems weird and scary to me."
>   --  Abe (Grandpa) Simpson
>
> 
>
>


Hive performance vs. SQL?

2012-03-19 Thread Keith Wiley
I haven't had an opportunity to set up a huge Hive database yet because 
exporting csv files from our SQL database is, in itself, a rather laborious 
task.  I was just curious how I might expect Hive to perform vs. SQL on large 
databases and large queries?  I realize Hive is pretty "latent" since it builds 
and runs MapReduce jobs for even the simplest queries, but that is precisely 
why I think it might perform better on long queries against large (external 
CSV) databases).

Would you expect Hive to ever outperform SQL on a single machine (standalone or 
pseudo-distributed mode)?  I am entirely open to the possibility that the 
answer is no, that Hive could never compete with SQL in a single machine.  Is 
this true?

If so, how large (how parallel) do you think the underlying Hadoop cluster 
needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover point 
where Hive actually outperforms SQL?

Along similar lines, might Hive never outperform SQL on a database small enough 
for SQL to run on a single machine, a 10s to 100s of GBs?  Must the database 
itself be so large that SQL is effectively crippled and the data must be 
distributed before Hive offer significant gains?

I am really just trying to get a basic feel for how I might anticipate's Hive's 
behavior vs. SQL once I get a large system up and running.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson