Re: Hive performance vs. SQL?

2012-03-19 Thread Maxime Brugidou
From my experience, if you can fit data in a SQL without sharding or
anything, don't ever think twice. Hive is not even comparable.

I would rather say that Hive is a nice SQL interface over Hadoop M/R rather
than any SQL replacement. If you are running a DWH in SQL and you don't
need to grow your data to at least a couple of Tb, then keep SQL. The very
nice feature of Hadoop/Hive is that your DWH can grow (almost) horizontally
without much trouble by buying new servers, and most of your queries scale
with the number of servers too.

You have to know that doing a SELECT count(1) FROM t where t is ~1Gb can
take more time to start/stop the M/R job which has huge overhead than to
actually count. A simple wc -l takes about a second on any normal PC.

On Mon, Mar 19, 2012 at 11:51 PM, Keith Wiley kwi...@keithwiley.com wrote:

 I haven't had an opportunity to set up a huge Hive database yet because
 exporting csv files from our SQL database is, in itself, a rather laborious
 task.  I was just curious how I might expect Hive to perform vs. SQL on
 large databases and large queries?  I realize Hive is pretty latent since
 it builds and runs MapReduce jobs for even the simplest queries, but that
 is precisely why I think it might perform better on long queries against
 large (external CSV) databases).

 Would you expect Hive to ever outperform SQL on a single machine
 (standalone or pseudo-distributed mode)?  I am entirely open to the
 possibility that the answer is no, that Hive could never compete with SQL
 in a single machine.  Is this true?

 If so, how large (how parallel) do you think the underlying Hadoop cluster
 needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover
 point where Hive actually outperforms SQL?

 Along similar lines, might Hive never outperform SQL on a database small
 enough for SQL to run on a single machine, a 10s to 100s of GBs?  Must the
 database itself be so large that SQL is effectively crippled and the data
 must be distributed before Hive offer significant gains?

 I am really just trying to get a basic feel for how I might anticipate's
 Hive's behavior vs. SQL once I get a large system up and running.

 Thanks.


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson

 




Change in serdeproperties does not update existing partitions

2011-09-13 Thread Maxime Brugidou
Hello,

I am using Hive 0.7 from cloudera cdh3u0 and I encounter a strange behavior
when I update the serdeproperties of a table (for example for the
RegexSerDe).

If you have a simple partitioned table like

create external table test_table (
id int)
partitioned by (day string)
row format serde 'org.apache.hadoop.contrib.serde2.RegexSerDe'
with serdeproperties (
'input.regex' = '.* ([^ ]*)'
);

alter table test_table add partition (day='2011-09-01');

alter table test_table set serdeproperties  (
'input.regex' = '(.*)'
);

alter table test_table add partition (day='2011-09-02');


The first partition will still use the older regex and the new one will use
the new regex. Is this intended behavior? Why?

Thanks for your help,
Maxime


Re: Change in serdeproperties does not update existing partitions

2011-09-13 Thread Maxime Brugidou
Thanks Ashutosh for your answer. I actually use external tables so that i
don't drop my partitions data.

This is still an odd behavior to me and I don't get why someone would expect
it. Whenever I need to add a column to a table (my table here represent a
log, and it is common to add fields to logs), I need to drop all partitions
and recreate them. How do people do in general?

Do you have a use case where people want to alter a table and not update
existing partitions? Is it so that if your file format evolves you don't
have to convert the whole history?

Best,
Maxime

On Tue, Sep 13, 2011 at 7:03 PM, Ashutosh Chauhan hashut...@apache.orgwrote:

 Hey Maxime,

 Yeah, thats intended behavior. After you do alter on table, all subsequent
 actions on table and partitions will inherit from it. If you want to modify
 properties of already existing partitions, you should be able to do
 something like 'alter table test_table partition (day='2011-09-02') set
 serdeproperties ('input.regex' = '(.*)')' Unfortunately this is not
 supported currently. Feel free to file a bug for that.

 A workaround (applicable only because you are using external table) is to
 drop partition and then add them again. When you drop a partition from
 external table, only metadata gets wiped out, data is not deleted, so when
 you will add partition again, it will inherit from table serde properties
 and you will get what you are looking for. Use this workaround with care,
 you don't want to loose your data in recreating partitions.

 Hope it helps,
 Ashutosh

 On Tue, Sep 13, 2011 at 06:03, Maxime Brugidou 
 maxime.brugi...@gmail.comwrote:

 Hello,

 I am using Hive 0.7 from cloudera cdh3u0 and I encounter a strange
 behavior when I update the serdeproperties of a table (for example for the
 RegexSerDe).

 If you have a simple partitioned table like

 create external table test_table (
 id int)
 partitioned by (day string)
 row format serde 'org.apache.hadoop.contrib.serde2.RegexSerDe'
 with serdeproperties (
 'input.regex' = '.* ([^ ]*)'
 );

 alter table test_table add partition (day='2011-09-01');

 alter table test_table set serdeproperties  (
 'input.regex' = '(.*)'
 );

 alter table test_table add partition (day='2011-09-02');


 The first partition will still use the older regex and the new one will
 use the new regex. Is this intended behavior? Why?

 Thanks for your help,
 Maxime