Trying jdbc:hive client

2012-03-19 Thread shashwat shriparv
I am trying hive using java jdbc client, i can execute simple queries like
select * from table and select * from table where someting=someting but
when i am going for join queries it throwing me the following error:

*
In my Netbean ide code this is the exception*:

Running: SELECT * FROM sampletab1 sp1  JOIN sampletab12 sp2 ON (sp1.id =
sp2.id) limit 10
Exception in thread main java.sql.SQLException: Query returned non-zero
code: 9, cause: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MapRedTask
at
org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:192)
at witsmlstore.HiveJdbcJava.main(HiveJdbcJava.java:77)
Java Result: 1


*and at the server consol this is the output :*


WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
Hive history
file=/tmp/shashwat/hive_job_log_shashwat_201203191222_772980239.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=number
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=number
In order to set a constant number of reducers:
  set mapred.reduce.tasks=number
Starting Job = job_201203191156_0003, Tracking URL =
http://localhost:50030/jobdetails.jsp?jobid=job_201203191156_0003
Kill Command = /home/shashwat/Hadoop/hadoop-0.20.205/libexec/../bin/hadoop
job  -Dmapred.job.tracker=localhost:9001 -kill job_201203191156_0003
2012-03-19 12:23:07,239 Stage-1 map = 0%,  reduce = 0%
2012-03-19 12:24:01,453 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201203191156_0003 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask


*
When is checked in the task tracker the error was :*

*@
http://localhost:50030/taskdetails.jsp?tipid=task_201203191156_0003_m_00
*

java.io.IOException: Cannot create an instance of InputSplit class =
org.apache.hadoop.hive.hbase.HBaseSplit:org.apache.hadoop.hive.hbase.HBaseSplit
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:145)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:396)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)



What may be the probable cause

When i am trying embeded mode its throwing me error as follows:



12/03/19 12:27:39 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
12/03/19 12:27:39 INFO metastore.ObjectStore: ObjectStore, initialize called
12/03/19 12:27:39 INFO DataNucleus.Persistence: Property
datanucleus.cache.level2 unknown - will be ignored
12/03/19 12:27:39 INFO DataNucleus.Persistence: Property
javax.jdo.option.NonTransactionalRead unknown - will be ignored
12/03/19 12:27:39 INFO DataNucleus.Persistence: =
Persistence Configuration ===
12/03/19 12:27:39 INFO DataNucleus.Persistence: DataNucleus Persistence
Factory - Vendor: DataNucleus  Version: 2.0.3
12/03/19 12:27:39 INFO DataNucleus.Persistence: DataNucleus Persistence
Factory initialised for datastore
URL=jdbc:derby:;databaseName=metastore_db;create=true
driver=org.apache.derby.jdbc.EmbeddedDriver userName=APP
12/03/19 12:27:39 INFO DataNucleus.Persistence:
===
12/03/19 12:27:42 INFO Datastore.Schema: Initialising Catalog , Schema
APP using None auto-start option
12/03/19 12:27:42 INFO Datastore.Schema: Catalog , Schema APP
initialised - managing 0 classes
12/03/19 12:27:42 INFO metastore.ObjectStore: Setting MetaStore object pin
classes with
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
12/03/19 12:27:42 INFO DataNucleus.MetaData: Registering listener for
metadata initialisation
12/03/19 12:27:42 INFO metastore.ObjectStore: Initialized ObjectStore
12/03/19 12:27:42 WARN DataNucleus.MetaData: MetaData Parser encountered an
error in file
jar:file:/home/shashwat/Hadoop/hive-0.7.1/lib/hive-metastore-0.7.1.jar!/package.jdo
at line 11, column 6 : cvc-elt.1: Cannot 

how is number of mappers determined in mapside join?

2012-03-19 Thread Bruce Bian
Hi there,
when I'm executing the following queries in hive

set hive.auto.convert.join = true;
CREATE TABLE IDAP_ROOT as
SELECT a.*,b.acnt_no
FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
a.acnt_id=b.acnt_id

the number of mappers to run in the mapside join is 3, how is it
determined? When launching a job in hadoop mapreduce, i know it's
determined by the function
max(Min split size, min(Max split size, HDFS blockSize)) which in my
configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
and 1.5MB respectively.
Thus I thought the mappers to launch to be around 15, which is not the case.

Thanks
Bruce


Re: how is number of mappers determined in mapside join?

2012-03-19 Thread Bejoy Ks
Hi Bruce
      In map side join the smaller table is loader in memory and hence the 
number of mappers is dependent only on the data on larger table. Say If 
CombineHiveInputFormat is used and we have our hdfs block size as 32 mb, min 
split size as 1B and max split size as 256 mb. Which means one mapper would be 
processing data chunks not less than 1B and not more than 256 MB. So based on 
that mappers would be triggered, 
so a possibility in your case
mapper 1 - 200 MB
mapper 2 - 120 MB
mapper 3 - 140 MB
Every mapper is processing data whose size id between 1B and 256 MB. Totally of 
460 MB, your table size.

I'm not sure of the formula you posted here, Can you point me to the document 
from which you got this?

Regards
Bejoy



 From: Bruce Bian weidong@gmail.com
To: user@hive.apache.org 
Sent: Monday, March 19, 2012 2:42 PM
Subject: how is number of mappers determined in mapside join?
 

Hi there,
when I'm executing the following queries in hive

set hive.auto.convert.join = true;
CREATE TABLE IDAP_ROOT as
SELECT a.*,b.acnt_no
FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id

the number of mappers to run in the mapside join is 3, how is it determined? 
When launching a job in hadoop mapreduce, i know it's determined by the function
max(Min split size, min(Max split size, HDFS blockSize)) which in my 
configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB and 
1.5MB respectively.
Thus I thought the mappers to launch to be around 15, which is not the case.

Thanks
Bruce

Re: how is number of mappers determined in mapside join?

2012-03-19 Thread Bruce Bian
Hi Bejoy,
Thanks for your reply.
The function is from the book, Hadoop The Definitive Guide 2nd edition. On
page 203 there is
The split size is calculated by the formula (see the computeSplitSize()
method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
by default:minimumSize  blockSize  maximumSize so the split size is
blockSize.

And I've actually used the HDFS block size to control the number of mappers
launched before.
So as to your response, do you mean that any value of the data between 1B
and 256MB is OK for the mappers to process?
Then the only way I can think of to increase the #mappers is to reduce the
max split size.

Regards,
Bruce

On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks bejoy...@yahoo.com wrote:

 Hi Bruce
   In map side join the smaller table is loader in memory and hence the
 number of mappers is dependent only on the data on larger table. Say If
 CombineHiveInputFormat is used and we have our hdfs block size as 32 mb,
 min split size as 1B and max split size as 256 mb. Which means one mapper
 would be processing data chunks not less than 1B and not more than 256 MB.
 So based on that mappers would be triggered,
 so a possibility in your case
 mapper 1 - 200 MB
 mapper 2 - 120 MB
 mapper 3 - 140 MB
 Every mapper is processing data whose size id between 1B and 256 MB.
 Totally of 460 MB, your table size.

 I'm not sure of the formula you posted here, Can you point me to the
 document from which you got this?

 Regards
 Bejoy

   --
 *From:* Bruce Bian weidong@gmail.com
 *To:* user@hive.apache.org
 *Sent:* Monday, March 19, 2012 2:42 PM
 *Subject:* how is number of mappers determined in mapside join?

 Hi there,
 when I'm executing the following queries in hive

 set hive.auto.convert.join = true;
 CREATE TABLE IDAP_ROOT as
 SELECT a.*,b.acnt_no
 FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
 a.acnt_id=b.acnt_id

 the number of mappers to run in the mapside join is 3, how is it
 determined? When launching a job in hadoop mapreduce, i know it's
 determined by the function
 max(Min split size, min(Max split size, HDFS blockSize)) which in my
 configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
 and 1.5MB respectively.
 Thus I thought the mappers to launch to be around 15, which is not the
 case.

 Thanks
 Bruce





Hive CLI and Standalone Server : Need Suggestion

2012-03-19 Thread LakshmiKanth P
Hi


I need to schedule my hive scripts which needs to process incoming weblogs
on an hourly basis.

Currently, I could process my weblog files by executing my scripts from
hive command line interface.  Now I want to keep my scripts in a file and
invoke my scripts at a regular periods of interval.  I came to know that
hive command line options provides a facility to pass the .sql file as
input for execution.  Is it the right approach for any production
environment.

OR

Should I use my hive server in stand alone mode and inovke my hive scripts
using JDBC API.

Request you to suggest me the best approach.


Regards,
LK


Re: Hive CLI and Standalone Server : Need Suggestion

2012-03-19 Thread Bejoy Ks
Hi LakshmiKanth
        In production systems if you have a sequence of command to be executed 
pack them in order in a file. Then execute the command as
hive -f filename ;

For simplicity, you can use a cron job to run it in a scheduled manner. Just 
give this command in a .sh file call the file in cron. Infact you can use any 
scheduler that would trigger a .sh file.

But for hadoop based work flows the preferred workflow manager is oozie and 
I recommend oozie for hadoop jobs.

Regrads
Bejoy KS



 From: LakshmiKanth P lk.asp...@gmail.com
To: user@hive.apache.org 
Sent: Tuesday, March 20, 2012 12:19 AM
Subject: Hive CLI and Standalone Server : Need Suggestion
 

Hi
 
 
I need to schedule my hive scripts which needs to process incoming weblogs on 
an hourly basis.  
 
Currently, I could process my weblog files by executing my scripts from hive 
command line interface.  Now I want to keep my scripts in a file and invoke my 
scripts at a regular periods of interval.  I came to know that hive command 
line options provides a facility to pass the .sql file as input for execution.  
Is it the right approach for any production environment.  
 
OR 
 
Should I use my hive server in stand alone mode and inovke my hive scripts 
using JDBC API.
 
Request you to suggest me the best approach.
 
 
Regards,
LK

RE: Hive CLI and Standalone Server : Need Suggestion

2012-03-19 Thread carla.staeben
Great topic as I was wondering a similar thing this morning...I want to use 
oozie to execute my hive job, but I have to pass the job parameters that I 
generate with a shell script.  Some of the literature that I've seen says that 
oozie may or may not allow for calling shell scripts.  Is that true?

Thanks
Carla

From: ext Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Monday, March 19, 2012 15:34
To: user@hive.apache.org
Subject: Re: Hive CLI and Standalone Server : Need Suggestion

Hi LakshmiKanth
In production systems if you have a sequence of command to be executed 
pack them in order in a file. Then execute the command as
hive -f filename ;

For simplicity, you can use a cron job to run it in a scheduled manner. Just 
give this command in a .sh file call the file in cron. Infact you can use any 
scheduler that would trigger a .sh file.

But for hadoop based work flows the preferred workflow manager is oozie and I 
recommend oozie for hadoop jobs.

Regrads
Bejoy KS


From: LakshmiKanth P lk.asp...@gmail.commailto:lk.asp...@gmail.com
To: user@hive.apache.orgmailto:user@hive.apache.org
Sent: Tuesday, March 20, 2012 12:19 AM
Subject: Hive CLI and Standalone Server : Need Suggestion


Hi


I need to schedule my hive scripts which needs to process incoming weblogs on 
an hourly basis.

Currently, I could process my weblog files by executing my scripts from hive 
command line interface.  Now I want to keep my scripts in a file and invoke my 
scripts at a regular periods of interval.  I came to know that hive command 
line options provides a facility to pass the .sql file as input for execution.  
Is it the right approach for any production environment.

OR

Should I use my hive server in stand alone mode and inovke my hive scripts 
using JDBC API.

Request you to suggest me the best approach.


Regards,
LK



Re: Hive CLI and Standalone Server : Need Suggestion

2012-03-19 Thread Edward Capriolo
This is a bit of a problem. ozzie is great for workflow scheduling but
oozie does not have actions for everything and adding actions is
non-trivial in current versions.

I have created some bootleg/generic oozie actions that make it easy
to exec pretty much anything and treat it as an action.

https://github.com/edwardcapriolo/m6d_oozie

On Mon, Mar 19, 2012 at 3:38 PM,  carla.stae...@nokia.com wrote:
 Great topic as I was wondering a similar thing this morning…I want to use
 oozie to execute my hive job, but I have to pass the job parameters that I
 generate with a shell script.  Some of the literature that I’ve seen says
 that oozie may or may not allow for calling shell scripts.  Is that true?



 Thanks

 Carla



 From: ext Bejoy Ks [mailto:bejoy...@yahoo.com]
 Sent: Monday, March 19, 2012 15:34
 To: user@hive.apache.org
 Subject: Re: Hive CLI and Standalone Server : Need Suggestion



 Hi LakshmiKanth

         In production systems if you have a sequence of command to be
 executed pack them in order in a file. Then execute the command as

 hive -f filename ;



 For simplicity, you can use a cron job to run it in a scheduled manner. Just
 give this command in a .sh file call the file in cron. Infact you can use
 any scheduler that would trigger a .sh file.



 But for hadoop based work flows the preferred workflow manager is oozie and
 I recommend oozie for hadoop jobs.



 Regrads

 Bejoy KS



 

 From: LakshmiKanth P lk.asp...@gmail.com
 To: user@hive.apache.org
 Sent: Tuesday, March 20, 2012 12:19 AM
 Subject: Hive CLI and Standalone Server : Need Suggestion



 Hi





 I need to schedule my hive scripts which needs to process incoming weblogs
 on an hourly basis.



 Currently, I could process my weblog files by executing my scripts from hive
 command line interface.  Now I want to keep my scripts in a file and invoke
 my scripts at a regular periods of interval.  I came to know that hive
 command line options provides a facility to pass the .sql file as input for
 execution.  Is it the right approach for any production environment.



 OR



 Should I use my hive server in stand alone mode and inovke my hive scripts
 using JDBC API.



 Request you to suggest me the best approach.





 Regards,

 LK




Re: Hive CLI and Standalone Server : Need Suggestion

2012-03-19 Thread Alejandro Abdelnur
Eduardo,

Beside the mapreduce/streaming/hive/pig/sqoop/distcp action, Oozie has a
JAVA action (to execute a Java Main class in the cluster), a SSH action (to
execute a script via SSH in a remote host), and a SHELL action (to execute
a script in the cluster).

Would you mind explaining what does your m6d extension that JAVA, SSH or
SHELL cannot do to in a similar way?

Thanks.

Alejandro

On Mon, Mar 19, 2012 at 12:46 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 This is a bit of a problem. ozzie is great for workflow scheduling but
 oozie does not have actions for everything and adding actions is
 non-trivial in current versions.

 I have created some bootleg/generic oozie actions that make it easy
 to exec pretty much anything and treat it as an action.

 https://github.com/edwardcapriolo/m6d_oozie

 On Mon, Mar 19, 2012 at 3:38 PM,  carla.stae...@nokia.com wrote:
  Great topic as I was wondering a similar thing this morning…I want to use
  oozie to execute my hive job, but I have to pass the job parameters that
 I
  generate with a shell script.  Some of the literature that I’ve seen says
  that oozie may or may not allow for calling shell scripts.  Is that true?
 
 
 
  Thanks
 
  Carla
 
 
 
  From: ext Bejoy Ks [mailto:bejoy...@yahoo.com]
  Sent: Monday, March 19, 2012 15:34
  To: user@hive.apache.org
  Subject: Re: Hive CLI and Standalone Server : Need Suggestion
 
 
 
  Hi LakshmiKanth
 
  In production systems if you have a sequence of command to be
  executed pack them in order in a file. Then execute the command as
 
  hive -f filename ;
 
 
 
  For simplicity, you can use a cron job to run it in a scheduled manner.
 Just
  give this command in a .sh file call the file in cron. Infact you can use
  any scheduler that would trigger a .sh file.
 
 
 
  But for hadoop based work flows the preferred workflow manager is oozie
 and
  I recommend oozie for hadoop jobs.
 
 
 
  Regrads
 
  Bejoy KS
 
 
 
  
 
  From: LakshmiKanth P lk.asp...@gmail.com
  To: user@hive.apache.org
  Sent: Tuesday, March 20, 2012 12:19 AM
  Subject: Hive CLI and Standalone Server : Need Suggestion
 
 
 
  Hi
 
 
 
 
 
  I need to schedule my hive scripts which needs to process incoming
 weblogs
  on an hourly basis.
 
 
 
  Currently, I could process my weblog files by executing my scripts from
 hive
  command line interface.  Now I want to keep my scripts in a file and
 invoke
  my scripts at a regular periods of interval.  I came to know that hive
  command line options provides a facility to pass the .sql file as input
 for
  execution.  Is it the right approach for any production environment.
 
 
 
  OR
 
 
 
  Should I use my hive server in stand alone mode and inovke my hive
 scripts
  using JDBC API.
 
 
 
  Request you to suggest me the best approach.
 
 
 
 
 
  Regards,
 
  LK
 
 



Re: Hive CLI and Standalone Server : Need Suggestion

2012-03-19 Thread Edward Capriolo
I am not trying to knock oozie but
MapReduce Action: Would be great but hadoop docs taught me the proper
way to write hadoop programs was Tool and Configured. 90% of our
legacy jobs are tools. MapReduce action can not launch Tools. So
JavaMain...

SSH action is something I would never allow on our network. Super
bootleg and insecure.

HiveAction requires the entire hive fat client which is not easy since
our RDBMS needs to be configured to allow every possible tasktracker
to access it's metastore. Would be better if HiveAction was
HiveThriftAction then it would only need minimal jars and a host port
pair. Again back to JavaMain...

Not sure about the shell action.  May not have been around when I put
this framework together.

My main point is that oozie in its current form is not very flexible,
what if I want to add an RDBMS action? Beg developers to patch it in?
Just having to patch in actions is detracting. (I know there is a jira
open on this)

The reason I wrote the library was:
https://github.com/edwardcapriolo/m6d_oozie/blob/master/src/main/java/com/m6d/oozie/RunShellProps.java

The problem I was facing with the Shell and Java Main actions is that
if you want to extract any output to be used in the next phase of the
job it is not easy to get at. I wrote a JavaMain that was
capture-output / friendly.


On Mon, Mar 19, 2012 at 5:23 PM, Alejandro Abdelnur t...@cloudera.com wrote:
 Eduardo,

 Beside the mapreduce/streaming/hive/pig/sqoop/distcp action, Oozie has a
 JAVA action (to execute a Java Main class in the cluster), a SSH action (to
 execute a script via SSH in a remote host), and a SHELL action (to execute a
 script in the cluster).

 Would you mind explaining what does your m6d extension that JAVA, SSH or
 SHELL cannot do to in a similar way?

 Thanks.

 Alejandro

 On Mon, Mar 19, 2012 at 12:46 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 This is a bit of a problem. ozzie is great for workflow scheduling but
 oozie does not have actions for everything and adding actions is
 non-trivial in current versions.

 I have created some bootleg/generic oozie actions that make it easy
 to exec pretty much anything and treat it as an action.

 https://github.com/edwardcapriolo/m6d_oozie

 On Mon, Mar 19, 2012 at 3:38 PM,  carla.stae...@nokia.com wrote:
  Great topic as I was wondering a similar thing this morning…I want to
  use
  oozie to execute my hive job, but I have to pass the job parameters that
  I
  generate with a shell script.  Some of the literature that I’ve seen
  says
  that oozie may or may not allow for calling shell scripts.  Is that
  true?
 
 
 
  Thanks
 
  Carla
 
 
 
  From: ext Bejoy Ks [mailto:bejoy...@yahoo.com]
  Sent: Monday, March 19, 2012 15:34
  To: user@hive.apache.org
  Subject: Re: Hive CLI and Standalone Server : Need Suggestion
 
 
 
  Hi LakshmiKanth
 
          In production systems if you have a sequence of command to be
  executed pack them in order in a file. Then execute the command as
 
  hive -f filename ;
 
 
 
  For simplicity, you can use a cron job to run it in a scheduled manner.
  Just
  give this command in a .sh file call the file in cron. Infact you can
  use
  any scheduler that would trigger a .sh file.
 
 
 
  But for hadoop based work flows the preferred workflow manager is oozie
  and
  I recommend oozie for hadoop jobs.
 
 
 
  Regrads
 
  Bejoy KS
 
 
 
  
 
  From: LakshmiKanth P lk.asp...@gmail.com
  To: user@hive.apache.org
  Sent: Tuesday, March 20, 2012 12:19 AM
  Subject: Hive CLI and Standalone Server : Need Suggestion
 
 
 
  Hi
 
 
 
 
 
  I need to schedule my hive scripts which needs to process incoming
  weblogs
  on an hourly basis.
 
 
 
  Currently, I could process my weblog files by executing my scripts from
  hive
  command line interface.  Now I want to keep my scripts in a file and
  invoke
  my scripts at a regular periods of interval.  I came to know that hive
  command line options provides a facility to pass the .sql file as input
  for
  execution.  Is it the right approach for any production environment.
 
 
 
  OR
 
 
 
  Should I use my hive server in stand alone mode and inovke my hive
  scripts
  using JDBC API.
 
 
 
  Request you to suggest me the best approach.
 
 
 
 
 
  Regards,
 
  LK
 
 




Hive performance vs. SQL?

2012-03-19 Thread Keith Wiley
I haven't had an opportunity to set up a huge Hive database yet because 
exporting csv files from our SQL database is, in itself, a rather laborious 
task.  I was just curious how I might expect Hive to perform vs. SQL on large 
databases and large queries?  I realize Hive is pretty latent since it builds 
and runs MapReduce jobs for even the simplest queries, but that is precisely 
why I think it might perform better on long queries against large (external 
CSV) databases).

Would you expect Hive to ever outperform SQL on a single machine (standalone or 
pseudo-distributed mode)?  I am entirely open to the possibility that the 
answer is no, that Hive could never compete with SQL in a single machine.  Is 
this true?

If so, how large (how parallel) do you think the underlying Hadoop cluster 
needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover point 
where Hive actually outperforms SQL?

Along similar lines, might Hive never outperform SQL on a database small enough 
for SQL to run on a single machine, a 10s to 100s of GBs?  Must the database 
itself be so large that SQL is effectively crippled and the data must be 
distributed before Hive offer significant gains?

I am really just trying to get a basic feel for how I might anticipate's Hive's 
behavior vs. SQL once I get a large system up and running.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




Re: Hive performance vs. SQL?

2012-03-19 Thread Maxime Brugidou
From my experience, if you can fit data in a SQL without sharding or
anything, don't ever think twice. Hive is not even comparable.

I would rather say that Hive is a nice SQL interface over Hadoop M/R rather
than any SQL replacement. If you are running a DWH in SQL and you don't
need to grow your data to at least a couple of Tb, then keep SQL. The very
nice feature of Hadoop/Hive is that your DWH can grow (almost) horizontally
without much trouble by buying new servers, and most of your queries scale
with the number of servers too.

You have to know that doing a SELECT count(1) FROM t where t is ~1Gb can
take more time to start/stop the M/R job which has huge overhead than to
actually count. A simple wc -l takes about a second on any normal PC.

On Mon, Mar 19, 2012 at 11:51 PM, Keith Wiley kwi...@keithwiley.com wrote:

 I haven't had an opportunity to set up a huge Hive database yet because
 exporting csv files from our SQL database is, in itself, a rather laborious
 task.  I was just curious how I might expect Hive to perform vs. SQL on
 large databases and large queries?  I realize Hive is pretty latent since
 it builds and runs MapReduce jobs for even the simplest queries, but that
 is precisely why I think it might perform better on long queries against
 large (external CSV) databases).

 Would you expect Hive to ever outperform SQL on a single machine
 (standalone or pseudo-distributed mode)?  I am entirely open to the
 possibility that the answer is no, that Hive could never compete with SQL
 in a single machine.  Is this true?

 If so, how large (how parallel) do you think the underlying Hadoop cluster
 needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover
 point where Hive actually outperforms SQL?

 Along similar lines, might Hive never outperform SQL on a database small
 enough for SQL to run on a single machine, a 10s to 100s of GBs?  Must the
 database itself be so large that SQL is effectively crippled and the data
 must be distributed before Hive offer significant gains?

 I am really just trying to get a basic feel for how I might anticipate's
 Hive's behavior vs. SQL once I get a large system up and running.

 Thanks.


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson

 




LOAD DATA problem

2012-03-19 Thread Sean McNamara
Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 to logs 
that already exist in a partition?  If the log is already in hdfs/hive I'd 
rather it fail and give me an return code or output saying that the log already 
exists.

For example, if I run these queries:
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')

I end up with:
test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2

However, If I use OVERWRITE it will nuke all the data in the partition 
(including test_a.bz2) and I end up with just:
test_b.bz2

I recall that older versions of hive would not do this.  How do I handle this 
case?  Is there a safe atomic way to do this?

Sean