Re: Amazon EMR Best Practices for Hive metastore

2012-03-06 Thread Sam Wilson
We also do #4. Initially we had lots of conversations about all the other 
options and we should do this or that... Ultimately we focused on just going 
live as quickly as possible and getting more involved in the setup later. 

Since then the only thing we've needed to do is hack a few o the baseline 
scripts used by emr to launch hive so that it uses more heap. We definitely 
have a few pain points around partition recovery but those are things inherent 
to hive and not emr. 

I should note that we don't trust our emr cluster to stick around so we design 
for it to just die. You can't treat it like a regular Hadoop cluster. We made 
launching a new one an easy process and have decoupled hive from the ux so that 
it's fully asynchronous. 

So far, big wins and no complaints. 

Sent from my iPhone

On Mar 6, 2012, at 10:02 PM, Jeff Sternberg  wrote:

> Mark,
> 
> We do 4), basically. We have a simple hive script that does all the "create 
> external table" statements, and we run that script as step 1 of the EMR jobs 
> we spin up. Then our "real" processing takes over in step 2 and beyond. We're 
> only working with about 50 tables, so it's pretty manageable. A side benefit 
> is that we can put this create-table script under source control to track our 
> schema changes over time.
> 
> Jeff Sternberg
> S&P Capital IQ
> www.spcapitaliq.com
> 
> -Original Message-
> From: Mark Grover [mailto:mgro...@oanda.com] 
> Sent: Tuesday, March 06, 2012 9:54 PM
> To: user@hive.apache.org
> Cc: Baiju Devani; Denys Berestyuk
> Subject: Amazon EMR Best Practices for Hive metastore
> 
> Hi all,
> I am trying to get an idea of what people do for setting up Hive metastore 
> when using Amazon EMR.
> 
> For those of you using Amazon EMR:
> 
> 1) Do you have a dedicated RDS instance external to your EMR Hive+Hadoop 
> cluster that you use as a persistent metastore for all your cluster 
> instantiations?
> 
> 2) Do you use the MySQL DB that comes pre-installed on the master node and 
> export its data (on cluster tear down) to something like S3 and import it 
> from S3 during cluster bring up?
> 
> 3) Do you use a local installation of Hive (instead of that on EMR) so that 
> you could make use of an in-house dedicated metastore while utilizing Hadoop 
> cluster on EMR? (i.e. local Hive + EMR Hadoop)
> 
> 4) Do you do something really simple and naive like scripting up all your 
> "create external table" commands and running them every time you bring up a 
> cluster?
> 
> Or, do you do something else not mentioned above?:-)
> 
> Thank you in advance for sharing!
> 
> Mark
> 
> Mark Grover, Business Intelligence Analyst OANDA Corporation 
> 
> www: oanda.com www: fxtrade.com 
> 
> "Best Trading Platform" - World Finance's Forex Awards 2009. 
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009. 
> 
> 


Re: rainstor

2012-01-25 Thread Sam Wilson
Google?

Sent from my iPhone

On Jan 25, 2012, at 7:34 PM, Dalia Sobhy  wrote:

> Do anyone have any idea about rainstor ???
> 
> Opensource? How to download ? How to use? PErformance ??


Re: drop table -> java.lang.OutOfMemoryError: Java heap space

2012-01-05 Thread Sam Wilson
I recommend trying a daily partitioning scheme over an hourly one. We had a 
similar setup and ran into the same problem and ultimately found that daily 
works fine for us, even with larger file sizes.

At the very least it is worth evaluating. 

Sent from my iPhone

On Jan 5, 2012, at 2:23 PM, Matt Vonkip  wrote:

> Shoot, I meant to reply to the group, not respond to Mark directly.  (Mark 
> replied offline to me; not sure the etiquette in pasting that response in 
> here as well!)
> 
> Hi Mark, thanks for the response!  I tried using the memory-intensive 
> boostrap action and got a different error; however, I'm not sure if it 
> represents progress in the right direction or regression.  (I thought the 
> memory-intensive script was for memory intensive map-reduce jobs -- not table 
> DDL.  So I am wondering if it made things even worse.)
> 
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 
> As for the other suggestion, I agree that 15k partitions (and growing) is 
> unruly; but, the files are not small!  Each is over one gigabyte and 
> represents one hour from the past twenty months.  I would imagine others must 
> have similar setups and have some way around my issue.  Also, since it worked 
> in the older hadoop/hive stack, I'm suspicious that there is some 
> configuration item I should be able to tweak.
> 
> In the meantime, I am tempted to drop the entire database and recreate from 
> scratch (since all tables are external anyway).  If no solution is found, we 
> will probably look into some kind of hybrid system where older data is 
> archived in other tables and a union is used in queries.
> 
> Sincerely,
> Matt
> 
> 


Re: Hive Metadata URI error

2011-12-11 Thread Sam Wilson
Try file:// in front of the property value...

Sent from my iPhone

On Dec 12, 2011, at 12:07 AM, "Periya.Data"  wrote:

> Hi,
>I am trying to create Hive tables on an EC2 instance. I get this strange 
> error about URI schema and log4j properties not found. I do not know how to 
> fix this. 
> 
> On EC2 instance : Ubuntu 10.04, Hive-0.7.1-cdh3u2. 
> 
> Initially I did not have an entry for hive.metastore.uris property in my 
> hive-default.xml file. So, I created one.  Still, I get the errors as pasted 
> below. I was under the assumption that if we leave the uris value blank, it 
> is will assume the local metastore. 
> 
> 
>   hive.metastore.local name>
>   true
>   controls whether to connect to remove metastore server or open 
> a new metastore server in Hive Client JVM
> 
> 
> 
>   hive.metastore.uris
>   /home/users/jtv/CDH3/hive/conf/metastore_db
> 
> 
> 
> root@ip-10-114-18-63:/home/users/jtv# hive -f ./scripts/log25.q 
> hive-log4j.properties not found
> Hive history file=/tmp/root/hive_job_log_root_201112120332_1795396613.txt
> 11/12/12 03:32:03 INFO exec.HiveHistory: Hive history 
> file=/tmp/root/hive_job_log_root_201112120332_1795396613.txt
> 11/12/12 03:32:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE 
> log25_tbl (OperationEvent STRING, HostIP STRING, StartTime STRING, SourceRepo 
> STRING, SourceFolder STRING, DestRepo STRING, DestFolder STRING, 
> EntityOrObject STRING, BytesSent STRING, TotalTimeInSecs STRING) COMMENT 
> 'This is the Log_25 Table'
> 11/12/12 03:32:04 INFO parse.ParseDriver: Parse Completed
> 11/12/12 03:32:04 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 11/12/12 03:32:04 INFO parse.SemanticAnalyzer: Creating table log25_tbl 
> position=13
> 11/12/12 03:32:04 INFO ql.Driver: Semantic Analysis Completed
> 11/12/12 03:32:04 INFO ql.Driver: Returning Hive schema: 
> Schema(fieldSchemas:null, properties:null)
> 11/12/12 03:32:04 INFO ql.Driver: Starting command: CREATE TABLE log25_tbl 
> (OperationEvent STRING, HostIP STRING, StartTime STRING, SourceRepo STRING, 
> SourceFolder STRING, DestRepo STRING, DestFolder STRING, EntityOrObject 
> STRING, BytesSent STRING, TotalTimeInSecs STRING) COMMENT 'This is the Log_25 
> Table'
> 11/12/12 03:32:04 INFO exec.DDLTask: Default to LazySimpleSerDe for table 
> log25_tbl
> 11/12/12 03:32:04 INFO hive.log: DDL: struct log25_tbl { string 
> operationevent, string hostip, string starttime, string sourcerepo, string 
> sourcefolder, string destrepo, string destfolder, string entityorobject, 
> string  bytessent, string totaltimeinsecs}
> FAILED: Error in metadata: java.lang.IllegalArgumentException: URI:  does not 
> have a scheme
> 11/12/12 03:32:04 ERROR exec.DDLTask: FAILED: Error in metadata: 
> java.lang.IllegalArgumentException: URI:  does not have a scheme
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: URI:  does not have a scheme
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:476)
> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:310)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:317)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:490)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.lang.IllegalArgumentException: URI:  does not have a scheme
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:127)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:1868)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:1878)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:470)
> ... 17 more
> 
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> 11/12/12 03:32:04 ERROR ql.Driver: FAILED: Execution Error, return code 1 
> from org.apache.hadoop.hive.ql.exec.DDLTask
> root@ip-10-114-18-63:/home/users/jtv# 
> =

Re: Building out Hive in EC2/S3 versus dedicated servers

2011-11-22 Thread Sam Wilson
We recently adopted Hadoop and Hive for doing some significant data processing. 
We went the Amazon route.

My own $.02 is as follows:

If you are already incredibly experienced with Hadoop and Hive and have someone 
on staff who has previously built a cluster at least as big as the one you are 
projecting to require, then simply do some back of the envelope calculations 
and decide if it is cost effective to run on your own system given all your 
other business constraints. If you don't know how to do this, then you aren't 
sufficiently experienced to go this route.

If you are new to Hadoop and Hive, then your best bet is to build your 
application first, using EMR as a prototype cluster. If your data is already 
loaded into S3 or you are already using Amazon, then this is also a no brainer 
way to get started. Hadoop and Hive are not what I would call user friendly. 
Frankly, they are full of bugs, and gotchas and are poorly documented. The 
learning curve is a bit steep. The most important thing is to prove out your 
functionality and build a system that delivers value quickly. You don't want 
your deadline to pass with only a pretty rack of servers to show for it. You 
need functionality.

EMR lets you focus on your application, your code, your requirements, without 
having to deal with the details of the infrastructure. I simply cannot stress 
how nice it has been for us to be able to spin up new clusters on-the-fly while 
we were developing our application. Our ability to rapidly prototype has simply 
blown me away.

Once you've got yourself up and running, your application is doing what it's 
supposed to, and you've built some familiarity with Hadoop and Hive, my 
suggestion is to then build a prototype cluster either hosted or in your 
office. Familiarize yourself with all the network, OS and other low-level 
details. Do some analysis on cost/performance, then decide whether or not to 
move your production system from Amazon to somewhere else.

Everyone's application is going to be very unique to them, so looking at 
someone else's calculations is largely pointless.

In our experience how did this pan out? We rebuilt a major system component in 
3 months, reducing query times for certain jobs from 16+ days to 4 minutes. We 
did not purchase a single piece of hardware, or install a single piece of 
software we did not write ourselves. We have the ability to rapidly redeploy 
our system in any of 5 different data centers around the world at the flip of a 
few switches. If we wanted to deploy on our own hardware or in a colo at this 
point, we would only have to focus on building the cluster.

Our app is already built, serving our customers and making us money.

YMMV.


On Nov 22, 2011, at 3:15 PM, Loren Siebert wrote:

> My colleague has a Heroku-based startup and they are just getting started 
> with Hadoop and Hive. They’re evaluating running Hive in EC2/S3 versus buying 
> a handful of boxes and installing CDH.
> 
> One nice (albeit dated) analysis on this question is here, but I’m curious if 
> anyone here has a different take on it:
>   
> http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/
> 
> What is the sweet spot for when a Hive warehouse in EC2 makes the most sense?
> 
> I’m asking on this Hive list versus the more general Hadoop lists because I 
> think a solution for a Hive cluster could differ quite a bit from a solution 
> for a HBase cluster.
> 
> - Loren



Re: Asynchronous query exection

2011-11-15 Thread Sam Wilson
If you go this route, you may want to use nohup. This way your processes will 
continue running even if you lose connection to your terminal session.

Other options:

1) You can write your queries to a DB/Queue and have a process running on the 
Hive server that reads from the DB/queue and runs them locally against Hive.

2) You could use SSH and nohup to run your queries.

On Nov 15, 2011, at 2:15 PM, Mapred Learn wrote:

> You could write your query to a file and do something like:
>  
> hive -f  &
> hive -f  &
>  
> etc. to invoke many instances in parallel.
> 
> On Tue, Nov 15, 2011 at 3:24 AM, Chinna Rao Lalam  
> wrote:
> Hi,
> 
>  
>  Hive calls are blocking calls because once the query is executed it will 
> return the ResultSet  from that result set u will get the results.
> 
>  
>  "hive.exec.parallel" property will helps to speed up the query execution if 
> the query generates more than one independent tasks. If it generates 
> independent tasks if this
> 
> property is true it will execute the independent tasks parallely otherwise it 
> will execute sequentially.
> 
>  
> Thanks&Regards,
> 
> Chinna Rao Lalam 
> 
>  
> From: Ghousia [ghousia.ath...@gmail.com]
> Sent: Tuesday, November 15, 2011 6:12 PM
> To: user@hive.apache.org
> Subject: Asynchronous query exection
> 
> Hi,
>  
> Hive queries take longer time to execute, and by default it is a blocking 
> call. Is there any way provided by Hive client to supports non blocking 
> execution.
>  
> Also, to execute jobs parallely, I tried setting the "hive.exec.parallel" to 
> true in hive-site.xml. But this did not work, Looking at the code, it looks 
> like the same flow is been followed both for serial and parallel execution.
>  
> Any inputs would be of great help.
>  
> Thanks,
> Ghousia.
>  
> 
>  
>  
>