Re: hiveserver usage

2011-12-11 Thread Aaron Sun
how's the data look like? and what's the size of the cluster?

2011/12/11 王锋 

> Hi,
>
> I'm one of engieer of sina.com.  We have used hive ,hiveserver
> several months. We have our own tasks schedule system .The system can
> schedule tasks running with hiveserver by jdbc.
>
> But The hiveserver use mem very large, usally  large than 10g.   we
> have 5min tasks which will be  running every 5 minutes.,and have hourly
> tasks .total num of tasks  is 40. And we start 3 hiveserver in one linux
> server,and be cycle connected .
>
> so why Memory of  hiveserver  using so large and how we do or some
> suggestion from you ?
>
> Thanks and Best Regards!
>
> Royce Wang
>
>
>
>
>


Re: Hive Reducers hanging - interesting problem - skew ?

2011-12-06 Thread Aaron Sun
Can you try "from B join A".

One simple rule of join in Hive is "Largest table last". The smaller tables
can then be buffered into distributed cache for fast retrieval and
comparison.

Thanks
Aaron

On Tue, Dec 6, 2011 at 4:01 AM, john smith  wrote:

> Hi Mark,
>
> Thanks for your response. I tried skew optimization and I also saw the
> video by Lin and Namit. From what I understand about skew join, instead of
> a single go , they divide it into 2 stages.
>
> Stage1
> Join non-skew pairs. and write the skew pairs into temporary files on HDFS.
>
> Stage 2
> Do a Map-Join of the files by copying smaller file into mappers of larger
> file.
>
> I have a doubt here. How can they be so sure that MapJoin works in stage
> 2? The files can be so large that they donot fit into the memory and join
> is impossible. Am I wrong?
>
> I also ran the query with skew optimized  and as expected, none of the the
> pairs got joined in  the stage 1 and all of them got written into the HDFS.
> (They are huge)
>
> Now in the stage2 , Hive is trying to perform a map-join on these large
> tables and my Map phase in stage 2 is stuck at 0.13% after 6 hours and 2 of
> my machines went down. I had to kill the job finally.
>
> The size of each table is just 2GB which is way smaller than what Hadoop
> eco system can handle.
>
> So is there anyway I can join these tables in Hive? Any thoughts ?
>
>
> Thanks,
> jS
>
>
>
> On Tue, Dec 6, 2011 at 3:39 AM, Mark Grover  wrote:
>
>> jS,
>> Check out if this helps:
>>
>> http://search-hadoop.com/m/l1usr1MAHX32&subj=Re+Severely+hit+by+curse+of+last+reducer+
>>
>>
>>
>> Mark Grover, Business Intelligence Analyst
>> OANDA Corporation
>>
>> www: oanda.com www: fxtrade.com
>> e: mgro...@oanda.com
>>
>> "Best Trading Platform" - World Finance's Forex Awards 2009.
>> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>>
>>
>> - Original Message -
>> From: "john smith" 
>> To: user@hive.apache.org
>> Sent: Monday, December 5, 2011 4:38:14 PM
>> Subject: Hive Reducers hanging - interesting problem - skew ?
>>
>> Hi list,
>>
>> I am trying to run a Join query on my 10 node cluster. My query looks as
>> follows
>>
>> select * from A JOIN B on (A.a = B.b)
>>
>> size of A = 15 million rows
>> size of B = 1 million rows
>>
>> The problem is A.a and B.b has around 25-30 distinct values per column
>> which implies that they have high selectivities and the reducers are bulky.
>>
>> However the performance hit is so horrible that , ALL my reducers hang @
>> 75% for 6 hours and doesn't move further.
>>
>> The only thing that log shows up is "Join operator - forwarding rows
>> ---" kinds of logs for all this long. What does
>> this mean ?
>> There is no swapping happening and the CPU % is constantly around 40% for
>> all this time (observed through Ganglia) .
>>
>> Any way I can solve this problem? Can anyone help me with this?
>>
>> Thanks,
>> jS
>>
>>
>>
>


Re: Scheduling Hive Jobs (Oozie vs. Pentaho vs. something else)

2011-11-29 Thread Aaron Sun
Azkaban is worth to look at

On Tue, Nov 29, 2011 at 4:27 PM, William Kornfeld wrote:

>  We are building an application that involves chains of M/R jobs, most
> likely all will be written in Hive.  We need to start a Hive job when one
> or more prerequisite data sets appear (defined in the Hive sense as a new
> partition having been populated with data) - OR- a particular time has been
> reached.
>
> We know of two scheduling packages that appear to solve this problem:
> Oozie & Pentaho (to which my company has a license).
>
> Does anyone have actual experience using either of these (or something
> else) to schedule Hive jobs?
>
> William Kornfeld
> Baynote
>
>


Re: Lzo compression on Hive table

2011-07-07 Thread Aaron Sun
You can use this one

STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"


On Thu, Jul 7, 2011 at 6:06 PM,  wrote:

> Hi there,
> I got my hadoop all setup writing out sequence file with LZO compression.
>  Using the following:
> mapred.output.compress=true
> mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>
> How do I define my table so it will write out compressed data and be able
> to read in compressed data during my Hive queries?
>
> CREATE EXTERNAL TABLE foo (
>   columnA string,
>   columnB string )
>   PARTITIONED BY (date string)
>   ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
>   LOCATION '/path/to/hive/tables/foo';
>
> Thanks,
> Jon
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise private information.  If you have
> received it in error, please notify the sender immediately and delete the
> original.  Any other use of the email by you is prohibited.
>