hadoop fs -text cannot get .deflate file decompressed

2013-01-30 Thread Richard
I got some hive generated files with .defate extension. I know this is a 
compressed file.
It is not my data so i canot change the option to uncompressed. I just want to 
view
the file content. But when I used hadoop fs -text, i cannot get plaintext 
output. The
output is still binary. How can I fix this problem? Thanks.

Re: delay before query starts processing

2013-01-30 Thread Ariel Marcus
>From the archives:
http://mail-archives.apache.org/mod_mbox/hive-user/201110.mbox/%3CCAC9SPjuQtxOK1KtEmReD6OanNTgNM_uLkGQD+=n7krcjcal...@mail.gmail.com%3E

TL;DR set hive.optimize.s3.query=true;

-
Ariel Marcus, Consultant
www.openbi.com | ariel.mar...@openbi.com
150 N Michigan Avenue, Suite 2800, Chicago, IL 60601
Cell: 314-827-4356


On Wed, Jan 30, 2013 at 6:16 PM, Marc Limotte  wrote:

> Hi,
>
> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
> other Hadoop jobs, but only started experimenting with Hive recently.
>
> I've been seeing a long pause after submitting a hive query and the
> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
> wondering what's happening during this time.  Either a high level answer,
> or maybe there is some logging I can turn on?
>
> Here's some more detail.  I submit the query on the master using the hive
> cli, and start to see some output right away...
>
> Total MapReduce jobs = 2
> Launching Job 1 out of 2
> Number of reduce tasks not specified. Estimated from input data size: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
>
>
> *[then a long delay here: 10 minutes or more... no activity in the hadoop
> job tracker ui] *
>
>
> … and then it continues normally ...
> Starting Job = job_201301160029_0082, Tracking URL =
> http://ip-.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
> Kill Command = /home/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=xx:9001 -kill job_201301160029_0082
> Hadoop job information for Stage-1: number of mappers: 2; number of
> reducers: 1
> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
> …
>
> This query is processing in the neighborhood of 500GB of data from S3.  A
> couple of possibilities I thought of… perhaps someone can confirm or deny:
> a) Is the data copied from S3 to HDFS during this time?
> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
> MB)-- does it have to copy these around to the tasks at this time?
>
> Any insights appreciated.
>
> Marc
>
>
>
>
> --
>
> This transmission is confidential and intended solely for the use of the
> recipient named above. It may contain confidential, proprietary, or legally
> privileged information. If you are not the intended recipient, you are
> hereby notified that any unauthorized review, use, disclosure or
> distribution is strictly prohibited. If you have received this transmission
> in error, please contact the sender by reply e-mail and delete the original
> transmission and all copies from your system.
>

-- 

--

This transmission is confidential and intended solely for the use of the 
recipient named above. It may contain confidential, proprietary, or legally 
privileged information. If you are not the intended recipient, you are 
hereby notified that any unauthorized review, use, disclosure or 
distribution is strictly prohibited. If you have received this transmission 
in error, please contact the sender by reply e-mail and delete the original 
transmission and all copies from your system.


Re: delay before query starts processing

2013-01-30 Thread Abdelrahman Shettia
Hi Marc,

You can try running the hive client with debug mode on and see what is
trying to do on the JT level.
hive -hiveconf hive.root.logger=ALL,console -e " DDL;"
hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ;

Hope this helps .

Thanks
-Abdelrahman


On Wed, Jan 30, 2013 at 3:16 PM, Marc Limotte  wrote:

> Hi,
>
> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
> other Hadoop jobs, but only started experimenting with Hive recently.
>
> I've been seeing a long pause after submitting a hive query and the
> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
> wondering what's happening during this time.  Either a high level answer,
> or maybe there is some logging I can turn on?
>
> Here's some more detail.  I submit the query on the master using the hive
> cli, and start to see some output right away...
>
> Total MapReduce jobs = 2
> Launching Job 1 out of 2
> Number of reduce tasks not specified. Estimated from input data size: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
>
>
> *[then a long delay here: 10 minutes or more... no activity in the hadoop
> job tracker ui] *
>
>
> … and then it continues normally ...
> Starting Job = job_201301160029_0082, Tracking URL =
> http://ip-.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
> Kill Command = /home/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=xx:9001 -kill job_201301160029_0082
> Hadoop job information for Stage-1: number of mappers: 2; number of
> reducers: 1
> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
> …
>
> This query is processing in the neighborhood of 500GB of data from S3.  A
> couple of possibilities I thought of… perhaps someone can confirm or deny:
> a) Is the data copied from S3 to HDFS during this time?
> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
> MB)-- does it have to copy these around to the tasks at this time?
>
> Any insights appreciated.
>
> Marc
>
>
>
>


Re: ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread Nitin Pawar
it will not work old partition because old data did not have this new
column as metadata for old partition

your new meta data applies only to new partitions

always remember there is nothing called update or alter row on hive.
alter is only on the table meta data from that time onwards

if you really want to check if your old data has the new column then you
can do a select * from table where condition for old data limit 1
and then table definition old and new.
also how do you recreate the partition? you are reading from old table and
writing into a new table with the same data?  or you have external table so
you are just registering the partitions with metadata store like hcatalog
in that case it will be easy to recreate the table and register the
partitions again to have your new metadata applied to old partitions



On Thu, Jan 31, 2013 at 1:02 AM, Mark Grover wrote:

> Hardik,
> The schema is associated per partition. It sounds to me that the structure
> of your data remains the same, you are just expressing it differently in
> your Hive table.
>
> If you are table is external this is no biggie, just drop the external
> table, re-create it and re-add the partitions. If not, you'll have to look
> into the documentation to see if there is an alter table partition
> (partition spec)... command that will let you alter metadata about the
> partition.
>
> If I am not mistaken, alter table doesn't touch your existing columns,
> just modifies the partitions going forward.
>
> Mark
>
>
> On Wed, Jan 30, 2013 at 11:12 AM, hardik doshi wrote:
>
>> Thanks, Nitin & Dean.
>>
>> My hive table is backed by data files in hdfs and they do contain the
>> additional field that I am adding in my hive table schema.
>>
>> I noticed that if I remove partitions and recreate them after changing
>> the column type, it works. But it does not work
>> on old partition for some weird reasons.
>>
>> Any ideas?
>>
>> -Hardik.
>>
>>
>>--
>> *From:* Dean Wampler 
>> *To:* user@hive.apache.org
>> *Cc:* hardik doshi 
>> *Sent:* Wednesday, January 30, 2013 5:51 AM
>> *Subject:* Re: ALTER TABLE CHANGE COLUMN issue
>>
>> Right, the very important thing to remember about ALTER TABLE is that it
>> only changes metadata about your table. It doesn't modify the data in any
>> way. You have to do that yourself.
>>
>> On Wed, Jan 30, 2013 at 2:17 AM, Nitin Pawar wrote:
>>
>> after u did alter table, did you add any new data to table with new
>> schema?
>>
>> for the old data already present in data, if you add anything new in
>> columns it will be null value
>>
>>
>> On Wed, Jan 30, 2013 at 1:44 PM, hardik doshi wrote:
>>
>> Hi,
>>
>> I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem
>> to be working.
>>
>> I have a table with a column data type looking like array> b:int>> and I am trying to it change to array> c:string>> based
>> on the underlying data schema change.
>>
>> The alter command succeeds and subsequent describe call shows me the
>> updated table structure. But when tried querying the table,
>> it returns null for the newly added field.
>>
>> This does not happen when a new table with updated column data type is
>> created.
>>
>> Is this a known bug?
>>
>> Thanks,
>> Hardik.
>>
>> PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2
>> array>.
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>>
>>
>


-- 
Nitin Pawar


Re: ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread Mark Grover
Hardik,
The schema is associated per partition. It sounds to me that the structure
of your data remains the same, you are just expressing it differently in
your Hive table.

If you are table is external this is no biggie, just drop the external
table, re-create it and re-add the partitions. If not, you'll have to look
into the documentation to see if there is an alter table partition
(partition spec)... command that will let you alter metadata about the
partition.

If I am not mistaken, alter table doesn't touch your existing columns, just
modifies the partitions going forward.

Mark

On Wed, Jan 30, 2013 at 11:12 AM, hardik doshi wrote:

> Thanks, Nitin & Dean.
>
> My hive table is backed by data files in hdfs and they do contain the
> additional field that I am adding in my hive table schema.
>
> I noticed that if I remove partitions and recreate them after changing the
> column type, it works. But it does not work
> on old partition for some weird reasons.
>
> Any ideas?
>
> -Hardik.
>
>
>--
> *From:* Dean Wampler 
> *To:* user@hive.apache.org
> *Cc:* hardik doshi 
> *Sent:* Wednesday, January 30, 2013 5:51 AM
> *Subject:* Re: ALTER TABLE CHANGE COLUMN issue
>
> Right, the very important thing to remember about ALTER TABLE is that it
> only changes metadata about your table. It doesn't modify the data in any
> way. You have to do that yourself.
>
> On Wed, Jan 30, 2013 at 2:17 AM, Nitin Pawar wrote:
>
> after u did alter table, did you add any new data to table with new
> schema?
>
> for the old data already present in data, if you add anything new in
> columns it will be null value
>
>
> On Wed, Jan 30, 2013 at 1:44 PM, hardik doshi wrote:
>
> Hi,
>
> I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem
> to be working.
>
> I have a table with a column data type looking like array b:int>> and I am trying to it change to array c:string>> based
> on the underlying data schema change.
>
> The alter command succeeds and subsequent describe call shows me the
> updated table structure. But when tried querying the table,
> it returns null for the newly added field.
>
> This does not happen when a new table with updated column data type is
> created.
>
> Is this a known bug?
>
> Thanks,
> Hardik.
>
> PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2
> array>.
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>
>


Re: ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread hardik doshi
Thanks, Nitin & Dean.

My hive table is backed by data files in hdfs and they do contain the 
additional field that I am adding in my hive table schema.

I noticed that if I remove partitions and recreate them after changing the 
column type, it works. But it does not work
on old partition for some weird reasons.

Any ideas?

-Hardik.





 From: Dean Wampler 
To: user@hive.apache.org 
Cc: hardik doshi  
Sent: Wednesday, January 30, 2013 5:51 AM
Subject: Re: ALTER TABLE CHANGE COLUMN issue
 

Right, the very important thing to remember about ALTER TABLE is that it only 
changes metadata about your table. It doesn't modify the data in any way. You 
have to do that yourself.


On Wed, Jan 30, 2013 at 2:17 AM, Nitin Pawar  wrote:

after u did alter table, did you add any new data to table with new schema? 
>
>
>for the old data already present in data, if you add anything new in columns 
>it will be null value 
>
>
>
>On Wed, Jan 30, 2013 at 1:44 PM, hardik doshi  wrote:
>
>Hi,
>>
>>I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem to 
>>be working.
>>
>>I have a table with a column data type looking like array>b:int>> and I am trying to it change to array> 
>>based
>>on the underlying data schema change.
>>
>>
>>
>>The alter command succeeds and subsequent describe call shows me the updated 
>>table structure. But when tried querying the table,
>>it returns null for the newly added field.
>>
>>
>>This does not happen when a new table with updated column data type is 
>>created.
>>
>>
>>Is this a known bug?
>>
>>
>>Thanks,
>>Hardik.
>>
>>
>>PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2 
>>array>.
>>
>
>
>
>-- 
>Nitin Pawar
>


-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread John Omernik
I am realizing one of my challenges is that I have quite a few cores and
map tasks per node, but (I didn't set it up) I am only running 4 GB per
physical core (12) with 18 map slots.  I am guessing right now that any
given time, with 18 map slots, the 1.8 total GB of ram I am assigning to to
the sort stuff is under sized, yet I am constrained on memory, so I can't
just up it. Working on getting things upgraded. Thanks for all I appreciate
the thoughts.



On Wed, Jan 30, 2013 at 10:40 AM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> We didn't ask yet, but to be sure, are all the slave nodes configured the
> same, both in terms of hardware and other apps running, if any, running on
> them?
>
>
> On Wed, Jan 30, 2013 at 10:14 AM, Richard Nadeau wrote:
>
>> What do you have set in core-site.XML for io.sort.mb, io.sort.factor, and
>> io.file.buffer.size? You should be able to adjust these and get past the
>> heap issue. Be careful about how much ram you ave though, and don't st them
>> too high.
>>
>> Rick
>> On Jan 30, 2013 8:55 AM, "John Omernik"  wrote:
>>
>>> So it's filling up on the emitting stage, so I need to look at the task
>>> logs and or my script that's printing to stdout as the likely culprits I am
>>> guessing.
>>>
>>>
>>>
>>> On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans <
>>> philip.j.trom...@gmail.com> wrote:
>>>
 That particular OutOfMemoryError is happening on one of your hadoop
 nodes. It's the heap within the process forked by the hadoop tasktracker, I
 think.

 Phil.


 On 30 January 2013 14:28, John Omernik  wrote:

> So just a follow-up. I am less looking for specific troubleshooting on
> how to fix my problem, and more looking for a general understanding of 
> heap
> space usage with Hive.  When I get an error like this, is it heap space on
> a node, or heap space on my hive server?  Is it the heap space of the
> tasktracker? Heap of the job kicked off on the node?  Which heap is being
> affected? If it's not clear in my output, where can I better understand
> this? I am sorely out of my league here when it comes to understanding the
> JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
> trackers are run etc.
>
> Thanks is advance!
>
>
>
> On Tue, Jan 29, 2013 at 7:43 AM, John Omernik wrote:
>
>> I am running a transform script that parses through a bunch of binary
>> data. In 99% of the cases it runs, it runs fine, but on certain files I 
>> get
>> a failure (as seen below).  Funny thing is, I can run a job with "only" 
>> the
>> problem source file, and it will work fine, but when as a group of 
>> files, I
>> get these warnings.  I guess what I am asking here is this: Where is the
>> heap error? Is this occurring on the nodes themselves or, since this is
>> where the script is emitting records (and potentially large ones at that)
>> and in this case my hive server running the job may be memory light, 
>> could
>> the issue actually be due to heap on the hive server itself?   My setup 
>> is
>> 1 Hive node (that is woefully underpowered, under memoried, and under 
>> disk
>> I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue 
>> on
>> the sender or the receiver :)
>>
>>
>>
>>
>> 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool:
>> Got brand-new compressor
>> 2013-01-29 08:20:24,107 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
>> 2013-01-29 08:20:24,410 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
>> 2013-01-29 08:20:24,410 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
>> 2013-01-29 08:20:24,412 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
>> 2013-01-29 08:20:27,170 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
>> 2013-01-29 08:20:27,170 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
>> 2013-01-29 08:20:27,170 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
>> 2013-01-29 08:20:27,171 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
>> 2013-01-29 08:20:27,

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread Dean Wampler
We didn't ask yet, but to be sure, are all the slave nodes configured the
same, both in terms of hardware and other apps running, if any, running on
them?

On Wed, Jan 30, 2013 at 10:14 AM, Richard Nadeau wrote:

> What do you have set in core-site.XML for io.sort.mb, io.sort.factor, and
> io.file.buffer.size? You should be able to adjust these and get past the
> heap issue. Be careful about how much ram you ave though, and don't st them
> too high.
>
> Rick
> On Jan 30, 2013 8:55 AM, "John Omernik"  wrote:
>
>> So it's filling up on the emitting stage, so I need to look at the task
>> logs and or my script that's printing to stdout as the likely culprits I am
>> guessing.
>>
>>
>>
>> On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans <
>> philip.j.trom...@gmail.com> wrote:
>>
>>> That particular OutOfMemoryError is happening on one of your hadoop
>>> nodes. It's the heap within the process forked by the hadoop tasktracker, I
>>> think.
>>>
>>> Phil.
>>>
>>>
>>> On 30 January 2013 14:28, John Omernik  wrote:
>>>
 So just a follow-up. I am less looking for specific troubleshooting on
 how to fix my problem, and more looking for a general understanding of heap
 space usage with Hive.  When I get an error like this, is it heap space on
 a node, or heap space on my hive server?  Is it the heap space of the
 tasktracker? Heap of the job kicked off on the node?  Which heap is being
 affected? If it's not clear in my output, where can I better understand
 this? I am sorely out of my league here when it comes to understanding the
 JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
 trackers are run etc.

 Thanks is advance!



 On Tue, Jan 29, 2013 at 7:43 AM, John Omernik  wrote:

> I am running a transform script that parses through a bunch of binary
> data. In 99% of the cases it runs, it runs fine, but on certain files I 
> get
> a failure (as seen below).  Funny thing is, I can run a job with "only" 
> the
> problem source file, and it will work fine, but when as a group of files, 
> I
> get these warnings.  I guess what I am asking here is this: Where is the
> heap error? Is this occurring on the nodes themselves or, since this is
> where the script is emitting records (and potentially large ones at that)
> and in this case my hive server running the job may be memory light, could
> the issue actually be due to heap on the hive server itself?   My setup is
> 1 Hive node (that is woefully underpowered, under memoried, and under disk
> I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue 
> on
> the sender or the receiver :)
>
>
>
>
> 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
> brand-new compressor
> 2013-01-29 08:20:24,107 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
> 2013-01-29 08:20:24,410 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
> 2013-01-29 08:20:24,410 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
> 2013-01-29 08:20:24,412 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
> 2013-01-29 08:20:27,170 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
> 2013-01-29 08:20:27,170 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
> 2013-01-29 08:20:27,170 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
> 2013-01-29 08:21:16,247 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
> 2013-01-29 08:21:16,247 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
> 2013-01-29 08:21:16,247 INFO
> org.apa

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread Richard Nadeau
What do you have set in core-site.XML for io.sort.mb, io.sort.factor, and
io.file.buffer.size? You should be able to adjust these and get past the
heap issue. Be careful about how much ram you ave though, and don't st them
too high.

Rick
On Jan 30, 2013 8:55 AM, "John Omernik"  wrote:

> So it's filling up on the emitting stage, so I need to look at the task
> logs and or my script that's printing to stdout as the likely culprits I am
> guessing.
>
>
>
> On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans <
> philip.j.trom...@gmail.com> wrote:
>
>> That particular OutOfMemoryError is happening on one of your hadoop
>> nodes. It's the heap within the process forked by the hadoop tasktracker, I
>> think.
>>
>> Phil.
>>
>>
>> On 30 January 2013 14:28, John Omernik  wrote:
>>
>>> So just a follow-up. I am less looking for specific troubleshooting on
>>> how to fix my problem, and more looking for a general understanding of heap
>>> space usage with Hive.  When I get an error like this, is it heap space on
>>> a node, or heap space on my hive server?  Is it the heap space of the
>>> tasktracker? Heap of the job kicked off on the node?  Which heap is being
>>> affected? If it's not clear in my output, where can I better understand
>>> this? I am sorely out of my league here when it comes to understanding the
>>> JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
>>> trackers are run etc.
>>>
>>> Thanks is advance!
>>>
>>>
>>>
>>> On Tue, Jan 29, 2013 at 7:43 AM, John Omernik  wrote:
>>>
 I am running a transform script that parses through a bunch of binary
 data. In 99% of the cases it runs, it runs fine, but on certain files I get
 a failure (as seen below).  Funny thing is, I can run a job with "only" the
 problem source file, and it will work fine, but when as a group of files, I
 get these warnings.  I guess what I am asking here is this: Where is the
 heap error? Is this occurring on the nodes themselves or, since this is
 where the script is emitting records (and potentially large ones at that)
 and in this case my hive server running the job may be memory light, could
 the issue actually be due to heap on the hive server itself?   My setup is
 1 Hive node (that is woefully underpowered, under memoried, and under disk
 I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
 the sender or the receiver :)




 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
 brand-new compressor
 2013-01-29 08:20:24,107 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
 2013-01-29 08:20:24,412 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.e

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread John Omernik
So it's filling up on the emitting stage, so I need to look at the task
logs and or my script that's printing to stdout as the likely culprits I am
guessing.



On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans
wrote:

> That particular OutOfMemoryError is happening on one of your hadoop nodes.
> It's the heap within the process forked by the hadoop tasktracker, I think.
>
> Phil.
>
>
> On 30 January 2013 14:28, John Omernik  wrote:
>
>> So just a follow-up. I am less looking for specific troubleshooting on
>> how to fix my problem, and more looking for a general understanding of heap
>> space usage with Hive.  When I get an error like this, is it heap space on
>> a node, or heap space on my hive server?  Is it the heap space of the
>> tasktracker? Heap of the job kicked off on the node?  Which heap is being
>> affected? If it's not clear in my output, where can I better understand
>> this? I am sorely out of my league here when it comes to understanding the
>> JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
>> trackers are run etc.
>>
>> Thanks is advance!
>>
>>
>>
>> On Tue, Jan 29, 2013 at 7:43 AM, John Omernik  wrote:
>>
>>> I am running a transform script that parses through a bunch of binary
>>> data. In 99% of the cases it runs, it runs fine, but on certain files I get
>>> a failure (as seen below).  Funny thing is, I can run a job with "only" the
>>> problem source file, and it will work fine, but when as a group of files, I
>>> get these warnings.  I guess what I am asking here is this: Where is the
>>> heap error? Is this occurring on the nodes themselves or, since this is
>>> where the script is emitting records (and potentially large ones at that)
>>> and in this case my hive server running the job may be memory light, could
>>> the issue actually be due to heap on the hive server itself?   My setup is
>>> 1 Hive node (that is woefully underpowered, under memoried, and under disk
>>> I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
>>> the sender or the receiver :)
>>>
>>>
>>>
>>>
>>> 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
>>> brand-new compressor
>>> 2013-01-29 08:20:24,107 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
>>> 2013-01-29 08:20:24,410 INFO
>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
>>> 2013-01-29 08:20:24,410 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
>>> 2013-01-29 08:20:24,411 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
>>> 2013-01-29 08:20:24,411 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
>>> 2013-01-29 08:20:24,411 INFO
>>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
>>> 2013-01-29 08:20:24,411 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
>>> 2013-01-29 08:20:24,411 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
>>> 2013-01-29 08:20:24,412 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
>>> 2013-01-29 08:20:27,170 INFO
>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
>>> 2013-01-29 08:20:27,170 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
>>> 2013-01-29 08:20:27,170 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
>>> 2013-01-29 08:20:27,171 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
>>> 2013-01-29 08:20:27,171 INFO
>>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
>>> 2013-01-29 08:20:27,171 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
>>> 2013-01-29 08:20:27,171 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
>>> 2013-01-29 08:20:27,171 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
>>> 2013-01-29 08:21:16,247 INFO
>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
>>> 2013-01-29 08:21:16,247 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
>>> 2013-01-29 08:21:16,247 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
>>> 2013-01-29 08:21:16,247 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
>>> 2013-01-29 08:21:16,248 INFO
>>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
>>> 2013-01-29 08:21:16,248 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows
>>> 2013-01-29 08:21:16,248 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows
>>> 2013-01-29 08:21:16,248 INFO
>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows
>>> 2013-01-29 08:25:47,532 INFO
>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1 rows
>>> 2013-01-29 08:25:47,532 INFO
>>> org.apache.hadoop.h

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread Philip Tromans
That particular OutOfMemoryError is happening on one of your hadoop nodes.
It's the heap within the process forked by the hadoop tasktracker, I think.

Phil.


On 30 January 2013 14:28, John Omernik  wrote:

> So just a follow-up. I am less looking for specific troubleshooting on how
> to fix my problem, and more looking for a general understanding of heap
> space usage with Hive.  When I get an error like this, is it heap space on
> a node, or heap space on my hive server?  Is it the heap space of the
> tasktracker? Heap of the job kicked off on the node?  Which heap is being
> affected? If it's not clear in my output, where can I better understand
> this? I am sorely out of my league here when it comes to understanding the
> JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
> trackers are run etc.
>
> Thanks is advance!
>
>
>
> On Tue, Jan 29, 2013 at 7:43 AM, John Omernik  wrote:
>
>> I am running a transform script that parses through a bunch of binary
>> data. In 99% of the cases it runs, it runs fine, but on certain files I get
>> a failure (as seen below).  Funny thing is, I can run a job with "only" the
>> problem source file, and it will work fine, but when as a group of files, I
>> get these warnings.  I guess what I am asking here is this: Where is the
>> heap error? Is this occurring on the nodes themselves or, since this is
>> where the script is emitting records (and potentially large ones at that)
>> and in this case my hive server running the job may be memory light, could
>> the issue actually be due to heap on the hive server itself?   My setup is
>> 1 Hive node (that is woefully underpowered, under memoried, and under disk
>> I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
>> the sender or the receiver :)
>>
>>
>>
>>
>> 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
>> brand-new compressor
>> 2013-01-29 08:20:24,107 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
>> 2013-01-29 08:20:24,410 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
>> 2013-01-29 08:20:24,410 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
>> 2013-01-29 08:20:24,411 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
>> 2013-01-29 08:20:24,412 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
>> 2013-01-29 08:20:27,170 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
>> 2013-01-29 08:20:27,170 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
>> 2013-01-29 08:20:27,170 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
>> 2013-01-29 08:20:27,171 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
>> 2013-01-29 08:20:27,171 INFO
>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
>> 2013-01-29 08:20:27,171 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
>> 2013-01-29 08:20:27,171 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
>> 2013-01-29 08:20:27,171 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
>> 2013-01-29 08:21:16,247 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
>> 2013-01-29 08:21:16,247 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
>> 2013-01-29 08:21:16,247 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
>> 2013-01-29 08:21:16,247 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
>> 2013-01-29 08:21:16,248 INFO
>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
>> 2013-01-29 08:21:16,248 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows
>> 2013-01-29 08:21:16,248 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows
>> 2013-01-29 08:21:16,248 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows
>> 2013-01-29 08:25:47,532 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1 rows
>> 2013-01-29 08:25:47,532 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1 rows
>> 2013-01-29 08:25:47,532 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1 rows
>> 2013-01-29 08:25:47,532 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
>>  2013-01-29 08:25:47,532 INFO
>> org.apache.hadoop

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread John Omernik
So just a follow-up. I am less looking for specific troubleshooting on how
to fix my problem, and more looking for a general understanding of heap
space usage with Hive.  When I get an error like this, is it heap space on
a node, or heap space on my hive server?  Is it the heap space of the
tasktracker? Heap of the job kicked off on the node?  Which heap is being
affected? If it's not clear in my output, where can I better understand
this? I am sorely out of my league here when it comes to understanding the
JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
trackers are run etc.

Thanks is advance!



On Tue, Jan 29, 2013 at 7:43 AM, John Omernik  wrote:

> I am running a transform script that parses through a bunch of binary
> data. In 99% of the cases it runs, it runs fine, but on certain files I get
> a failure (as seen below).  Funny thing is, I can run a job with "only" the
> problem source file, and it will work fine, but when as a group of files, I
> get these warnings.  I guess what I am asking here is this: Where is the
> heap error? Is this occurring on the nodes themselves or, since this is
> where the script is emitting records (and potentially large ones at that)
> and in this case my hive server running the job may be memory light, could
> the issue actually be due to heap on the hive server itself?   My setup is
> 1 Hive node (that is woefully underpowered, under memoried, and under disk
> I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
> the sender or the receiver :)
>
>
>
>
> 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
> brand-new compressor
> 2013-01-29 08:20:24,107 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
> 2013-01-29 08:20:24,410 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
> 2013-01-29 08:20:24,410 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
> 2013-01-29 08:20:24,411 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
> 2013-01-29 08:20:24,412 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
> 2013-01-29 08:20:27,170 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
> 2013-01-29 08:20:27,170 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
> 2013-01-29 08:20:27,170 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
> 2013-01-29 08:20:27,171 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
> 2013-01-29 08:21:16,247 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
> 2013-01-29 08:21:16,247 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
> 2013-01-29 08:21:16,247 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
> 2013-01-29 08:21:16,247 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
> 2013-01-29 08:21:16,248 INFO
> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
> 2013-01-29 08:21:16,248 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows
> 2013-01-29 08:21:16,248 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows
> 2013-01-29 08:21:16,248 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
>  2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1 rows
> 2013-01-29 08:25:47,532 INFO
> org.apache.hadoop

Re: Run hive queries, and collect job information

2013-01-30 Thread Mathieu Despriee
Fantastic.
Thanks !


2013/1/30 Qiang Wang 

> Every hive query has a history file, and you can get these info from hive
> history file
>
> Following java code can be an example:
>
> https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java
>
> Regard,
> Qiang
>
>
> 2013/1/30 Mathieu Despriee 
>
>> Hi folks,
>>
>> I would like to run a list of generated HIVE queries. For each, I would
>> like to retrieve the MR job_id (or ids, in case of multiple stages). And
>> then, with this job_id, collect statistics from job tracker (cumulative
>> CPU, read bytes...)
>>
>> How can I send HIVE queries from a bash or python script, and retrieve
>> the job_id(s) ?
>>
>> For the 2nd part (collecting stats for the job), we're using a MRv1
>> Hadoop cluster, so I don't have the AppMaster REST 
>> API.
>> I'm about to collect data from the jobtracker web UI. Any better idea ?
>>
>> Mathieu
>>
>>
>>
>


Re: ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread Dean Wampler
Right, the very important thing to remember about ALTER TABLE is that it
only changes metadata about your table. It doesn't modify the data in any
way. You have to do that yourself.

On Wed, Jan 30, 2013 at 2:17 AM, Nitin Pawar wrote:

> after u did alter table, did you add any new data to table with new
> schema?
>
> for the old data already present in data, if you add anything new in
> columns it will be null value
>
>
> On Wed, Jan 30, 2013 at 1:44 PM, hardik doshi wrote:
>
>> Hi,
>>
>> I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem
>> to be working.
>>
>> I have a table with a column data type looking like array> b:int>> and I am trying to it change to array> c:string>> based
>> on the underlying data schema change.
>>
>> The alter command succeeds and subsequent describe call shows me the
>> updated table structure. But when tried querying the table,
>> it returns null for the newly added field.
>>
>> This does not happen when a new table with updated column data type is
>> created.
>>
>> Is this a known bug?
>>
>> Thanks,
>> Hardik.
>>
>> PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2
>> array>.
>>
>
>
>
> --
> Nitin Pawar
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Re: Run hive queries, and collect job information

2013-01-30 Thread Nitin Pawar
for all the queries you run as user1 .. hive stores the hive cli history
into .hive_history file (please check the limits on how many queries it
stores)

For all the jobs hive cli runs, it keeps the details in /tmp/user.name/

all these values are configurable into hive-site.xml


On Wed, Jan 30, 2013 at 3:55 PM, Qiang Wang  wrote:

> Every hive query has a history file, and you can get these info from hive
> history file
>
> Following java code can be an example:
>
> https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java
>
> Regard,
> Qiang
>
>
> 2013/1/30 Mathieu Despriee 
>
>> Hi folks,
>>
>> I would like to run a list of generated HIVE queries. For each, I would
>> like to retrieve the MR job_id (or ids, in case of multiple stages). And
>> then, with this job_id, collect statistics from job tracker (cumulative
>> CPU, read bytes...)
>>
>> How can I send HIVE queries from a bash or python script, and retrieve
>> the job_id(s) ?
>>
>> For the 2nd part (collecting stats for the job), we're using a MRv1
>> Hadoop cluster, so I don't have the AppMaster REST 
>> API.
>> I'm about to collect data from the jobtracker web UI. Any better idea ?
>>
>> Mathieu
>>
>>
>>
>


-- 
Nitin Pawar


Re: Run hive queries, and collect job information

2013-01-30 Thread Qiang Wang
Every hive query has a history file, and you can get these info from hive
history file

Following java code can be an example:
https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java

Regard,
Qiang


2013/1/30 Mathieu Despriee 

> Hi folks,
>
> I would like to run a list of generated HIVE queries. For each, I would
> like to retrieve the MR job_id (or ids, in case of multiple stages). And
> then, with this job_id, collect statistics from job tracker (cumulative
> CPU, read bytes...)
>
> How can I send HIVE queries from a bash or python script, and retrieve the
> job_id(s) ?
>
> For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop
> cluster, so I don't have the AppMaster REST 
> API.
> I'm about to collect data from the jobtracker web UI. Any better idea ?
>
> Mathieu
>
>
>


Run hive queries, and collect job information

2013-01-30 Thread Mathieu Despriee
Hi folks,

I would like to run a list of generated HIVE queries. For each, I would
like to retrieve the MR job_id (or ids, in case of multiple stages). And
then, with this job_id, collect statistics from job tracker (cumulative
CPU, read bytes...)

How can I send HIVE queries from a bash or python script, and retrieve the
job_id(s) ?

For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop
cluster, so I don't have the AppMaster REST
API.
I'm about to collect data from the jobtracker web UI. Any better idea ?

Mathieu


Re: ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread Nitin Pawar
after u did alter table, did you add any new data to table with new schema?

for the old data already present in data, if you add anything new in
columns it will be null value


On Wed, Jan 30, 2013 at 1:44 PM, hardik doshi  wrote:

> Hi,
>
> I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem
> to be working.
>
> I have a table with a column data type looking like array b:int>> and I am trying to it change to array c:string>> based
> on the underlying data schema change.
>
> The alter command succeeds and subsequent describe call shows me the
> updated table structure. But when tried querying the table,
> it returns null for the newly added field.
>
> This does not happen when a new table with updated column data type is
> created.
>
> Is this a known bug?
>
> Thanks,
> Hardik.
>
> PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2
> array>.
>



-- 
Nitin Pawar


ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread hardik doshi
Hi,

I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem to be 
working.

I have a table with a column data type looking like array> 
and I am trying to it change to array> based
on the underlying data schema change.


The alter command succeeds and subsequent describe call shows me the updated 
table structure. But when tried querying the table,
it returns null for the newly added field.

This does not happen when a new table with updated column data type is created.

Is this a known bug?

Thanks,
Hardik.

PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2 
array>.