Re: HDFS Block location verification

2013-02-05 Thread Samir Ahmic
Hi,
You may try with:
hadoop fsck -locations -blocks -files [hdfs_path.] It will print detailed
info about blocks and there locations.

On Tue, Feb 5, 2013 at 4:00 PM, Dhanasekaran Anbalagan
bugcy...@gmail.comwrote:

 Hi Guys,

 I have configured HDFS with replication factor 3. We have 1TB for data How
 to file the particular block will available in 3 machine

 How to find same block of data will available in 3 machine

 Please guide How to check my data available in three different location
 node?

 -Dhanasekaran.
 Did I learn something today? If not, I wasted it.



HyperThreading in TaskTracker nodes?

2013-02-05 Thread Terry Healy
I would like to get some opinions / recommendations about the pros and
cons of enabling HyperThreading on TaskTracker nodes. Presumably memory
could be an issue, but is there anything to be gained, perhaps because
of I/O wait? My small cluster is made of relatively slow and old
systems, which mostly are quite slow to/from disk, if that matters.

Thanks,

Terry


Re: HDFS Block location verification

2013-02-05 Thread Dhanasekaran Anbalagan
Hi Samir,

Thanks so much.

Exactly I want this.

tech@dvcliftonhera150:~$ hadoop fsck -locations -blocks -files
/user/tech/pkg.tar.gz
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Connecting to namenode via http://dvcliftonhera122:50070
FSCK started by tech (auth:SIMPLE) from /172.16.30.150 for path
/user/tech/pkg.tar.gz at Tue Feb 05 10:33:23 EST 2013
/user/tech/pkg.tar.gz 165 bytes, 1 block(s):  OK
0.
BP-1936777173-172.16.30.122-1343141974879:blk_8828079455224016541_10294868
len=165 repl=3 [*172.16.30.144:50010, 172.16.30.135:50010,
172.16.30.134:50010*]

Status: HEALTHY
 Total size:165 B
 Total dirs:0
 Total files:1
 Total blocks (validated):1 (avg. block size 165 B)
 Minimally replicated blocks:1 (100.0 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:0 (0.0 %)
 Mis-replicated blocks:0 (0.0 %)
 Default replication factor:3
 Average block replication:3.0
 Corrupt blocks:0
 Missing replicas:0 (0.0 %)
 Number of data-nodes:47
 Number of racks:1
FSCK ended at Tue Feb 05 10:33:23 EST 2013 in 3 milliseconds


The filesystem under path '/user/tech/pkg.tar.gz' is HEALTHY





Did I learn something today? If not, I wasted it.


On Tue, Feb 5, 2013 at 10:18 AM, Samir Ahmic ahmic.sa...@gmail.com wrote:

 Hi,
 You may try with:
 hadoop fsck -locations -blocks -files [hdfs_path.] It will print detailed
 info about blocks and there locations.


 On Tue, Feb 5, 2013 at 4:00 PM, Dhanasekaran Anbalagan bugcy...@gmail.com
  wrote:

 Hi Guys,

 I have configured HDFS with replication factor 3. We have 1TB for data
 How to file the particular block will available in 3 machine

 How to find same block of data will available in 3 machine

 Please guide How to check my data available in three different location
 node?

 -Dhanasekaran.
 Did I learn something today? If not, I wasted it.





Job History files in Hadoop 2.0

2013-02-05 Thread sangroya
Hi,

I recently migrated to Hadoop 2.0 from Hadoop 1.0 (0.20.2 before).

I am able to successfully launch example applications.

/Could anyone please suggest where are the MapReduce job history files
available, after running jobs in Hadoop 2.0./

I need the statistics after running the jobs. Of course, the web UI gives me
the information. But I need the history files that were available in the
previous versions of hadoop job_ID etc.

I can see a directory with application and container patterns but this does
not have specific information about job submit time, Map Start time, finish
time, reduce start time, finish time, job finish time etc. 

In the previous version of hadoop i.e. 1.0 or 0.20, it was stored under
logs/history.

Can anyone suggest if the pattern of storing job history files is also
changed in the new architecture?


Thanks in advance,
Amit



-
Sangroya
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Job-History-files-in-Hadoop-2-0-tp4038599.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Application of Cloudera Hadoop for Dataset analysis

2013-02-05 Thread Suresh Srinivas
Please take this thread to CDH mailing list.


On Tue, Feb 5, 2013 at 2:43 AM, Sharath Chandra Guntuku 
sharathchandr...@gmail.com wrote:

 Hi,

 I am Sharath Chandra, an undergraduate student at BITS-Pilani, India. I
 would like to get the following clarifications regarding cloudera hadoop
 distribution. I am using a CDH4 Demo VM for now.

 1. After I upload the files into the file browser, if I have to link
 two-three datasets using a key in those files, what should I do? Do I have
 to run a query over them?

 2. My objective is that I have some data collected over a few years and
 now, I would like to link all of them, as in a database using keys and then
 run queries over them to find out particular patterns. Later I would like
 to implement some Machine learning algorithms on them for predictive
 analysis. Will this be possible on the demo VM?

 I am totally new to this. Can I get some help on this? I would be very
 grateful for the same.


 --
 Thanks and Regards,
 *Sharath Chandra Guntuku*
 Undergraduate Student (Final Year)
 *Computer Science Department*
 *Email*: f2009...@hyderabad.bits-pilani.ac.in

 *BITS-Pilani*, Hyderabad Campus
 Jawahar Nagar, Shameerpet, RR Dist,
 Hyderabad - 500078, Andhra Pradesh




-- 
http://hortonworks.com/download/


Re: replication factor

2013-02-05 Thread Nicolas Liochon
I would recommend this: http://www.aosabook.org/en/hdfs.html

Nicolas


On Tue, Feb 5, 2013 at 6:28 PM, Lin Ma lin...@gmail.com wrote:

 Hello guys,

 I want to learn a bit more when we need to change (increase/decrease)
 replication factor for better performance, and also want to learn a bit
 more internals about how replication factor works, and pros/cons for
 larger/smaller replication factors, for example, deploy static model/config
 file for Hadoop jobs, whether larger replication factor is better?
 Unfortunately, I cannot find related materials by search. Appreciate if
 anyone could point me some good documents.

 thanks in advance,
 Lin



Re: Application of Cloudera Hadoop for Dataset analysis

2013-02-05 Thread Richard Pickens
You can use Hortonworks data platform which already integrates HDFS,
MapReduce and Hive well.
http://hortonworks.com/products/hortonworksdataplatform/

Came across this new solution recently, They claim to be Hadoop based
Standard SQL solution for data analytics.
http://queryio.com/hadoop-big-data-product/hadoop-hive.html

Have not given it a try yet but you can explore it.

-Richard

 On Tue, Feb 5, 2013 at 10:07 AM, * *Preethi Vinayak Ponangi 
vinayakpona...@gmail.com wrote:

 *From: *Preethi Vinayak Ponangi vinayakpona...@gmail.com
 *Subject: **Re: Application of Cloudera Hadoop for Dataset analysis*
 *Date: *February 5, 2013 8:07:47 AM PST
 *To: *user@hadoop.apache.org
 *Reply-To: *user@hadoop.apache.org

 It depends on what part of the Hadoop Eco system component you would like
 to use.

 You can do it in several ways:

 1) You could write a basic map reduce job to do joins.
 This link could help or just a basic search on google would give you
 several links.

 http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/

 2) You could use an abstract language like Pig to do these joins using
 simple pig scripts.
 http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html

 3) The simplest of all, you could write SQL like queries to do this join
 using Hive.
 http://hive.apache.org/

 Hope this helps.

 Regards,
 Vinayak.


 On Tue, Feb 5, 2013 at 10:00 AM, Suresh Srinivas 
 sur...@hortonworks.comwrote:

 Please take this thread to CDH mailing list.


 On Tue, Feb 5, 2013 at 2:43 AM, Sharath Chandra Guntuku 
 sharathchandr...@gmail.com wrote:

 Hi,

 I am Sharath Chandra, an undergraduate student at BITS-Pilani, India. I
 would like to get the following clarifications regarding cloudera hadoop
 distribution. I am using a CDH4 Demo VM for now.

 1. After I upload the files into the file browser, if I have to link
 two-three datasets using a key in those files, what should I do? Do I have
 to run a query over them?

 2. My objective is that I have some data collected over a few years and
 now, I would like to link all of them, as in a database using keys and then
 run queries over them to find out particular patterns. Later I would like
 to implement some Machine learning algorithms on them for predictive
 analysis. Will this be possible on the demo VM?

 I am totally new to this. Can I get some help on this? I would be very
 grateful for the same.


 --
 Thanks and Regards,
 *Sharath Chandra Guntuku*
 Undergraduate Student (Final Year)
 *Computer Science Department*
 *Email*: f2009...@hyderabad.bits-pilani.ac.in

 *BITS-Pilani*, Hyderabad Campus
 Jawahar Nagar, Shameerpet, RR Dist,
 Hyderabad - 500078, Andhra Pradesh




 --
 http://hortonworks.com/download/






RE: HyperThreading in TaskTracker nodes?

2013-02-05 Thread Brad Sarsfield
Hate to say it, but HyperThreading can have either positive or negative 
performance characteristics.  It all depends on your workload.  You have to 
measure very careful; it may not even be a bottleneck(!) :) 

I hit a pretty significant power issue when I enable HyperThreading at 
multi-thousand node scale.  We hit a ~8-10% power utilization increase, which, 
if rolled out to the entire cluster, would put me a few %'ge over our max spec 
power. In this case, for our workload, we actually saw a 15% increase in 
processing throughput / job latency.   We ended up literally turning off 
machines and enabling HyperThreading on the remaining and saw an overall ~10% 
efficiency gain in the cluster, with a few less machines, but running hot on 
power.

~Brad

-Original Message-
From: Terry Healy [mailto:the...@bnl.gov] 
Sent: Tuesday, February 5, 2013 7:20 AM
To: user@hadoop.apache.org
Subject: HyperThreading in TaskTracker nodes?

I would like to get some opinions / recommendations about the pros and cons of 
enabling HyperThreading on TaskTracker nodes. Presumably memory could be an 
issue, but is there anything to be gained, perhaps because of I/O wait? My 
small cluster is made of relatively slow and old systems, which mostly are 
quite slow to/from disk, if that matters.

Thanks,

Terry





Re: HyperThreading in TaskTracker nodes?

2013-02-05 Thread Todd Lipcon
Power issues aside, I've seen similar sorts of performance gains for MR
workloads - around 15-20%.

I think a fair bit of it is due to poor CPU cache utilization in various
parts of Hadoop - hyperthreading gets some extra parallelism there while
the core is waiting on round trips to DRAM.

-Todd

On Tue, Feb 5, 2013 at 10:03 AM, Brad Sarsfield b...@bing.com wrote:

 Hate to say it, but HyperThreading can have either positive or negative
 performance characteristics.  It all depends on your workload.  You have to
 measure very careful; it may not even be a bottleneck(!) :)

 I hit a pretty significant power issue when I enable HyperThreading at
 multi-thousand node scale.  We hit a ~8-10% power utilization increase,
 which, if rolled out to the entire cluster, would put me a few %'ge over
 our max spec power. In this case, for our workload, we actually saw a 15%
 increase in processing throughput / job latency.   We ended up literally
 turning off machines and enabling HyperThreading on the remaining and saw
 an overall ~10% efficiency gain in the cluster, with a few less machines,
 but running hot on power.

 ~Brad

 -Original Message-
 From: Terry Healy [mailto:the...@bnl.gov]
 Sent: Tuesday, February 5, 2013 7:20 AM
 To: user@hadoop.apache.org
 Subject: HyperThreading in TaskTracker nodes?

 I would like to get some opinions / recommendations about the pros and
 cons of enabling HyperThreading on TaskTracker nodes. Presumably memory
 could be an issue, but is there anything to be gained, perhaps because of
 I/O wait? My small cluster is made of relatively slow and old systems,
 which mostly are quite slow to/from disk, if that matters.

 Thanks,

 Terry






-- 
Todd Lipcon
Software Engineer, Cloudera


[Hadoop-Help]About Map-Reduce implementation

2013-02-05 Thread Mayur Patil
Hello,

I am new to Hadoop. I am doing a project in cloud in which I

have to use hadoop for Map-reduce. It is such that I am going

to collect logs from 2-3 machines having different locations.

The logs are also in different formats such as .rtf .log .txt

Later, I have to collect and convert them to one format and

collect to one location.

So I am asking which module of Hadoop that I need to study

for this implementation?? Or whole framework should I need

to study ??

Seeking for guidance,

Thank you !!
-- 
*Cheers,*
*Mayur.*


[Hadoop-Help]About Map-Reduce implementation

2013-02-05 Thread Mayur Patil
Hello,

I am new to Hadoop. I am doing a project in cloud in which I

have to use hadoop for Map-reduce. It is such that I am going

to collect logs from 2-3 machines having different locations.

The logs are also in different formats such as .rtf .log .txt

Later, I have to collect and convert them to one format and

collect to one location.

So I am asking which module of Hadoop that I need to study

for this implementation?? Or whole framework should I need

to study ??

Seeking for guidance,

Thank you !!

-- 
*Cheers,*
*Mayur.*


Re: [Hadoop-Help]About Map-Reduce implementation

2013-02-05 Thread Nitin Pawar
Hey Mayur,

If you are collecting logs from multiple servers then you can use flume for
the same.

if the contents of the logs are different in format  then you can just use
textfileinput format to read and write into any other format you want for
your processing in later part of your projects

first thing you need to learn is how to setup hadoop
then you can try writing sample hadoop mapreduce jobs to read from text
file and then process them and write the results into another file
then you can integrate flume as your log collection mechanism
once you get hold on the system then you can decide more on which paths you
want to follow based on your requirements for storage, compute time,
compute capacity, compression etc


On Wed, Feb 6, 2013 at 3:01 AM, Mayur Patil ram.nath241...@gmail.comwrote:

 Hello,

 I am new to Hadoop. I am doing a project in cloud in which I

 have to use hadoop for Map-reduce. It is such that I am going

 to collect logs from 2-3 machines having different locations.

 The logs are also in different formats such as .rtf .log .txt

 Later, I have to collect and convert them to one format and

 collect to one location.

 So I am asking which module of Hadoop that I need to study

 for this implementation?? Or whole framework should I need

 to study ??

 Seeking for guidance,

 Thank you !!
 --
 *Cheers,*
 *Mayur.*




-- 
Nitin Pawar


Re: [Hadoop-Help]About Map-Reduce implementation

2013-02-05 Thread Jagat Singh
Hi,

Please read basics on how hadoop works.

Then start your hands on with map reduce coding.

The tool which has been made for you is flume , but don't see tool till you
complete above two steps.

Good luck , keep us posted.

Regards,

Jagat Singh

---
Sent from Mobile , short and crisp.
On 06-Feb-2013 8:32 AM, Mayur Patil ram.nath241...@gmail.com wrote:

 Hello,

 I am new to Hadoop. I am doing a project in cloud in which I

 have to use hadoop for Map-reduce. It is such that I am going

 to collect logs from 2-3 machines having different locations.

 The logs are also in different formats such as .rtf .log .txt

 Later, I have to collect and convert them to one format and

 collect to one location.

 So I am asking which module of Hadoop that I need to study

 for this implementation?? Or whole framework should I need

 to study ??

 Seeking for guidance,

 Thank you !!
 --
 *Cheers,*
 *Mayur.*



Re: Advice on post mortem of data loss (v 1.0.3)

2013-02-05 Thread Suresh Srinivas
Sorry to hear you are having issues. Few questions and comments inline.


On Fri, Feb 1, 2013 at 8:40 AM, Peter Sheridan 
psheri...@millennialmedia.com wrote:

  Yesterday, I bounced my DFS cluster.  We realized that ulimit –u was,
 in extreme cases, preventing the name node from creating threads.  This had
 only started occurring within the last day or so.  When I brought the name
 node back up, it had essentially been rolled back by one week, and I lost
 all changes which had been made since then.

  There are a few other factors to consider.

1. I had 3 locations for dfs.name.dir — one local and two NFS.  (I
originally thought this was 2 local and one NFS when I set it up.)  On
1/24, the day which we effectively rolled back to, the second NFS mount
started showing as FAILED on dfshealth.jsp.  We had seen this before
without issue, so I didn't consider it critical.

 What do you mean by rolled back to?
I understand this so far has you have three dirs: l1, nfs1 and nfs2. (l for
local disk and nfs for NFS). nfs2 was shown as failed.


1. When I brought the name node back up, because of discovering the
above, I had changed dfs.name.dir to 2 local drives and one NFS, excluding
the one which had failed.

 When you brought the namenode backup, with the changed configuration you
have l1, l2 and nfs1. Given you have not seen any failures, l1 and nfs1
have the latest edits so far. Correct? How did you add l2? Can you describe
this procedure in detail?


 Reviewing the name node log from the day with the NFS outage, I see:


When you say NFS outage here, this is the failure corresponding to nfs2
from above. Is that correct?



  2013-01-24 16:33:11,794 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit
 log.
 java.io.IOException: Input/output error
 at sun.nio.ch.FileChannelImpl.force0(Native Method)
 at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)
 at
 org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.flushAndSync(FSEditLog.java:215)
 at
 org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:89)
 at
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1015)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1666)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:718)
 at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
 2013-01-24 16:33:11,794 WARN org.apache.hadoop.hdfs.server.common.Storage:
 Removing storage dir /rdisks/xx


  Unfortunately, since I wasn't expecting anything terrible to happen, I
 didn't look too closely at the file system while the name node was down.
  When I brought it up, the time stamp on the previous checkpoint directory
 in the dfs.name.dir was right around the above error message.  The current
 directory basically had an fsimage and an empty edits log with the current
 time stamps.


Which storage directory are you talking about here?


  So: what happened?  Should this failure have led to my data loss?  I
 would have thought the local directory would be fine in this scenario.  Did
 I have any other options for data recovery?


I am not sure how you concluded that you lost a week's data and the
namenode rolled back by one week? Please share the namenode logs
corresponding to the restart.

This is how it should have worked.
- When nfs2 was removed, on both l1 and nfs1 a timestamp is recorded,
corresponding to removal of a storage directory.
- If there is any checkpointing that happened, it would have also
incremented the timestamp.
- When the namenode starts up, it chooses l1 and nfs1 because the recorded
timestamp is the latest on these directories and loads fsimage and edits
from those directories. Namenode also performs checkpoint and writes new
consolidated image on l1, l2 and nfs1 and creates empty editlog on l1, l2
and nfs1.

If you provide more details on how l2 was added, we may be able to
understand what happened.

Regards,
Suresh


-- 
http://hortonworks.com/download/


Re: Specific HDFS tasks where is passwordless SSH is necessary

2013-02-05 Thread Robert Dyer
The JobTracker will also SSH in to start TaskTracker's.

So basically, the masters need SSH to any slave(s) you define.  The slave
nodes (DN, TT) do not need SSH to each other.


On Tue, Feb 5, 2013 at 5:06 PM, Jay Vyas jayunit...@gmail.com wrote:

 When setting up passwordless ssh on a cluster, its clear that the namenode
 needs to be able to ssh into task trackers to start/stop nodes and restart
 the cluster.

 What else is passwordless SSH used for?  Do TaskTrackers/DataNodes ever
 SSH into each other horizontally ? Or is SSH only used for one-way nn to tt
 operations?

 --
 Jay Vyas
 http://jayunit100.blogspot.com

 --

 Robert Dyer
 rd...@iastate.edu



Re: [HOD] Cannot use env variables in hodrc

2013-02-05 Thread Mehmet Belgin
On a related note, env-vars is also being ignored:

env-vars= 
HOD_PYTHON_HOME=/usr/local/packages/python/2.5.1/bin/python2.5

And hod picks the system-default python and terminates with errors unless I 
manually export HOD_PYTHON_HOME.

export HOD_PYTHON_HOME=`which python2.5`

I am also having problems having hod use the cluster I created, but I assume 
those issues are also related. 

How can I make sure that hodrc contents are passed correctly into hod? 

Thanks a lot in advance!


On Feb 5, 2013, at 4:41 PM, Mehmet Belgin wrote:

 Hello everyone,
 
 I am setting up Hadoop for the first time, so please bear with me while I ask 
 all these beginner questions :)
 
 I followed the instructions to create a hodrc, but looks like I cannot user 
 env variables in this file:
 
 error: bin/hod failed to start.
 error: invalid 'java-home' specified in section hod (--hod.java-home): 
 ${JAVA_HOME}
 error: invalid 'batch-home' specified in section resource_manager 
 (--resource_manager.batch-home): ${RM_HOME}
 
 ... despite the fact that I have  ${JAVA_HOME} and ${RM_HOME} correctly 
 defined in my environment. When I replace these variables with full explicit 
 paths, it works. I checked the permissions, and everything else looks fine.
 
 What am I missing here?
 
 Thanks!
 
 



Re: Specific HDFS tasks where is passwordless SSH is necessary

2013-02-05 Thread Harsh J
It isn't the NN that does the SSH technically, its the scripts we ship
for an easier start/stop:
http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

So wherever you launch the script, the SSH may happen from that point.

On Wed, Feb 6, 2013 at 4:36 AM, Jay Vyas jayunit...@gmail.com wrote:
 When setting up passwordless ssh on a cluster, its clear that the namenode
 needs to be able to ssh into task trackers to start/stop nodes and restart
 the cluster.

 What else is passwordless SSH used for?  Do TaskTrackers/DataNodes ever SSH
 into each other horizontally ? Or is SSH only used for one-way nn to tt
 operations?

 --
 Jay Vyas
 http://jayunit100.blogspot.com



-- 
Harsh J


Re: Use vaidya but error in parsing conf file

2013-02-05 Thread jun zhang
You can find it with google vaidya github hadoop

link is https://github.com/facebook/hadoop-20/tree/master/src/contrib/vaidya

But these is only 5 rules will be checked.

It was not   as useful as I wished.

And my problem is fixed by change file://home to file:/home



2013/2/5 Dhanasekaran Anbalagan bugcy...@gmail.com

 Hi jun,

 I am very much interested with vaidya project. to analysis the mapreduce
 job, output. I read some weblinks, We have already using CDH4, where you
 can get from source vaidya. Please guide me How to test my MR jon to vaidya.

 -Dhanasekaran

 Did I learn something today? If not, I wasted it.


 On Mon, Feb 4, 2013 at 2:15 AM, jun zhang zhangjun.jul...@gmail.comwrote:

 I’m try to use vaidya to check my mr job, but always get the error
 info like the below

 what's the home here? Need I setting any things

  ./vaidya_new.sh -jobconf
 file://home/jt1_1359122958375_job_201301252209_1384_conf.xml -joblog
 file://home/job_201301252209_1384_1359959201318_b  -testconf
 /opt/hadoop/contrib/vaidya/conf/postex_diagnosis_tests.xml -report
 ./report.xml

 13/02/04 15:06:04 FATAL conf.Configuration: error parsing conf file:
 java.net.UnknownHostException: home
 Exception:java.lang.RuntimeException: java.net.UnknownHostException:
 homejava.lang.RuntimeException: java.net.UnknownHostException: home
 at
 org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1395)
 at
 org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1269)
 at
 org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1200)
 at
 org.apache.hadoop.conf.Configuration.get(Configuration.java:501)
 at
 org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:131)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:242)
 at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:225)
 at
 org.apache.hadoop.vaidya.postexdiagnosis.PostExPerformanceDiagnoser.readJobInformation(PostExPerformanceDiagnoser.java:138)
 at
 org.apache.hadoop.vaidya.postexdiagnosis.PostExPerformanceDiagnoser.init(PostExPerformanceDiagnoser.java:112)
 at
 org.apache.hadoop.vaidya.postexdiagnosis.PostExPerformanceDiagnoser.main(PostExPerformanceDiagnoser.java:220)
 Caused by: java.net.UnknownHostException: home
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at java.net.Socket.connect(Socket.java:478)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
 at sun.net.NetworkClient.openServer(NetworkClient.java:118)
 at sun.net.ftp.FtpClient.openServer(FtpClient.java:488)
 at sun.net.ftp.FtpClient.openServer(FtpClient.java:475)
 at
 sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:270)
 at
 sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:352)
 at
 com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
 at
 com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
 at
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
 at
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
 at
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
 at
 com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
 at
 com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
 at
 javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
 at
 org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1300)
 ... 9 more





Re: TaskStatus Exception using HFileOutputFormat

2013-02-05 Thread Ted Yu
Using the below construct, do you still get exception ?

Please consider upgrading to hadoop 1.0.4

Thanks

On Tue, Feb 5, 2013 at 4:55 PM, Sean McNamara
sean.mcnam...@webtrends.comwrote:

   an you tell us the HBase and hadoop versions you were using ?

  Ahh yes, sorry I left that out:

  Hadoop: 1.0.3
 HBase: 0.92.0


   I guess you have used the above construct


  Our code is as follows:
  HTable table = new HTable(conf, configHBaseTable);
 FileOutputFormat.setOutputPath(job, outputDir);
 HFileOutputFormat.configureIncrementalLoad(job, table);


  Thanks!

   From: Ted Yu yuzhih...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Tuesday, February 5, 2013 5:46 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: TaskStatus Exception using HFileOutputFormat

   Can you tell us the HBase and hadoop versions you were using ?
 From TestHFileOutputFormat:

 HFileOutputFormat.configureIncrementalLoad(job, table);

 FileOutputFormat.setOutputPath(job, outDir);
 I guess you have used the above construct ?

  Cheers

 On Tue, Feb 5, 2013 at 4:31 PM, Sean McNamara sean.mcnam...@webtrends.com
  wrote:


  We're trying to use HFileOutputFormat for bulk hbase loading.   When
 using HFileOutputFormat's setOutputPath or configureIncrementalLoad, the
 job is unable to run.  The error I see in the jobtracker logs is: Trying to
 set finish time for task attempt_201301030046_123198_m_02_0 when no
 start time is set, stackTrace is : java.lang.Exception

  If I remove an references to HFileOutputFormat, and
 use FileOutputFormat.setOutputPath, things seem to run great.  Does anyone
 know what could be causing the TaskStatus error when
 using HFileOutputFormat?

  Thanks,

  Sean


  What I see on the Job Tracker:

  2013-02-06 00:17:33,685 ERROR org.apache.hadoop.mapred.TaskStatus:
 Trying to set finish time for task attempt_201301030046_123198_m_02_0
 when no start time is set, stackTrace is : java.lang.Exception
 at
 org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:145)
 at
 org.apache.hadoop.mapred.TaskInProgress.incompleteSubTask(TaskInProgress.java:670)
 at
 org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2945)
 at
 org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1162)
 at
 org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4739)
 at
 org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3683)
 at
 org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3378)
 at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)


  What I see from the console:

  391  [main] INFO  org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
  - Looking up current regions for table
 org.apache.hadoop.hbase.client.HTable@3a083b1b
 1284 [main] INFO  org.apache.hadoop.hbase.mapreduce.HFileOutputFormat  -
 Configuring 41 reduce partitions to match current region count
 1285 [main] INFO  org.apache.hadoop.hbase.mapreduce.HFileOutputFormat  -
 Writing partition information to
 file:/opt/webtrends/oozie/jobs/Lab/O/VisitorAnalytics.MapReduce/bin/partitions_1360109875112
 1319 [main] INFO  org.apache.hadoop.util.NativeCodeLoader  - Loaded the
 native-hadoop library
 1328 [main] INFO  org.apache.hadoop.io.compress.zlib.ZlibFactory  -
 Successfully loaded  initialized native-zlib library
 1329 [main] INFO  org.apache.hadoop.io.compress.CodecPool  - Got
 brand-new compressor
 1588 [main] INFO  org.apache.hadoop.hbase.mapreduce.HFileOutputFormat  -
 Incremental table output configured.
 2896 [main] INFO  org.apache.hadoop.hbase.mapreduce.TableOutputFormat  -
 Created table instance for Lab_O_VisitorHistory
 2910 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat
  - Total input paths to process : 1
 Job Name:   job_201301030046_123199
 Job Id:
 http://strack01.staging.dmz:50030/jobdetails.jsp?jobid=job_201301030046_123199
 Job URL:VisitorHistory MapReduce (soozie01.Lab.O)
 3141 [main] INFO  org.apache.hadoop.mapred.JobClient  - Running job:
 job_201301030046_123199
 4145 [main] INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
 10162 [main] INFO  org.apache.hadoop.mapred.JobClient  - Task Id :