Re: HDFS Block location verification
Hi, You may try with: hadoop fsck -locations -blocks -files [hdfs_path.] It will print detailed info about blocks and there locations. On Tue, Feb 5, 2013 at 4:00 PM, Dhanasekaran Anbalagan bugcy...@gmail.comwrote: Hi Guys, I have configured HDFS with replication factor 3. We have 1TB for data How to file the particular block will available in 3 machine How to find same block of data will available in 3 machine Please guide How to check my data available in three different location node? -Dhanasekaran. Did I learn something today? If not, I wasted it.
HyperThreading in TaskTracker nodes?
I would like to get some opinions / recommendations about the pros and cons of enabling HyperThreading on TaskTracker nodes. Presumably memory could be an issue, but is there anything to be gained, perhaps because of I/O wait? My small cluster is made of relatively slow and old systems, which mostly are quite slow to/from disk, if that matters. Thanks, Terry
Re: HDFS Block location verification
Hi Samir, Thanks so much. Exactly I want this. tech@dvcliftonhera150:~$ hadoop fsck -locations -blocks -files /user/tech/pkg.tar.gz DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Connecting to namenode via http://dvcliftonhera122:50070 FSCK started by tech (auth:SIMPLE) from /172.16.30.150 for path /user/tech/pkg.tar.gz at Tue Feb 05 10:33:23 EST 2013 /user/tech/pkg.tar.gz 165 bytes, 1 block(s): OK 0. BP-1936777173-172.16.30.122-1343141974879:blk_8828079455224016541_10294868 len=165 repl=3 [*172.16.30.144:50010, 172.16.30.135:50010, 172.16.30.134:50010*] Status: HEALTHY Total size:165 B Total dirs:0 Total files:1 Total blocks (validated):1 (avg. block size 165 B) Minimally replicated blocks:1 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks:0 (0.0 %) Mis-replicated blocks:0 (0.0 %) Default replication factor:3 Average block replication:3.0 Corrupt blocks:0 Missing replicas:0 (0.0 %) Number of data-nodes:47 Number of racks:1 FSCK ended at Tue Feb 05 10:33:23 EST 2013 in 3 milliseconds The filesystem under path '/user/tech/pkg.tar.gz' is HEALTHY Did I learn something today? If not, I wasted it. On Tue, Feb 5, 2013 at 10:18 AM, Samir Ahmic ahmic.sa...@gmail.com wrote: Hi, You may try with: hadoop fsck -locations -blocks -files [hdfs_path.] It will print detailed info about blocks and there locations. On Tue, Feb 5, 2013 at 4:00 PM, Dhanasekaran Anbalagan bugcy...@gmail.com wrote: Hi Guys, I have configured HDFS with replication factor 3. We have 1TB for data How to file the particular block will available in 3 machine How to find same block of data will available in 3 machine Please guide How to check my data available in three different location node? -Dhanasekaran. Did I learn something today? If not, I wasted it.
Job History files in Hadoop 2.0
Hi, I recently migrated to Hadoop 2.0 from Hadoop 1.0 (0.20.2 before). I am able to successfully launch example applications. /Could anyone please suggest where are the MapReduce job history files available, after running jobs in Hadoop 2.0./ I need the statistics after running the jobs. Of course, the web UI gives me the information. But I need the history files that were available in the previous versions of hadoop job_ID etc. I can see a directory with application and container patterns but this does not have specific information about job submit time, Map Start time, finish time, reduce start time, finish time, job finish time etc. In the previous version of hadoop i.e. 1.0 or 0.20, it was stored under logs/history. Can anyone suggest if the pattern of storing job history files is also changed in the new architecture? Thanks in advance, Amit - Sangroya -- View this message in context: http://lucene.472066.n3.nabble.com/Job-History-files-in-Hadoop-2-0-tp4038599.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Application of Cloudera Hadoop for Dataset analysis
Please take this thread to CDH mailing list. On Tue, Feb 5, 2013 at 2:43 AM, Sharath Chandra Guntuku sharathchandr...@gmail.com wrote: Hi, I am Sharath Chandra, an undergraduate student at BITS-Pilani, India. I would like to get the following clarifications regarding cloudera hadoop distribution. I am using a CDH4 Demo VM for now. 1. After I upload the files into the file browser, if I have to link two-three datasets using a key in those files, what should I do? Do I have to run a query over them? 2. My objective is that I have some data collected over a few years and now, I would like to link all of them, as in a database using keys and then run queries over them to find out particular patterns. Later I would like to implement some Machine learning algorithms on them for predictive analysis. Will this be possible on the demo VM? I am totally new to this. Can I get some help on this? I would be very grateful for the same. -- Thanks and Regards, *Sharath Chandra Guntuku* Undergraduate Student (Final Year) *Computer Science Department* *Email*: f2009...@hyderabad.bits-pilani.ac.in *BITS-Pilani*, Hyderabad Campus Jawahar Nagar, Shameerpet, RR Dist, Hyderabad - 500078, Andhra Pradesh -- http://hortonworks.com/download/
Re: replication factor
I would recommend this: http://www.aosabook.org/en/hdfs.html Nicolas On Tue, Feb 5, 2013 at 6:28 PM, Lin Ma lin...@gmail.com wrote: Hello guys, I want to learn a bit more when we need to change (increase/decrease) replication factor for better performance, and also want to learn a bit more internals about how replication factor works, and pros/cons for larger/smaller replication factors, for example, deploy static model/config file for Hadoop jobs, whether larger replication factor is better? Unfortunately, I cannot find related materials by search. Appreciate if anyone could point me some good documents. thanks in advance, Lin
Re: Application of Cloudera Hadoop for Dataset analysis
You can use Hortonworks data platform which already integrates HDFS, MapReduce and Hive well. http://hortonworks.com/products/hortonworksdataplatform/ Came across this new solution recently, They claim to be Hadoop based Standard SQL solution for data analytics. http://queryio.com/hadoop-big-data-product/hadoop-hive.html Have not given it a try yet but you can explore it. -Richard On Tue, Feb 5, 2013 at 10:07 AM, * *Preethi Vinayak Ponangi vinayakpona...@gmail.com wrote: *From: *Preethi Vinayak Ponangi vinayakpona...@gmail.com *Subject: **Re: Application of Cloudera Hadoop for Dataset analysis* *Date: *February 5, 2013 8:07:47 AM PST *To: *user@hadoop.apache.org *Reply-To: *user@hadoop.apache.org It depends on what part of the Hadoop Eco system component you would like to use. You can do it in several ways: 1) You could write a basic map reduce job to do joins. This link could help or just a basic search on google would give you several links. http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/ 2) You could use an abstract language like Pig to do these joins using simple pig scripts. http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html 3) The simplest of all, you could write SQL like queries to do this join using Hive. http://hive.apache.org/ Hope this helps. Regards, Vinayak. On Tue, Feb 5, 2013 at 10:00 AM, Suresh Srinivas sur...@hortonworks.comwrote: Please take this thread to CDH mailing list. On Tue, Feb 5, 2013 at 2:43 AM, Sharath Chandra Guntuku sharathchandr...@gmail.com wrote: Hi, I am Sharath Chandra, an undergraduate student at BITS-Pilani, India. I would like to get the following clarifications regarding cloudera hadoop distribution. I am using a CDH4 Demo VM for now. 1. After I upload the files into the file browser, if I have to link two-three datasets using a key in those files, what should I do? Do I have to run a query over them? 2. My objective is that I have some data collected over a few years and now, I would like to link all of them, as in a database using keys and then run queries over them to find out particular patterns. Later I would like to implement some Machine learning algorithms on them for predictive analysis. Will this be possible on the demo VM? I am totally new to this. Can I get some help on this? I would be very grateful for the same. -- Thanks and Regards, *Sharath Chandra Guntuku* Undergraduate Student (Final Year) *Computer Science Department* *Email*: f2009...@hyderabad.bits-pilani.ac.in *BITS-Pilani*, Hyderabad Campus Jawahar Nagar, Shameerpet, RR Dist, Hyderabad - 500078, Andhra Pradesh -- http://hortonworks.com/download/
RE: HyperThreading in TaskTracker nodes?
Hate to say it, but HyperThreading can have either positive or negative performance characteristics. It all depends on your workload. You have to measure very careful; it may not even be a bottleneck(!) :) I hit a pretty significant power issue when I enable HyperThreading at multi-thousand node scale. We hit a ~8-10% power utilization increase, which, if rolled out to the entire cluster, would put me a few %'ge over our max spec power. In this case, for our workload, we actually saw a 15% increase in processing throughput / job latency. We ended up literally turning off machines and enabling HyperThreading on the remaining and saw an overall ~10% efficiency gain in the cluster, with a few less machines, but running hot on power. ~Brad -Original Message- From: Terry Healy [mailto:the...@bnl.gov] Sent: Tuesday, February 5, 2013 7:20 AM To: user@hadoop.apache.org Subject: HyperThreading in TaskTracker nodes? I would like to get some opinions / recommendations about the pros and cons of enabling HyperThreading on TaskTracker nodes. Presumably memory could be an issue, but is there anything to be gained, perhaps because of I/O wait? My small cluster is made of relatively slow and old systems, which mostly are quite slow to/from disk, if that matters. Thanks, Terry
Re: HyperThreading in TaskTracker nodes?
Power issues aside, I've seen similar sorts of performance gains for MR workloads - around 15-20%. I think a fair bit of it is due to poor CPU cache utilization in various parts of Hadoop - hyperthreading gets some extra parallelism there while the core is waiting on round trips to DRAM. -Todd On Tue, Feb 5, 2013 at 10:03 AM, Brad Sarsfield b...@bing.com wrote: Hate to say it, but HyperThreading can have either positive or negative performance characteristics. It all depends on your workload. You have to measure very careful; it may not even be a bottleneck(!) :) I hit a pretty significant power issue when I enable HyperThreading at multi-thousand node scale. We hit a ~8-10% power utilization increase, which, if rolled out to the entire cluster, would put me a few %'ge over our max spec power. In this case, for our workload, we actually saw a 15% increase in processing throughput / job latency. We ended up literally turning off machines and enabling HyperThreading on the remaining and saw an overall ~10% efficiency gain in the cluster, with a few less machines, but running hot on power. ~Brad -Original Message- From: Terry Healy [mailto:the...@bnl.gov] Sent: Tuesday, February 5, 2013 7:20 AM To: user@hadoop.apache.org Subject: HyperThreading in TaskTracker nodes? I would like to get some opinions / recommendations about the pros and cons of enabling HyperThreading on TaskTracker nodes. Presumably memory could be an issue, but is there anything to be gained, perhaps because of I/O wait? My small cluster is made of relatively slow and old systems, which mostly are quite slow to/from disk, if that matters. Thanks, Terry -- Todd Lipcon Software Engineer, Cloudera
[Hadoop-Help]About Map-Reduce implementation
Hello, I am new to Hadoop. I am doing a project in cloud in which I have to use hadoop for Map-reduce. It is such that I am going to collect logs from 2-3 machines having different locations. The logs are also in different formats such as .rtf .log .txt Later, I have to collect and convert them to one format and collect to one location. So I am asking which module of Hadoop that I need to study for this implementation?? Or whole framework should I need to study ?? Seeking for guidance, Thank you !! -- *Cheers,* *Mayur.*
[Hadoop-Help]About Map-Reduce implementation
Hello, I am new to Hadoop. I am doing a project in cloud in which I have to use hadoop for Map-reduce. It is such that I am going to collect logs from 2-3 machines having different locations. The logs are also in different formats such as .rtf .log .txt Later, I have to collect and convert them to one format and collect to one location. So I am asking which module of Hadoop that I need to study for this implementation?? Or whole framework should I need to study ?? Seeking for guidance, Thank you !! -- *Cheers,* *Mayur.*
Re: [Hadoop-Help]About Map-Reduce implementation
Hey Mayur, If you are collecting logs from multiple servers then you can use flume for the same. if the contents of the logs are different in format then you can just use textfileinput format to read and write into any other format you want for your processing in later part of your projects first thing you need to learn is how to setup hadoop then you can try writing sample hadoop mapreduce jobs to read from text file and then process them and write the results into another file then you can integrate flume as your log collection mechanism once you get hold on the system then you can decide more on which paths you want to follow based on your requirements for storage, compute time, compute capacity, compression etc On Wed, Feb 6, 2013 at 3:01 AM, Mayur Patil ram.nath241...@gmail.comwrote: Hello, I am new to Hadoop. I am doing a project in cloud in which I have to use hadoop for Map-reduce. It is such that I am going to collect logs from 2-3 machines having different locations. The logs are also in different formats such as .rtf .log .txt Later, I have to collect and convert them to one format and collect to one location. So I am asking which module of Hadoop that I need to study for this implementation?? Or whole framework should I need to study ?? Seeking for guidance, Thank you !! -- *Cheers,* *Mayur.* -- Nitin Pawar
Re: [Hadoop-Help]About Map-Reduce implementation
Hi, Please read basics on how hadoop works. Then start your hands on with map reduce coding. The tool which has been made for you is flume , but don't see tool till you complete above two steps. Good luck , keep us posted. Regards, Jagat Singh --- Sent from Mobile , short and crisp. On 06-Feb-2013 8:32 AM, Mayur Patil ram.nath241...@gmail.com wrote: Hello, I am new to Hadoop. I am doing a project in cloud in which I have to use hadoop for Map-reduce. It is such that I am going to collect logs from 2-3 machines having different locations. The logs are also in different formats such as .rtf .log .txt Later, I have to collect and convert them to one format and collect to one location. So I am asking which module of Hadoop that I need to study for this implementation?? Or whole framework should I need to study ?? Seeking for guidance, Thank you !! -- *Cheers,* *Mayur.*
Re: Advice on post mortem of data loss (v 1.0.3)
Sorry to hear you are having issues. Few questions and comments inline. On Fri, Feb 1, 2013 at 8:40 AM, Peter Sheridan psheri...@millennialmedia.com wrote: Yesterday, I bounced my DFS cluster. We realized that ulimit –u was, in extreme cases, preventing the name node from creating threads. This had only started occurring within the last day or so. When I brought the name node back up, it had essentially been rolled back by one week, and I lost all changes which had been made since then. There are a few other factors to consider. 1. I had 3 locations for dfs.name.dir — one local and two NFS. (I originally thought this was 2 local and one NFS when I set it up.) On 1/24, the day which we effectively rolled back to, the second NFS mount started showing as FAILED on dfshealth.jsp. We had seen this before without issue, so I didn't consider it critical. What do you mean by rolled back to? I understand this so far has you have three dirs: l1, nfs1 and nfs2. (l for local disk and nfs for NFS). nfs2 was shown as failed. 1. When I brought the name node back up, because of discovering the above, I had changed dfs.name.dir to 2 local drives and one NFS, excluding the one which had failed. When you brought the namenode backup, with the changed configuration you have l1, l2 and nfs1. Given you have not seen any failures, l1 and nfs1 have the latest edits so far. Correct? How did you add l2? Can you describe this procedure in detail? Reviewing the name node log from the day with the NFS outage, I see: When you say NFS outage here, this is the failure corresponding to nfs2 from above. Is that correct? 2013-01-24 16:33:11,794 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit log. java.io.IOException: Input/output error at sun.nio.ch.FileChannelImpl.force0(Native Method) at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348) at org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.flushAndSync(FSEditLog.java:215) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:89) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1015) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1666) at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:718) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) 2013-01-24 16:33:11,794 WARN org.apache.hadoop.hdfs.server.common.Storage: Removing storage dir /rdisks/xx Unfortunately, since I wasn't expecting anything terrible to happen, I didn't look too closely at the file system while the name node was down. When I brought it up, the time stamp on the previous checkpoint directory in the dfs.name.dir was right around the above error message. The current directory basically had an fsimage and an empty edits log with the current time stamps. Which storage directory are you talking about here? So: what happened? Should this failure have led to my data loss? I would have thought the local directory would be fine in this scenario. Did I have any other options for data recovery? I am not sure how you concluded that you lost a week's data and the namenode rolled back by one week? Please share the namenode logs corresponding to the restart. This is how it should have worked. - When nfs2 was removed, on both l1 and nfs1 a timestamp is recorded, corresponding to removal of a storage directory. - If there is any checkpointing that happened, it would have also incremented the timestamp. - When the namenode starts up, it chooses l1 and nfs1 because the recorded timestamp is the latest on these directories and loads fsimage and edits from those directories. Namenode also performs checkpoint and writes new consolidated image on l1, l2 and nfs1 and creates empty editlog on l1, l2 and nfs1. If you provide more details on how l2 was added, we may be able to understand what happened. Regards, Suresh -- http://hortonworks.com/download/
Re: Specific HDFS tasks where is passwordless SSH is necessary
The JobTracker will also SSH in to start TaskTracker's. So basically, the masters need SSH to any slave(s) you define. The slave nodes (DN, TT) do not need SSH to each other. On Tue, Feb 5, 2013 at 5:06 PM, Jay Vyas jayunit...@gmail.com wrote: When setting up passwordless ssh on a cluster, its clear that the namenode needs to be able to ssh into task trackers to start/stop nodes and restart the cluster. What else is passwordless SSH used for? Do TaskTrackers/DataNodes ever SSH into each other horizontally ? Or is SSH only used for one-way nn to tt operations? -- Jay Vyas http://jayunit100.blogspot.com -- Robert Dyer rd...@iastate.edu
Re: [HOD] Cannot use env variables in hodrc
On a related note, env-vars is also being ignored: env-vars= HOD_PYTHON_HOME=/usr/local/packages/python/2.5.1/bin/python2.5 And hod picks the system-default python and terminates with errors unless I manually export HOD_PYTHON_HOME. export HOD_PYTHON_HOME=`which python2.5` I am also having problems having hod use the cluster I created, but I assume those issues are also related. How can I make sure that hodrc contents are passed correctly into hod? Thanks a lot in advance! On Feb 5, 2013, at 4:41 PM, Mehmet Belgin wrote: Hello everyone, I am setting up Hadoop for the first time, so please bear with me while I ask all these beginner questions :) I followed the instructions to create a hodrc, but looks like I cannot user env variables in this file: error: bin/hod failed to start. error: invalid 'java-home' specified in section hod (--hod.java-home): ${JAVA_HOME} error: invalid 'batch-home' specified in section resource_manager (--resource_manager.batch-home): ${RM_HOME} ... despite the fact that I have ${JAVA_HOME} and ${RM_HOME} correctly defined in my environment. When I replace these variables with full explicit paths, it works. I checked the permissions, and everything else looks fine. What am I missing here? Thanks!
Re: Specific HDFS tasks where is passwordless SSH is necessary
It isn't the NN that does the SSH technically, its the scripts we ship for an easier start/stop: http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F So wherever you launch the script, the SSH may happen from that point. On Wed, Feb 6, 2013 at 4:36 AM, Jay Vyas jayunit...@gmail.com wrote: When setting up passwordless ssh on a cluster, its clear that the namenode needs to be able to ssh into task trackers to start/stop nodes and restart the cluster. What else is passwordless SSH used for? Do TaskTrackers/DataNodes ever SSH into each other horizontally ? Or is SSH only used for one-way nn to tt operations? -- Jay Vyas http://jayunit100.blogspot.com -- Harsh J
Re: Use vaidya but error in parsing conf file
You can find it with google vaidya github hadoop link is https://github.com/facebook/hadoop-20/tree/master/src/contrib/vaidya But these is only 5 rules will be checked. It was not as useful as I wished. And my problem is fixed by change file://home to file:/home 2013/2/5 Dhanasekaran Anbalagan bugcy...@gmail.com Hi jun, I am very much interested with vaidya project. to analysis the mapreduce job, output. I read some weblinks, We have already using CDH4, where you can get from source vaidya. Please guide me How to test my MR jon to vaidya. -Dhanasekaran Did I learn something today? If not, I wasted it. On Mon, Feb 4, 2013 at 2:15 AM, jun zhang zhangjun.jul...@gmail.comwrote: I’m try to use vaidya to check my mr job, but always get the error info like the below what's the home here? Need I setting any things ./vaidya_new.sh -jobconf file://home/jt1_1359122958375_job_201301252209_1384_conf.xml -joblog file://home/job_201301252209_1384_1359959201318_b -testconf /opt/hadoop/contrib/vaidya/conf/postex_diagnosis_tests.xml -report ./report.xml 13/02/04 15:06:04 FATAL conf.Configuration: error parsing conf file: java.net.UnknownHostException: home Exception:java.lang.RuntimeException: java.net.UnknownHostException: homejava.lang.RuntimeException: java.net.UnknownHostException: home at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1395) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1269) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1200) at org.apache.hadoop.conf.Configuration.get(Configuration.java:501) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:131) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:242) at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:225) at org.apache.hadoop.vaidya.postexdiagnosis.PostExPerformanceDiagnoser.readJobInformation(PostExPerformanceDiagnoser.java:138) at org.apache.hadoop.vaidya.postexdiagnosis.PostExPerformanceDiagnoser.init(PostExPerformanceDiagnoser.java:112) at org.apache.hadoop.vaidya.postexdiagnosis.PostExPerformanceDiagnoser.main(PostExPerformanceDiagnoser.java:220) Caused by: java.net.UnknownHostException: home at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at sun.net.NetworkClient.doConnect(NetworkClient.java:163) at sun.net.NetworkClient.openServer(NetworkClient.java:118) at sun.net.ftp.FtpClient.openServer(FtpClient.java:488) at sun.net.ftp.FtpClient.openServer(FtpClient.java:475) at sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:270) at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:352) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1300) ... 9 more
Re: TaskStatus Exception using HFileOutputFormat
Using the below construct, do you still get exception ? Please consider upgrading to hadoop 1.0.4 Thanks On Tue, Feb 5, 2013 at 4:55 PM, Sean McNamara sean.mcnam...@webtrends.comwrote: an you tell us the HBase and hadoop versions you were using ? Ahh yes, sorry I left that out: Hadoop: 1.0.3 HBase: 0.92.0 I guess you have used the above construct Our code is as follows: HTable table = new HTable(conf, configHBaseTable); FileOutputFormat.setOutputPath(job, outputDir); HFileOutputFormat.configureIncrementalLoad(job, table); Thanks! From: Ted Yu yuzhih...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Tuesday, February 5, 2013 5:46 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: TaskStatus Exception using HFileOutputFormat Can you tell us the HBase and hadoop versions you were using ? From TestHFileOutputFormat: HFileOutputFormat.configureIncrementalLoad(job, table); FileOutputFormat.setOutputPath(job, outDir); I guess you have used the above construct ? Cheers On Tue, Feb 5, 2013 at 4:31 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: We're trying to use HFileOutputFormat for bulk hbase loading. When using HFileOutputFormat's setOutputPath or configureIncrementalLoad, the job is unable to run. The error I see in the jobtracker logs is: Trying to set finish time for task attempt_201301030046_123198_m_02_0 when no start time is set, stackTrace is : java.lang.Exception If I remove an references to HFileOutputFormat, and use FileOutputFormat.setOutputPath, things seem to run great. Does anyone know what could be causing the TaskStatus error when using HFileOutputFormat? Thanks, Sean What I see on the Job Tracker: 2013-02-06 00:17:33,685 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201301030046_123198_m_02_0 when no start time is set, stackTrace is : java.lang.Exception at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:145) at org.apache.hadoop.mapred.TaskInProgress.incompleteSubTask(TaskInProgress.java:670) at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2945) at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1162) at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4739) at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3683) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3378) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) What I see from the console: 391 [main] INFO org.apache.hadoop.hbase.mapreduce.HFileOutputFormat - Looking up current regions for table org.apache.hadoop.hbase.client.HTable@3a083b1b 1284 [main] INFO org.apache.hadoop.hbase.mapreduce.HFileOutputFormat - Configuring 41 reduce partitions to match current region count 1285 [main] INFO org.apache.hadoop.hbase.mapreduce.HFileOutputFormat - Writing partition information to file:/opt/webtrends/oozie/jobs/Lab/O/VisitorAnalytics.MapReduce/bin/partitions_1360109875112 1319 [main] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library 1328 [main] INFO org.apache.hadoop.io.compress.zlib.ZlibFactory - Successfully loaded initialized native-zlib library 1329 [main] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new compressor 1588 [main] INFO org.apache.hadoop.hbase.mapreduce.HFileOutputFormat - Incremental table output configured. 2896 [main] INFO org.apache.hadoop.hbase.mapreduce.TableOutputFormat - Created table instance for Lab_O_VisitorHistory 2910 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 Job Name: job_201301030046_123199 Job Id: http://strack01.staging.dmz:50030/jobdetails.jsp?jobid=job_201301030046_123199 Job URL:VisitorHistory MapReduce (soozie01.Lab.O) 3141 [main] INFO org.apache.hadoop.mapred.JobClient - Running job: job_201301030046_123199 4145 [main] INFO org.apache.hadoop.mapred.JobClient - map 0% reduce 0% 10162 [main] INFO org.apache.hadoop.mapred.JobClient - Task Id :