Re: No space left on device
ok,I find it. the jobtracker server is full. 2012-05-28 yingnan.ma 发件人: yingnan.ma 发送时间: 2012-05-28 13:01:56 收件人: common-user 抄送: 主题: No space left on device Hi, I encounter a problem as following: Error - Job initialization failed: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:201) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.FilterOutputStream.close(FilterOutputStream.java:140) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:348) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1344) .. So, I think that the HDFS is full or something, but I cannot find a way to address the problem, if you had some suggestion, Please show me , thank you. Best Regards
RE: Splunk + Hadoop
Hi Abhishek, I am looking for a scenario where the customer representative needs to respond back to the customers on call. They need to search on huge data and then respond back in few seconds. Thanks and Regards, Shreya Pal Architect Technology Cognizant Technology Pvt Ltd Vnet - 205594 Mobile - +91-9766310680 -Original Message- From: Abhishek Pratap Singh [mailto:manu.i...@gmail.com] Sent: Tuesday, May 22, 2012 2:44 AM To: common-user@hadoop.apache.org Subject: Re: Splunk + Hadoop I have used Hadoop and Splunk both. Can you please let me know what is your requirement? Real time processing with hadoop depends upon What defines Real time in particular scenario. Based on requirement, Real time (near real time) can be achieved. ~Abhishek On Fri, May 18, 2012 at 3:58 PM, Russell Jurney russell.jur...@gmail.comwrote: Because that isn't Cube. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On May 18, 2012, at 2:01 PM, Ravi Shankar Nair ravishankar.n...@gmail.com wrote: Why not Hbase with Hadoop? It's a best bet. Rgds, Ravi Sent from my Beethoven On May 18, 2012, at 3:29 PM, Russell Jurney russell.jur...@gmail.com wrote: I'm playing with using Hadoop and Pig to load MongoDB with data for Cube to consume. Cube https://github.com/square/cube/wiki is a realtime tool... but we'll be replaying events from the past. Does that count? It is nice to batch backfill metrics into 'real-time' systems in bulk. On Fri, May 18, 2012 at 12:11 PM, shreya@cognizant.com wrote: Hi , Has anyone used Hadoop and splunk, or any other real-time processing tool over Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful.
Re: Splunk + Hadoop
Hi Shreya, if you are looking at data locality, then you may or may not use hadoop out of the box. It will all depend on how you design the data layout on top of hdfs and how do you implement search based on the customer queries. a good idea might be have hop-in queryable database like mysql inbetween where you can store the results of your data being processed on hadoop and then use solr search for fast access and search. Thanks, Nitin On Mon, May 28, 2012 at 12:41 PM, shreya@cognizant.com wrote: Hi Abhishek, I am looking for a scenario where the customer representative needs to respond back to the customers on call. They need to search on huge data and then respond back in few seconds. Thanks and Regards, Shreya Pal Architect Technology Cognizant Technology Pvt Ltd Vnet - 205594 Mobile - +91-9766310680 -Original Message- From: Abhishek Pratap Singh [mailto:manu.i...@gmail.com] Sent: Tuesday, May 22, 2012 2:44 AM To: common-user@hadoop.apache.org Subject: Re: Splunk + Hadoop I have used Hadoop and Splunk both. Can you please let me know what is your requirement? Real time processing with hadoop depends upon What defines Real time in particular scenario. Based on requirement, Real time (near real time) can be achieved. ~Abhishek On Fri, May 18, 2012 at 3:58 PM, Russell Jurney russell.jur...@gmail.com wrote: Because that isn't Cube. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On May 18, 2012, at 2:01 PM, Ravi Shankar Nair ravishankar.n...@gmail.com wrote: Why not Hbase with Hadoop? It's a best bet. Rgds, Ravi Sent from my Beethoven On May 18, 2012, at 3:29 PM, Russell Jurney russell.jur...@gmail.com wrote: I'm playing with using Hadoop and Pig to load MongoDB with data for Cube to consume. Cube https://github.com/square/cube/wiki is a realtime tool... but we'll be replaying events from the past. Does that count? It is nice to batch backfill metrics into 'real-time' systems in bulk. On Fri, May 18, 2012 at 12:11 PM, shreya@cognizant.com wrote: Hi , Has anyone used Hadoop and splunk, or any other real-time processing tool over Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. -- Nitin Pawar
Help with DFSClient Exception.
Hi, We are frequently observing the exception java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. on our cluster. The exception occurs during writing a file. We are using Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every 3 days. Detailed stack trace : 12/05/27 23:26:54 INFO mapred.JobClient: Task Id : attempt_201205232329_28133_r_02_0, Status : FAILED java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170) Our investigation: We have min replication factor set to 2. As mentioned http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here , “A call to complete() will not return true until all the file's blocks have been replicated the minimum number of times. Thus, DataNode failures may cause a client to call complete() several times before succeeding”, we should retry complete() several times. The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls complete() function and retries it for 20 times. But in spite of that file blocks are not replicated minimum number of times. The retry count is not configurable. Changing min replication factor to 1 is also not a good idea since there are continuously jobs running on our cluster. Do we have any solution / workaround for this problem? What is min replication factor in general used in industry. Let me know if any further inputs required. Thanks, -Akshay -- View this message in context: http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Help with DFSClient Exception.
Whats the block size? also are you experiencing any slowness in network? i am guessing you are using EC2 these issues normally come with network problems On Mon, May 28, 2012 at 3:57 PM, akshaymb akshaybhara...@gmail.com wrote: Hi, We are frequently observing the exception java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. on our cluster. The exception occurs during writing a file. We are using Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every 3 days. Detailed stack trace : 12/05/27 23:26:54 INFO mapred.JobClient: Task Id : attempt_201205232329_28133_r_02_0, Status : FAILED java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170) Our investigation: We have min replication factor set to 2. As mentioned http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here , “A call to complete() will not return true until all the file's blocks have been replicated the minimum number of times. Thus, DataNode failures may cause a client to call complete() several times before succeeding”, we should retry complete() several times. The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls complete() function and retries it for 20 times. But in spite of that file blocks are not replicated minimum number of times. The retry count is not configurable. Changing min replication factor to 1 is also not a good idea since there are continuously jobs running on our cluster. Do we have any solution / workaround for this problem? What is min replication factor in general used in industry. Let me know if any further inputs required. Thanks, -Akshay -- View this message in context: http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Nitin Pawar
Re: No space left on device
Do you have the JT and NN on the same node? Look here on the Lars Francke´s post: http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html This is a very schema how to install Hadoop, and look the configuration that he used for the name and data directories. If this directories are in the same disk, and you don´t have enough space for it, you can find that exception. My recomendation is to divide these directories in separate discs with a very similar schema to the Lars´s configuration Another recomendation is to check the Hadoop´s logs. Read about this here: http://www.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/ regards On 05/28/2012 02:20 AM, yingnan.ma wrote: ok,I find it. the jobtracker server is full. 2012-05-28 yingnan.ma 发件人: yingnan.ma 发送时间: 2012-05-28 13:01:56 收件人: common-user 抄送: 主题: No space left on device Hi, I encounter a problem as following: Error - Job initialization failed: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:201) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.FilterOutputStream.close(FilterOutputStream.java:140) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:348) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1344) .. So, I think that the HDFS is full or something, but I cannot find a way to address the problem, if you had some suggestion, Please show me , thank you. Best Regards -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
HBase (BigTable) many to many with students and courses
Hello list, I have some time now to try out HBase and want to use it for a private project. Questions like How to I transfer one-to-many or many-to-many relations from my RDBMS's schema to HBase? seem to be common. I hope we can throw all the best practices that are out there in this thread. As the wiki states: One should create two tables. One for students, another for courses. Within the students' table, one should add one column per selected course with the course_id besides some columns for the student itself (name, birthday, sex etc.). On the other hand one fills the courses table with one column per student_id besides some columns which describe the course itself (name, teacher, begin, end, year, location etc.). So far, so good. How do I access these tables efficiently? A common case would be to show all courses per student. To do so, one has to access the student-table and get all the student's courses-columns. Let's say their names are prefixed ids. One has to remove the prefix and then one accesses the courses-table to get all the courses and their metadata (name, teacher, location etc.). How do I do this kind of operation efficiently? The naive and brute force approach seems to be using a Get-object per course and fetch the neccessary data. Another approach seems to be using the HTable-class and unleash the power of multigets by using the batch()-method. All of the information above is theoretically, since I did not used it in code (I currently learn more about the fundamentals of HBase). That's why I give the question to you: How do you do this kind of operation by using HBase? Kind regards, Em
Eclipse Plugin removed from contrib folder, where to find it?
Hi All, I just downloaded hadoop-1.0.3 from apache's download page, but to my surprise, could not find the eclipse plugin that normally comes with hadoop in the contrib folder. I could find the source foe building the hadoop eclipse plugin in the src/contrib/eclipse-plugin folder, but building it with ant too, did not fetch me any jar to work with. I have plugins from 0.20.203 stage but it wont work in my opinion due to the new API support. Also plugin found at the location http://code.google.com/edu/parallel/tools/hadoopvm/hadoop-eclipse-plugin.jar supports very old version of Hadoop. How would a newbie get started with the eclipse plugin in 1.0.x era? Please let me know if I am doing the steps right or missing something. Thanks in Advance, Varad