Hi.I also noticed this exception. Strange that this exception is happened every time on the same regionserver. Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist. Very strange, but history folder in hadoop is empty.
Reformatting HDFS will help ? One more things in a last minute, i found that one node in cluster has totally different time, could this cause for such a problems ? P.S. About logs, is it possible to send to some email - each log file compressed is about 1mb, and only in 3 files i found exceptions. On Thu, Oct 30, 2008 at 10:25 PM, stack <[EMAIL PROTECTED]> wrote: > Can you put them someplace that I can pull them? > > I took another look at your logs. I see that a region is missing files. > That means it will never open and just keep trying. Grep your logs for > FileNotFound. You'll see this: > > hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: > File does not exist: > hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data > hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: > File does not exist: > hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data > > Try shutting down, and removing these files. Remove the following > directories: > > > hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906 > hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906 > > hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637 > hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637 > > Then retry restarting. > > You can try and figure how these files got lost by going back in your > history. > > > St.Ack > > > > Slava Gorelik wrote: > >> Michael,still have the problem, but the logs files are very big (50MB >> each) >> even compressed they are bigger than limit for this mailing list. >> Most of the problems are happened during compaction (i see in the log), >> may >> be i can send some parts from logs ? >> >> Best Regards. >> >> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <[EMAIL PROTECTED] >> >wrote: >> >> >> >>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now, >>> soon >>> will try again. >>> >>> >>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <[EMAIL PROTECTED] >>> >wrote: >>> >>> >>> >>>> Hi.Very strange, i see in limits.conf that it's upped. >>>> I attached the limits.conf, please have a look, may be i did it wrong. >>>> >>>> Best Regards. >>>> >>>> >>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>> >>>>> Thanks for the logs Slava. I notice that you have not upped the ulimit >>>>> on your cluster. See the head of your logs where we print out the >>>>> ulimit. >>>>> Its 1024. This could be one cause of your grief especially when you >>>>> seemingly have many regions (>1000). Please try upping it. >>>>> St.Ack >>>>> >>>>> >>>>> >>>>> >>>>> Slava Gorelik wrote: >>>>> >>>>> >>>>> >>>>>> Hi. >>>>>> I enabled DEBUG log level and now I'm sending all logs (archived) >>>>>> including fsck run result. >>>>>> Today my program starting to fail couple of minutes from the begin, >>>>>> it's >>>>>> very easy to reproduce the problem, cluster became very unstable. >>>>>> >>>>>> Best Regards. >>>>>> >>>>>> >>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto: >>>>>> [EMAIL PROTECTED]>> wrote: >>>>>> >>>>>> See http://wiki.apache.org/hadoop/Hbase/FAQ#5 >>>>>> >>>>>> St.Ack >>>>>> >>>>>> >>>>>> Slava Gorelik wrote: >>>>>> >>>>>> Hi.First of all i want to say thank you for you assistance !!! >>>>>> >>>>>> >>>>>> DEBUG on hadoop or hbase ? And how can i enable ? >>>>>> fsck said that HDFS is healthy. >>>>>> >>>>>> Best Regards and Thank You >>>>>> >>>>>> >>>>>> On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED] >>>>>> <mailto:[EMAIL PROTECTED]>> wrote: >>>>>> >>>>>> >>>>>> Slava Gorelik wrote: >>>>>> >>>>>> >>>>>> Hi.HDFS capacity is about 800gb (8 datanodes) and the >>>>>> current usage is >>>>>> about >>>>>> 30GB. This is after total re-format of the HDFS that >>>>>> was made a hour >>>>>> before. >>>>>> >>>>>> BTW, the logs i sent are from the first exception that >>>>>> i found in them. >>>>>> Best Regards. >>>>>> >>>>>> >>>>>> >>>>>> Please enable DEBUG and retry. Send me all logs. What >>>>>> does the fsck on >>>>>> HDFS say? There is something seriously wrong with your >>>>>> cluster that you are >>>>>> having so much trouble getting it running. Lets try and >>>>>> figure it. >>>>>> >>>>>> St.Ack >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 28, 2008 at 7:12 PM, stack >>>>>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I took a quick look Slava (Thanks for sending the >>>>>> files). Here's a few >>>>>> notes: >>>>>> >>>>>> + The logs are from after the damage is done; the >>>>>> transition from good to >>>>>> bad is missing. If I could see that, that would >>>>>> help >>>>>> + But what seems to be plain is that that your >>>>>> HDFS is very sick. See >>>>>> this >>>>>> from head of one of the regionserver logs: >>>>>> >>>>>> 2008-10-27 23:41:12,682 WARN >>>>>> org.apache.hadoop.dfs.DFSClient: >>>>>> DataStreamer >>>>>> Exception: java.io.IOException: Unable to create >>>>>> new block. >>>>>> at >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349) >>>>>> at >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735) >>>>>> at >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912) >>>>>> >>>>>> 2008-10-27 23:41:12,682 WARN >>>>>> org.apache.hadoop.dfs.DFSClient: Error >>>>>> Recovery for block blk_-5188192041705782716_60000 >>>>>> bad datanode[0] >>>>>> 2008-10-27 23:41:12,685 ERROR >>>>>> >>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >>>>>> Compaction/Split >>>>>> failed for region >>>>>> >>>>>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518 >>>>>> java.io.IOException: Could not get block >>>>>> locations. Aborting... >>>>>> >>>>>> >>>>>> If HDFS is ailing, hbase is too. In fact, the >>>>>> regionservers will shut >>>>>> themselves to protect themselves against damaging >>>>>> or losing data: >>>>>> >>>>>> 2008-10-27 23:41:12,688 FATAL >>>>>> org.apache.hadoop.hbase.regionserver.Flusher: >>>>>> Replay of hlog required. Forcing server restart >>>>>> >>>>>> So, whats up with your HDFS? Not enough space >>>>>> alloted? What happens if >>>>>> you run "./bin/hadoop fsck /"? Does that give you >>>>>> a clue as to what >>>>>> happened? Dig in the datanode and namenode logs. >>>>>> Look for where the >>>>>> exceptions start. It might give you a clue. >>>>>> >>>>>> + The suse regionserver log had garbage in it. >>>>>> >>>>>> St.Ack >>>>>> >>>>>> >>>>>> Slava Gorelik wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Hi. >>>>>> My happiness was very short :-( After i >>>>>> successfully added 1M rows (50k >>>>>> each row) i tried to add 10M rows. >>>>>> And after 3-4 working hours it started to >>>>>> dying. First one region server >>>>>> is died, after another one and eventually all >>>>>> cluster is dead. >>>>>> >>>>>> I attached log files (relevant part, archived) >>>>>> from region servers and >>>>>> from the master. >>>>>> >>>>>> Best Regards. >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik >>>>>> < >>>>>> [EMAIL PROTECTED] >>>>>> <mailto:[EMAIL PROTECTED]><mailto: >>>>>> [EMAIL PROTECTED] >>>>>> <mailto:[EMAIL PROTECTED]>>> wrote: >>>>>> >>>>>> Hi. >>>>>> So far so good, after changing the file >>>>>> descriptors >>>>>> and dfs.datanode.socket.write.timeout, >>>>>> dfs.datanode.max.xcievers >>>>>> my cluster works stable. >>>>>> Thank You and Best Regards. >>>>>> >>>>>> P.S. Regarding deleting multiple columns >>>>>> missing functionality i >>>>>> filled jira : >>>>>> https://issues.apache.org/jira/browse/HBASE-961 >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Oct 26, 2008 at 12:58 AM, Michael >>>>>> Stack <[EMAIL PROTECTED] <mailto: >>>>>> [EMAIL PROTECTED] >>>>>> <mailto:[EMAIL PROTECTED] >>>>>> >>>>>> <mailto:[EMAIL PROTECTED]>>> wrote: >>>>>> >>>>>> Slava Gorelik wrote: >>>>>> >>>>>> Hi.Haven't tried yet them, i'll try >>>>>> tomorrow morning. In >>>>>> general cluster is >>>>>> working well, the problems begins if >>>>>> i'm trying to add 10M >>>>>> rows, after 1.2M >>>>>> if happened. >>>>>> >>>>>> Anything else running beside the >>>>>> regionserver or datanodes >>>>>> that would suck resources? When >>>>>> datanodes begin to slow, we >>>>>> begin to see the issue Jean-Adrien's >>>>>> configurations address. >>>>>> Are you uploading using MapReduce? Are >>>>>> TTs running on same >>>>>> nodes as the datanode and regionserver? >>>>>> How are you doing the >>>>>> upload? Describe what your uploader >>>>>> looks like (Sorry if >>>>>> you've already done this). >>>>>> >>>>>> >>>>>> I already changed the limit of files >>>>>> descriptors, >>>>>> >>>>>> Good. >>>>>> >>>>>> >>>>>> I'll try >>>>>> to change the properties: >>>>>> <property> >>>>>> <name>dfs.datanode.socket.write.timeout</name> >>>>>> <value>0</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>dfs.datanode.max.xcievers</name> >>>>>> <value>1023</value> >>>>>> </property> >>>>>> >>>>>> >>>>>> Yeah, try it. >>>>>> >>>>>> >>>>>> And let you know, is any other >>>>>> prescriptions ? Did i miss >>>>>> something ? >>>>>> >>>>>> BTW, off topic, but i sent e-mail >>>>>> recently to the list and >>>>>> i can't see it: >>>>>> Is it possible to delete multiple >>>>>> columns in any way by >>>>>> regex : for example >>>>>> colum_name_* ? >>>>>> >>>>>> Not that I know of. If its not in the >>>>>> API, it should be. >>>>>> Mind filing a JIRA? >>>>>> >>>>>> Thanks Slava. >>>>>> St.Ack >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >> >> > >
