Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss Clock. Worked with 8 clients about 30 hours intensive load.
Just small question, after about 28 hours (when i came back to work) i found that one of 7 datanodes in Hadoop is about 98% usage and all other about 30%, is it normal ? Best Regards. On Fri, Oct 31, 2008 at 10:16 PM, Slava Gorelik <[EMAIL PROTECTED]>wrote: > Hi.No problem with silly question :-) Yes, sure i replaced, here the list > of folder that begins with 73*: > > drwxr-xr-x - XXXXXXXXX supergroup 0 2008-10-29 11:13 > /hbase/BizDB/732078971/BusinessObject > drwxr-xr-x - XXXXXXXX supergroup 0 2008-10-29 11:13 > /hbase/BizDB/732215319/BusinessObject > drwxr-xr-x - XXXXXXXX supergroup 0 2008-10-29 11:13 > /hbase/BizDB/733411255/BusinessObject > drwxr-xr-x - XXXXXXXX supergroup 0 2008-10-29 11:14 > /hbase/BizDB/733598097/BusinessObject > drwxr-xr-x - XXXXXXXX supergroup 0 2008-10-29 10:50 > /hbase/BizDB/734145833/BusinessObject > drwxr-xr-x - XXXXXXXX supergroup 0 2008-10-29 11:09 > /hbase/BizDB/735612900/BusinessObject > drwxr-xr-x - XXXXXXXX supergroup 0 2008-10-29 11:15 > /hbase/BizDB/738009120/BusinessObject > > There no 735893330 folder. > Scanning.META. in shell is not easy at all. .META. is huge and simple scan > without providing specific column will give me about 10 min only listing of > .META. content, so i failed to find the 735893330, may be you can give me > the name of the column, where this info is placed ? > > I think i'll reformat the HDFS and will start it from clean environment and > the we'll see. I'll do it this Sunday and let you know. > > Best Regards and Big Thank You for your patience and assistance. > > > On Fri, Oct 31, 2008 at 4:47 AM, Michael Stack <[EMAIL PROTECTED]> wrote: > >> Slava Gorelik wrote: >> >>> Hi.I also noticed this exception. >>> Strange that this exception is happened every time on the same >>> regionserver. >>> Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist. >>> Very strange, but history folder in hadoop is empty. >>> >>> >> It is odd indeed that the system keeps trying to load a region that does >> not exist. >> >> I don't think it necessarily the same regionserver that is responsible. >> I'd think it an attribute of the region that we're trying to deploy on that >> server. >> >> Silly question: you did replace 'X' with your machine name in the above? >> >> If you restart, it still tries to load this nonexistent region? >> >> If so, the .META. table is not consistent with whats on the filesystem. >> They've gotten out of sync. Describing how to repair is involved. >> >> Reformatting HDFS will help ? >>> >>> >>> >> Do a "scan '.META.'" in the shell. Do you see your region listed (look at >> the encoded names attribute to find 735893330. >> >> If your table is damaged -- i'd guess it because ulimit was bad up to this >> -- the best thing might to start over. >> >> One more things in a last minute, i found that one node in cluster has >>> totally different time, could this cause for such a problems ? >>> >>> >> We thought we'd fixed all problems that could arise from time skew, but >> you never know. In our requirements, clocks must be synced. Fix this too >> if you can before reloading. >> >> P.S. About logs, is it possible to send to some email - each log file >>> compressed is about 1mb, and only in 3 files i found exceptions. >>> >>> >>> >> There probably is such a functionality but I'm not familiar. Can you put >> them under a webserver at your place so I can grab them? You can send me >> the URL offlist if you like. >> >> Thanks for your patience Slava. We'll figure it. >> >> St.Ack >> >> >> On Thu, Oct 30, 2008 at 10:25 PM, stack <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>>> Can you put them someplace that I can pull them? >>>> >>>> I took another look at your logs. I see that a region is missing files. >>>> That means it will never open and just keep trying. Grep your logs for >>>> FileNotFound. You'll see this: >>>> >>>> >>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: >>>> File does not exist: >>>> >>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data >>>> >>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: >>>> File does not exist: >>>> >>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data >>>> >>>> Try shutting down, and removing these files. Remove the following >>>> directories: >>>> >>>> >>>> >>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906 >>>> >>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906 >>>> >>>> >>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637 >>>> >>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637 >>>> >>>> Then retry restarting. >>>> >>>> You can try and figure how these files got lost by going back in your >>>> history. >>>> >>>> >>>> St.Ack >>>> >>>> >>>> >>>> Slava Gorelik wrote: >>>> >>>> >>>> >>>>> Michael,still have the problem, but the logs files are very big (50MB >>>>> each) >>>>> even compressed they are bigger than limit for this mailing list. >>>>> Most of the problems are happened during compaction (i see in the log), >>>>> may >>>>> be i can send some parts from logs ? >>>>> >>>>> Best Regards. >>>>> >>>>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik < >>>>> [EMAIL PROTECTED] >>>>> >>>>> >>>>>> wrote: >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now, >>>>>> soon >>>>>> will try again. >>>>>> >>>>>> >>>>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik < >>>>>> [EMAIL PROTECTED] >>>>>> >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Hi.Very strange, i see in limits.conf that it's upped. >>>>>>> I attached the limits.conf, please have a look, may be i did it >>>>>>> wrong. >>>>>>> >>>>>>> Best Regards. >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Thanks for the logs Slava. I notice that you have not upped the >>>>>>>> ulimit >>>>>>>> on your cluster. See the head of your logs where we print out the >>>>>>>> ulimit. >>>>>>>> Its 1024. This could be one cause of your grief especially when >>>>>>>> you >>>>>>>> seemingly have many regions (>1000). Please try upping it. >>>>>>>> St.Ack >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Slava Gorelik wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi. >>>>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived) >>>>>>>>> including fsck run result. >>>>>>>>> Today my program starting to fail couple of minutes from the begin, >>>>>>>>> it's >>>>>>>>> very easy to reproduce the problem, cluster became very unstable. >>>>>>>>> >>>>>>>>> Best Regards. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto: >>>>>>>>> [EMAIL PROTECTED]>> wrote: >>>>>>>>> >>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/FAQ#5 >>>>>>>>> >>>>>>>>> St.Ack >>>>>>>>> >>>>>>>>> >>>>>>>>> Slava Gorelik wrote: >>>>>>>>> >>>>>>>>> Hi.First of all i want to say thank you for you assistance !!! >>>>>>>>> >>>>>>>>> >>>>>>>>> DEBUG on hadoop or hbase ? And how can i enable ? >>>>>>>>> fsck said that HDFS is healthy. >>>>>>>>> >>>>>>>>> Best Regards and Thank You >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED] >>>>>>>>> <mailto:[EMAIL PROTECTED]>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Slava Gorelik wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi.HDFS capacity is about 800gb (8 datanodes) and the >>>>>>>>> current usage is >>>>>>>>> about >>>>>>>>> 30GB. This is after total re-format of the HDFS that >>>>>>>>> was made a hour >>>>>>>>> before. >>>>>>>>> >>>>>>>>> BTW, the logs i sent are from the first exception that >>>>>>>>> i found in them. >>>>>>>>> Best Regards. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Please enable DEBUG and retry. Send me all logs. What >>>>>>>>> does the fsck on >>>>>>>>> HDFS say? There is something seriously wrong with your >>>>>>>>> cluster that you are >>>>>>>>> having so much trouble getting it running. Lets try and >>>>>>>>> figure it. >>>>>>>>> >>>>>>>>> St.Ack >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 28, 2008 at 7:12 PM, stack >>>>>>>>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I took a quick look Slava (Thanks for sending the >>>>>>>>> files). Here's a few >>>>>>>>> notes: >>>>>>>>> >>>>>>>>> + The logs are from after the damage is done; the >>>>>>>>> transition from good to >>>>>>>>> bad is missing. If I could see that, that would >>>>>>>>> help >>>>>>>>> + But what seems to be plain is that that your >>>>>>>>> HDFS is very sick. See >>>>>>>>> this >>>>>>>>> from head of one of the regionserver logs: >>>>>>>>> >>>>>>>>> 2008-10-27 23:41:12,682 WARN >>>>>>>>> org.apache.hadoop.dfs.DFSClient: >>>>>>>>> DataStreamer >>>>>>>>> Exception: java.io.IOException: Unable to create >>>>>>>>> new block. >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912) >>>>>>>>> >>>>>>>>> 2008-10-27 23:41:12,682 WARN >>>>>>>>> org.apache.hadoop.dfs.DFSClient: Error >>>>>>>>> Recovery for block blk_-5188192041705782716_60000 >>>>>>>>> bad datanode[0] >>>>>>>>> 2008-10-27 23:41:12,685 ERROR >>>>>>>>> >>>>>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >>>>>>>>> Compaction/Split >>>>>>>>> failed for region >>>>>>>>> >>>>>>>>> >>>>>>>>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518 >>>>>>>>> java.io.IOException: Could not get block >>>>>>>>> locations. Aborting... >>>>>>>>> >>>>>>>>> >>>>>>>>> If HDFS is ailing, hbase is too. In fact, the >>>>>>>>> regionservers will shut >>>>>>>>> themselves to protect themselves against damaging >>>>>>>>> or losing data: >>>>>>>>> >>>>>>>>> 2008-10-27 23:41:12,688 FATAL >>>>>>>>> org.apache.hadoop.hbase.regionserver.Flusher: >>>>>>>>> Replay of hlog required. Forcing server restart >>>>>>>>> >>>>>>>>> So, whats up with your HDFS? Not enough space >>>>>>>>> alloted? What happens if >>>>>>>>> you run "./bin/hadoop fsck /"? Does that give you >>>>>>>>> a clue as to what >>>>>>>>> happened? Dig in the datanode and namenode logs. >>>>>>>>> Look for where the >>>>>>>>> exceptions start. It might give you a clue. >>>>>>>>> >>>>>>>>> + The suse regionserver log had garbage in it. >>>>>>>>> >>>>>>>>> St.Ack >>>>>>>>> >>>>>>>>> >>>>>>>>> Slava Gorelik wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi. >>>>>>>>> My happiness was very short :-( After i >>>>>>>>> successfully added 1M rows (50k >>>>>>>>> each row) i tried to add 10M rows. >>>>>>>>> And after 3-4 working hours it started to >>>>>>>>> dying. First one region server >>>>>>>>> is died, after another one and eventually all >>>>>>>>> cluster is dead. >>>>>>>>> >>>>>>>>> I attached log files (relevant part, archived) >>>>>>>>> from region servers and >>>>>>>>> from the master. >>>>>>>>> >>>>>>>>> Best Regards. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava >>>>>>>>> Gorelik >>>>>>>>> < >>>>>>>>> [EMAIL PROTECTED] >>>>>>>>> <mailto:[EMAIL PROTECTED]><mailto: >>>>>>>>> [EMAIL PROTECTED] >>>>>>>>> <mailto:[EMAIL PROTECTED]>>> wrote: >>>>>>>>> >>>>>>>>> Hi. >>>>>>>>> So far so good, after changing the file >>>>>>>>> descriptors >>>>>>>>> and dfs.datanode.socket.write.timeout, >>>>>>>>> dfs.datanode.max.xcievers >>>>>>>>> my cluster works stable. >>>>>>>>> Thank You and Best Regards. >>>>>>>>> >>>>>>>>> P.S. Regarding deleting multiple columns >>>>>>>>> missing functionality i >>>>>>>>> filled jira : >>>>>>>>> >>>>>>>>> https://issues.apache.org/jira/browse/HBASE-961 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Oct 26, 2008 at 12:58 AM, Michael >>>>>>>>> Stack <[EMAIL PROTECTED] <mailto: >>>>>>>>> [EMAIL PROTECTED] >>>>>>>>> <mailto:[EMAIL PROTECTED] >>>>>>>>> >>>>>>>>> <mailto:[EMAIL PROTECTED]>>> wrote: >>>>>>>>> >>>>>>>>> Slava Gorelik wrote: >>>>>>>>> >>>>>>>>> Hi.Haven't tried yet them, i'll try >>>>>>>>> tomorrow morning. In >>>>>>>>> general cluster is >>>>>>>>> working well, the problems begins if >>>>>>>>> i'm trying to add 10M >>>>>>>>> rows, after 1.2M >>>>>>>>> if happened. >>>>>>>>> >>>>>>>>> Anything else running beside the >>>>>>>>> regionserver or datanodes >>>>>>>>> that would suck resources? When >>>>>>>>> datanodes begin to slow, we >>>>>>>>> begin to see the issue Jean-Adrien's >>>>>>>>> configurations address. >>>>>>>>> Are you uploading using MapReduce? Are >>>>>>>>> TTs running on same >>>>>>>>> nodes as the datanode and regionserver? >>>>>>>>> How are you doing the >>>>>>>>> upload? Describe what your uploader >>>>>>>>> looks like (Sorry if >>>>>>>>> you've already done this). >>>>>>>>> >>>>>>>>> >>>>>>>>> I already changed the limit of files >>>>>>>>> descriptors, >>>>>>>>> >>>>>>>>> Good. >>>>>>>>> >>>>>>>>> >>>>>>>>> I'll try >>>>>>>>> to change the properties: >>>>>>>>> <property> >>>>>>>>> <name>dfs.datanode.socket.write.timeout</name> >>>>>>>>> <value>0</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> >>>>>>>>> <name>dfs.datanode.max.xcievers</name> >>>>>>>>> <value>1023</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> >>>>>>>>> Yeah, try it. >>>>>>>>> >>>>>>>>> >>>>>>>>> And let you know, is any other >>>>>>>>> prescriptions ? Did i miss >>>>>>>>> something ? >>>>>>>>> >>>>>>>>> BTW, off topic, but i sent e-mail >>>>>>>>> recently to the list and >>>>>>>>> i can't see it: >>>>>>>>> Is it possible to delete multiple >>>>>>>>> columns in any way by >>>>>>>>> regex : for example >>>>>>>>> colum_name_* ? >>>>>>>>> >>>>>>>>> Not that I know of. If its not in the >>>>>>>>> API, it should be. >>>>>>>>> Mind filing a JIRA? >>>>>>>>> >>>>>>>>> Thanks Slava. >>>>>>>>> St.Ack >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> >
