Hi Robin, �
��� Thanks for the response. To answer your questions: � 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I�am using a branch version. Currently trying to install the trunk version 1. The data I am trying to classify is from scientific papers - essentially the abstract title, text and keywords of there paper - example below 2. No data source is under 300 characters 3. I am training using the Mahout naive Bayes and am getting low incorrectly classified rates something like: 1.67% - I’m quite happy with that… 4. After I have trained the model Robin I use the Mahout naive Bayes classify() method to classify new (unseen) data (with the classification already known) - this is where I start to get problems -� I get very poor successful classification rates for new data. Something like: 82% unsuccessful classified. � To Summarise: I get very good results in training and very poor results with new data. � I have posted on this before and it was suggested to me that I use the trunk version. I am still working on that and will let you know if this is successful and clears up this problem – its tricky as there are many jars missing after I downloaded it. Could be a bit smoother IMHO. Will persevere. Any hints/comments here to help? � In the mean time (as I work on that) I was wondering could it be something to do with the data itself? Perhaps I should use more papers per file or increase the data per paper in the files? Any comments on this… � PS: Thanks you for the fix on the “priority queue implementation of hadoop” problem (which was addressed in another post) Robin. Perhaps this fix will address the high error rates for the new data? Or perhaps the trunk version will it – am nor sure. Still working on the installation… Would appreciate you comments though on any/all of the above… � � � Example of data below (the class in this case is War): War [Characteristics of war wound infection] War wounds are the most complex type of non-targeted injuries due to uncontrolled tissue damage of varied and multifold localizations, exposing sterile body areas to contamination with a huge amount of bacteria. Wound contamination is caused by both the host microflora and exogenous agents from the environment (bullets, cloth fragments, dust, dirt, water) due to destruction of the host protective barriers. War wounds are the consequence of destructive effects of various types of projectiles, which result in massive tissue devitalization, hematomas, and compromised circulation with tissue ischemia or anoxia. This environment is highly favorable for proliferation of bacteria and their invasion in the surrounding tissue over a relatively short period of time. War wounds are associated with a high risk of local and systemic infection. The infection will develop unless a timely combined treatment is undertaken, including surgical intervention within 6 hours of wounding and antibiotic therapy administered immediately or at latest in 3 hours of wound infliction. Time is a crucial factor in this type of targeted combined treatment consisting of surgical debridement, appropriate empirical antimicrobial therapy, and specific antitetanic prophylaxis. Apart from exposure factors, there are a number of predisposing factors that favor the development of polymicrobial aerobic-anaerobic infection. These are shock, pain, blood loss, hypoxia, hematomas, type and amount of traumatized tissue, age, and comorbidity factors in the wounded. The determinants that define the spectrum of etiologic agents in contaminated war wounds are: wound type, body region involved, time interval between wounding and primary surgical treatment, climate factors, season, geographical area, hygienic conditions, and patient habits. The etiologic agents of infection include gram-positive aerobic cocci, i. e. Staphylococcus spp, Streptococcus spp and Enterococcus spp, which belong to the physiological flora of the human skin and mucosa; gram-negative facultative aerobic rods; members of the family Enterobacteriacea (Escherichia coil, Proteus mirabilis, Klebsiella pneumoniae, Enterobacter cloacae), which predominate in the physiological flora of the intestines, transitory flora of the skin and environment; gram-negative bacteria, i. e. Pseudomonas aeruginosa, Serratia marcescens, Acinetobacter calcoaceticus - A. baumanii complex; environmental bacteria associated with humid environment and dust; anaerobic gram-positive sporogeneous rods Clostridium spp, gram-negative asporogeneous rods Bacteroides spp and gram-positive anaerobic cocci; Peptostreptococcus spp and Peptococcus spp. The latter usually colonize the intestine, primarily the colon, and the skin, while clostridium spores are also found in the environment. Early empirical antibiotic therapy is used instead of standard antibiotic prophylaxis. Empirical antimicrobial therapy is administered to prevent the development of systemic infection, gas gangrene, necrotizing infection of soft tissue, intoxication and death. The choice of antibiotics is determined by the presumed infective agents and localization of the wound. It is used in all types of war wounds over 5-7-10 days. The characteristics of antibiotics used in war wounds are the following: broad spectrum of activity, ability to penetrate deep into the tissue, low toxicity, long half-life, easy storage and application, and cost effectiveness. The use of antibiotics is not a substitution for surgical treatment. The expected incidence of infection, according to literature data, is 35%-40%. If the time elapsed until surgical debridement exceeds 12 hours, or the administration of antibiotics exceeds 6 hours of wound infliction, primary infection of the war wound occurs (early infection) in more than 50% of cases. The keys for the prevention of infection are prompt and thorough surgical exploration of the wound, administration of antibiotics and antitetanic prophylaxis, awareness of the probable pathogens with respect to localization of the wound, and optimal choice of antibiotics and length of their administration.� ----- Original Message ----- From: "Robin Anil" To: [email protected] Subject: Re: Document size rules of thumb Date: Wed, 7 Oct 2009 18:00:58 +0530 HI Sandra, Could you explain your setup, what kind of a dataset it is. Mahout Naive Bayes/CBayes (not Bayesian network) classifier is built with text articles or documents in mind. The characteristics might change if the document you wish to classify is 140 char sms or twitter messages(Wont affect much though). Could you tell me what kind of results are you getting, then by looking at the data and the scores generated we can see what to tune Robin On Wed, Oct 7, 2009 at 4:58 PM, Sandra Clover wrote: > Hi, Just wondering do you have any nice rules-of-thumb or any other > guides (characteristics) as to the minimum size of the documents used in > training the complementary Bayesian network? I would appreciate any > comments/views/opinions/rules-of-thumb/experiences that you may be able > to offer on good characteristics of the documents that go into training > (particularly when you have a large number of categories to > classify)... Thanking > you,Sandra. > > -- > An Excellent Credit Score is 750 > See Yours in Just 2 Easy Steps! > > -- An Excellent Credit Score is 750 See Yours in Just 2 Easy Steps!
