Hi Aditya, You can you any HTML parser if you are getting/crawling an page from wikipedia and ignore those sections which are repetitive. If you are using Jericho parser here is what you can do.
URL u = new URL("any english wikipedia page"); Source src = new Source(u.openConnection().getInputStream()); TextExtractor textExtractor=new TextExtractor(src) { public boolean excludeElement(StartTag startTag) { return startTag.getName()==HTMLElementName.HEAD || "printfooter".equalsIgnoreCase(startTag.getAttributeValue("class")) || "footer".equalsIgnoreCase(startTag.getAttributeValue("id")) || "references".equalsIgnoreCase(startTag.getAttributeValue("class")) || "infobox sisterproject".equalsIgnoreCase(startTag.getAttributeValue("class")) || "siteSub".equalsIgnoreCase(startTag.getAttributeValue("id")) || "dablink".equalsIgnoreCase(startTag.getAttributeValue("class")) || "portlet".equalsIgnoreCase(startTag.getAttributeValue("class")) || "jump-to-nav".equalsIgnoreCase(startTag.getAttributeValue("id")) || "mw-hidden-cats-hidden".equalsIgnoreCase(startTag.getAttributeValue("class")) || "generated-sidebar portlet".equalsIgnoreCase(startTag.getAttributeValue("class")) ; } }; String parsedText = textExtractor.setIncludeAttributes(false).toString(); Though above code does not remove all the repetitve things, so you need to dig a little more in the page to get those. If you are not crawling the wiki page and are using XML dump, take any mediawiki parser which will give the html and you can use the above code, but yeah it will be duplication effort. --Thanks and Regards Vaijanath N. Rao ----- Original Message ----- From: "Aditya" <aditya.kulka...@gmail.com> To: java-user@lucene.apache.org Sent: Saturday, May 2, 2009 4:19:33 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi Subject: REPOST from another list: Question related to improving search results Hi, New to this group. Question: Generally sites like wikipeadia have a template and every page follows it. These templates contains the word that occurs in every page. For example wikipedia template has the list of language in the left panel. Now these words gets indexed every time since they are not (cannot be) stop words. if user for example search for "Galego", every wikipedia page will be in the search result which is wrong as every wikipedia page does not talk about "Galego" Any takes on this one for how to solve this problem? Best Regards, Aditya --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org