Hi Aditya,

You can you any HTML parser if you are getting/crawling an page from wikipedia 
and ignore those sections which are repetitive.
If you are using Jericho parser here is what you can do.

URL u = new URL("any english wikipedia page");
                Source src = new Source(u.openConnection().getInputStream());
                TextExtractor textExtractor=new TextExtractor(src) {
                        public boolean excludeElement(StartTag startTag) {
                                return startTag.getName()==HTMLElementName.HEAD 
                                || 
"printfooter".equalsIgnoreCase(startTag.getAttributeValue("class"))
                                || 
"footer".equalsIgnoreCase(startTag.getAttributeValue("id"))
                                || 
"references".equalsIgnoreCase(startTag.getAttributeValue("class"))   
                                || "infobox 
sisterproject".equalsIgnoreCase(startTag.getAttributeValue("class"))
                                || 
"siteSub".equalsIgnoreCase(startTag.getAttributeValue("id"))
                                || 
"dablink".equalsIgnoreCase(startTag.getAttributeValue("class"))
                                || 
"portlet".equalsIgnoreCase(startTag.getAttributeValue("class"))
                                || 
"jump-to-nav".equalsIgnoreCase(startTag.getAttributeValue("id"))
                                || 
"mw-hidden-cats-hidden".equalsIgnoreCase(startTag.getAttributeValue("class"))
                                || "generated-sidebar 
portlet".equalsIgnoreCase(startTag.getAttributeValue("class"))
                                
                                ;
                                
                        }
                };
                String parsedText = 
textExtractor.setIncludeAttributes(false).toString();

Though above code does not remove all the repetitve things, so you need to dig 
a little more in the page to get those. If you are not crawling the wiki page 
and are using XML dump, take any mediawiki parser which will give the html and 
you can use the above code, but yeah it will be duplication effort.

--Thanks and Regards
Vaijanath N. Rao

----- Original Message -----
From: "Aditya" <aditya.kulka...@gmail.com>
To: java-user@lucene.apache.org
Sent: Saturday, May 2, 2009 4:19:33 PM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: REPOST from another list: Question related to improving search results

Hi,

 

New to this group.

 

Question:

 

Generally sites like wikipeadia have a template and every page follows it.
These templates contains the word that occurs in every page. 

 

For example wikipedia template has the list of language in the left panel.
Now these words gets indexed every time since they are not (cannot be) stop
words. 

if user for example search for "Galego", every wikipedia page will be in the
search result which is wrong as every wikipedia page does not talk about
"Galego" 

 

Any takes on this one for how to solve this problem?

 

Best Regards,

Aditya

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to