[Nutch-dev] Re: index segmentation

Jack Tang Wed, 08 Jun 2005 03:48:42 -0700

Hi guys.

I use nutch Query.class to translate my query string, and here is the result:
----------------------------------------------------------------------------------------------------------------------
Query: textonly:nutch
Parsed: nutctextonly nutch
Translated: +(url:nutctextonly^4.0 anchor:nutctextonly^2.0
content:nutctextonly) +(url:nutch^4.0 anchor:nutch^2.0 content:nutch)
url:"nutctextonly nutch"~2147483647^4.0 anchor:"nutctextonly
nutch"~4^2.0 content:"nutctextonly nutch"~2147483647
---------------------------------------------------------------------------------------------------------------------


It seems "textonly" field is not searched at all, right?

Regards
/Jack


On 6/8/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi Doug
> 
> I don't know the query "field"s in Nutch. Is it the same in Lucene? I
> suppose it is.
> And the question comes along my deep debugging. In the method
> filter(Query input, BooleanQuery output) in FieldQueryFilter class,
> one statement looks this:
> 
>      // skip non-matching clauses (line 54 here!!)
>      if (!c.getField().equals(field))
>        continue;
> 
> I enabled my plugin and feed "textonly:nutch" query string, "field" is
> "textonly" and it is right. However, why c.getField() is always
> "DEFAULT"? Somthing I should take care in the plugin config file?
> 
> my plugin.xml is:
> <?xml version="1.0" encoding="UTF-8"?>
> <plugin
>   ... ...
> 
>   <extension id="com.ccs.nutch.searcher.TextOnlyVersionQueryFilter"
>              name="TextOnlyVersion Query Filter"
>              point="org.apache.nutch.searcher.QueryFilter">
>      <implementation id="TextOnlyVersionQueryFilter"
>                      class="com.ccs.nutch.searcher.TextOnlyVersionQueryFilter"
>                      fields="textonly"/>
>   </extension>
> 
>   <extension id="com.ccs.nutch.searcher.GraphicVersionQueryFilter"
>              name="GraphicVersion Query Filter"
>              point="org.apache.nutch.searcher.QueryFilter">
>      <implementation id="GraphicVersionQueryFilter"
>                      class="com.ccs.nutch.searcher.GraphicVersionQueryFilter"
>                      fields="graphic"/>
>   </extension>
> </plugin>
> 
> 
> Regards
> /Jack
> 
> On 6/8/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> > Hi Doug
> >
> > Thank you for your suggestion.
> > And I modified my indexing filter. If the URL of page contains
> > text-only/graphic flag, corresponding content will be tagged with
> > "textonly/graphic".
> >
> > Here is the code.
> > public class MyIndexingFilter {
> >
> >  public Document filter(Document doc, Parse parse, FetcherOutput fo)
> >    throws IndexingException {
> >        ... ...
> >
> >    if(TEXTONLY_WEBSITE_TAG != null){
> >     pattern = compiler.compile(TEXTONLY_WEBSITE_TAG);
> >     if(matcher.contains(url,pattern))
> >     doc.add(Field.UnStored("textonly",parse.getText()));
> >    }
> >
> >    if(GRAPHIC_WEBSITE_TAG != null){
> >     pattern = compiler.compile(GRAPHIC_WEBSITE_TAG);
> >     if(matcher.contains(url,pattern))
> >     doc.add(Field.UnStored("graphic",parse.getText()));
> >    }
> >
> >    // content is indexed, so that it's searchable, but not stored in index
> >        doc.add(Field.UnStored("content", parse.getText()));
> >
> >        ... ...
> >    }
> >
> > }
> >
> > And the query filter is simple.
> > public class TextOnlyVersionQueryFilter extends FieldQueryFilter {
> >
> > public TextOnlyVersionQueryFilter(){
> > super("textonly");
> > }
> > }
> >
> > After nutch crawls the whole website, I test the index file via
> > luke(query string is "textonly:nutch"), everything is OK, however,
> > when I feed the same query string into NutchBean, the result is quite
> > different, and it is obviously wrong. It seems NutchBean only shows
> > the pages whose depth is 1.
> >
> > BTW: the "plugin.includes" property is
> > <property>
> >  <name>plugin.includes</name>
> >  
> > <value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|meta)|query-(basic|site|url)|myplugin</value>
> >  <description>Regular expression naming plugin directory names to
> >  include.  Any plugin not matching this expression is excluded.  By
> >  default Nutch includes crawling just HTML and plain text via HTTP,
> >  and basic indexing and search plugins.
> >  </description>
> > </property>
> >
> >
> > Any suggestion?
> >
> > Regards
> > /Jack
> >
> > On 6/8/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > Jack Tang wrote:
> > > > The problem is when I try to search "scope:textonly" (I expect it will
> > > > list all page in textonly part ), the result is blank. So, what should
> > > > I do in order to get the right result?
> > >
> > > If you use RawFieldQueryFilter, then these are non-scoring, filtering
> > > query clauses (triggered by boost=0).  They do not affect ranking.  They
> > > must be accompanied by a scoring clause (boost != 0).  This is akin to
> > > the following at Google:
> > >
> > > http://www.google.com/search?q=filetype%3Apdf
> > >
> > > They could be made into scoring clauses, but that would make searches
> > > slower.  We could automatically turn one into a scoring clause when
> > > there are no scoring clauses in a query, if needed.  Is it important
> > > that you be able to, e.g., find all of the scope:textonly documents,
> > > with no other qualifications?
> > >
> > > Doug
> > >
> >
>


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r 
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: index segmentation

Reply via email to