Re: Search a URL

2010-09-24 Thread Markus Jelsma
WordDelimiterFilter

On Friday 24 September 2010 02:42:52 Dennis Gearon wrote:
> WDF is not WTF(what I think when I see WDF), right ;-)
> 
> What is WDF?
> 
> Dennis Gearon
> 
> Signature Warning
> 
> EARTH has a Right To Life,
>   otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> --- On Thu, 9/23/10, Markus Jelsma  wrote:
> > From: Markus Jelsma 
> > Subject: RE: Search a URL
> > To: solr-user@lucene.apache.org
> > Date: Thursday, September 23, 2010, 2:11 PM
> > Try setting generateWordParts=1 in
> > your WDF. Also, having a WhitespaceTokenizer makes little
> > sense for URL's, there should be no whitespace in a URL, the
> > StandardTokenizer can tokenize a URL. Anyway, the problem is
> > your WDF.
> >  
> > -Original message-
> > From: Max Lynch 
> > Sent: Thu 23-09-2010 23:00
> > To: solr-user@lucene.apache.org;
> >
> > Subject: Search a URL
> >
> > Is there a tokenizer that will allow me to search for parts
> > of a URL?  For
> > example, the search "google" would match on the data "
> > http://mail.google.com/dlkjadf";
> >
> > This tokenizer factory doesn't seem to be sufficient:
> >
> > > class="solr.TextField"
> > positionIncrementGap="100">
> >
> > > class="solr.WhitespaceTokenizerFactory"/>
> > > class="solr.WordDelimiterFilterFactory"
> > generateWordParts="0" generateNumberParts="1"
> > catenateWords="1"
> > catenateNumbers="1" catenateAll="0"
> > splitOnCaseChange="1"/>
> > > class="solr.LowerCaseFilterFactory"/>
> > > class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/>
> >
> >
> >  > class="solr.WhitespaceTokenizerFactory"/>
> >
> >  > class="solr.WordDelimiterFilterFactory"
> > generateWordParts="0" generateNumberParts="1"
> > catenateWords="1"
> > catenateNumbers="1" catenateAll="0"
> > splitOnCaseChange="1"/>
> >  > class="solr.LowerCaseFilterFactory"/>
> >  > class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/>
> > 
> >
> >
> > Thanks.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



RE: Search a URL

2010-09-23 Thread Dennis Gearon
WDF is not WTF(what I think when I see WDF), right ;-)

What is WDF?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/23/10, Markus Jelsma  wrote:

> From: Markus Jelsma 
> Subject: RE: Search a URL
> To: solr-user@lucene.apache.org
> Date: Thursday, September 23, 2010, 2:11 PM
> Try setting generateWordParts=1 in
> your WDF. Also, having a WhitespaceTokenizer makes little
> sense for URL's, there should be no whitespace in a URL, the
> StandardTokenizer can tokenize a URL. Anyway, the problem is
> your WDF.
>  
> -Original message-
> From: Max Lynch 
> Sent: Thu 23-09-2010 23:00
> To: solr-user@lucene.apache.org;
> 
> Subject: Search a URL
> 
> Is there a tokenizer that will allow me to search for parts
> of a URL?  For
> example, the search "google" would match on the data "
> http://mail.google.com/dlkjadf";
> 
> This tokenizer factory doesn't seem to be sufficient:
> 
>         class="solr.TextField"
> positionIncrementGap="100">
>            
>                 class="solr.WhitespaceTokenizerFactory"/>
>                 class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
>                 class="solr.LowerCaseFilterFactory"/>
>                 class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>            
>            
>                  class="solr.WhitespaceTokenizerFactory"/>
> 
>                  class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
>                  class="solr.LowerCaseFilterFactory"/>
>                  class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>             
>    
> 
> Thanks.
>


RE: Search a URL

2010-09-23 Thread Markus Jelsma
Try setting generateWordParts=1 in your WDF. Also, having a WhitespaceTokenizer 
makes little sense for URL's, there should be no whitespace in a URL, the 
StandardTokenizer can tokenize a URL. Anyway, the problem is your WDF.
 
-Original message-
From: Max Lynch 
Sent: Thu 23-09-2010 23:00
To: solr-user@lucene.apache.org; 
Subject: Search a URL

Is there a tokenizer that will allow me to search for parts of a URL?  For
example, the search "google" would match on the data "
http://mail.google.com/dlkjadf";

This tokenizer factory doesn't seem to be sufficient:

       
           
               
               
               
               
           
           
                

                
                
                
            
   

Thanks.


Re: Search a URL

2010-09-23 Thread dl
LetterTokenizerFactory will use each contiguous sequence of letters and discard 
the rest. http, https, com,  etc. would need to be a stopword.

Alternatively you can try PatternTokenizerFactory with a regular expression if 
you are looking for a specific part of the URL.

On Sep 23, 2010, at 10:59 PM, Max Lynch wrote:

> Is there a tokenizer that will allow me to search for parts of a URL?  For
> example, the search "google" would match on the data "
> http://mail.google.com/dlkjadf";
> 
> This tokenizer factory doesn't seem to be sufficient:
> 
> positionIncrementGap="100">
>
>
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
> language="English" protected="protwords.txt"/>
>
>
> 
> 
>  generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  language="English" protected="protwords.txt"/>
> 
>
> 
> Thanks.



Search a URL

2010-09-23 Thread Max Lynch
Is there a tokenizer that will allow me to search for parts of a URL?  For
example, the search "google" would match on the data "
http://mail.google.com/dlkjadf";

This tokenizer factory doesn't seem to be sufficient:









 

 
 
 
 


Thanks.