from:"Scott Gonyea"

Re: String field

2011-03-29 Thread Scott Gonyea

First, make sure your request handler is set to spit out everything.  I take
it you did, but I hate to assume.

Second, I suggest indexing your data twice.  One as tokenized-text, the
other as a string.  It'll save you from howling at the moon in anguish...
Unless you really only do care about pure, exact-matching.  IE, down to the
character-case.

Scott

On Tue, Mar 29, 2011 at 8:46 AM, Brian Lamb
wrote:

> Hi all,
>
> I'm a little confused about the string field. I read somewhere that if I
> want to do an exact match, I should use an exact match. So I made a few
> modifications to my schema file:
>
>  required="false"
> />
>  stored="true" required="false" />
>  required="false" />
>  required="false" />
>
> And did a full import but when I do a search and return all fields, only id
> is showing up. The only difference is that id is my primary key field so
> that could be why it is showing up but why aren't the others showing up?
>
> Thanks,
>
> Brian Lamb
>

Advice on Exact Matching?

2010-12-30 Thread Scott Gonyea

Hi,

I am trying to make sure that when I search for text—regardless of
what that text is—that I get an exact match.  I'm *still* getting some
issues, and this last mile is becoming very painful.  The solr field,
for which I'm setting this up on, is pasted below my explanation.  I
appreciate any help.

Explanation:

I'm crawling websites with Nutch.  I'm performing some
mechanical-turk-like filtering and term matching.  The problem is,
there's some very gnarly behavior in Solr due to any number of
gotchas.

If I want to find *all* Solr documents that match
"[id]somejunk\hi[/id]" then life is instantly hell.

Likewise, lots of whitespace in between words throws it off " john
says hello,  how are you?"  I would love to be able to search for
these exact phrases.  If that's just not practical (I'm more than
willing to live with a bloated search index), what would some other
strategies be?

There's no MapReduce in Solr; I could attempt to do Hadoop-streaming,
but that's not very ideal for a variety of reasons.


Solr Schema.xml, fieldType "text" (no, this is not used everywhere;
only on 2 fields):



  
    




  



Thank you,
Scott Gonyea

Solr highlighting is double-quotes-aware?

2010-12-01 Thread Scott Gonyea

Not sure how to write that subject line.  I'm getting some weird behavior out 
of the highlighter in Solr.  It seems like an edge case, but I'm curious to 
hear if this is known about, or if it's something worth looking into further.

Background:

I'm using Solr's highlighting facility to tag words, found in content crawled 
via Nutch. I split up the content based on those tags, which is later fed into 
a moderation process.

Sample Data (snippet from larger content):
[url=\"http://www.sampleurl.com/baffle_prices.html\"]baffle[/url]

(My "hl.simple.pre" is set to "TEST_KEYWORD_START" and my "hl.simple.post" is 
set to "TEST_KEYWORD_END")

Query for "baffle", and solr highlights it thus:

TEST_KEYWORD_STARTbaffle_prices.html\"]baffleTEST_KEYWORD_END

What should be happening, is this:

TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END_prices.html\"]TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END


Is there something about this data that makes the highlighter not want to split 
it up? Do I have to have Solr tokenize the words by some character that I 
somehow excluded?

Thank you,
Scott Gonyea

Re: Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

2010-10-04 Thread Scott Gonyea

Wow, that's pretty infuriating.  Thank you for the suggestion.  I
added it to the Wiki, with the hope that if it contains misinformation
then someone will correct it and, consequently, save me from another
one of these experiences :)  (...and to also document that, hey, there
is a tokenizer which treats the entire field as an exact value.)

Will go this route and re-index everything back into Solr...again...sigh.

Scott

On Mon, Oct 4, 2010 at 10:07 AM, Ahmet Arslan  wrote:
>>
>>     > name="idstr"   class="solr.StrField">
>>       
>>         > class="solr.PatternTokenizerFactory" pattern="(.*)"
>> group="1"/>
>>           > class="solr.LowerCaseFilterFactory"/>
>>       
>
> This definition is invalid. You cannot use charfilter/tokenizer/tokenfilter 
> with solr.StrField.
>
> But it is interesting that (i just tested) analysis.jsp (1.4.1) displays as 
> if its working. But if you observe at /schema.jsp you will see that real 
> indexed values are not lowercased.
>
> You can use this definition instead:
>
> 
>  
>   
>   
>   
>  
> 
>
>
>
>

Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

2010-10-04 Thread Scott Gonyea

Wow, this is probably the most annoying Solr issue I've *ever* dealt
with. First question: How do I debug Dismax, and its query handling?

Issue: When I query against this StrField, I am attempting to do an
*exact* match...  Albeit one that is case-insensitive :).  So, 90%
exact.  It works in a majority of cases.  Indeed, I am teling Solr
that this field is my uniqueField and it enforces uniqueness
perfectly.  The issue comes about when I try to query a document,
based on a key in this field, and the key I'm using has hyphens
(dashes) in it.  Then I get zero results.  Very frustrating.

The keys will always be a URL.  IE,
"http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry";

Here's my configuration info...  schema.xml (the URL exists twice;
once in 'idstr' format, for uniqueness, and once in the 'url' form
below. I am querying against the 'idstr' field):


  

  
  


  





  





  id
  content
  


Yes, the PatternTokenizerFactory is inefficient for doing what I
wanted above. It was a quick hack, while I sought something to do
exactly what I'm doing above.  IE, exact / WHOLE string... but lower
case.

Here's my solrconfig.xml:



  
dismax
explicit
0.01
 content^1.5 anchor^0.3 title^1.2
mcode^1.0 site_id^1.0 priority^1.0
 * 
true
*:*
content title
0
title
regex3
  



And, finally, when I run that sample URL through the query analyzer...
 here's the output (copied from the HTML)... I appreciate any/all help
anyone can provide.  Seriously.  I'll love you forever :(  :


Index Analyzer
org.apache.solr.analysis.PatternTokenizerFactory   null


term position
1

term text
http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry

term type
word

source start,end
0,58

payload


org.apache.solr.analysis.LowerCaseFilterFactory   {}


term position
1

term text
http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry

term type
word

source start,end
0,58

payload


Query Analyzer
org.apache.solr.analysis.PatternTokenizerFactory   null


term position
1

term text
http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry

term type
word

source start,end
0,58

payload


org.apache.solr.analysis.LowerCaseFilterFactory   {}


term position
1

term text
http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry

term type
word

source start,end
0,58

payload

Re: Highlighting match term in bold rather than italic

2010-09-30 Thread Scott Gonyea

Your solrconfig has a highlighting section.  You can make that CDATA
thing whatever you want.  I changed it to .

On Thu, Sep 30, 2010 at 2:54 PM, efr...@gmail.com  wrote:
> Hi all -
>
> Does anyone know how to produce solr results where the match term is
> highlighted in bold rather than italic?
>
> thanks in advance,
>
> Brad
>

Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Scott Gonyea

Break your HTML pages into the desired fields, format it as follows:

http://wiki.apache.org/solr/UpdateXmlMessages

And away you go.  You may want to search / review the Wiki.  Also, if
you're indexing websites and want to place it in Solr, you should look
at Nutch.  It can do all that work for you, and more.

Scott

On Wed, Sep 29, 2010 at 12:56 PM, Savannah Beckett
 wrote:
> Hi,
>   I am using xpath to index different parts of the html pages into different
> fields.  Now, I have some pure text documents that has no html.  So I can't 
> use
> xpath.  How do I index these pure text into different fields of the index?  
> How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>

Re: Get all results from a solr query

2010-09-16 Thread Scott Gonyea

lol, note to self: scratch out IPs.  Good thing firewalls exist to
keep my stupidity at bay.

Scott

On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea  wrote:
> If you want to do it in Ruby, you can use this script as scaffolding:
> require 'rsolr' # run `gem install rsolr` to get this
> solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
> total = solr.select({:rows => 0})["response"]["numFound"]
> rows  = 10
> query = {
>   :rows   => rows,
>   :start  => 0
> }
> pages = (total.to_f / rows.to_f).ceil # round up
> (1..pages).each do |page|
>   query[:start] = (page-1) * rows
>   results = solr.select(query)
>   docs    = results[:response][:docs]
>   # Do stuff here
>   #
>   docs.each do |doc|
>     doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
>   end
>   # Add it back in to Solr
>   solr.add(docs)
>   solr.commit
> end
>
> Scott
>
> On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>>
>> Start with a *:*, then the “numFound” attribute of the 
>> element should give you the rows to fetch by a 2nd request.
>>
>>
>> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  wrote:
>> > That will stil just return 10 rows for me.  Is there something else in
>> > the configuration of solr to have it return all the rows in the
>> > results?
>> >
>> > -- Chris
>> >
>> >
>> >
>> > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>> >> q=*:*
>> >>
>> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
>> >> wrote:
>> >>> I have some queries that I'm running against a solr instance (older,
>> >>> 1.2 I believe), and I would like to get *all* the results back (and
>> >>> not have to put an absurdly large number as a part of the rows
>> >>> parameter).
>> >>>
>> >>> Is there a way that I can do that?  Any help would be appreciated.
>> >>>
>> >>> -- Chris
>> >>>
>> >>
>> >
>

Re: Get all results from a solr query

2010-09-16 Thread Scott Gonyea

If you want to do it in Ruby, you can use this script as scaffolding:
require 'rsolr' # run `gem install rsolr` to get this
solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
total = solr.select({:rows => 0})["response"]["numFound"]
rows  = 10
query = {
  :rows   => rows,
  :start  => 0
}
pages = (total.to_f / rows.to_f).ceil # round up
(1..pages).each do |page|
  query[:start] = (page-1) * rows
  results = solr.select(query)
  docs    = results[:response][:docs]
  # Do stuff here
  #
  docs.each do |doc|
    doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
  end
  # Add it back in to Solr
  solr.add(docs)
  solr.commit
end

Scott

On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>
> Start with a *:*, then the “numFound” attribute of the 
> element should give you the rows to fetch by a 2nd request.
>
>
> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  wrote:
> > That will stil just return 10 rows for me.  Is there something else in
> > the configuration of solr to have it return all the rows in the
> > results?
> >
> > -- Chris
> >
> >
> >
> > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
> >> q=*:*
> >>
> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
> >> wrote:
> >>> I have some queries that I'm running against a solr instance (older,
> >>> 1.2 I believe), and I would like to get *all* the results back (and
> >>> not have to put an absurdly large number as a part of the rows
> >>> parameter).
> >>>
> >>> Is there a way that I can do that?  Any help would be appreciated.
> >>>
> >>> -- Chris
> >>>
> >>
> >

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Scott Gonyea

There's a lot of reasons, with the performance hit being notable--but also
because I feel that using a regex on something this basic amounts to a lazy
hack.  I'm typically against regular expressions in XML.
 I'm vehemently opposed to them in cases where not using them should
otherwise be quite trivial.  Regarding LowerCaseFilter, etc:

My question is: Why should LowerCaseFilter be the means by which that work
is done? I fully agree with keeping things DRY, but I'm not quite sure I
agree with how that mantra is being employed.  For instance, the two
tokenizer statements:




Can be written to utilize the same codebase, which makes things DRY and
*may* even be a bit more performant for less trivial transformations.

If nothing else, I think a "CharacterTokenizer" would be good way to go.




All that said :)  I don't promote myself as an expert and I'm happy to be
shown the light / slapped across the head.

Scott

On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind  wrote:

> How about patching the LetterTokenizer to be capable of tokenizing how you
> want, which can then be combined with a LowerCaseFilter (or not) as desired.
> Or indeed creating a new tokenizer to do exactly what you want, possibly
> (but one that doesn't combine an embedded lowercasefilter in there too!).
> Instead of patching the LowerCaseTokenizer, which is of dubious value. Just
> brainstorming.
>
> Another way to tokenize based on "Non-Whitespace/Alpha/Numeric
> character-content" might be using the existing PatternTokenizerFactory with
> a suitable regexp, as you mention.  Which of course could do what the
> LetterTokenizer does to, but presumably not as efficiently. Is that what
> gives you an uncomfortable feeling? If it performs worse enough to matter,
> then that's why you'd need a custom tokenizer, other than that I'm not sure
> anything's undesirable about the PatternTokenizer.
>
>
> Jonathan
>
> Scott Gonyea wrote:
>
>> I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
>> result of not wanting to create yet more Classes.
>>
>> That said, rightfully dumping LowerCaseTokenizer would probably have me
>> creating my own Tokenizer.
>>
>> I could very well be thinking about this wrong...  But what if I wanted to
>> create tokens based on Non-Whitespace/Alpha/Numeric character-content?
>>
>> It looks like I could perhaps use the PatternTokenizer, but that didn't
>> leave me with a comfortable feeling when I had first looked into it.
>>
>> Scott
>>
>> On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir  wrote:
>>
>>
>>
>>> Jonathan, you bring up an excellent point.
>>>
>>> I think its worth our time to actually benchmark this LowerCaseTokenizer
>>> versus LetterTokenizer + LowerCaseFilter
>>>
>>> This tokenizer is quite old, and although I can understand there is no
>>> doubt
>>> its technically faster than LetterTokenizer + LowerCaseFilter even today
>>> (as
>>> it can just go through the char[] only a single time), I have my doubts
>>> that
>>> this brings any value these days...
>>>
>>>
>>> On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind 
>>> wrote:
>>>
>>>
>>>
>>>> Why would you want to do that, instead of just using another tokenizer
>>>>
>>>>
>>> and
>>>
>>>
>>>> a lowercasefilter?  It's more confusing less DRY code to leave them
>>>>
>>>>
>>> separate
>>>
>>>
>>>> -- the LowerCaseTokenizerFactory  combines anyway because someone
>>>> decided
>>>>
>>>>
>>> it
>>>
>>>
>>>> was such a common use case that it was worth it for the demonstrated
>>>> performance advantage. (At least I hope that's what happened, otherwise
>>>> there's no excuse for it!).
>>>>
>>>> Do you know you get a worthwhile performance benefit for what you're
>>>>
>>>>
>>> doing?
>>>
>>>
>>>>  If not, why do it?
>>>>
>>>> Jonathan
>>>>
>>>>
>>>> Scott Gonyea wrote:
>>>>
>>>>
>>>>
>>>>> I went for a different route:
>>>>>
>>>>> https://issues.apache.org/jira/browse/LUCENE-2644
>>>>>
>>>>> Scott
>>>>>
>>>>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir 
>>>>> wrote:
>>&

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Scott Gonyea

I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.

That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.

I could very well be thinking about this wrong...  But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?

It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.

Scott

On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir  wrote:

> Jonathan, you bring up an excellent point.
>
> I think its worth our time to actually benchmark this LowerCaseTokenizer
> versus LetterTokenizer + LowerCaseFilter
>
> This tokenizer is quite old, and although I can understand there is no
> doubt
> its technically faster than LetterTokenizer + LowerCaseFilter even today
> (as
> it can just go through the char[] only a single time), I have my doubts
> that
> this brings any value these days...
>
>
> On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind 
> wrote:
>
> > Why would you want to do that, instead of just using another tokenizer
> and
> > a lowercasefilter?  It's more confusing less DRY code to leave them
> separate
> > -- the LowerCaseTokenizerFactory  combines anyway because someone decided
> it
> > was such a common use case that it was worth it for the demonstrated
> > performance advantage. (At least I hope that's what happened, otherwise
> > there's no excuse for it!).
> >
> > Do you know you get a worthwhile performance benefit for what you're
> doing?
> >  If not, why do it?
> >
> > Jonathan
> >
> >
> > Scott Gonyea wrote:
> >
> >> I went for a different route:
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-2644
> >>
> >> Scott
> >>
> >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir  wrote:
> >>
> >>
> >>
> >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea 
> wrote:
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't
> create
> >>>> tokens, based solely on lower-casing characters.  Is there a way to
> tell
> >>>>
> >>>>
> >>> it
> >>>
> >>>
> >>>> NOT to drop non-characters?  It's amazingly frustrating that the
> >>>> TokenizerFactory and the FilterFactory have two entirely different
> modes
> >>>>
> >>>>
> >>> of
> >>>
> >>>
> >>>> behavior.  If I wanted it to tokenize based on non-lower case
> >>>> characters
> >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the
> >>>> LowerCaseFilterFactory?  Or any number of combinations that would
> >>>>
> >>>>
> >>> otherwise
> >>>
> >>>
> >>>> achieve that specific end-result?
> >>>>
> >>>>
> >>>>
> >>> I don't think you should use LowerCaseTokenizerFactory if you dont want
> >>> to
> >>> divide text on non-letters, its intended to do just that.
> >>>
> >>> from the javadocs:
> >>> LowerCaseTokenizer performs the function of LetterTokenizer and
> >>> LowerCaseFilter together. It divides text at non-letters and converts
> >>> them
> >>> to lower case. While it is functionally equivalent to the combination
> of
> >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage
> to
> >>> doing the two tasks at once, hence this (redundant) implementation.
> >>>
> >>>
> >>>
> >>> So... Is there a way for me to tell it to NOT split based on
> >>> non-characters?
> >>>Use a different tokenizer that doesn't split on non-characters,
> >>> followed by
> >>> a LowerCaseFilter
> >>>
> >>> --
> >>> Robert Muir
> >>> rcm...@gmail.com
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Scott Gonyea

There doesn't seem to have been anything readily available.  All of the
tokenizers make their own assumptions about how I want to treat the data.
 The end result is that this felt like the most direct approach.  The
default behavior of "LowerCaseTokenizer"(+Factory) was retained, while
allowing it to be extended in very small ways--at the users discretion.

The comments noted that it was done for performance reasons, but I honestly
cannot believe the performance gain is altogether worthwhile.  Whether or
not that's the case, I strongly believe that "LowerCaseTokenizer" should
have (more correctly) been called "LowerCaseLetterTokenizer".

There's arguably zero negative impact from my change.  Where the (inherited)
isTokenChar(int) method from LetterTokenizer was simply:

  protected boolean isTokenChar(int c) {

return Character.isLetter(c);

  }

I've (likewise) given the most-common use-case the the first check in the
method:

  protected boolean isTokenChar(int c) {

if(Character.isLetter(c)) {

  return true;

}

Scott

On Tue, Sep 14, 2010 at 2:23 PM, Jonathan Rochkind  wrote:

> Why would you want to do that, instead of just using another tokenizer and
> a lowercasefilter?  It's more confusing less DRY code to leave them separate
> -- the LowerCaseTokenizerFactory  combines anyway because someone decided it
> was such a common use case that it was worth it for the demonstrated
> performance advantage. (At least I hope that's what happened, otherwise
> there's no excuse for it!).
>
> Do you know you get a worthwhile performance benefit for what you're doing?
>  If not, why do it?
>
> Jonathan
>
>
> Scott Gonyea wrote:
>
>> I went for a different route:
>>
>> https://issues.apache.org/jira/browse/LUCENE-2644
>>
>> Scott
>>
>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir  wrote:
>>
>>
>>
>>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea  wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
>>>> tokens, based solely on lower-casing characters.  Is there a way to tell
>>>>
>>>>
>>> it
>>>
>>>
>>>> NOT to drop non-characters?  It's amazingly frustrating that the
>>>> TokenizerFactory and the FilterFactory have two entirely different modes
>>>>
>>>>
>>> of
>>>
>>>
>>>> behavior.  If I wanted it to tokenize based on non-lower case
>>>> characters
>>>> wouldn't I use, say, LetterTokenizerFactory and tack on the
>>>> LowerCaseFilterFactory?  Or any number of combinations that would
>>>>
>>>>
>>> otherwise
>>>
>>>
>>>> achieve that specific end-result?
>>>>
>>>>
>>>>
>>> I don't think you should use LowerCaseTokenizerFactory if you dont want
>>> to
>>> divide text on non-letters, its intended to do just that.
>>>
>>> from the javadocs:
>>> LowerCaseTokenizer performs the function of LetterTokenizer and
>>> LowerCaseFilter together. It divides text at non-letters and converts
>>> them
>>> to lower case. While it is functionally equivalent to the combination of
>>> LetterTokenizer and LowerCaseFilter, there is a performance advantage to
>>> doing the two tasks at once, hence this (redundant) implementation.
>>>
>>>
>>>
>>> So... Is there a way for me to tell it to NOT split based on
>>> non-characters?
>>>Use a different tokenizer that doesn't split on non-characters,
>>> followed by
>>> a LowerCaseFilter
>>>
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>>
>>>
>>
>>
>>
>

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Scott Gonyea

I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644

Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir  wrote:

> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea  wrote:
>
> > Hi,
> >
> > I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
> > tokens, based solely on lower-casing characters.  Is there a way to tell
> it
> > NOT to drop non-characters?  It's amazingly frustrating that the
> > TokenizerFactory and the FilterFactory have two entirely different modes
> of
> > behavior.  If I wanted it to tokenize based on non-lower case
> > characters
> > wouldn't I use, say, LetterTokenizerFactory and tack on the
> > LowerCaseFilterFactory?  Or any number of combinations that would
> otherwise
> > achieve that specific end-result?
> >
>
> I don't think you should use LowerCaseTokenizerFactory if you dont want to
> divide text on non-letters, its intended to do just that.
>
> from the javadocs:
> LowerCaseTokenizer performs the function of LetterTokenizer and
> LowerCaseFilter together. It divides text at non-letters and converts them
> to lower case. While it is functionally equivalent to the combination of
> LetterTokenizer and LowerCaseFilter, there is a performance advantage to
> doing the two tasks at once, hence this (redundant) implementation.
>
>
>
> So... Is there a way for me to tell it to NOT split based on
> non-characters?
> >
>
> Use a different tokenizer that doesn't split on non-characters, followed by
> a LowerCaseFilter
>
> --
> Robert Muir
> rcm...@gmail.com
>

LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Scott Gonyea

Hi,

I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
tokens, based solely on lower-casing characters.  Is there a way to tell it
NOT to drop non-characters?  It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different modes of
behavior.  If I wanted it to tokenize based on non-lower case characters
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory?  Or any number of combinations that would otherwise
achieve that specific end-result?

So... Is there a way for me to tell it to NOT split based on non-characters?
 If not, I'd really like to submit a patch to make it behave as
advertised--which is the next best thing to yelling incoherently at the poor
guy who wrote it :).

Scott

Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal

2010-09-03 Thread Scott Gonyea

I've been considering the use of Hadoop, since that's what Nutch uses.
Unless I piggy-back onto Nutch's MR job, when creating a Solr index, I'm
wondering if it's overkill.  I can see ways of working it into a MapReduce
workflow, but it would involve dumping the database onto HDFS beforehand.
I'm still debating that one, with myself.

One other thing that I want to take advantage of is Lucene/Solr's filter
factories (?).  I'm not sure if I have the terminology right, but there are
a lot of advanced text-parsing features.  IE, a search for "reality" would
also turn up "reale."  It seems that I would want to perform my "find words,
filter out any white-listed context, and re-inject" after Nutch stuffs Solr
with all of its crawl data.

So, perhaps I can get help starting at #1 of your suggestion:

How would I best extract a phrase from Solr?  IE, can I tell Solr "give me
each occurence of X in document Y" or (and I'm guessing this is it) where
would I look to perform that kind of a search, myself?

Thinking about it, I imagine that Solr might tend to "flatten" words in its
index.  Ie, the string "reality" only really occurs once in a given page's
index, and (maybe?) it'll have some boost reflecting the number of times it
appeared.  Please excuse my obscene generalizations :(.

I'm going to do some more digging through the Solr.  I appreciate your
help.  I am a bit of a beggar when it comes to seeking out help on where to
start.  But, as I mentioned on the Nutch list, I will contribute all of my
changes back to Solr.  I'll also look to improve documentation, which I
still owe Nutch,  but that's queueing up for when there's a lull.

Thank you, - Scott

On Fri, Sep 3, 2010 at 1:19 AM, Jan Høydahl / Cominvent <
jan@cominvent.com> wrote:

> Hi,
>
> This smells like a job for Hadoop and perhaps Mahout, unless your use cases
> are totally ad-hoc research.
> After Nutch has fetched the sites, kick off some MapReduce jobs for each
> case you wish to study:
> 1. Extract phrases/contexts
> 2. For each context, perform detection and whitelisting
> 3. In the reduce step, sum it all up, and write the results to some store
> 4. Now you may index a "report" per site into Solr, with links to the
> original pages for each context
>
> You may be able to represent your grammar as textual rules instead of code.
> Your latency may be minutes instead of milliseconds though...
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 3. sep. 2010, at 01.03, Scott Gonyea wrote:
>
> > Hi Grant,
> >
> > Thanks for replying--sorry for sticking this on dev; I had imagined that
> > development against the Solr codebase would be inevitable.
> >
> > The application has to do with regulatory and legal compliance work by a
> > non-profit, and is "socially good," but I need to 'abstract' the
> > problem/goals--as it's not mine to disclose.
> >
> > Crawl several websites, ie: slashdot, engadget, etc., inject them into
> Solr,
> > and search for a given word.
> >
> > Issue 1: How many times did that word appear, on the URL returned by
> Solr?
> >
> > Suppose that word is "Linux" and you want to make sure that each
> occurence
> > of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism
> gone
> > wild).  Now, suppose that "GNU Linux" is ok.  And even "GNU Projects such
> as
> > Linux" is OK too.  So, now:
> >
> > Issue 2: Suppose that your goal is now to separate the noise from the
> > signal.  You therefore "white list" occurrences in which "Linux" appears
> > without a "GNU/" prefix, yet which you've deemed acceptable within the
> given
> > context.  "GNU/Linux" would be a starting point for any of your
> > white-listing tasks.
> >
> > Simply iterating over what is--and is not--a "white list" just doesn't
> scale
> > on a lot of levels.  So my approach is to maintain a separate datastore,
> > which contains a list of phrases that are worthy of whomever's attention,
> as
> > well as a whole lot of "phrase-contexts"... Or the context in which the
> > phrase appeared.
> >
> > Suppose that one website lists "Linux" 20 times; the goal is to
> white-list
> > all 20 of those occurrences.  Or perhaps "Linux" appears 20 times, within
> > the same context, then you might only need 1 "white list" to knock out
> all
> > 20.  F

Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal

2010-09-02 Thread Scott Gonyea

Hi Grant,

Thanks for replying--sorry for sticking this on dev; I had imagined that
development against the Solr codebase would be inevitable.

The application has to do with regulatory and legal compliance work by a
non-profit, and is "socially good," but I need to 'abstract' the
problem/goals--as it's not mine to disclose.

Crawl several websites, ie: slashdot, engadget, etc., inject them into Solr,
and search for a given word.

Issue 1: How many times did that word appear, on the URL returned by Solr?

Suppose that word is "Linux" and you want to make sure that each occurence
of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism gone
wild).  Now, suppose that "GNU Linux" is ok.  And even "GNU Projects such as
Linux" is OK too.  So, now:

Issue 2: Suppose that your goal is now to separate the noise from the
signal.  You therefore "white list" occurrences in which "Linux" appears
without a "GNU/" prefix, yet which you've deemed acceptable within the given
context.  "GNU/Linux" would be a starting point for any of your
white-listing tasks.

Simply iterating over what is--and is not--a "white list" just doesn't scale
on a lot of levels.  So my approach is to maintain a separate datastore,
which contains a list of phrases that are worthy of whomever's attention, as
well as a whole lot of "phrase-contexts"... Or the context in which the
phrase appeared.

Suppose that one website lists "Linux" 20 times; the goal is to white-list
all 20 of those occurrences.  Or perhaps "Linux" appears 20 times, within
the same context, then you might only need 1 "white list" to knock out all
20.  Further, the white-listing can generally be applied to other sites in
which they appear.

I'd love to get some thoughts on how to tackle this problem, but I think
that kicking off separate documents, within Solr, for each specific
occurrence... would be the simplest path.  But again, I'd love for some
thoughts on how else I might do this, or where I should start my coding :)

Thank you very much,
Scott Gonyea

On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll  wrote:

> Dropping d...@lucene.a.o.
>
> How about we step back and please explain the problem you are trying to
> solve, as opposed to the proposed solution to the problem below.  You can
> likely do what you want below in Solr/Lucene (modulo replacing the index
> with a new document), but the bigger question is "is that the best way to do
> it?"  I think if you give us that context, then perhaps we can brainstorm on
> solutions.
>
> Thanks,
> Grant
>
>
> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote:
>
> > Hi,
> >
> > I'm looking to get some direction on where I should focus my attention,
> with regards to the Solr codebase and documentation.  Rather than write a
> ton of stuff no one wants to read, I'll just start with a use-case.  For
> context, the data originates from Nutch crawls and is indexed into Solr.
> >
> > Imagine a web page has the following content (4 occurences of "Johnson"
> are bolded):
> >
> > --content_--
> > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla
> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum.
> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget
> ligula nisi. Ut fringilla ullamcorper sem.
> > --_content--
> >
> > First; I would like to have the entire "content" block be indexed within
> Solr.  This is done and definitely not an issue.
> >
> > Second (+); during the injection of crawl data into Solr, I would like to
> grab every occurence of a specific word, or phrase, with "Johnson" being my
> example for the above.  I want to take every such phrase (without
> collision), as well as its unique-context, and inject that into its own,
> separate Solr index.  For example, the above "content" example, having been
> indexed in its entirety, would also be the source of 4 additional indexes.
>  In each index, "Johnson" would only appear once.  All of the text before
> and after "Johnson" would be BOUND BY any other occurrence of "Johnson."
>  eg:
> >
> > --index1_--
> > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> id urna et justo fringilla dictum
> > --_index1-- --index2_--
> > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
> > --_index2-- --index3_--
> > in at t

In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal

2010-09-01 Thread Scott Gonyea

Hi,

I'm looking to get some direction on where I should focus my attention, with
regards to the Solr codebase and documentation.  Rather than write a ton of
stuff no one wants to read, I'll just start with a use-case.  For context,
the data originates from Nutch crawls and is indexed into Solr.

Imagine a web page has the following content (4 occurences of "Johnson" are
bolded):

--content_--
Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id
urna et justo fringilla dictum *johnson* in at tortor. Nulla eu nulla magna,
nec sodales est. Sed *johnSon* sed elit non lorem sagittis fermentum. Mauris
a arcu et sem sagittis rhoncus vel malesuada *Johnsons* mi. Morbi eget
ligula nisi. Ut fringilla ullamcorper sem.
--_content--

*First*; I would like to have the entire "content" block be indexed within
Solr.  This is done and definitely not an issue.

*Second* (+); during the injection of crawl data into Solr, I would like to
grab every occurence of a specific word, or phrase, with "Johnson" being my
example for the above.  I want to take every such phrase (without
collision), as well as its unique-context, and inject that into its own,
separate Solr index.  For example, the above "content" example, having been
indexed in its entirety, would also be the source of 4 additional indexes.
In each index, "Johnson" would only appear once.  All of the text before and
after "Johnson" would be BOUND BY any other occurrence of "Johnson."  eg:

--index1_--
Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id
urna et justo fringilla dictum
--_index1-- --index2_--
sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
dictum *johnson* in at tortor. Nulla eu nulla magna, nec sodales est. Sed
--_index2-- --index3_--
in at tortor. Nulla eu nulla magna, nec sodales est. Sed *johnSon* sed elit
non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
malesuada
--_index3-- --index4_--
sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus
vel malesuada *Johnsons* mi. Morbi eget ligula nisi. Ut fringilla
ullamcorper sem.
--_index4--

Q:
How much of this is feasible in "present-day Solr" and how much of it do I
need to produce in a patch of my own?  Can anyone give me some direction on
where I should look, in approaching this problem (ie, libs / classes /
confs)?  I sincerely appreciate it.

*Third*; I would later like to go through the above, child indexes and
dismiss any that appear within a given context.  For example, I may deem
"ipsum dolor *Johnson* sit amet" as not being useful and I'd want to delete
any indexes matching that particular phrase-context.  The deletion is
trivial and, with the 2nd item resolved--this becomes a fairly non-issue.

Q:
The question, more or less, comes from the fact that my source data is from
a web crawler.  When recrawled, I need to repeat the process of dismissing
phrase-contexts that are not relevant to me.  Where is the best place to
perform this work?  I could easily perform queries, after indexing my crawl,
but that seems needlessly intensive.  I think the answer to that will be
"wherever I implement #2", but assumptions can be painfully expensive.


Thank you for reading my bloated e-mail.  Again, I'm mostly just looking to
be pointed to various pieces of the Lucene / Solr code-base, and am trolling
for any insight that people might share.

Scott Gonyea

Re: String field

Advice on Exact Matching?

Solr highlighting is double-quotes-aware?

Re: Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

Re: Highlighting match term in bold rather than italic

Re: How to Index Pure Text into Seperate Fields?

Re: Get all results from a solr query

Re: Get all results from a solr query

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal

Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal

In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal

17 matches

Site Navigation

Mail list logo

Footer information