Re: String field
First, make sure your request handler is set to spit out everything. I take it you did, but I hate to assume. Second, I suggest indexing your data twice. One as tokenized-text, the other as a string. It'll save you from howling at the moon in anguish... Unless you really only do care about pure, exact-matching. IE, down to the character-case. Scott On Tue, Mar 29, 2011 at 8:46 AM, Brian Lamb wrote: > Hi all, > > I'm a little confused about the string field. I read somewhere that if I > want to do an exact match, I should use an exact match. So I made a few > modifications to my schema file: > > required="false" > /> > stored="true" required="false" /> > required="false" /> > required="false" /> > > And did a full import but when I do a search and return all fields, only id > is showing up. The only difference is that id is my primary key field so > that could be why it is showing up but why aren't the others showing up? > > Thanks, > > Brian Lamb >
Advice on Exact Matching?
Hi, I am trying to make sure that when I search for text—regardless of what that text is—that I get an exact match. I'm *still* getting some issues, and this last mile is becoming very painful. The solr field, for which I'm setting this up on, is pasted below my explanation. I appreciate any help. Explanation: I'm crawling websites with Nutch. I'm performing some mechanical-turk-like filtering and term matching. The problem is, there's some very gnarly behavior in Solr due to any number of gotchas. If I want to find *all* Solr documents that match "[id]somejunk\hi[/id]" then life is instantly hell. Likewise, lots of whitespace in between words throws it off " john says hello, how are you?" I would love to be able to search for these exact phrases. If that's just not practical (I'm more than willing to live with a bloated search index), what would some other strategies be? There's no MapReduce in Solr; I could attempt to do Hadoop-streaming, but that's not very ideal for a variety of reasons. Solr Schema.xml, fieldType "text" (no, this is not used everywhere; only on 2 fields): Thank you, Scott Gonyea
Solr highlighting is double-quotes-aware?
Not sure how to write that subject line. I'm getting some weird behavior out of the highlighter in Solr. It seems like an edge case, but I'm curious to hear if this is known about, or if it's something worth looking into further. Background: I'm using Solr's highlighting facility to tag words, found in content crawled via Nutch. I split up the content based on those tags, which is later fed into a moderation process. Sample Data (snippet from larger content): [url=\"http://www.sampleurl.com/baffle_prices.html\"]baffle[/url] (My "hl.simple.pre" is set to "TEST_KEYWORD_START" and my "hl.simple.post" is set to "TEST_KEYWORD_END") Query for "baffle", and solr highlights it thus: TEST_KEYWORD_STARTbaffle_prices.html\"]baffleTEST_KEYWORD_END What should be happening, is this: TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END_prices.html\"]TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END Is there something about this data that makes the highlighter not want to split it up? Do I have to have Solr tokenize the words by some character that I somehow excluded? Thank you, Scott Gonyea
Re: Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?
Wow, that's pretty infuriating. Thank you for the suggestion. I added it to the Wiki, with the hope that if it contains misinformation then someone will correct it and, consequently, save me from another one of these experiences :) (...and to also document that, hey, there is a tokenizer which treats the entire field as an exact value.) Will go this route and re-index everything back into Solr...again...sigh. Scott On Mon, Oct 4, 2010 at 10:07 AM, Ahmet Arslan wrote: >> >> > name="idstr" class="solr.StrField"> >> >> > class="solr.PatternTokenizerFactory" pattern="(.*)" >> group="1"/> >> > class="solr.LowerCaseFilterFactory"/> >> > > This definition is invalid. You cannot use charfilter/tokenizer/tokenfilter > with solr.StrField. > > But it is interesting that (i just tested) analysis.jsp (1.4.1) displays as > if its working. But if you observe at /schema.jsp you will see that real > indexed values are not lowercased. > > You can use this definition instead: > > > > > > > > > > > >
Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?
Wow, this is probably the most annoying Solr issue I've *ever* dealt with. First question: How do I debug Dismax, and its query handling? Issue: When I query against this StrField, I am attempting to do an *exact* match... Albeit one that is case-insensitive :). So, 90% exact. It works in a majority of cases. Indeed, I am teling Solr that this field is my uniqueField and it enforces uniqueness perfectly. The issue comes about when I try to query a document, based on a key in this field, and the key I'm using has hyphens (dashes) in it. Then I get zero results. Very frustrating. The keys will always be a URL. IE, "http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry"; Here's my configuration info... schema.xml (the URL exists twice; once in 'idstr' format, for uniqueness, and once in the 'url' form below. I am querying against the 'idstr' field): id content Yes, the PatternTokenizerFactory is inefficient for doing what I wanted above. It was a quick hack, while I sought something to do exactly what I'm doing above. IE, exact / WHOLE string... but lower case. Here's my solrconfig.xml: dismax explicit 0.01 content^1.5 anchor^0.3 title^1.2 mcode^1.0 site_id^1.0 priority^1.0 * true *:* content title 0 title regex3 And, finally, when I run that sample URL through the query analyzer... here's the output (copied from the HTML)... I appreciate any/all help anyone can provide. Seriously. I'll love you forever :( : Index Analyzer org.apache.solr.analysis.PatternTokenizerFactory null term position 1 term text http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry term type word source start,end 0,58 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 term text http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry term type word source start,end 0,58 payload Query Analyzer org.apache.solr.analysis.PatternTokenizerFactory null term position 1 term text http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry term type word source start,end 0,58 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 term text http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry term type word source start,end 0,58 payload
Re: Highlighting match term in bold rather than italic
Your solrconfig has a highlighting section. You can make that CDATA thing whatever you want. I changed it to . On Thu, Sep 30, 2010 at 2:54 PM, efr...@gmail.com wrote: > Hi all - > > Does anyone know how to produce solr results where the match term is > highlighted in bold rather than italic? > > thanks in advance, > > Brad >
Re: How to Index Pure Text into Seperate Fields?
Break your HTML pages into the desired fields, format it as follows: http://wiki.apache.org/solr/UpdateXmlMessages And away you go. You may want to search / review the Wiki. Also, if you're indexing websites and want to place it in Solr, you should look at Nutch. It can do all that work for you, and more. Scott On Wed, Sep 29, 2010 at 12:56 PM, Savannah Beckett wrote: > Hi, > I am using xpath to index different parts of the html pages into different > fields. Now, I have some pure text documents that has no html. So I can't > use > xpath. How do I index these pure text into different fields of the index? > How > do I make nutch/solr understand these different parts belong to different > fields? Maybe I can use existing content in the fields in my index? > Thanks. > > >
Re: Get all results from a solr query
lol, note to self: scratch out IPs. Good thing firewalls exist to keep my stupidity at bay. Scott On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea wrote: > If you want to do it in Ruby, you can use this script as scaffolding: > require 'rsolr' # run `gem install rsolr` to get this > solr = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr') > total = solr.select({:rows => 0})["response"]["numFound"] > rows = 10 > query = { > :rows => rows, > :start => 0 > } > pages = (total.to_f / rows.to_f).ceil # round up > (1..pages).each do |page| > query[:start] = (page-1) * rows > results = solr.select(query) > docs = results[:response][:docs] > # Do stuff here > # > docs.each do |doc| > doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}" > end > # Add it back in to Solr > solr.add(docs) > solr.commit > end > > Scott > > On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant wrote: >> >> Start with a *:*, then the “numFound” attribute of the >> element should give you the rows to fetch by a 2nd request. >> >> >> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross wrote: >> > That will stil just return 10 rows for me. Is there something else in >> > the configuration of solr to have it return all the rows in the >> > results? >> > >> > -- Chris >> > >> > >> > >> > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant wrote: >> >> q=*:* >> >> >> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross >> >> wrote: >> >>> I have some queries that I'm running against a solr instance (older, >> >>> 1.2 I believe), and I would like to get *all* the results back (and >> >>> not have to put an absurdly large number as a part of the rows >> >>> parameter). >> >>> >> >>> Is there a way that I can do that? Any help would be appreciated. >> >>> >> >>> -- Chris >> >>> >> >> >> > >
Re: Get all results from a solr query
If you want to do it in Ruby, you can use this script as scaffolding: require 'rsolr' # run `gem install rsolr` to get this solr = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr') total = solr.select({:rows => 0})["response"]["numFound"] rows = 10 query = { :rows => rows, :start => 0 } pages = (total.to_f / rows.to_f).ceil # round up (1..pages).each do |page| query[:start] = (page-1) * rows results = solr.select(query) docs = results[:response][:docs] # Do stuff here # docs.each do |doc| doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}" end # Add it back in to Solr solr.add(docs) solr.commit end Scott On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant wrote: > > Start with a *:*, then the “numFound” attribute of the > element should give you the rows to fetch by a 2nd request. > > > On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross wrote: > > That will stil just return 10 rows for me. Is there something else in > > the configuration of solr to have it return all the rows in the > > results? > > > > -- Chris > > > > > > > > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant wrote: > >> q=*:* > >> > >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross > >> wrote: > >>> I have some queries that I'm running against a solr instance (older, > >>> 1.2 I believe), and I would like to get *all* the results back (and > >>> not have to put an absurdly large number as a part of the rows > >>> parameter). > >>> > >>> Is there a way that I can do that? Any help would be appreciated. > >>> > >>> -- Chris > >>> > >> > >
Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?
There's a lot of reasons, with the performance hit being notable--but also because I feel that using a regex on something this basic amounts to a lazy hack. I'm typically against regular expressions in XML. I'm vehemently opposed to them in cases where not using them should otherwise be quite trivial. Regarding LowerCaseFilter, etc: My question is: Why should LowerCaseFilter be the means by which that work is done? I fully agree with keeping things DRY, but I'm not quite sure I agree with how that mantra is being employed. For instance, the two tokenizer statements: Can be written to utilize the same codebase, which makes things DRY and *may* even be a bit more performant for less trivial transformations. If nothing else, I think a "CharacterTokenizer" would be good way to go. All that said :) I don't promote myself as an expert and I'm happy to be shown the light / slapped across the head. Scott On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind wrote: > How about patching the LetterTokenizer to be capable of tokenizing how you > want, which can then be combined with a LowerCaseFilter (or not) as desired. > Or indeed creating a new tokenizer to do exactly what you want, possibly > (but one that doesn't combine an embedded lowercasefilter in there too!). > Instead of patching the LowerCaseTokenizer, which is of dubious value. Just > brainstorming. > > Another way to tokenize based on "Non-Whitespace/Alpha/Numeric > character-content" might be using the existing PatternTokenizerFactory with > a suitable regexp, as you mention. Which of course could do what the > LetterTokenizer does to, but presumably not as efficiently. Is that what > gives you an uncomfortable feeling? If it performs worse enough to matter, > then that's why you'd need a custom tokenizer, other than that I'm not sure > anything's undesirable about the PatternTokenizer. > > > Jonathan > > Scott Gonyea wrote: > >> I'd agree with your point entirely. My attacking LowerCaseTokenizer was a >> result of not wanting to create yet more Classes. >> >> That said, rightfully dumping LowerCaseTokenizer would probably have me >> creating my own Tokenizer. >> >> I could very well be thinking about this wrong... But what if I wanted to >> create tokens based on Non-Whitespace/Alpha/Numeric character-content? >> >> It looks like I could perhaps use the PatternTokenizer, but that didn't >> leave me with a comfortable feeling when I had first looked into it. >> >> Scott >> >> On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir wrote: >> >> >> >>> Jonathan, you bring up an excellent point. >>> >>> I think its worth our time to actually benchmark this LowerCaseTokenizer >>> versus LetterTokenizer + LowerCaseFilter >>> >>> This tokenizer is quite old, and although I can understand there is no >>> doubt >>> its technically faster than LetterTokenizer + LowerCaseFilter even today >>> (as >>> it can just go through the char[] only a single time), I have my doubts >>> that >>> this brings any value these days... >>> >>> >>> On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind >>> wrote: >>> >>> >>> >>>> Why would you want to do that, instead of just using another tokenizer >>>> >>>> >>> and >>> >>> >>>> a lowercasefilter? It's more confusing less DRY code to leave them >>>> >>>> >>> separate >>> >>> >>>> -- the LowerCaseTokenizerFactory combines anyway because someone >>>> decided >>>> >>>> >>> it >>> >>> >>>> was such a common use case that it was worth it for the demonstrated >>>> performance advantage. (At least I hope that's what happened, otherwise >>>> there's no excuse for it!). >>>> >>>> Do you know you get a worthwhile performance benefit for what you're >>>> >>>> >>> doing? >>> >>> >>>> If not, why do it? >>>> >>>> Jonathan >>>> >>>> >>>> Scott Gonyea wrote: >>>> >>>> >>>> >>>>> I went for a different route: >>>>> >>>>> https://issues.apache.org/jira/browse/LUCENE-2644 >>>>> >>>>> Scott >>>>> >>>>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir >>>>> wrote: >>&
Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?
I'd agree with your point entirely. My attacking LowerCaseTokenizer was a result of not wanting to create yet more Classes. That said, rightfully dumping LowerCaseTokenizer would probably have me creating my own Tokenizer. I could very well be thinking about this wrong... But what if I wanted to create tokens based on Non-Whitespace/Alpha/Numeric character-content? It looks like I could perhaps use the PatternTokenizer, but that didn't leave me with a comfortable feeling when I had first looked into it. Scott On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir wrote: > Jonathan, you bring up an excellent point. > > I think its worth our time to actually benchmark this LowerCaseTokenizer > versus LetterTokenizer + LowerCaseFilter > > This tokenizer is quite old, and although I can understand there is no > doubt > its technically faster than LetterTokenizer + LowerCaseFilter even today > (as > it can just go through the char[] only a single time), I have my doubts > that > this brings any value these days... > > > On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind > wrote: > > > Why would you want to do that, instead of just using another tokenizer > and > > a lowercasefilter? It's more confusing less DRY code to leave them > separate > > -- the LowerCaseTokenizerFactory combines anyway because someone decided > it > > was such a common use case that it was worth it for the demonstrated > > performance advantage. (At least I hope that's what happened, otherwise > > there's no excuse for it!). > > > > Do you know you get a worthwhile performance benefit for what you're > doing? > > If not, why do it? > > > > Jonathan > > > > > > Scott Gonyea wrote: > > > >> I went for a different route: > >> > >> https://issues.apache.org/jira/browse/LUCENE-2644 > >> > >> Scott > >> > >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir wrote: > >> > >> > >> > >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea > wrote: > >>> > >>> > >>> > >>>> Hi, > >>>> > >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't > create > >>>> tokens, based solely on lower-casing characters. Is there a way to > tell > >>>> > >>>> > >>> it > >>> > >>> > >>>> NOT to drop non-characters? It's amazingly frustrating that the > >>>> TokenizerFactory and the FilterFactory have two entirely different > modes > >>>> > >>>> > >>> of > >>> > >>> > >>>> behavior. If I wanted it to tokenize based on non-lower case > >>>> characters > >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the > >>>> LowerCaseFilterFactory? Or any number of combinations that would > >>>> > >>>> > >>> otherwise > >>> > >>> > >>>> achieve that specific end-result? > >>>> > >>>> > >>>> > >>> I don't think you should use LowerCaseTokenizerFactory if you dont want > >>> to > >>> divide text on non-letters, its intended to do just that. > >>> > >>> from the javadocs: > >>> LowerCaseTokenizer performs the function of LetterTokenizer and > >>> LowerCaseFilter together. It divides text at non-letters and converts > >>> them > >>> to lower case. While it is functionally equivalent to the combination > of > >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage > to > >>> doing the two tasks at once, hence this (redundant) implementation. > >>> > >>> > >>> > >>> So... Is there a way for me to tell it to NOT split based on > >>> non-characters? > >>>Use a different tokenizer that doesn't split on non-characters, > >>> followed by > >>> a LowerCaseFilter > >>> > >>> -- > >>> Robert Muir > >>> rcm...@gmail.com > >>> > >>> > >>> > >> > >> > >> > > > > > -- > Robert Muir > rcm...@gmail.com >
Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?
There doesn't seem to have been anything readily available. All of the tokenizers make their own assumptions about how I want to treat the data. The end result is that this felt like the most direct approach. The default behavior of "LowerCaseTokenizer"(+Factory) was retained, while allowing it to be extended in very small ways--at the users discretion. The comments noted that it was done for performance reasons, but I honestly cannot believe the performance gain is altogether worthwhile. Whether or not that's the case, I strongly believe that "LowerCaseTokenizer" should have (more correctly) been called "LowerCaseLetterTokenizer". There's arguably zero negative impact from my change. Where the (inherited) isTokenChar(int) method from LetterTokenizer was simply: protected boolean isTokenChar(int c) { return Character.isLetter(c); } I've (likewise) given the most-common use-case the the first check in the method: protected boolean isTokenChar(int c) { if(Character.isLetter(c)) { return true; } Scott On Tue, Sep 14, 2010 at 2:23 PM, Jonathan Rochkind wrote: > Why would you want to do that, instead of just using another tokenizer and > a lowercasefilter? It's more confusing less DRY code to leave them separate > -- the LowerCaseTokenizerFactory combines anyway because someone decided it > was such a common use case that it was worth it for the demonstrated > performance advantage. (At least I hope that's what happened, otherwise > there's no excuse for it!). > > Do you know you get a worthwhile performance benefit for what you're doing? > If not, why do it? > > Jonathan > > > Scott Gonyea wrote: > >> I went for a different route: >> >> https://issues.apache.org/jira/browse/LUCENE-2644 >> >> Scott >> >> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir wrote: >> >> >> >>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea wrote: >>> >>> >>> >>>> Hi, >>>> >>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create >>>> tokens, based solely on lower-casing characters. Is there a way to tell >>>> >>>> >>> it >>> >>> >>>> NOT to drop non-characters? It's amazingly frustrating that the >>>> TokenizerFactory and the FilterFactory have two entirely different modes >>>> >>>> >>> of >>> >>> >>>> behavior. If I wanted it to tokenize based on non-lower case >>>> characters >>>> wouldn't I use, say, LetterTokenizerFactory and tack on the >>>> LowerCaseFilterFactory? Or any number of combinations that would >>>> >>>> >>> otherwise >>> >>> >>>> achieve that specific end-result? >>>> >>>> >>>> >>> I don't think you should use LowerCaseTokenizerFactory if you dont want >>> to >>> divide text on non-letters, its intended to do just that. >>> >>> from the javadocs: >>> LowerCaseTokenizer performs the function of LetterTokenizer and >>> LowerCaseFilter together. It divides text at non-letters and converts >>> them >>> to lower case. While it is functionally equivalent to the combination of >>> LetterTokenizer and LowerCaseFilter, there is a performance advantage to >>> doing the two tasks at once, hence this (redundant) implementation. >>> >>> >>> >>> So... Is there a way for me to tell it to NOT split based on >>> non-characters? >>>Use a different tokenizer that doesn't split on non-characters, >>> followed by >>> a LowerCaseFilter >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> >>> >> >> >> >
Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?
I went for a different route: https://issues.apache.org/jira/browse/LUCENE-2644 Scott On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir wrote: > On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea wrote: > > > Hi, > > > > I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create > > tokens, based solely on lower-casing characters. Is there a way to tell > it > > NOT to drop non-characters? It's amazingly frustrating that the > > TokenizerFactory and the FilterFactory have two entirely different modes > of > > behavior. If I wanted it to tokenize based on non-lower case > > characters > > wouldn't I use, say, LetterTokenizerFactory and tack on the > > LowerCaseFilterFactory? Or any number of combinations that would > otherwise > > achieve that specific end-result? > > > > I don't think you should use LowerCaseTokenizerFactory if you dont want to > divide text on non-letters, its intended to do just that. > > from the javadocs: > LowerCaseTokenizer performs the function of LetterTokenizer and > LowerCaseFilter together. It divides text at non-letters and converts them > to lower case. While it is functionally equivalent to the combination of > LetterTokenizer and LowerCaseFilter, there is a performance advantage to > doing the two tasks at once, hence this (redundant) implementation. > > > > So... Is there a way for me to tell it to NOT split based on > non-characters? > > > > Use a different tokenizer that doesn't split on non-characters, followed by > a LowerCaseFilter > > -- > Robert Muir > rcm...@gmail.com >
LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?
Hi, I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create tokens, based solely on lower-casing characters. Is there a way to tell it NOT to drop non-characters? It's amazingly frustrating that the TokenizerFactory and the FilterFactory have two entirely different modes of behavior. If I wanted it to tokenize based on non-lower case characters wouldn't I use, say, LetterTokenizerFactory and tack on the LowerCaseFilterFactory? Or any number of combinations that would otherwise achieve that specific end-result? So... Is there a way for me to tell it to NOT split based on non-characters? If not, I'd really like to submit a patch to make it behave as advertised--which is the next best thing to yelling incoherently at the poor guy who wrote it :). Scott
Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
I've been considering the use of Hadoop, since that's what Nutch uses. Unless I piggy-back onto Nutch's MR job, when creating a Solr index, I'm wondering if it's overkill. I can see ways of working it into a MapReduce workflow, but it would involve dumping the database onto HDFS beforehand. I'm still debating that one, with myself. One other thing that I want to take advantage of is Lucene/Solr's filter factories (?). I'm not sure if I have the terminology right, but there are a lot of advanced text-parsing features. IE, a search for "reality" would also turn up "reale." It seems that I would want to perform my "find words, filter out any white-listed context, and re-inject" after Nutch stuffs Solr with all of its crawl data. So, perhaps I can get help starting at #1 of your suggestion: How would I best extract a phrase from Solr? IE, can I tell Solr "give me each occurence of X in document Y" or (and I'm guessing this is it) where would I look to perform that kind of a search, myself? Thinking about it, I imagine that Solr might tend to "flatten" words in its index. Ie, the string "reality" only really occurs once in a given page's index, and (maybe?) it'll have some boost reflecting the number of times it appeared. Please excuse my obscene generalizations :(. I'm going to do some more digging through the Solr. I appreciate your help. I am a bit of a beggar when it comes to seeking out help on where to start. But, as I mentioned on the Nutch list, I will contribute all of my changes back to Solr. I'll also look to improve documentation, which I still owe Nutch, but that's queueing up for when there's a lull. Thank you, - Scott On Fri, Sep 3, 2010 at 1:19 AM, Jan Høydahl / Cominvent < jan@cominvent.com> wrote: > Hi, > > This smells like a job for Hadoop and perhaps Mahout, unless your use cases > are totally ad-hoc research. > After Nutch has fetched the sites, kick off some MapReduce jobs for each > case you wish to study: > 1. Extract phrases/contexts > 2. For each context, perform detection and whitelisting > 3. In the reduce step, sum it all up, and write the results to some store > 4. Now you may index a "report" per site into Solr, with links to the > original pages for each context > > You may be able to represent your grammar as textual rules instead of code. > Your latency may be minutes instead of milliseconds though... > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 3. sep. 2010, at 01.03, Scott Gonyea wrote: > > > Hi Grant, > > > > Thanks for replying--sorry for sticking this on dev; I had imagined that > > development against the Solr codebase would be inevitable. > > > > The application has to do with regulatory and legal compliance work by a > > non-profit, and is "socially good," but I need to 'abstract' the > > problem/goals--as it's not mine to disclose. > > > > Crawl several websites, ie: slashdot, engadget, etc., inject them into > Solr, > > and search for a given word. > > > > Issue 1: How many times did that word appear, on the URL returned by > Solr? > > > > Suppose that word is "Linux" and you want to make sure that each > occurence > > of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism > gone > > wild). Now, suppose that "GNU Linux" is ok. And even "GNU Projects such > as > > Linux" is OK too. So, now: > > > > Issue 2: Suppose that your goal is now to separate the noise from the > > signal. You therefore "white list" occurrences in which "Linux" appears > > without a "GNU/" prefix, yet which you've deemed acceptable within the > given > > context. "GNU/Linux" would be a starting point for any of your > > white-listing tasks. > > > > Simply iterating over what is--and is not--a "white list" just doesn't > scale > > on a lot of levels. So my approach is to maintain a separate datastore, > > which contains a list of phrases that are worthy of whomever's attention, > as > > well as a whole lot of "phrase-contexts"... Or the context in which the > > phrase appeared. > > > > Suppose that one website lists "Linux" 20 times; the goal is to > white-list > > all 20 of those occurrences. Or perhaps "Linux" appears 20 times, within > > the same context, then you might only need 1 "white list" to knock out > all > > 20. F
Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Hi Grant, Thanks for replying--sorry for sticking this on dev; I had imagined that development against the Solr codebase would be inevitable. The application has to do with regulatory and legal compliance work by a non-profit, and is "socially good," but I need to 'abstract' the problem/goals--as it's not mine to disclose. Crawl several websites, ie: slashdot, engadget, etc., inject them into Solr, and search for a given word. Issue 1: How many times did that word appear, on the URL returned by Solr? Suppose that word is "Linux" and you want to make sure that each occurence of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism gone wild). Now, suppose that "GNU Linux" is ok. And even "GNU Projects such as Linux" is OK too. So, now: Issue 2: Suppose that your goal is now to separate the noise from the signal. You therefore "white list" occurrences in which "Linux" appears without a "GNU/" prefix, yet which you've deemed acceptable within the given context. "GNU/Linux" would be a starting point for any of your white-listing tasks. Simply iterating over what is--and is not--a "white list" just doesn't scale on a lot of levels. So my approach is to maintain a separate datastore, which contains a list of phrases that are worthy of whomever's attention, as well as a whole lot of "phrase-contexts"... Or the context in which the phrase appeared. Suppose that one website lists "Linux" 20 times; the goal is to white-list all 20 of those occurrences. Or perhaps "Linux" appears 20 times, within the same context, then you might only need 1 "white list" to knock out all 20. Further, the white-listing can generally be applied to other sites in which they appear. I'd love to get some thoughts on how to tackle this problem, but I think that kicking off separate documents, within Solr, for each specific occurrence... would be the simplest path. But again, I'd love for some thoughts on how else I might do this, or where I should start my coding :) Thank you very much, Scott Gonyea On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll wrote: > Dropping d...@lucene.a.o. > > How about we step back and please explain the problem you are trying to > solve, as opposed to the proposed solution to the problem below. You can > likely do what you want below in Solr/Lucene (modulo replacing the index > with a new document), but the bigger question is "is that the best way to do > it?" I think if you give us that context, then perhaps we can brainstorm on > solutions. > > Thanks, > Grant > > > On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote: > > > Hi, > > > > I'm looking to get some direction on where I should focus my attention, > with regards to the Solr codebase and documentation. Rather than write a > ton of stuff no one wants to read, I'll just start with a use-case. For > context, the data originates from Nutch crawls and is indexed into Solr. > > > > Imagine a web page has the following content (4 occurences of "Johnson" > are bolded): > > > > --content_-- > > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean > id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla > magna, nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. > Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget > ligula nisi. Ut fringilla ullamcorper sem. > > --_content-- > > > > First; I would like to have the entire "content" block be indexed within > Solr. This is done and definitely not an issue. > > > > Second (+); during the injection of crawl data into Solr, I would like to > grab every occurence of a specific word, or phrase, with "Johnson" being my > example for the above. I want to take every such phrase (without > collision), as well as its unique-context, and inject that into its own, > separate Solr index. For example, the above "content" example, having been > indexed in its entirety, would also be the source of 4 additional indexes. > In each index, "Johnson" would only appear once. All of the text before > and after "Johnson" would be BOUND BY any other occurrence of "Johnson." > eg: > > > > --index1_-- > > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean > id urna et justo fringilla dictum > > --_index1-- --index2_-- > > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla > dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed > > --_index2-- --index3_-- > > in at t
In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Hi, I'm looking to get some direction on where I should focus my attention, with regards to the Solr codebase and documentation. Rather than write a ton of stuff no one wants to read, I'll just start with a use-case. For context, the data originates from Nutch crawls and is indexed into Solr. Imagine a web page has the following content (4 occurences of "Johnson" are bolded): --content_-- Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum *johnson* in at tortor. Nulla eu nulla magna, nec sodales est. Sed *johnSon* sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada *Johnsons* mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem. --_content-- *First*; I would like to have the entire "content" block be indexed within Solr. This is done and definitely not an issue. *Second* (+); during the injection of crawl data into Solr, I would like to grab every occurence of a specific word, or phrase, with "Johnson" being my example for the above. I want to take every such phrase (without collision), as well as its unique-context, and inject that into its own, separate Solr index. For example, the above "content" example, having been indexed in its entirety, would also be the source of 4 additional indexes. In each index, "Johnson" would only appear once. All of the text before and after "Johnson" would be BOUND BY any other occurrence of "Johnson." eg: --index1_-- Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum --_index1-- --index2_-- sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum *johnson* in at tortor. Nulla eu nulla magna, nec sodales est. Sed --_index2-- --index3_-- in at tortor. Nulla eu nulla magna, nec sodales est. Sed *johnSon* sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada --_index3-- --index4_-- sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada *Johnsons* mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem. --_index4-- Q: How much of this is feasible in "present-day Solr" and how much of it do I need to produce in a patch of my own? Can anyone give me some direction on where I should look, in approaching this problem (ie, libs / classes / confs)? I sincerely appreciate it. *Third*; I would later like to go through the above, child indexes and dismiss any that appear within a given context. For example, I may deem "ipsum dolor *Johnson* sit amet" as not being useful and I'd want to delete any indexes matching that particular phrase-context. The deletion is trivial and, with the 2nd item resolved--this becomes a fairly non-issue. Q: The question, more or less, comes from the fact that my source data is from a web crawler. When recrawled, I need to repeat the process of dismissing phrase-contexts that are not relevant to me. Where is the best place to perform this work? I could easily perform queries, after indexing my crawl, but that seems needlessly intensive. I think the answer to that will be "wherever I implement #2", but assumptions can be painfully expensive. Thank you for reading my bloated e-mail. Again, I'm mostly just looking to be pointed to various pieces of the Lucene / Solr code-base, and am trolling for any insight that people might share. Scott Gonyea