Re: dismax request handler without q
try something like this: q.alt=*:*fq=keyphrase:hotel though if you dont need to query across multiple fields, dismax is probably not the best choice On Tue, Jul 20, 2010 at 4:57 AM, olivier sallou olivier.sal...@gmail.com wrote: q will search in defaultSearchField if no field name is set, but you can specify in your q param the fields you want to search into. Dismax is a handler where you can specify to look in a number of fields for the input query. In this case, you do not specify the fields and dismax will look in the fields specified in its configuration. However, by default, dismax is not used, it needs to be called help with the query type parameter (qt=dismax). In default solr config, you can call ...solr/select?q=keyphrase:hotel if keyphrzase is a declared field in your schema 2010/7/20 Chamnap Chhorn chamnapchh...@gmail.com I can't put q=keyphrase:hotel in my request using dismax handler. It returns no result. On Tue, Jul 20, 2010 at 1:19 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: There are some default configuration on my solrconfig.xml that I didn't show you. I'm a little confused when reading http://wiki.apache.org/solr/DisMaxRequestHandler#q. I think q is for plain user input query. On Tue, Jul 20, 2010 at 12:08 PM, olivier sallou olivier.sal...@gmail.com wrote: Hi, this is not very clear, if you need to query only keyphrase, why don't you query directly it? e.g. q=keyphrase:hotel ? Furthermore, why dismax if only keyphrase field is of interest? dismax is used to query multiple fields automatically. At least dismax do not appear in your query (using query type). It is set in your config for your default request handler? 2010/7/20 Chamnap Chhorn chamnapchh...@gmail.com I wonder how could i make a query to return only *all books* that has keyphrase web development using dismax handler? A book has multiple keyphrases (keyphrase is multivalued column). Do I have to pass q parameter? Is it the correct one? http://locahost:8081/solr/select?q=hotelfq=keyphrase:%20hotel -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/ -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/ -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/
Re: preside != president
the general consensus among people who run into the problem you have is to use a plurals only stemmer, a synonyms file or a combination of both (for irregular nouns etc) if you search the archives you can find info on a plurals stemmer On Mon, Jun 28, 2010 at 6:49 AM, dar...@ontrenet.com wrote: Thanks for the tip. Yeah, I think the stemming confounds search results as it stands (porter stemmer). I was also thinking of using my dictionary of 500,000 words with their complete morphologies and conjugations and create a synonyms.txt to provide english accurate morphology. Is this a good idea? Darren Hi Darren, You might want to look at the KStemmer (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) instead of the standard PorterStemmer. It essentially has a 'dictionary' of exception words where stemming stops if found, so in your case president won't be stemmed any further than president (but presidents will be stemmed to president). You will have to integrate it into solr yourself, but that's straightforward. HTH Brendan On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: Hi, It seems to me that because the stemming does not produce grammatically correct stems in many of the cases, search anomalies can occur like the one I am seeing where I have a document with president in it and it is returned when I search for preside, a different word entirely. Is this correct or acceptable behavior? Previous discussions here on stemming, I was told its ok as long as all the words reduce to the same stem, but when different words reduce to the same stem it seems to affect search results in a bad way. Darren
Re: Strange query behavior
splitOnCaseChange is creating multiple tokens from 3dsMax disable it or enable catenateAll, use the analysys page in the admin tool to see exactly how your text will be indexed by analyzers without having to reindex your documents, once you have it right you can do a full reindex. On Mon, Jun 28, 2010 at 5:48 AM, Marc Ghorayeb dekay...@hotmail.com wrote: Hello, I have a title that says 3DVIA Studio Virtools Maya and 3dsMax Exporters. The analysis tool for this field gives me these tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport However, when i search for 3dsmax, i get no results :( Furthermore, if i search for dsmax i get the spellchecker that suggests me 3dsmax even though it doesn't find any results. If i search for any other token (3dvia, or max for example), the document is found. 3dsmax is the only token that doesn't seem to work!! :( Here is my schema for this field:fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt / /analyzer /fieldType Can anyone help me out please? :( PS: the ${Language} is set to en (for english) in this case... _ La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail dans Hotmail ! http://www.windowslive.fr/hotmail/nowgeneration/
Re: questions about Solr shards
there is a first pass query to retrieve all matching document ids from every shard along with relevant sorting information, the document ids are then sorted and limited to the amount needed, then a second query is sent for the rest of the documents metadata. On Sun, Jun 27, 2010 at 7:32 PM, Babak Farhang farh...@gmail.com wrote: Otis, Belated thanks for your reply. 2. The index could change between stages, e.g. a document that matched a query and was subsequently changed may no longer match but will still be retrieved. 2. This describes the situation where, for instance, a document with ID=10 is updated between the 2 calls to the Solr instance/shard where that doc ID=10 lives. Can you explain why this happens? (I.e. does each query to the sharded index somehow involve 2 calls to each shard instance from the base instance?) -Babak On Thu, Jun 24, 2010 at 10:14 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Babak, 1. Yes, you are reading that correctly. 2. This describes the situation where, for instance, a document with ID=10 is updated between the 2 calls to the Solr instance/shard where that doc ID=10 lives. 3. Yup, orthogonal. You can have a master with multiple cores for sharded and non-sharded indices and you can have a slave with cores that hold complete indices or just their shards. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Babak Farhang farh...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, June 24, 2010 6:32:54 PM Subject: questions about Solr shards Hi everyone, There are a couple of notes on the limitations of this approach at target=_blank http://wiki.apache.org/solr/DistributedSearch which I'm having trouble understanding. 1. When duplicate doc IDs are received, Solr chooses the first doc and discards subsequent ones Received here is from the perspective of the base Solr instance at query time, right? I.e. if you inadvertently indexed 2 versions of the document with the same unique ID but different contents to 2 shards, then at query time, the first document (putting aside for the moment what exactly first means) would win. Am I reading this right? 2. The index could change between stages, e.g. a document that matched a query and was subsequently changed may no longer match but will still be retrieved. I have no idea what this second statement means. And one other question about shards: 3. The examples I've seen documented do not illustrate sharded, multicore setups; only sharded monolithic cores. I assume sharding works with multicore as well (i.e. the two issues are orthogonal). Is this right? Any help on interpreting the above would be much appreciated. Thank you, -Babak
Re: SOLR partial string matching question
you want a combination of WhitespaceTokenizer and EdgeNGramFilter http://lucene.apache.org/solr/api/org/apache/solr/analysis/WhitespaceTokenizerFactory.html http://lucene.apache.org/solr/api/org/apache/solr/analysis/EdgeNGramFilterFactory.html the first will create tokens for each word the second will create multiple tokens from each word prefix use the analysis link from the admin page to test your filter chain and make sure its doing what you want. On Tue, Jun 22, 2010 at 4:06 PM, Vladimir Sutskever vladimir.sutske...@jpmorgan.com wrote: Hi, Can you guys make a recommendation for which types/filters to use accomplish the following partial keyword match: A. Actual Indexed Term: bank of america B. User Enters Search Term: of ameri I would like SOLR to match document bank of america with the partial string of ameri Any suggestions? Kind regards, Vladimir Sutskever Investment Bank - Technology JPMorgan Chase, Inc. This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
Re: DismaxRequestHandler
the qs parameter affects matching , but you have to wrap your query in double quotes,ex q=oil spillqf=title descriptionqs=4defType=dismax im not sure how to formulate such a query to apply that rule just to description, maybe with nested queries ... On Thu, Jun 17, 2010 at 12:01 PM, Blargy zman...@hotmail.com wrote: I have a title field and a description filed. I am searching across both fields but I don't want description matches unless they are within some slop of each other. How can I query for this? It seems that im getting back crazy results when there are matches that are nowhere each other -- View this message in context: http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p903641.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exact match on a filter
use a copyField and index the copy as type string, exact matches on that field should then work as the text wont be tokenized On Thu, Jun 17, 2010 at 3:13 PM, Pete Chudykowski pchudykow...@shopzilla.com wrote: Hi, I'm trying with no luck to filter on the exact-match value of a field. Speciffically: fq=brand:apple returns document's whose 'brand' field contains values like apple bottoms. Is there a way to formulate the fq expression to match precisely and only apple ? Thanks in advance for your help. Pete.
Re: DismaxRequestHandler
see yonik's post on nested queries http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ so for example i thought you could possibly do a dismax query across the main fields (in this case just title) and OR that with _query_:{!description:'oil spill'~4} On Thu, Jun 17, 2010 at 3:01 PM, MitchK mitc...@web.de wrote: Joe, please, can you provide an example of what you are thinking of? Subqueries with Solr... I've never seen something like that before. Thank you! Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p904142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: federated / meta search
yes, you can use distributed search across shards with different schemas as long as the query only references overlapping fields, i usually test adding new fields or tokenizers on one shard and deploy only after i verified its working properly On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsma markus.jel...@buyways.nl wrote: Hi, Check out Solr sharding [1] capabilities. I never tested it with different schema's but if each node is queried with fields that it supports, it should return useful results. [1]: http://wiki.apache.org/solr/DistributedSearch Cheers. -Original message- From: Sascha Szott sz...@zib.de Sent: Thu 17-06-2010 19:44 To: solr-user@lucene.apache.org; Subject: federated / meta search Hi folks, if I'm seeing it right Solr currently does not provide any support for federated / meta searching. Therefore, I'd like to know if anyone has already put efforts into this direction? Moreover, is federated / meta search considered a scenario Solr should be able to deal with at all or is it (far) beyond the scope of Solr? To be more precise, I'll give you a short explanation of my requirements. Assume, there are a couple of Solr instances running at different places. The documents stored within those instances are all from the same domain (bibliographic records), but it can not be ensured that the schema definitions conform to 100%. But lets say, there are at least some index fields that are present in all instances (fields with the same name and type definition). Now, I'd like to perform a search on all instances at the same time (with the restriction that the query contains only those fields that overlap among the different schemas) and combine the results in a reasonable way by utilizing the score information associated with each hit. Please note, that due to legal issues it is not feasible to build a single index that integrates the documents of all Solr instances under consideration. Thanks in advance, Sascha
Re: how to have shards parameter by default
youve created an infinite loop, the shard you query calls all other shards and itself and so on, create a separate requestHandler and query that, ex requestHandler name=/distributed_select class=solr.SearchHandler lst name=defaults str name=shardslocalhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr/str /lst arr name=components strfacet/str strdebug/str /arr /requestHandler On Wed, Jun 9, 2010 at 9:10 PM, Scott Zhang macromars...@gmail.com wrote: I tried put shards into default request handler. But now each time if search, solr hangs forever. So what's the correct solution? Thanks. requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=version2.1/str str name=shardslocalhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr/str !-- -- /lst /requestHandler On Thu, Jun 10, 2010 at 11:48 AM, Scott Zhang macromars...@gmail.comwrote: Hi. I am running distributed search on solr. I have 70 solr instances. So each time I want to search I need to use ?shards=localhost:7500/solr,localhost..7620/solr It is very long url. so how can I encode shards into config file then i don't need to type each time. thanks. Scott
Re: Field Collapsing: How to estimate total number of hits
dont know if its the best solution but i have a field i facet on called type its either 0,1, combined with collapse.facet=before i just sum all the values of the facet field to get the total number found if you dont have such a field u can always add a field with a single value --joe On Wed, May 12, 2010 at 10:41 AM, Sergey Shinderuk sshinde...@gmail.com wrote: Hi, fellows! I use field collapsing to collapse near-duplicate documents based on document fuzzy signature calculated at index time. The problem is that, when field collapsing is enabled, in query response numFound is equal to the number of rows requested. For instance, with solr example schema i can issue the following query http://localhost:8983/solr/select?q=*:*rows=3collapse.field=manu_exact In response i get collapse_counts together with ordinary result list, but numFound equals 3. As far as I understand, this is due to the way field collapsing works. I want to show the total number of hits to the user and provide a pagination through the results. Any ideas? Regards, Sergey Shinderuk
synonym filter and offsets
hello *, im having issues with the synonym filter altering token offsets, my input text is saturday night live its is tokenized by the whitespace tokenizer yielding 3 tokens [saturday, 0,8], [night, 9, 14], [live, 15,19] on indexing these are passed through a synonym filter that has this line saturday night live = snl, saturday night live i now end up with four tokens [saturday, 0, 19], [snl, 0, 19], [night, 0, 19], [live, 0,19] what i want is [saturday, 0,8], [snl, 0,19], [night, 9, 14], [live, 15,19] when using the highlighter i want to make it so only the relevant part of the text is highlighted, how can i fix my filter chain? thx much --joe
highlighter issue
hello *, i have a field that is indexing the string the ex-girlfriend as these tokens: [the, exgirlfriend, ex, girlfriend] then they are passed to the edgengram filter, this allows me to match different user spellings and allows for partial highlighting, however a token like 'ex' would get generated twice which should be fine except the highlighter seems to highlight that token twice even though it has the same offsets (4,6) is there away to make the highlighter not highlight the same token twice, or do i have to create a token filter that would dump tokens with equal text and offsets ? basically whats happening now is if i search 'the e', i get: 'emSeinfeld/emThe emE/ememE/emx-Girlfriend' for 'the ex', i get: 'emSeinfeld/emThe emEx/ememEx/em-Girlfriend' and so on thx much --joe
Re: highlighter issue
i had tried it earlier with no effect, when i looked at the source, it doesnt look at offsets at all, just position increments, so short of somebody finding a better way i going to create a similar filter that compared offsets... On Fri, Apr 2, 2010 at 2:07 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here? Erik On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote: hello *, i have a field that is indexing the string the ex-girlfriend as these tokens: [the, exgirlfriend, ex, girlfriend] then they are passed to the edgengram filter, this allows me to match different user spellings and allows for partial highlighting, however a token like 'ex' would get generated twice which should be fine except the highlighter seems to highlight that token twice even though it has the same offsets (4,6) is there away to make the highlighter not highlight the same token twice, or do i have to create a token filter that would dump tokens with equal text and offsets ? basically whats happening now is if i search 'the e', i get: 'emSeinfeld/em The emE/ememE/emx-Girlfriend' for 'the ex', i get: 'emSeinfeld/em The emEx/ememEx/em-Girlfriend' and so on thx much --joe
how to create this highlighter behaviour
hello *, ive been using the highlighter and been pretty happy with its results, however theres an edge case im not sure how to fix for query: amazing grace the record matched and highlighted is emamazing/em rendition of emamazing grace/em is there any way to only highlight amazing grace without using phrase queries, can i modify the highlighter components to only use terms once and to favor contiguous sections? i dont want to enforce phrase queries as sometimes i do want terms out of order highlighter but i only want each term matched highlighted once does this make sense?
Re: Need help in deploying the modified SOLR source code
do `ant clean dist` within the solr source and use the resulting war file, though in the future you might think about extending the built in parser and creating a parser plugin rather than modifying the actual sources see http://wiki.apache.org/solr/SolrPlugins#QParserPlugin for more info --joe On 03/12/2010 07:34 PM, JavaGuy84 wrote: Hi, I had made some changes to solrqueryparser.java using Eclipse and I am able to do a leading wildcard search using Jetty plugin (downloaded this plugin for eclipse).. Now I am not sure how I can package this code and redploy it. Can someone help me out please? Thanks, B
Re: Highlighting
just to make sure were on the same page, youre saying that the highlight section of the response is empty right? the results section is never highlighted but a separate section contains the highlighted fields specified in hl.fl= On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan iori...@yahoo.com wrote: Yes Content is stored and I get same results adding that parameter. Still not highlighting the content :-( Any other ideas What is the field type of attr_content? And what is your query? Are you running your query on another field and then requesting snippets from attr_content? q:attr_content:somequeryhl=truehl.fl=attr_contenthl.maxAnalyzedChars=-1 should return highlighting.
Re: Highlighting
no, thats not the case, see this example response in json format: { responseHeader:{ status:0, QTime:0, params:{ indent:on, q:title_edge:fami, hl.fl:title_edge, wt:json, hl:on, rows:1}}, response:{numFound:18,start:0,docs:[ { title_id:1581, title_edge:Family, num:4}] }, highlighting:{ 1581:{ title_edge:[emFami/emly]}} see how the highlight info is separate from the results? On Wed, Mar 10, 2010 at 7:44 AM, Lee Smith l...@weblee.co.uk wrote: Im am getting results no problem with the query. But from what I believe it should wrap em/ around the text in the result. So if I search ie Andrew within the return content Ie would have the contents with the word emAndrew/em and hl.fl=attr_content Thank you for you help Begin forwarded message: From: Joe Calderon calderon@gmail.com Date: 10 March 2010 15:37:35 GMT To: solr-user@lucene.apache.org Subject: Re: Highlighting Reply-To: solr-user@lucene.apache.org just to make sure were on the same page, youre saying that the highlight section of the response is empty right? the results section is never highlighted but a separate section contains the highlighted fields specified in hl.fl= On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan iori...@yahoo.com wrote: Yes Content is stored and I get same results adding that parameter. Still not highlighting the content :-( Any other ideas What is the field type of attr_content? And what is your query? Are you running your query on another field and then requesting snippets from attr_content? q:attr_content:somequeryhl=truehl.fl=attr_contenthl.maxAnalyzedChars=-1 should return highlighting.
Re: Highlighting
did u enable the highlighting component in solrconfig.xml? try setting debugQuery=true to see if the highlighting component is even being called... On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith l...@weblee.co.uk wrote: Hey All I have indexed a whole bunch of documents and now I want to search against them. My search is going great all but highlighting. I have these items set hl=true hl.snippets=2 hl.fl = attr_content hl.fragsize=100 Everything works apart from the highlighted text found not being surrounded with a em Am I missing a setting ? Lee
Re: indexing a huge data
ive found the csv update to be exceptionally fast, though others enjoy the flexibility of the data import handler On Fri, Mar 5, 2010 at 10:21 AM, Mark N nipen.m...@gmail.com wrote: what should be the fastest way to index a documents , I am indexing huge collection of data after extracting certain meta - data information for example author and filename of each files i am extracting these information and storing in XML format for example : fileid 1fileidauthorabc /author filenameabc.doc/filename fileid 2fileidauthorabc /author filenameabc1.doc/filename I can not index these documents directly to solr as it is not in the format required by solr ( i can not change the format as its used in other modules) should converting these file to CSV will be better and faster approach compared to XML? please suggest -- Nipen Mark
Re: Issue on stopword list
or you can try the commongrams filter that combines tokens next to a stopword On Tue, Mar 2, 2010 at 6:56 AM, Walter Underwood wun...@wunderwood.org wrote: Don't remove stopwords if you want to search on them. --wunder On Mar 2, 2010, at 5:43 AM, Erick Erickson wrote: This is a classic problem with Stopword removal. Have you tried just removing stopwords from the indexing definition and the query definition and reindexing? You can't search on them no matter what you do if they've been removed, they just aren't there HTH Erick On Tue, Mar 2, 2010 at 5:47 AM, Suram reactive...@yahoo.com wrote: Hi, How can i search using stopword my query like this This - 0 results becuase it is a stopword is - 0 results becuase it is a stopword that - 0 results becuase it is a stopword if i search like This is that - it must give the result for that i need to change anything in my schema file to get result This is that -- View this message in context: http://old.nabble.com/Issue-on-stopword-list-tp27754434p27754434.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search Result differences Standard vs DisMax
what are you using for the mm parameter? if you set it to 1 only one word has to match, On 03/01/2010 05:07 PM, Steve Reichgut wrote: ***Sorry if this was sent twice. I had connection problems here and it didn't look like the first time it went out I have been testing out results for some basic queries using both the Standard and DisMax query parsers. The results though aren't what I expected and am wondering if I am misundertanding how the DisMax query parser works. For example, let's say I am doing a basic search for Apache Solr across a single field = Field 1 using the Standard parser. My results are exactly what I expected. Any document that includes either Apache or Solr or Apache Solr in Field 1 is listed with priority given to those that include both words. Now, if I do the same search for Apache Solr across multiple fields - Field 1, Field 2 - using DisMax, I would expect basically the same results. The results should include any document that has one or both words in Field 1 or Field 2. When I run that query in DisMax though, it only returns the documents that have BOTH words included which in my sample set only includes 1 or 2 documents. I thought that, by default, DisMax should make both words optional so I am confused as to why I am only getting such a small subset. Can anyone shed some light on what I am doing wrong or if I am misunderstanding how DisMax works. Thanks, Steve
Re: Changing term frequency according to value of one of the fields
extend the similarity class, compile it against the jars in lib, put in a path solr can find and set your schema to use it http://wiki.apache.org/solr/SolrPlugins#Similarity On 02/25/2010 10:09 PM, Pooja Verlani wrote: Hi, I want to modify Similarity class for my app like the following- Right now tf is Math.sqrt(termFrequency) I would like to modify it to Math.sqrt(termFrequncy/solrDoc.getFieldValue(count)) where count is one of the fields in the particular solr document. Is it possible to do so? Can I import solrDocument class and take the particular solrDoc for calculating tf in the similarity class? Please suggest. regards, Pooja
Re: Solr 1.4 distributed search configuration
you can set a default shard parameter on the request handler doing distributed search, you can set up two different request handlers one with shards default and one without On Thu, Feb 25, 2010 at 1:35 PM, Jeffrey Zhao jeffrey.z...@metalogic-inc.com wrote: Now I got it, just forgot put qt=search in query. By the way, in solr 1.3, I used shards.txt under conf directory and distributed=true in query for distributed search. In that way,in my java application, I can hard code solr query with distributed=true and control the using of distributed search by define shards.txt or not. In solr 1.4, it is more difficult to use distributed search dynamically.Is there a way I just change configuration without changing query to make DS work? Thanks, From: Mark Miller markrmil...@gmail.com To: solr-user@lucene.apache.org Date: 25/02/2010 04:13 PM Subject: Re: Solr 1.4 distributed search configuration Can you elaborate on doesn't work when you put it in the /search handler? You get an error in the logs? Nothing happens? On 02/25/2010 03:47 PM, Jeffrey Zhao wrote: Hi Mark, Thanks for your reply. I did make a new handler as following, but it does not work, anything wrong with my configuration? Thanks, requestHandler name=search class=solr.SearchHandler !-- default values for query parameters -- lst name=defaults str name=shards202.161.196.189:8080/solr,localhost:8080/solr/str /lst arr name=components strquery/str strfacet/str strspellcheck/str strdebug/str /arr /requestHandler From: Mark Millermarkrmil...@gmail.com To: solr-user@lucene.apache.org Date: 25/02/2010 03:41 PM Subject: Re: Solr 1.4 distributed search configuration On 02/25/2010 03:32 PM, Jeffrey Zhao wrote: How do define a new search handler with a shards parameter? I defined as following way but it doesn't work. If I put the shards parameter in default handler, it seems I got an infinite loop. requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler requestHandler name=search class=solr.SearchHandler !-- default values for query parameters -- lst name=defaults str name=shards202.161.196.189:8080/solr,localhost:8080/solr/str /lst arr name=components strquery/str strfacet/str strspellcheck/str strdebug/str /arr /requestHandler Thanks, Not seeing this on the wiki (it should be there), but you can't put the shards param on the default search handler without causing an infinite loop - you have to make a new request handler and put it on that. -- - Mark http://www.lucidimagination.com
Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams
i had to create a autosuggest implementation not too long ago, originally i was using faceting, where i would match wildcards on a tokenized field and facet on an unaltered field, this had the advantage that i could do everything from one index, though it was also limited by the fact suggestions came though facets and scoring and highlighting went out the window what i settled on was to create a separate core for suggest to use, i analyze the fields i want to match against with whitespace tokenizer and edgengram filter, this has multiple advantages: query is ran through text analysis where as with wildcarded terms they are not highlighter will highlight only the text matched not the expanded word scoring and boosts can be used to rank suggest results i tokenize on whitespace so i can match out of order tokens , ex q=family guy stewie and q=stewie family guy, etc, this is something that prefix based solutions wont be able to do, one small gotcha is that i recently submitted a patch to edgengram filter to fix highlighting behaviour, it has been comitted to lucenes trunk but its only available in versions 2.9.2 and up unless you patch it yourself On Wed, Feb 24, 2010 at 7:35 AM, Grant Ingersoll gsing...@apache.org wrote: You might also look at http://issues.apache.org/jira/browse/SOLR-1316 On Feb 24, 2010, at 1:17 AM, Sachin wrote: Hi All, I am trying to setup autosuggest using solr 1.4 for my site and needed some pointers on that. Basically, we provide autosuggest for user typed in characters in the searchbox. The autosuggest index is created with older user typed in search queries which returned 0 results. We do some lazy writing to store this information into the db and then export it to solr on a nightly basis. As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4: 1. Use EdgeNGrams 2. Use shingles and prefix query. 3. Use the new Terms component. I am for now more inclinded towards using the EdgeNGrams (no method to madness) and just wanted to know is there any recommended approach out of the 3 in terms of performance, since the user excepts the suggestions to be almost instantaneous? We do some heavy caching at our end to avoid hitting solr everytime but is any of these 3 approaches faster than the other? Also, I would also like to return the suggestion even if the user typed in query matches in between: for instance if I have the query chicken pasta in my index and the user types in pasta, I would also like this query to be returned as part of the suggestion (ala Yahoo!). Below is my field definition: fieldType name=suggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=50 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, and though it works great for the above scenario (does a in-between match), it has the side-effect of removing everything which are not letters so if the user types in 123 he gets absolutely no suggestions. Is there anything that I'm missing in my configuration, is this even achievable by using EdgeNGrams or shall I look at using perhaps the TermsComponent after applying the regex patch from 1.5 and maybe do something like .*user-typed-in-chars.*? Thanks!
Re: including 'the' dismax query kills results
use the common grams filter, itll create tokens for stop words and their adjacent terms On Thu, Feb 18, 2010 at 7:16 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: I've noticed some peculiar behavior with the dismax searchhandler. In my case I'm making the search The British Open, and am getting 0 results. When I change it to British Open I get many hits. I looked at the query analyzer and it should be broken down to british and open tokens ('the' is a stopword). I imagine it is doing an 'and' type search, and by setting the 'mm' parameter to 1 I once again get results for 'the british open'. I would like mm to be 100% however, but just not care about stopwords. Is there a way to do this? Thanks, -Kal
Re: Reindex after changing defaultSearchField?
no, youre just changing how your querying the index, not the actual index, you will need to restart the servlet container or reload the core for the config changes to take effect tho On 02/17/2010 10:04 AM, Frederico Azeiteiro wrote: Hi, If i change the defaultSearchField in the core schema, do I need to recreate the index? Thanks, Frederico
Re: defaultSearchField and DisMaxRequestHandler
no but you can set a default for the qf parameter with the same value On 02/15/2010 01:50 AM, Steve Radhouani wrote: Hi there, Can thedefaultSearchField option be used by the DisMaxRequestHandler? Thanks, -Steve
Re: problem with edgengramtokenfilter and highlighter
lucene-2266 filed and patch posted. On 02/13/2010 09:14 PM, Robert Muir wrote: Joe, can you open a Lucene JIRA issue for this? I just glanced at the code and it looks like a bug to me. On Sun, Feb 14, 2010 at 12:07 AM, Joe Calderoncalderon@gmail.comwrote: i ran into a problem while using the edgengramtokenfilter, it seems to report incorrect offsets when generating tokens, more specifically all the tokens have offset 0 and term length as start and end, this leads to goofy highlighting behavior when creating edge grams for tokens beyond the first one, i created a small patch that takes into account the start of the original token and adds that to the reported start/end offsets.
problem with edgengramtokenfilter and highlighter
i ran into a problem while using the edgengramtokenfilter, it seems to report incorrect offsets when generating tokens, more specifically all the tokens have offset 0 and term length as start and end, this leads to goofy highlighting behavior when creating edge grams for tokens beyond the first one, i created a small patch that takes into account the start of the original token and adds that to the reported start/end offsets.
reloading sharedlib folder
when using solr.xml, you can specify a sharedlib directory to share among cores, is it possible to reload the classes in this dir without having to restart the servlet container? it would be useful to be able to make changes to those classes on the fly or be able to drop in new plugins
Re: How to reindex data without restarting server
if you use the core model via solr.xml you can reload a core without having to to restart the servlet container, http://wiki.apache.org/solr/CoreAdmin On 02/11/2010 02:40 PM, Emad Mushtaq wrote: Hi, I would like to know if there is a way of reindexing data without restarting the server. Lets say I make a change in the schema file. That would require me to reindex data. Is there a solution to this ?
Re: analysing wild carded terms
sorry, what i meant to say is apply text analysis to the part of the query that is wildcarded, for example if a term with latin1 diacritics is wildcarded ide still like to run it through ISOLatin1Filter On Wed, Feb 10, 2010 at 4:59 AM, Fuad Efendi f...@efendi.ca wrote: hello *, quick question, what would i have to change in the query parser to allow wildcarded terms to go through text analysis? I believe it is illogical. wildcarded terms will go through terms enumerator.
analysing wild carded terms
hello *, quick question, what would i have to change in the query parser to allow wildcarded terms to go through text analysis?
Re: old wildcard highlighting behaviour
when i set hl.highlightMultiTerm=false the term that matches the wild card is not highlighted at all, ideally ide like a partial highlight (the characters before the wildcard), but if not i can live without it thx much for the help --joe On Fri, Feb 5, 2010 at 10:44 PM, Mark Miller markrmil...@gmail.com wrote: On iPhone so don't remember exact param I named it, but check wiki - something like hl.highlightMultiTerm - set it to false. - Mark http://www.lucidimagination.com (mobile) On Feb 6, 2010, at 12:00 AM, Joe Calderon calderon@gmail.com wrote: hello *, currently with hl.usePhraseHighlighter=true, a query for (joe jack*) will highlight emjoe jackson/em, however after reading the archives, what im looking for is the old 1.1 behaviour so that only emjoe jack/em is highlighted, is this possible in solr 1.5 ? thx much --joe
old wildcard highlighting behaviour
hello *, currently with hl.usePhraseHighlighter=true, a query for (joe jack*) will highlight emjoe jackson/em, however after reading the archives, what im looking for is the old 1.1 behaviour so that only emjoe jack/em is highlighted, is this possible in solr 1.5 ? thx much --joe
fuzzy matching / configurable distance function?
is it possible to configure the distance formula used by fuzzy matching? i see there are other under the function query page under strdist but im wondering if they are applicable to fuzzy matching thx much --joe
source tree for lucene
i want to recompile lucene with http://issues.apache.org/jira/browse/LUCENE-2230, but im not sure which source tree to use, i tried using the implied trunk revision from the admin/system page but solr fails to build with the generated jars, even if i exclude the patches from 2230... im wondering if there is another lucene tree i should grab to use to build solr? --joe
Re: distributed search and failed core
thx guys, i ended up using a mix of code from the solr-1143 and solr-1537 patches, now whenever there is an exception theres is a section in the results indicating the result is partial and also lists the failed core(s), weve added some monitoring to check for that output as well to alert us when a shard has failed On Wed, Feb 3, 2010 at 10:55 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.com wrote: hello *, in distributed search when a shard goes down, an error is returned and the search fails, is there a way to avoid the error and return the results from the shards that are still up? The SolrCloud branch has load-balancing capabilities for distributed search amongst shard replicas. http://wiki.apache.org/solr/SolrCloud -Yonik http://www.lucidimagination.com
Re: Basic indexing question
by default solr will only search the default fields, you have to either query all fields field1:(ore) or field2:(ore) or field3:(ore) or use a different query parser like dismax On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric sma...@ntlworld.com wrote: I have got a basic configuration of Solr up and running and have loaded some data to experiment with When I run a query for 'ore' I get 3 results when I'm expecting 4 Dataimport is pulling the expected number of rows in from my DB view In my schema.xml I have field name=id type=string indexed=true stored=true required=true / field name=atomId type=string indexed=true stored=true required=true / field name=name type=text indexed=true stored=true/ field name=description type=text indexed=true stored=true / and the defaults field name=text type=text indexed=true stored=false multiValued=true/ copyField source=name dest=text/ From an SQL point of view - I am expecting a search for 'ore' to retrieve 4 results (which the following does) select * from v_sm_search_sectors where description like '% ore%' or name like '% ore%'; 121 B0.010.010 Mining and quarrying Mining of metal ore, stone, sand, clay, coal and other solid minerals 1000144 E0.030 Metal and metal ores wholesale (null) 1000145 E0.030.010 Metal and metal ores wholesale (null) 1000146 E0.030.020 Metal and metal ores wholesale agents (null) From a Solr query for 'ore' - I get the following response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=rows10/str str name=start0/str str name=indenton/str str name=qore/str str name=version2.2/str /lst /lst - result name=response numFound=3 start=0 - doc str name=atomIdE0.030/str str name=id1000144/str str name=nameMetal and metal ores wholesale/str /doc - doc str name=atomIdE0.030.010/str str name=id1000145/str str name=nameMetal and metal ores wholesale/str /doc - doc str name=atomIdE0.030.020/str str name=id1000146/str str name=nameMetal and metal ores wholesale agents/str /doc /result /response So I don't retrieve the document where 'ore' is in the descritpion field (and NOT the name field) It would seem that Solr is ONLY returning me results based on what has been put into the field name=text by the copyField source=name dest=text/ Any hints as to what I've missed ?? Regards Stefan Maric
Re: Basic indexing question
see http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field for details on default field, most people use the dismax handler when handling queries from user see http://wiki.apache.org/solr/DisMaxRequestHandler for more details, if you dont have many fields you can write your own query using the lucene query parser as i mentioned before, the syntax cen be found at http://lucene.apache.org/java/2_9_1/queryparsersyntax.html hope this helps --joe On Tue, Feb 2, 2010 at 3:59 PM, Stefan Maric sma...@ntlworld.com wrote: Thanks for the quick reply I will have to see if the default query mechanism will suffice for most of my needs I have skimmed through most of the Solr documentation and didn't see anything describing I can easily change my DB View so that I only source Solr with a single string plus my id field (as my application makng the search will have to collate associated information into a presentable screen anyhow - so I'm not too worried about info being returned by Solr as such) Would that be a reasonable way of using Solr -Original Message- From: Joe Calderon [mailto:calderon@gmail.com] Sent: 02 February 2010 23:42 To: solr-user@lucene.apache.org Subject: Re: Basic indexing question by default solr will only search the default fields, you have to either query all fields field1:(ore) or field2:(ore) or field3:(ore) or use a different query parser like dismax On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric sma...@ntlworld.com wrote: I have got a basic configuration of Solr up and running and have loaded some data to experiment with When I run a query for 'ore' I get 3 results when I'm expecting 4 Dataimport is pulling the expected number of rows in from my DB view In my schema.xml I have field name=id type=string indexed=true stored=true required=true / field name=atomId type=string indexed=true stored=true required=true / field name=name type=text indexed=true stored=true/ field name=description type=text indexed=true stored=true / and the defaults field name=text type=text indexed=true stored=false multiValued=true/ copyField source=name dest=text/ From an SQL point of view - I am expecting a search for 'ore' to retrieve 4 results (which the following does) select * from v_sm_search_sectors where description like '% ore%' or name like '% ore%'; 121 B0.010.010 Mining and quarrying Mining of metal ore, stone, sand, clay, coal and other solid minerals 1000144 E0.030 Metal and metal ores wholesale (null) 1000145 E0.030.010 Metal and metal ores wholesale (null) 1000146 E0.030.020 Metal and metal ores wholesale agents (null) From a Solr query for 'ore' - I get the following response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=rows10/str str name=start0/str str name=indenton/str str name=qore/str str name=version2.2/str /lst /lst - result name=response numFound=3 start=0 - doc str name=atomIdE0.030/str str name=id1000144/str str name=nameMetal and metal ores wholesale/str /doc - doc str name=atomIdE0.030.010/str str name=id1000145/str str name=nameMetal and metal ores wholesale/str /doc - doc str name=atomIdE0.030.020/str str name=id1000146/str str name=nameMetal and metal ores wholesale agents/str /doc /result /response So I don't retrieve the document where 'ore' is in the descritpion field (and NOT the name field) It would seem that Solr is ONLY returning me results based on what has been put into the field name=text by the copyField source=name dest=text/ Any hints as to what I've missed ?? Regards Stefan Maric No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.435 / Virus Database: 271.1.1/2663 - Release Date: 02/02/10 07:35:00
distributed search and failed core
hello *, in distributed search when a shard goes down, an error is returned and the search fails, is there a way to avoid the error and return the results from the shards that are still up? thx much --joe
Re: index of facet fields are not same as original string
facets are based off the indexed version of your string nor the stored version, you probably have an analyzer thats removing punctuation, most people index the same field multiple ways for different purposes, matching. storting, faceting etc... index a copy of your field as string type and facet on that On Thu, Jan 28, 2010 at 3:12 AM, Sergey Pavlikovskiy pavlikovs...@gmail.com wrote: Hi, probably, it's because of stemming if you need unstemmed text you can use 'textgen' data type for the field Sergey On Thu, Jan 28, 2010 at 12:25 PM, Solr user uma.ravind...@yahoo.co.inwrote: Hi, I am new to Solr. I found facets fields does not reflect the original string in the record. For example, the returned xml is, - doc str name=g_numberG-EUPE/str /doc - lst name=facet_counts lst name=facet_queries / - lst name=facet_fields - lst name=g_number int name=gupe1/int /lst /lst - lst name=facet_dates / /lst Here, G-EUPE is displayed under facet field as 'gupe' where it is not capital and missing '-' from the original string. Is there any way we could fix this to match the original text in record? Thanks in advance. Regards, uma -- View this message in context: http://old.nabble.com/index-of-facet-fields-are-not-same-as-original-string-tp27353838p27353838.html Sent from the Solr - User mailing list archive at Nabble.com.
create requesthandler with default shard parameter for different query parser
hello *, what is the best way to create a requesthandler for distributed search with a default shards parameter but that can use different query parsers thus far i have requestHandler name=/ds class=solr.SearchHandler !-- default values for query parameters -- lst name=defaults str name=fl*,score/str str name=wtjson/str str name=shardshost0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3/str /lst arr name=components strquery/str strfacet/str strspellcheck/str strdebug/str /arr /requestHandler which works as long as qt=standard, if i change it to dismax it doenst use the shards parameter anymore... thx much --joe
Re: create requesthandler with default shard parameter for different query parser
thx much, i see now, having request handlers with the same name as the query parsers was confusing me, i do however have an additional problem, if i use defType it does indeed use the right query parser but is there a way to not send all the query parameters in the url (qf, pf, bf etc), its the main reason im creating the new request handler, or do i put them all as defaults under my new request handler and let the query parser use whichever ones it supports? On Thu, Jan 21, 2010 at 11:45 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Jan 21, 2010 at 2:39 PM, Joe Calderon calderon@gmail.com wrote: hello *, what is the best way to create a requesthandler for distributed search with a default shards parameter but that can use different query parsers thus far i have requestHandler name=/ds class=solr.SearchHandler !-- default values for query parameters -- lst name=defaults str name=fl*,score/str str name=wtjson/str str name=shardshost0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3/str /lst arr name=components strquery/str strfacet/str strspellcheck/str strdebug/str /arr /requestHandler which works as long as qt=standard, if i change it to dismax it doenst use the shards parameter anymore... Legacy terminology causing some confusion I think... qt does stand for query type, but it actually picks the request handler. defType defines the default query parser to use, so you probably don't want to be using qt at all. So try something like: http://localhost:8983/solr/ds?defType=dismaxqf=textq=foo -Yonik http://www.lucidimagination.com
Re: Field collapsing patch error
this has come up before, my suggestions would be to use the 12/24 patch with trunk revision 892336 http://www.lucidimagination.com/search/document/797549d29e1810d9/solr_1_4_field_collapsing_what_are_the_steps_for_applying_the_solr_236_patch 2010/1/19 Licinio Fernández Maurelo licinio.fernan...@gmail.com: Hi folks, i've downloaded solr release 1.4 and tried to apply latest field collapsing patchhttps://issues.apache.org/jira/secure/attachment/12428902/SOLR-236.patchi've found . Found errors : d...@backend05:~/workspace/solr-release-1.4.0$ patch -p0 -i SOLR-236.patch patching file src/test/test-files/solr/conf/solrconfig-fieldcollapse.xml patching file src/test/test-files/solr/conf/schema-fieldcollapse.xml patching file src/test/test-files/solr/conf/solrconfig.xml patching file src/test/test-files/fieldcollapse/testResponse.xml patching file src/test/org/apache/solr/search/fieldcollapse/FieldCollapsingIntegrationTest.java patching file src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java patching file src/test/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapserTest.java patching file src/test/org/apache/solr/search/fieldcollapse/AdjacentCollapserTest.java patching file src/test/org/apache/solr/handler/component/CollapseComponentTest.java patching file src/test/org/apache/solr/client/solrj/response/FieldCollapseResponseTest.java patching file src/java/org/apache/solr/search/DocSetAwareCollector.java patching file src/java/org/apache/solr/search/fieldcollapse/CollapseGroup.java patching file src/java/org/apache/solr/search/fieldcollapse/DocumentCollapseResult.java patching file src/java/org/apache/solr/search/fieldcollapse/DocumentCollapser.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollectorFactory.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/DocumentGroupCountCollapseCollectorFactory.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AverageFunction.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MinFunction.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/SumFunction.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MaxFunction.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AggregateFunction.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/CollapseContext.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/DocumentFieldsCollapseCollectorFactory.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/AggregateCollapseCollectorFactory.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollector.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/FieldValueCountCollapseCollectorFactory.java patching file src/java/org/apache/solr/search/fieldcollapse/collector/AbstractCollapseCollector.java patching file src/java/org/apache/solr/search/fieldcollapse/AbstractDocumentCollapser.java patching file src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java patching file src/java/org/apache/solr/search/fieldcollapse/AdjacentDocumentCollapser.java patching file src/java/org/apache/solr/search/fieldcollapse/util/Counter.java patching file src/java/org/apache/solr/search/SolrIndexSearcher.java patching file src/java/org/apache/solr/search/DocSetHitCollector.java patching file src/java/org/apache/solr/handler/component/CollapseComponent.java patching file src/java/org/apache/solr/handler/component/QueryComponent.java Hunk #1 FAILED at 522. 1 out of 1 hunk FAILED -- saving rejects to file src/java/org/apache/solr/handler/component/QueryComponent.java.rej patching file src/java/org/apache/solr/util/DocSetScoreCollector.java patching file src/common/org/apache/solr/common/params/CollapseParams.java patching file src/solrj/org/apache/solr/client/solrj/SolrQuery.java Hunk #1 FAILED at 17. Hunk #2 FAILED at 50. Hunk #3 FAILED at 76. Hunk #4 FAILED at 148. Hunk #5 FAILED at 197. Hunk #6 succeeded at 510 (offset -155 lines). Hunk #7 succeeded at 566 (offset -155 lines). 5 out of 7 hunks FAILED -- saving rejects to file src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej patching file src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java Hunk #1 succeeded at 17 with fuzz 1. Hunk #2 FAILED at 42. Hunk #3 FAILED at 58. Hunk #4 succeeded at 117 with fuzz 2 (offset -8 lines). Hunk #5 succeeded at 315 with fuzz 2 (offset 17 lines). 2 out of 5 hunks FAILED -- saving rejects to file src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java.rej patching file src/solrj/org/apache/solr/client/solrj/response/FieldCollapseResponse.java
Re: question about date boosting
I think you need to use the new trieDateField On 01/12/2010 07:06 PM, Daniel Higginbotham wrote: Hello, I'm trying to boost results based on date using the first example here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents However, I'm getting an error that reads, Can't use ms() function on non-numeric legacy date field The date field uses solr.DateField . What am I doing wrong? Thank you! Daniel Higginbotham
help implementing a couple of business rules
hello *, im looking for help on writing queries to implement a few business rules. 1. given a set of fields how to return matches that match across them but not just one specific one, ex im using a dismax parser currently but i want to exclude any results that only match against a field called 'description2' 2. given a set of fields how to return matches that match across them but on one specific field match as a phrase only, ex im using a dismax parser currently but i want matches against a field called 'people' to only match as a phrase thx much, --joe
Re: help implementing a couple of business rules
thx, but im not sure that covers all edge cases, to clarify 1. matching description2 is okay if other fields are matched too, but results matching only to description2 should be omitted 2. its okay to not match against the people field, but matches against the people field should only be phrase matches sorry if i was unclear --joe On Mon, Jan 11, 2010 at 10:13 AM, Erik Hatcher erik.hatc...@gmail.com wrote: On Jan 11, 2010, at 12:56 PM, Joe Calderon wrote: 1. given a set of fields how to return matches that match across them but not just one specific one, ex im using a dismax parser currently but i want to exclude any results that only match against a field called 'description2' One way could be to add an fq parameter to the request: fq=-description2:(query) 2. given a set of fields how to return matches that match across them but on one specific field match as a phrase only, ex im using a dismax parser currently but i want matches against a field called 'people' to only match as a phrase Doesn't setting pf=people accomplish this? Erik
Re: Solr 1.4 Field collapsing - What are the steps for applying the SOLR-236 patch?
it seems to be in flux right now as the solr developers slowly make improvements and ingest the various pieces into the solr trunk, i think your best bet might be to use the 12/24 patch and fix any errors where it doesnt apply cleanly im using solr trunk r892336 with the 12/24 patch --joe On 01/11/2010 08:48 PM, Kelly Taylor wrote: Hi, Is there a step-by-step for applying the patch for SOLR-236 to enable field collapsing in Solr 1.4? Thanks, Kelly
custom wildcarding in qparser
hello *, what do i need to do to make a query parser that works just like the standard query parser but also runs analyzers/tokenizers on a wildcarded term, specifically im looking to only wildcarding the last token ive tried the edismax qparser and the prefix qparser and neither is exactly what im looking for, the problem im trying to solve is matching wildcards on terms that can be entered multiple ways, i have a set of analyzers that generate the various terms, ex wildcarding on stemmed fields etc thx much --joe
analyzer type=query with NGramTokenFilterFactory forces phrase query
Hello *, im trying to make an index to support spelling errors/fuzzy matching, ive indexed my document titles with NGramFilterFactory minGramSize=2 maxGramSize=3, using the analysis page i can see the common grams match between the indexed value and the query value, however when i try to do a query for it ex. title_ngram:(family) the debug output says the query is converted to a phrase query f a m i l y fa am mi il ly fam ami mil ily, if this is the expected behavior is there a way to override it? or should i scrap this approach and use title:(family) and boost on strdist(family, title, ngram, 3) ?
Re: analyzer type=query with NGramTokenFilterFactory forces phrase query
if this is the expected behaviour is there a way to override it?[1] [1] me On Thu, Dec 31, 2009 at 10:13 AM, AHMET ARSLAN iori...@yahoo.com wrote: Hello *, im trying to make an index to support spelling errors/fuzzy matching, ive indexed my document titles with NGramFilterFactory minGramSize=2 maxGramSize=3, using the analysis page i can see the common grams match between the indexed value and the query value, however when i try to do a query for it ex. title_ngram:(family) the debug output says the query is converted to a phrase query f a m i l y fa am mi il ly fam ami mil ily, if this is the expected behavior is there a way to override it? If a single token is split into more tokens during the analysis phase, solr will do a phrase query instead of a term query. [1] [1]http://www.mail-archive.com/solr-user@lucene.apache.org/msg30055.html
score = result of function query
how can i make the score be solely the output of a function query? the function query wiki page details something like q=boxname:findbox+_val_:product(product(x,y),z)fl=*,score but that doesnt seems to work --joe
boosting on string distance
hello *, i want to boost documents that match the query better, currently i also index my field as a string an boost if i match the string field but im wondering if its possible to boost with bf parameter with a formula using the function strdist(), i know one of the columns would be the field name, but how do i specify the use query as the other parameter? http://wiki.apache.org/solr/FunctionQuery#strdist best, --joe
Re: SOLR Performance Tuning: Pagination
fwiw, when implementing distributed search i ran into a similar problem, but then i noticed even google doesnt let you go past page 1000, easier to just set a limit on start On Thu, Dec 24, 2009 at 8:36 AM, Walter Underwood wun...@wunderwood.org wrote: When do users do a query like that? --wunder On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
wildcard oddity
im trying to do a wild card search q:item_title:(gets*)returns no results q:item_title:(gets)returns results q:item_title:(get*)returns results seems like * at the end of a token is requiring a character, instead of being 0 or more its acting like1 or more the text im trying to match is The Gang Gets Extreme: Home Makeover Edition the field uses the following analyzers fieldType name=text_token class=solr.TextField positionIncrementGap=100 omitNorms=false analyzer charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhiteSpaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateAll=1 splitOnNumerics=0 splitOnCaseChange=0 stemEnglishPossessive=0 / /analyzer /fieldType is anybody else having similar problems? best, --joe
Re: apply a patch on solr
patch -p0 /path/to/field-collapse-5.patch On Tue, Nov 3, 2009 at 7:48 PM, michael8 mich...@saracatech.com wrote: Hmmm, perhaps I jumped the gun. I just looked over the field collapse patch for SOLR-236 and each file listed in the patch has its own revision #. E.g. from field-collapse-5.patch: --- src/java/org/apache/solr/core/SolrConfig.java (revision 824364) --- src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java (revision 816372) --- src/solrj/org/apache/solr/client/solrj/SolrQuery.java (revision 823653) --- src/java/org/apache/solr/search/SolrIndexSearcher.java (revision 794328) --- src/java/org/apache/solr/search/DocSetHitCollector.java (revision 794328) Unless there is a better way, it seems like I would need to do svn up --revision ... for each of the files to be patched and then apply the patch? This seems error prone and tedious. Am I missing something simpler here? Michael michael8 wrote: Perfect. This is what I need to know instead of patching 'in the dark'. Good thing SVN revision cuts across all files like a tag. Thanks Mike! Michael cambridgemike wrote: You can see what revision the patch was written for at the top of the patch, it will look like this: Index: org/apache/solr/handler/MoreLikeThisHandler.java === --- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437) +++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy) now check out revision 772437 using the --revision switch in svn, patch away, and then svn up to make sure everything merges cleanly. This is a good guide to follow as well: http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html cheers, -mike On Mon, Nov 2, 2009 at 3:55 PM, michael8 mich...@saracatech.com wrote: Hi, First I like to pardon my novice question on patching solr (1.4). What I like to know is, given a patch, like the one for collapse field, how would one go about knowing what solr source that patch is meant for since this is a source level patch? Wouldn't the exact versions of a set of java files to be patched critical for the patch to work properly? So far what I have done is to pull the latest collapse field patch down from http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and then svn up the latest trunk from http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. Intuitively I was thinking I should be doing svn up to a specific revision/tag instead of just latest. So far everything seems fine, but I just want to make sure I'm doing the right thing and not just being lucky. Thanks, Michael -- View this message in context: http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26190563.html Sent from the Solr - User mailing list archive at Nabble.com.
tokenize after filters
is it possible to tokenize a field on whitespace after some filters have been applied: ex: A + W Root Beer the field uses a keyword tokenizer to keep the string together, then it will get converted to aw root beer by a custom filter ive made, i now want to split that up into 3 tokens (aw, root, beer), but seems like you cant use a tokenizer after a filter ... so whats the best way of accomplishing this? thx much --joe
profiling solr
as a curiosity ide like to use a profiler to see where within solr queries spend most of their time, im curious what tools if any others use for this type of task.. im using jetty as my servlet container so ideally ide like a profiler thats compatible with it --joe
field collapsing exception
found another exception, i cant find specific steps to reproduce besides starting with an unfiltered result and then given an int field with values (1,2,3) filtering by 3 triggers it sometimes, this is in an index with very frequent updates and deletes --joe java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory$FieldValueCountCollapseCollector.getResult(FieldValueCountCollapseCollectorFactory.java:84) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:191) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:179) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
field collapsing bug (java.lang.ArrayIndexOutOfBoundsException)
seems to happen when sort on anything besides strictly score, even score desc, num desc triggers it, using latest nightly and 10/14 patch Problem accessing /solr/core1/select. Reason: 4731592 java.lang.ArrayIndexOutOfBoundsException: 4731592 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660) at org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:235) at org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:173) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158) at org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:95) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
boostQParser and dismax
hello *, i was just reading over the wiki function query page and found this little gem for boosting recent docs thats much better than what i was doing before recip(ms(NOW,mydatefield),3.16e-11,1,1) my question is, at the bottom it says The most effective way to use such a boost is to multiply it with the relevancy score, rather than add it in. One way to do this is with the boost query parser. how exactly do i use the boost query parser along with the dismax parser? can someone post an example solrconfig snippet? thx much --joe
max words/tokens
i have a pretty basic question, is there an existing analyzer that limits the number of words/tokens indexed from a field? let say i only wanted to index the top 25 words... thx much --joe
Re: max words/tokens
cool np, i just didnt want to duplicate code if that already existed. On Tue, Oct 20, 2009 at 12:49 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Oct 20, 2009 at 1:53 PM, Joe Calderon calderon@gmail.com wrote: i have a pretty basic question, is there an existing analyzer that limits the number of words/tokens indexed from a field? let say i only wanted to index the top 25 words... It would be really easy to write one, but no there is not currently. -Yonik http://www.lucidimagination.com
lucene 2.9 bug
hello * , ive read in other threads that lucene 2.9 had a serious bug in it, hence trunk moved to 2.9.1 dev, im wondering what the bug is as ive been using the 2.9.0 version for the past weeks with no problems, is it critical to upgrade? --joe
Re: Solr 1.4 release candidate
maybe im just not familiar with the way the version numbers works in trunk but when i build the latest nightly the jars have names like *-1.5-dev.jar, is that normal? On Wed, Oct 14, 2009 at 7:01 AM, Yonik Seeley yo...@lucidimagination.com wrote: Folks, we've been in code freeze since Monday and a test release candidate was created yesterday, however it already had to be updated last night due to a serious bug found in Lucene. For now you can use the latest nightly build to get any recent changes like this: http://people.apache.org/builds/lucene/solr/nightly/ We'll probably release the final bits next week, so in the meantime, download the latest nightly build and give it a spin! -Yonik http://www.lucidimagination.com
how to get field contents out of Document object
hello *, sorry if this seems like a dumb question, im still fairly new to working with lucene/solr internals. given a Document object, what is the proper way to fetch an integer value for a field called num_in_stock, it is both indexed and stored thx much --joe
concatenating tokens
hello *, im using a combination of tokenizers and filters that give me the desired tokens, however for a particular field i want to concatenate these tokens back to a single string, is there a filter to do that, if not what are the steps needed to make my own filter to concatenate tokens? for example, i start with Sprocket (widget) - Blue the analyzers churn out the tokens [sprocket,widget,blue] i want to end up with the string sprocket widget blue, this is a simple example and in the general case lowercasing and punctuation removal does not work, hence why im looking to concatenate tokens --joe
Re: stats page slow in latest nightly
thx much guys, no biggie for me, i just wanted to get to the bottom of it in case i had screwed something else up.. --joe On Tue, Oct 6, 2009 at 1:19 PM, Mark Miller markrmil...@gmail.com wrote: I was worried about that actually. I havn't tested how fast the RAM estimator is on huge String FieldCaches - it will be fast on everything else, but it checks the size of each String in the array. When I was working on it, I was actually going to default to not show the size, and make you click a link that added a param to get the sizes in the display too. But I foolishly didn't bring it up when Hoss made my life easier with his simpler patch. Yonik Seeley wrote: Might be the new Lucene fieldCache stats stuff that was recently added? -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon calderon@gmail.com wrote: hello *, ive been noticing that /admin/stats.jsp is really slow in the recent builds, has anyone else encountered this? --joe -- - Mark http://www.lucidimagination.com
Re: JVM OOM when using field collapse component
heap space is 4gb set to grow up to 8gb, usage is normally ~1-2gb, seems to happen within a few searches. if its just me ill try to isolate it, it could be some other part of my implementation thx much On Fri, Oct 2, 2009 at 1:18 AM, Martijn v Groningen martijn.is.h...@gmail.com wrote: No I have not encountered OOM exception yet with current field collapse patch. How large is your configured JVM heap space (-Xmx)? Field collapsing requires more memory then regular searches so. Does Solr run out of memory during the first search(es) or does it run out of memory after a while when it performed quite a few field collapse searches? I see that you are also using the collapse.includeCollapsedDocs.fl parameter for your search. This feature will require more memory then a normal field collapse search. I normally give the Solr instance a heap space of 1024M when having an index of a few million. Martijn 2009/10/2 Joe Calderon calderon@gmail.com: i gotten two different out of memory errors while using the field collapsing component, using the latest patch (2009-09-26) and the latest nightly, has anyone else encountered similar problems? my collection is 5 million results but ive gotten the error collapsing as little as a few thousand SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173) at org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749) at org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757) at org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292) at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233) at org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402) at org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520) SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.util.DocSetScoreCollector.init(DocSetScoreCollector.java:44) at org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131
Re: field collapsing sums
hello martijn, thx for the tip, i tried that approach but ran into two snags, 1. returning the fields makes collapsing a lot slower for results, but that might just be the nature of iterating large results. 2. it seems like only dupes of records on the first page are returned or is tehre a a setting im missing? currently im only sending, collapse.field=brand and collapse.includeCollapseDocs.fl=num_in_stock --joe On Thu, Oct 1, 2009 at 1:14 AM, Martijn v Groningen martijn.is.h...@gmail.com wrote: Hi Joe, Currently the patch does not do that, but you can do something else that might help you in getting your summed stock. In the latest patch you can include fields of collapsed documents in the result per distinct field value. If your specify collapse.includeCollapseDocs.fl=num_in_stock in the request nd lets say you collapse on brand then in the response you will receive the following xml: lst name=collapsedDocs result name=brand1 numFound=48 start=0 doc str name=num_in_stock2/str /doc doc str name=num_in_stock3/str /doc ... /result result name=”brand2” numFound=”9” start=”0” ... /result /lst On the client side you can do whatever you want with this data and for example sum it together. Although the patch does not sum for you, I think it will allow to implement your requirement without to much hassle. Cheers, Martijn 2009/10/1 Matt Weber m...@mattweber.org: You might want to see how the stats component works with field collapsing. Thanks, Matt Weber On Sep 30, 2009, at 5:16 PM, Uri Boness wrote: Hi, At the moment I think the most appropriate place to put it is in the AbstractDocumentCollapser (in the getCollapseInfo method). Though, it might not be the most efficient. Cheers, Uri Joe Calderon wrote: hello all, i have a question on the field collapsing patch, say i have an integer field called num_in_stock and i collapse by some other column, is it possible to sum up that integer field and return the total in the output, if not how would i go about extending the collapsing component to support that? thx much --joe
Re: field collapsing sums
thx for the reply, i just want the number of dupes in the query result, but it seems i dont get the correct totals, for example a non collapsed dismax query for belgian beer returns X number results but when i collapse and sum the number of docs under collapse_counts, its much less than X it does seem to work when the collapsed results fit on one page (10 rows in my case) --joe 2) It seems that you are using the parameters as was intended. The collapsed documents will contain all documents (from whole query result) that have been collapsed on a certain field value that occurs in the result set that is being displayed. That is how it should work. But if I'm understanding you correctly you want to display all dupes from the whole query result set (also those which collapse field value does not occur in the in the displayed result set)?
JVM OOM when using field collapse component
i gotten two different out of memory errors while using the field collapsing component, using the latest patch (2009-09-26) and the latest nightly, has anyone else encountered similar problems? my collection is 5 million results but ive gotten the error collapsing as little as a few thousand SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173) at org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749) at org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757) at org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292) at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233) at org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402) at org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520) SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.util.DocSetScoreCollector.init(DocSetScoreCollector.java:44) at org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
field collapsing sums
hello all, i have a question on the field collapsing patch, say i have an integer field called num_in_stock and i collapse by some other column, is it possible to sum up that integer field and return the total in the output, if not how would i go about extending the collapsing component to support that? thx much --joe
changing dismax parser to not treat symbols differently
how would i go about modifying the dismax parser to treat +/- as regular text?
Re: KStem download
is the source for the lucid kstemmer available ? from the lucid solr package i only found the compiled jars On Mon, Sep 14, 2009 at 11:04 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Sep 14, 2009 at 1:56 PM, darniz rnizamud...@edmunds.com wrote: Pascal Dimassimo wrote: Hi, I want to try KStem. I'm following the instructions on this page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem ... but the download link doesn't work. Is anyone know the new location to download KStem? I am stuck with the same issue its link is not working for a long time is there any alternate link Please let us know *shrug* - looks like they changed their download structure (or just took it down). I searched around their site a bit but couldn't find another one (and google wasn't able to find it either). The one from Lucid is functionally identical, free, and much, much faster though - I'd just use that. -Yonik http://www.lucidimagination.com
query parser question
i have field called text_stem that has a kstemmer on it, im having trouble matching wildcard searches on a word that got stemmed for example i index the word america's, which according to analysis.jsp after stemming gets indexed as america when matching i do a query like myfield:(ame*) which matches the indexed term, this all works fine until the query becomes myfield:(america's*) at which point it doesnt match, however if i remove the wildcard like myfield:(america's) the it works again its almost like the term doesnt get stemmed when using a wildcard im using 1.4 nightly, is this the correct behaviour, is there something i should do differently? in the mean time ive added americas as protected word in the kstemmer but im afraid of more edge cases that will come up --joe
help with solr.PatternTokenizerFactory
hello *, im not sure what im doing wrong i have this field defined in schema.xml, using admin/analysis.jsp its working as expected, fieldType name=text_spell class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.PatternTokenizerFactory pattern=; / filter class=solr.LowerCaseFilterFactory / filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=(\p{Punct}+) replacement= replace=all/ /analyzer /fieldType but when i try to update via csvhandler i get Error 500 org.apache.solr.analysis.PatternTokenizerFactory$1 cannot be cast to org.apache.lucene.analysis.Tokenizer java.lang.ClassCastException: org.apache.solr.analysis.PatternTokenizerFactory$1 cannot be cast to org.apache.lucene.analysis.Tokenizer at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69) at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74) ... im using nightly of solr 1.4 thx much, --joe
Re: Geographic clustering
there are clustering libraries like http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have bindings to perl/python, you can preprocess your results and create clusters for each zoom level On Tue, Sep 8, 2009 at 8:08 AM, gwkg...@eyefi.nl wrote: Hi, I just completed a simple proof-of-concept clusterer component which naively clusters with a specified bounding box around each position, similar to what the javascript MarkerClusterer does. It's currently very slow as I loop over the entire docset and request the longitude and latitude of each document (Not to mention that my unfamiliarity with Lucene/Solr isn't helping the implementations performance any, most code is copied from grep-ing the solr source). Clustering a set of about 80.000 documents takes about 5-6 seconds. I'm currently looking into storing the hilber curve mapping in Solr and clustering using facet counts on numerical ranges of that mapping but I'm not sure it will pan out. Regards, gwk Grant Ingersoll wrote: Not directly related to geo clustering, but http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable interface to clustering implementations. It currently has Carrot2 implemented, but the APIs are marked as experimental. I would definitely be interested in hearing your experience with implementing your clustering algorithm in it. -Grant On Sep 8, 2009, at 4:00 AM, gwk wrote: Hi, I'm working on a search-on-map interface for our website. I've created a little proof of concept which uses the MarkerClusterer (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the markers nicely. But because sending tens of thousands of markers over Ajax is not quite as fast as I would like it to be, I'd prefer to do the clustering on the server side. I've considered a few options like storing the morton-order and throwing away precision to cluster, assigning all locations to a grid position. Or simply cluster based on country/region/city depending on zoom level by adding latitude on longitude fields for each zoom level (so that for smaller countries you have to be zoomed in further to get the next level of clustering). I was wondering if anybody else has worked on something similar and if so what their solutions are. Regards, gwk -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
stemming plurals
i saw some post regarding stemming plurals in the archives from 2008, i was wondering if this was ever integrated or if custom hackery is still needed, is there something like a stemplurals analyzer is the kstemmer the closest thing? thx much --joe
score = sum of boosts
hello *, what would be the best approach to return the sum of boosts as the score? ex: a dismax handler boosts matches to field1^100 and field2^50, a query matches both fields hence the score for that row would be 150 is this something i could do with a function query or do i need to hack up DisjunctionMaxScorer ? --joe
Re: Responses getting truncated
i had a similar issue with text from past requests showing up, this was on 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went away, im using a nightly of 1.4 right now also without probs, then again your mileage may vary as i also made a bunch of schema changes that might have had some effect, it wouldnt hurt to try though On 08/28/2009 02:04 PM, Rupert Fiasco wrote: Firstly, to everyone who has been helping me, thank you very much. All this feedback is helping me narrow down these issues. I deleted the index and re-indexed all the data from scratch and for a couple of days we were OK, but now it seems to be erring again. It happens on different input documents so what was broken before now works (documents that were having issues before are OK now, after a fresh re-index). An issue we are seeing now is that an XML response from Solr will contain the tail of an earlier response, for an example: http://brockwine.com/solr2.txt That is a response we are getting from Solr - using the web interface for Solr in Firefox, Firefox freaks out because it tries to parse that, and of course, its invalid XML, but I can retrieve that via curl. Anyone seeing this before? In regards to earlier questions: i assume you are correct, but you listed several steps of transformation above, are you certian they all work correctly and produce valid UTF-8? Yes, I have looked at the source and contacted the author of the conversion library we are using and have verified that if UTF8 goes in then UTF8 will come out and UTF8 is definitely going in. I dont think sending over an actual input document would help because it seems to change. Plus, this latest issue appears to be more an issue of the last response buffer not clearing or something. Whats strange is that if I wait a few minutes and reload, then the buffer is cleared and I get back a valid response, its intermittent, but appears to be happening frequently. If it matters, we started using LucidGaze for Solr about 10 days ago, approximately when these issues started happening (but its hard to say if thats an issue because at this same time we switched from a PHP to Java indexing client). Thanks for your patience -Rupert On Tue, Aug 25, 2009 at 8:33 PM, Chris Hostetterhossman_luc...@fucit.org wrote: : We are running an instance of MediaWiki so the text goes through a : couple of transformations: wiki markup - html - plain text. : Its at this last step that I take a snippet and insert that into Solr. ... : doc.addField(text_snippet_t, article.getSnippet(1000)); ok, well first off: that's the not the field we're you are having problems is it? if i remember correctly from your previous posts, wasn't the response getting aborted in the middle of the Contents field? : and a maximum of 1K chars if its bigger. I initialized this String : from the DB by using the String constructor where I pass in the : charset/collation : : text = new String(textFromDB, UTF-8); : : So to the best of my knowledge, accessing a substring of a UTF-8 : encoded string should not break up the UTF-8 code point. Is that an i assume you are correct, but you listed several steps of transformation above, are you certian they all work correctly and produce valid UTF-8? this leads back to my suggestion before : Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) : file that this solr doc came from online somewhere? : : What does your *indexing* code look like? ... Can you add some debuging to : the SolrJ client when you *add* this doc to print out exactly what those : 1000 characters are? -Hoss
Re: Responses getting truncated
yonik has a point, when i ran into this i also upgraded to the latest stable jetty, im using jetty 6.1.18 On 08/28/2009 04:07 PM, Rupert Fiasco wrote: I deployed LucidWorks with my existing solrconfig / schema and re-indexed my data into it and pushed it out to production, we'll see how it stacks up over the weekend. Already queries that were breaking on the prior Jetty/stock Solr setup are now working - but I have seen it before where upon an initial re-index things work OK then a couple of days later they break. Keep y'all posted. Thanks -Rupert On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiascorufia...@gmail.com wrote: Yes, I am hitting the Solr server directly (medsolr1.colo:9007) Versions / architectures: Jetty(6.1.3) o...@medsolr1 ~ $ uname -a Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009 x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux o...@medsolr1 ~ $ java -version java version 1.6.0_11 Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try. -Rupert On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeleyysee...@gmail.com wrote: On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiascorufia...@gmail.com wrote: If I run these through curl on the command its truncated and if I run the search through the web-based admin panel then I get an XML parse error. Are you running curl directly against the solr server, or going through a load balancer? Cutting out the middle-men using curl was a great idea - just make sure to go all the way. At first I thought it could possibly be a FastWriter bug (internal Solr class), but that's only used on the TextWriter (JSON, Python, Ruby) based formats, not on the original XML format. It really looks like you're hitting a lower-level IO buffering bug (esp when you see a response starting off with the tail of another response). That doesn't look like it could be a Solr bug... but rather smells like a thread safety bug in the servlet container. What type of machine are you running on? What JVM? You could try upgrading your version of Jetty, the JVM, or try switching to Tomcat. -Yonik http://www.lucidimagination.com This appears to have just started recently and the only thing we have done is change our indexer from a PHP one to a Java one, but functionally they are identical. Any thoughts? Thanks in advance. - Rupert
extended documentation on analyzers
is there an online resource or a book that contains a thorough list of tokenizers and filters available and their functionality? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters is very helpful but i would like to go through additional filters to make sure im not reinventing the wheel adding my own --joe
shingle filter
hello *, im currently faceting on a shingled field to obtain popular phrases and its working well, however ide like to limit the number of shingles that get created, the solr.ShingleFilterFactory supports maxShingleSize, can it be made to support a minimum as well? can someone point me in the right direction? thx much --joe
where to get solr 1.4 nightly
i want to try out the improvements in 1.4 but the nightly site is down http://people.apache.org/builds/lucene/solr/nightly/ is there a mirror for nightlies? --joe
Re: dealing with duplicates
so in the case someone can help me with the query syntax, the relational query i would use for this would be something like: SELECT * FROM videos WHERE title LIKE 'family guy' AND desc LIKE 'stewie%' AND ( ( is_dup = 0 ) OR ( is_dup = 1 AND id NOT IN ( SELECT id FROM videos WHERE title LIKE 'family guy' AND desc LIKE 'stewie%' AND is_dup = 0 ) ) ) ORDER BY views LIMIT 10 can a similar query be written in lucene or do i need to structure my index differently to be able to do such a query? thx much --joe On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderoncalderon@gmail.com wrote: hello, thanks for the response, i did take a look at that document but in my application i actually want the duplicates, as i mentioned, the matching text could be very different among cluster members, what joins them together is a similar set of numeric features. currently i do a query with fq=duplicate:0 and show a link to optionally show the dupes via by querying for all dupes of the master id, however im currently missing any documents that matched the query but are duplicates of other masters not included in that result set. in a relational database (fulltext indexing aside) i would use a subquery, i imagine a similar approach could be used with lucene, i just dont know the syntax best, --joe On Fri, Jul 31, 2009 at 11:32 PM, Otis Gospodneticotis_gospodne...@yahoo.com wrote: Joe, Maybe we can take a step back first. Would it be better if your index was cleaner and didn't have flagged duplicates in the first place? If so, have you tried using http://wiki.apache.org/solr/Deduplication ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Joe Calderon calderon@gmail.com To: solr-user@lucene.apache.org Sent: Friday, July 31, 2009 5:06:48 PM Subject: dealing with duplicates hello all, i have a collection of a few million documents; i have many duplicates in this collection. they have been clustered with a simple algorithm, i have a field called 'duplicate' which is 0 or 1 and a fields called 'description, tags, meta', documents are clustered on different criteria and the text i search against could be very different among members of a cluster. im currently using a dismax handler to search across the text fields with different boosts, and a filter query to restrict to masters (duplicate: 0) my question is then, how do i best query for documents which are masters OR match text but are not included in the matched set of masters? does this make sense?
concurrent csv loading
for first time loads i currently post to /update/csv?commit=falseseparator=%09escape=\stream.file=workfile.txtmap=NULL:keepEmpty=false, this works well and finishes in about 20 minutes for my work load. this is mostly cpu bound, i have an 8 core box and it seems one takes the brunt of the work. if i wanted to optimize, would i see any benefit to splitting workfile.txt in two and doing two posts ? im running lucid's build of solr 1.3.0 on jetty 6, io is not a bottleneck as the data folder is on tmpfs thx much --joe
Re: dealing with duplicates
hello, thanks for the response, i did take a look at that document but in my application i actually want the duplicates, as i mentioned, the matching text could be very different among cluster members, what joins them together is a similar set of numeric features. currently i do a query with fq=duplicate:0 and show a link to optionally show the dupes via by querying for all dupes of the master id, however im currently missing any documents that matched the query but are duplicates of other masters not included in that result set. in a relational database (fulltext indexing aside) i would use a subquery, i imagine a similar approach could be used with lucene, i just dont know the syntax best, --joe On Fri, Jul 31, 2009 at 11:32 PM, Otis Gospodneticotis_gospodne...@yahoo.com wrote: Joe, Maybe we can take a step back first. Would it be better if your index was cleaner and didn't have flagged duplicates in the first place? If so, have you tried using http://wiki.apache.org/solr/Deduplication ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Joe Calderon calderon@gmail.com To: solr-user@lucene.apache.org Sent: Friday, July 31, 2009 5:06:48 PM Subject: dealing with duplicates hello all, i have a collection of a few million documents; i have many duplicates in this collection. they have been clustered with a simple algorithm, i have a field called 'duplicate' which is 0 or 1 and a fields called 'description, tags, meta', documents are clustered on different criteria and the text i search against could be very different among members of a cluster. im currently using a dismax handler to search across the text fields with different boosts, and a filter query to restrict to masters (duplicate: 0) my question is then, how do i best query for documents which are masters OR match text but are not included in the matched set of masters? does this make sense?