Re: How do I make sure the resulting documents contain the query terms?
>k0 --> A | C >k1 --> A | B >k2 --> A | B | C >k3 --> B | C >Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do I make sure the resulting documents contain the query terms?
Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 --> A | C k1 --> A | B k2 --> A | B | C k3 --> B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? On Tue, Jun 7, 2011 at 12:21 AM, Erick Erickson wrote: > I'm having a hard time understanding what you're driving at, can > you provide some examples? This *looks* like filter queries, > but I think you already know about those... > > Best > Erick > > On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout > wrote: > > Hello, > > > > I've seen that through boosting it's possible to influence the scoring > > function, but what I would like is sort of a boolean property. In some > way > > it's to search only the indexed documents by that keyword (or the > > intersection/union) rather than the whole set. > > Is this supported in any way? > > > > > > -- > > Regards, > > K. Gabriele > > > > --- unchanged since 20/9/10 --- > > P.S. If the subject contains "[LON]" or the addressee acknowledges the > > receipt within 48 hours then I don't resend the email. > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) > > < Now + 48h) ⇒ ¬resend(I, this). > > > > If an email is sent by a sender that is not a trusted contact or the > email > > does not contain a valid code then the email is not received. A valid > code > > starts with a hyphen and ends with "X". > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > > L(-[a-z]+[0-9]X)). > > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: problem: zooKeeper Integration with solr
Instead of integrating zookeeper, you could create shards over multiple machines and specify the shards while you are querying solr. Eg: http://localhost:8983/solr/select?shards=*:/,* *:/*&indent=true&q= On Mon, Jun 6, 2011 at 5:59 PM, Mohammad Shariq wrote: > Hi folk, > I am using solr to index around 100mn docs. > now I am planning to move to cluster based solr, so that I can scale the > indexing and searching process. > since solrCloud is in development stage, I am trying to index in shard > based environment using zooKeeper. > > I followed the steps from > http://wiki.apache.org/solr/ZooKeeperIntegrationthen also I am not > able to do distributes search. > Once I index the docs in one shard, not able to query from other shard and > vice-versa, (using the query > > http://localhost:8180/solr/select/?q=itunes&version=2.2&start=0&rows=10&indent=on > ) > > I am running solr3.1 on ubuntu 10.10. > > please help me. > > > -- > Thanks and Regards > Mohammad Shariq > -- Thanks and Regards, DakshinaMurthy BM
Re: Master Slave help
Do you mean the replication happens everytime you restart the server ? If so, you would need to modify the events you want the replication to happen. Check for the replicateAfter tag and remove the startup option, if you don't need it. startup commit schema.xml,stopwords.txt,elevate.xml 00:00:10 Regards, Jayendra On Mon, Jun 6, 2011 at 11:24 AM, Rohit Gupta wrote: > Hi, > > I have configured my master slave server and everything seems to be running > fine, the replication completed the firsttime it ran. But everytime I go the > the > replication link in the admin panel after restarting the server or server > startup I notice the replication starting from scratch or at least the stats > show that. > > What could be wrong? > > Thanks, > Rohit
Re: synonyms problem
Please take a look at the analysis page for the field in question. I don't even know what happens if you define ONLY a query analyzer (or you left things out as an efficiency). Substituting synonyms to a string field is suspicious, I assume you're only indexing single tokens in that field. You have to re-index after the query time change to see the effects. Best Erick On Mon, Jun 6, 2011 at 8:33 PM, deniz wrote: > well i was trying to say that; i have changed the config files for synonyms > and so on but nothing happens so i thought i needed to do something in java > code too... i was trying to ask about that... > > - > Zeki ama calismiyor... Calissa yapar... > -- > View this message in context: > http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3032666.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: synonyms problem
well i was trying to say that; i have changed the config files for synonyms and so on but nothing happens so i thought i needed to do something in java code too... i was trying to ask about that... - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3032666.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SpellCheckComponent performance
Hmmm, how are you configuring you spell checker? The first-time slowdown is probably due to cache warming, but subsequent 500 ms slowdowns seem odd. How many unique terms are there in your spellecheck index? It'd probably be best if you showed us your fieldtype and field definition... Best Erick On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz wrote: > I'm continuing to work on tuning my Solr server, and now I'm noticing that my > biggest bottleneck is the SpellCheckComponent. This is eating multiple > seconds on most first-time searches, and still taking around 500ms even on > cached searches. Here is my configuration: > > class="org.apache.solr.handler.component.SpellCheckComponent"> > > basicSpell > spelling > 0.75 > ./spellchecker > textSpell > true > > > > I've done a bit of searching, but the best advice I could find for making the > search component go faster involved reducing spellcheck.maxCollationTries, > which doesn't even seem to apply to my settings. > > Does anyone have any advice on tuning this aspect of my configuration? Are > there any extra debug settings that might give deeper insight into how the > component is spending its time? > > thanks, > Demian >
Re: How do I make sure the resulting documents contain the query terms?
I'm having a hard time understanding what you're driving at, can you provide some examples? This *looks* like filter queries, but I think you already know about those... Best Erick On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout wrote: > Hello, > > I've seen that through boosting it's possible to influence the scoring > function, but what I would like is sort of a boolean property. In some way > it's to search only the indexed documents by that keyword (or the > intersection/union) rather than the whole set. > Is this supported in any way? > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) > < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). >
Re: Minimum Should Match + External Field + Function Query with boost
Seem to have a solution but I am still trying to figure out how/why it works. Addition of "defType=edismax" in the boost query seem to honor "MM" and correct boosting based on external file source. The new query syntax q={!boost b=dishRating v=$qq defType=edismax}&qq=hot chicken wings -- View this message in context: http://lucene.472066.n3.nabble.com/Minimum-Should-Match-not-enforced-with-External-Field-Function-Query-with-boost-tp2985564p3032143.html Sent from the Solr - User mailing list archive at Nabble.com.
SpellCheckComponent performance
I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent. This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches. Here is my configuration: basicSpell spelling 0.75 ./spellchecker textSpell true I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing spellcheck.maxCollationTries, which doesn't even seem to apply to my settings. Does anyone have any advice on tuning this aspect of my configuration? Are there any extra debug settings that might give deeper insight into how the component is spending its time? thanks, Demian
How do I make sure the resulting documents contain the query terms?
Hello, I've seen that through boosting it's possible to influence the scoring function, but what I would like is sort of a boolean property. In some way it's to search only the indexed documents by that keyword (or the intersection/union) rather than the whole set. Is this supported in any way? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Solr Indexing Patterns
This is a start, for many common best practices: http://wiki.apache.org/solr/SolrRelevancyFAQ Many of the questions in there have an answer that involves de-normalizing. As an example. It may be that even if your specific problem isn't in there, I myself anyway found reading through there gave me a general sense of common patterns in Solr. ( It's certainly true that some things are hard to do in Solr. It turns out that an RDBMS is a remarkably flexible thing -- but when it doesn't do something you need well, and you turn to a specialized tool instead like Solr, you certainly give up some things One of the biggest areas of limitation involves hieararchical or relationship data, definitely. There are a variety of features, some more fully baked than others, some not yet in a Solr release, meant to provide tools to get at different aspects of this. Including "pivot facetting", "join" (https://issues.apache.org/jira/browse/SOLR-2272), and field-collapsing. Each, IMO, is trying to deal with different aspects of dealing with hieararchical or multi-class data, or data that is entities with relationships. ). On 6/6/2011 3:43 PM, Judioo wrote: I do think that Solr would be better served if there was a *best practice section *of the site. Looking at the majority of emails to this list they resolve around "how do I do X?". Seems like tutorials with real world examples would serve Solr no end of good. I still do not have an example of the best method to approach my problem, although Erick has help me understand the limitations of Solr. Just thought I'd say. On 6 June 2011 20:26, Judioo wrote: Thanks On 6 June 2011 19:32, Erick Erickson wrote: #Everybody# (including me) who has any RDBMS background doesn't want to flatten data, but that's usually the way to go in Solr. Part of whether it's a good idea or not depends on how big the index gets, and unfortunately the only way to figure that out is to test. But that's the first approach I'd try. Good luck! Erick On Mon, Jun 6, 2011 at 11:42 AM, Judioo wrote: On 5 June 2011 14:42, Erick Erickson wrote: See: http://wiki.apache.org/solr/SchemaXml By adding ' "multiValued="true" ' to the field, you can add the same field multiple times in a doc, something like value1 value2 I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { "name":"The Book", "price":"$9.99", "price":"$3.00", "price":"$4.00","synopsis":"thanksgiving special", "starts":"11-24-2011", "starts":"10-10-2011", "ends":"11-25-2011", "ends":"10-11-2011", "synopsis":"Canadian thanksgiving special", }, How does one differentiate the different offers? But there's no real ability in Solr to store "sub documents", so you'd have to get creative in how you encoded the discounts... This is what I'm asking :) What is the best / recommended / known patterns for doing this? But I suspect a better approach would be to store each discount as a separate document. If you're in the trunk version, you could then group results by, say, ISBN and get responses grouped together... This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get "all books currently discounted" requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want "all books that are currently on discount in the "horror" genre containing the word 'elm' in the title." The only way I can see in catering for the above search is to duplicate all searchable fields in my "book" document in my "discount" document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? Best Erick On Sat, Jun 4, 2011 at 1:42 AM, Judioo wrote: Hi, Discounts can change daily. Also there can be a lot of them (over time and in a given time period ). Could you give an example of what you mean buy multi-valuing the field. Thanks On 3 June 2011 14:29, Erick Erickson wrote: How often are the discounts changed? Because you can simply re-index the book information with a multiValued "discounts" field and get something similar to your example (&wt=json) Best Erick On Fri, Jun 3, 2011 at 8:38 AM, Judioo wrote: What is the "best practice" method to index the following in Solr: I'm attempting to use solr for a book store site. Each book will have a price but on occasions this will be discounted. The discounted price exists for a defined time period but there may be many discount periods. Each discount will have a brief synopsis, start and end time. A subset of the desired output would be as follows: ... "response":{"numFound":1,"start":0,"docs":[
Re: Solr Indexing Patterns
I do think that Solr would be better served if there was a *best practice section *of the site. Looking at the majority of emails to this list they resolve around "how do I do X?". Seems like tutorials with real world examples would serve Solr no end of good. I still do not have an example of the best method to approach my problem, although Erick has help me understand the limitations of Solr. Just thought I'd say. On 6 June 2011 20:26, Judioo wrote: > Thanks > > > On 6 June 2011 19:32, Erick Erickson wrote: > >> #Everybody# (including me) who has any RDBMS background >> doesn't want to flatten data, but that's usually the way to go in >> Solr. >> >> Part of whether it's a good idea or not depends on how big the index >> gets, and unfortunately the only way to figure that out is to test. >> >> But that's the first approach I'd try. >> >> Good luck! >> Erick >> >> On Mon, Jun 6, 2011 at 11:42 AM, Judioo wrote: >> > On 5 June 2011 14:42, Erick Erickson wrote: >> > >> >> See: http://wiki.apache.org/solr/SchemaXml >> >> >> >> By adding ' "multiValued="true" ' to the field, you can add >> >> the same field multiple times in a doc, something like >> >> >> >> >> >> >> >> value1 >> >> value2 >> >> >> >> >> >> >> >> I can't see how that would work as one would need to associate the >> right >> > start / end dates and price. >> > As I understand using multivalued and thus flattening the discounts >> would >> > result in: >> > >> > { >> >"name":"The Book", >> >"price":"$9.99", >> >"price":"$3.00", >> >"price":"$4.00","synopsis":"thanksgiving special", >> >"starts":"11-24-2011", >> >"starts":"10-10-2011", >> >"ends":"11-25-2011", >> >"ends":"10-11-2011", >> >"synopsis":"Canadian thanksgiving special", >> > }, >> > >> > How does one differentiate the different offers? >> > >> > >> > >> >> But there's no real ability in Solr to store "sub documents", >> >> so you'd have to get creative in how you encoded the discounts... >> >> >> > >> > This is what I'm asking :) >> > What is the best / recommended / known patterns for doing this? >> > >> > >> > >> >> >> >> But I suspect a better approach would be to store each discount as >> >> a separate document. If you're in the trunk version, you could then >> >> group results by, say, ISBN and get responses grouped together... >> >> >> > >> > This is an option but seems sub optimal. So say I store the discounts in >> > multiple documents with ISDN as an attribute and also store the title >> again >> > with ISDN as an attribute. >> > >> > To get >> > "all books currently discounted" >> > >> > requires 2 request >> > >> > * get all discounts currently active >> > * get all books using ISDN retrieved from above search >> > >> > Not that bad. However what happens when I want >> > "all books that are currently on discount in the "horror" genre >> containing >> > the word 'elm' in the title." >> > >> > The only way I can see in catering for the above search is to duplicate >> all >> > searchable fields in my "book" document in my "discount" document. >> Coming >> > from a RDBM background this seems wrong. >> > >> > Is this the correct approach to take? >> > >> > >> > >> >> >> >> Best >> >> Erick >> >> >> >> On Sat, Jun 4, 2011 at 1:42 AM, Judioo wrote: >> >> > Hi, >> >> > Discounts can change daily. Also there can be a lot of them (over >> time >> >> and >> >> > in a given time period ). >> >> > >> >> > Could you give an example of what you mean buy multi-valuing the >> field. >> >> > >> >> > Thanks >> >> > >> >> > On 3 June 2011 14:29, Erick Erickson >> wrote: >> >> > >> >> >> How often are the discounts changed? Because you can simply >> >> >> re-index the book information with a multiValued "discounts" field >> >> >> and get something similar to your example (&wt=json) >> >> >> >> >> >> >> >> >> Best >> >> >> Erick >> >> >> >> >> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo wrote: >> >> >> > What is the "best practice" method to index the following in Solr: >> >> >> > >> >> >> > I'm attempting to use solr for a book store site. >> >> >> > >> >> >> > Each book will have a price but on occasions this will be >> discounted. >> >> The >> >> >> > discounted price exists for a defined time period but there may be >> >> many >> >> >> > discount periods. Each discount will have a brief synopsis, start >> and >> >> end >> >> >> > time. >> >> >> > >> >> >> > A subset of the desired output would be as follows: >> >> >> > >> >> >> > ... >> >> >> > "response":{"numFound":1,"start":0,"docs":[ >> >> >> > { >> >> >> >"name":"The Book", >> >> >> >"price":"$9.99", >> >> >> >"discounts":[ >> >> >> >{ >> >> >> > "price":"$3.00", >> >> >> > "synopsis":"thanksgiving special", >> >> >> > "starts":"11-24-2011", >> >> >> > "ends":"11-25-2011", >> >> >> >}, >> >> >> >{ >> >> >> > "price":"$4.00", >> >> >> > "synopsis":"Canadian thanksgiving special", >>
Re: Solr Indexing Patterns
Thanks On 6 June 2011 19:32, Erick Erickson wrote: > #Everybody# (including me) who has any RDBMS background > doesn't want to flatten data, but that's usually the way to go in > Solr. > > Part of whether it's a good idea or not depends on how big the index > gets, and unfortunately the only way to figure that out is to test. > > But that's the first approach I'd try. > > Good luck! > Erick > > On Mon, Jun 6, 2011 at 11:42 AM, Judioo wrote: > > On 5 June 2011 14:42, Erick Erickson wrote: > > > >> See: http://wiki.apache.org/solr/SchemaXml > >> > >> By adding ' "multiValued="true" ' to the field, you can add > >> the same field multiple times in a doc, something like > >> > >> > >> > >> value1 > >> value2 > >> > >> > >> > >> I can't see how that would work as one would need to associate the right > > start / end dates and price. > > As I understand using multivalued and thus flattening the discounts > would > > result in: > > > > { > >"name":"The Book", > >"price":"$9.99", > >"price":"$3.00", > >"price":"$4.00","synopsis":"thanksgiving special", > >"starts":"11-24-2011", > >"starts":"10-10-2011", > >"ends":"11-25-2011", > >"ends":"10-11-2011", > >"synopsis":"Canadian thanksgiving special", > > }, > > > > How does one differentiate the different offers? > > > > > > > >> But there's no real ability in Solr to store "sub documents", > >> so you'd have to get creative in how you encoded the discounts... > >> > > > > This is what I'm asking :) > > What is the best / recommended / known patterns for doing this? > > > > > > > >> > >> But I suspect a better approach would be to store each discount as > >> a separate document. If you're in the trunk version, you could then > >> group results by, say, ISBN and get responses grouped together... > >> > > > > This is an option but seems sub optimal. So say I store the discounts in > > multiple documents with ISDN as an attribute and also store the title > again > > with ISDN as an attribute. > > > > To get > > "all books currently discounted" > > > > requires 2 request > > > > * get all discounts currently active > > * get all books using ISDN retrieved from above search > > > > Not that bad. However what happens when I want > > "all books that are currently on discount in the "horror" genre > containing > > the word 'elm' in the title." > > > > The only way I can see in catering for the above search is to duplicate > all > > searchable fields in my "book" document in my "discount" document. Coming > > from a RDBM background this seems wrong. > > > > Is this the correct approach to take? > > > > > > > >> > >> Best > >> Erick > >> > >> On Sat, Jun 4, 2011 at 1:42 AM, Judioo wrote: > >> > Hi, > >> > Discounts can change daily. Also there can be a lot of them (over time > >> and > >> > in a given time period ). > >> > > >> > Could you give an example of what you mean buy multi-valuing the > field. > >> > > >> > Thanks > >> > > >> > On 3 June 2011 14:29, Erick Erickson wrote: > >> > > >> >> How often are the discounts changed? Because you can simply > >> >> re-index the book information with a multiValued "discounts" field > >> >> and get something similar to your example (&wt=json) > >> >> > >> >> > >> >> Best > >> >> Erick > >> >> > >> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo wrote: > >> >> > What is the "best practice" method to index the following in Solr: > >> >> > > >> >> > I'm attempting to use solr for a book store site. > >> >> > > >> >> > Each book will have a price but on occasions this will be > discounted. > >> The > >> >> > discounted price exists for a defined time period but there may be > >> many > >> >> > discount periods. Each discount will have a brief synopsis, start > and > >> end > >> >> > time. > >> >> > > >> >> > A subset of the desired output would be as follows: > >> >> > > >> >> > ... > >> >> > "response":{"numFound":1,"start":0,"docs":[ > >> >> > { > >> >> >"name":"The Book", > >> >> >"price":"$9.99", > >> >> >"discounts":[ > >> >> >{ > >> >> > "price":"$3.00", > >> >> > "synopsis":"thanksgiving special", > >> >> > "starts":"11-24-2011", > >> >> > "ends":"11-25-2011", > >> >> >}, > >> >> >{ > >> >> > "price":"$4.00", > >> >> > "synopsis":"Canadian thanksgiving special", > >> >> > "starts":"10-10-2011", > >> >> > "ends":"10-11-2011", > >> >> >}, > >> >> > ] > >> >> > }, > >> >> > . > >> >> > > >> >> > A requirement is to be able to search for just discounted > >> publications. I > >> >> > think I could use date faceting for this ( return publications that > >> are > >> >> > within a discount window ). When a discount search is performed no > >> >> > publications that are not currently discounted will be returned. > >> >> > > >> >> > My question are: > >> >> > > >> >> > - Does solr support this type of sub documents > >> >> > > >> >> > In the above example th
Re: Solr performance tuning - disk i/o?
If you're seeing results, things must be OK. It's a little strange, though, I'm seeing warmup times of 1 on the trivial reload of the example documents. But I wouldn't worry too much here. Those are pretty high autowarm counts, you might have room to reduce them but absent long autowarm times there's not much reason to mess with them... Best Erick On Mon, Jun 6, 2011 at 1:38 PM, Demian Katz wrote: > All of my cache autowarmCount settings are either 1 or 5. > maxWarmingSearchers is set to 2. I previously shared the contents of my > firstSearcher and newSearcher events -- just a "queries" array surrounded by > a standard-looking tag. The events are definitely firing -- in > addition to the measurable performance improvement they give me, I can > actually see them happening in the console output during startup. That seems > to cover every configuration option in my file that references warming in any > way, and it all looks reasonable to me. warmupTime remains consistently 0 in > the statistics display. Is there anything else I should be looking at? In > any case, I'm not too alarmed by this... it just seems a little strange. > > thanks, > Demian > >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Monday, June 06, 2011 11:59 AM >> To: solr-user@lucene.apache.org >> Subject: Re: Solr performance tuning - disk i/o? >> >> Polling interval was in reference to slaves in a multi-machine >> master/slave setup. so probably not >> a concern just at present. >> >> Warmup time of 0 is not particularly normal, I'm not quite sure what's >> going on there but you may >> want to look at firstsearcher, newsearcher and autowarm parameters in >> config.xml.. >> >> Best >> Erick >> >> On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz >> wrote: >> > Thanks once again for the helpful suggestions! >> > >> > Regarding the selection of facet fields, I think publishDate (which >> is actually just a year) and callnumber-first (which is actually a very >> broad, high-level category) are okay. authorStr is an interesting >> problem: it's definitely a useful facet (when a user searches for an >> author, odds are good that they want the one who published the most >> books... i.e. a search for dickens will probably show Charles Dickens >> at the top of the facet list), but it has a long tail since there are >> many minor authors who have only published one or two books... Is >> there a possibility that the facet.mincount parameter could be helpful >> here, or does that have no impact on performance/memory footprint? >> > >> > Regarding polling interval for slaves, are you referring to a >> distributed Solr environment, or is this something to do with Solr's >> internals? We're currently a single-server environment, so I don't >> think I have to worry if it's related to a multi-server setup... but >> if it's something internal, could you point me to the right area of the >> admin panel to check my stats? I'm not seeing anything about polling >> on the statistics page. It's also a little strange that all of my >> warmupTime stats on searchers and caches are showing as 0 -- is that >> normal? >> > >> > thanks, >> > Demian >> > >> >> -Original Message- >> >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> >> Sent: Friday, June 03, 2011 4:45 PM >> >> To: solr-user@lucene.apache.org >> >> Subject: Re: Solr performance tuning - disk i/o? >> >> >> >> Quick impressions: >> >> >> >> The faceting is usually best done on fields that don't have lots of >> >> unique >> >> values for three reasons: >> >> 1> It's questionable how much use to the user to have a gazillion >> >> facets. >> >> In the case of a unique field per document, in fact, it's >> useless. >> >> 2> resource requirements go up as a function of the number of unique >> >> terms. This is true for faceting and sorting. >> >> 3> warmup times grow the more terms have to be read into memory. >> >> >> >> >> >> Glancing at your warmup stuff, things like publishDate, authorStr >> and >> >> maybe >> >> callnumber-first are questionable. publishDate depends on how coarse >> >> the >> >> resolution is. If it's by day, that's not really much use. >> authorStr.. >> >> How many >> >> authors have more than one publication? Would this be better served >> by >> >> some >> >> kind of autosuggest rather than facets? callnumber-first... I don't >> >> really know, but >> >> if it's unique per document it's probably not something the user >> would >> >> find useful >> >> as a facet. >> >> >> >> The admin page will help you determine the number of unique terms >> per >> >> field, >> >> which may guide you whether or not to continue to facet on these >> >> fields. >> >> >> >> As Otis said, doing a sort on the fields during warmup will also >> help. >> >> >> >> Watch your polling interval for any slaves in relation to the warmup >> >> times. >> >> If your polling interval is shorter than the warmup times, you ru
Re: Solr Indexing Patterns
#Everybody# (including me) who has any RDBMS background doesn't want to flatten data, but that's usually the way to go in Solr. Part of whether it's a good idea or not depends on how big the index gets, and unfortunately the only way to figure that out is to test. But that's the first approach I'd try. Good luck! Erick On Mon, Jun 6, 2011 at 11:42 AM, Judioo wrote: > On 5 June 2011 14:42, Erick Erickson wrote: > >> See: http://wiki.apache.org/solr/SchemaXml >> >> By adding ' "multiValued="true" ' to the field, you can add >> the same field multiple times in a doc, something like >> >> >> >> value1 >> value2 >> >> >> >> I can't see how that would work as one would need to associate the right > start / end dates and price. > As I understand using multivalued and thus flattening the discounts would > result in: > > { > "name":"The Book", > "price":"$9.99", > "price":"$3.00", > "price":"$4.00", "synopsis":"thanksgiving special", > "starts":"11-24-2011", > "starts":"10-10-2011", > "ends":"11-25-2011", > "ends":"10-11-2011", > "synopsis":"Canadian thanksgiving special", > }, > > How does one differentiate the different offers? > > > >> But there's no real ability in Solr to store "sub documents", >> so you'd have to get creative in how you encoded the discounts... >> > > This is what I'm asking :) > What is the best / recommended / known patterns for doing this? > > > >> >> But I suspect a better approach would be to store each discount as >> a separate document. If you're in the trunk version, you could then >> group results by, say, ISBN and get responses grouped together... >> > > This is an option but seems sub optimal. So say I store the discounts in > multiple documents with ISDN as an attribute and also store the title again > with ISDN as an attribute. > > To get > "all books currently discounted" > > requires 2 request > > * get all discounts currently active > * get all books using ISDN retrieved from above search > > Not that bad. However what happens when I want > "all books that are currently on discount in the "horror" genre containing > the word 'elm' in the title." > > The only way I can see in catering for the above search is to duplicate all > searchable fields in my "book" document in my "discount" document. Coming > from a RDBM background this seems wrong. > > Is this the correct approach to take? > > > >> >> Best >> Erick >> >> On Sat, Jun 4, 2011 at 1:42 AM, Judioo wrote: >> > Hi, >> > Discounts can change daily. Also there can be a lot of them (over time >> and >> > in a given time period ). >> > >> > Could you give an example of what you mean buy multi-valuing the field. >> > >> > Thanks >> > >> > On 3 June 2011 14:29, Erick Erickson wrote: >> > >> >> How often are the discounts changed? Because you can simply >> >> re-index the book information with a multiValued "discounts" field >> >> and get something similar to your example (&wt=json) >> >> >> >> >> >> Best >> >> Erick >> >> >> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo wrote: >> >> > What is the "best practice" method to index the following in Solr: >> >> > >> >> > I'm attempting to use solr for a book store site. >> >> > >> >> > Each book will have a price but on occasions this will be discounted. >> The >> >> > discounted price exists for a defined time period but there may be >> many >> >> > discount periods. Each discount will have a brief synopsis, start and >> end >> >> > time. >> >> > >> >> > A subset of the desired output would be as follows: >> >> > >> >> > ... >> >> > "response":{"numFound":1,"start":0,"docs":[ >> >> > { >> >> > "name":"The Book", >> >> > "price":"$9.99", >> >> > "discounts":[ >> >> > { >> >> > "price":"$3.00", >> >> > "synopsis":"thanksgiving special", >> >> > "starts":"11-24-2011", >> >> > "ends":"11-25-2011", >> >> > }, >> >> > { >> >> > "price":"$4.00", >> >> > "synopsis":"Canadian thanksgiving special", >> >> > "starts":"10-10-2011", >> >> > "ends":"10-11-2011", >> >> > }, >> >> > ] >> >> > }, >> >> > . >> >> > >> >> > A requirement is to be able to search for just discounted >> publications. I >> >> > think I could use date faceting for this ( return publications that >> are >> >> > within a discount window ). When a discount search is performed no >> >> > publications that are not currently discounted will be returned. >> >> > >> >> > My question are: >> >> > >> >> > - Does solr support this type of sub documents >> >> > >> >> > In the above example the discounts are the sub documents. I know solr >> is >> >> not >> >> > a relational DB but I would like to store and index the above >> >> representation >> >> > in a single document if possible. >> >> > >> >> > - what is the best method to approach the above >> >> > >> >> > I can see in many examples the authors tend to denormalize to solve >> >> similar >> >> > problems. This s
Re: TIKA INTEGRATION PERFORMANCE
On Mon, Jun 6, 2011 at 1:47 PM, Naveen Gupta wrote: > Hi Tomas, > > 1. Regarding SolrInputDocument, > > We are not using java client, rather we are using php solr, wrapping > content > in SolrInputDocument, i am not sure how to do in PHP client? In this case, > we need tika related jars to avail the metadata such as content .. we > certainly don't want to handle all these things in PHP client. > I don't understand, Tika IS integrated in Solr, it doesn't matter which client or client language you are using. To add a static value, all you have to do is add it as a request parameter with the prefix "literal". Something like "literal.somefield=thevalue". Content and other file metadata such as author etc (see http://wiki.apache.org/solr/ExtractingRequestHandler#Metadata) will be added to the document inside Solr and indexed. You don't need to handle this on the client application. > > Secondly, what i was asking about commit strategy -- > > what about suppose you have 100 docs > > iterate over 99 docs and fire curl without commit in url > > and for 100th doc, we will use commit > > so doing so, will it also update the indexes for last 99 docs > > while(upto 99){ > curl_command = url without commit; > } > > when i = 100, url would be commit > You can certainly do this. The 100 documents will be available for search after the commit. Non of the documents will be available for search before commit. > > i wanted to achieve something similar to optimize kind of thing > Optimize command should be issued when not many queries or updates are sent to the index. It uses lots of resources and will slow down queries. > > why these kind of use cases which are general purpose not included in > example (especially in other language ...java guys can easily do using API) > They are, you can the auto-commit feature, configured on solrconfig.xml file. You can either tell Solr to commit on a time interval or when a number of documents are updated and not committed. On the example file, the autocommit is commented, but you can uncomment it. > I am basically a Java Guy, so i can feel the problem > > Thanks > Naveen > 2011/6/6 Tomás Fernández Löbbe > > > 1. About the commit strategy, all the ExtractingRequestHandler (request > > handler that uses Tika to extract content from the input file) will do is > > extract the content of your file and add it to a SolrInputDocument. The > > commit strategy should not change because of this, compared to other > > documents you might be indexing. It is usually not recommended to commit > on > > every new / updated document. > > > > 2. Don't know if I understand the question. you can add all the static > > fields you want to the document by adding the "literal." prefix to the > name > > of the fields when using ExtractingRequestHandler (as you are doing with > " > > literal.id"). You can also leave empty fields if they are not marked as > > "required" at the schema.xml file. See: > > http://wiki.apache.org/solr/ExtractingRequestHandler#Literals > > > > 3. Solr cores can work almost as completely different Solr instances. You > > could tell one core to replicate from another core. I don't think this > > would > > be of any help here. If you want to separate the indexing operations from > > the query operations, you could probably use different machines, that's > > usually a better option. Configure the indexing box as master and the > query > > box as slave. Here you have some more information about it: > > http://wiki.apache.org/solr/SolrReplication > > > > Were this the answers you were looking for or did I misunderstand your > > questions? > > > > Tomás > > > > On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta > wrote: > > > > > Hi > > > > > > Since it is php, we are using solphp for calling curl based call, > > > > > > what my concern here is that for each user, we might be having 20-40 > > > attachments needed to be indexed each day, and there are various users > > > ..daily we are targeting around 500-1000 users .. > > > > > > right now if you see, we > > > > > > > > $ch = curl_init(' > > > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true' > ); > > > curl_setopt ($ch, CURLOPT_POST, 1); > > > curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf")); > > > $result= curl_exec ($ch); > > > ?> > > > > > > also we are planning to use other fields which are to be indexed and > > stored > > > ... > > > > > > > > > There are couple of questions here > > > > > > 1. what would be the best strategies for commit. if we take all the > > > documents in an array and iterating one by one and fire the curl and > for > > > the > > > last doc, if we commit, will it work or for each doc, we need to > commit? > > > > > > 2. we are having several fields which are already defined in schema and > > few > > > of the them are required earlier, but for this purpose, we don't want, > > how > > > to have two requirement together in the same schema? > > > >
Re: Need query help
See "Tagging and excluding Filters" section * http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters 2011/6/6 Denis Kuzmenok : > For now i have a collection with: > id (int) > price (double) multivalue > brand_id (int) > filters (string) multivalue > > I need to get available brand_id, filters, price values and list of > id's for current query. For example now i'm doing queries with > facet.field=brand_id/filters/price: > 1) to get current id's list: (brand_id:100 OR brand_id:150) AND > (filters:p1s100 OR filters:p4s20) > 2) to get available filters on selected properties (same properties but > another values): (brand_id:100 OR brand_id:150) AND (filters:p1s* OR > filters:p4s*) > 3) to get available brand_id (if any are selected, if none - take from > 1st query results): (filters:p1s100 OR filters:p4s20) > 4) another request to get available prices if any are selected > > Is there any way to simplify this task? > Data needed: > 1) Id's for selected filters, price, brand_id > 2) Available filters, price, brand_id from selected values > 3) Another values for selected properties (is any chosen) > 4) Another brand_id for selected brand_id > 5) Another price for selected price > > Will appreciate any help or thoughts! > > Cheers, > Denis Kuzmenok > >
Re: Auto-scaling solr setup
Yes sadly .. I too have not much clue about AWS. The SolrReplication API doesnt give me what i want exactly.. For the time being i have hacked my way into the amazon image bootstrapping the replication check in a shell script ((curl & awk) very dirty way) . Once the check suceeds I enable the server using the Solr healthcheck for load-balancers. I was wondering if anyone has moved to the cloud..specially Amazon auto-scaling where they dont have control over when a new node is fired.. All scenarios i encountered were people creating a node .. warming up the cache and then adding it under the HAProxy LB. I guess warmup is not that big an issue as compared to an empty response. Thanks for your response :) Regards, Akshay On Mon, Jun 6, 2011 at 6:33 PM, Erick Erickson wrote: > The HTTP interface (http://wiki.apache.org/solr/SolrReplication#HTTP_API) > can be used to control lots of parts of replication. > > As to warmups, I don't know of a good way to test that. I don't know > whether > getting the current status on the slave includes whether warmup is > completed > or not. At worst, after replication is complete you could wait an interval > (see > the warmup times on your running servers) before routing requests to the > slave. > > I haven't any clue at all about AWS... > > Best > Erick > > On Mon, Jun 6, 2011 at 9:18 AM, Akshay wrote: > > So i am trying to setup an auto-scaling search system of ec2 solr-slaves > > which scale up as number of requests increase and vice versa > > Here is what I have > > 1. A solr master and underlying slaves(scalable). And an elastic load > > balancer to distribute the load. > > 2. The ec2-auto-scaling setup fires nodes when traffic increases. However > > the replication times(replication speed) for the index from the master > > varies for these newly fired nodes. > > 3. I want to avoid addition of these nodes to the load balancer till it > has > > completed initial replication and has a warmed up cache. > >For this I need to know a way I can check if the initial replication > has > > completed. and also a way of warming up the cache post this. > > > > I can think of doing this via .. a shellscript/awk(checking times > > replicated/index size) ... is there a cleaner way ? > > > > Also on the side note .. any suggestions or pointers to how one set up > their > > scalable solr setup on cloud(AWS mainly) would be helpful. > > > > Regards, > > Akshay > > >
RE: Solr performance tuning - disk i/o?
All of my cache autowarmCount settings are either 1 or 5. maxWarmingSearchers is set to 2. I previously shared the contents of my firstSearcher and newSearcher events -- just a "queries" array surrounded by a standard-looking tag. The events are definitely firing -- in addition to the measurable performance improvement they give me, I can actually see them happening in the console output during startup. That seems to cover every configuration option in my file that references warming in any way, and it all looks reasonable to me. warmupTime remains consistently 0 in the statistics display. Is there anything else I should be looking at? In any case, I'm not too alarmed by this... it just seems a little strange. thanks, Demian > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Monday, June 06, 2011 11:59 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr performance tuning - disk i/o? > > Polling interval was in reference to slaves in a multi-machine > master/slave setup. so probably not > a concern just at present. > > Warmup time of 0 is not particularly normal, I'm not quite sure what's > going on there but you may > want to look at firstsearcher, newsearcher and autowarm parameters in > config.xml.. > > Best > Erick > > On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz > wrote: > > Thanks once again for the helpful suggestions! > > > > Regarding the selection of facet fields, I think publishDate (which > is actually just a year) and callnumber-first (which is actually a very > broad, high-level category) are okay. authorStr is an interesting > problem: it's definitely a useful facet (when a user searches for an > author, odds are good that they want the one who published the most > books... i.e. a search for dickens will probably show Charles Dickens > at the top of the facet list), but it has a long tail since there are > many minor authors who have only published one or two books... Is > there a possibility that the facet.mincount parameter could be helpful > here, or does that have no impact on performance/memory footprint? > > > > Regarding polling interval for slaves, are you referring to a > distributed Solr environment, or is this something to do with Solr's > internals? We're currently a single-server environment, so I don't > think I have to worry if it's related to a multi-server setup... but > if it's something internal, could you point me to the right area of the > admin panel to check my stats? I'm not seeing anything about polling > on the statistics page. It's also a little strange that all of my > warmupTime stats on searchers and caches are showing as 0 -- is that > normal? > > > > thanks, > > Demian > > > >> -Original Message- > >> From: Erick Erickson [mailto:erickerick...@gmail.com] > >> Sent: Friday, June 03, 2011 4:45 PM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Solr performance tuning - disk i/o? > >> > >> Quick impressions: > >> > >> The faceting is usually best done on fields that don't have lots of > >> unique > >> values for three reasons: > >> 1> It's questionable how much use to the user to have a gazillion > >> facets. > >> In the case of a unique field per document, in fact, it's > useless. > >> 2> resource requirements go up as a function of the number of unique > >> terms. This is true for faceting and sorting. > >> 3> warmup times grow the more terms have to be read into memory. > >> > >> > >> Glancing at your warmup stuff, things like publishDate, authorStr > and > >> maybe > >> callnumber-first are questionable. publishDate depends on how coarse > >> the > >> resolution is. If it's by day, that's not really much use. > authorStr.. > >> How many > >> authors have more than one publication? Would this be better served > by > >> some > >> kind of autosuggest rather than facets? callnumber-first... I don't > >> really know, but > >> if it's unique per document it's probably not something the user > would > >> find useful > >> as a facet. > >> > >> The admin page will help you determine the number of unique terms > per > >> field, > >> which may guide you whether or not to continue to facet on these > >> fields. > >> > >> As Otis said, doing a sort on the fields during warmup will also > help. > >> > >> Watch your polling interval for any slaves in relation to the warmup > >> times. > >> If your polling interval is shorter than the warmup times, you run a > >> risk of > >> "runaway warmups". > >> > >> As you've figured out, measuring responses to the first few queries > >> doesn't > >> always measure what you really need .. > >> > >> I don't have the pages handy, but autowarming is a good topic to > >> understand, > >> so you might spend some time tracking it down. > >> > >> Best > >> Erick > >> > >> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz > >> wrote: > >> > Thanks to you and Otis for the suggestions! Some more > information: > >> > > >> > - Based on the Sol
Re: Auto-scaling solr setup
The HTTP interface (http://wiki.apache.org/solr/SolrReplication#HTTP_API) can be used to control lots of parts of replication. As to warmups, I don't know of a good way to test that. I don't know whether getting the current status on the slave includes whether warmup is completed or not. At worst, after replication is complete you could wait an interval (see the warmup times on your running servers) before routing requests to the slave. I haven't any clue at all about AWS... Best Erick On Mon, Jun 6, 2011 at 9:18 AM, Akshay wrote: > So i am trying to setup an auto-scaling search system of ec2 solr-slaves > which scale up as number of requests increase and vice versa > Here is what I have > 1. A solr master and underlying slaves(scalable). And an elastic load > balancer to distribute the load. > 2. The ec2-auto-scaling setup fires nodes when traffic increases. However > the replication times(replication speed) for the index from the master > varies for these newly fired nodes. > 3. I want to avoid addition of these nodes to the load balancer till it has > completed initial replication and has a warmed up cache. > For this I need to know a way I can check if the initial replication has > completed. and also a way of warming up the cache post this. > > I can think of doing this via .. a shellscript/awk(checking times > replicated/index size) ... is there a cleaner way ? > > Also on the side note .. any suggestions or pointers to how one set up their > scalable solr setup on cloud(AWS mainly) would be helpful. > > Regards, > Akshay >
Re: SolrJ and Range Faceting
Small error, shouldn't be using this.start but should instead be using Double.parseDouble(this.getValue()); and sdf.parse(count.getValue()); respectfully. On Mon, Jun 6, 2011 at 1:16 PM, Jamie Johnson wrote: > Thanks Martijn. I pulled your patch and it looks like what I was looking > for. The original FacetField class has a getAsFilterQuery method which > returns the criteria to use as an fq parameter, I have logic which does this > in my class which works, any chance of getting something like this added to > the patch as well? > > > public static class Numeric extends RangeFacet { > > public Numeric(String name, Number start, Number end, Number gap) { > super(name, start, end, gap); > } > > public String getAsFilterQuery(){ > Double end = this.start.doubleValue() + this.gap.doubleValue() - > 1; > return this.name + ":[" + this.start + " TO " + end + "]"); > } > > > } > > > and for dates (there's a parse exception below which I am not doing > anything with currently) > > public String getAsFilterQuery(){ > RangeFacet.Date dateCount = > (RangeFacet.Date)count.getRangeFacet(); > > DateMathParser parser = new DateMathParser(TimeZone.getDefault(), > Locale.getDefault()); > SimpleDateFormat sdf = new > SimpleDateFormat("-MM-dd'T'HH:mm:ss"); > > parser.setNow(dateCount.getStart()); > Date end = parser.parseMath(dateCount.getGap()); > String startStr = sdf.format(dateCount.getStart()) + "Z"; > String endStr = sdf.format(end) + "Z"; > String label = startStr + " TO " + endStr; > return facetField.getName() + ":[" + label + "]"); > > } > > > On Fri, Jun 3, 2011 at 7:05 AM, Martijn v Groningen < > martijn.is.h...@gmail.com> wrote: > >> Hi Jamie, >> >> I don't know why range facets didn't make it into SolrJ. But I've recently >> opened an issue for this: >> https://issues.apache.org/jira/browse/SOLR-2523 >> >> I hope this will be committed soon. Check the patch out and see if you >> like >> it. >> >> Martijn >> >> On 2 June 2011 18:22, Jamie Johnson wrote: >> >> > Currently the range and date faceting in SolrJ acts a bit differently >> than >> > I >> > would expect. Specifically, range facets aren't parsed at all and date >> > facets end up generating filterQueries which don't have the range, just >> the >> > lower bound. Is there a reason why SolrJ doesn't support these? I have >> > written some things on my end to handle these and generate filterQueries >> > for >> > date ranges of the form dateTime:[start TO end] and I have a function >> > (which >> > I copied from the date faceting) which parses the range facets, but >> would >> > prefer not to have to maintain these myself. Is there a plan to >> implement >> > these? Also is there a plan to update FacetField to not have end be a >> > date, >> > perhaps making it a String like start so we can support date and range >> > queries? >> > >> >> >> >> -- >> Met vriendelijke groet, >> >> Martijn van Groningen >> > >
Re: SolrJ and Range Faceting
Thanks Martijn. I pulled your patch and it looks like what I was looking for. The original FacetField class has a getAsFilterQuery method which returns the criteria to use as an fq parameter, I have logic which does this in my class which works, any chance of getting something like this added to the patch as well? public static class Numeric extends RangeFacet { public Numeric(String name, Number start, Number end, Number gap) { super(name, start, end, gap); } public String getAsFilterQuery(){ Double end = this.start.doubleValue() + this.gap.doubleValue() - 1; return this.name + ":[" + this.start + " TO " + end + "]"); } } and for dates (there's a parse exception below which I am not doing anything with currently) public String getAsFilterQuery(){ RangeFacet.Date dateCount = (RangeFacet.Date)count.getRangeFacet(); DateMathParser parser = new DateMathParser(TimeZone.getDefault(), Locale.getDefault()); SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd'T'HH:mm:ss"); parser.setNow(dateCount.getStart()); Date end = parser.parseMath(dateCount.getGap()); String startStr = sdf.format(dateCount.getStart()) + "Z"; String endStr = sdf.format(end) + "Z"; String label = startStr + " TO " + endStr; return facetField.getName() + ":[" + label + "]"); } On Fri, Jun 3, 2011 at 7:05 AM, Martijn v Groningen < martijn.is.h...@gmail.com> wrote: > Hi Jamie, > > I don't know why range facets didn't make it into SolrJ. But I've recently > opened an issue for this: > https://issues.apache.org/jira/browse/SOLR-2523 > > I hope this will be committed soon. Check the patch out and see if you like > it. > > Martijn > > On 2 June 2011 18:22, Jamie Johnson wrote: > > > Currently the range and date faceting in SolrJ acts a bit differently > than > > I > > would expect. Specifically, range facets aren't parsed at all and date > > facets end up generating filterQueries which don't have the range, just > the > > lower bound. Is there a reason why SolrJ doesn't support these? I have > > written some things on my end to handle these and generate filterQueries > > for > > date ranges of the form dateTime:[start TO end] and I have a function > > (which > > I copied from the date faceting) which parses the range facets, but would > > prefer not to have to maintain these myself. Is there a plan to > implement > > these? Also is there a plan to update FacetField to not have end be a > > date, > > perhaps making it a String like start so we can support date and range > > queries? > > > > > > -- > Met vriendelijke groet, > > Martijn van Groningen >
Re: TIKA INTEGRATION PERFORMANCE
Hi Tomas, 1. Regarding SolrInputDocument, We are not using java client, rather we are using php solr, wrapping content in SolrInputDocument, i am not sure how to do in PHP client? In this case, we need tika related jars to avail the metadata such as content .. we certainly don't want to handle all these things in PHP client. Secondly, what i was asking about commit strategy -- what about suppose you have 100 docs iterate over 99 docs and fire curl without commit in url and for 100th doc, we will use commit so doing so, will it also update the indexes for last 99 docs while(upto 99){ curl_command = url without commit; } when i = 100, url would be commit i wanted to achieve something similar to optimize kind of thing why these kind of use cases which are general purpose not included in example (especially in other language ...java guys can easily do using API) I am basically a Java Guy, so i can feel the problem Thanks Naveen 2011/6/6 Tomás Fernández Löbbe > 1. About the commit strategy, all the ExtractingRequestHandler (request > handler that uses Tika to extract content from the input file) will do is > extract the content of your file and add it to a SolrInputDocument. The > commit strategy should not change because of this, compared to other > documents you might be indexing. It is usually not recommended to commit on > every new / updated document. > > 2. Don't know if I understand the question. you can add all the static > fields you want to the document by adding the "literal." prefix to the name > of the fields when using ExtractingRequestHandler (as you are doing with " > literal.id"). You can also leave empty fields if they are not marked as > "required" at the schema.xml file. See: > http://wiki.apache.org/solr/ExtractingRequestHandler#Literals > > 3. Solr cores can work almost as completely different Solr instances. You > could tell one core to replicate from another core. I don't think this > would > be of any help here. If you want to separate the indexing operations from > the query operations, you could probably use different machines, that's > usually a better option. Configure the indexing box as master and the query > box as slave. Here you have some more information about it: > http://wiki.apache.org/solr/SolrReplication > > Were this the answers you were looking for or did I misunderstand your > questions? > > Tomás > > On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta wrote: > > > Hi > > > > Since it is php, we are using solphp for calling curl based call, > > > > what my concern here is that for each user, we might be having 20-40 > > attachments needed to be indexed each day, and there are various users > > ..daily we are targeting around 500-1000 users .. > > > > right now if you see, we > > > > > $ch = curl_init(' > > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true'); > > curl_setopt ($ch, CURLOPT_POST, 1); > > curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf")); > > $result= curl_exec ($ch); > > ?> > > > > also we are planning to use other fields which are to be indexed and > stored > > ... > > > > > > There are couple of questions here > > > > 1. what would be the best strategies for commit. if we take all the > > documents in an array and iterating one by one and fire the curl and for > > the > > last doc, if we commit, will it work or for each doc, we need to commit? > > > > 2. we are having several fields which are already defined in schema and > few > > of the them are required earlier, but for this purpose, we don't want, > how > > to have two requirement together in the same schema? > > > > 3. since it is frequent commit, how to use solr multicore for write and > > read > > operations separately ? > > > > Thanks > > Naveen > > >
Re: How to get default result?
Hi Richard, are you setting the value to 0 at index time when the housenumber is not present? If you are, this would be as simple as modify the query at the application layer to city = a, street= b, housenumber=(14 OR 0). If you are not doing anything at index time with the not present housenumbers, you could do something like city:a AND street:b AND (housenumber:14 OR NOT housenumber:[* TO *]). First option is better if you ask me. You can set the default value on your schema. See http://wiki.apache.org/solr/SchemaXml#Fields On Mon, Jun 6, 2011 at 1:14 PM, richardr wrote: > Dear list, > > i got a question regarding my address search: > I am searching for address data. If there is one address field not definied > (in this case the housenumber) for the specific query (e.g. city = a, > street > = b, housenumber=14), I am getting no result. For every street there exists > at least one housenumber (=0). > > Is it possible to get this default value (housenumber 0) as a result, if > the > user is searching for the housenumber 14, which does not exist in our > model? > > Thanks in advance, > Richard > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-get-default-result-tp3030665p3030665.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Default query parser operator
Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb
How to get default result?
Dear list, i got a question regarding my address search: I am searching for address data. If there is one address field not definied (in this case the housenumber) for the specific query (e.g. city = a, street = b, housenumber=14), I am getting no result. For every street there exists at least one housenumber (=0). Is it possible to get this default value (housenumber 0) as a result, if the user is searching for the housenumber 14, which does not exist in our model? Thanks in advance, Richard -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-default-result-tp3030665p3030665.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance tuning - disk i/o?
Polling interval was in reference to slaves in a multi-machine master/slave setup. so probably not a concern just at present. Warmup time of 0 is not particularly normal, I'm not quite sure what's going on there but you may want to look at firstsearcher, newsearcher and autowarm parameters in config.xml.. Best Erick On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz wrote: > Thanks once again for the helpful suggestions! > > Regarding the selection of facet fields, I think publishDate (which is > actually just a year) and callnumber-first (which is actually a very broad, > high-level category) are okay. authorStr is an interesting problem: it's > definitely a useful facet (when a user searches for an author, odds are good > that they want the one who published the most books... i.e. a search for > dickens will probably show Charles Dickens at the top of the facet list), but > it has a long tail since there are many minor authors who have only published > one or two books... Is there a possibility that the facet.mincount parameter > could be helpful here, or does that have no impact on performance/memory > footprint? > > Regarding polling interval for slaves, are you referring to a distributed > Solr environment, or is this something to do with Solr's internals? We're > currently a single-server environment, so I don't think I have to worry if > it's related to a multi-server setup... but if it's something internal, > could you point me to the right area of the admin panel to check my stats? > I'm not seeing anything about polling on the statistics page. It's also a > little strange that all of my warmupTime stats on searchers and caches are > showing as 0 -- is that normal? > > thanks, > Demian > >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Friday, June 03, 2011 4:45 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Solr performance tuning - disk i/o? >> >> Quick impressions: >> >> The faceting is usually best done on fields that don't have lots of >> unique >> values for three reasons: >> 1> It's questionable how much use to the user to have a gazillion >> facets. >> In the case of a unique field per document, in fact, it's useless. >> 2> resource requirements go up as a function of the number of unique >> terms. This is true for faceting and sorting. >> 3> warmup times grow the more terms have to be read into memory. >> >> >> Glancing at your warmup stuff, things like publishDate, authorStr and >> maybe >> callnumber-first are questionable. publishDate depends on how coarse >> the >> resolution is. If it's by day, that's not really much use. authorStr.. >> How many >> authors have more than one publication? Would this be better served by >> some >> kind of autosuggest rather than facets? callnumber-first... I don't >> really know, but >> if it's unique per document it's probably not something the user would >> find useful >> as a facet. >> >> The admin page will help you determine the number of unique terms per >> field, >> which may guide you whether or not to continue to facet on these >> fields. >> >> As Otis said, doing a sort on the fields during warmup will also help. >> >> Watch your polling interval for any slaves in relation to the warmup >> times. >> If your polling interval is shorter than the warmup times, you run a >> risk of >> "runaway warmups". >> >> As you've figured out, measuring responses to the first few queries >> doesn't >> always measure what you really need .. >> >> I don't have the pages handy, but autowarming is a good topic to >> understand, >> so you might spend some time tracking it down. >> >> Best >> Erick >> >> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz >> wrote: >> > Thanks to you and Otis for the suggestions! Some more information: >> > >> > - Based on the Solr stats page, my caches seem to be working pretty >> well (few or no evictions, hit rates in the 75-80% range). >> > - VuFind is actually doing two Solr queries per search (one initial >> search followed by a supplemental spell check search -- I believe this >> is necessary because VuFind has two separate spelling indexes, one for >> shingled terms and one for single words). That is probably >> exaggerating the problem, though based on searches with debugQuery on, >> it looks like it's always the initial search (rather than the >> supplemental spelling search) that's consuming the bulk of the time. >> > - enableLazyFieldLoading is set to true. >> > - I'm retrieving 20 documents per page. >> > - My JVM settings: -server - >> Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m - >> XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5 >> > >> > It appears that a large portion of my problem had to do with >> autowarming, a topic that I've never had a strong grasp on, though >> perhaps I'm finally learning (any recommended primer links would be >> welcome!). I did have some autowarming settings in solrconfig.xml (
Re: Solr Indexing Patterns
On 5 June 2011 14:42, Erick Erickson wrote: > See: http://wiki.apache.org/solr/SchemaXml > > By adding ' "multiValued="true" ' to the field, you can add > the same field multiple times in a doc, something like > > > > value1 > value2 > > > > I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { "name":"The Book", "price":"$9.99", "price":"$3.00", "price":"$4.00","synopsis":"thanksgiving special", "starts":"11-24-2011", "starts":"10-10-2011", "ends":"11-25-2011", "ends":"10-11-2011", "synopsis":"Canadian thanksgiving special", }, How does one differentiate the different offers? > But there's no real ability in Solr to store "sub documents", > so you'd have to get creative in how you encoded the discounts... > This is what I'm asking :) What is the best / recommended / known patterns for doing this? > > But I suspect a better approach would be to store each discount as > a separate document. If you're in the trunk version, you could then > group results by, say, ISBN and get responses grouped together... > This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get "all books currently discounted" requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want "all books that are currently on discount in the "horror" genre containing the word 'elm' in the title." The only way I can see in catering for the above search is to duplicate all searchable fields in my "book" document in my "discount" document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? > > Best > Erick > > On Sat, Jun 4, 2011 at 1:42 AM, Judioo wrote: > > Hi, > > Discounts can change daily. Also there can be a lot of them (over time > and > > in a given time period ). > > > > Could you give an example of what you mean buy multi-valuing the field. > > > > Thanks > > > > On 3 June 2011 14:29, Erick Erickson wrote: > > > >> How often are the discounts changed? Because you can simply > >> re-index the book information with a multiValued "discounts" field > >> and get something similar to your example (&wt=json) > >> > >> > >> Best > >> Erick > >> > >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo wrote: > >> > What is the "best practice" method to index the following in Solr: > >> > > >> > I'm attempting to use solr for a book store site. > >> > > >> > Each book will have a price but on occasions this will be discounted. > The > >> > discounted price exists for a defined time period but there may be > many > >> > discount periods. Each discount will have a brief synopsis, start and > end > >> > time. > >> > > >> > A subset of the desired output would be as follows: > >> > > >> > ... > >> > "response":{"numFound":1,"start":0,"docs":[ > >> > { > >> >"name":"The Book", > >> >"price":"$9.99", > >> >"discounts":[ > >> >{ > >> > "price":"$3.00", > >> > "synopsis":"thanksgiving special", > >> > "starts":"11-24-2011", > >> > "ends":"11-25-2011", > >> >}, > >> >{ > >> > "price":"$4.00", > >> > "synopsis":"Canadian thanksgiving special", > >> > "starts":"10-10-2011", > >> > "ends":"10-11-2011", > >> >}, > >> > ] > >> > }, > >> > . > >> > > >> > A requirement is to be able to search for just discounted > publications. I > >> > think I could use date faceting for this ( return publications that > are > >> > within a discount window ). When a discount search is performed no > >> > publications that are not currently discounted will be returned. > >> > > >> > My question are: > >> > > >> > - Does solr support this type of sub documents > >> > > >> > In the above example the discounts are the sub documents. I know solr > is > >> not > >> > a relational DB but I would like to store and index the above > >> representation > >> > in a single document if possible. > >> > > >> > - what is the best method to approach the above > >> > > >> > I can see in many examples the authors tend to denormalize to solve > >> similar > >> > problems. This suggest that for each discount I am required to > duplicate > >> the > >> > book data or form a document > >> > association< > http://stackoverflow.com/questions/2689399/solr-associations > >> >. > >> > Which method would you advise? > >> > > >> > It would be nice if solr could return a response structured as above. > >> > > >> > Much Thanks > >> > > >> > > >
Re: Search with Synonyms in two fields
On 6/5/2011 3:36 AM, occurred wrote: Ok, thx for the answer. My idea now is to store both field-values in one field and pre- and suffix the values from field2 with something very special. Also then the synonyms have to have the special pre- and suffixes. What are you actually trying to do? Usually, what people would do is just store both original values and synonym expansion in one field, the end, no need to use custom suffixes. Then you could have a _second_ field with only the original values without synonym expansion, if you sometimes need to search without synonym expansion. You want to search over both original values and expanded synonyms, you search over the field that does that. You want to, in another search, search only over original values without synonym expansion, you search over the field without synonyms expanded in it. That's usually the sort of thing people do. "de-normalization" in Solr is not something to be avoided, it's instead a general pattern.
Master Slave help
Hi, I have configured my master slave server and everything seems to be running fine, the replication completed the firsttime it ran. But everytime I go the the replication link in the admin panel after restarting the server or server startup I notice the replication starting from scratch or at least the stats show that. What could be wrong? Thanks, Rohit
Need query help
For now i have a collection with: id (int) price (double) multivalue brand_id (int) filters (string) multivalue I need to get available brand_id, filters, price values and list of id's for current query. For example now i'm doing queries with facet.field=brand_id/filters/price: 1) to get current id's list: (brand_id:100 OR brand_id:150) AND (filters:p1s100 OR filters:p4s20) 2) to get available filters on selected properties (same properties but another values): (brand_id:100 OR brand_id:150) AND (filters:p1s* OR filters:p4s*) 3) to get available brand_id (if any are selected, if none - take from 1st query results): (filters:p1s100 OR filters:p4s20) 4) another request to get available prices if any are selected Is there any way to simplify this task? Data needed: 1) Id's for selected filters, price, brand_id 2) Available filters, price, brand_id from selected values 3) Another values for selected properties (is any chosen) 4) Another brand_id for selected brand_id 5) Another price for selected price Will appreciate any help or thoughts! Cheers, Denis Kuzmenok
Auto-scaling solr setup
So i am trying to setup an auto-scaling search system of ec2 solr-slaves which scale up as number of requests increase and vice versa Here is what I have 1. A solr master and underlying slaves(scalable). And an elastic load balancer to distribute the load. 2. The ec2-auto-scaling setup fires nodes when traffic increases. However the replication times(replication speed) for the index from the master varies for these newly fired nodes. 3. I want to avoid addition of these nodes to the load balancer till it has completed initial replication and has a warmed up cache. For this I need to know a way I can check if the initial replication has completed. and also a way of warming up the cache post this. I can think of doing this via .. a shellscript/awk(checking times replicated/index size) ... is there a cleaner way ? Also on the side note .. any suggestions or pointers to how one set up their scalable solr setup on cloud(AWS mainly) would be helpful. Regards, Akshay
RE: Solr performance tuning - disk i/o?
Thanks once again for the helpful suggestions! Regarding the selection of facet fields, I think publishDate (which is actually just a year) and callnumber-first (which is actually a very broad, high-level category) are okay. authorStr is an interesting problem: it's definitely a useful facet (when a user searches for an author, odds are good that they want the one who published the most books... i.e. a search for dickens will probably show Charles Dickens at the top of the facet list), but it has a long tail since there are many minor authors who have only published one or two books... Is there a possibility that the facet.mincount parameter could be helpful here, or does that have no impact on performance/memory footprint? Regarding polling interval for slaves, are you referring to a distributed Solr environment, or is this something to do with Solr's internals? We're currently a single-server environment, so I don't think I have to worry if it's related to a multi-server setup... but if it's something internal, could you point me to the right area of the admin panel to check my stats? I'm not seeing anything about polling on the statistics page. It's also a little strange that all of my warmupTime stats on searchers and caches are showing as 0 -- is that normal? thanks, Demian > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Friday, June 03, 2011 4:45 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr performance tuning - disk i/o? > > Quick impressions: > > The faceting is usually best done on fields that don't have lots of > unique > values for three reasons: > 1> It's questionable how much use to the user to have a gazillion > facets. > In the case of a unique field per document, in fact, it's useless. > 2> resource requirements go up as a function of the number of unique > terms. This is true for faceting and sorting. > 3> warmup times grow the more terms have to be read into memory. > > > Glancing at your warmup stuff, things like publishDate, authorStr and > maybe > callnumber-first are questionable. publishDate depends on how coarse > the > resolution is. If it's by day, that's not really much use. authorStr.. > How many > authors have more than one publication? Would this be better served by > some > kind of autosuggest rather than facets? callnumber-first... I don't > really know, but > if it's unique per document it's probably not something the user would > find useful > as a facet. > > The admin page will help you determine the number of unique terms per > field, > which may guide you whether or not to continue to facet on these > fields. > > As Otis said, doing a sort on the fields during warmup will also help. > > Watch your polling interval for any slaves in relation to the warmup > times. > If your polling interval is shorter than the warmup times, you run a > risk of > "runaway warmups". > > As you've figured out, measuring responses to the first few queries > doesn't > always measure what you really need .. > > I don't have the pages handy, but autowarming is a good topic to > understand, > so you might spend some time tracking it down. > > Best > Erick > > On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz > wrote: > > Thanks to you and Otis for the suggestions! Some more information: > > > > - Based on the Solr stats page, my caches seem to be working pretty > well (few or no evictions, hit rates in the 75-80% range). > > - VuFind is actually doing two Solr queries per search (one initial > search followed by a supplemental spell check search -- I believe this > is necessary because VuFind has two separate spelling indexes, one for > shingled terms and one for single words). That is probably > exaggerating the problem, though based on searches with debugQuery on, > it looks like it's always the initial search (rather than the > supplemental spelling search) that's consuming the bulk of the time. > > - enableLazyFieldLoading is set to true. > > - I'm retrieving 20 documents per page. > > - My JVM settings: -server - > Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m - > XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5 > > > > It appears that a large portion of my problem had to do with > autowarming, a topic that I've never had a strong grasp on, though > perhaps I'm finally learning (any recommended primer links would be > welcome!). I did have some autowarming settings in solrconfig.xml (an > arbitrary search for a bunch of random keywords in the newSearcher and > firstSearcher events, plus autowarmCount settings on all of my caches). > However, when I looked at the debugQuery output, I noticed that a huge > amount of time was being wasted loading facets on the first search > after restarting Solr, so I changed my newSearcher and firstSearcher > events to this: > > > > > > > > *:* > > 0 > > 10 > > true > > 1
problem: zooKeeper Integration with solr
Hi folk, I am using solr to index around 100mn docs. now I am planning to move to cluster based solr, so that I can scale the indexing and searching process. since solrCloud is in development stage, I am trying to index in shard based environment using zooKeeper. I followed the steps from http://wiki.apache.org/solr/ZooKeeperIntegrationthen also I am not able to do distributes search. Once I index the docs in one shard, not able to query from other shard and vice-versa, (using the query http://localhost:8180/solr/select/?q=itunes&version=2.2&start=0&rows=10&indent=on ) I am running solr3.1 on ubuntu 10.10. please help me. -- Thanks and Regards Mohammad Shariq
Re: Applying synonyms increase the data size from MB to GBs
Have you considered query-time expansion rather than index-time expansion? In general this will lead to more complex queries, but smaller indexes. Take a look at the analysis page available from the admin page to see exactly what happens. What is the high-legel problem you're trying to solve? Having this huge an expansion in index size is pretty unusual, and I'm wondering if there might be another approach to the problem... Best Erick On Mon, Jun 6, 2011 at 6:19 AM, Ahmet Arslan wrote: >> Is there a way where in I can apply all those file to same >> tag with some >> delimiter separated? >> >> like this: >> > class="solr.SynonymFilterFactory" >> synonyms="BODYTaxonomy.txt >> , ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt" >> ignoreCase="true" >> expand="true"/> > > > Yes, you can perfectly feed multiple text files separated by comma to > synonyms parameter. > > synonyms="BODYTaxonomy.txt,ClinicalObs.txt,MicTaxo.txt,SPTaxo.txt" >
Re: java.io.IOException: The specified network name is no longer available
Yep, but note the discussion. It's not at all clear that Solr is the place to deal with an unreliable network, and it sounds like that's the root of your issue. It doesn't look like anyone's hot to change Solr's behavior here, and it's arguable that Solr isn't the place to compensate for an unreliable share, but that's debatable. Do you have the energy to propose a patch? Best Erick On Mon, Jun 6, 2011 at 1:02 AM, Gaurav Shingala wrote: > > Hi, > > Yes, you are right I have a remote file system also I have checked and > confirmed that there was no issue in network. > One more thing i need to include here is i had found same bug with ID > SOLR-2235 on ASF JIRA. > > > Thanks, > Gaurav > >> Date: Fri, 3 Jun 2011 09:13:00 -0400 >> Subject: Re: java.io.IOException: The specified network name is no longer >> available >> From: erickerick...@gmail.com >> To: solr-user@lucene.apache.org >> >> You'v got to tell us more about your setup. We can only guess that you're >> on a remote file system and there's a problem there, which would be a >> network problem outside of Solr's purview >> >> You might want to review: >> http://wiki.apache.org/solr/UsingMailingLists >> >> Best >> Erick >> >> On Fri, Jun 3, 2011 at 1:52 AM, Gaurav Shingala >> wrote: >> > >> > Hi, >> > >> > I am using solr 1.4.1 and at the time of updating index getting following >> > error: >> > >> > 2011-06-03 05:54:06,943 ERROR [org.apache.solr.core.SolrCore] >> > (http-10.38.33.146-8080-4) java.io.IOException: The specified network name >> > is no longer available >> > at java.io.RandomAccessFile.readBytes(Native Method) >> > at java.io.RandomAccessFile.read(RandomAccessFile.java:322) >> > at >> > org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(SimpleFSDirectory.java:132) >> > at >> > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157) >> > at >> > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) >> > at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78) >> > at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64) >> > at >> > org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129) >> > at >> > org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160) >> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) >> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) >> > at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57) >> > at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:1103) >> > at >> > org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:981) >> > at >> > org.apache.solr.search.SolrIndexReader.termDocs(SolrIndexReader.java:320) >> > at >> > org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:640) >> > at >> > org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545) >> > at >> > org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:581) >> > at >> > org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903) >> > at >> > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) >> > at >> > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) >> > at >> > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) >> > at >> > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) >> > at >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >> > at >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) >> > at >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) >> > at >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:274) >> > at >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:242) >> > at >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) >> > at >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) >> > at >> > org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:181) >> > at >> > org.jboss.modcluster.catalina.CatalinaContext$RequestListenerValve.event(CatalinaContext.java:285) >> > at >> > org.jboss.modcluster.catalina.CatalinaContext$RequestListenerValve.invoke(CatalinaContext.java:261) >> > at >> > org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:88) >> > at >> > org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:100) >> > at >> > or
Re: TIKA INTEGRATION PERFORMANCE
1. About the commit strategy, all the ExtractingRequestHandler (request handler that uses Tika to extract content from the input file) will do is extract the content of your file and add it to a SolrInputDocument. The commit strategy should not change because of this, compared to other documents you might be indexing. It is usually not recommended to commit on every new / updated document. 2. Don't know if I understand the question. you can add all the static fields you want to the document by adding the "literal." prefix to the name of the fields when using ExtractingRequestHandler (as you are doing with " literal.id"). You can also leave empty fields if they are not marked as "required" at the schema.xml file. See: http://wiki.apache.org/solr/ExtractingRequestHandler#Literals 3. Solr cores can work almost as completely different Solr instances. You could tell one core to replicate from another core. I don't think this would be of any help here. If you want to separate the indexing operations from the query operations, you could probably use different machines, that's usually a better option. Configure the indexing box as master and the query box as slave. Here you have some more information about it: http://wiki.apache.org/solr/SolrReplication Were this the answers you were looking for or did I misunderstand your questions? Tomás On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta wrote: > Hi > > Since it is php, we are using solphp for calling curl based call, > > what my concern here is that for each user, we might be having 20-40 > attachments needed to be indexed each day, and there are various users > ..daily we are targeting around 500-1000 users .. > > right now if you see, we > > $ch = curl_init(' > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true'); > curl_setopt ($ch, CURLOPT_POST, 1); > curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf")); > $result= curl_exec ($ch); > ?> > > also we are planning to use other fields which are to be indexed and stored > ... > > > There are couple of questions here > > 1. what would be the best strategies for commit. if we take all the > documents in an array and iterating one by one and fire the curl and for > the > last doc, if we commit, will it work or for each doc, we need to commit? > > 2. we are having several fields which are already defined in schema and few > of the them are required earlier, but for this purpose, we don't want, how > to have two requirement together in the same schema? > > 3. since it is frequent commit, how to use solr multicore for write and > read > operations separately ? > > Thanks > Naveen >
Re: synonyms problem
What does "call synonym methods in Java" mean? That is, what are you trying to accomplish and from where? Best Erick On Sun, Jun 5, 2011 at 9:48 PM, deniz wrote: > well i have changed it into text... but still confused about how to use > synonyms... > > and also I want to know how to call synonym methods in java... i have tried > to use synonymmap and some other similar things but nothing happens... > anyone can give me a sample or a website that i can find examples about solr > in java? > > - > Zeki ama calismiyor... Calissa yapar... > -- > View this message in context: > http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3028353.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Expunging deletes from a very large index
You can drop your mergeFactor to 2 and then run expungeDeletes? This will make the operation take longer but (assuming you have > 3 segments in your index) should use less transient disk space. You could also make a custom merge policy, that expunges one segment at a time (even slower but even less transient disk space required). optimize(maxNumSegments) may also help, though it's not guaranteed to reclaim disk space due to deleted docs. Mike McCandless http://blog.mikemccandless.com On Mon, Jun 6, 2011 at 2:16 AM, Simon Wistow wrote: > Due to some emergency maintenance I needed to run delete on a large > number of documents in a 200Gb index. > > The problem is that it's taking an inordinately long amount of time (2+ > hours so far and counting) and is steadily eating up disk space - > presumably up to 2x index size which is getting awfully close to the > wire on this machine. > > Is that inevitable? Is there any way to speed up the process or use less > space? Maybe do an optimize with a different number of maxSegments? > > I suspect not but I thought it was worth asking. > > > > >
Re: Applying synonyms increase the data size from MB to GBs
> Is there a way where in I can apply all those file to same > tag with some > delimiter separated? > > like this: > class="solr.SynonymFilterFactory" > synonyms="BODYTaxonomy.txt > , ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt" > ignoreCase="true" > expand="true"/> Yes, you can perfectly feed multiple text files separated by comma to synonyms parameter. synonyms="BODYTaxonomy.txt,ClinicalObs.txt,MicTaxo.txt,SPTaxo.txt"
Travel Assistance applications now open for ApacheCon NA 2011
The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is now accepting applications for ApacheCon North America 2011, 7-11 November in Vancouver BC, Canada. The TAC is seeking individuals from the Apache community at-large --users, developers, educators, students, Committers, and Members-- who would like to attend ApacheCon, but need some financial support in order to be able to get there. There are limited places available, and all applicants will be scored on their individual merit. Financial assistance is available to cover flights/trains, accommodation and entrance fees either in part or in full, depending on circumstances. However, the support available for those attending only the BarCamp (7-8 November) is less than that for those attending the entire event (Conference + BarCamp 7-11 November). The Travel Assistance Committee aims to support all official ASF events, including cross-project activities; as such, it may be prudent for those in Asia and Europe to wait for an event geographically closer to them. More information can be found at http://www.apache.org/travel/index.html including a link to the online application and detailed instructions for submitting. Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1). We wish good luck to all those who will apply, and thank you in advance for tweeting, blogging, and otherwise spreading the word. Regards, The Travel Assistance Committee
Re: Solr Field name restrictions
Hi, Using Solr 3.1 I'm getting errors when trying to sort on fields containing dashes in the name... So that's true stay away from dashes if you can. Marc. On Sun, Jun 5, 2011 at 3:46 PM, Erick Erickson wrote: > I'd stay away from dashes too. It's too easy for the query parsers > to mistake them for the NOT operator on a URL. > > You've really got two issues here: > 1> what is allowable in the field name > 2> what causes grief with some query parser. > > To avoid <2>, I'd really just stick with characters and underscores. > > Best > Erick > > 2011/6/4 François Schiettecatte : > > Underscores and dashes are fine, but I would think that colons (:) are > verboten. > > > > François > > > > On Jun 4, 2011, at 9:49 PM, Jamie Johnson wrote: > > > >> Is there a list anywhere detailing field name restrictions. I imagine > >> fields containing periods (.) are problematic if you try to use that > field > >> when doing faceted queries, but are there any others? Are underscores > (_) > >> or dashes (-) ok? > > > > >
Re: Feature: skipping caches and info about cache use
SOLR1.3+ logs only the fresh queries in the logs. If you re-run the same query then it is served from cache, and not printed on the logs(unless cache(s) are not warmed or sercher is reopened). So, Otis's proposal would definitely help in doing some benchmarks & baselining the search :) -- View this message in context: http://lucene.472066.n3.nabble.com/Feature-skipping-caches-and-info-about-cache-use-tp3020325p3028894.html Sent from the Solr - User mailing list archive at Nabble.com.