Sorting results by last update date
Hi All I am trying to sort the results as per last updated date. My url looks as below. *fq=last_updated_date:[NOW-60DAY TO NOW]fq=experience:[0 TO 588]fq=salary:[0 TO 500] OR salary:0fq=-bundle:jobfq=-bundle:panelfq=-bundle:pagefq=-bundle:articlespellcheck=trueq=+java +sipfl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,zm_parent_entity,ss_filemime,ss_file_entity_title,ss_file_entity_url,ss_field_uidspellcheck.q=+java +sipqf=content^40qf=label^5.0qf=tos_content_extra^0.1qf=tos_name^3.0hl.fl=contentmm=1q.op=ANDwt=json json.nl=mapsort=last_updated_date asc * With this I get the data in ascending order of last updated date. If I am trying to sort data in descending order, I use below url *fq=last_updated_date:[NOW-60DAY TO NOW]fq=experience:[0 TO 588]fq=salary:[0 TO 500] OR salary:0fq=-bundle:jobfq=-bundle:panelfq=-bundle:pagefq=-bundle:articlespellcheck=trueq=+java +sipfl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,zm_parent_entity,ss_filemime,ss_file_entity_title,ss_file_entity_url,ss_field_uidspellcheck.q=+java +sipqf=content^40qf=label^5.0qf=tos_content_extra^0.1qf=tos_name^3.0hl.fl=contentmm=1q.op=ANDwt=json json.nl=mapsort=last_updated_date desc* Here the data set is not ordered properly, mostly it looks to me data is ordered on basis of score, not last updated date. Can somebody tell me what I am missing here, why *desc* is not working properly for me. Thanks kamal
Re: What exactly happens to extant documents when the schema changes?
On Tue, May 28, 2013 at 2:20 PM, Upayavira u...@odoko.co.uk wrote: The schema provides Solr with a description of what it will find in the Lucene indexes. If you, for example, changed a string field to an integer in your schema, that'd mess things up bigtime. I recently had to upgrade a date field from the 1.4.1 date field format to the newer TrieDateField. Given I had to do it on a live index, I had to add a new field (just using copyfield) and re-index over the top, as the old field was still in use. I guess, given my app now uses the new date field only, I could presumably reindex the old date field with the new TrieDateField format, but I'd want to try that before I do it for real. Thank you for the insight. Unfortunately, with 20 million records and growing by hundreds each minute (social media posts) I don't see that I could ever reindex the data in a timely way. However, if you changed a single valued field to a multi-valued one, that's not an issue, as a field with a single value is still valid for a multi-valued field. Also, if you add a new field, existing documents will be considered to have no value in that field. If that is acceptable, then you're fine. I guess if you remove a field, then those fields will be ignored by Solr, and thus not impact anything. But I have to say, I've never tried that. Thus - changing the schema will only impact on future indexing. Whether your existing index will still be valid depends upon the changes you are making. Upayavira Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: What exactly happens to extant documents when the schema changes?
On Tue, May 28, 2013 at 3:58 PM, Jack Krupansky j...@basetechnology.com wrote: The technical answer: Undefined and not guaranteed. I was afraid of that! Sure, you can experiment and see what the effects happen to be in any given release, and maybe they don't tend to change (too much) between most releases, but there is no guarantee that any given change schema but keep existing data without a delete of directory contents and full reindex will actually be benign or what you expect. As a general proposition, when it comes to changing the schema and not deleting the directory and doing a full reindex, don't do it! Of course, we all know not to try to walk on thin ice, but a lot of people will try to do it anyway - and maybe it happens that most of the time the results are benign. In the case of this particular application, reindexing really is overly burdensome as the application is performing hundreds of writes to the index per minute. How might I gauge how much spare I/O Solr could commit to a reindex? All the data that I need is in fact in stored fields. Note that because the social media application that feeds our Solr index is global, there are no 'off hours'. OTOH, you could file a Jira to propose that the effects of changing the schema but keeping the existing data should be precisely defined and documented, but, that could still change from release to release. Seems like a lot of effort to document, for little benefit. I'm not going to file it. I would like to know, though, is the schema consulted at index time, query time, or both? From a practical perspective for your original question: If you suddenly add a field, there is no guarantee what will happen when you try to access that field for existing documents, or what will happen if you update existing documents. Sure, people can talk about what happens to be true today, but there is no guarantee for the future. Similarly for deleting a field from the schema, there is no guarantee about the status of existing data, even though people can chatter about what it seems to do today. Generally, you should design your application around contracts and what is guaranteed to be true, not what happens to be true from experiments or even experience. Granted, that is the theory and sometimes you do need to rely on experimentation and folklore and spotty or ambiguous documentation, but to the extent possible, it is best to avoid explicitly trying to rely on undocumented, uncontracted behavior. Thanks. The application does change (added features) and we do not want to loose old data. One question I asked long ago and never received an answer: what is the best practice for doing a full reindex - is it sufficient to first do a delete of *:*, or does the Solr index directory contents or even the directory itself need to be explicitly deleted first? I believe it is the latter, but the former seems to work, most of the time. Deleting the directory itself seems to be the best answer, to date - but no guarantees! I don't have an answer for that, sorry! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Choosing specific fields for suggestions in SpellCheckerComponent
Hi Wilson, I don't think SpellCheckComponent supports multiple fields in the same dictionary. Am I missing something? On Wed, May 29, 2013 at 10:24 AM, Wilson Passos wrpas...@gmail.com wrote: Hi everyone, I've been searching about how to configure the SpellCheckerComponent in Solr 4.0 to support suggestion queries based on s subset of the configured fields in schema.xml. Let's say the spell checking is configured to use these 4 fields: field name=field1 type=text_general/ field name=field2 type=text_general/ field name=field3 type=text_general/ field name=field4 type=text_general/ I'd like to know if there's any possibility to dynamically set the SpellCheckerComponent to suggest terms using just fields field2 and field3 instead of the default behavior, which always includes suggestions across the 4 defined fields. Thanks in advance for any help! -- Regards, Shalin Shekhar Mangar.
Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery
I have opened https://issues.apache.org/jira/browse/SOLR-4870 On Tue, May 28, 2013 at 5:53 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This sounds like a bug. I'll open an issue. Thanks! On Tue, May 28, 2013 at 2:29 PM, AlexeyK lex.kudi...@gmail.com wrote: The cluster state problem reported above is not an issue - it was caused by our own code. Speaking about the update log - i have noticed a strange behavior concerning the replay. The replay is *supposed* to be done for a predefined number of log entries, but actually it is always done for the whole last 2 tlogs. RecentUpdates.update() reads log within while (numUpdates numRecordsToKeep), while numUpdates is never incremented, so it exits when the reader reaches EOF. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4066452.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery
On Thu, May 23, 2013 at 7:00 PM, AlexeyK lex.kudi...@gmail.com wrote: snip / from what I understood from the code, for each 'add' command there is a test for a 'delete by query'. if there is an older dbq, it's run after the 'add' operation if its version 'add' version. in my case, there are a lot of documents to be inserted, and a single large DBQ. My question is: shouldn't this be done in bulks? Why is it necessary to run the DBQ after each insertion? Supposedly there are 1000 insertions it's run 1000 times. As I understand it, this is done to handle out-of-order updates. Suppose a client makes a few add requests and then invokes a DBQ but the DBQ reaches the replicas before the last add request. In such a case, the DBQ is executed after the add request to preserve consistency. We don't do that in bulk because we don't know how long to wait for all add requests to arrive. Also, the individual add requests may arrive via different threads (think connection reset from leader to replica). That being said, the scenario you describe of a 1000 insertions causing DBQs to be run a large number of times (on recovery after restarting) could be optimized. Note that the bug you discovered (SOLR-4870) does not affect log replay because log replay on startup will replay all of the last two transaction logs (unless they end with a commit). Only PeerSync is affected by SOLR-4870. You say that both nodes are leaders but the comment inside DirectUpdateHandler2.addDoc() says that deletesAfter (i.e. reordered DBQs) should always be null on leaders. So there's definitely something fishy here. A quick review of the code leads me to believe that reordered DBQs can happen on a leader as well. I'll investigate further. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065628.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Sorting results by last update date
On Wed, May 29, 2013 at 12:10 PM, Kamal Palei palei.ka...@gmail.com wrote: Hi All I am trying to sort the results as per last updated date. My url looks as below. *fq=last_updated_date:[NOW-60DAY TO NOW]fq=experience:[0 TO 588]fq=salary:[0 TO 500] OR salary:0fq=-bundle:jobfq=-bundle:panelfq=-bundle:pagefq=-bundle:articlespellcheck=trueq=+java +sipfl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,zm_parent_entity,ss_filemime,ss_file_entity_title,ss_file_entity_url,ss_field_uidspellcheck.q=+java +sipqf=content^40qf=label^5.0qf=tos_content_extra^0.1qf=tos_name^3.0hl.fl=contentmm=1q.op=ANDwt=json json.nl=mapsort=last_updated_date asc * With this I get the data in ascending order of last updated date. If I am trying to sort data in descending order, I use below url *fq=last_updated_date:[NOW-60DAY TO NOW]fq=experience:[0 TO 588]fq=salary:[0 TO 500] OR salary:0fq=-bundle:jobfq=-bundle:panelfq=-bundle:pagefq=-bundle:articlespellcheck=trueq=+java +sipfl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,zm_parent_entity,ss_filemime,ss_file_entity_title,ss_file_entity_url,ss_field_uidspellcheck.q=+java +sipqf=content^40qf=label^5.0qf=tos_content_extra^0.1qf=tos_name^3.0hl.fl=contentmm=1q.op=ANDwt=json json.nl=mapsort=last_updated_date desc* Here the data set is not ordered properly, mostly it looks to me data is ordered on basis of score, not last updated date. Can somebody tell me what I am missing here, why *desc* is not working properly for me. What is the field type of last_update_date? Which version of Solr? A side note: Using NOW in a filter query is ineffecient because it doesn't use your filter cache effectively. Round it to nearest time interval instead. See http://java.dzone.com/articles/solr-date-math-now-and-filter -- Regards, Shalin Shekhar Mangar.
Re: delta-import tweaking?
Hi Shawn; and first off, thanks bunches for your pointers. Am Tue, 28 May 2013 09:31:54 -0600 schrieb Shawn Heisey s...@elyograg.org: My workaround was to store the highest indexed autoincrement value in a location outside Solr. In my original Perl code, I dropped it into a file on NFS. The latest iteration of my indexing code (Java, using SolrJ) no longer uses DIH for regular indexing, but it still uses that stored autoincrement value, this time in another database table. I do still use full-import for complete index rebuilds. Well, overally after playing with it a bit last nite, I decided to also go down the SolrJ way; we'll be likely to use this in the future anyway as the rest of our environment's Java too, so going for it right now seems just the logical thing to do. Thanks and all the best! Kristian
Reindexing strategy
I see that I do need to reindex my Solr index. The index consists of 20 million documents with a few hundred new documents added per minute (social media data). The documents are mostly smaller than 1KiB of data, but some may go as large as 10 KiB. All the data is text, and all indexed fields are stored. To reindex, I am considering adding a 'last_indexed' field, and having a Python or Java application pull out N results every T seconds when sorting on last_indexed asc. How might I determine a good values for N and T? I would like to know when the Solr index is 'overloaded', or whatever happens to Solr when it is being pushed beyond the limits of its hardware. What should I be looking at to know if Solr is over stressed? Is looking at CPU and memory good enough? Is there a way to measure I/O to the disk on which the Solr index is stored? Bear in mind that while the reindex is happening, clients will be performing searches and a few hundred documents will be written per minute. Note that the machine running Solr is an EC2 instance running on Amazon Web Services, and that the 'disk' on which the Solr index is stored in an EBS volume. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Strange behavior on text field with number-text content
Hmmm, there are two things you _must_ get familiar with when diagnosing these G.. 1 admin/analysis. That'll show you exactly what the analysis chain does, and it's not always obvious. 2 add debug=query to your input and look at the parsed query results. For instance, this name:4nSolution Inc. parses as name:4nSolution defaultfield:inc. That doesn't explain why name=4nSolutions, except.. your index chain has splitOnCaseChange=1 and your query bit has splitOnCaseChange=0 which doesn't seem right Best Erick On Tue, May 28, 2013 at 10:31 AM, Алексей Цой alexey...@gmail.com wrote: solr-user-unsubscribe solr-user-unsubscr...@lucene.apache.org 2013/5/28 Michał Matulka michal.matu...@gowork.pl Thanks for your responses, I must admit that after hours of trying I made some mistakes. So the most problematic phrase will now be: 4nSolution Inc. which cannot be found using query: name:4nSolution or even name:4nSolution Inc. but can be using following queries: name:nSolution name:4 name:inc Sorry for the mess, it turned out I didn't reindex fields after modyfying schema so I thought that the problem also applies to 300letters . The cause of all of this is the WordDelimiter filter defined as following: fieldType name=text class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=1 splitOnCaseChange=0 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType and I still don't know why it behaves like that - after all there is preserveOriginal attribute set to 1... W dniu 28.05.2013 14:21, Erick Erickson pisze: Hmmm, with 4.x I get much different behavior than you're describing, what version of Solr are you using? Besides Alex's comments, try adding debug=query to the url and see what comes out from the query parser. A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do any analysis, here's the javadoc... /** * Default analyzer for types that only produces 1 verbatim token... * A maximum size of chars to be read must be specified */ so it's much like the string type. Which means I'm totally perplexed by your statement that 300 and letters return a hit. Have you perhaps changed the field definition and not re-indexed? The behavior you're seeing really looks like somehow WordDelimiterFilterFactory is getting into your analysis chain with settings that don't mash the parts back together, i.e. you can set up WDDF to split on letter/number transitions, index each and NOT index the original, but I have no explanation for how that could happen with the field definition you indicated FWIW, Erick On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitcharafa...@gmail.com arafa...@gmail.com wrote: What does analyzer screen say in the Web AdminUI when you try to do that? Also, what are the tokens stored in the field (also in Web AdminUI). I think it is very strange to have TextField without a tokenizer chain. Maybe you get a standard one assigned by default, but I don't know what the standard chain would be. Regards, Alex. On 28 May 2013 04:44, Michał Matulka michal.matu...@gowork.pl michal.matu...@gowork.pl wrote: Hello, I've got following problem. I have a text type in my schema and a field name of that type. That field contains a data, there is, for example, record
Re: Keeping a rolling window of indexes around solr
I suspect you're worrying about something you don't need to. At 1 insert every 30 seconds, and assuming 30,000,000 records will fit on a machine (I've seen this), you're talking 1,000,000 seconds worth of data on a single box! Or roughly 10,000 day's worth of data. Test, of course, YMMV. Or I'm mis-understanding what 1 log insert means, I guess it could be a full log file But do the simple thing first, just let Solr do what it does by default and periodically do a delete by query on documents you want to roll off the end. Especially since you say that queries happen every few days. The tricks for utilizing hot shards are probably not very useful for you with that low a query rate. Test, of course Best Erick On Tue, May 28, 2013 at 8:42 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Volume of data: 1 log insert every 30 seconds, queries done sporadically asynchronously every so often at a much lower frequency every few days Also the majority of the requests are indeed going to be within a splice of time (typically hours or at most a few days) Type of queries: Keyword or termsearch Search by guid (or id as known in the solr world) Reserved or percolation queries to be executed when new data becomes available Search by dates as mentioned above Regards Sent from my iPhone On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : This is kind of the approach used by elastic search , if I'm not using : solrcloud will I be able to use shard aliasing, also with this approach : how would replication work, is it even needed? you haven't said much about hte volume of data you expect to deal with, nor have you really explained what types of queries you intend to do -- ie: you said you were intersted in a rolling window of indexes around n days of data but you never clarified why you think a rolling window of indexes would be useful to you or how exactly you would use it. The primary advantage of sharding by date is if you know that a large percentage of your queries are only going to be within a small range of time, and therefore you can optimize those requests to only hit the shards neccessary to satisfy that small windo of time. if the majority of requests are going to be across your entire n days of data, then date based sharding doesn't really help you -- you can just use arbitrary (randomized) sharding using periodic deleteByQuery commands to purge anything older then N days. Query the whole collection by default, and add a filter query if/when you want to restrict your search to only a narrow date range of documents. this is the same general approach you would use on a non-distributed / non-SolrCloud setup if you just had a single collection on a single master replicated to some number of slaves for horizontal scaling. -Hoss
Re: split document or not
But in this case phrase frequence per whole document will be not taken into accout because document is splitted by subdocuments. Or it is not true? -- View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066734.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Note on The Book
IMHO I prefer narrative, as Erick says, explain all use-cases it's impossible, cover the base cases is a good start. Either way I miss a book about solr different to a cookbook or a guide. Regards. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, May 29, 2013 at 12:19 PM, Erick Erickson wrote: FWIW, picking up on Alexandre's point. One of my continual frustrations with virtually _all_ technical books is they become endless pages of details without ever mentioning why the hell I should care. Unfortunately, explaining use-cases for everything would only make the book about 10,000 pages long. Siiigh. I guess you can take this as a vote for narrative Erick On Tue, May 28, 2013 at 4:53 PM, Jack Krupansky j...@basetechnology.com (mailto:j...@basetechnology.com) wrote: We'll have a blog for the book. We hope to have a first raw/rough/partial/draft published as an e-book in maybe 10 days to 2 weeks. As soon as we get that process under control, we'll start the blog. I'll keep your email on file and keep you posted. -- Jack Krupansky -Original Message- From: Swati Swoboda Sent: Tuesday, May 28, 2013 1:36 PM To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: RE: Note on The Book I'd definitely prefer the spiral bound as well. E-books are great and your draft version seems very reasonably priced (aka I would definitely get it). Really looking forward to this. Is there a separate mailing list / etc. for the book for those who would like to receive updates on the status of the book? Thanks Swati Swoboda Software Developer - Igloo Software +1.519.489.4120 sswob...@igloosoftware.com (mailto:sswob...@igloosoftware.com) Bring back Cake Fridays – watch a video you’ll actually like http://vimeo.com/64886237 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 23, 2013 7:15 PM To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: Note on The Book To those of you who may have heard about the Lucene/Solr book that I and two others are writing on Lucene and Solr, some bad and good news. The bad news: The book contract with O’Reilly has been canceled. The good news: I’m going to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of the previous effort was too great, even for O’Reilly – a book larger than 800 pages (or even 600) that was heavy on reference and lighter on “guide” just wasn’t fitting in with their traditional “guide” model. In truth, Solr is just too complex for a simple guide that covers it all, let alone Lucene as well. I’ll announce more details in the coming weeks, but I expect to publish an e-book-only version of the book, focused on Solr reference (and plenty of guide as well), possibly on Lulu, plus eventually publish 4-8 individual print volumes for people who really want the paper. One model I may pursue is to offer the current, incomplete, raw, rough, draft as a $7.99 e-book, with the promise of updates every two weeks or a month as new and revised content and new releases of Solr become available. Maybe the individual e-book volumes would be $2 or $3. These are just preliminary ideas. Feel free to let me know what seems reasonable or excessive. For paper: Do people really want perfect bound, or would you prefer spiral bound that lies flat and folds back easily? I suppose we could offer both – which should be considered “premium”? I’ll announce more details next week. The immediate goal will be to get the “raw rough draft” available to everyone ASAP. For those of you who have been early reviewers – your effort will not have been in vain. I have all your comments and will address them over the next month or two or three. Just for some clarity, the existing Solr Wiki and even the recent contribution of the LucidWorks Solr Reference to Apache really are still great contributions to general knowledge about Solr, but the book is intended to go much deeper into detail, especially with loads of examples and a lot more narrative guide. For example, the book has a complete list of the analyzer filters, each with a clean one-liner description. Ditto for every parameter (although I would note that the LucidWorks Solr Reference does a decent job of that as well.) Maybe, eventually, everything in the book COULD (and will) be integrated into the standard Solr doc, but until then, a single, integrated reference really is sorely needed. And, the book has a lot of narrative guide and walking through examples as well. Over time, I’m sure both will evolve. And just to be clear, the book is not a simple repurposing of the Solr wiki content – EVERY
How can a Tokenizer be CoreAware?
I am currently testing some things with Solr 4.0.0. I tried to make a tokenizer CoreAware, and was rewarded with: Caused by: org.apache.solr.common.SolrException: Invalid 'Aware' object: com.basistech.rlp.solr.RLPTokenizerFactory@19336006 -- org.apache.solr.util.plugin.SolrCoreAware must be an instance of: [org.apache.solr.request.SolrRequestHandler] [org.apache.solr.response.QueryResponseWriter] [org.apache.solr.handler.component.SearchComponent] [org.apache.solr.update.processor.UpdateRequestProcessorFactory] [org.apache.solr.handler.component.ShardHandlerFactory] I need this to allow cleanup of some cached items in the tokenizer. Questions: 1: will a newer version allow me to do this directly? 2: is there some other approach that anyone would recommend? I could, for example, make a fake object in the list above to act as a singleton with a static accessor, but that seems pretty ugly.
Re: Reindexing strategy
I presume you are running Solr on a multi-core/CPU server. If you kept a single process hitting Solr to re-index, you'd be using just one of those cores. It would take as long as it takes, I can't see how you would 'overload' it that way. I guess you could have a strategy that pulls 100 documents with an old last_indexed, and push them for re-indexing. If you get the full 100 docs, you make a subsequent request immediately. If you get less than 100 back, you know you're up-to-date and can wait, say, 30s before making another request. Upayavira On Wed, May 29, 2013, at 12:00 PM, Dotan Cohen wrote: I see that I do need to reindex my Solr index. The index consists of 20 million documents with a few hundred new documents added per minute (social media data). The documents are mostly smaller than 1KiB of data, but some may go as large as 10 KiB. All the data is text, and all indexed fields are stored. To reindex, I am considering adding a 'last_indexed' field, and having a Python or Java application pull out N results every T seconds when sorting on last_indexed asc. How might I determine a good values for N and T? I would like to know when the Solr index is 'overloaded', or whatever happens to Solr when it is being pushed beyond the limits of its hardware. What should I be looking at to know if Solr is over stressed? Is looking at CPU and memory good enough? Is there a way to measure I/O to the disk on which the Solr index is stored? Bear in mind that while the reindex is happening, clients will be performing searches and a few hundred documents will be written per minute. Note that the machine running Solr is an EC2 instance running on Amazon Web Services, and that the 'disk' on which the Solr index is stored in an EBS volume. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Reindexing strategy
On Wed, May 29, 2013 at 2:41 PM, Upayavira u...@odoko.co.uk wrote: I presume you are running Solr on a multi-core/CPU server. If you kept a single process hitting Solr to re-index, you'd be using just one of those cores. It would take as long as it takes, I can't see how you would 'overload' it that way. I mean 'overload' Solr in the sense that it cannot read, process, and write data fast enough because too much data is being handled. I remind you that this system is writing hundreds of documents per minute. Certainly there is a limit to what Solr can handle. I ask how to know how close I am to this limit. I guess you could have a strategy that pulls 100 documents with an old last_indexed, and push them for re-indexing. If you get the full 100 docs, you make a subsequent request immediately. If you get less than 100 back, you know you're up-to-date and can wait, say, 30s before making another request. Actually, I would add a filter query for documents whose last_index value is before the last schema change, and stop when less documents were returned than were requested. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Note on The Book
Perhaps, you will enjoy mine then: http://www.packtpub.com/apache-solr-for-indexing-data/book . I will send a formal announcement to the list a little later, but basically this is a book for advanced beginners and early intermediates and takes them from a basic index to multilingual indexing with bells and whistles. Covers a small part of Solr (Solr is big!), but shows how different parts work together. It's structured as a cookbook but the narrative is a journey. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, May 29, 2013 at 7:33 AM, Yago Riveiro yago.rive...@gmail.com wrote: IMHO I prefer narrative, as Erick says, explain all use-cases it's impossible, cover the base cases is a good start. Either way I miss a book about solr different to a cookbook or a guide. Regards. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, May 29, 2013 at 12:19 PM, Erick Erickson wrote: FWIW, picking up on Alexandre's point. One of my continual frustrations with virtually _all_ technical books is they become endless pages of details without ever mentioning why the hell I should care. Unfortunately, explaining use-cases for everything would only make the book about 10,000 pages long. Siiigh. I guess you can take this as a vote for narrative Erick On Tue, May 28, 2013 at 4:53 PM, Jack Krupansky j...@basetechnology.com (mailto:j...@basetechnology.com) wrote: We'll have a blog for the book. We hope to have a first raw/rough/partial/draft published as an e-book in maybe 10 days to 2 weeks. As soon as we get that process under control, we'll start the blog. I'll keep your email on file and keep you posted. -- Jack Krupansky -Original Message- From: Swati Swoboda Sent: Tuesday, May 28, 2013 1:36 PM To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: RE: Note on The Book I'd definitely prefer the spiral bound as well. E-books are great and your draft version seems very reasonably priced (aka I would definitely get it). Really looking forward to this. Is there a separate mailing list / etc. for the book for those who would like to receive updates on the status of the book? Thanks Swati Swoboda Software Developer - Igloo Software +1.519.489.4120 sswob...@igloosoftware.com (mailto:sswob...@igloosoftware.com) Bring back Cake Fridays – watch a video you’ll actually like http://vimeo.com/64886237 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 23, 2013 7:15 PM To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: Note on The Book To those of you who may have heard about the Lucene/Solr book that I and two others are writing on Lucene and Solr, some bad and good news. The bad news: The book contract with O’Reilly has been canceled. The good news: I’m going to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of the previous effort was too great, even for O’Reilly – a book larger than 800 pages (or even 600) that was heavy on reference and lighter on “guide” just wasn’t fitting in with their traditional “guide” model. In truth, Solr is just too complex for a simple guide that covers it all, let alone Lucene as well. I’ll announce more details in the coming weeks, but I expect to publish an e-book-only version of the book, focused on Solr reference (and plenty of guide as well), possibly on Lulu, plus eventually publish 4-8 individual print volumes for people who really want the paper. One model I may pursue is to offer the current, incomplete, raw, rough, draft as a $7.99 e-book, with the promise of updates every two weeks or a month as new and revised content and new releases of Solr become available. Maybe the individual e-book volumes would be $2 or $3. These are just preliminary ideas. Feel free to let me know what seems reasonable or excessive. For paper: Do people really want perfect bound, or would you prefer spiral bound that lies flat and folds back easily? I suppose we could offer both – which should be considered “premium”? I’ll announce more details next week. The immediate goal will be to get the “raw rough draft” available to everyone ASAP. For those of you who have been early reviewers – your effort will not have been in vain. I have all your comments and will address them over the next month or two or three. Just for some clarity, the existing Solr Wiki and even the recent contribution of the LucidWorks Solr Reference to Apache really are still great contributions
Problem with xpath expression in data-config.xml
Replacing the contents of solr-4.3.0\example\example-DIH\solr\rss\conf\rss-data-config.xml by dataConfig dataSource type=URLDataSource / document entity name=beautybooks88 pk=link url=http://beautybooks88.blogspot.com/feeds/posts/default; processor=XPathEntityProcessor forEach=/feed/entry transformer=DateFormatTransformer field column=source xpath=/feed/title commonField=true / field column=source-link xpath=/feed/link[@rel='self']/@href commonField=true / field column=title xpath=/feed/entry/title / field column=link xpath=/feed/entry/link[@rel='self']/@href / field column=description xpath=/feed/entry/content stripHTML=true/ field column=creator xpath=/feed/entry/author / field column=item-subject xpath=/feed/entry/category/@term/ field column=date xpath=/feed/entry/updated dateTimeFormat=-MM-dd'T'HH:mm:ss / /entity /document /dataConfig and running the full dataimport from http://localhost:8983/solr/#/rss/dataimport//dataimport results in an error. 1) How could I have found the reason faster than I did - by looking into which log files,? 2) If you remove the first occurrence of /@href above, the import succeeds. (Note that the same pattern works for column link.) What's the reason why?!! Best regards and thanks in advance Hans-Peter
Advice : High-traffic web site
Hi Team, Please I need your advice, I have high-traffic web site (100 million page views/month) to 22 country and I want to build fast and powerfull search engine. So, I use solr 4.3 and sperate every country to collection , but I want to build right structure to accommodates high traffic .So, What advise me to use? Solr cloud or Master-Slave or multi-cores . Thanks in advance. Ramzi, -- View this message in context: http://lucene.472066.n3.nabble.com/Advice-High-traffic-web-site-tp4066745.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: split document or not
Do I need first search whole document Id and next between its paragraphs stored in separated docs? -- View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Note on The Book
Erick, your point is well taken. Although my primary interest/skill is to produce a solid foundation reference (including tons of examples), the real goal is to then build on top of that foundation. While I focus on the hard-core material - which really does include some narrative and lots of examples in addition to tons of mere reference, my co-author, Ryan Tabora, will focus almost exclusively on... narrative and diagrams. And when I say reference, I also mean lots of examples. Even as the hard-core reference stabilizes, the examples will continue to grow (like weeds!). Once we get the current, existing, under-review, chapters packaged into the new book and available for purchase and download (maybe Lulu, not decided) - available, in a couple of weeks, it will be updated approximately every other week, both with additional reference material, and additional narrative and diagrams. One of our priorities (after we get through Stage 0 of the next few weeks) is to in fact start giving each of the long Deep Dive Chapters enough narrative lead to basically say exactly that - why you should care. A longer-term priority is to improve the balance of narrative and hard-core reference. Yeah, that will be a lot of pages. It already is. We were at 907 pages and I was about to drop in another 166 pages on update handlers when O'Reilly threw up their hands and pulled the plug. I was estimating 1200 pages at that stage. And I'll probably have another 60-80 pages on update request processors within a week or so. With more to come. That did include a lot of hard-core material and example code for Lucene, which won't be in the new Solr-only book. By focusing on an e-book the raw page count alone becomes moot. We haven't given up on print - the intent is eventually to have multiple volumes (4-8 or so, maybe more), both as cheaper e-books ($3 to $5 each) and slimmer print volumes for people who don't need everything in print. In fact, we will likely offer the revamped initial chapters of the book as a standalone introduction to Solr - narrative introduction (why should you care about Solr), basic concepts of Lucene and Solr (and why you should care!), brief tutorial walkthough of the major feature areas of Solr, and a case study. The intent would be both e-book and a slim print volume (75 pages?). Another priority (beyond Stage 0) is to develop a detailed roadmap diagram of Solr and how applications can use Solr, and then use that to show how each of the Deep Dive sections (heavy reference, but gradually adding more narrative over time.) We will probably be very open to requests - what people really wish a book would actually do for them. The only request we won't be open to is to do it all in only 300 pages. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, May 29, 2013 7:19 AM To: solr-user@lucene.apache.org Subject: Re: Note on The Book FWIW, picking up on Alexandre's point. One of my continual frustrations with virtually _all_ technical books is they become endless pages of details without ever mentioning why the hell I should care. Unfortunately, explaining use-cases for everything would only make the book about 10,000 pages long. Siiigh. I guess you can take this as a vote for narrative Erick On Tue, May 28, 2013 at 4:53 PM, Jack Krupansky j...@basetechnology.com wrote: We'll have a blog for the book. We hope to have a first raw/rough/partial/draft published as an e-book in maybe 10 days to 2 weeks. As soon as we get that process under control, we'll start the blog. I'll keep your email on file and keep you posted. -- Jack Krupansky -Original Message- From: Swati Swoboda Sent: Tuesday, May 28, 2013 1:36 PM To: solr-user@lucene.apache.org Subject: RE: Note on The Book I'd definitely prefer the spiral bound as well. E-books are great and your draft version seems very reasonably priced (aka I would definitely get it). Really looking forward to this. Is there a separate mailing list / etc. for the book for those who would like to receive updates on the status of the book? Thanks Swati Swoboda Software Developer - Igloo Software +1.519.489.4120 sswob...@igloosoftware.com Bring back Cake Fridays – watch a video you’ll actually like http://vimeo.com/64886237 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 23, 2013 7:15 PM To: solr-user@lucene.apache.org Subject: Note on The Book To those of you who may have heard about the Lucene/Solr book that I and two others are writing on Lucene and Solr, some bad and good news. The bad news: The book contract with O’Reilly has been canceled. The good news: I’m going to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of the previous effort was too great, even for O’Reilly – a book larger than 800 pages (or
Re: What exactly happens to extant documents when the schema changes?
On 5/29/2013 1:07 AM, Dotan Cohen wrote: In the case of this particular application, reindexing really is overly burdensome as the application is performing hundreds of writes to the index per minute. How might I gauge how much spare I/O Solr could commit to a reindex? All the data that I need is in fact in stored fields. Note that because the social media application that feeds our Solr index is global, there are no 'off hours'. I handle this in a very specific way with my sharded index. This won't work for all designs, and the precise procedure won't work for SolrCloud. There is a 'live' and a 'build' core for each of my shards. When I want to reindex, the program makes a note of my current position for deletes, reinserts, and new documents. Then I use a DIH full-import from mysql into the build cores. Once the import is done, I run the update cycle of deletes, reinserts, and new documents on those build cores, using the position information noted earlier. Then I swap the cores so the new index is online. To adapt this for SolrCloud, I would need to use two collections, and update a collection alias for what is considered live. To control the I/O and CPU usage, you might need some kind of throttling in your update/rebuild application. I don't need any throttling in my design. Because I'm using DIH, the import only uses a single thread for each shard on the server. I've got RAID10 for storage and half of the CPU cores are still available for queries, so it doesn't overwhelm the server. The rebuild does lower performance, so I have the other copy of the index handle queries while the rebuild is underway. When the rebuild is done on one copy, I run it again on the other copy. Right now I'm half-upgraded -- one copy of my index is version 3.5.0, the other is 4.2.1. Switching to SolrCloud with sharding and replication would eliminate this flexibility, unless I maintained two separate clouds. Thanks, Shawn
[Announce] Apache Solr 4.1 with RankingAlgorithm 1.4.7 available now -- includes realtime-search with multiple granularities
I am very excited to announce the availability of Solr 4.3 with RankingAlgorithm40 1.4.8 with realtime-search with multiple granularities. realtime-search is very fast NRT and allows you to not only lookup a document by id but also allows you to search in realtime, see http://tgels.org/realtime-nrt.jsp. The update performance is about 70,000 docs / sec. The query performance is in ms, allows you to query a 10m wikipedia index (complete index) in 50 ms. This release includes realtime-search with multiple granularities, request/intra-request. The granularity attribute controls the NRT behavior. With attribute granularity=request, all search components like search, faceting, highlighting, etc. will see a consistent view of the index and will all report the same number of documents. With granularity=intrarequest, the components may each report the most recent changes to the index. realtime-search has been contributed back to Apache Solr, see https://issues.apache.org/jira/browse/SOLR-3816. RankingAlgorithm 1.4.8 supports the entire Lucene Query Syntax, ± and/or boolean/dismax/glob/regular expression/wildcard/fuzzy/prefix/suffix queries with boosting, etc. and is compatible with the lucene 4.3 api. You can get more information about realtime-search performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.3 with RankingAlgorithm40 1.4.8 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://elasticsearch-ra.tgels.org http://rankingalgorithm.tgels.org Note: 1. Apache Solr 4.1 with RankingAlgorithm40 1.4.7 is an external project.
Re: Advice : High-traffic web site
I don't see how multi-cores will help you. Both SolrCloud or Master-Slave can work for you. Of course, SolrCloud helps you in terms of maintaining higher availability due to replica/leader fail over. If your queries are always going to be limited to one country then creating a collection per country is fine. On Wed, May 29, 2013 at 6:12 PM, Ramzi Alqrainy ramzi.alqra...@gmail.comwrote: Hi Team, Please I need your advice, I have high-traffic web site (100 million page views/month) to 22 country and I want to build fast and powerfull search engine. So, I use solr 4.3 and sperate every country to collection , but I want to build right structure to accommodates high traffic .So, What advise me to use? Solr cloud or Master-Slave or multi-cores . Thanks in advance. Ramzi, -- View this message in context: http://lucene.472066.n3.nabble.com/Advice-High-traffic-web-site-tp4066745.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Reindexing strategy
On 5/29/2013 6:01 AM, Dotan Cohen wrote: I mean 'overload' Solr in the sense that it cannot read, process, and write data fast enough because too much data is being handled. I remind you that this system is writing hundreds of documents per minute. Certainly there is a limit to what Solr can handle. I ask how to know how close I am to this limit. It's impossible for us to give you hard numbers. You'll have to experiment to know how fast you can reindex without killing your servers. A basic tenet for such experimentation, and something you hopefully already know: You'll want to get baseline measurements before you begin testing for comparison. One of the most reliable Solr-specific indicators of pushing your hardware too hard is that the QTime on your queries will start to increase dramatically. Solr 4.1 and later has more granular query time statistics in the UI - the median and 95% numbers are much more important than the average. Outside of that, if your overall IOwait CPU percentage starts getting near (or above) 30-50%, your server is struggling. If all of your CPU cores are staying near 100% usage, then it's REALLY struggling. Assuming you have plenty of CPU cores, using fast storage and having plenty of extra RAM will alleviate much of the I/O bottleneck. The usual rule of thumb for good query performance is that you need enough RAM to put 50-100% of your index in the OS disk cache. For blazing performance during a rebuild, that becomes 100-200%. If you had 150%, that would probably keep most indexes well-cached even during a rebuild. A rebuild will always lower performance, even with lots of RAM. My earlier reply to your other message has some other ideas that will hopefully help. Thanks, Shawn
Re: Replica shards not updating their index when update is sent to them
I found how to solve the problem. After sending a file to be indexed to a replica shard (node2): curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbig moth/field/doc/add' I can send a commit param to the same shard and then it gets updated: curl 'http://node2:8983/solr/update?commit=true' Another option is to send, from the beginning, a commitWithin param with some milliseconds instead of a commit directly. That way, the commit happens at most (the milliseconds specified) after, but the changes get reflected in all shards, including the replica shard that received the update request: curl 'http://node2:8983/solr/update?commitWithin=1http://node2:8983/solr/update?commit=true ' As these emails get archived, I hope this may help someone in the future. Sebastián Ramírez On Mon, May 20, 2013 at 4:32 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: Yes, It's happening with the latest version, 4.2.1 Yes, it's easy to reproduce. It happened using 3 Virtual Machines and also happened using 3 physical nodes. Here are the details: I installed Hortonworks (a Hadoop distribution) in the 3 nodes. That installs Zookeeper. I used the example directory and copied it to the 3 nodes. I start Zookeeper in the 3 nodes. The first time, I run this command on each node, to start Solr: java -jar -Dbootstrap_conf=true -DzkHost='node1,node2,node3' start.jar As I understand, the -Dbootstrap_conf=true uploads the configuration to Zookeeper, so I don't need to do that the following times that I start each SolrCore. So, the following times, I run this on each node: java -jar -DzkHost='node0,node1,node2' start.jar Because I ran that command on node0 first, that node became the leader shard. I send an update to the leader shard, (in this case node0): I run curl 'http://node0:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbuggy/field/doc/add' When I query any shard I get the correct result: I run curl 'http://node0:8983/solr/select?q=id:asdf' or curl 'http://node1:8983/solr/select?q=id:asdf' or curl 'http://node2:8983/solr/select?q=id:asdf' (i.e. I send the query to each node), and then I get the expected response ... docstr name=idasdf/strarr name=content strbuggy/str /arr ... /doc... But when I send an update to a replica shard (node2) it is updated only in the leader shard (node0) and in the other replica (node1), not in the shard that received the update (node2): I send an update to the replica node2, I run curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbig moth/field/doc/add' Then I query each node and I receive the updated results only from the leader shard (node0) and the other replica shard (node1). I run (leader, node0): curl 'http://node0:8983/solr/select?q=id:asdf' And I get: ... docstr name=idasdf/strarr name=content strbig moth/str /arr ... /doc ... I run (other replica, node1): curl 'http://node1:8983/solr/select?q=id:asdf' And I get: ... docstr name=idasdf/strarr name=content strbig moth/str /arr ... /doc ... I run (first replica, the one that received the update, node2): curl 'http://node2:8983/solr/select?q=id:asdf' And I get (old result): ... docstr name=idasdf/strarr name=content strbuggy/str /arr ... /doc ... Thanks for your interest, Sebastián Ramírez On Mon, May 20, 2013 at 3:30 PM, Yonik Seeley yo...@lucidworks.comwrote: On Mon, May 20, 2013 at 4:21 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: When I send an update to a non-leader (replica) shard (B), the updated results are reflected in the leader shard (A) and in the other replica shard (C), but not in the shard that received the update (B). I've never seen that before. The replica that received the update isn't treated as special in any way by the code, so it's not clear how this could happen. What version of Solr is this (and does it happen with the latest version)? How easy is this to reproduce for you? -Yonik http://lucidworks.com -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
RE: Why do FQs make my spelling suggestions so slow?
Andy, I opened this ticket so that someone can eventaully invistigate: https://issues.apache.org/jira/browse/SOLR-4874 Just an instanity check, I see I had misspelled maxCollations as maxCollation in my prior response. When you tested with this set the same as maxCollationTries, did you correct my spelling? The thought is that by requiring it to return this many collations back, you are guaranteed to make it try the maximum time every time,giving yourself a cleaner test. I am trying to isolate here if spellcheck is not running the queries properly or if the queries just naturally take that long to run over and over again. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Andy Lester [mailto:a...@petdance.com] Sent: Tuesday, May 28, 2013 4:22 PM To: solr-user@lucene.apache.org Subject: Re: Why do FQs make my spelling suggestions so slow? Thanks for looking at this. What are the QTimes for the 0fq,1fq,2fq,4fq 4fq cases with spellcheck entirely turned off? Is it about (or a little more than) half the total when maxCollationTries=1 ? With spellcheck off I get 8ms for 4fq query. Also, with the varying # of fq's, how many collation tries does it take to get 10 collations? I don't know. How can I tell? Possibly, a better way to test this is to set maxCollations = maxCollationTries. The reason is that it quits trying once it finds maxCollations, so if with 0fq's, lots of combinations can generate hits and it doesn't need to try very many to get to 10. But with more fq's, fewer collations will pan out so now it is trying more up to 100 before (if ever) it gets to 10. It does just fine doing 100 collations so long as there are no FQs. It seems to me that the FQs are taking an inordinate amount of extra time. 100 collations in (roughly) the same amount of time as a single collation, so long as there are no FQs. Why are the FQs such a drag on the collation process? (I'm assuming you have all non-search components like faceting turned off). Yes, definitely. So say with 2fq's it takes 10ms for the query to complete with spellcheck off, and 20ms with maxCollation = maxCollationTries = 1, then it will take about 110ms with maxCollation = maxCollationTries = 10. I can do maxCollation = maxCollationTries = 100 and it comes back in 14ms, so long as I have FQs off. Add a single FQ and it becomes 13499ms. I can do maxCollation = maxCollationTries = 1000 and it comes back in 45ms, so long as I have FQs off. Add a single FQ and it becomes 62038ms. But I think you're just setting maxCollationTries too high. You're asking it to do too much work in trying teens of combinations. The results I get back with 100 tries are about twice as many as I get with 10 tries. That's a big difference to the user where it's trying to figure misspelled phrases. Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Escaping character at Query
I use Solr 4.2.1 and I analyze that keyword: keliledimle at admin page: WT keliledimle SF keliledimle TLCF keliledimle However when I escape that charter and search it: solr/select?q=kelile\dimle here is what I see: response lst name=responseHeader int name=status0/int int name=QTime148/int lst name=params str name=dimle/ *str name=qkelile\/str* /lst /lst I have edismax as default query parser. How can I escape that character, why it doesn't like that?: str name=qkelile\dimle/str Any ideas?
RE: Choosing specific fields for suggestions in SpellCheckerComponent
I assume here you've got a spellcheck field like this: field name=Spelling_Dictionary type=text_general/ copyField source=field1 dest=Spelling_Dictionary / copyField source=field2 dest=Spelling_Dictionary / copyField source=field3 dest=Spelling_Dictionary / copyField source=field4 dest=Spelling_Dictionary / ...so that a check against Spelling_Dictionary always checks all 4, right? This is the only way I know to approximate having it spellcheck across multiple fields. And as you have found, short of creating several separate versions of Spelling_Dictionary, there is no way to specify the individual fields a la carte. Although not supported, some of the work was done as part of SOLR-2993. Your best bet now is to use Spelling_Dictionary as a master dictionary then use maxCollationTries to have it generate collations that only pertain the what the user actually searched against. This is less efficient and may not work well (or at all) with Suggest. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Wilson Passos [mailto:wrpas...@gmail.com] Sent: Tuesday, May 28, 2013 11:54 PM To: Solr User List Subject: Choosing specific fields for suggestions in SpellCheckerComponent Hi everyone, I've been searching about how to configure the SpellCheckerComponent in Solr 4.0 to support suggestion queries based on s subset of the configured fields in schema.xml. Let's say the spell checking is configured to use these 4 fields: field name=field1 type=text_general/ field name=field2 type=text_general/ field name=field3 type=text_general/ field name=field4 type=text_general/ I'd like to know if there's any possibility to dynamically set the SpellCheckerComponent to suggest terms using just fields field2 and field3 instead of the default behavior, which always includes suggestions across the 4 defined fields. Thanks in advance for any help!
Re: Why do FQs make my spelling suggestions so slow?
On May 29, 2013, at 9:46 AM, Dyer, James james.d...@ingramcontent.com wrote: Just an instanity check, I see I had misspelled maxCollations as maxCollation in my prior response. When you tested with this set the same as maxCollationTries, did you correct my spelling? Yes, definitely. Thanks for the ticket. I am looking at the effects of turning on spellcheck.onlyMorePopular to true, which reduces the number of collations it seems to do, but doesn't affect the underlying question of is the spellchecker doing FQs properly? Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: Escaping character at Query
Hi, try with double quotation marks ( ). Carlos. 2013/5/29 Furkan KAMACI furkankam...@gmail.com I use Solr 4.2.1 and I analyze that keyword: keliledimle at admin page: WT keliledimle SF keliledimle TLCF keliledimle However when I escape that charter and search it: solr/select?q=kelile\dimle here is what I see: response lst name=responseHeader int name=status0/int int name=QTime148/int lst name=params str name=dimle/ *str name=qkelile\/str* /lst /lst I have edismax as default query parser. How can I escape that character, why it doesn't like that?: str name=qkelile\dimle/str Any ideas?
using HTTP caching with shards in Solr 4.3
Hello, I'd like to take advantage of Solr's HTTP caching feature (httpCaching never304=false in solrconfig.xml).. It is behaving as expected when I do a standard query against a Solr instance and then repeat it: I receive an HTTP304 (not modified) response. However, when using the shards functionality, I seem to be unable to get the HTTP304 functionality. When sending a request to a Solr instance that includes other Solr instances in the shards parameter, a GET request is sent to the original Solr instance, but it turns around and sends POST requests to the Solr instances referenced in shards. Since POST requests cannot generate a 304, I seem to be unable to use HTTP caching with shards. Is there a way to make the original Solr instance query the shards with a GET method? Or some other way I can leverage HTTP caching when using shards? Thanks, Ty
[Announce] Apache Solr 4.3 with RankingAlgorithm 1.4.8 available now -- includes realtime-search with multiple granularities (correction)
I am very excited to announce the availability of Solr 4.3 with RankingAlgorithm40 1.4.8 with realtime-search with multiple granularities. realtime-search is very fast NRT and allows you to not only lookup a document by id but also allows you to search in realtime, see http://tgels.org/realtime-nrt.jsp. The update performance is about 70,000 docs / sec. The query performance is in ms, allows you to query a 10m wikipedia index (complete index) in 50 ms. This release includes realtime-search with multiple granularities, request/intra-request. The granularity attribute controls the NRT behavior. With attribute granularity=request, all search components like search, faceting, highlighting, etc. will see a consistent view of the index and will all report the same number of documents. With granularity=intrarequest, the components may each report the most recent changes to the index. realtime-search has been contributed back to Apache Solr, see https://issues.apache.org/jira/browse/SOLR-3816. RankingAlgorithm 1.4.8 supports the entire Lucene Query Syntax, ± and/or boolean/dismax/glob/regular expression/wildcard/fuzzy/prefix/suffix queries with boosting, etc. and is compatible with the lucene 4.3 api. You can get more information about realtime-search performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.3 with RankingAlgorithm40 1.4.8 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://elasticsearch-ra.tgels.org http://rankingalgorithm.tgels.org Note: 1. Apache Solr 4.3 with RankingAlgorithm40 1.4.8 is an external project.
Re: Escaping character at Query
When I write: solr/select?q=kelile\dimle it still says: lst name=params str name=dimle/ *str name=qkelile\/str* /lst 2013/5/29 Carlos Bonilla carlosbonill...@gmail.com Hi, try with double quotation marks ( ). Carlos. 2013/5/29 Furkan KAMACI furkankam...@gmail.com I use Solr 4.2.1 and I analyze that keyword: keliledimle at admin page: WT keliledimle SF keliledimle TLCF keliledimle However when I escape that charter and search it: solr/select?q=kelile\dimle here is what I see: response lst name=responseHeader int name=status0/int int name=QTime148/int lst name=params str name=dimle/ *str name=qkelile\/str* /lst /lst I have edismax as default query parser. How can I escape that character, why it doesn't like that?: str name=qkelile\dimle/str Any ideas?
Re: Escaping character at Query
You need to UUEncode the with %26: ...solr/select?q=kelile%26dimle Normally, introduces a new URL query parameter in the URL. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, May 29, 2013 10:55 AM To: solr-user@lucene.apache.org Subject: Escaping character at Query I use Solr 4.2.1 and I analyze that keyword: keliledimle at admin page: WT keliledimle SF keliledimle TLCF keliledimle However when I escape that charter and search it: solr/select?q=kelile\dimle here is what I see: response lst name=responseHeader int name=status0/int int name=QTime148/int lst name=params str name=dimle/ *str name=qkelile\/str* /lst /lst I have edismax as default query parser. How can I escape that character, why it doesn't like that?: str name=qkelile\dimle/str Any ideas?
Re: Escaping character at Query
Hi, I meant: solr/select?q=keliledimle Cheers. 2013/5/29 Jack Krupansky j...@basetechnology.com You need to UUEncode the with %26: ...solr/select?q=kelile%**26dimle Normally, introduces a new URL query parameter in the URL. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, May 29, 2013 10:55 AM To: solr-user@lucene.apache.org Subject: Escaping character at Query I use Solr 4.2.1 and I analyze that keyword: keliledimle at admin page: WT keliledimle SF keliledimle TLCF keliledimle However when I escape that charter and search it: solr/select?q=kelile\dimle here is what I see: response lst name=responseHeader int name=status0/int int name=QTime148/int lst name=params str name=dimle/ *str name=qkelile\/str* /lst /lst I have edismax as default query parser. How can I escape that character, why it doesn't like that?: str name=qkelile\dimle/str Any ideas?
Re: Problem with xpath expression in data-config.xml
On Wed, May 29, 2013 at 6:05 PM, Hans-Peter Stricker stric...@epublius.dewrote: Replacing the contents of solr-4.3.0\example\example-** DIH\solr\rss\conf\rss-data-**config.xml by dataConfig dataSource type=URLDataSource / document entity name=beautybooks88 pk=link url=http://beautybooks88.* *blogspot.com/feeds/posts/**defaulthttp://beautybooks88.blogspot.com/feeds/posts/default processor=**XPathEntityProcessor forEach=/feed/entry transformer=** DateFormatTransformer field column=source xpath=/feed/title commonField=true / field column=source-link xpath=/feed/link[@rel='self']**/@href commonField=true / field column=title xpath=/feed/entry/title / field column=link xpath=/feed/entry/link[@rel=' **self']/@href / field column=description xpath=/feed/entry/content stripHTML=true/ field column=creator xpath=/feed/entry/author / field column=item-subject xpath=/feed/entry/category/@**term/ field column=date xpath=/feed/entry/updated dateTimeFormat=-MM-dd'T'**HH:mm:ss / /entity /document /dataConfig and running the full dataimport from http://localhost:8983/solr/#/** rss/dataimport//dataimporthttp://localhost:8983/solr/#/rss/dataimport//dataimportresults in an error. 1) How could I have found the reason faster than I did - by looking into which log files,? DIH uses the same log file as solr. The name/location of the log file depends on your logging configuration. 2) If you remove the first occurrence of /@href above, the import succeeds. (Note that the same pattern works for column link.) What's the reason why?!! I think there is a bug here. In my tests, xpath=/root/a/@y works, xpath=/root/a[@x='1']/@y also works. But if you use them together the one which is defined last returns null. I'll open an issue. -- Regards, Shalin Shekhar Mangar.
Re: Escaping character at Query
So, make it: solr/select?q=kelile%26dimle -- Jack Krupansky -Original Message- From: Carlos Bonilla Sent: Wednesday, May 29, 2013 11:39 AM To: solr-user@lucene.apache.org Subject: Re: Escaping character at Query Hi, I meant: solr/select?q=keliledimle Cheers. 2013/5/29 Jack Krupansky j...@basetechnology.com You need to UUEncode the with %26: ...solr/select?q=kelile%**26dimle Normally, introduces a new URL query parameter in the URL. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, May 29, 2013 10:55 AM To: solr-user@lucene.apache.org Subject: Escaping character at Query I use Solr 4.2.1 and I analyze that keyword: keliledimle at admin page: WT keliledimle SF keliledimle TLCF keliledimle However when I escape that charter and search it: solr/select?q=kelile\dimle here is what I see: response lst name=responseHeader int name=status0/int int name=QTime148/int lst name=params str name=dimle/ *str name=qkelile\/str* /lst /lst I have edismax as default query parser. How can I escape that character, why it doesn't like that?: str name=qkelile\dimle/str Any ideas?
Re: Why do FQs make my spelling suggestions so slow?
I also have problems getting the solrspellchecker to utilise existing FQ params correctly. we have some fairly monster queries eg : http://pastebin.com/4XzGpfeC I cannot seem to get our FQ parameters to be honored when generating results. In essence i am getting collations that yield no results when the filter query is applied. We have items that are by default not shown when out of stock or forthcoming. the user can select whether to show these or not. Is there something wrong with my query or perhaps my use case is not supported? Im using nested query and local params etc Would very much appreciate some assistance on this one as 2days worth of hacking, and pestering people on IRC have not yet yeilded a solution for me. Im not even sure what i am trying is even possible! Some sort of clarification on this would really help! Cheers Nick... On 29 May 2013 15:57, Andy Lester a...@petdance.com wrote: On May 29, 2013, at 9:46 AM, Dyer, James james.d...@ingramcontent.com wrote: Just an instanity check, I see I had misspelled maxCollations as maxCollation in my prior response. When you tested with this set the same as maxCollationTries, did you correct my spelling? Yes, definitely. Thanks for the ticket. I am looking at the effects of turning on spellcheck.onlyMorePopular to true, which reduces the number of collations it seems to do, but doesn't affect the underlying question of is the spellchecker doing FQs properly? Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance -- Nick Fellows DJdownload.com --- 10 Greenland Street London NW10ND United Kingdom --- n...@djdownload.com (E) --- www.djdownload.com
Re: Not able to search Spanish word with ascent in solr
Solr returning error 500, when i post data with ascent chars... Any solution for that? -- View this message in context: http://lucene.472066.n3.nabble.com/Not-able-to-search-Spanish-word-with-ascent-in-solr-tp4064404p4066808.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: error while indexing huge filesystem with data import handler and FileListEntityProcessor
The configuraiton works with LineEntityProcessor, with few documents (havn (t test with many documents yet. For information this the config dataConfig dataSource name=myfilelist baseUrl=file:///D:/jed/noticesBib/ type=URLDataSource encoding=UTF-8 / document !-- config avec fichier contenant la liste des xml a ouvrir. -- entity name=noticebib datasource=myfilelist processor=LineEntityProcessor acceptLineRegex=^.*\.xml$ url=listeNotices.txt rootEntity=false transformer=LogTransformer logTemplate=In entity noticebib logLevel=debug entity name=processorDocument processor=XPathEntityProcessor url=file:///D:/${noticebib.rawLine} xsl=xslt/mnb/IXM_MNb.xsl forEach=/record transformer=fr.bnf.solr.BnfDateTransformer,LogTransformer logTemplate=In entity processorDocument fichier: file:///D:/$ {noticebib.rawLine} logLevel=debug ... fields defintion file:///D:/jed/noticesBib/listeNotices.txt contains the follwing lines jed/noticesBib/3/4/307/34307035.xml jed/noticesBib/3/4/307/34307082.xml jed/noticesBib/3/4/307/34307110.xml jed/noticesBib/3/4/307/34307197.xml jed/noticesBib/3/4/307/34307350.xml jed/noticesBib/3/4/307/34307399.xml ... (Could have containes all the location with the beginning, but I wanted to test the concatenation of filename. That works fine, thanks for the help!! Next step, the same without using a file. (I'll write it in another post). Regards, Jérôme Exposition Guy Debord, un art de la guerre - du 27 mars au 13 juillet 2013 - BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à l'environnement.
Re: Problem with xpath expression in data-config.xml
I created https://issues.apache.org/jira/browse/SOLR-4875 On Wed, May 29, 2013 at 9:15 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, May 29, 2013 at 6:05 PM, Hans-Peter Stricker stric...@epublius.de wrote: Replacing the contents of solr-4.3.0\example\example-** DIH\solr\rss\conf\rss-data-**config.xml by dataConfig dataSource type=URLDataSource / document entity name=beautybooks88 pk=link url=http://beautybooks88. **blogspot.com/feeds/posts/**defaulthttp://beautybooks88.blogspot.com/feeds/posts/default processor=**XPathEntityProcessor forEach=/feed/entry transformer=** DateFormatTransformer field column=source xpath=/feed/title commonField=true / field column=source-link xpath=/feed/link[@rel='self']**/@href commonField=true / field column=title xpath=/feed/entry/title / field column=link xpath=/feed/entry/link[@rel='**self']/@href / field column=description xpath=/feed/entry/content stripHTML=true/ field column=creator xpath=/feed/entry/author / field column=item-subject xpath=/feed/entry/category/@**term/ field column=date xpath=/feed/entry/updated dateTimeFormat=-MM-dd'T'**HH:mm:ss / /entity /document /dataConfig and running the full dataimport from http://localhost:8983/solr/#/** rss/dataimport//dataimporthttp://localhost:8983/solr/#/rss/dataimport//dataimportresults in an error. 1) How could I have found the reason faster than I did - by looking into which log files,? DIH uses the same log file as solr. The name/location of the log file depends on your logging configuration. 2) If you remove the first occurrence of /@href above, the import succeeds. (Note that the same pattern works for column link.) What's the reason why?!! I think there is a bug here. In my tests, xpath=/root/a/@y works, xpath=/root/a[@x='1']/@y also works. But if you use them together the one which is defined last returns null. I'll open an issue. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Not able to search Spanish word with ascent in solr
On 29 May 2013 21:39, jignesh js.vishava...@gmail.com wrote: Solr returning error 500, when i post data with ascent chars... Any solution for that? [...] Please look in the Solr logs for the appropriate error message. Regards, Gora
Solr Cloud Using Zookeeper SASL
Hiya all, Got a question that I hope someone can help me with. I was just wondering if anyone has ever used Solr Cloud using Zookeepers that have SASL authentication turned on? I can't seem to find any documentation on it so any help at all would be amazing! Thanks, Don Tran Developer Omnifone Island Studios 47 British Grove London W4 2NL, UK T: +44 (0)20 8600 0580 F: +44 (0)20 8600 0581 S: DonTranOmnifone E: dt...@omnifone.commailto:dt...@omnifone.com __ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com __
RE: Why do FQs make my spelling suggestions so slow?
Instead of maxCollationTries=0, use a value greater than zero. Zero means not to check if the collation will return hits. 1 means to test 1 possible combination against the index and return it only if it returns hits. 2 tries up to 2 possibilities, etc. As you have spellcheck.maxCollations=8, you'll probably want maxCollationTries at least that large. Maybe 10-20 would be better. Make it as low as possible to get generally good results, or as high as possible before the performance on a query with many misspelled words gets too bad. Also, use a spellcheck.count greater than 2. This is as many corrections per misspelled term you want it to consider. If using DirectSolrSpellChecker, you can have it set low, 5-10 might be good. If using IndexBased- or FileBased spell checkers, use at least 10. Also, do not use onlyMorePopular unless you indeed want every term in the user's query to be replaced with higher-frequency terms (even correctly-spelled terms get replaced). If you want it to suggest even for words that are in the dictionary, try spellcheck.alternativeTermCount instead. Try setting it to about half of spellcheck.count (but at least 10 if using IndexBased- or FileBased spell checkers). James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Nicholas Fellows [mailto:n...@djdownload.com] Sent: Wednesday, May 29, 2013 11:06 AM To: solr-user@lucene.apache.org Subject: Re: Why do FQs make my spelling suggestions so slow? I also have problems getting the solrspellchecker to utilise existing FQ params correctly. we have some fairly monster queries eg : http://pastebin.com/4XzGpfeC I cannot seem to get our FQ parameters to be honored when generating results. In essence i am getting collations that yield no results when the filter query is applied. We have items that are by default not shown when out of stock or forthcoming. the user can select whether to show these or not. Is there something wrong with my query or perhaps my use case is not supported? Im using nested query and local params etc Would very much appreciate some assistance on this one as 2days worth of hacking, and pestering people on IRC have not yet yeilded a solution for me. Im not even sure what i am trying is even possible! Some sort of clarification on this would really help! Cheers Nick... On 29 May 2013 15:57, Andy Lester a...@petdance.com wrote: On May 29, 2013, at 9:46 AM, Dyer, James james.d...@ingramcontent.com wrote: Just an instanity check, I see I had misspelled maxCollations as maxCollation in my prior response. When you tested with this set the same as maxCollationTries, did you correct my spelling? Yes, definitely. Thanks for the ticket. I am looking at the effects of turning on spellcheck.onlyMorePopular to true, which reduces the number of collations it seems to do, but doesn't affect the underlying question of is the spellchecker doing FQs properly? Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance -- Nick Fellows DJdownload.com --- 10 Greenland Street London NW10ND United Kingdom --- n...@djdownload.com (E) --- www.djdownload.com
Re: Not able to search Spanish word with ascent in solr
On May 29, 2013, at 18:09 , jignesh js.vishava...@gmail.com wrote: Solr returning error 500, when i post data with ascent chars... Any solution for that? The solution probably involves using the correct encoding, and ensuring that the HTTP request sets the appropriate header values accordingly. In other words, more likely a pilot error than a SOLR error... at least that was the case for me :-)
Re: Why do FQs make my spelling suggestions so slow?
James, this is very useful information. Can you please add this to the wiki? On Wed, May 29, 2013 at 10:36 PM, Dyer, James james.d...@ingramcontent.comwrote: Instead of maxCollationTries=0, use a value greater than zero. Zero means not to check if the collation will return hits. 1 means to test 1 possible combination against the index and return it only if it returns hits. 2 tries up to 2 possibilities, etc. As you have spellcheck.maxCollations=8, you'll probably want maxCollationTries at least that large. Maybe 10-20 would be better. Make it as low as possible to get generally good results, or as high as possible before the performance on a query with many misspelled words gets too bad. Also, use a spellcheck.count greater than 2. This is as many corrections per misspelled term you want it to consider. If using DirectSolrSpellChecker, you can have it set low, 5-10 might be good. If using IndexBased- or FileBased spell checkers, use at least 10. Also, do not use onlyMorePopular unless you indeed want every term in the user's query to be replaced with higher-frequency terms (even correctly-spelled terms get replaced). If you want it to suggest even for words that are in the dictionary, try spellcheck.alternativeTermCount instead. Try setting it to about half of spellcheck.count (but at least 10 if using IndexBased- or FileBased spell checkers). James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Nicholas Fellows [mailto:n...@djdownload.com] Sent: Wednesday, May 29, 2013 11:06 AM To: solr-user@lucene.apache.org Subject: Re: Why do FQs make my spelling suggestions so slow? I also have problems getting the solrspellchecker to utilise existing FQ params correctly. we have some fairly monster queries eg : http://pastebin.com/4XzGpfeC I cannot seem to get our FQ parameters to be honored when generating results. In essence i am getting collations that yield no results when the filter query is applied. We have items that are by default not shown when out of stock or forthcoming. the user can select whether to show these or not. Is there something wrong with my query or perhaps my use case is not supported? Im using nested query and local params etc Would very much appreciate some assistance on this one as 2days worth of hacking, and pestering people on IRC have not yet yeilded a solution for me. Im not even sure what i am trying is even possible! Some sort of clarification on this would really help! Cheers Nick... On 29 May 2013 15:57, Andy Lester a...@petdance.com wrote: On May 29, 2013, at 9:46 AM, Dyer, James james.d...@ingramcontent.com wrote: Just an instanity check, I see I had misspelled maxCollations as maxCollation in my prior response. When you tested with this set the same as maxCollationTries, did you correct my spelling? Yes, definitely. Thanks for the ticket. I am looking at the effects of turning on spellcheck.onlyMorePopular to true, which reduces the number of collations it seems to do, but doesn't affect the underlying question of is the spellchecker doing FQs properly? Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance -- Nick Fellows DJdownload.com --- 10 Greenland Street London NW10ND United Kingdom --- n...@djdownload.com (E) --- www.djdownload.com -- Regards, Shalin Shekhar Mangar.
RE: Why do FQs make my spelling suggestions so slow?
It has been in the wiki, more or less. See http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.count and following sections. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Wednesday, May 29, 2013 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Why do FQs make my spelling suggestions so slow? James, this is very useful information. Can you please add this to the wiki? On Wed, May 29, 2013 at 10:36 PM, Dyer, James james.d...@ingramcontent.comwrote: Instead of maxCollationTries=0, use a value greater than zero. Zero means not to check if the collation will return hits. 1 means to test 1 possible combination against the index and return it only if it returns hits. 2 tries up to 2 possibilities, etc. As you have spellcheck.maxCollations=8, you'll probably want maxCollationTries at least that large. Maybe 10-20 would be better. Make it as low as possible to get generally good results, or as high as possible before the performance on a query with many misspelled words gets too bad. Also, use a spellcheck.count greater than 2. This is as many corrections per misspelled term you want it to consider. If using DirectSolrSpellChecker, you can have it set low, 5-10 might be good. If using IndexBased- or FileBased spell checkers, use at least 10. Also, do not use onlyMorePopular unless you indeed want every term in the user's query to be replaced with higher-frequency terms (even correctly-spelled terms get replaced). If you want it to suggest even for words that are in the dictionary, try spellcheck.alternativeTermCount instead. Try setting it to about half of spellcheck.count (but at least 10 if using IndexBased- or FileBased spell checkers). James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Nicholas Fellows [mailto:n...@djdownload.com] Sent: Wednesday, May 29, 2013 11:06 AM To: solr-user@lucene.apache.org Subject: Re: Why do FQs make my spelling suggestions so slow? I also have problems getting the solrspellchecker to utilise existing FQ params correctly. we have some fairly monster queries eg : http://pastebin.com/4XzGpfeC I cannot seem to get our FQ parameters to be honored when generating results. In essence i am getting collations that yield no results when the filter query is applied. We have items that are by default not shown when out of stock or forthcoming. the user can select whether to show these or not. Is there something wrong with my query or perhaps my use case is not supported? Im using nested query and local params etc Would very much appreciate some assistance on this one as 2days worth of hacking, and pestering people on IRC have not yet yeilded a solution for me. Im not even sure what i am trying is even possible! Some sort of clarification on this would really help! Cheers Nick... On 29 May 2013 15:57, Andy Lester a...@petdance.com wrote: On May 29, 2013, at 9:46 AM, Dyer, James james.d...@ingramcontent.com wrote: Just an instanity check, I see I had misspelled maxCollations as maxCollation in my prior response. When you tested with this set the same as maxCollationTries, did you correct my spelling? Yes, definitely. Thanks for the ticket. I am looking at the effects of turning on spellcheck.onlyMorePopular to true, which reduces the number of collations it seems to do, but doesn't affect the underlying question of is the spellchecker doing FQs properly? Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance -- Nick Fellows DJdownload.com --- 10 Greenland Street London NW10ND United Kingdom --- n...@djdownload.com (E) --- www.djdownload.com -- Regards, Shalin Shekhar Mangar.
Seeming bug in ConcurrentUpdateSolrServer
The comment here is clearly wrong, since there is no division by two. I think that the code is wrong, because this results in not starting runners when it should start runners. Am I misanalyzing? if (runners.isEmpty() || (queue.remainingCapacity() queue.size() // queue // is // half // full // and // we // can // add // more // runners runners.size() threadCount)) {
Re: Indexing Solr, Multiple Doc Types. Production of Multiple Values for UniqueKey Field Using TemplateTransformer
: org.apache.solr.common.SolrException: Document contains multiple values for : uniqueKey field: uid=[A_1, dc1999fcf12df900] By the looks of things, your TemplateTransformer is properly creating a value of A_${atest.id} where ${atest.id} == 1 for that document ... the problem seems to be that somehow another value is getting put in your uid field containing dc1999fcf12df900 Based on your stack trace, i suspect that in addition to having DIH create a value for your uid field, you also have SignatureUpdateProcessorFactory configured (in your solrconfig.xml) to generate a synthetic unique id based on the signature of some fields as well... : org.apache.solr.update.processor.SignatureUpdateProcessorFactory$SignatureUpdateProcessor.processAdd(SignatureUpdateProcessorFactory.java:194) -Hoss
Re: Seeming bug in ConcurrentUpdateSolrServer
On Wed, May 29, 2013 at 11:29 PM, Benson Margulies bimargul...@gmail.comwrote: The comment here is clearly wrong, since there is no division by two. I think that the code is wrong, because this results in not starting runners when it should start runners. Am I misanalyzing? if (runners.isEmpty() || (queue.remainingCapacity() queue.size() // queue // is // half // full // and // we // can // add // more // runners runners.size() threadCount)) { queue.remainingCapacity() returns capacity - queue.size() so the comment is correct. -- Regards, Shalin Shekhar Mangar.
Re: Seeming bug in ConcurrentUpdateSolrServer
Ah. So now I have to find some other explanation of why it never creates more than one thread, even when I make a very deep queue and specify 6 threads. On Wed, May 29, 2013 at 2:25 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, May 29, 2013 at 11:29 PM, Benson Margulies bimargul...@gmail.comwrote: The comment here is clearly wrong, since there is no division by two. I think that the code is wrong, because this results in not starting runners when it should start runners. Am I misanalyzing? if (runners.isEmpty() || (queue.remainingCapacity() queue.size() // queue // is // half // full // and // we // can // add // more // runners runners.size() threadCount)) { queue.remainingCapacity() returns capacity - queue.size() so the comment is correct. -- Regards, Shalin Shekhar Mangar.
Re: SOLR 4.3.0 - How to make fq optional?
Hoss, for some reason this doesn't work when I pass the latlong value via query.. This is the query.. It just returns all the values for fname='peter' (doesn't filter for Tarmac, Florida). fl=*,scorerows=10qt=findpersonfps_latlong=26.22084,-80.29fps_fname=peter *solrconfig.xml* lst name=appends str name=fq{!switch case='*:*' default=$fq_bbox v=$fps_latlong}/str /lst lst name=invariants str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}/str /lst *Works when used via custom component:* This works fine when the latlong value is passed via custom component. We have a custom component which gets the location name via query, calculates the corresponding lat long co-ordinates stored in TSV file and passes the co-ordinates to the query. *Custom component config:* searchComponent name=geo class=com.customcomponent str name=placenameFilecentroids.tsv/str str name=placenameQueryParamfps_where/str str name=latQueryParamfps_latitude/str str name=lonQueryParamfps_longitude/str str name=latlonQueryParamfps_latlong/str str name=distQueryParamfps_dist/str float name=defaultDist48.2803/float float name=boost1.0/float /searchComponent *Custom component query:* fl=*,scorerows=10*fps_where=new york, ny*qt=findpersonfps_latlong=26.22084,-80.29fps_dist=.10fps_fname=peter Is it a bug? -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066862.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR 4.3.0 - How to make fq optional?
: Hoss, for some reason this doesn't work when I pass the latlong value via : query.. ... : fl=*,scorerows=10qt=findpersonfps_latlong=26.22084,-80.29fps_fname=peter Hmmm, are these appends invariants on your findperson requestHandler? What does debugQuery=true show you the pplied filters are? : lst name=invariants : str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo : d=$fps_dist}/str : /lst Why do you have the _query_ hack in there? i haven't had a chance to test this, but perhaps that hack doesn't play nicely with localparam variable substitution? it should just be... str name=fq_bbox{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}/str : This works fine when the latlong value is passed via custom component. We : have a custom component which gets the location name via query, calculates : the corresponding lat long co-ordinates stored in TSV file and passes the : co-ordinates to the query. Ok wait a minute -- all bets are off about this working if you have a custom component in the mix adding/removing params. you need to provide us with more details about exactly how your component works, where it's configured in the component list, and how it is adding the fps_latlong param it generates to the query, becuase my guesses are one of two things are happening: 1) your component is doing it's logic after the query parsing has already happened and the variables have been evaluated -- at which point fps_latlong isn't set yet, so you get the case='*:*' behavior 2) your component is doing it's logic before the query parsing happens, but it is setting the value of fps_latlong in a way that the query parsing code doens't see it hen resolving the local variables. -Hoss
Problem with PatternReplaceCharFilter
Hi, I have a Problem when using PatternReplaceCharFilter when indexing a field. I created the following field: fieldType name=testfield class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=#60;TextDocument[^#62;]*#62; replacement=/ charFilter class=solr.PatternReplaceCharFilterFactory pattern=#60;/TextDocument#62; replacement=/-- charFilter class=solr.PatternReplaceCharFilterFactory pattern=#60;TextLine[^#60;]+ content=\#34;([^\#34;]*)\#34;[^/]+/#62; replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType And I created a field that is indexed and stored: field name=testfield type=testfield indexed=true stored=true / I need to index a document with such a structure in this field: TextDocument filename=somefile.end mime=... created=...TextLine aa=bb cc=dd content=the content to search in ee=ff /TextLine aa=bb cc=dd content=the second content line ee=ff //TextDocument Basically I have some sort of XML structure, i need only to search in the content attribute, but when highlighting i need to get back to the enclosing XML tags. So with the 3 Regex I want to remove all unwanted tags and tokenize/index only the important data. I know that I could use HTMLStripCharFilterFactory but then also the tag names, attribute names and values get indexed. And I don't want to search in that content too. I read the following in the doc: NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble. The thing is, why is this the case? When running the analyze from solr admin the CharFilters generate the content to search in the second content line which looks perfect, but then the StandardTokenizer gets the start and end positions of the tokens wrong. Why is this the case? Does there exist another solution to my problem? Could I use the following method I saw in the doc of PatternReplaceCharFilter: protected int correct(int currentOff) Documentation: Retrieve the corrected offset. How could I solve such a task? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html Sent from the Solr - User mailing list archive at Nabble.com.
Support for Mongolian language
Hi All, Does solr provide support for Mongolian language? Also which filters and tokenizers must be used for Chinese, Japanese and Korean languages? Regards, Sagar Chaturvedi DISCLAIMER: --- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or NEC or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of NEC or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. . ---
Re: SOLR 4.3.0 - How to make fq optional?
Ok..I removed all my custom components from findperson request handler.. requestHandler name=findperson class=solr.SearchHandler default=false lst name=defaults str name=defTypelucene/str str name=echoParamsexplicit/str int name=rows10/int str name=q.opAND/str str name=qfperson_name_all_i/str int name=score_truncation_cliff50/int int name=fps_dist32/int str name=q *:* /str lst name=appends str name=fq{!switch case='*:*' default=$fq_bbox v=$fps_latlong}/str /lst lst name=invariants str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}/str /lst /lst arr name=components strquery/str strdebug/str /arr /requestHandler My query: select?fl=*,scorerows=10qt=findpersonfps_latlong=42.3482,-75.1890 The above query just returns everything back from SOLR (should only return results corresponding to lat and long values passed in the query)... I even tried changing the below hack, but got the same results. str name=fq_bbox{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}/str Not sure if I am missing something... -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066872.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Support for Mongolian language
Check out.. wiki.apache.org/solr/LanguageAnalysis For some reason the above site takes long time to open.. -- View this message in context: http://lucene.472066.n3.nabble.com/Support-for-Mongolian-language-tp4066871p4066874.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query syntax error: Cannot parse ....
# has a separate meaning in URL.. You need to encode that.. http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping%20Special%20Characters. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-syntax-error-Cannot-parse-tp4066560p4066879.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouping results based on the field which matched the query
Not sure if you are looking for this.. http://wiki.apache.org/solr/FieldCollapsing -- View this message in context: http://lucene.472066.n3.nabble.com/Grouping-results-based-on-the-field-which-matched-the-query-tp4065670p4066882.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR 4.3.0 - How to make fq optional?
: lst name=defaults ... : lst name=appends : str name=fq{!switch case='*:*' default=$fq_bbox : v=$fps_latlong}/str : /lst : lst name=invariants : str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo : d=$fps_dist}/str : /lst : /lst ...you have your appends and invariants nested inside your defaults -- they should be siblings... lst name=defaults ... /lst lst name=appends ... /lst lst name=invariants ... /lst -Hoss
Re: SOLR 4.3.0 - How to make fq optional?
I totally missed that..Sorry about that :)...It seems to work fine now... -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066891.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Note on The Book
Jack, I'd prefer tons of information instead of a meager 300 page book that leaves a lot of questions. I'm looking forward to a paperback or hardcover book and price doesn't really matter, it is going to be worth it anyway. Thanks, Markus -Original message- From:Jack Krupansky j...@basetechnology.com Sent: Wed 29-May-2013 15:10 To: solr-user@lucene.apache.org Subject: Re: Note on The Book Erick, your point is well taken. Although my primary interest/skill is to produce a solid foundation reference (including tons of examples), the real goal is to then build on top of that foundation. While I focus on the hard-core material - which really does include some narrative and lots of examples in addition to tons of mere reference, my co-author, Ryan Tabora, will focus almost exclusively on... narrative and diagrams. And when I say reference, I also mean lots of examples. Even as the hard-core reference stabilizes, the examples will continue to grow (like weeds!). Once we get the current, existing, under-review, chapters packaged into the new book and available for purchase and download (maybe Lulu, not decided) - available, in a couple of weeks, it will be updated approximately every other week, both with additional reference material, and additional narrative and diagrams. One of our priorities (after we get through Stage 0 of the next few weeks) is to in fact start giving each of the long Deep Dive Chapters enough narrative lead to basically say exactly that - why you should care. A longer-term priority is to improve the balance of narrative and hard-core reference. Yeah, that will be a lot of pages. It already is. We were at 907 pages and I was about to drop in another 166 pages on update handlers when O'Reilly threw up their hands and pulled the plug. I was estimating 1200 pages at that stage. And I'll probably have another 60-80 pages on update request processors within a week or so. With more to come. That did include a lot of hard-core material and example code for Lucene, which won't be in the new Solr-only book. By focusing on an e-book the raw page count alone becomes moot. We haven't given up on print - the intent is eventually to have multiple volumes (4-8 or so, maybe more), both as cheaper e-books ($3 to $5 each) and slimmer print volumes for people who don't need everything in print. In fact, we will likely offer the revamped initial chapters of the book as a standalone introduction to Solr - narrative introduction (why should you care about Solr), basic concepts of Lucene and Solr (and why you should care!), brief tutorial walkthough of the major feature areas of Solr, and a case study. The intent would be both e-book and a slim print volume (75 pages?). Another priority (beyond Stage 0) is to develop a detailed roadmap diagram of Solr and how applications can use Solr, and then use that to show how each of the Deep Dive sections (heavy reference, but gradually adding more narrative over time.) We will probably be very open to requests - what people really wish a book would actually do for them. The only request we won't be open to is to do it all in only 300 pages. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, May 29, 2013 7:19 AM To: solr-user@lucene.apache.org Subject: Re: Note on The Book FWIW, picking up on Alexandre's point. One of my continual frustrations with virtually _all_ technical books is they become endless pages of details without ever mentioning why the hell I should care. Unfortunately, explaining use-cases for everything would only make the book about 10,000 pages long. Siiigh. I guess you can take this as a vote for narrative Erick On Tue, May 28, 2013 at 4:53 PM, Jack Krupansky j...@basetechnology.com wrote: We'll have a blog for the book. We hope to have a first raw/rough/partial/draft published as an e-book in maybe 10 days to 2 weeks. As soon as we get that process under control, we'll start the blog. I'll keep your email on file and keep you posted. -- Jack Krupansky -Original Message- From: Swati Swoboda Sent: Tuesday, May 28, 2013 1:36 PM To: solr-user@lucene.apache.org Subject: RE: Note on The Book I'd definitely prefer the spiral bound as well. E-books are great and your draft version seems very reasonably priced (aka I would definitely get it). Really looking forward to this. Is there a separate mailing list / etc. for the book for those who would like to receive updates on the status of the book? Thanks Swati Swoboda Software Developer - Igloo Software +1.519.489.4120 sswob...@igloosoftware.com Bring back Cake Fridays – watch a video you’ll actually like http://vimeo.com/64886237 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 23, 2013 7:15 PM
Re: Seeming bug in ConcurrentUpdateSolrServer
I now understand the algorithm, but I don't understand why is the way it is. Consider one of these objects configure with a handful of threads and a pretty big queue. When the first request comes in, the object creates one runner. It then won't create a second runner until the Q reaches 1/2-full. If the idea is that we want to pile up 'a lot' (1/2-of-a-q) of work before sending any of it, why start that first runner? On Wed, May 29, 2013 at 2:45 PM, Benson Margulies bimargul...@gmail.com wrote: Ah. So now I have to find some other explanation of why it never creates more than one thread, even when I make a very deep queue and specify 6 threads. On Wed, May 29, 2013 at 2:25 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, May 29, 2013 at 11:29 PM, Benson Margulies bimargul...@gmail.comwrote: The comment here is clearly wrong, since there is no division by two. I think that the code is wrong, because this results in not starting runners when it should start runners. Am I misanalyzing? if (runners.isEmpty() || (queue.remainingCapacity() queue.size() // queue // is // half // full // and // we // can // add // more // runners runners.size() threadCount)) { queue.remainingCapacity() returns capacity - queue.size() so the comment is correct. -- Regards, Shalin Shekhar Mangar.
RE: Slow Highlighter Performance Even Using FastVectorHighlighter
Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a representation of what was indexed outside the Solr core, in some constant-bytes-to-character representation that you can use offsets with (e.g. UTF-16, or ASCII+entity references). However, you may not need to do this -- it may be that you just need more memory for your search machine. Not JVM memory, but memory that the O/S can use as a file cache. What do you have now? That is, how much memory do you have that is not used by the JVM or other apps, and how big is your Solr core? One way to start getting a handle on where time is being spent is to set up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight queries, and look at where the time is being spent. If it's mostly in methods that are just reading from disk, buy more memory. If you're on Linux, look at what top is telling you. If the CPU usage is low and the wa number is above 1% more often than not, buy more memory (I don't know why that wa number makes sense, I just know that it has been a good rule of thumb for us). -- Bryan -Original Message- From: Andy Brown [mailto:andy_br...@rhoworld.com] Sent: Monday, May 20, 2013 9:53 AM To: solr-user@lucene.apache.org Subject: Slow Highlighter Performance Even Using FastVectorHighlighter I'm providing a search feature in a web app that searches for documents that range in size from 1KB to 200MB of varying MIME types (PDF, DOC, etc). Currently there are about 3000 documents and this will continue to grow. I'm providing full word search and partial word search. For each document, there are three source fields that I'm interested in searching and highlighting on: name, description, and content. Since I'm providing both full and partial word search, I've created additional fields that get tokenized differently: name_par, description_par, and content_par. Those are indexed and stored as well for querying and highlighting. As suggested in the Solr wiki, I've got two catch all fields text and text_par for faster querying. An average search results page displays 25 results and I provide paging. I'm just returning the doc ID in my Solr search results and response times have been quite good (1 to 10 ms). The problem in performance occurs when I turn on highlighting. I'm already using the FastVectorHighlighter and depending on the query, it has taken as long as 15 seconds to get the highlight snippets. However, this isn't always the case. Certain query terms result in 1 sec or less response time. In any case, 15 seconds is way too long. I'm fairly new to Solr but I've spent days coming up with what
Solr query performance tool
Hi, Lately we are seeing increased latency times on solr and we would like to know which queries / facet searches are the most time consuming and heavy for our system. Is there any tool equivalent to the mysql low log? Does solr keep the times each query takes in some log? Thank you for your help. -S. -- Spyros Lambrinidis Head of Engineering Commando of PeoplePerHour.comhttp://www.peopleperhour.com Evmolpidon 23 118 54, Gkazi Athens, Greece Tel: +30 210 3455480 Follow us on Facebook http://www.facebook.com/peopleperhour Follow us on Twitter http://twitter.com/#%21/peopleperhour
Re: Problem with PatternReplaceCharFilter
Just replace the stripped markup with the equivalent number of spaces to maintain positions. Was there some specific problem you were encountering? -- Jack Krupansky -Original Message- From: jasimop Sent: Wednesday, May 29, 2013 4:12 PM To: solr-user@lucene.apache.org Subject: Problem with PatternReplaceCharFilter Hi, I have a Problem when using PatternReplaceCharFilter when indexing a field. I created the following field: fieldType name=testfield class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=#60;TextDocument[^#62;]*#62; replacement=/ charFilter class=solr.PatternReplaceCharFilterFactory pattern=#60;/TextDocument#62; replacement=/-- charFilter class=solr.PatternReplaceCharFilterFactory pattern=#60;TextLine[^#60;]+ content=\#34;([^\#34;]*)\#34;[^/]+/#62; replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType And I created a field that is indexed and stored: field name=testfield type=testfield indexed=true stored=true / I need to index a document with such a structure in this field: TextDocument filename=somefile.end mime=... created=...TextLine aa=bb cc=dd content=the content to search in ee=ff /TextLine aa=bb cc=dd content=the second content line ee=ff //TextDocument Basically I have some sort of XML structure, i need only to search in the content attribute, but when highlighting i need to get back to the enclosing XML tags. So with the 3 Regex I want to remove all unwanted tags and tokenize/index only the important data. I know that I could use HTMLStripCharFilterFactory but then also the tag names, attribute names and values get indexed. And I don't want to search in that content too. I read the following in the doc: NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble. The thing is, why is this the case? When running the analyze from solr admin the CharFilters generate the content to search in the second content line which looks perfect, but then the StandardTokenizer gets the start and end positions of the tokens wrong. Why is this the case? Does there exist another solution to my problem? Could I use the following method I saw in the doc of PatternReplaceCharFilter: protected int correct(int currentOff) Documentation: Retrieve the corrected offset. How could I solve such a task? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Note on The Book
Markus, Okay, more pages it is! -- Jack Krupansky -Original Message- From: Markus Jelsma Sent: Wednesday, May 29, 2013 5:35 PM To: solr-user@lucene.apache.org Subject: RE: Note on The Book Jack, I'd prefer tons of information instead of a meager 300 page book that leaves a lot of questions. I'm looking forward to a paperback or hardcover book and price doesn't really matter, it is going to be worth it anyway. Thanks, Markus -Original message- From:Jack Krupansky j...@basetechnology.com Sent: Wed 29-May-2013 15:10 To: solr-user@lucene.apache.org Subject: Re: Note on The Book Erick, your point is well taken. Although my primary interest/skill is to produce a solid foundation reference (including tons of examples), the real goal is to then build on top of that foundation. While I focus on the hard-core material - which really does include some narrative and lots of examples in addition to tons of mere reference, my co-author, Ryan Tabora, will focus almost exclusively on... narrative and diagrams. And when I say reference, I also mean lots of examples. Even as the hard-core reference stabilizes, the examples will continue to grow (like weeds!). Once we get the current, existing, under-review, chapters packaged into the new book and available for purchase and download (maybe Lulu, not decided) - available, in a couple of weeks, it will be updated approximately every other week, both with additional reference material, and additional narrative and diagrams. One of our priorities (after we get through Stage 0 of the next few weeks) is to in fact start giving each of the long Deep Dive Chapters enough narrative lead to basically say exactly that - why you should care. A longer-term priority is to improve the balance of narrative and hard-core reference. Yeah, that will be a lot of pages. It already is. We were at 907 pages and I was about to drop in another 166 pages on update handlers when O'Reilly threw up their hands and pulled the plug. I was estimating 1200 pages at that stage. And I'll probably have another 60-80 pages on update request processors within a week or so. With more to come. That did include a lot of hard-core material and example code for Lucene, which won't be in the new Solr-only book. By focusing on an e-book the raw page count alone becomes moot. We haven't given up on print - the intent is eventually to have multiple volumes (4-8 or so, maybe more), both as cheaper e-books ($3 to $5 each) and slimmer print volumes for people who don't need everything in print. In fact, we will likely offer the revamped initial chapters of the book as a standalone introduction to Solr - narrative introduction (why should you care about Solr), basic concepts of Lucene and Solr (and why you should care!), brief tutorial walkthough of the major feature areas of Solr, and a case study. The intent would be both e-book and a slim print volume (75 pages?). Another priority (beyond Stage 0) is to develop a detailed roadmap diagram of Solr and how applications can use Solr, and then use that to show how each of the Deep Dive sections (heavy reference, but gradually adding more narrative over time.) We will probably be very open to requests - what people really wish a book would actually do for them. The only request we won't be open to is to do it all in only 300 pages. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, May 29, 2013 7:19 AM To: solr-user@lucene.apache.org Subject: Re: Note on The Book FWIW, picking up on Alexandre's point. One of my continual frustrations with virtually _all_ technical books is they become endless pages of details without ever mentioning why the hell I should care. Unfortunately, explaining use-cases for everything would only make the book about 10,000 pages long. Siiigh. I guess you can take this as a vote for narrative Erick On Tue, May 28, 2013 at 4:53 PM, Jack Krupansky j...@basetechnology.com wrote: We'll have a blog for the book. We hope to have a first raw/rough/partial/draft published as an e-book in maybe 10 days to 2 weeks. As soon as we get that process under control, we'll start the blog. I'll keep your email on file and keep you posted. -- Jack Krupansky -Original Message- From: Swati Swoboda Sent: Tuesday, May 28, 2013 1:36 PM To: solr-user@lucene.apache.org Subject: RE: Note on The Book I'd definitely prefer the spiral bound as well. E-books are great and your draft version seems very reasonably priced (aka I would definitely get it). Really looking forward to this. Is there a separate mailing list / etc. for the book for those who would like to receive updates on the status of the book? Thanks Swati Swoboda Software Developer - Igloo Software +1.519.489.4120 sswob...@igloosoftware.com Bring back Cake Fridays – watch a video you’ll actually like http://vimeo.com/64886237 -Original Message- From: Jack
java.lang.IllegalAccessError when invoking protected method from another class in the same package path but different jar.
Hi, I am overriding the query component and creating a custom component. I am using _responseDocs from org.apache.solr.handler.component.ResponseBuilder to get the values. I have my component in same package (org.apache.solr.handler.component) to access the _responseDocs value. Everything works fine when I run the test for this component but I am getting the below error when I package the custom component in a jar and place it in lib directory (inside solr/lib - using basic jetty configuration). I assume this is due to the fact that different class loaders load different class at runtime. Is there a way to resolve this? str name=msgjava.lang.IllegalAccessError: tried to access field org.apache.solr.handler.component.ResponseBuilder._responseDocs from class org.apache.solr.handler.component.WPFastDistributedQueryComponent/strstr name=tracejava.lang.RuntimeException: java.lang.IllegalAccessError: tried to access field org.apache.solr.handler.component.ResponseBuilder._responseDocs from class org.apache.solr.handler.component.CustomComponent at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.IllegalAccessError: tried to access field org.apache.solr.handler.component.ResponseBuilder._responseDocs from class org.apache.solr.handler.component.WPFastDistributedQueryComponent at org.apache.solr.handler.component.WPFastDistributedQueryComponent.handleResponses(WPFastDistributedQueryComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) ... 26 more -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-IllegalAccessError-when-invoking-protected-method-from-another-class-in-the-same-package-p-tp4066904.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Support for Mongolian language
On Wed, May 29, 2013, at 09:34 PM, bbarani wrote: Check out.. wiki.apache.org/solr/LanguageAnalysis For some reason the above site takes long time to open.. There's a known performance issue with the wiki. Admins are working on it. Upayavira
Re: java.lang.IllegalAccessError when invoking protected method from another class in the same package path but different jar.
My assumptions were right :) I was able to fix this error by copying all my custom jar inside webapp/web-inf/lib directory and everything started working -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-IllegalAccessError-when-invoking-protected-method-from-another-class-in-the-same-package-p-tp4066904p4066906.html Sent from the Solr - User mailing list archive at Nabble.com.
solr 4.3: write.lock is not removed
Hi, I recently upgraded solr from 3.6.1 to 4.3, it works well, but I noticed that after finishing indexing write.lock is NOT removed. Later if I index again it still works OK. Only after I shutdown Tomcat then write.lock is removed. This behavior caused some problem like I could not use luke to observe indexed data. I did not see any error/warning messages. Is this the designed behavior? Can I have the old behavior (after commit write.lock is removed) through configuration? Thanks very much for helps, Lisheng
Re: Solr query performance tool
Hi, The regular Solr log logs Qtime for each query. Otis Solr ElasticSearch Support http://sematext.com/ On May 29, 2013 5:59 PM, Spyros Lambrinidis spy...@peopleperhour.com wrote: Hi, Lately we are seeing increased latency times on solr and we would like to know which queries / facet searches are the most time consuming and heavy for our system. Is there any tool equivalent to the mysql low log? Does solr keep the times each query takes in some log? Thank you for your help. -S. -- Spyros Lambrinidis Head of Engineering Commando of PeoplePerHour.comhttp://www.peopleperhour.com Evmolpidon 23 118 54, Gkazi Athens, Greece Tel: +30 210 3455480 Follow us on Facebook http://www.facebook.com/peopleperhour Follow us on Twitter http://twitter.com/#%21/peopleperhour
Re: Solr query performance tool
The qtimes are in the solr log, you'll see lines like: params={q=*:*} hits=32 status=0 QTime=5 QTime is the time spent serving the query but does NOT include assembling the response. Best Erick On Wed, May 29, 2013 at 5:58 PM, Spyros Lambrinidis spy...@peopleperhour.com wrote: Hi, Lately we are seeing increased latency times on solr and we would like to know which queries / facet searches are the most time consuming and heavy for our system. Is there any tool equivalent to the mysql low log? Does solr keep the times each query takes in some log? Thank you for your help. -S. -- Spyros Lambrinidis Head of Engineering Commando of PeoplePerHour.comhttp://www.peopleperhour.com Evmolpidon 23 118 54, Gkazi Athens, Greece Tel: +30 210 3455480 Follow us on Facebook http://www.facebook.com/peopleperhour Follow us on Twitter http://twitter.com/#%21/peopleperhour
Re: java.lang.IllegalAccessError when invoking protected method from another class in the same package path but different jar.
: Subject: java.lang.IllegalAccessError when invoking protected method from : another class in the same package path but different jar. ... : I am overriding the query component and creating a custom component. I am : using _responseDocs from org.apache.solr.handler.component.ResponseBuilder : to get the values. I have my component in same package _responseDocs is not protected it is package-private which is why you can't access it from a subclass in another *runtime* pacakge. Even if you put your custom component in the same org.apache.solr... package namespace, the runtime package is determined by the ClassLoader combined with the source package... http://www.cooljeff.co.uk/2009/05/03/the-subtleties-of-overriding-package-private-methods/ ...this is helpful to ensure plugins don't attempt to do tihngs they shouldn't. In general, the ResponseBuilder class internals aren't very friendly in terms of allowing custom components to interact with the intermediate results of other built in components -- it's primarily designed arround letting other internal Solr components share data with eachother in (hopefully) well tested ways. Note that there is even a specific comment one line directly above the declaration of _responseDocs that alludes to it and several other variables being deliberately package-private... /* private... components that don't own these shouldn't use them */ SolrDocumentList _responseDocs; StatsInfo _statsInfo; TermsComponent.TermsHelper _termsHelper; SimpleOrderedMapListNamedListObject _pivots; If you want access to the SolrDocumentList containing the query results, the only safe way/time to do that is by fetching it out of the response (ResponseBuilder.rsp) after the QueryComponent has put it there in it's finishStage -- untill then ResponseBuilder._responseDocs may not be correct (ie: distribute search, grouped search, etc...) -Hoss
multiple field join?
http://wiki.apache.org/solr/Join I found solr join is actually sql subquery,does solr support 3 tables jion ? the sql like this SELECT xxx, yyy FROM collection1 WHERE outer_id IN (SELECT inner_id FROM collection1 where zzz = vvv) and outer_id2 IN (SELECT inner_id2 FROM collection1 where ttt = xxx) and outer_id3 IN (SELECT inner_id3 FROM collection1 where ppp = rrr) how to write the solr request url? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-field-join-tp4066930.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with PatternReplaceCharFilter
Honestly, I have no idea how to do that. PatternReplaceCharFilter doesn't seem to have a parameter like preservePositions=true and optionally fillCharacter= . And I don't think I can express this simply as regex. How would I count in a pure regex the length difference before and after the match? Well, the specific problem is, that when highlighting the term positions are wrong and the result is not a valid XML structure that I can handle. I expect something like TextLine aa=quot;bbquot; cc=quot;ddquot; content=quot;the content to lt;emsearch/em in ee=ff / but I can Texlt;emtLine/emaa=bb cc=dd content=the content to emsearch/em in ee=ff / Thanks for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4066939.html Sent from the Solr - User mailing list archive at Nabble.com.