Re: Problem adding new requesthandler to solr branch_3x
Hoss many thanks for the reply Paul On 8 March 2011 19:45, Chris Hostetter hossman_luc...@fucit.org wrote: : 1. Why the problem occurs (has something changed between 1.4.1 and 3x)? Various pieces of code dealing with config parsing have changed since 1.4.1 to be better about verifying that configs are meaningful ,ad reporting errors when unexpected things are encountered. i'm not sure of the specific change, but the underlying point is: if 1.4.1 wasn't giving you an error for that syntax, it's because it was compleltey ignoring it. -Hoss
LucidGaze Monitoring tool
Hi all, Does anyone know what does m on the y -axis stands for in req/sec graph for update handler . -- Thanks Regards, Isan Fulia.
Re: NRT in Solr
i am using solr for NRT with this version of solr ... Solr Specification Version: 4.0.0.2010.10.26.08.43.14 Solr Implementation Version: 4.0-2010-10-26_08-05-39 1027394 - hudson - 2010-10-26 08:43:14 Lucene Specification Version: 4.0-2010-10-26_08-05-39 Lucene Implementation Version: 4.0-2010-10-26_08-05-39 1027394 - 2010-10-26 08:43:44 is this version ready for NRT or not ? it works, but if it can work better i gonna be update solr ... thx - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 4GB Xmx - Solr2 for Update-Request - delta every 2 Minutes - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654472.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr UIMA Wiki page
Hi all, I just improved the Solr UIMA integration wiki page [1] so if anyone is using it and/or has any feedback it'd be more than welcome. Regards, Tommaso [1] : http://wiki.apache.org/solr/SolrUIMA
Re: NRT in Solr
question: http://wiki.apache.org/solr/NearRealtimeSearchTuning 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x i got this message. in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got exception. with 4 i got nothing, but the Performance Warning. the wiki-articel says, that the best solution is to set the warmingSearcher to 1!!! how can this work ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: getting much double-Values from solr -- timeout
Are you using shards or have everything in same index? - shards == distributed Search over several cores ? = yes, but not always. but in generally not. What problem did you experience with the StatsCompnent? - if i use stats on my 34Million Index, no matter how many docs founded, the sum takes VEERY long time. How did you use it? - like in the wiki, i think statscomp is not so dynamic usable !? I think the right approach will be to optimize StatsComponent to do quick sum() - how can i optimize this ? change the code vom statscomponent and create a new solr ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/getting-much-double-Values-from-solr-timeout-tp2650981p2654721.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: getting much double-Values from solr -- timeout
i am using NRT, and the caches are not always warmed, i think this is almost a problem !? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/getting-much-double-Values-from-solr-timeout-tp2650981p2654725.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr UIMA Wiki page
Great work! On Wednesday 09 March 2011 11:20:41 Tommaso Teofili wrote: Hi all, I just improved the Solr UIMA integration wiki page [1] so if anyone is using it and/or has any feedback it'd be more than welcome. Regards, Tommaso [1] : http://wiki.apache.org/solr/SolrUIMA -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
NRT and warmupTime of filterCache
I tried to create an NRT like in the wiki but i got some problems with autowarming and ondeckSearchers. ervery minute i start a delta of one core and the other core start every minute a commit of the index to search for it. wiki says ... = 1 Searcher and fitlerCache warmupCount=3600. with this config i got exception that no searcher is available ... so i cannot use this config ... my config is, 4 Searchers and warmupCount=3000... with this settings i got Performance Warning, but it works. BUT when the complete 30 seconds (or more) needed to warming the searcher, i cannot ping my server in this time and i got errors ... make it sense to decrese my warmupCount to 0 ??? how serchers do i need for 7 Cores ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2654886.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: NRT in Solr
maxWarmingSearcher=1 is good for current stable Solr versions where memory is important. Overlapping warming searchers can be extremely memory consuming. I don't know how cache warming behaves with NRT. On Wednesday 09 March 2011 11:27:39 stockii wrote: question: http://wiki.apache.org/solr/NearRealtimeSearchTuning 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x i got this message. in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got exception. with 4 i got nothing, but the Performance Warning. the wiki-articel says, that the best solution is to set the warmingSearcher to 1!!! how can this work ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html Sent from the Solr - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: True master-master fail-over without data gaps
Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Help -DIH (mail)
Hi Peter, When I execute the commands you mentioned, nothing happend. Below I show you the comands executed and the answered of they. Sorry, but I don´t know how to enable the log; my jre is by default. Rememeber I´m running the example-DIH (trunk\solr\example\example-DIH\solr); java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar. Import: http://localhost:8983/solr/mail/dataimport?command=status http://localhost:8983/solr/mail/dataimport?command=full-import response - lst name=responseHeader int name=status0/int int name=QTime15/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst - str name=command full-importhttp://localhost:8983/solr/mail/dataimport?command=full-import /str str name=statusidle/str str name=importResponse/ lst name=statusMessages/ - str name=WARNING This response format is experimental. It is likely to change in the future. /str /response Status: http://localhost:8983/solr/mail/dataimport?command=status http://localhost:8983/solr/mail/dataimport?command=full-import response - lst name=responseHeader int name=status0/int int name=QTime0/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst - str name=command statushttp://localhost:8983/solr/mail/dataimport?command=full-import /str str name=statusidle/str str name=importResponse/ lst name=statusMessages/ - str name=WARNING This response format is experimental. It is likely to change in the future. /str /response Thank you for your help. Matias. 2011/3/4 Peter Sturge peter.stu...@gmail.com Can you try this: Issue a full import command like this: http://localhost:8983/solr/dataimport?command=full-import http://localhost:8983/solr/db/dataimport?command=full-import (There is no core name here - if you're using a core name (db?), then add that in between solr/ and /dataimport) then, run: http://localhost:8983/solr/dataimport?command=status http://localhost:8983/solr/db/dataimport?command=full-import This will show the results of the previous import. Has it been rolled-back? If so, there might be something in the log if it's enabled (see your jre's lib/logging.properties file). (you won't see any errors unless you run the status command - that's where they're stored) HTH Peter On Sat, Mar 5, 2011 at 12:46 AM, Matias Alonso matiasgalo...@gmail.com wrote: I´m using the trunk. Thanks Peter for your preoccupation! Matias. 2011/3/4 Peter Sturge peter.stu...@gmail.com Hi Matias, What version of Solr are you using? Are you running any patches (maybe SOLR-2245)? Thanks, Peter On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.com wrote: Hi Peter, From DataImportHandler Development Console I made a full-import, but didn´t work. Now, I execute http://localhost:8983/solr/mail/dataimport?command=full-import; but nothing happends; no index; no errors. thks... Matias. 2011/3/4 Peter Sturge peter.stu...@gmail.com Hi Mataias, http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses the dataimport handler, but you need to tell it to do something by sending a command: http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport ?command=full-import http://localhost:8983/solr/db/dataimport?command=full-import If you haven't already, have a look at: http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FDataImportHandler It gives very thorough and useful advice on getting the DIH working. Peter On Fri, Mar 4, 2011 at 6:59 PM, Matias Alonso matiasgalo...@gmail.com wrote: Hi Peter, I test with deltaFetch=false, but doesn´t work :( I'm using DataImportHandler Development Console to index ( http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport ); I'm working with example-DIH. thks... 2011/3/4 Peter Sturge peter.stu...@gmail.com Hi Matias, I haven't seen it in the posts, but I may have missed it -- what is the import command you're sending? Something like: http://localhost:8983/solr/db/dataimport?command=full-import Can you also test it with deltaFetch=false. I seem to remember having some problems with delta in the MailEntityProcessor. On Fri, Mar 4, 2011 at 6:29 PM, Matias Alonso matiasgalo...@gmail.com wrote: dataConfig document entity name=email user=myem...@gmail.com password=mypassword host=imap.gmail.com fetchMailsSince=2011-01-01 00:00:00
Re: NRT and warmupTime of filterCache
make it sense to update solr for getting SOLR-571 ??? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2655073.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: getting much double-Values from solr -- timeout
You have a large index with tough performance requirements on one server. I would analyze your system to see if it's got any bottlenecks. Watch out for auto-warming taking too long so it does not finish before next commit() Watch out for too frequent commits Monitor mem usage (JConsole or similar) to find if the correct RAM is allocated to each JVM. How large is your index in terms of Gb? It may very well be that you need even more RAM in the server to cache more of the index files in OS memory. Try to stop the Update JVM and let only Search-JVM be active. This will free RAM for OS. Then see if performance increases. Next, try an optimize() and then see if that makes a difference. I'm not familiar with the implementation details of StatsComponent. But if your Stats query is still slow after freeing RAM and optimize() I would file a JIRA issue, and attach to that issue some detailed response XMLs with debugQuery=trueechoParams=all , to document exactly how you use it and how it performs. It may be possible to optimize the code. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 9. mars 2011, at 11.39, stockii wrote: i am using NRT, and the caches are not always warmed, i think this is almost a problem !? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/getting-much-double-Values-from-solr-timeout-tp2650981p2654725.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help -DIH (mail)
Hi, You've included some output in your message, so I presume something *did* happen when you ran the 'status' command (but it might not be what you wanted to happen :-) If you run: http://localhost:8983/solr/mail/dataimport?command=status and you get something like this back: str name=statusidle/str str name=importResponse/ lst name=statusMessages/ It means that no full-import or delta-import has been run during the life of the JVM Solr session. You should try running: http://localhost:8983/solr/mail/dataimport?command=full-import Then run: http://localhost:8983/solr/mail/dataimport?command=status to see the status of the full-import (busy, idle, error, rolled back etc.) You can enable java logging by editing your JRE's lib/logging.properties file. Something like this should give you some log files: handlers= java.util.logging.FileHandler .level= INFO java.util.logging.FileHandler.pattern = ./logs/mylogs%d.log java.util.logging.FileeHandler.level = INFO java.util.logging.FileHandler.limit = 50 java.util.logging.FileHandler.count = 1 java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter NOTE: Make sure the 'logs' folder exists (in your $cwd) before you start, or you'll get an error. HTH Peter On Wed, Mar 9, 2011 at 12:47 PM, Matias Alonso matiasgalo...@gmail.com wrote: Hi Peter, When I execute the commands you mentioned, nothing happend. Below I show you the comands executed and the answered of they. Sorry, but I don´t know how to enable the log; my jre is by default. Rememeber I´m running the example-DIH (trunk\solr\example\example-DIH\solr); java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar. Import: http://localhost:8983/solr/mail/dataimport?command=status http://localhost:8983/solr/mail/dataimport?command=full-import response - lst name=responseHeader int name=status0/int int name=QTime15/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst - str name=command full-importhttp://localhost:8983/solr/mail/dataimport?command=full-import /str str name=statusidle/str str name=importResponse/ lst name=statusMessages/ - str name=WARNING This response format is experimental. It is likely to change in the future. /str /response Status: http://localhost:8983/solr/mail/dataimport?command=status http://localhost:8983/solr/mail/dataimport?command=full-import response - lst name=responseHeader int name=status0/int int name=QTime0/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst - str name=command statushttp://localhost:8983/solr/mail/dataimport?command=full-import /str str name=statusidle/str str name=importResponse/ lst name=statusMessages/ - str name=WARNING This response format is experimental. It is likely to change in the future. /str /response Thank you for your help. Matias. 2011/3/4 Peter Sturge peter.stu...@gmail.com Can you try this: Issue a full import command like this: http://localhost:8983/solr/dataimport?command=full-import http://localhost:8983/solr/db/dataimport?command=full-import (There is no core name here - if you're using a core name (db?), then add that in between solr/ and /dataimport) then, run: http://localhost:8983/solr/dataimport?command=status http://localhost:8983/solr/db/dataimport?command=full-import This will show the results of the previous import. Has it been rolled-back? If so, there might be something in the log if it's enabled (see your jre's lib/logging.properties file). (you won't see any errors unless you run the status command - that's where they're stored) HTH Peter On Sat, Mar 5, 2011 at 12:46 AM, Matias Alonso matiasgalo...@gmail.com wrote: I´m using the trunk. Thanks Peter for your preoccupation! Matias. 2011/3/4 Peter Sturge peter.stu...@gmail.com Hi Matias, What version of Solr are you using? Are you running any patches (maybe SOLR-2245)? Thanks, Peter On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.com wrote: Hi Peter, From DataImportHandler Development Console I made a full-import, but didn´t work. Now, I execute http://localhost:8983/solr/mail/dataimport?command=full-import; but nothing happends; no index; no errors. thks... Matias. 2011/3/4 Peter Sturge peter.stu...@gmail.com Hi Mataias, http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses the dataimport handler, but you need to tell it to do something by sending a command: http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport ?command=full-import http://localhost:8983/solr/db/dataimport?command=full-import If you haven't already, have a look at:
Re: NRT in Solr
Jae, NRT hasn't been implemented NRT as of yet in Solr, I think partially because major features such as replication, caching, and uninverted faceting suddenly are no longer viable, eg, it's another round of testing etc. It's doable, however I think the best approach is a separate request call path, to avoid altering to current [working] API. On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo jaejo...@gmail.com wrote: Hi, Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not find the configuration for NRT. Regards Jae
Re: NRT and warmupTime of filterCache
I think it's best to turn the warmupCount to zero because usually there isn't time in between the creation of a new searcher to run the warmup queries, eg, it'll negatively impact the desired goal of low latency new index readers? On Wed, Mar 9, 2011 at 3:41 AM, stockii stock.jo...@googlemail.com wrote: I tried to create an NRT like in the wiki but i got some problems with autowarming and ondeckSearchers. ervery minute i start a delta of one core and the other core start every minute a commit of the index to search for it. wiki says ... = 1 Searcher and fitlerCache warmupCount=3600. with this config i got exception that no searcher is available ... so i cannot use this config ... my config is, 4 Searchers and warmupCount=3000... with this settings i got Performance Warning, but it works. BUT when the complete 30 seconds (or more) needed to warming the searcher, i cannot ping my server in this time and i got errors ... make it sense to decrese my warmupCount to 0 ??? how serchers do i need for 7 Cores ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2654886.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: True master-master fail-over without data gaps
If you're using the delta import handler the problem would seem to go away because you can have two separate masters running at all times, and if one fails, you can then point the slaves to the secondary master, that is guaranteed to be in sync because it's been importing from the same database? On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: dataimport
This has since been fixed. The problem was that there was not enough memory on the machine. It works just fine now. On Tue, Mar 8, 2011 at 6:22 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : INFO: Creating a connection for entity id with URL: : jdbc:mysql://localhost/researchsquare_beta_library?characterEncoding=UTF8zeroDateTimeBehavior=convertToNull : Feb 24, 2011 8:58:25 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 : call : INFO: Time taken for getConnection(): 137 : Killed : : So it looks like for whatever reason, the server crashes trying to do a full : import. When I add a LIMIT clause on the query, it works fine when the LIMIT : is only 250 records but if I try to do 500 records, I get the same message. ...wow. that's ... weird. I've never seen a java process just log Killed like that. The only time i've ever seen a process log Killed is if it was terminated by the os (ie: kill -9 pid) What OS are you using? how are you running solr? (ie: are you using the simple jetty example java -jar start.jar or are you using a differnet servlet container?) ... are you absolutely certain your machine doens't have some sort of monitoring in place that kills jobs if they take too long, or use too much CPU? -Hoss
Re: Help -DIH (mail)
Peter, You´re right; may be I expose wrong because of my english. I done everything you told me. I think that no find the folder when index. What you thinking about? Below I show to you part of the log. 09/03/2011 11:52:01 org.apache.solr.core.SolrCore execute INFO: [mail] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties 09/03/2011 11:52:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [mail] REMOVING ALL DOCUMENTS FROM INDEX 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c] 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1298912662799 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.MailEntityProcessor logConfig INFO: user : myem...@gmail.com pwd : mypass protocol : imaps host : imap.gmail.com folders : Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo,Mail,mail,MAIL recurse : false exclude : [] include : [] batchSize : 100 fetchSize : 32768 read timeout : 6 conection timeout : 3 custom filter : fetch mail since : Thu Mar 03 00:00:00 GFT 2011 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.MailEntityProcessor connectToMailBox INFO: Connected to mailbox 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c] commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_d,version=1298912662800,generation=13,filenames=[segments_d] 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1298912662800 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@1cee792 main 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03
Re: Help -DIH (mail)
Hi, When you ran the status command, what was the output? On Wed, Mar 9, 2011 at 2:55 PM, Matias Alonso matiasgalo...@gmail.com wrote: Peter, You´re right; may be I expose wrong because of my english. I done everything you told me. I think that no find the folder when index. What you thinking about? Below I show to you part of the log. 09/03/2011 11:52:01 org.apache.solr.core.SolrCore execute INFO: [mail] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties 09/03/2011 11:52:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [mail] REMOVING ALL DOCUMENTS FROM INDEX 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c] 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1298912662799 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.MailEntityProcessor logConfig INFO: user : myem...@gmail.com pwd : mypass protocol : imaps host : imap.gmail.com folders : Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo,Mail,mail,MAIL recurse : false exclude : [] include : [] batchSize : 100 fetchSize : 32768 read timeout : 6 conection timeout : 3 custom filter : fetch mail since : Thu Mar 03 00:00:00 GFT 2011 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.MailEntityProcessor connectToMailBox INFO: Connected to mailbox 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c] commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_d,version=1298912662800,generation=13,filenames=[segments_d] 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1298912662800 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@1cee792 main 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main
Re: Help -DIH (mail)
Log: 09/03/2011 11:54:58 org.apache.solr.core.SolrCore execute INFO: [mail] webapp=/solr path=/dataimport params={command=status} status=0 QTime=0 XML response - lst name=responseHeader int name=status0/int int name=QTime0/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst str name=commandstatus/str str name=statusidle/str str name=importResponse/ - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2011-03-09 11:52:01/str - str name= Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. /str str name=Committed2011-03-09 11:52:03/str str name=Optimized2011-03-09 11:52:03/str str name=Total Documents Processed0/str str name=Time taken 0:0:2.359/str /lst - str name=WARNING This response format is experimental. It is likely to change in the future. /str /response Thks, Matias. 2011/3/9 Peter Sturge peter.stu...@gmail.com Hi, When you ran the status command, what was the output? On Wed, Mar 9, 2011 at 2:55 PM, Matias Alonso matiasgalo...@gmail.com wrote: Peter, You´re right; may be I expose wrong because of my english. I done everything you told me. I think that no find the folder when index. What you thinking about? Below I show to you part of the log. 09/03/2011 11:52:01 org.apache.solr.core.SolrCore execute INFO: [mail] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties 09/03/2011 11:52:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [mail] REMOVING ALL DOCUMENTS FROM INDEX 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c] 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1298912662799 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.MailEntityProcessor logConfig INFO: user : myem...@gmail.com pwd : mypass protocol : imaps host : imap.gmail.com folders : Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo,Mail,mail,MAIL recurse : false exclude : [] include : [] batchSize : 100 fetchSize : 32768 read timeout : 6 conection timeout : 3 custom filter : fetch mail since : Thu Mar 03 00:00:00 GFT 2011 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.MailEntityProcessor connectToMailBox INFO: Connected to mailbox 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c] commit{dir=D:\Search Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_d,version=1298912662800,generation=13,filenames=[segments_d] 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1298912662800 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@1cee792 main 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@1cee792 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 09/03/2011 11:52:03
SolrJ and digest authentication
I'm trying to do a search with SolrJ using digest authentication, but I'm getting the following error: org.apache.solr.common.SolrException: Unauthorized I'm setting up SolrJ this way: HttpClient client = new HttpClient(); ListString authPrefs = new ArrayListString(); authPrefs.add(AuthPolicy.DIGEST); client.getParams().setParameter(AuthPolicy.AUTH_SCHEME_PRIORITY, authPrefs); AuthScope scope = new AuthScope(host, 443, resin); client.getState().setCredentials(scope, new UsernamePasswordCredentials(username, password)); client.getParams().setAuthenticationPreemptive(true); SolrServer server = new CommonsHttpSolrServer(server, client); Is this something which is not supported by SolrJ or have I written something wrong in the code above? Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
RE: True master-master fail-over without data gaps
If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Hi, - Original Message If you're using the delta import handler the problem would seem to go away because you can have two separate masters running at all times, and if one fails, you can then point the slaves to the secondary master, that is guaranteed to be in sync because it's been importing from the same database? Oh, there is no DB involved. Think of a document stream continuously coming in, a component listening to that stream, grabbing docs, and pushing it to master(s). Otis On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Oh, there is no DB involved. Think of a document stream continuously coming in, a component listening to that stream, grabbing docs, and pushing it to master(s). I don't think Solr is designed for this use case, eg, I wouldn't expect deterministic results with the current architecture as it's something that's inherently a a key component of [No]SQL databases. On Wed, Mar 9, 2011 at 8:49 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, - Original Message If you're using the delta import handler the problem would seem to go away because you can have two separate masters running at all times, and if one fails, you can then point the slaves to the secondary master, that is guaranteed to be in sync because it's been importing from the same database? Oh, there is no DB involved. Think of a document stream continuously coming in, a component listening to that stream, grabbing docs, and pushing it to master(s). Otis On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Hi, - Original Message Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. Hm, but this makes the tee app aware of this. What if I want to hide that from any code of mine? The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? Let's say it is! :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Hi, - Original Message Oh, there is no DB involved. Think of a document stream continuously coming in, a component listening to that stream, grabbing docs, and pushing it to master(s). I don't think Solr is designed for this use case, eg, I wouldn't expect deterministic results with the current architecture as it's something that's inherently a a key component of [No]SQL databases. You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ On Wed, Mar 9, 2011 at 8:49 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, - Original Message If you're using the delta import handler the problem would seem to go away because you can have two separate masters running at all times, and if one fails, you can then point the slaves to the secondary master, that is guaranteed to be in sync because it's been importing from the same database? Oh, there is no DB involved. Think of a document stream continuously coming in, a component listening to that stream, grabbing docs, and pushing it to master(s). Otis On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: True master-master fail-over without data gaps
Currently I use an application connected to a queue containing incoming data which my indexer app turns into solr docs. I log everything to a log table and have never had an issue with losing anything. I can trace incoming docs exactly, and keep timing data in there also. If I added a second solr url for a second master and resent the same doc to master02 that I sent to master01, I would expect near 100% synchronization. The problem here is how to get the slave farm to start replicating from the second master if and when the first goes down. I can only see that as being a manual operation, repointing the slaves to master02 and restarting or reloading them etc... -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 8:52 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Hi, - Original Message I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? Otis If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. wunder -- Walter Underwood
RE: True master-master fail-over without data gaps
...but the index resides on disk doesn't it??? lol -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:06 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
Hi, - Original Message Currently I use an application connected to a queue containing incoming data which my indexer app turns into solr docs. I log everything to a log table and have never had an issue with losing anything. Yeah, if everything goes through some storage that can be polled (either a DB or a durable JMS Topic or some such), then N masters could connect to it, not miss anything, and be more or less in near real-time sync. I can trace incoming docs exactly, and keep timing data in there also. If I added a second solr url for a second master and resent the same doc to master02 that I sent to master01, I would expect near 100% synchronization. The problem here is how to get the slave farm to start replicating from the second master if and when the first goes down. I can only see that as being a manual operation, repointing the slaves to master02 and restarting or reloading them etc... Actually, you can configure a LB to handle that, so that's less of a problem, I think. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 8:52 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1
Re: True master-master fail-over without data gaps
On disk, yes, but only indexed, and thus far enough from the original content to make storing terms in Lucene's inverted index. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 12:07:27 PM Subject: RE: True master-master fail-over without data gaps ...but the index resides on disk doesn't it??? lol -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:06 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
RAMdisk ...but the index resides on disk doesn't it??? lol -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:06 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps
This is why there's block cipher cryptography. On Wed, Mar 9, 2011 at 9:11 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: On disk, yes, but only indexed, and thus far enough from the original content to make storing terms in Lucene's inverted index. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: dataimport
Brian, I had the same problem a while back and set the JAVA_OPTS env variable to something my machine could handle. That may also be an option for you going forward. Adam On Wed, Mar 9, 2011 at 9:33 AM, Brian Lamb brian.l...@journalexperts.com wrote: This has since been fixed. The problem was that there was not enough memory on the machine. It works just fine now. On Tue, Mar 8, 2011 at 6:22 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : INFO: Creating a connection for entity id with URL: : jdbc:mysql://localhost/researchsquare_beta_library?characterEncoding=UTF8zeroDateTimeBehavior=convertToNull : Feb 24, 2011 8:58:25 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 : call : INFO: Time taken for getConnection(): 137 : Killed : : So it looks like for whatever reason, the server crashes trying to do a full : import. When I add a LIMIT clause on the query, it works fine when the LIMIT : is only 250 records but if I try to do 500 records, I get the same message. ...wow. that's ... weird. I've never seen a java process just log Killed like that. The only time i've ever seen a process log Killed is if it was terminated by the os (ie: kill -9 pid) What OS are you using? how are you running solr? (ie: are you using the simple jetty example java -jar start.jar or are you using a differnet servlet container?) ... are you absolutely certain your machine doens't have some sort of monitoring in place that kills jobs if they take too long, or use too much CPU? -Hoss
RE: True master-master fail-over without data gaps
I guess you could put a LB between slaves and masters, never thought of that! :) -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:10 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message Currently I use an application connected to a queue containing incoming data which my indexer app turns into solr docs. I log everything to a log table and have never had an issue with losing anything. Yeah, if everything goes through some storage that can be polled (either a DB or a durable JMS Topic or some such), then N masters could connect to it, not miss anything, and be more or less in near real-time sync. I can trace incoming docs exactly, and keep timing data in there also. If I added a second solr url for a second master and resent the same doc to master02 that I sent to master01, I would expect near 100% synchronization. The problem here is how to get the slave farm to start replicating from the second master if and when the first goes down. I can only see that as being a manual operation, repointing the slaves to master02 and restarting or reloading them etc... Actually, you can configure a LB to handle that, so that's less of a problem, I think. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 8:52 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD,
Re: True master-master fail-over without data gaps
Right. LB VIP on both sides of master(s). Black box. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 12:16:31 PM Subject: RE: True master-master fail-over without data gaps I guess you could put a LB between slaves and masters, never thought of that! :) -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:10 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message Currently I use an application connected to a queue containing incoming data which my indexer app turns into solr docs. I log everything to a log table and have never had an issue with losing anything. Yeah, if everything goes through some storage that can be polled (either a DB or a durable JMS Topic or some such), then N masters could connect to it, not miss anything, and be more or less in near real-time sync. I can trace incoming docs exactly, and keep timing data in there also. If I added a second solr url for a second master and resent the same doc to master02 that I sent to master01, I would expect near 100% synchronization. The problem here is how to get the slave farm to start replicating from the second master if and when the first goes down. I can only see that as being a manual operation, repointing the slaves to master02 and restarting or reloading them etc... Actually, you can configure a LB to handle that, so that's less of a problem, I think. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 8:52 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or withlosing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically:
Newb query question
Is there a way to perform string logic on the key field using a subquery or some other method. IE. If the left 4 characters of the key are ABCD, then include or exclude those from the search. Here is the laymans pseudo code for what I'm wanting to do: *:* AND LEFT(KEY, 4) 'abcd' Anyone know that one?
Re: Newb query question
How about something like: for exclusion +*:* -KEY:abcd* for inclusion +*:* +KEY:abcd* Best Erick On Wed, Mar 9, 2011 at 12:34 PM, Daniel Baughman da...@hostworks.com wrote: Is there a way to perform string logic on the key field using a subquery or some other method. IE. If the left 4 characters of the key are ABCD, then include or exclude those from the search. Here is the laymans pseudo code for what I'm wanting to do: *:* AND LEFT(KEY, 4) 'abcd' Anyone know that one?
Re: True master-master fail-over without data gaps
On 3/9/2011 12:05 PM, Otis Gospodnetic wrote: But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. Do they realize that a Solr index is on disk, and if you save it to a Solr index it's being saved to disk? If they prohibited you from putting the doc in a stored field in Solr, I guess that would at least be somewhat consistent, although annoying. But I don't think it's our customers jobs to tell us HOW to implement our software to get the results they want. They can certainly make you promise not to distribute or use copyrighted material, and they can even ask to see your security procedures to make sure it doesn't get out. But if you need to buffer documents to achieve the application they want, but they won't let you... Solr can't help you with that. As I suggested before though, I might rather buffer to a NoSQL store like MongoDB or CouchDB instead of actually to disk. Perhaps your customer won't notice those stores keep data on disk just like they haven't noticed Solr does. I am not an expert in various kinds of NoSQL stores, but I think some of them in fact specialize in the area of concern here: Absolute failover reliability through replication. Solr is not a store. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them automatically. Is this correct or is there a way to avoid humans here? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Newb query question
Hi, It sounds like if you put those 4 chars in a separate field at index time you could apply your logic on that at search time. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Daniel Baughman da...@hostworks.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 12:34:54 PM Subject: Newb query question Is there a way to perform string logic on the key field using a subquery or some other method. IE. If the left 4 characters of the key are ABCD, then include or exclude those from the search. Here is the laymans pseudo code for what I'm wanting to do: *:* AND LEFT(KEY, 4) 'abcd' Anyone know that one?
RE: True master-master fail-over without data gaps (choosing CA in CAP)
Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them automatically. Is this correct or is there a way to avoid humans here? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Solr Hanging all of sudden with update/csv
After About 4-5 hours the merge completed (ran out of heap)..as you suggested..it was having memory issues.. Read queries during the merge were working just fine (they were taking longer then normal ~30-60seconds). I think I need to do more reading on understanding the merge/optimization processes. I am beginning to think what I need to do is have lots of segments? (i.e. frequent merges..of smaller sized segments, wouldn't that speed up the merging process when it actually runs?). A couple things I'm trying to wrap my ahead around: Increasing the segments will improve indexing speed on the whole. The question I have is: when it needs to actually perform a merge: will having more segments be better (i.e. make the merge process faster)? or longer? ..having a 4 hour merge aka (indexing request) is not really acceptable (unless I can control when that merge happens). We are using our Solr server differently then most: Frequent Inserts (in batches), with few Reads. I would say having a 'long' query time is acceptable (say ~60 seconds). -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html Sent from the Solr - User mailing list archive at Nabble.com.
how would you design schema?
Hi, I'm investigating how to set up a schema like this: I want to index accounts and the products purchased (multiValued) by that account but I also need the ability to search by the date the product was purchased. It would be easy if the purchase date wasn't part of the requirements. How would the schema be designed? Is there a better approach? Thanks, Dan
Re: how would you design schema?
Would having a solr-document represent a 'product purchase per account' solve your problem? You could then easily link the date of purchase to the document as well as the account-number. e.g: fields: orderid (key), productid, product-characteristics, order-characteristics (including date of purchase). or in case of option of multiple products having a joined orderid: fields: cat(orderid,productid) (key), orderid, productid, product-characteristics, order-characteristics (including date of purchase). The difference to your setup (i.e: one document per account) is that the suggested setup above may return multiple documents when you search by account-nr, which may or may not be what you're after. hth, Geert-Jan 2011/3/9 dan whelan d...@adicio.com Hi, I'm investigating how to set up a schema like this: I want to index accounts and the products purchased (multiValued) by that account but I also need the ability to search by the date the product was purchased. It would be easy if the purchase date wasn't part of the requirements. How would the schema be designed? Is there a better approach? Thanks, Dan
Sorting
Hi all, I know that I can add sort=score desc to the url to sort in descending order. However, I would like to sort a MoreLikeThis response which returns records like this: lst name=moreLikeThis result name=3 numFound=113611 start=0 maxScore=0.4392774 result name=2 numFound= start=0 maxScore=0.5392774 /lst I don't want them grouped by result; I would just like have them all thrown together and then sorted according to score. I have an XSLT which does put them altogether and returns the following: moreLikeThis similar scorex./score idsome_id/id /similar /moreLikeThis However it appears that it basically applies the stylesheet to result name=3 then result name=2. How can I make it so that with my XSLT, the results appear sorted by score?
Re: docBoost
Anyone have any clue on this on? On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I am using dataimport to create my index and I want to use docBoost to assign some higher weights to certain docs. I understand the concept behind docBoost but I haven't been able to find an example anywhere that shows how to implement it. Assuming the following config file: document entity name=animal dataSource=animals pk=id query=SELECT * FROM animals field column=id name=id / field column=genus name=genus / field column=species name=species / entity name=boosters dataSource=boosts query=SELECT boost_score FROM boosts WHERE animal_id = ${ animal.id} field column=boost_score name=boost_score / /entity /entity /document How do I add in a docBoost score? The boost score is currently in a separate table as shown above.
Re: docBoost
you can use the ScriptTransformer to perform the boost calcualtion and addition. http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer dataConfig script![CDATA[ function f1(row) { // Add boost row.put('$docBoost',1.5); return row; } ]]/script document entity name=e pk=id transformer=script:f1 query=select * from X /entity /document /dataConfig Regards, Jayendra On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb brian.l...@journalexperts.com wrote: Anyone have any clue on this on? On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I am using dataimport to create my index and I want to use docBoost to assign some higher weights to certain docs. I understand the concept behind docBoost but I haven't been able to find an example anywhere that shows how to implement it. Assuming the following config file: document entity name=animal dataSource=animals pk=id query=SELECT * FROM animals field column=id name=id / field column=genus name=genus / field column=species name=species / entity name=boosters dataSource=boosts query=SELECT boost_score FROM boosts WHERE animal_id = ${ animal.id} field column=boost_score name=boost_score / /entity /entity /document How do I add in a docBoost score? The boost score is currently in a separate table as shown above.
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them automatically. Is this correct or is there a way to avoid humans here? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Solr Hanging all of sudden with update/csv
Hi, You'll benefit from watching this segment merging video: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html And you'll appreciate the graph at the bottom: http://code.google.com/p/zoie/wiki/ZoieMergePolicy Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: danomano dshopk...@earthlink.net To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 1:17:08 PM Subject: Re: Solr Hanging all of sudden with update/csv After About 4-5 hours the merge completed (ran out of heap)..as you suggested..it was having memory issues.. Read queries during the merge were working just fine (they were taking longer then normal ~30-60seconds). I think I need to do more reading on understanding the merge/optimization processes. I am beginning to think what I need to do is have lots of segments? (i.e. frequent merges..of smaller sized segments, wouldn't that speed up the merging process when it actually runs?). A couple things I'm trying to wrap my ahead around: Increasing the segments will improve indexing speed on the whole. The question I have is: when it needs to actually perform a merge: will having more segments be better (i.e. make the merge process faster)? or longer? ..having a 4 hour merge aka (indexing request) is not really acceptable (unless I can control when that merge happens). We are using our Solr server differently then most: Frequent Inserts (in batches), with few Reads. I would say having a 'long' query time is acceptable (say ~60 seconds). -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Hi Otis, Have you considered using Solandra with Quorum writes to achieve master/master with CA semantics? -Jake On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them automatically. Is this correct or is there a way to avoid humans here? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- http://twitter.com/tjake
Re: Solr Hanging all of sudden with update/csv
You will need to cap the maximum segment size using LogByteSizeMergePolicy.setMaxMergeMB. As then you will only have segments that are of an optimal size, and Lucene will not try to create gigantic segments. I think though on the query side you will run out of heap space due to the terms index size. What version are you using? On Wed, Mar 9, 2011 at 10:17 AM, danomano dshopk...@earthlink.net wrote: After About 4-5 hours the merge completed (ran out of heap)..as you suggested..it was having memory issues.. Read queries during the merge were working just fine (they were taking longer then normal ~30-60seconds). I think I need to do more reading on understanding the merge/optimization processes. I am beginning to think what I need to do is have lots of segments? (i.e. frequent merges..of smaller sized segments, wouldn't that speed up the merging process when it actually runs?). A couple things I'm trying to wrap my ahead around: Increasing the segments will improve indexing speed on the whole. The question I have is: when it needs to actually perform a merge: will having more segments be better (i.e. make the merge process faster)? or longer? ..having a 4 hour merge aka (indexing request) is not really acceptable (unless I can control when that merge happens). We are using our Solr server differently then most: Frequent Inserts (in batches), with few Reads. I would say having a 'long' query time is acceptable (say ~60 seconds). -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: docBoost
That makes sense. As a follow up, is there a way to only conditionally use the boost score? For example, in some cases I want to use the boost score and in other cases I want all documents to be treated equally. On Wed, Mar 9, 2011 at 2:42 PM, Jayendra Patil jayendra.patil@gmail.com wrote: you can use the ScriptTransformer to perform the boost calcualtion and addition. http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer dataConfig script![CDATA[ function f1(row) { // Add boost row.put('$docBoost',1.5); return row; } ]]/script document entity name=e pk=id transformer=script:f1 query=select * from X /entity /document /dataConfig Regards, Jayendra On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb brian.l...@journalexperts.com wrote: Anyone have any clue on this on? On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I am using dataimport to create my index and I want to use docBoost to assign some higher weights to certain docs. I understand the concept behind docBoost but I haven't been able to find an example anywhere that shows how to implement it. Assuming the following config file: document entity name=animal dataSource=animals pk=id query=SELECT * FROM animals field column=id name=id / field column=genus name=genus / field column=species name=species / entity name=boosters dataSource=boosts query=SELECT boost_score FROM boosts WHERE animal_id = ${ animal.id} field column=boost_score name=boost_score / /entity /entity /document How do I add in a docBoost score? The boost score is currently in a separate table as shown above.
Excluding results from more like this
Hi all, I'm using MoreLikeThis to find similar results but I'd like to exclude records by the id number. For example, I use the following URL: http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,score How would I exclude record 4 form the MoreLikeThis results? I tried, http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4 But that still returned record 4 in the MoreLikeThisResults.
Fwd: some relational-type groupig with search
- Forwarded Message - From: l blevins l.blev...@comcast.net To: solr user mail solr-user-h...@lucene.apache.org Sent: Wednesday, March 9, 2011 4:03:06 PM Subject: some relational-type groupig with search I have a large database for which we have some good search capabilties now, but am interested to see if SOLR might be usable instead. That would gain us the additional text-search features and eliminate the high fees for some of the database features. If I have fields such as person_id, document_date, and measurement_value. I need to be able to fullfil the following types of searches that I cannot figure out how to do now: * limit search to only the most recent (or earliest) document per person along with whatever other criteria is present (each person's LAST or FIRST document), * search and only return the most recent document per person (LASTor FIRST meeting the other criteria), * limit search to only the documents with the max or min measurement_value per person, * search and return only the max or min measurement_value per person All of these boil down to limiting by the max or min of either a date or numeric field within a group (by person in this case). I know these features are considered relational and that SOLR has declared that it is not really a relational search engine, but a number of highly placed persons that I work for are very interested in using SOLR. If we could satisfy this type of query, SOLR could fit our needs so I feel compelled to ask this group if these searches are possible.
Re: Excluding results from more like this
Brian, ...?q=id:(2 3 5) -4 Otis --- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Brian Lamb brian.l...@journalexperts.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 4:05:10 PM Subject: Excluding results from more like this Hi all, I'm using MoreLikeThis to find similar results but I'd like to exclude records by the id number. For example, I use the following URL: http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,score How would I exclude record 4 form the MoreLikeThis results? I tried, http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4 But that still returned record 4 in the MoreLikeThisResults.
Same index is ranking differently on 2 machines
Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: +0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) -0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: -0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost +0.17194802 = (MATCH) max plus 0.01 times others of: + 0.00851347 = (MATCH) weight(text:product in 1551), product of: +0.018402064 = queryWeight(text:product), product of: 1.8505468 = idf(docFreq=2589, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of: 1.0 = tf(termFreq(text:product)=1) 1.8505468 = idf(docFreq=2589, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of: -0.1725098 = queryWeight(profile:product^2.0), product of: + 0.17186289 = (MATCH) weight(profile:product^2.0 in 1551), product of: +0.08028162 = queryWeight(profile:product^2.0), product of: 2.0 = boost 4.036637 = idf(docFreq=290, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 2.14075 = (MATCH) fieldWeight(profile:product in 1551), product of: 1.4142135 = tf(termFreq(profile:product)=2) 4.036637 = idf(docFreq=290, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) - 0.59742856 = (MATCH) max plus 0.01 times others of: -0.59742856 = weight(profile:dubai product~10^0.5 in 1551), product of: - 0.12465195 = queryWeight(profile:dubai product~10^0.5), product of: +
FunctionQueries and FieldCache and OOM
Hi, In one of the environments i'm working on (4 Solr 1.4.1. nodes with replication, 3+ million docs, ~5.5GB index size, high commit rate (~1-2min), high query rate (~50q/s), high number of updates (~1000docs/commit)) the nodes continuously run out of memory. During development we frequently ran excessive stress tests and after tuning JVM and Solr settings all ran fine. A while ago i added the DisMax bq parameter for boosting recent documents, documents older than a day receive 50% less boost, similar to the example but with a much steeper slope. For clarity, i'm not using the ordinal function but the reciprocal version in the bq parameter which is warned against when using Solr 1.4.1 according to the wiki. This week we started the stress tests and nodes are going down again. I've reconfigured the nodes to have different settings for the bq parameter (or no bq parameter). It seems the bq the cause of the misery. Issue SOLR- keeps popping up but it has not been resolved. Is there anyone who can confirm one of those patches fixes this issue before i waste hours of work finding out it doesn't? ;) Am i correct when i assume that Lucene FieldCache entries are added for each unique function query? In that case, every query is a unique cache entry because it operates on milliseconds. If all doesn't work i might be able to reduce precision by operating on minutes or even more instead of milli seconds. I, however, cannot use other nice math function in the ms() parameter so that might make things difficult. However, date math seems available (NOW/HOUR) so i assume it would also work for SOME_DATE_FIELD/HOUR as well. This way i just might prevent useless entries. My apologies for this long mail but it may prove useful for other users and hopefully we find the solution and can update the wiki to add this warning. Cheers,
Re: Excluding results from more like this
That doesn't seem to do it. Record 4 is still showing up in the MoreLikeThis results. On Wed, Mar 9, 2011 at 4:12 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Brian, ...?q=id:(2 3 5) -4 Otis --- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Brian Lamb brian.l...@journalexperts.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 4:05:10 PM Subject: Excluding results from more like this Hi all, I'm using MoreLikeThis to find similar results but I'd like to exclude records by the id number. For example, I use the following URL: http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,score How would I exclude record 4 form the MoreLikeThis results? I tried, http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4 But that still returned record 4 in the MoreLikeThisResults.
Re: Same index is ranking differently on 2 machines
queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. - 0.021368012 = queryNorm (local) + 0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: - 1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: - 0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: + 0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: + 0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: - 0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: + 0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) - 0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: - 0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost + 0.17194802 = (MATCH) max plus 0.01 times others of: + 0.00851347 = (MATCH) weight(text:product in 1551), product of: + 0.018402064 = queryWeight(text:product), product of: 1.8505468 = idf(docFreq=2589, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of: 1.0 = tf(termFreq(text:product)=1) 1.8505468 = idf(docFreq=2589, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of: - 0.1725098 =
Re: Excluding results from more like this
Yeah, that just restricts what items are in your main result set (and adding -4 has no real effect). The more like this set is constructed based on your main result set, for each document in it. As far as I can see from here: http://wiki.apache.org/solr/MoreLikeThis ..there seems to be no built-in way to customize the 'more like this' results in the way you want, excluding certain document id's. I don't entirely understand what mlt.boost does, but I don't think it does anything useful for this case. So, if that's so, you are out of luck, unless you want to write Java code. In which case you could try customizing or adding that feature to the MoreLikeThis search component, and either suggest your new code back as a patch, or just use your own customized version of MoreLikeThis. On 3/9/2011 4:29 PM, Brian Lamb wrote: That doesn't seem to do it. Record 4 is still showing up in the MoreLikeThis results. On Wed, Mar 9, 2011 at 4:12 PM, Otis Gospodneticotis_gospodne...@yahoo.com wrote: Brian, ...?q=id:(2 3 5) -4 Otis --- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Brian Lambbrian.l...@journalexperts.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 4:05:10 PM Subject: Excluding results from more like this Hi all, I'm using MoreLikeThis to find similar results but I'd like to exclude records by the id number. For example, I use the following URL: http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,score How would I exclude record 4 form the MoreLikeThis results? I tried, http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4 But that still returned record 4 in the MoreLikeThisResults.
Re: Same index is ranking differently on 2 machines
Yes, but the identical index with the identical solrconfig.xml and the identical query and the identical version of Solr on two different machines should preduce identical results. So it's a legitimate question why it's not. But perhaps queryNorm isn't enough to answer that. Sorry, it's out of my league to try and figure out it out. But are you absolutely sure you have identical indexes, identical solrconfig.xml, identical queries, and identical versions of Solr and any other installed Java libraries... on both machines? One of these being different seems more likely than a bug in Solr, although that's possible. On 3/9/2011 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: +0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) -0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: -0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost +0.17194802 = (MATCH) max
Re: Same index is ranking differently on 2 machines
Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: +0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) -0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: -0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost +0.17194802 = (MATCH) max plus 0.01 times others of: + 0.00851347 = (MATCH) weight(text:product in 1551), product of: +0.018402064 = queryWeight(text:product), product of: 1.8505468 = idf(docFreq=2589, maxDocs=6063) - 0.021368012 =
Indexing a text string for faceting
Hello all, I have a small problem with my faceting fields. In all I create a new faceting field which is indexed and not stored, and use copyField. The problem is I facet on category names which have examples like this Policies Documentation (37)http://localhost:8080/apache-solr-1.4.1/select?q=Checklist%20Employee%20Hiringfacet=onfacet.field=fcategoryNamefq=fcategoryName:Policies%20%20Documentation Forms Checklists (22)http://localhost:8080/apache-solr-1.4.1/select?q=Checklist%20Employee%20Hiringfacet=onfacet.field=fcategoryNamefq=fcategoryName:Forms%20%20Checklists Right now my fields were using the string type, which is not got because I think by default it is using a tokenizer etc.. I think I must define a new type field so that my category names will be properly indexed as a facet field. Here is what I have now field name=categoryName type=text indexed=true stored=true / field name=typeName type=text indexed=true stored=true / field name=ftypeName type=string indexed=true stored=false multiValued=true/ field name=fcategoryName type=string indexed=true stored=false multiValued=true/ copyField source=typeName dest=ftypeName/ copyField source=categoryName dest=fcategoryName/ Can someone give me a type configuration which will support my category names which have whitespaces and ampersands? Thanks in advance Greg
Re: Same index is ranking differently on 2 machines
That's what I think, glad I am not going mad. I've spent 1/2 a day comparing the config files, checking out from SVN again and ensuring the databases are identical. I cannot see what else I can do to make them equivalent. Both servers checkout directly from SVN, I am convinced the files are the same. The database is definately the same. Not sure what you mean about having identical indices - that's my problem - I don't - or do you mean something else I've missed? But yes everything else you mention is identical, I am as certain as I can be. I too think there must be a difference I have missed but I have run out of ideas for what to check! Frustrating :) On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote: Yes, but the identical index with the identical solrconfig.xml and the identical query and the identical version of Solr on two different machines should preduce identical results. So it's a legitimate question why it's not. But perhaps queryNorm isn't enough to answer that. Sorry, it's out of my league to try and figure out it out. But are you absolutely sure you have identical indexes, identical solrconfig.xml, identical queries, and identical versions of Solr and any other installed Java libraries... on both machines? One of these being different seems more likely than a bug in Solr, although that's possible. On 3/9/2011 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -
Re: Same index is ranking differently on 2 machines
Are you sure you have the same config ... The boost seems different for the field text - text:dubai^0.1 text:dubai -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: Regards, Jayendra On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote: Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. - 0.021368012 = queryNorm (local) + 0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: - 1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: - 0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: + 0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: + 0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: - 0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: + 0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933
Re: Same index is ranking differently on 2 machines
On Wed, Mar 9, 2011 at 4:49 PM, Jayendra Patil jayendra.patil@gmail.com wrote: Are you sure you have the same config ... The boost seems different for the field text - text:dubai^0.1 text:dubai Yep... Try adding echoParams=all and see all the parameters solr is acting on. http://wiki.apache.org/solr/CoreQueryParameters#echoParams -Yonik http://lucidimagination.com -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: - 1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: - 0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: + 0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: + 0.05489459 = queryWeight(text:dubai), product of: Regards, Jayendra On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote: Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. - 0.021368012 = queryNorm (local) + 0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: - 1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: - 0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: + 0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: + 0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: - 0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH)
Re: Same index is ranking differently on 2 machines
Oh wow, how did I miss that? My apologies to anyone who read this post. I should have diffed my custom dismax handler. Looks like my SVN merge didn't work properly. Embarassing. Thanks everyone ;) On Mar 9, 2011, at 4:51 PM, Yonik Seeley wrote: On Wed, Mar 9, 2011 at 4:49 PM, Jayendra Patil jayendra.patil@gmail.com wrote: Are you sure you have the same config ... The boost seems different for the field text - text:dubai^0.1 text:dubai Yep... Try adding echoParams=all and see all the parameters solr is acting on. http://wiki.apache.org/solr/CoreQueryParameters#echoParams -Yonik http://lucidimagination.com -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: Regards, Jayendra On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote: Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 =
Math-generated fields during query
Hi, I was wondering if it is possible during a query to create a returned field 'on the fly' (like function query, but for concrete values, not score). For example, if I input this query: q=_val_:product(15,3)fl=*,score For every returned document, I get score = 45. If I change it slightly to add *:* like this: q=*:* _val_:product(15,3)fl=*,score I get score = 32.526913. If I try my use case of _val_:product(qty_ordered,unit_price), I get varying scores depending on...well depending on something. I understand this is doing relevance scoring, but it doesn't seem to tally with the FunctionQuery Wiki [example at the bottom of the page]: q=boxname:findbox+_val_:product(product(x,y),z)fl=*,score ...where score will contain the resultant volume. Is there a trick to getting not a score, but the actual value of quantity*price (e.g. product(5,2.21) == 11.05)? Many thanks
Re: True master-master fail-over without data gaps (choosing CA in CAP)
I was just about to jump in this conversation to mention Solandra and go fig, Solandra's committer comes in. :-) It was nice to meet you at Strata, Jake. I haven't dug into the code yet but Solandra strikes me as a killer way to scale Solr. I'm looking forward to playing with it; particularly looking at disk requirements and performance measurements. ~ David Smiley On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote: Hi Otis, Have you considered using Solandra with Quorum writes to achieve master/master with CA semantics? -Jake On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them automatically. Is this correct or is there a way to avoid humans here? Thanks, Otis Sematext ::
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Doesn't Solandra partition by term instead of document? On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W. dsmi...@mitre.org wrote: I was just about to jump in this conversation to mention Solandra and go fig, Solandra's committer comes in. :-) It was nice to meet you at Strata, Jake. I haven't dug into the code yet but Solandra strikes me as a killer way to scale Solr. I'm looking forward to playing with it; particularly looking at disk requirements and performance measurements. ~ David Smiley On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote: Hi Otis, Have you considered using Solandra with Quorum writes to achieve master/master with CA semantics? -Jake On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them
Re: Same index is ranking differently on 2 machines
Wait, if you don't have identical indexes, then why would you expect identical results? If your indexes are different, one would expect the results for the same query to be different -- there are different documents in the index! The iDF portion of the TF/iDF type algorithm at the base of Solr's relevancy will also be different in different indexes. http://en.wikipedia.org/wiki/Tf%E2%80%93idf Maybe I'm misunderstanding you. But if you have different indexes -- not exactly the same collection of documents indexed using exactly the same field definitions and rules -- then one should expect different relevance results. Jonathan On 3/9/2011 4:48 PM, Allistair Crossley wrote: That's what I think, glad I am not going mad. I've spent 1/2 a day comparing the config files, checking out from SVN again and ensuring the databases are identical. I cannot see what else I can do to make them equivalent. Both servers checkout directly from SVN, I am convinced the files are the same. The database is definately the same. Not sure what you mean about having identical indices - that's my problem - I don't - or do you mean something else I've missed? But yes everything else you mention is identical, I am as certain as I can be. I too think there must be a difference I have missed but I have run out of ideas for what to check! Frustrating :) On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote: Yes, but the identical index with the identical solrconfig.xml and the identical query and the identical version of Solr on two different machines should preduce identical results. So it's a legitimate question why it's not. But perhaps queryNorm isn't enough to answer that. Sorry, it's out of my league to try and figure out it out. But are you absolutely sure you have identical indexes, identical solrconfig.xml, identical queries, and identical versions of Solr and any other installed Java libraries... on both machines? One of these being different seems more likely than a bug in Solr, although that's possible. On 3/9/2011 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 =
Re: NRT in Solr
Zoie adds NRT to Solr: http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin I haven't tried it yet but looks cool. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote: Jae, NRT hasn't been implemented NRT as of yet in Solr, I think partially because major features such as replication, caching, and uninverted faceting suddenly are no longer viable, eg, it's another round of testing etc. It's doable, however I think the best approach is a separate request call path, to avoid altering to current [working] API. On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo jaejo...@gmail.com wrote: Hi, Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not find the configuration for NRT. Regards Jae
Re: NRT in Solr
Interesting, does anyone have a summary of what techniques zoie uses to do this? I don't see any docs on the technical details. On 3/9/2011 5:29 PM, Smiley, David W. wrote: Zoie adds NRT to Solr: http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin I haven't tried it yet but looks cool. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote: Jae, NRT hasn't been implemented NRT as of yet in Solr, I think partially because major features such as replication, caching, and uninverted faceting suddenly are no longer viable, eg, it's another round of testing etc. It's doable, however I think the best approach is a separate request call path, to avoid altering to current [working] API. On Tue, Mar 8, 2011 at 1:27 PM, Jae Joojaejo...@gmail.com wrote: Hi, Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not find the configuration for NRT. Regards Jae
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Jason, It's predecessor did, Lucandra. But Solandra is a new approach that manages shards of documents across the cluster for you and uses solrs distributed search to query indexes. Jake On Mar 9, 2011, at 5:15 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Doesn't Solandra partition by term instead of document? On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W. dsmi...@mitre.org wrote: I was just about to jump in this conversation to mention Solandra and go fig, Solandra's committer comes in. :-) It was nice to meet you at Strata, Jake. I haven't dug into the code yet but Solandra strikes me as a killer way to scale Solr. I'm looking forward to playing with it; particularly looking at disk requirements and performance measurements. ~ David Smiley On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote: Hi Otis, Have you considered using Solandra with Quorum writes to achieve master/master with CA semantics? -Jake On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing
Re: NRT in Solr
Jonathan, they have a Wiki up these somewhere, including pretty diagrams. If you have Lucene in Action, Zoie is one of the case studies and is described in a lot of detail. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Jonathan Rochkind rochk...@jhu.edu To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Smiley, David W. dsmi...@mitre.org Sent: Wed, March 9, 2011 5:34:01 PM Subject: Re: NRT in Solr Interesting, does anyone have a summary of what techniques zoie uses to do this? I don't see any docs on the technical details. On 3/9/2011 5:29 PM, Smiley, David W. wrote: Zoie adds NRT to Solr: http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin I haven't tried it yet but looks cool. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote: Jae, NRT hasn't been implemented NRT as of yet in Solr, I think partially because major features such as replication, caching, and uninverted faceting suddenly are no longer viable, eg, it's another round of testing etc. It's doable, however I think the best approach is a separate request call path, to avoid altering to current [working] API. On Tue, Mar 8, 2011 at 1:27 PM, Jae Joojaejo...@gmail.com wrote: Hi, Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not find the configuration for NRT. Regards Jae
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Jake, Maybe it's time to come up with the Solandra/Solr matrix so we can see Solandra's strengths (e.g. RT, no replication) and weaknesses (e.g. I think I saw a mention of some big indices?) or missing feature (e.g. no delete by query), etc. Thanks! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Jake Luciani jak...@gmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, March 9, 2011 6:04:13 PM Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Jason, It's predecessor did, Lucandra. But Solandra is a new approach that manages shards of documents across the cluster for you and uses solrs distributed search to query indexes. Jake On Mar 9, 2011, at 5:15 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Doesn't Solandra partition by term instead of document? On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W. dsmi...@mitre.org wrote: I was just about to jump in this conversation to mention Solandra and go fig, Solandra's committer comes in. :-) It was nice to meet you at Strata, Jake. I haven't dug into the code yet but Solandra strikes me as a killer way to scale Solr. I'm looking forward to playing with it; particularly looking at disk requirements and performance measurements. ~ David Smiley On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote: Hi Otis, Have you considered using Solandra with Quorum writes to achieve master/master with CA semantics? -Jake On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary
Re: Fwd: some relational-type groupig with search
Probably you can just sort by date (one way and then the other) and limit your result set to a single document. That should free up enough budget for the bonuses of the highly-placed people, I think :) On 3/9/2011 4:05 PM, l.blev...@comcast.net wrote: - Forwarded Message - From: l blevinsl.blev...@comcast.net To: solr user mailsolr-user-h...@lucene.apache.org Sent: Wednesday, March 9, 2011 4:03:06 PM Subject: some relational-type groupig with search I have a large database for which we have some good search capabilties now, but am interested to see if SOLR might be usable instead. That would gain us the additional text-search features and eliminate the high fees for some of the database features. If I have fields such asperson_id,document_date, andmeasurement_value. I need to be able to fullfil the following types of searches that I cannot figure out how to do now: * limit search to only the most recent (or earliest) document per person along with whatever other criteria is present (each person's LAST or FIRST document), * search and only return the most recent document per person (LASTor FIRST meeting the other criteria), * limit search to only the documents with the max or minmeasurement_value per person, * search and return only the max or minmeasurement_value per person All of these boil down to limiting by the max or min of either a date or numeric field within a group (by person in this case). I know these features are considered relational and that SOLR has declared that it is not really a relational search engine, but a number of highly placed persons that I work for are very interested in using SOLR. If we could satisfy this type of query, SOLR could fit our needs so I feel compelled to ask this group if these searches are possible.
Re: some relational-type groupig with search
It is not just one document that would be returned, it one document per person. That is a little trickier. - Original Message - From: Michael Sokolov soko...@ifactory.com To: solr-user@lucene.apache.org Cc: l blevins l.blev...@comcast.net Sent: Wednesday, March 9, 2011 7:46:10 PM Subject: Re: Fwd: some relational-type groupig with search Probably you can just sort by date (one way and then the other) and limit your result set to a single document. That should free up enough budget for the bonuses of the highly-placed people, I think :) On 3/9/2011 4:05 PM, l.blev...@comcast.net wrote: - Forwarded Message - From: l blevinsl.blev...@comcast.net To: solr user mailsolr-user-h...@lucene.apache.org Sent: Wednesday, March 9, 2011 4:03:06 PM Subject: some relational-type groupig with search I have a large database for which we have some good search capabilties now, but am interested to see if SOLR might be usable instead. That would gain us the additional text-search features and eliminate the high fees for some of the database features. If I have fields such asperson_id,document_date, andmeasurement_value. I need to be able to fullfil the following types of searches that I cannot figure out how to do now: * limit search to only the most recent (or earliest) document per person along with whatever other criteria is present (each person's LAST or FIRST document), * search and only return the most recent document per person (LASTor FIRST meeting the other criteria), * limit search to only the documents with the max or minmeasurement_value per person, * search and return only the max or minmeasurement_value per person All of these boil down to limiting by the max or min of either a date or numeric field within a group (by person in this case). I know these features are considered relational and that SOLR has declared that it is not really a relational search engine, but a number of highly placed persons that I work for are very interested in using SOLR. If we could satisfy this type of query, SOLR could fit our needs so I feel compelled to ask this group if these searches are possible.
Re: True master-master fail-over without data gaps (choosing CA in CAP)
Yeah sure. Let me update this on the Solandra wiki. I'll send across the link I think you hit the main two shortcomings atm. -Jake On Wed, Mar 9, 2011 at 6:17 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Jake, Maybe it's time to come up with the Solandra/Solr matrix so we can see Solandra's strengths (e.g. RT, no replication) and weaknesses (e.g. I think I saw a mention of some big indices?) or missing feature (e.g. no delete by query), etc. Thanks! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Jake Luciani jak...@gmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, March 9, 2011 6:04:13 PM Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Jason, It's predecessor did, Lucandra. But Solandra is a new approach that manages shards of documents across the cluster for you and uses solrs distributed search to query indexes. Jake On Mar 9, 2011, at 5:15 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Doesn't Solandra partition by term instead of document? On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W. dsmi...@mitre.org wrote: I was just about to jump in this conversation to mention Solandra and go fig, Solandra's committer comes in. :-) It was nice to meet you at Strata, Jake. I haven't dug into the code yet but Solandra strikes me as a killer way to scale Solr. I'm looking forward to playing with it; particularly looking at disk requirements and performance measurements. ~ David Smiley On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote: Hi Otis, Have you considered using Solandra with Quorum writes to achieve master/master with CA semantics? -Jake On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Original Message From: Robert Petersen rober...@buy.com Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. I could, but then I'd have the issue of keeping them in sync, which seems more fragile. I think SAN makes things simpler overall. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then No, because it wouldn't be in sync. It would always be N minutes behind, and when the primary master fails, the secondary would not have all the docs - data loss. master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... Yeah, complete up to the last replication is the problem. It's a data gap that now needs to be filled somehow. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to
java.lang.ClassCastException being thrown seemingly at random
Hello, I'm using a recent build of the trunk (from 3/1). I've noticed that after the index is up and running for some time I start to get intermittent errors that look like this: Mar 2, 2011 9:26:01 AM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException The querys I get the error against are seemingly random and do not consistently throw the error - in fact, every time I test a query I receive this error on again, it completes successfully. This is also the total extent of the error recorded in the logs, there is no traceback. I'm not even sure how to begin debugging the problem, any suggestions or pointers as to what may be going wrong would be greatly appreciated. -Harish -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-ClassCastException-being-thrown-seemingly-at-random-tp2658331p2658331.html Sent from the Solr - User mailing list archive at Nabble.com.
Caching filter question / code review
I created the following SearchComponent that wraps a deduplicate filter around the current query and added it to last-components. It appears to be working, but is there any way I can improve the performance? Would this be considered and added to the filtercache? Am I even caching correctly? Thanks for any input/suggestions ... private MapString, Filter filtersByField = new HashMapString, Filter(); @Override public void prepare(ResponseBuilder rb) throws IOException { SolrParams params = rb.req.getParams(); if (params.getBool(DuplicateParams.DEDUPLICATE, false)) { String field = params.get(DuplicateParams.DUPLICATE_FIELD); if (field == null) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, Deduplicate field is required); } Filter filter = filtersByField.get(field); if (filter == null) { filter = new CachingWrapperFilter(new DuplicateFilter(field, DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FAST_INVALIDATION)); filtersByField.put(field, filter); } rb.getFilters().add(new FilteredQuery(rb.getQuery(), filter)); } } ...
Re: java.lang.ClassCastException being thrown seemingly at random
On Wed, Mar 9, 2011 at 8:34 PM, harish.agarwal harish.agar...@gmail.com wrote: I'm using a recent build of the trunk (from 3/1). I've noticed that after the index is up and running for some time I start to get intermittent errors that look like this: Mar 2, 2011 9:26:01 AM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException This was probably fixed today: https://issues.apache.org/jira/browse/LUCENE-2953 -Yonik http://lucidimagination.com
Re: NRT in Solr
So it looks like can handle adding new documents, and expiring old documents. Updating a document is not part of the game. This would work well for message boards or tweet type solutions. Solr can do this as well directly. Why wouldn't you just improve the document and facet caching so that when you append there is not a huge hit to Solr? Also we could add a expiration to documents as well. The big issue for me is that when I update Solr I need to replicate that change quickly to all slaves. If we changed replication to stream to the slaves in Near Real Time and not have to create a whole new index version, warming, etc, that would be awesome. That combined with better caching smarts and we have a near perfect solution. Thanks. On 3/9/11 3:29 PM, Smiley, David W. dsmi...@mitre.org wrote: Zoie adds NRT to Solr: http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin I haven't tried it yet but looks cool. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote: Jae, NRT hasn't been implemented NRT as of yet in Solr, I think partially because major features such as replication, caching, and uninverted faceting suddenly are no longer viable, eg, it's another round of testing etc. It's doable, however I think the best approach is a separate request call path, to avoid altering to current [working] API. On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo jaejo...@gmail.com wrote: Hi, Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not find the configuration for NRT. Regards Jae
Re: docBoost
Yes just add if statement based on a field type and do a row.put() only if that other value is a certain value. On 3/9/11 1:39 PM, Brian Lamb brian.l...@journalexperts.com wrote: That makes sense. As a follow up, is there a way to only conditionally use the boost score? For example, in some cases I want to use the boost score and in other cases I want all documents to be treated equally. On Wed, Mar 9, 2011 at 2:42 PM, Jayendra Patil jayendra.patil@gmail.com wrote: you can use the ScriptTransformer to perform the boost calcualtion and addition. http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer dataConfig script![CDATA[ function f1(row) { // Add boost row.put('$docBoost',1.5); return row; } ]]/script document entity name=e pk=id transformer=script:f1 query=select * from X /entity /document /dataConfig Regards, Jayendra On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb brian.l...@journalexperts.com wrote: Anyone have any clue on this on? On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I am using dataimport to create my index and I want to use docBoost to assign some higher weights to certain docs. I understand the concept behind docBoost but I haven't been able to find an example anywhere that shows how to implement it. Assuming the following config file: document entity name=animal dataSource=animals pk=id query=SELECT * FROM animals field column=id name=id / field column=genus name=genus / field column=species name=species / entity name=boosters dataSource=boosts query=SELECT boost_score FROM boosts WHERE animal_id = ${ animal.id} field column=boost_score name=boost_score / /entity /entity /document How do I add in a docBoost score? The boost score is currently in a separate table as shown above.
Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files
Hi, I'm using Solr 1.4.1. The scenario involves user uploading multiple files. These have content extracted using SolrCell, then indexed by Solr along with other information about the user. ContentStreamUpdateRequest seemed like the right choice for this - use addFile() to send file data, and use setParam() to add normal data fields. However, when I do multiple addFile() to ContentStreamUpdateRequest, I observed that at the server side, even the file parts of this multipart post are interpreted as regular form fields by the FileUpload component. I found that FileUpload does so because the filename value in Content-Disposition headers of each part are not being set. Digging a bit further, it seems the actual root cause is in the client side solrj API ... the CommonsHttpSolrServer class is not setting filename value in Content-Disposition header while creating multipart Part instances (from HttpClient framework). I solved this problem by a hack - in CommonsHttpSolrServer.request() method where the PartBase instances are created, I overrode sendDispositionHeader() and added filename value. That solved the problem. However, my questions are: 1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug? Should I be using something else? 2. My end goal is to map contents of each file to *separate* fields, not a common field. Since the regular ExtractingRequestHandler maps all content to just one field, I believe I've to create a custom RequestHandler (possibly reusing existing SolrCell classes). Is this approach right? Thanks Karthik
Re: NRT in Solr
Please start new threads for new conversations. On Wed, Mar 9, 2011 at 2:27 AM, stockii stock.jo...@googlemail.com wrote: question: http://wiki.apache.org/solr/NearRealtimeSearchTuning 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x i got this message. in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got exception. with 4 i got nothing, but the Performance Warning. the wiki-articel says, that the best solution is to set the warmingSearcher to 1!!! how can this work ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files
In case the exact problem was not clear to somebody: The problem with FileUpload interpreting file data as regular form fields is that, Solr thinks there are no content streams in the request and throws a missing_content_stream exception. On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly karthikshiral...@gmail.com wrote: Hi, I'm using Solr 1.4.1. The scenario involves user uploading multiple files. These have content extracted using SolrCell, then indexed by Solr along with other information about the user. ContentStreamUpdateRequest seemed like the right choice for this - use addFile() to send file data, and use setParam() to add normal data fields. However, when I do multiple addFile() to ContentStreamUpdateRequest, I observed that at the server side, even the file parts of this multipart post are interpreted as regular form fields by the FileUpload component. I found that FileUpload does so because the filename value in Content-Disposition headers of each part are not being set. Digging a bit further, it seems the actual root cause is in the client side solrj API ... the CommonsHttpSolrServer class is not setting filename value in Content-Disposition header while creating multipart Part instances (from HttpClient framework). I solved this problem by a hack - in CommonsHttpSolrServer.request() method where the PartBase instances are created, I overrode sendDispositionHeader() and added filename value. That solved the problem. However, my questions are: 1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug? Should I be using something else? 2. My end goal is to map contents of each file to *separate* fields, not a common field. Since the regular ExtractingRequestHandler maps all content to just one field, I believe I've to create a custom RequestHandler (possibly reusing existing SolrCell classes). Is this approach right? Thanks Karthik