Re: Quick Questions
In example/cloud-scripts/ you will find a Solr specific zkCli tool to upload/download configs. You will need to reload a core/collection for the changes to take effect. Upayavira On Fri, Mar 8, 2013, at 07:02 AM, Nathan Findley wrote: I am setting up solrcloud with zookeeper. - I am wondering if there are nicer ways to update the zookeeper config files (data-import) besides restarting a node with the boostrap option? - Right now I kill the node manually in order to restart it. Is there a better way to restart? Thanks, Nate -- CTO Zenlok株式会社
SOLR - Recommendation on architecture
We are planning to use SOLR 4.1 for full text indexing. Following is the hardware configuration of the web server that we plan to install SOLR on:- *CPU*: 2 x Dual Core (4 cores) *R**AM:* 12GB *Storage*: 212GB *OS Version* – Windows 2008 R2 The dataset to be imported will have approx.. 800k records, with 450 fields per record. Query response time should be btw 200ms-800ms. Please suggest if the current single server implementation should work fine and if the specified configuration is enough for the requirement.
R: Query parsing issue
Thank you very much, I've tried both the way that you have suggested to me. Then I've choosen to re-write the parse method by extending ExtendedDismaxQParser class. Francesco. -Messaggio originale- Da: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com] Inviato: mercoledì 6 marzo 2013 19:39 A: solr-user@lucene.apache.org Oggetto: Re: Query parsing issue It should be easy to extend ExtendedDismaxQParser and do your pre-processing in the parse() method before calling edismax's parse. Or maybe you could change the way EDismax is splitting the input query into clauses by extending the splitIntoClauses method? Tomás On Wed, Mar 6, 2013 at 6:37 AM, Francesco Valentini francesco.valent...@altiliagroup.com wrote: Hi, Ive written my own analyzer to index and query a set of documents. At indexing time everything goes well but now I have a problem in query phase. I need to pass the whole query string to my analyzer before the edismax query parser begins its tasks. In other words I have to preprocess the raw query string. The phrase querying does not fit my needs because I dont have to match the entire set of terms/tokens. How can I achieve this? Thank you in advance. Francesco
Re: SOLR - Recommendation on architecture
On 8 March 2013 14:19, Kobe J kobe.free.wo...@gmail.com wrote: We are planning to use SOLR 4.1 for full text indexing. Following is the hardware configuration of the web server that we plan to install SOLR on:- *CPU*: 2 x Dual Core (4 cores) *R**AM:* 12GB *Storage*: 212GB *OS Version* – Windows 2008 R2 [...] As with most things, the devil is in the details: What kind of queries are you planning to run, and what search features will you be using, e.g., faceting, sorting, highlighting, etc. A desired query response time is meaningless without also specifying the number of simultaneous users. Your best bet is to set up a prototype, and benchmark your search. Having said that, your proposed hardware seems more than adequate for your needs: 1. If possible, use SSDs or fast disks 2. I would not use Windows as a server platform Regards, Gora
Re: JoinQuery and scores
I would recommend reading up on Lucene scoring, there's a lot to understand there. The join query parser (triggered by the use of {!join} syntax) searches for a list of documents matching the term specified, and provides a list of matching IDs. It then performs a second search based upon those IDs. It is that second search that will be scored, but given you are just using IDs, there's no scoring to be done. Given that your joining term 'jeans' will exist in documents on both sides of the join, you could say: http://localhost:8983/solr/ee/select?fq={!join%20from=oxparentid%20to=oxid}jeansq=jeans That would cause the term 'jeans' to be scored (the more common the term in a document, the higher it scores, etc). But by the sounds of it, it would be useful for you to understand better how scoring calculations are done, so you can see *why* a score would be the way it is. Upayavira On Fri, Mar 8, 2013, at 07:56 AM, Stefan Moises wrote: Hi Erick, if I try the same query without join I get different scores for each hit... here is an example query: http://localhost:8983/solr/ee/select?facet=truefacet.mincount=1facet.limit=-1rows=10fl=oxid,score,oxtitledebugQuery=truestart=0facet.sort=lexfacet.field=oxpricefacet.field=manuseofacet.field=vendseofacet.field=catpathsfac et.field=catpathstokfacet.field=att_EU-Groessefacet.field=att_Schnittfacet.field=att_Groessefacet.field=att_Farbefacet.field=att_Einsatzbereichfacet.field=att_Materialfacet.field=att_Modellfacet.field=att_Anzeigefacet.field=att_Designfacet.field=att_Lieferumfangfacet.field=att_Washingfacet.field=att_Beschaffenheitqt=dismaxq={!join%20from=oxparentid%20to=oxid}jeans Anything wrong with that? Every doc returned has a score of 1.0 with the join. Without join I get scores between 0.40337953 and 0.40530312. Thanks, Stefan Am 08.03.2013 03:21, schrieb Erick Erickson: What's the rest of your query? What you've indicated doesn't have any terms to score. Join can be thought of as a bit like a filter query in this sense; the fact that the join hit is just an inclusion/exclusion clause, not a scoring. Best Erick On Thu, Mar 7, 2013 at 10:32 AM, Stefan Moises moi...@shoptimax.de wrote: Hi List, we are using the JoinQuery (JoinQParserPlugin) via request parameter, e.g. {!join from=parentid to=productsid} in Solr 4.1 which works great for our purposes, but unfortunately, all docs returned get a score of 1.0... this makes the whole search pretty useless imho, since the results are sorted totally random of course Is there any simple way to fix this or an explanation why this is the case? Thanks a lot in advance, Stefan -- Mit den besten Grüßen aus Nürnberg, Stefan Moises *** Stefan Moises Senior Softwareentwickler Leiter Modulentwicklung shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
Re: Solr 4.x auto-increment/sequence/counter functionality.
So I think I took the easiest option by creating an UpdateRequestProcessor implementation (I was unsure of the performance implications and object model of ScriptUpdateProcessor). The below DocumentCreationDetailsProcessorFactory class seems to achieve my aim of allowing me to sort my Solr Documents by a creation order (To an extent - I don't think it is exactly the commit order..), though the auto-increment/sequence/counter functionality is not continuous. Solr Sort Parameter String: sort=created_time_stamp_l asc, created_processing_sequence_number_l asc, created_by_solr_thread_id_l asc, created_by_solr_core_name_s asc, created_by_solr_shard_id_s asc Any comments or feedback would be appreciated. // // UpdateRequestProcessor implementation // public class DocumentCreationDetailsProcessorFactory extends UpdateRequestProcessorFactory { private static final AtomicLong processingSequenceNumber = new AtomicLong(); @Override public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new DocumentCreationDetailsProcessor(req, rsp, next, processingSequenceNumber); } } class DocumentCreationDetailsProcessor extends UpdateRequestProcessor { private final SolrQueryRequest req; @SuppressWarnings(unused) private final SolrQueryResponse rsp; @SuppressWarnings(unused) private final UpdateRequestProcessor next; private final AtomicLong processingSequenceNumber; public DocumentCreationDetailsProcessor(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next, AtomicLong processingSequenceNumber ) { super(next); this.req = req; this.rsp = rsp; this.next = next; this.processingSequenceNumber = processingSequenceNumber; } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument solrInputDocument = cmd.getSolrInputDocument(); solrInputDocument.addField(created_time_stamp_l, System.currentTimeMillis()); solrInputDocument.addField(created_processing_sequence_number_l, processingSequenceNumber.incrementAndGet()); String solrCoreName = null; String solrShardId = null; if (req != null req.getCore() != null req.getCore().getCoreDescriptor() != null ) { SolrCore solrCore = req.getCore(); CoreDescriptor coreDesc = null; CloudDescriptor cloudDesc = null; if ( solrCore != null ) { solrCoreName = solrCore.getName(); coreDesc = req.getCore().getCoreDescriptor(); if (coreDesc != null) { cloudDesc = coreDesc.getCloudDescriptor(); } if (cloudDesc != null) { solrShardId = cloudDesc.getShardId(); } } } solrInputDocument.addField(created_by_solr_thread_id_l, Thread.currentThread().getId()); solrInputDocument.addField(created_by_solr_core_name_s, solrCoreName); solrInputDocument.addField(created_by_solr_shard_id_s, solrShardId); // pass it up the chain super.processAdd(cmd); } } // // // Added the below for a bit of context (http://wiki.apache.org/solr/SolrPlugins) // mkdir /opt/solr/instances/test/collection1/lib cp /home/user/download/test-solr-plugins-0.0.1.jar /opt/solr/instances/test/collection1/lib/ chown root:tomcat7 /opt/solr/instances/test/collection1/lib/* vim /opt/solr/instances/test/collection1/conf/solrconfig.xml updateRequestProcessorChain name=mychain processor class=com.test.solr.plugins.DocumentCreationDetailsProcessorFactory /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain vim /opt/solr/instances/test/collection1/conf/solrconfig.xml requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainmychain/str /lst /requestHandler -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-x-auto-increment-sequence-counter-functionality-tp4045125p4045725.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR-3076 for beginners?
Hi, blockjoin seems to be a real cool feature. Unfortunately I'm to dumb, to get the patch running. I even don't know what to do :-( Is there anywhere an example, a howto or a cookbook, other than using elasticsearch or bare lucene? Uwe
Re: SOLR - Recommendation on architecture
I would not recommend Windows too 2013/3/8 Kobe J kobe.free.wo...@gmail.com We are planning to use SOLR 4.1 for full text indexing. Following is the hardware configuration of the web server that we plan to install SOLR on:- *CPU*: 2 x Dual Core (4 cores) *R**AM:* 12GB *Storage*: 212GB *OS Version* – Windows 2008 R2 The dataset to be imported will have approx.. 800k records, with 450 fields per record. Query response time should be btw 200ms-800ms. Please suggest if the current single server implementation should work fine and if the specified configuration is enough for the requirement.
Re: SOLR - Recommendation on architecture
Because? Upayavira On Fri, Mar 8, 2013, at 09:27 AM, Jilal Oussama wrote: I would not recommend Windows too 2013/3/8 Kobe J kobe.free.wo...@gmail.com We are planning to use SOLR 4.1 for full text indexing. Following is the hardware configuration of the web server that we plan to install SOLR on:- *CPU*: 2 x Dual Core (4 cores) *R**AM:* 12GB *Storage*: 212GB *OS Version* – Windows 2008 R2 The dataset to be imported will have approx.. 800k records, with 450 fields per record. Query response time should be btw 200ms-800ms. Please suggest if the current single server implementation should work fine and if the specified configuration is enough for the requirement.
Re: Quick Questions
On 03/08/2013 05:06 PM, Upayavira wrote: In example/cloud-scripts/ you will find a Solr specific zkCli tool to upload/download configs. You will need to reload a core/collection for the changes to take effect. Upayavira On Fri, Mar 8, 2013, at 07:02 AM, Nathan Findley wrote: I am setting up solrcloud with zookeeper. - I am wondering if there are nicer ways to update the zookeeper config files (data-import) besides restarting a node with the boostrap option? - Right now I kill the node manually in order to restart it. Is there a better way to restart? Thanks, Nate -- CTO Zenlok株式会社 Ok that is good to know. Using zookeeper I can see the following dataimport.properties: last_index_time=2013-03-06 12\:02\:22 email_history.last_index_time=2013-03-06 12\:02\:22 ... The problem is that the last_index_time is not being changed when I run a delta import. Any ideas why? If it is a permissions issue, I am a bit confused because I am testing using the root user and don't see any errors to indicate that zookeeper is failing to write to the filesystem. Thanks, Nate -- CTO Zenlok株式会社
Re: SOLR - Recommendation on architecture
If you are attempting to assess performance, you should use as many records as you can muster. A Lucene index does start to struggle at a certain size, and you may be getting close to that, depending upon the size of your fields. Are you suggesting that you would host other services on the server as well? I would expect your Solr instance to want sole use of the server, as an index of your size will demand it. Upayavira On Fri, Mar 8, 2013, at 10:02 AM, kobe.free.wo...@gmail.com wrote: Thanks for your suggestion Gora. Yes, we are planning to use faceting, sorting features. The number of simultaneous users would be around 500 per min. We have preferred windows since the server would also be hosting some of our Microsoft based web applications. For prototyping, given the number of records we will be working with, what number of records do you suggest should we include in prototyping? -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Recommendation-on-architecture-tp4045718p4045734.html Sent from the Solr - User mailing list archive at Nabble.com.
Mark document as hidden
Dear all, I would like to mark documents as hidden. I could add a field hidden and pass the value to true, but the whole documents will be reindexed. And External file fields are not searchable. I could store the document keys in an external database and filter the result with these ids. But if I have some millions of hidden documents, I don't think it is a great idea. Currently I will reindex the documents, but if someone has a better idea, any help will be appreciated. Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mark document as hidden
Without java coding, you cannot filter on things that aren't in your index. You would need to re-index the document, but maybe you could make use of atomic updates to just change the hidden field without needing to push the whole document again. Upayavira On Fri, Mar 8, 2013, at 11:40 AM, lboutros wrote: Dear all, I would like to mark documents as hidden. I could add a field hidden and pass the value to true, but the whole documents will be reindexed. And External file fields are not searchable. I could store the document keys in an external database and filter the result with these ids. But if I have some millions of hidden documents, I don't think it is a great idea. Currently I will reindex the documents, but if someone has a better idea, any help will be appreciated. Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756.html Sent from the Solr - User mailing list archive at Nabble.com.
RessourceLoader newInstance
Hi Can someone explain to me the point of the method public T T newInstance(String cname, ClassT expectedType) in interface org.apache.solr.common.ResourceLoader (or org.apache.lucene.analysis.util.ResourceLoader)? If I want to implement a ResourceLoader, what is the purpose of me implementing a newInstance method? The other method in the interface (openResource), makes sense to me, but I'm not sure about newInstance. Thanks, Peter
Re: Mark document as hidden
External file fields, via function queries, are still usable for filtering. Consider using the frange function query to filter out hidden documents. Erik On Mar 8, 2013, at 6:40, lboutros boutr...@gmail.com wrote: Dear all, I would like to mark documents as hidden. I could add a field hidden and pass the value to true, but the whole documents will be reindexed. And External file fields are not searchable. I could store the document keys in an external database and filter the result with these ids. But if I have some millions of hidden documents, I don't think it is a great idea. Currently I will reindex the documents, but if someone has a better idea, any help will be appreciated. Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: inconsistent number of results returned in solr cloud
HI I am using solr 4.0 (Not BETA), and have created 2 shard 2 replica configuration. But when I query solr with filter query it returns inconsistent result count. Without filter query it returns same consistent result count. I don't understand why? Can any one help in this? Best Regards Hardik Upadhyay
SolrCloud: port out of range:-1
Hello, I have some problems with Solrcloud and Zookeeper. I have 2 servers and i want to have a solr instance on both servers. Both solr instances runs an embedded zookeeper. When i try to start the first one i get the error: port out of range:-1. The command i run to start solr with embedded zookeeper: java -Djetty.port=4110 -DzkRun=10.100.10.101:5110 -DzkHost=10.100.10.101:5110,10.100.10.102:5120 -Dbootstrap_conf=true -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar It runs Solr on port 4110, the embedded zk on 5110. The -DzkHost gives the urls of the localhost zk(5110) and the url of the other server(zk port). When i try to start this it give the error: port out of range:-1. What's wrong? Thanks Roy -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-port-out-of-range-1-tp4045804.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: inconsistent number of results returned in solr cloud
check for dup id's a quick way is to facet using the id as a field and set the mincount to 2. -Mike Hardik Upadhyay wrote: HI I am using solr 4.0 (Not BETA), and have created 2 shard 2 replica configuration. But when I query solr with filter query it returns inconsistent result count. Without filter query it returns same consistent result count. I don't understand why? Can any one help in this? Best Regards Hardik Upadhyay
Re: Mark document as hidden
Excellent Erik ! It works perfectly. Normal filter queries are cached. Is it the same for frange filter queries like this one ? : fq={!frange l=0 u=10}removed_revision Thanks to both for your answers. Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mark document as hidden
One more question, is there already a way to update the external file (add values) in Solr ? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045823.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR - Recommendation on architecture
Your servers seems to be about the right size, but as everyone else has said, it depends on the kinds of queries. Solr should be the only service on the system. Solr can make heavy use of the disk which will interfere with other processes. If you are lucky enough to get the system tuned to run from RAM, it can use 100% of CPU. Tuning Solr will be very difficult with other services sharing the same system. If you need to meet an SLA, you will have a hard time doing that on a shared server. When you don't meet that SLA, it will be almost impossible to diagnose why. Why not Windows? * the Windows filesystem is not designed for heavy server use * Windows does not allow open files to be deleted -- there are workarounds for this in Solr, but it is a continuing problem * the Windows file cache is organized by file, not by block, which is inefficient for Solr's access pattern * Java on Windows works, but has a number of workarounds and quirks * the Solr community is almost all Unix users, so you will get much better help on Unix wunder On Mar 8, 2013, at 3:04 AM, Upayavira wrote: If you are attempting to assess performance, you should use as many records as you can muster. A Lucene index does start to struggle at a certain size, and you may be getting close to that, depending upon the size of your fields. Are you suggesting that you would host other services on the server as well? I would expect your Solr instance to want sole use of the server, as an index of your size will demand it. Upayavira On Fri, Mar 8, 2013, at 10:02 AM, kobe.free.wo...@gmail.com wrote: Thanks for your suggestion Gora. Yes, we are planning to use faceting, sorting features. The number of simultaneous users would be around 500 per min. We have preferred windows since the server would also be hosting some of our Microsoft based web applications. For prototyping, given the number of records we will be working with, what number of records do you suggest should we include in prototyping? -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Recommendation-on-architecture-tp4045718p4045734.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mark document as hidden
Ludovic - Yes, this query would be cached (unless you say cache=false). Erik On Mar 8, 2013, at 10:26 , lboutros wrote: Excellent Erik ! It works perfectly. Normal filter queries are cached. Is it the same for frange filter queries like this one ? : fq={!frange l=0 u=10}removed_revision Thanks to both for your answers. Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mark document as hidden
The external file is maintained externally. Solr only reads it, and does not have a facility to write to it, if that is what you're asking. Erik On Mar 8, 2013, at 10:43 , lboutros wrote: One more question, is there already a way to update the external file (add values) in Solr ? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045823.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4
The underling index format is unchanged between SOLR-2155 and Solr 4 provided that this is only about indexing points, and SOLR-2155 could only index points any way. To really ensure it's drop-in compatible, specify maxLevels=12 *instead of* setting maxDistErr (which indirectly derives a maxLevels) so you can be sure the levels are the same. Also, SOLR-2155 always did full query shape precision (to as much as the maxLevels indexed length allows of course). By default, SpatialRecursivePrefixTreeFieldType uses 2.5% of the query shape radius as its accuracy, which buys a little more performance at the expense of accuracy. You can set distErrPct=0 if you require full precision. For example you might need a meter of indexed precision for the case when someone zooms in really low to some small region, but if the search is a huge area of an entire country or state, then do you truly need a meter precision along the edge for that case too? I think not. distErrPct is relative to the overall size of the query shape. The default I think is probably fine but people have observed its inaccuracy, particularly when a point is plotted outside a drawn query box and thought it was a problem with the spatial code when it's actually a configuration default. 0 is actually quite scalable provided there isn't a ton of indexed data coinciding with the query shape edge along the query shape's entire edge. I'd be interested to hear if the Solr 4 version is faster/slower if you have any benchmarks -- especially v4.2 due out soon, but earlier 4.x should be nearly the same. It's weird that you're seeing the stored value coming back in search results as a geohash. In Solr 4 you get precisely what you added. ~ David Harley wrote Hi David Smiley: We use a 3rd party software to load Solr 3.4 so the behavior needs to be transparent with the migration to 4.1, but I was expecting that I would need to rebuild the solr database. I moved/added the old solr 3.4 core to solr 4.1, with only minor modification (commented out the old spatial type and add the new) and I was surprised I was able to query the data. The geohash is displaying as a hash, and not coordinate, so I am checking my configuration on the geospatial class. Harley Powers Parks, GISP Booz | Allen | Hamilton Geospatial Visualization Web Developer WEB: https://www.apan.org USPACOM J73/APAN Pacific Warfighting Center Ford Island p: 808.472.7752 c: 808.377.0632 apan: harley.parks@ nipr: harley.parks.ctr@ CONTRACTOR: Booz | Allen | Hamilton e: parks_harley@ -Original Message- From: David Smiley (@MITRE.org) [mailto: DSMILEY@ ] Sent: Wednesday, March 06, 2013 9:34 PM To: solr-user@.apache Subject: Re: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4 Hi Harley, See: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 In SOLR-2155 you had to explicitly specify the prefix encoding length, whereas in Solr 4 you specify how much precision you need and it figures out what the length is that satisfies that. When you first use the field, it'll log what the derived levels figure is (if you care). The units are decimal degrees (0-180 from no distance to reverse side of the globe -- aka latitudinal degrees). You can name the field type whatever you want, but I don't recommend geohash because this conflates it with an actual GeoHashField, and also it's more of an internal detail. You said you're having trouble with the migration... but what is the trouble? ~ David Harley wrote I'm having trouble migrating the geohash fields from my Solr 3.4 schema to the Solr 4 schema. this is the 3.4 type and class: fieldType name=geohash class=solr2155.solr.schema.GeoHashField length=12/ is the below Solr 4 spatial type the right configuration to implement data being stored in fields once using the geohash type and class in the above solr 3.4 field type? fieldType name=geohash class=solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025 maxDistErr=0.09 units=degrees prefixTree=geohash / is the units=degrees degree decimal? example: 21.0345 Harley Powers Parks, GISP - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-to-Solr-4-tp4045416p4045470.html Sent from the Solr - User mailing list archive at Nabble.com. - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-to-Solr-4-tp4045416p4045835.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4
You're supposed to add geo point data in latitude, longitude format, although some other variations work. Is your updating process supplying a geohash instead? If so you could write a simple Solr UpdateRequestProcessor to convert it to the expected format. But that doesn't help the fact that apparently you're index already has a geohash for the stored field value. Are all your fields either stored fields or aren't but are copied from a stored field? It would then be an option to dump all data via CSV (take care for multi-value fields) then load it into an empty instance. You could optimize your index, which upgrades as a side-effect. FYI there's a Lucene IndexUpgrader you can use at the command line: http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/index/IndexUpgrader.html But again, if your stored field values are geohashes when you want a lat,lon then this isn't going to fix that. ~ David Harley wrote David Smiley: Because we use a 3rd party software.. I checked to see if this would still worked... search query still works. But adding data seems to be broken, likely because of the geohash type. So, below is the log file, which tells me to upgrade If possible, it would be great to simply get the old 3.4 index working. What should my workflow be to get this working as is, then to upgrade? I'm expecting to delete the data folder, then rebuild the index via 3rd party software adding data to Solr 4... is it' possible to reindex the existing data folder? - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-to-Solr-4-tp4045416p4045836.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mark document as hidden
Ok, thanks Erik. Do you see any problem in modifying the Update handler in order to append some values to this file ? Ludovic - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045839.html Sent from the Solr - User mailing list archive at Nabble.com.
How can I limit my Solr search to an arbitrary set of 100,000 documents?
We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: Mark document as hidden
I could create an UpdateRequestProcessorFactory that could update this file, it seems to be better ? - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045842.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
First, terms used to subset the index should be a filter query, not part of the main query. That may help, because the filter query terms are not used for relevance scoring. Have you done any system profiling? Where is the bottleneck: CPU or disk? There is no point in optimising things before you know the bottleneck. Also, your latency goals may be impossible. Assume roughly one disk access per term in the query. You are not going to be able to do 100,000 random access disk IOs in 2 seconds, let alone process the results. wunder On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote: hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: InvalidShapeException when using SpatialRecursivePrefixTreeFieldType with custom worldBounds
Hi Jon. If you're able to trigger an IndexOutOfBoundsException out of the prefix tree then please file a bug (to the Lucene project, not Solr). I'll look into it when I have time. I need to add a Wiki page on the use of spatial for time ranges; there are some tricks to it. Nevertheless you've demonstrated a bug. ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/InvalidShapeException-when-using-SpatialRecursivePrefixTreeFieldType-with-custom-worldBounds-tp4045351p4045864.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.1 UI fail to display result
I know, it's a bit late on this thread, but for the record - filed and already fixed: https://issues.apache.org/jira/browse/SOLR-4349 On Saturday, February 2, 2013 at 6:35 PM, J Mohamed Zahoor wrote: It works In chrome though... ./Zahoor@iPhone On 02-Feb-2013, at 4:34 PM, J Mohamed Zahoor zah...@indix.com (mailto:zah...@indix.com) wrote: I'm not sure why .. but this sounds like the JSON Parser was called with an HTML- or XML-String? After you hit the Execute Button on the Website, on the top of the right content-area, there is a link - which is what the UI will request .. if you open that in another browser-tab or with curl/wget .. what is the response you get? Is that really JSON? Or perhaps some kind of Error Message? The link itself does not seem to be okay. It shows only this for q=*:* http://localhost:8983/solr/collection1/select? But if i add a wt=json in another tab.. i get a json response. ./zahoor
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
I think we speak of one use case where user wants to limit the search into a collection of documents but there is no unifying (easy) way to select those papers - besides a loong query: id:1 OR id:5 OR id:90... And no, the latency of several hundred milliseconds is perfectly achievable with several hundred thousands of ids, you should explore the link... roman On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote: First, terms used to subset the index should be a filter query, not part of the main query. That may help, because the filter query terms are not used for relevance scoring. Have you done any system profiling? Where is the bottleneck: CPU or disk? There is no point in optimising things before you know the bottleneck. Also, your latency goals may be impossible. Assume roughly one disk access per term in the query. You are not going to be able to do 100,000 random access disk IOs in 2 seconds, let alone process the results. wunder On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote: hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How to add shard in 4.2-snapshot
On Mar 8, 2013, at 12:23 AM, Jam Luo cooljam2...@gmail.com wrote: Hi I use the 4.2-snapshot version, git sha id is f4502778b263849a827e89e45d37b33861f225f9 . I deploy a cluster by SolrCloud, there is 3 node,one core per node,they are in defferent shard. the JVM argument is -DnumShards=3. Now I must add mechine and add shard, but I start new solr instance and change the argument numShards=4, the count of shard do not change. How to add shard in this version? In 4.0, I increase numShards, the new solr instance will in a new shard, but now it is inoperative. Unless you control sharding yourself, you cannot currently add shards without some reindexing. Shard splitting is a feature that is coming soon though. You can easily add replicas currently, but not shards. You can control shards yourself by not using numShards. - Mark
Re: Dynamic schema design: feedback requested
Hi Jan, On Mar 6, 2013, at 4:50 PM, Jan Høydahl jan@cominvent.com wrote: Will ZK get pushed the serialized monolithic schema.xml / schema.json from the node which changed it, and then trigger an update to the rest of the cluster? Yes. I was kind of hoping that once we have introduced ZK into the mix as our centralized config server, we could start using it as such consistently. And so instead of ZK storing a plain xml file, we split up the schema as native ZK nodes […] Erik Hatcher made the same suggestion on SOLR-3251: https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13571713page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571713 My response on the issue: https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13572774page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13572774 In short, I'm not sure it's a good idea, and in any event, I don't want to implement this as part of the initial implementation - it could be added on later. multiple collections may share the same config set and thus schema, so what happens if someone does not know this and hits PUT localhost:8983/solr/collection1/schema and it affects also the schema for collection2? Hmm, that's a great question. Querying against a named config rather than a collection/core would not be an improvement, though, since the relationship between the two wouldn't be represented in the request. Maybe if there were requests that returned the collections using a particular named config, and vice versa, people could at least discover problematic dependencies before they send schema modificaiton requests? Or maybe such requests already exist? Steve
Re: SolrCloud: port out of range:-1
On 3/8/2013 7:37 AM, roySolr wrote: java -Djetty.port=4110 -DzkRun=10.100.10.101:5110 -DzkHost=10.100.10.101:5110,10.100.10.102:5120 -Dbootstrap_conf=true -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar It runs Solr on port 4110, the embedded zk on 5110. The -DzkHost gives the urls of the localhost zk(5110) and the url of the other server(zk port). When i try to start this it give the error: port out of range:-1. The full log line, ideally with several lines above and below for context, is going to be crucial for figuring this out. Also, the contents of your solr.xml file may be important. Thanks, Shawn
Re: Dynamic schema design: feedback requested
On Mar 6, 2013, at 7:50 PM, Chris Hostetter hossman_luc...@fucit.org wrote: I think it would make a lot of sense -- not just in terms of implementation but also for end user clarity -- to have some simple, straightforward to understand caveats about maintaining schema information... 1) If you want to keep schema information in an authoritative config file that you can manually edit, then the /schema REST API will be read only. 2) If you wish to use the /schema REST API for read and write operations, then schema information will be persisted under the covers in a data store whose format is an implementation detail just like the index file format. 3) If you are using a schema config file and you wish to switch to using the /schema REST API for managing schema information, there is a tool/command/API you can run to so. 4) if you are using the /schema REST API for managing schema information, and you wish to switch to using a schema config file, there is a tool/command/API you can run to export the schema info if a config file format. +1 ...wether of not the under the covers in a data store used by the REST API is JSON, or some binary data, or an XML file just schema.xml w/o whitespace/comments should be an implementation detail. Likewise is the question of wether some new config file formats are added -- it shouldn't matter. If it's config it's config and the user owns it. If it's data it's data and the system owns it. Calling the system-owned file 'schema.dat', rather than 'schema.json' (i.e., extension=format), would help to reinforce this black-box view. Steve
Re: SolrCloud: port out of range:-1
A couple of comments about your deployment architecture too. You'll need to change the zoo.cfg to make the Zookeeper ensemble work with two instances as you are trying to do, have you? The example configuration with the zoo.cfg is intended for a single ZK instance as described in the SolrCloud example. That said, really a two instances ZK ensemble as the one you are intending to have doesn't make much sense, if ANY of your Solr servers break (which as you are running embedded, ZK will also stop), the whole cluster will be useless until you start the server again. Tomás On Fri, Mar 8, 2013 at 12:26 PM, Shawn Heisey s...@elyograg.org wrote: On 3/8/2013 7:37 AM, roySolr wrote: java -Djetty.port=4110 -DzkRun=10.100.10.101:5110 -DzkHost=10.100.10.101:5110,10**.100.10.102:5120http://10.100.10.102:5120-Dbootstrap_conf=true -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar It runs Solr on port 4110, the embedded zk on 5110. The -DzkHost gives the urls of the localhost zk(5110) and the url of the other server(zk port). When i try to start this it give the error: port out of range:-1. The full log line, ideally with several lines above and below for context, is going to be crucial for figuring this out. Also, the contents of your solr.xml file may be important. Thanks, Shawn
Re: SolrCloud: port out of range:-1
A two server Zookeeper ensemble is actually less reliable than a one server ensemble. With two servers, Zookeeper stops working if either of them fail, so there is a higher probability that it will go down. The minimum number for increased reliability is three servers. wunder On Mar 8, 2013, at 12:33 PM, Tomás Fernández Löbbe wrote: A couple of comments about your deployment architecture too. You'll need to change the zoo.cfg to make the Zookeeper ensemble work with two instances as you are trying to do, have you? The example configuration with the zoo.cfg is intended for a single ZK instance as described in the SolrCloud example. That said, really a two instances ZK ensemble as the one you are intending to have doesn't make much sense, if ANY of your Solr servers break (which as you are running embedded, ZK will also stop), the whole cluster will be useless until you start the server again. Tomás On Fri, Mar 8, 2013 at 12:26 PM, Shawn Heisey s...@elyograg.org wrote: On 3/8/2013 7:37 AM, roySolr wrote: java -Djetty.port=4110 -DzkRun=10.100.10.101:5110 -DzkHost=10.100.10.101:5110,10**.100.10.102:5120http://10.100.10.102:5120-Dbootstrap_conf=true -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar It runs Solr on port 4110, the embedded zk on 5110. The -DzkHost gives the urls of the localhost zk(5110) and the url of the other server(zk port). When i try to start this it give the error: port out of range:-1. The full log line, ideally with several lines above and below for context, is going to be crucial for figuring this out. Also, the contents of your solr.xml file may be important. Thanks, Shawn -- Walter Underwood wun...@wunderwood.org
Re: Dynamic schema design: feedback requested
On Mar 8, 2013, at 2:57 PM, Steve Rowe sar...@gmail.com wrote: multiple collections may share the same config set and thus schema, so what happens if someone does not know this and hits PUT localhost:8983/solr/collection1/schema and it affects also the schema for collection2? Hmm, that's a great question. Querying against a named config rather than a collection/core would not be an improvement, though, since the relationship between the two wouldn't be represented in the request. Maybe if there were requests that returned the collections using a particular named config, and vice versa, people could at least discover problematic dependencies before they send schema modificaiton requests? Or maybe such requests already exist? Also, this doesn't have to be either/or (collection/core vs. config) - we could have another API that's config-specific, e.g. for the fields resource: collection-specific:http://localhost:8983/solr/collection1/schema/fields config-specific:http://localhost:8983/solr/configs/configA/schema/fields Steve
update some fields vs replace the whole document
Generally speaking, which has better performance for Solr? 1. updating some fields or adding new fields into a document. or 2. replacing the whole document. As I understand, update fields need to search for the corresponding doc first, and then replace field values. While replacing the whole document is just like adding new document. Is it right?
Re: update some fields vs replace the whole document
With an atomic update, you need to retrieve the stored fields in order to build up the full document to insert back. In either case, you'll have to locate the previous version and mark it deleted before you can insert the new version. I bet that the amount of time spent retrieving stored fields is matched by the time saved by not having to transmit those fields over the wire, although I'd be very curious to see someone actually test that. Upayavira On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote: Generally speaking, which has better performance for Solr? 1. updating some fields or adding new fields into a document. or 2. replacing the whole document. As I understand, update fields need to search for the corresponding doc first, and then replace field values. While replacing the whole document is just like adding new document. Is it right?
Re: update some fields vs replace the whole document
Then what's the difference between adding a new document vs. replacing/overwriting a document? Ming- On Fri, Mar 8, 2013 at 2:07 PM, Upayavira u...@odoko.co.uk wrote: With an atomic update, you need to retrieve the stored fields in order to build up the full document to insert back. In either case, you'll have to locate the previous version and mark it deleted before you can insert the new version. I bet that the amount of time spent retrieving stored fields is matched by the time saved by not having to transmit those fields over the wire, although I'd be very curious to see someone actually test that. Upayavira On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote: Generally speaking, which has better performance for Solr? 1. updating some fields or adding new fields into a document. or 2. replacing the whole document. As I understand, update fields need to search for the corresponding doc first, and then replace field values. While replacing the whole document is just like adding new document. Is it right?
Re: High QTime when wildcards in hl.fl are used
I've found more interesting informations about using fastVectorHighlighting combined with highlighted fields with wildcards after testing on isolated group of documents with text content. fvh + fulltext_*: QTime ~4s (!) fvh + fulltext_1234: QTime ~50ms no fvh + fulltext_*: QTime ~600ms no fvh + fulltext_1234: QTime ~500ms. As we can see very long query times as associated with using fvh combined with wildcarded hl.fl. In source code I found that fields to highlight when we using wildcards are computed by regex in loop through fields returned by query in document, so I this case, when I have only one fileld that is matching given pattern it should be no diference between using wildcards and not. Any ideas? W dniu 08.03.2013 13:49, Karol Sikora pisze: Hi all, I'm currently stumbling with following case: I have indexed documents with fileds named like fulltext_[some id]. I'm testing highlighting on document which have only one that field, fulltext_1234. When 'fulltext_*' is provided as hl.fl, QTime is horribly big ( 10s), when explicit 'fulltext_1234' is provided, QTime is acceptable (~30ms). I've found that using widlcards in hl.fl can increase QTime ( http://stackoverflow.com/questions/11774508/optimize-solr-highlighter), but it definitly should not cost so much. I'm using fastVectorHighliter in both cases. Any ideas why using wildcards cause such big QTimes? Maybe there is workaround? -- Karol Sikora +48 781 493 788 Laboratorium EE ul. Mokotowska 46A/23 | 00-543 Warszawa | www.laboratorium.ee |www.laboratorium.ee/facebook -- Karol Sikora +48 781 493 788 Laboratorium EE ul. Mokotowska 46A/23 | 00-543 Warszawa | www.laboratorium.ee | www.laboratorium.ee/facebook
Multiple Collections in one Zookeeper
Hi, I have a solrcloud cluster running several cores and pointing at one zookeeper. For performance reasons, I'd like to move one of the cores on to it's own dedicated cluster of servers. Can I use the same zookeeper to keep track of both clusters. Thanks! Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Collections in one Zookeeper
Yes, but you'll need to append a sub path on to the zookeeper path for your second cluster. For ex: zookeeper1.example.com,zookeeper2.example.com,zookeeper3.example.com/subpath On Mar 8, 2013 6:46 PM, jimtronic jimtro...@gmail.com wrote: Hi, I have a solrcloud cluster running several cores and pointing at one zookeeper. For performance reasons, I'd like to move one of the cores on to it's own dedicated cluster of servers. Can I use the same zookeeper to keep track of both clusters. Thanks! Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4
Yes. Success. I was able to successfully migrate solr 3.4 w/ solr-2155 solrconfig.xml and schema.xml; but I had to rebuild the database (solr index data folder). fieldType name=geohash_rpt class=solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0 maxLevels=12 units=degrees prefixTree=geohash/ Everything seems to be working. I need to try and see if I can convert the old 3.4 database... but when we upgrade, we always rebuild our solr index. -Original Message- From: David Smiley (@MITRE.org) [mailto:dsmi...@mitre.org] Sent: Friday, March 08, 2013 6:56 AM To: solr-user@lucene.apache.org Subject: RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4 You're supposed to add geo point data in latitude, longitude format, although some other variations work. Is your updating process supplying a geohash instead? If so you could write a simple Solr UpdateRequestProcessor to convert it to the expected format. But that doesn't help the fact that apparently you're index already has a geohash for the stored field value. Are all your fields either stored fields or aren't but are copied from a stored field? It would then be an option to dump all data via CSV (take care for multi-value fields) then load it into an empty instance. You could optimize your index, which upgrades as a side-effect. FYI there's a Lucene IndexUpgrader you can use at the command line: http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/index/IndexUp grader.html But again, if your stored field values are geohashes when you want a lat,lon then this isn't going to fix that. ~ David Harley wrote David Smiley: Because we use a 3rd party software.. I checked to see if this would still worked... search query still works. But adding data seems to be broken, likely because of the geohash type. So, below is the log file, which tells me to upgrade If possible, it would be great to simply get the old 3.4 index working. What should my workflow be to get this working as is, then to upgrade? I'm expecting to delete the data folder, then rebuild the index via 3rd party software adding data to Solr 4... is it' possible to reindex the existing data folder? - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash- to-Solr-4-tp4045416p4045836.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: update some fields vs replace the whole document
Generally it will be more a matter of application semantics. Solr makes it reasonably efficient to completely overwrite the existing document and fields, if that is what you want. But, in some applications, it may be desirable to preserve some or most of the existing fields; whether that is easier to accomplish be completely regenerating the full document from data stored elsewhere in the application (e.g., a RDBMS) or doing a selective write will depend on the application. In some apps, the rest of the data may not be maintained separately, so a selective write makes more sense. Or, maybe the existing document contains metadata fields such as timestamps or counters that would get reset if the whole document was regenerated. -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Friday, March 08, 2013 5:41 PM To: solr-user@lucene.apache.org Subject: Re: update some fields vs replace the whole document Then what's the difference between adding a new document vs. replacing/overwriting a document? Ming- On Fri, Mar 8, 2013 at 2:07 PM, Upayavira u...@odoko.co.uk wrote: With an atomic update, you need to retrieve the stored fields in order to build up the full document to insert back. In either case, you'll have to locate the previous version and mark it deleted before you can insert the new version. I bet that the amount of time spent retrieving stored fields is matched by the time saved by not having to transmit those fields over the wire, although I'd be very curious to see someone actually test that. Upayavira On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote: Generally speaking, which has better performance for Solr? 1. updating some fields or adding new fields into a document. or 2. replacing the whole document. As I understand, update fields need to search for the corresponding doc first, and then replace field values. While replacing the whole document is just like adding new document. Is it right?
Re: Search a folder with File name and retrieve all the files matched
Since this is a POC you could simply run this command with the default example schema: cd solr/example/exampledocs java -Dauto -Drecursive=0 -jar post.jar path/to/folder You will get the full file name with path in field resourcename If you need to search just the filename, you can achieve that through adding a new field filename with a copyField resourcename-filename and a custom fieldType for filename with a PatternReplaceFilterFactory to remove the path. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch arafa...@gmail.com: You could use DataImportHandler with FileListEntityProcessor to get the file names in: http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor Then, if it is recursive enumeration and not just one level, you probably want a tokenizer that splits on path separator characters (e.g. /). Or maybe you want to index filename as a separate field from full path (can do it in FileListEntityProcessor itself). And if you combined the list of files with inner entity using Tika, you can load the file content for searching as well: http://wiki.apache.org/solr/DataImportHandler#Tika_Integration Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Mar 7, 2013 at 3:39 PM, pavangolla pavango...@gmail.com wrote: HI, I am new to apache solr, I am doing a poc, where there is a folder (in sys or some repository) which has different files with diff extensions pdf, doc, xls.., I want to search with a file name and retrieve all the files with the name matching How do i proceed on this. Please help me on this. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search a folder with File name and retrieve all the files matched
Thanks, Jan, for making the post tool do this type of thing. Great stuff. The filename would be a good one add for out of the box goodness. We can easily add just the filename to the index with something like the patch below. And on that note, what else would folks want in an easy to use document search system like this? Erik Index: core/src/java/org/apache/solr/util/SimplePostTool.java === --- core/src/java/org/apache/solr/util/SimplePostTool.java (revision 1450270) +++ core/src/java/org/apache/solr/util/SimplePostTool.java (working copy) @@ -749,6 +749,7 @@ urlStr = appendParam(urlStr, resource.name= + URLEncoder.encode(file.getAbsolutePath(), UTF-8)); if(urlStr.indexOf(literal.id)==-1) urlStr = appendParam(urlStr, literal.id= + URLEncoder.encode(file.getAbsolutePath(), UTF-8)); +urlStr = appendParam(urlStr, literal.filename_s= + URLEncoder.encode(file.getName(), UTF-8)); url = new URL(urlStr); } } else { On Mar 8, 2013, at 19:16 , Jan Høydahl wrote: Since this is a POC you could simply run this command with the default example schema: cd solr/example/exampledocs java -Dauto -Drecursive=0 -jar post.jar path/to/folder You will get the full file name with path in field resourcename If you need to search just the filename, you can achieve that through adding a new field filename with a copyField resourcename-filename and a custom fieldType for filename with a PatternReplaceFilterFactory to remove the path. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch arafa...@gmail.com: You could use DataImportHandler with FileListEntityProcessor to get the file names in: http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor Then, if it is recursive enumeration and not just one level, you probably want a tokenizer that splits on path separator characters (e.g. /). Or maybe you want to index filename as a separate field from full path (can do it in FileListEntityProcessor itself). And if you combined the list of files with inner entity using Tika, you can load the file content for searching as well: http://wiki.apache.org/solr/DataImportHandler#Tika_Integration Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Mar 7, 2013 at 3:39 PM, pavangolla pavango...@gmail.com wrote: HI, I am new to apache solr, I am doing a poc, where there is a folder (in sys or some repository) which has different files with diff extensions pdf, doc, xls.., I want to search with a file name and retrieve all the files with the name matching How do i proceed on this. Please help me on this. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html Sent from the Solr - User mailing list archive at Nabble.com.