spell suggestions help
hi all one thing I wanted to clear is for every other query I have got correct suggestions but these 2 cases I am not getting what suppose to be the suggestions: 1) I have kettle(doc frequency =5) and cable(doc frequecy=1) word indexed in direct solr spell cheker..but when I query for cattle I get cable as only suggestion and not kettle why is this happening i want to get kettle in suggestion as well im using jarowinkler distance according to which score for cattle = cable which is coming out to be 0.857 and for cattle = kettle which is coming out to be 0.777 kettle should also come in suggestions but its not how can I correct this any one. 2) how to query for sentence like hand blandar chopper as is delimiter for solr query and thus this query is returning error. thanks in advance regards Rohan
Re: solr 4.2.1 still has problems with index version and index generation
Hi Hoss, we don't use autoCommit and autoSoftCommit. We don't use openSearcher. We don't use transaction log. I can see it in the AdminGUI and with http://master_host:port/solr/replication?command=indexversion All files are replicated from master to slave, nothing lost. It is just that the gen/version differs and breaks our cronjobs which worked since solr 2.x. As you mentioned, it seams that the latest commit is fetched. Strange thing is, we start with a clean, empty index on master. With all commands we send a commit=true and, where applicable, an optimze=true. The master is always in state optimzed and current when replicating. How can it be that the searcher on master is referring to an older commit point if there is no such point. The logs show _AFTER_ the last optimize has finished a new searcher is started and the old one is closed. Also, we have replicateAfter startup, commit and optimize set but the AdminGUI and replication details only report replicateAfter commit and startup. Not really an error, but not that what is really set in the config! Very strange, I will try the patch. Regards Bernd Am 08.04.2013 20:12, schrieb Chris Hostetter: : I know there was some effort to fix this but I must report : that solr 4.2.1 has still problems with index version and : index generation numbering in master/slave mode with replication. ... : RESULT: slave has different (higher) version number and is with generation 1 ahead :-( Can you please provide more details... * are you using autocommit? with what settings? * are you using openSearcher=false in any of your commits? * where exactly are you looking that you see the master/slave out of sync? * are you observing any actual problems, or just seeing that the gen/version are reported as different? As Joel mentioned, there is an open Jira related purely to the *display* of information about gen/version between master slave, because in many cases the searcher in use on the master may refer to an older commit point, but it doesn't mean there is any actual problem in replication -- the slave is still fetching seraching the latest commit from the master as intended https://issues.apache.org/jira/browse/SOLR-4661 -Hoss
Re: Sub field indexing
Thanks Toke, Seems to be exactly what I try to do. Regards Eric Le 08/04/2013 20:02, Toke Eskildsen a écrit : It-forum [it-fo...@meseo.fr]: In exemple I have a product A this product is compatible with a Product B version 1, 5, 6. How can I index values like : compatible_engine : [productB,ProductZ] version_compatible : [1,5,6],[45,85,96] Index them as compatible_engine: productB/1 compatible_engine: productB/5 compatible_engine: productB/6 compatible_engine: productZ/45 compatible_engine: productZ/85 compatible_engine: productZ/96 in a StrField (so that it is not tokenized). After indexing how to search into ? compatible_engine:productZ/85 to get all products compatible with productZ, version 85 compatible_engine:productZ* to get all products compatible with any version of productZ. - Toke Eskildsen
Latency Comparison between cloud hosting Vs Dedicated hosting
Hi, We are comparing search request latency between Amazon Vs Dedicated hosting [Rackspace] .For comparison we used solr version 3.6.1 and Amazon small instance.The index size was less than 1GB. We see that the latency is about 75 -100 % from Amazon. Any body who has migrated form Dedicated hosting to Cloud has any pointers for improving latecny? Would a bigger instance improve latency? Regards Sujatha
Re: Indexed data not searchable
The XML files are formatted like this. I think there is the problem. metadataContainerType ns3:object ns3:generic ns3:provided ns3:titleT0084-00371-DOWNLOAD - Blatt 184r/ns3:title ns3:identifier type=METSXMLIDT0084-00371-DOWNLOAD/ns3:identifier ns3:formatapplication/pdf/ns3:format /ns3:provided ns3:generated ns3:created2012-11-08T00:09:57.531+01:00/ns3:created ns3:lastModified2012-11-08T00:09:57.531+01:00/ns3:lastModified ns3:issued2012-11-08T00:09:57.531+01:00/ns3:issued .. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexed-data-not-searchable-tp4054473p4054651.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 4.2.1 still has problems with index version and index generation
Looking a bit deeper showed that replication?command=commit reports the right indexversion, generation and filelist. arr name=commits lst long name=indexVersion1365357951589/long long name=generation198/long arr name=filelist ... And with replication?command=details I also see the correct commit part as above, BUT where the hell are the wrong info below the commit array are coming from? str name=isMastertrue/str str name=isSlavefalse/str long name=indexVersion1365357774190/long long name=generation197/long The command replication?command=filelistgeneration=197 replies with str name=statusinvalid index generation/str Have a look into the sources: Ahh, it is build in getReplicationDetails with: details.add(isMaster, String.valueOf(isMaster)); details.add(isSlave, String.valueOf(isSlave)); long[] versionAndGeneration = getIndexVersion(); details.add(indexVersion, versionAndGeneration[0]); details.add(GENERATION, versionAndGeneration[1]); So getIndexVersion() gets a wrong version and generation, but why? It first gets the searcher from the core and then tries to get via the IndexReader the IndexCommit and then the commitData. I think I should use remote debugging on master. At least I now know that it is the master. Regards Bernd Am 09.04.2013 08:35, schrieb Bernd Fehling: Hi Hoss, we don't use autoCommit and autoSoftCommit. We don't use openSearcher. We don't use transaction log. I can see it in the AdminGUI and with http://master_host:port/solr/replication?command=indexversion All files are replicated from master to slave, nothing lost. It is just that the gen/version differs and breaks our cronjobs which worked since solr 2.x. As you mentioned, it seams that the latest commit is fetched. Strange thing is, we start with a clean, empty index on master. With all commands we send a commit=true and, where applicable, an optimze=true. The master is always in state optimzed and current when replicating. How can it be that the searcher on master is referring to an older commit point if there is no such point. The logs show _AFTER_ the last optimize has finished a new searcher is started and the old one is closed. Also, we have replicateAfter startup, commit and optimize set but the AdminGUI and replication details only report replicateAfter commit and startup. Not really an error, but not that what is really set in the config! Very strange, I will try the patch. Regards Bernd Am 08.04.2013 20:12, schrieb Chris Hostetter: : I know there was some effort to fix this but I must report : that solr 4.2.1 has still problems with index version and : index generation numbering in master/slave mode with replication. ... : RESULT: slave has different (higher) version number and is with generation 1 ahead :-( Can you please provide more details... * are you using autocommit? with what settings? * are you using openSearcher=false in any of your commits? * where exactly are you looking that you see the master/slave out of sync? * are you observing any actual problems, or just seeing that the gen/version are reported as different? As Joel mentioned, there is an open Jira related purely to the *display* of information about gen/version between master slave, because in many cases the searcher in use on the master may refer to an older commit point, but it doesn't mean there is any actual problem in replication -- the slave is still fetching seraching the latest commit from the master as intended https://issues.apache.org/jira/browse/SOLR-4661 -Hoss -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: Sub field indexing
On Tue, 2013-04-09 at 08:40 +0200, It-forum wrote: Le 08/04/2013 20:02, Toke Eskildsen a écrit : compatible_engine:productZ/85 to get all products compatible with productZ, version 85 compatible_engine:productZ* to get all products compatible with any version of productZ. Whoops, slash triggers regexts, so you probably need to search for compatible_engine:productZ/85 or compatible_engine:productZ\/85 - Toke
Re: Indexed data not searchable
On 9 April 2013 13:10, Max Bo maximilian.brod...@gmail.com wrote: The XML files are formatted like this. I think there is the problem. [...] Yes, to use curl to post to /solr/update you need to have XML in the form described at http://wiki.apache.org/solr/UpdateXmlMessages Else, you can use FileListEntityProcessor and XPathEntityProcessor with FileDataSource from the Solr DataImportHandler. Please see examples at http://wiki.apache.org/solr/DataImportHandler Regards, Gora
Re: Empty Solr 4.2.1 can not create Collection
Hi, thanks for your faster answer. You don't use the Collection API - may I ask you why ? Therefore you have to setup everything (replicas, ...) manually..., which I would like to avoid. Also what I don't understand, why my steps work in 4.0 but won't in 4.2.1... Any clues ? Kind Regards Alexander Am 2013-04-08 19:12, schrieb Joel Bernstein: The steps that I use to setup the collection are slightly different: 1) Start zk and upconfig the config set. Your approach is same. 2) Start appservers with Solr zkHost set to the zk started in step 1. 3) Use a core admin command to spin up a new core and collection. http://app01/solr/admin/cores?action=CREATEname=storage-corecollection=storagenumShards=1collection.configName=storage-confhttp://app03/solr/admin/collections?action=CREATEname=storagenumShards=1replicationFactor=2collection.configName=storage-conf shard=shard1 This will spin up the new collection and initial core. I'm not using a replication factor because the following commands manually bind the replicas. 4) Spin up replica with a core admin command: http://app02/solr/admin/cores?action=CREATEname=storage-corecollection=storage;http://app03/solr/admin/collections?action=CREATEname=storagenumShards=1replicationFactor=2collection.configName=storage-conf shard=shard1 5) Same command as above on the 3rd server to spin up another replica. This will spin up a new core and bind it to shard1 of the storage collection. On Mon, Apr 8, 2013 at 9:34 AM, A.Eibner a_eib...@yahoo.de wrote: Hi, I have a problem with setting up my solr cloud environment (on three machines). If I want to create my collections from scratch I do the following: *) Start ZooKeeper on all machines. *) Upload the configuration (on app02) for the collection via the following command: zkcli.sh -cmd upconfig --zkhost app01:4181,app02:4181,app03:**4181 --confdir config/solr/storage/conf/ --confname storage-conf *) Linking the configuration (on app02) via the following command: zkcli.sh -cmd linkconfig --collection storage --confname storage-conf --zkhost app01:4181,app02:4181,app03:**4181 *) Start Tomcats (containing Solr) on app02,app03 *) Create Collection via: http://app03/solr/admin/**collections?action=CREATE** name=storagenumShards=1**replicationFactor=2** collection.configName=storage-**confhttp://app03/solr/admin/collections?action=CREATEname=storagenumShards=1replicationFactor=2collection.configName=storage-conf This creates the replication of the shard on app02 and app03, but neither of them is marked as leader, both are marked as DOWN. And after wards I can not access the collection. In the browser I get: SEVERE: org.apache.solr.common.**SolrException: no servers hosting shard: In the log files the following error is present: SEVERE: Error from shard: app02:9985/solr org.apache.solr.common.**SolrException: Error CREATEing SolrCore 'storage_shard1_replica1': at org.apache.solr.client.solrj.**impl.HttpSolrServer.request(** HttpSolrServer.java:404) at org.apache.solr.client.solrj.**impl.HttpSolrServer.request(** HttpSolrServer.java:181) at org.apache.solr.handler.**component.HttpShardHandler$1.** call(HttpShardHandler.java:**172) at org.apache.solr.handler.**component.HttpShardHandler$1.** call(HttpShardHandler.java:**135) at java.util.concurrent.**FutureTask$Sync.innerRun(** FutureTask.java:334) at java.util.concurrent.**FutureTask.run(FutureTask.**java:166) at java.util.concurrent.**Executors$RunnableAdapter.** call(Executors.java:471) at java.util.concurrent.**FutureTask$Sync.innerRun(** FutureTask.java:334) at java.util.concurrent.**FutureTask.run(FutureTask.**java:166) at java.util.concurrent.**ThreadPoolExecutor.runWorker(** ThreadPoolExecutor.java:1110) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(** ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.**java:722) Caused by: org.apache.solr.common.cloud.**ZooKeeperException: at org.apache.solr.core.**CoreContainer.registerInZk(** CoreContainer.java:922) at org.apache.solr.core.**CoreContainer.registerCore(** CoreContainer.java:892) at org.apache.solr.core.**CoreContainer.register(** CoreContainer.java:841) at org.apache.solr.handler.admin.**CoreAdminHandler.** handleCreateAction(**CoreAdminHandler.java:479) ... 19 more Caused by: org.apache.solr.common.**SolrException: Error getting leader from zk for shard shard1 at org.apache.solr.cloud.**ZkController.getLeader(** ZkController.java:864) at org.apache.solr.cloud.**ZkController.register(** ZkController.java:776) at org.apache.solr.cloud.**ZkController.register(** ZkController.java:727) at org.apache.solr.core.**CoreContainer.registerInZk(** CoreContainer.java:908) ... 22 more Caused by: java.lang.**InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native
Re: Empty Solr 4.2.1 can not create Collection
Hi, you are right, I have removed collection1 from the solr.xml but set defaultCoreName=storage. Also this works in 4.0 but won't in 4.2.1, any clues ? Kind Regards Alexander Am 2013-04-08 20:06, schrieb Joel Bernstein: The scenario above needs to have collection1 removed from the solr.xml to work. This, I believe, is the Empty Solr scenario that you are talking about. If you don't remove collection1 from solr.xml on all the solr instances, they will get tripped up on collection1 during these steps. If you startup with collection1 in solr.xml it's best to startup the initial Solr instance with the bootstrap-conf parameter so Solr can properly create this collection. On Mon, Apr 8, 2013 at 1:12 PM, Joel Bernstein joels...@gmail.com wrote: The steps that I use to setup the collection are slightly different: 1) Start zk and upconfig the config set. Your approach is same. 2) Start appservers with Solr zkHost set to the zk started in step 1. 3) Use a core admin command to spin up a new core and collection. http://app01/solr/admin/cores?action=CREATEname=storage-corecollection=storagenumShards=1collection.configName=storage-confhttp://app03/solr/admin/collections?action=CREATEname=storagenumShards=1replicationFactor=2collection.configName=storage-conf shard=shard1 This will spin up the new collection and initial core. I'm not using a replication factor because the following commands manually bind the replicas. 4) Spin up replica with a core admin command: http://app02/solr/admin/cores?action=CREATEname=storage-corecollection=storage;http://app03/solr/admin/collections?action=CREATEname=storagenumShards=1replicationFactor=2collection.configName=storage-conf shard=shard1 5) Same command as above on the 3rd server to spin up another replica. This will spin up a new core and bind it to shard1 of the storage collection. On Mon, Apr 8, 2013 at 9:34 AM, A.Eibner a_eib...@yahoo.de wrote: Hi, I have a problem with setting up my solr cloud environment (on three machines). If I want to create my collections from scratch I do the following: *) Start ZooKeeper on all machines. *) Upload the configuration (on app02) for the collection via the following command: zkcli.sh -cmd upconfig --zkhost app01:4181,app02:4181,app03:**4181 --confdir config/solr/storage/conf/ --confname storage-conf *) Linking the configuration (on app02) via the following command: zkcli.sh -cmd linkconfig --collection storage --confname storage-conf --zkhost app01:4181,app02:4181,app03:**4181 *) Start Tomcats (containing Solr) on app02,app03 *) Create Collection via: http://app03/solr/admin/**collections?action=CREATE** name=storagenumShards=1**replicationFactor=2** collection.configName=storage-**confhttp://app03/solr/admin/collections?action=CREATEname=storagenumShards=1replicationFactor=2collection.configName=storage-conf This creates the replication of the shard on app02 and app03, but neither of them is marked as leader, both are marked as DOWN. And after wards I can not access the collection. In the browser I get: SEVERE: org.apache.solr.common.**SolrException: no servers hosting shard: In the log files the following error is present: SEVERE: Error from shard: app02:9985/solr org.apache.solr.common.**SolrException: Error CREATEing SolrCore 'storage_shard1_replica1': at org.apache.solr.client.solrj.**impl.HttpSolrServer.request(** HttpSolrServer.java:404) at org.apache.solr.client.solrj.**impl.HttpSolrServer.request(** HttpSolrServer.java:181) at org.apache.solr.handler.**component.HttpShardHandler$1.** call(HttpShardHandler.java:**172) at org.apache.solr.handler.**component.HttpShardHandler$1.** call(HttpShardHandler.java:**135) at java.util.concurrent.**FutureTask$Sync.innerRun(** FutureTask.java:334) at java.util.concurrent.**FutureTask.run(FutureTask.**java:166) at java.util.concurrent.**Executors$RunnableAdapter.** call(Executors.java:471) at java.util.concurrent.**FutureTask$Sync.innerRun(** FutureTask.java:334) at java.util.concurrent.**FutureTask.run(FutureTask.**java:166) at java.util.concurrent.**ThreadPoolExecutor.runWorker(** ThreadPoolExecutor.java:1110) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(** ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.**java:722) Caused by: org.apache.solr.common.cloud.**ZooKeeperException: at org.apache.solr.core.**CoreContainer.registerInZk(** CoreContainer.java:922) at org.apache.solr.core.**CoreContainer.registerCore(** CoreContainer.java:892) at org.apache.solr.core.**CoreContainer.register(** CoreContainer.java:841) at org.apache.solr.handler.admin.**CoreAdminHandler.** handleCreateAction(**CoreAdminHandler.java:479) ... 19 more Caused by: org.apache.solr.common.**SolrException: Error getting leader from zk for shard shard1 at
Average Solr Server Spec.
This question may not have a generel answer and may be open ended but is there any commodity server spec. for a usual Solr running machine? I mean what is the average server spesification for a Solr machine (i.e. Hadoop running system it is not recommended to have very big storage capably computers.) I will use Solr for indexing web crawled data.
Re: SOLR-4581
Hi Alexander, I have put up a test case reproducing your issue. Perhaps someone more familiar with faceting code can debug this. For now, you can workaround this issue by adding facet.method=fc on your queries. On Mon, Apr 8, 2013 at 2:14 PM, Alexander Buhr a.b...@epages.com wrote: Hello, I created https://issues.apache.org/jira/browse/SOLR-4581 on 14.03.2013. Can anyone help me out with this? Thank You. Alexander Buhr Software Engineer ePages GmbH Pilatuspool 2 20355 Hamburg Germany +49-40-350 188-266 phone +49-40-350 188-222 fax a.b...@epages.commailto:a.b...@epages.com www.epages.comhttp://www.epages.com/ www.epages.com/bloghttp://www.epages.com/blog www.epages.com/twitterhttp://www.epages.com/twitter www.epages.com/facebookhttp://www.epages.com/facebook e-commerce. now plugplay. Geschäftsführer: Wilfried Beeck Handelsregister: Amtsgericht Hamburg HRB 120861 Sitz der Gesellschaft: Pilatuspool 2, 20355 Hamburg Steuernummer: 48/718/02195 USt-Ident.-Nr.: DE 282 947 700 -- Regards, Shalin Shekhar Mangar.
Doc Transformer with SolrDocumentList object
I am trying to modify the results of solr output . basically I need to change the ranking of the output of solr for a query. So please can anyone help. I wrote a java code that returns the SolrDocumentList object which is a union of the results I want this object to be displayed on solr. hats is once the query is hit. The solr runs the java code i wrote and the output returned in the java code gets as a output to the screen . I have tried to use the code as a data transformer. But I am getting this error: org.apache.solr.handler.dataimport.SolrWriter upload WARNING: Error creating document : SolrInputDocument[id=44, category=Apparel Fash Accessories, _version_=1431753044032225280, price=ERROR:SCHEMA-INDEX-MISMATC H,stringValue=1400, description=for girls, brand=Wrangler, price_c=1400,USD, siz e=ERROR:SCHEMA-INDEX-MISMATCH,stringValue=12] org.apache.solr.common.SolrException: version conflict for 44 expected=143175304 4032225280 actual=-1 Please can anyone help ?
Re: conditional queries?
Hi Mark, Is it possible to do a conditional query if another query has no results? For example, say I want to search against a given field for: - Search for car. If there are results, return them. - Else, search for car* . If there are results, return them. - Else, search for car~ . If there are results, return them. Is this possible in one query? Or would I need to make 3 separate queries by implementing this logic within my client? As far as I know, there is no such SearchComponent. But the idea of FallbackRequestHandler has been told, see SOLR-1878, for example: https://issues.apache.org/jira/browse/SOLR-1878 koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
How to configure shards with SSL?
Good morning everyone, I'm running solr 4.0 Final with ManifoldCF v1.2dev on tomcat 7.0.37 and I had shards up and running on http but when I migrated to SSL it won't work anymore. First I got an IO Exception but then I changed my configuration in solrconfig.xml to this: requestHandler name=/all class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str str name=wtxml/str str name=indenttrue/str str name=q.alt*:*/str str name=flid, solr.title, content, category, link, pubdateiso/str str name=shardsdev:7443/solr/ProfilesJava/|dev:7443/solr/C3Files/|dev:7443/solr/Blogs/|dev:7443/solr/Communities/|dev:7443/solr/Wikis/|dev:7443/solr/Bedeworks/|dev:7443/solr/Forums/|dev:7443/solr/Web/|dev:7443/solr/Bookmarks//str /lst shardHandlerFactory class=HttpShardHandlerFactory str name=urlSchemehttps:///str int name=socketTimeOut1000/int int name=connTimeOut5000/int /shardHandlerFactory /requestHandler And Now I'm getting this error: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request: How do I configure shards with SSL? Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-configure-shards-with-SSL-tp4054735.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Latency Comparison between cloud hosting Vs Dedicated hosting
On Tue, Apr 9, 2013 at 3:33 AM, Sujatha Arun suja.a...@gmail.com wrote: Would a bigger instance improve latency? Yes, and prewarming caches would help, too. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game
Re: Best practice for rebuild index in SolrCloud
We're setting up two collection aliases. One's a read alias, one's a write alias. When we need to start over with a new collection, we create the collection alongside the original, and point the write alias at it. When indexing is done, we point the read alias at it. Then you can delete the old collection when you feel good about the new one. Obviously this means that none of your clients should point at the collection directly, but rather one of the aliases depending on whether they're reading or writing. HTH, Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Mon, Apr 8, 2013 at 5:45 PM, Bill Au bill.w...@gmail.com wrote: We are using SolrCloud for replication and dynamic scaling but not distribution so we are only using a single shard. From time to time we make changes to the index schema that requires rebuilding of the index. Should I treat the rebuilding as just any other index operation? It seems to me it would be better if I can somehow take a node offline and rebuild the index there, then put it back online and let the new index be replicated from there. But I am not sure how to do the latter. Bill
Re: conditional queries?
We do this on the client side with multiple queries. It is fairly efficient, because most responses are from the first, exact query. wunder On Apr 9, 2013, at 6:15 AM, Koji Sekiguchi wrote: Hi Mark, Is it possible to do a conditional query if another query has no results? For example, say I want to search against a given field for: - Search for car. If there are results, return them. - Else, search for car* . If there are results, return them. - Else, search for car~ . If there are results, return them. Is this possible in one query? Or would I need to make 3 separate queries by implementing this logic within my client? As far as I know, there is no such SearchComponent. But the idea of FallbackRequestHandler has been told, see SOLR-1878, for example: https://issues.apache.org/jira/browse/SOLR-1878 koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Execution of Queries in Parallel: geotagged textual documents in Solrvvvv
I have around 100M of textual document geotagged (lat,long). THese documents are indexed with Solr 1.4. I am testing a retrieval model (written over Terrier). This model requires frequent execution of queries ( Bounding-box filter). These queries could be executed in parallel, one for each specific geographic tile. I was wondering if exists a solution speeding up the execution of queries in parallel. My naif idea is Split the index in many parts according the geographical tiles (how to do that? SolrCloud? Solr Index Replication? What is the max number of eventual replication?) Any practical further suggestion? Thanks in advance Massimiliano
Re: Search data who does not have x field
Sorry, I didnt explain my self good, I mean , you have to create an additional field 'hasCategory' in your schema, and then, before indexing, set the field 'hasCategory' in the indexed document as true, if your document has categories, or set it to false, if it has any. With this you will save computation time, since the query for a boolean field is much easier for Solr than checking for an empty string field. The query should be = q=*:*fq=hasCategory:true anurag.jain wrote another solution would be to add a boolean field, hasCategory, and use it for filtering q= your query here fq=hasCategory:true I am not getting result. i am trying localhost:8983/search?q=*:*fq=category:true it is giving zero result. by the way first technique is working fine. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-data-who-does-not-have-x-field-tp4046959p4054763.html Sent from the Solr - User mailing list archive at Nabble.com.
corrupted index in slave?
Hi guys, I'm getting exceptions in a Solr slave, when accessing TermVector component and RealTimeGetHandler. The weird thing is, that in the master and in one of the 2 slaves, the documents are ok, and the same query doesnt return any exception. For now, the only way I have to solve the problem is deleting these documents and indexing them again. I upgraded Solr from 4.0 directly to 4.2, then to 4.2.1 last week These exceptions seems to appear since the upgrade to 4.2. I didn't run the script for migrating the index files (as I did in the migration from 3.6 to 4.0), should I? Has the format of the index changed? If not, is that a known bug? If it's, sorry I couldn't find it in JIRA. These are the exceptions I get: {responseHeader:{status:500,QTime:1},response:{numFound:1,start:0,docs:[{itemid:105266867,text:exklusiver kann man kaum würzen safran ist das teuerste gewürz der welt handverlesen und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird in winzigen mengen gehandelt und verwendet,title:safran,domainid:4287,date_i:2012-11-21T17:01:23Z,date:2012-11-21T17:01:09Z,category:[kultur,literatur,gesellschaft,umwelt,trinken,essen]}]},termVectors:[uniqueKeyFieldName,itemid,105266867,[uniqueKey,105266867]],error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:493)\n\tat org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:175)\n\tat org.apache.lucene.index.BaseCompositeReader.getTermVectors(BaseCompositeReader.java:97)\n\tat org.apache.lucene.index.IndexReader.getTermVector(IndexReader.java:385)\n\tat org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:313)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat org.mortbay.jetty.Server.handle(Server.java:326)\n\tat org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n,code:500}} {error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat org.apache.lucene.index.IndexReader.document(IndexReader.java:436)\n\tat org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:640)\n\tat org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:568)\n\tat org.apache.solr.handler.component.RealTimeGetComponent.process(RealTimeGetComponent.java:176)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)\n\tat
Re: corrupted index in slave?
sorry I forgot to say, the exceptions are not for every document, but only for a few... regards, Victor Victor Ruiz wrote Hi guys, I'm getting exceptions in a Solr slave, when accessing TermVector component and RealTimeGetHandler. The weird thing is, that in the master and in one of the 2 slaves, the documents are ok, and the same query doesnt return any exception. For now, the only way I have to solve the problem is deleting these documents and indexing them again. I upgraded Solr from 4.0 directly to 4.2, then to 4.2.1 last week These exceptions seems to appear since the upgrade to 4.2. I didn't run the script for migrating the index files (as I did in the migration from 3.6 to 4.0), should I? Has the format of the index changed? If not, is that a known bug? If it's, sorry I couldn't find it in JIRA. These are the exceptions I get: {responseHeader:{status:500,QTime:1},response:{numFound:1,start:0,docs:[{itemid:105266867,text:exklusiver kann man kaum würzen safran ist das teuerste gewürz der welt handverlesen und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird in winzigen mengen gehandelt und verwendet,title:safran,domainid:4287,date_i:2012-11-21T17:01:23Z,date:2012-11-21T17:01:09Z,category:[kultur,literatur,gesellschaft,umwelt,trinken,essen]}]},termVectors:[uniqueKeyFieldName,itemid,105266867,[uniqueKey,105266867]],error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:493)\n\tat org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:175)\n\tat org.apache.lucene.index.BaseCompositeReader.getTermVectors(BaseCompositeReader.java:97)\n\tat org.apache.lucene.index.IndexReader.getTermVector(IndexReader.java:385)\n\tat org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:313)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat org.mortbay.jetty.Server.handle(Server.java:326)\n\tat org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n,code:500}} {error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat org.apache.lucene.index.IndexReader.document(IndexReader.java:436)\n\tat org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:640)\n\tat org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:568)\n\tat org.apache.solr.handler.component.RealTimeGetComponent.process(RealTimeGetComponent.java:176)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat
Re: solr 4.2.1 still has problems with index version and index generation
: And with replication?command=details I also see the correct commit part as : above, BUT where the hell are the wrong info below the commit array are : coming from? Please read the details in the previously mentioned Jira issue... https://issues.apache.org/jira/browse/SOLR-4661 The indexVersion and generation you are looking at refer to the speciics of the IndexReader as used by the *seracher* on the master server -- but in addition to situations like openSearcher=false, there are some optimizations in place such that Solr/Lucene is smart enough to realize when an empty commit doesn't change the IndexReader it continues to use the previous commit point... https://issues.apache.org/jira/browse/SOLR-4661?focusedCommentId=13620195page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13620195 ...but from the perspective of the slave, this is still a commit that needs replicated and loaded. Hence the current objective of the patch in SOLR-3855: add more details to the command=details response (as well as the Admin UI) to clearly distinguish between the gen/ver of the currently replicatable commit and the gen/ver of the currently open searcher. All available information suggests that this is purely a problem of conveying information to users via command=details -- Replication is behaving as designed using the correct information about hte commit points. -Hoss
How can I set configuration options?
Hi all, I have been working through the examples on the SolrCloud page: http://wiki.apache.org/solr/SolrCloud I am now at the point where, rather than firing up Solr through start.jar, I'm deploying the Solr war in to Tomcat instances. Taking the following command as an example: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar I can't figure out from the documentation how/ where I set the above properties when deploying Solr as a war file. I initially thought these might be configurable through solr.xml but can't find anything in the documentation to support this. Most grateful for any pointers here. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: conditional queries?
I not sure, but you can create a class extend of SearchComponent and include at the least of your requesthandler and in this way add optional actions about whatever query on your solr server. Example solrconfig.xml requestHandler ... arr name=last-components stractions/str /arr /requestHandler searchComponent name=actions class= HERE YOUR CLASS str name=params/str /searchComponent Regars El 09/04/2013 17:05, Walter Underwood escribió: We do this on the client side with multiple queries. It is fairly efficient, because most responses are from the first, exact query. wunder On Apr 9, 2013, at 6:15 AM, Koji Sekiguchi wrote: Hi Mark, Is it possible to do a conditional query if another query has no results? For example, say I want to search against a given field for: - Search for car. If there are results, return them. - Else, search for car* . If there are results, return them. - Else, search for car~ . If there are results, return them. Is this possible in one query? Or would I need to make 3 separate queries by implementing this logic within my client? As far as I know, there is no such SearchComponent. But the idea of FallbackRequestHandler has been told, see SOLR-1878, for example: https://issues.apache.org/jira/browse/SOLR-1878 koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Index Replication Failing in Solr 4.2.1
Hi All, I am migrating from Solr 3.5.0 to Solr 4.2.1. And everything is running fine and set to go, except the master slave replication. We use master slave replication with multi cores ( 1 master, 10 slaves and 20 plus cores). My Configuration is : Master : Solr 3.5.0, Has existing index, and delta import running using DIH. Slave : Solr 4.2.1 , Has no startup index Apr 9, 2013 9:18:40 PM org.apache.solr.core.SolrCore execute INFO: [phcare] webapp= path=/replication params={command=fetchindex_=1365522520521wt=json} status=0 QTime=1 Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchLatestIndex *INFO: Master's generation: 107876 *Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchLatestIndex *INFO: Slave's generation: 79248 *Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process *Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchFileList SEVERE: No files to download for index generation: 107876 *Apr 9, 2013 9:18:40 PM org.apache.solr.core.SolrCore execute INFO: [phcare] webapp= path=/replication params={command=details_=1365522520556wt=json} status=0 QTime=7 In Both Master and Slave The File list for replicable version is correct. *on Slave * { - masterDetails: { - indexSize: 4.31 MB, - indexPath: /var/lib/fk-w3-sherlock/cores/phcare/data/index.20130124235012, - commits: [ - [ - indexVersion, - 1323961124638, - generation, - 107856, - filelist, - [ - _45e1.tii, - _45e1.nrm, - .. *ON Master * [ - indexVersion, - 1323961124638, - generation, - 107856, - filelist, - [ - _45e1.tii, - _45e1.nrm, - _45e2_1.del, - _45e2.frq, - _45e1_3.del, - _45e1.tis, - .. Can someone help. Our whole Migration to Solr 4.2 is blocked on Replication issue. --- Thanks Regards Umesh Prasad
SolrCloud: Result Grouping - no groups with field type with precisionStep 0
Hello, I am using the Result Grouping feature with SolrCloud, and it seems that grouping does not work with field types having precisionStep property greater than 0, in distributed mode. I updated the SolrCloud - Getting Started page example A (Simple two shard cluster). In my schema.xml, the popularity field has an int type where I changed precisionStep from 0 to 4 : fieldType name=int class=solr.TrieIntField precisionStep=4 positionIncrementGap=0 / field name=popularity type=int indexed=true stored=true / When I'm requesting in distributed mode, the grouping on this field does not return groups : http://localhost:8983/solr/select?q=*:*group=truegroup.field=popularitydistrib=true lst name=grouped lst name=popularity int name=matches1/int arr name=groups lst int name=groupValue0/int result name=doclist numFound=0 start=0 / /lst /arr /lst /lst When I'm requesting on a single core, the grouping on this field returns a group : http://localhost:8983/solr/select?q=*:*group=truegroup.field=popularitydistrib=false lst int name=groupValue10/int result name=doclist numFound=1 start=0 doc str name=idMA147LL/A/str ... int name=popularity10/int ... /doc /result /lst If I come back to the origin configuration, changing the int type with precisionStep=0, the distributed request works : fieldType name=int class=solr.TrieIntField precisionStep=0 positionIncrementGap=0 / The precisionStep 0 can be useful for range queries but is it normal that it is not compatible with grouping queries, in distributed mode only ? Elodie Sannier Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Execution of Queries in Parallel: geotagged textual documents in Solrvvvv
Hi, I'd move to SolrCloud 4.2.1 to benefit from sharding, replication, and the latest Lucene. How many queries you will then be able to run in parallel will depend on their complexity, index size, query cachability, index size, latency requirements... But move to the latest setup first. Otis -- SOLR Performance Monitoring - http://sematext.com/spm/index.html On Tue, Apr 9, 2013 at 11:10 AM, Massimiliano Ruocco ruo...@idi.ntnu.no wrote: I have around 100M of textual document geotagged (lat,long). THese documents are indexed with Solr 1.4. I am testing a retrieval model (written over Terrier). This model requires frequent execution of queries ( Bounding-box filter). These queries could be executed in parallel, one for each specific geographic tile. I was wondering if exists a solution speeding up the execution of queries in parallel. My naif idea is Split the index in many parts according the geographical tiles (how to do that? SolrCloud? Solr Index Replication? What is the max number of eventual replication?) Any practical further suggestion? Thanks in advance Massimiliano
Re: Latency Comparison between cloud hosting Vs Dedicated hosting
Hi Sujatha, You should really do the same stuff to improve latency in the cloud as what you would do on a dedicated server. Amazon-specific stuff: Bigger EC2 instances have better IO. EBS performance varies. Some people mount N of them and stripe across them. Some people try N EBS volumes to find the best performing one(s) and discard the rest. Some people pay for provisioned IOPS. Otis -- SOLR Performance Monitoring - http://sematext.com/spm/index.html On Tue, Apr 9, 2013 at 3:33 AM, Sujatha Arun suja.a...@gmail.com wrote: Hi, We are comparing search request latency between Amazon Vs Dedicated hosting [Rackspace] .For comparison we used solr version 3.6.1 and Amazon small instance.The index size was less than 1GB. We see that the latency is about 75 -100 % from Amazon. Any body who has migrated form Dedicated hosting to Cloud has any pointers for improving latecny? Would a bigger instance improve latency? Regards Sujatha
Solr 4.2.1 SSLInitializationException
Hi All, Deploying Solr 4.2.1 to GlassFish 3.1.1 results in the error below. I have seen similar problems being reported with Solr 4.2 and my take-away was that 4.2.1 contains the necessary fix. Any help with this will be appreciated. Thanks! 2013-04-09 10:45:06,144 [main] ERROR org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check solr/home property and the logs 2013-04-09 10:45:06,224 [main] ERROR org.apache.solr.core.SolrCore - null:org.apache.http.conn.ssl.SSLInitializationException: Failure initializing default system SSL context Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772) at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55) at java.security.KeyStore.load(KeyStore.java:1214) at org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:281) at org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:366) ... 50 more Caused by: java.security.UnrecoverableKeyException: Password verification failed at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770)
Re: Solr 4.2.1 SSLInitializationException
: Deploying Solr 4.2.1 to GlassFish 3.1.1 results in the error below. I : have seen similar problems being reported with Solr 4.2 Are you trying to use server SSL with glassfish? can you please post the full stack trace so we can see where this error is coming from. My best guess is that this is coming from the changes made in SOLR-4451 to use system defaults correctly when initializing HttpClient, which suggets that your problem is exactly what the error message says... Keystore was tampered with, or password was incorrect Is it possible that the default keystore password for your JVM (or as overridden by glassfish defaults - possibly using the javax.net.ssl.keyStore sysprop) has a password set on it? If so you need to confiure your JVM with the standard java system properties to specify what that password is. http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E : 2013-04-09 10:45:06,144 [main] ERROR : org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check solr/home property and the logs : 2013-04-09 10:45:06,224 [main] ERROR : org.apache.solr.core.SolrCore - : null:org.apache.http.conn.ssl.SSLInitializationException: Failure : initializing default system SSL context : Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect : at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772) : at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55) : at java.security.KeyStore.load(KeyStore.java:1214) : at : org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:281) : at org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:366) : ... 50 more : Caused by: java.security.UnrecoverableKeyException: Password verification failed : at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770) -Hoss
Re: Execution of Queries in Parallel: geotagged textual documents in Solrvvvv
: I'd move to SolrCloud 4.2.1 to benefit from sharding, replication, and : the latest Lucene. How many queries you will then be able to run in : parallel will depend on their complexity, index size, query : cachability, index size, latency requirements... But move to the : latest setup first. No to mention thta geospatial query support is vastly improved in Solr 4.x vs what was possible in Solr 1.4. -Hoss
query regarding the use of boost across the fields in edismax query
hi all wanted to know what could be the difference between the results if I apply boost accross say 5 fields in query like for first: title^10.0 features^7.0 cat^5.0 color^3.0 root^1.0 and second settings like : title^10.0 features^5.0 cat^3.0 color^2.0 root^1.0 what could be the difference as in the weights are in same order decreasing? thanks in advance regards Rohan
Re: query regarding the use of boost across the fields in edismax query
Not sure if i'm missing something but in the first case features, cat, and color field have more weight, so matches on them with have bigger contribution to the overall relevancy score. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 1:52 PM, Rohan Thakur rohan.i...@gmail.com wrote: hi all wanted to know what could be the difference between the results if I apply boost accross say 5 fields in query like for first: title^10.0 features^7.0 cat^5.0 color^3.0 root^1.0 and second settings like : title^10.0 features^5.0 cat^3.0 color^2.0 root^1.0 what could be the difference as in the weights are in same order decreasing? thanks in advance regards Rohan
Re: Average Solr Server Spec.
Hi, You are right there is no average. I saw a Solr cluster with a few EC2 micro instances yesterday and regularly see Solr running on 16 or 32 GB RAM and sometimes well over 100 GB RAM. Sometimes they have just 2 CPU cores, sometimes 32 or more. Some use SSDs, some HDDs, some local storage, some SAN, some EBS on AWS. etc. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com wrote: This question may not have a generel answer and may be open ended but is there any commodity server spec. for a usual Solr running machine? I mean what is the average server spesification for a Solr machine (i.e. Hadoop running system it is not recommended to have very big storage capably computers.) I will use Solr for indexing web crawled data.
Re: Average Solr Server Spec.
We mostly run m1.xlarge with an 8GB heap. --wunder On Apr 9, 2013, at 10:57 AM, Otis Gospodnetic wrote: Hi, You are right there is no average. I saw a Solr cluster with a few EC2 micro instances yesterday and regularly see Solr running on 16 or 32 GB RAM and sometimes well over 100 GB RAM. Sometimes they have just 2 CPU cores, sometimes 32 or more. Some use SSDs, some HDDs, some local storage, some SAN, some EBS on AWS. etc. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com wrote: This question may not have a generel answer and may be open ended but is there any commodity server spec. for a usual Solr running machine? I mean what is the average server spesification for a Solr machine (i.e. Hadoop running system it is not recommended to have very big storage capably computers.) I will use Solr for indexing web crawled data.
Results Order When Performing Wildcard Query
Hi, I wrote a test of my application which revealed a Solr oddity (I think). The test which I wrote on Windows 7 and makes use of the solr-test-frameworkhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html fails under Ubuntu 12.04 because the Solr results I expected for a wildcard query of the test data are ordered differently under Ubuntu than Windows. On both Windows and Ubuntu all items in the result set have a score of 1.0 and appear to be ordered by docid (which looks like in corresponds to alphabetical unique id on Windows but not Ubuntu). I'm guessing that the root of my issue is that a different docid was assigned to the same document on each operating system. The data was imported using a DataImportHandler configuration during a @BeforeClass step in my JUnit test on both systems. Any suggestions on how to ensure a consistently ordered wildcard query result set for testing? Thanks, Tricia
Re: How can I set configuration options?
In Ubuntu, I've added it to /etc/default/tomcat7 in the JAVA_OPTS options. For example, I have: JAVA_OPTS=-Djava.awt.headless=true -Xmx2048m -XX:+UseConcMarkSweepGC JAVA_OPTS=${JAVA_OPTS} -DnumShards=2 -Djetty.port=8080 -DzkHost=zookeeper01.dev.:2181 -Dboostrap_conf=true -- Nate Fox Sr Systems Engineer o: 310.658.5775 m: 714.248.5350 Follow us @NEOGOV http://twitter.com/NEOGOV and on Facebookhttp://www.facebook.com/neogov NEOGOV http://www.neogov.com/ is among the top fastest growing software companies in the USA, recognized by Inc 500|5000, Deloitte Fast 500, and the LA Business Journal. We are hiring!http://www.neogov.com/#/company/careers On Tue, Apr 9, 2013 at 8:55 AM, Edd Grant e...@eddgrant.com wrote: Hi all, I have been working through the examples on the SolrCloud page: http://wiki.apache.org/solr/SolrCloud I am now at the point where, rather than firing up Solr through start.jar, I'm deploying the Solr war in to Tomcat instances. Taking the following command as an example: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar I can't figure out from the documentation how/ where I set the above properties when deploying Solr as a war file. I initially thought these might be configurable through solr.xml but can't find anything in the documentation to support this. Most grateful for any pointers here. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: How can I set configuration options?
Hi Edd; The parameters you mentioned are JVM parameters. There are two ways to define them. First one is if you are using an IDE you can indicate them as JVM parameters. i.e. if you are using Intellij IDEA when you click your Run/Debug configurations there is a line called VM Options. You can write your paramters without writing java word in front of them. Second one is deploying your war file into Tomcat without using an IDE (I think this is what you want). Here is what to do: Go to tomcat home folder and under the bin folder create a file called setenv.sh Then add that lines: #!/bin/sh # # export JAVA_OPTS=$JAVA_OPTS -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 2013/4/9 Edd Grant e...@eddgrant.com Hi all, I have been working through the examples on the SolrCloud page: http://wiki.apache.org/solr/SolrCloud I am now at the point where, rather than firing up Solr through start.jar, I'm deploying the Solr war in to Tomcat instances. Taking the following command as an example: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar I can't figure out from the documentation how/ where I set the above properties when deploying Solr as a war file. I initially thought these might be configurable through solr.xml but can't find anything in the documentation to support this. Most grateful for any pointers here. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
You may also be interested in looking at things like solrbase (on Github). Otis -- Solr ElasticSearch Support http://sematext.com/ On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; First of all should mention that I am new to Solr and making a research about it. What I am trying to do that I will crawl some websites with Nutch and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 ) I wonder about something. I have a cloud of machines that crawls websites and stores that documents. Then I send that documents into SolrCloud. Solr indexes that documents and generates indexes and save them. I know that from Information Retrieval theory: it *may* not be efficient to store indexes at a NoSQL database (they are something like linked lists and if you store them in such kind of database you *may* have a sparse representation -by the way there may be some solutions for it. If you explain them you are welcome.) However Solr stores some documents too (i.e. highlights) So some of my documents will be doubled somehow. If I consider that I will have many documents, that dobuled documents may cause a problem for me. So is there any way not storing that documents at Solr and pointing to them at Hbase(where I save my crawled documents) or instead of pointing directly storing them at Hbase (is it efficient or not)?
Re: Average Solr Server Spec.
Hi Walter; Could I learn that what is the average size of Solr indexes and average query per second to your Solr. Maybe I can come up with an assumption? 2013/4/9 Walter Underwood wun...@wunderwood.org We mostly run m1.xlarge with an 8GB heap. --wunder On Apr 9, 2013, at 10:57 AM, Otis Gospodnetic wrote: Hi, You are right there is no average. I saw a Solr cluster with a few EC2 micro instances yesterday and regularly see Solr running on 16 or 32 GB RAM and sometimes well over 100 GB RAM. Sometimes they have just 2 CPU cores, sometimes 32 or more. Some use SSDs, some HDDs, some local storage, some SAN, some EBS on AWS. etc. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com wrote: This question may not have a generel answer and may be open ended but is there any commodity server spec. for a usual Solr running machine? I mean what is the average server spesification for a Solr machine (i.e. Hadoop running system it is not recommended to have very big storage capably computers.) I will use Solr for indexing web crawled data.
Indexing and searching documents in different languages
Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdateProcessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory bool name=langidtrue/bool str name=langid.fltitle,subtitle,content/str str name=langid.langFieldlanguage_s/str str name=langid.threshold0.3/str str name=langid.fallbackgeneral/str str name=langid.whitelisten,fr,de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrigtrue/bool /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Re: Number of segments
My main concern was just making sure we were getting the best search performance, and that we did not have too many segments. Every attempt I made to adjust the segment count resulted in no difference (segment count never changed). Looking at that blog page, it looks like 30-40 segments is probably the norm. On 04/08/2013 08:43 PM, Chris Hostetter wrote: : How do I determine how many tiers it has? You may find this blog post from mccandless helpful... http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html (don't ignore the videos! watching them really helpful to understand what he is talking about) Once you've obsorbed that, then please revist your question, specifically Upayavira's key point: what is the problem you are trying to solve? https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Indexing and searching documents in different languages
Hi, Typically people try to figure out the query language somehow. Queries are short, so LID on them is hard. But user profile could indicate a language, or users can be asked and such. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote: Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdateProcessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory bool name=langidtrue/bool str name=langid.fltitle,subtitle,content/str str name=langid.langFieldlanguage_s/str str name=langid.threshold0.3/str str name=langid.fallbackgeneral/str str name=langid.whitelisten,fr,de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrigtrue/bool /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Re: Solr metrics in Codahale metrics and Graphite?
If it isn't obvious, I'm glad to help test a patch for this. We can run a simulated production load in dev and report to our metrics server. wunder On Apr 8, 2013, at 1:07 PM, Walter Underwood wrote: That approach sounds great. --wunder On Apr 7, 2013, at 9:40 AM, Alan Woodward wrote: I've been thinking about how to improve this reporting, especially now that metrics-3 (which removes all of the funky thread issues we ran into last time I tried to add it to Solr) is close to release. I think we could go about it as follows: * refactor the existing JMX reporting to use metrics-3. This would mean replacing the SolrCore.infoRegistry map with a MetricsRegistry, and adding a JmxReporter, keeping the existing config logic to determine which JMX server to use. PluginInfoHandler and SolrMBeanInfoHandler translate the metrics-3 data back into SolrMBean format to keep the reporting backwards-compatible. This seems like a lot of work for no visible benefit, but… * we can then add the ability to define other metrics reporters in solrconfig.xml. There are already reporters for Ganglia and Graphite - you just add then to the Solr lib/ directory, configure them in solrconfig, and voila - Solr can be monitored using the same devops tools you use to monitor everything else. Does this sound sane? Alan Woodward www.flax.co.uk On 6 Apr 2013, at 20:49, Walter Underwood wrote: Wow, that really doesn't help at all, since these seem to only be reported in the stats page. I don't need another non-standard app-specific set of metrics, especially one that needs polling. I need metrics delivered to the common system that we use for all our servers. This is also why SPM is not useful for us, sorry Otis. Also, there is no time period on these stats. How do you graph the 95th percentile? I know there was a lot of work on these, but they seem really useless to me. I'm picky about metrics, working at Netflix does that to you. wunder On Apr 3, 2013, at 4:01 PM, Walter Underwood wrote: In the Jira, but not in the docs. It would be nice to have VM stats like GC, too, so we can have common monitoring and alerting on all our services. wunder On Apr 3, 2013, at 3:31 PM, Otis Gospodnetic wrote: It's there! :) http://search-lucene.com/?q=percentilefc_project=Solrfc_type=issue Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood wun...@wunderwood.org wrote: That sounds great. I'll check out the bug, I didn't see anything in the docs about this. And if I can't find it with a search engine, it probably isn't there. --wunder On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote: On 3/29/2013 12:07 PM, Walter Underwood wrote: What are folks using for this? I don't know that this really answers your question, but Solr 4.1 and later includes a big chunk of codahale metrics internally for request handler statistics - see SOLR-1972. First we tried including the jar and using the API, but that created thread leak problems, so the source code was added. Thanks, Shawn -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Indexing and searching documents in different languages
Have you looked at edismax and the 'qf' fields parameter? It allows you to define the fields to search. Also, you can define those parameters in solrconfig.xml and not have to send them down the wire. Finally, you can define several different request handlers (e.g. /ensearch, /frsearch) and have each of them use different 'qf' values, possibly with 'fl' field also defined and with field name aliasing from language-specific to generic names. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote: Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.**processor.** TikaLanguageIdentifierUpdatePr**ocessorFactory bool name=langidtrue/bool str name=langid.fltitle,**subtitle,content/str str name=langid.langField**language_s/str str name=langid.threshold0.3/**str str name=langid.fallback**general/str str name=langid.whitelisten,fr,**de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrig**true/bool /processor processor class=solr.**LogUpdateProcessorFactory / processor class=solr.**RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:**+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Re: Slow qTime for distributed search
Thanks for replying. My config: - 40 dedicated servers, dual-core each - Running Tomcat servlet on Linux - 12 Gb RAM per server, splitted half between OS and Solr - Complex queries (up to 30 conditions on different fields), 1 qps rate Sharding my index was done for two reasons, based on 2 servers (4shards) tests: 1. As index grew above few million of docs qTime raised greatly, while sharding the index to smaller pieces (about 0.5M docs) gave way better results, so I bound every shard to have 0.5M docs. 2. Tests showed i was cpu-bounded during queries. As i have low qps rate (emphasize: lower than expected qTime) and as a query runs single-threaded on each shard, it made sense to accord a cpu to each shard. For the same amount of docs per shards I do expect a raise in total qTime for the reasons: 1. The response should wait for the slowest shard 2. Merging the responses from 40 different shards takes time What i understand from your explanation is that it's the merging that takes time and as qTime ends only after the second retrieval phase, the qTime on each shard will take longer. Meaning during a significant proportion of the first query phase (right after the [id,score] are retieved), all cpu's are idle except the response-merger thread running on a single cpu. I thought of the merge as a simple sorting of [id,score], way more simple than additional 300 ms cpu time. Why would a RAM increase improve my performances, as it's a response-merge (CPU resource) bottleneck? Thanks in advance, Manu On Mon, Apr 8, 2013 at 10:19 PM, Shawn Heisey s...@elyograg.org wrote: On 4/8/2013 12:19 PM, Manuel Le Normand wrote: It seems that sharding my collection to many shards slowed down unreasonably, and I'm trying to investigate why. First, I created collection1 - 4 shards*replicationFactor=1 collection on 2 servers. Second I created collection2 - 48 shards*replicationFactor=2 collection on 24 servers, keeping same config and same num of documents per shard. The primary reason to use shards is for index size, when your index is so big that a single index cannot give you reasonable performance. There are also sometimes performance gains when you break a smaller index into shards, but there is a limit. Going from 2 shards to 3 shards will have more of an impact that going from 8 shards to 9 shards. At some point, adding shards makes things slower, not faster, because of the extra work required for combining multiple queries into one result response. There is no reasonable way to predict when that will happen. Observations showed the following: 1. Total qTime for the same query set is 5 time higher in collection2 (150ms-700 ms) 2. Adding to colleciton2 the *shard.info=true* param in the query shows that each shard is much slower than each shard was in collection1 (about 4 times slower) 3. Querying only specific shards on collection2 (by adding the shards=shard1,shard2...shard12 param) gave me much better qTime per shard (only 2 times higher than in collection1) 4. I have a low qps rate, thus i don't suspect the replication factor for being the major cause of this. 5. The avg. cpu load on servers during querying was much higher in collection1 than in collection2 and i didn't catch any other bottlekneck. A distributed query actually consists of up to two queries per shard. The first query just requests the uniqueKey field, not the entire document. If you are sorting the results, then the sort field(s) are also requested, otherwise the only additional information requested is the relevance score. The results are compiled into a set of unique keys, then a second query is sent to the proper shards requesting specific documents. Q: 1. Why does the amount of shards affect the qTime of each shard? 2. How can I overcome to reduce back the qTime of each shard? With more shards, it takes longer for the first phase to compile the results, so the second phase (document retrieval) gets delayed, and the QTime goes up. One way to reduce the total time is to reduce the number of shards. You haven't said anything about how complex your queries are, your index size(s), or how much RAM you have on each server and how it is allocated. Can you provide this information? Getting good performance out of Solr requires plenty of RAM in your OS disk cache. Query times of 150 to 700 milliseconds seem very high, which could be due to query complexity or a lack of server resources (especially RAM), or possibly both. Thanks, Shawn
Re: Results Order When Performing Wildcard Query
On 4/9/2013 12:08 PM, P Williams wrote: I wrote a test of my application which revealed a Solr oddity (I think). The test which I wrote on Windows 7 and makes use of the solr-test-frameworkhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html fails under Ubuntu 12.04 because the Solr results I expected for a wildcard query of the test data are ordered differently under Ubuntu than Windows. On both Windows and Ubuntu all items in the result set have a score of 1.0 and appear to be ordered by docid (which looks like in corresponds to alphabetical unique id on Windows but not Ubuntu). I'm guessing that the root of my issue is that a different docid was assigned to the same document on each operating system. It might be due to differences in how Java works on the two platforms, or even something as simple as different Java versions. I don't know a lot about the underlying Lucene stuff, so this next sentence may not be correct: If you have are not starting from an index where the actual index directory was deleted before the test started (rather than deleting all documents), that might produce different internal Lucene document ids. The data was imported using a DataImportHandler configuration during a @BeforeClass step in my JUnit test on both systems. Any suggestions on how to ensure a consistently ordered wildcard query result set for testing? Include an explicit sort parameter. That way it will depend on the data, not the internal Lucene representation. Thanks, Shawn
Re: Slow qTime for distributed search
On 4/9/2013 2:10 PM, Manuel Le Normand wrote: Thanks for replying. My config: - 40 dedicated servers, dual-core each - Running Tomcat servlet on Linux - 12 Gb RAM per server, splitted half between OS and Solr - Complex queries (up to 30 conditions on different fields), 1 qps rate Sharding my index was done for two reasons, based on 2 servers (4shards) tests: 1. As index grew above few million of docs qTime raised greatly, while sharding the index to smaller pieces (about 0.5M docs) gave way better results, so I bound every shard to have 0.5M docs. 2. Tests showed i was cpu-bounded during queries. As i have low qps rate (emphasize: lower than expected qTime) and as a query runs single-threaded on each shard, it made sense to accord a cpu to each shard. For the same amount of docs per shards I do expect a raise in total qTime for the reasons: 1. The response should wait for the slowest shard 2. Merging the responses from 40 different shards takes time What i understand from your explanation is that it's the merging that takes time and as qTime ends only after the second retrieval phase, the qTime on each shard will take longer. Meaning during a significant proportion of the first query phase (right after the [id,score] are retieved), all cpu's are idle except the response-merger thread running on a single cpu. I thought of the merge as a simple sorting of [id,score], way more simple than additional 300 ms cpu time. Why would a RAM increase improve my performances, as it's a response-merge (CPU resource) bottleneck? If you have not tweaked the Tomcat configuration, that can lead to problems, but if your total query volume is really only one query per second, this is probably not a worry for you. A tomcat connector can be configured with a maxThreads parameter. The recommended value there is 1, but Tomcat defaults to 200. You didn't include the index sizes. There's half a million docs per shard, but I don't know what that translates to in terms of MB or GB of disk space. On another email thread you mention that your documents are about 50KB each. That would translate to an index that's at least 25GB, possibly more. That email thread also says that optimization for you takes an hour, further indications that you've got some really big indexes. You're saying that you have given 6GB out of the 12GB to Solr, leaving only 6GB for the OS and caching. Ideally you want to have enough RAM to cache the entire index, but in reality you can usually get away with caching between half and two thirds of the index. Exactly what ratio works best is highly dependent on your schema. If my numbers are even close to right, then you've got a lot more index on each server than available RAM. Based on what I can deduce, you would want 24 to 48GB of RAM per server. If my numbers are wrong, then this estimate is wrong. I would be interested in seeing your queries. If the complexity can be expressed as filter queries that get re-used a lot, the filter cache can be a major boost to performance. Solr's caches in general can make a big difference. There is no guarantee that caches will help, of course. Thanks, Shawn
Re: How can I set configuration options?
Thanks for the replies. The problem I have is that setting them at the JVM level would mean that all instances of Solr deployed in the Tomcat instance are forced to use the same settings. I actually want to set the properties at the application level (e.g. in solr.xml, zoo.conf or maybe an application level Tomcat Context.xml file). I'll grab the Solr source and see if there's any way to do this, unless anyone knows how off the top of their head? Cheers, Edd On 9 April 2013 19:21, Furkan KAMACI furkankam...@gmail.com wrote: Hi Edd; The parameters you mentioned are JVM parameters. There are two ways to define them. First one is if you are using an IDE you can indicate them as JVM parameters. i.e. if you are using Intellij IDEA when you click your Run/Debug configurations there is a line called VM Options. You can write your paramters without writing java word in front of them. Second one is deploying your war file into Tomcat without using an IDE (I think this is what you want). Here is what to do: Go to tomcat home folder and under the bin folder create a file called setenv.sh Then add that lines: #!/bin/sh # # export JAVA_OPTS=$JAVA_OPTS -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 2013/4/9 Edd Grant e...@eddgrant.com Hi all, I have been working through the examples on the SolrCloud page: http://wiki.apache.org/solr/SolrCloud I am now at the point where, rather than firing up Solr through start.jar, I'm deploying the Solr war in to Tomcat instances. Taking the following command as an example: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar I can't figure out from the documentation how/ where I set the above properties when deploying Solr as a war file. I initially thought these might be configurable through solr.xml but can't find anything in the documentation to support this. Most grateful for any pointers here. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543 -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Results Order When Performing Wildcard Query
Hey Shawn, My gut says the difference in assignment of docids has to do with how the FileListEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor works on the two operating systems. The documents are updated/imported in a different order is my guess, but I haven't tested that theory. I still think it's kind of odd that there would be a difference. Indexes are created from scratch in my test, so it's not that. java -versionreports the same values on both machines java version 1.7.0_17 Java(TM) SE Runtime Environment (build 1.7.0_17-b02) Java HotSpot(TM) Client VM (build 23.7-b01, mixed mode) The explicit (arbitrary non-score) sort parameter will work as a work-around to get my test to pass in both environments while I think about this some more. Thanks! Cheers, Tricia On Tue, Apr 9, 2013 at 2:13 PM, Shawn Heisey s...@elyograg.org wrote: On 4/9/2013 12:08 PM, P Williams wrote: I wrote a test of my application which revealed a Solr oddity (I think). The test which I wrote on Windows 7 and makes use of the solr-test-frameworkhttp://**lucene.apache.org/solr/4_1_0/** solr-test-framework/index.htmlhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html ** fails under Ubuntu 12.04 because the Solr results I expected for a wildcard query of the test data are ordered differently under Ubuntu than Windows. On both Windows and Ubuntu all items in the result set have a score of 1.0 and appear to be ordered by docid (which looks like in corresponds to alphabetical unique id on Windows but not Ubuntu). I'm guessing that the root of my issue is that a different docid was assigned to the same document on each operating system. It might be due to differences in how Java works on the two platforms, or even something as simple as different Java versions. I don't know a lot about the underlying Lucene stuff, so this next sentence may not be correct: If you have are not starting from an index where the actual index directory was deleted before the test started (rather than deleting all documents), that might produce different internal Lucene document ids. The data was imported using a DataImportHandler configuration during a @BeforeClass step in my JUnit test on both systems. Any suggestions on how to ensure a consistently ordered wildcard query result set for testing? Include an explicit sort parameter. That way it will depend on the data, not the internal Lucene representation. Thanks, Shawn
Re: Slow qTime for distributed search
Hi Shawn; You say that: *... your documents are about 50KB each. That would translate to an index that's at least 25GB* I know we can not say an exact size but what is the approximately ratio of document size / index size according to your experiences? 2013/4/9 Shawn Heisey s...@elyograg.org On 4/9/2013 2:10 PM, Manuel Le Normand wrote: Thanks for replying. My config: - 40 dedicated servers, dual-core each - Running Tomcat servlet on Linux - 12 Gb RAM per server, splitted half between OS and Solr - Complex queries (up to 30 conditions on different fields), 1 qps rate Sharding my index was done for two reasons, based on 2 servers (4shards) tests: 1. As index grew above few million of docs qTime raised greatly, while sharding the index to smaller pieces (about 0.5M docs) gave way better results, so I bound every shard to have 0.5M docs. 2. Tests showed i was cpu-bounded during queries. As i have low qps rate (emphasize: lower than expected qTime) and as a query runs single-threaded on each shard, it made sense to accord a cpu to each shard. For the same amount of docs per shards I do expect a raise in total qTime for the reasons: 1. The response should wait for the slowest shard 2. Merging the responses from 40 different shards takes time What i understand from your explanation is that it's the merging that takes time and as qTime ends only after the second retrieval phase, the qTime on each shard will take longer. Meaning during a significant proportion of the first query phase (right after the [id,score] are retieved), all cpu's are idle except the response-merger thread running on a single cpu. I thought of the merge as a simple sorting of [id,score], way more simple than additional 300 ms cpu time. Why would a RAM increase improve my performances, as it's a response-merge (CPU resource) bottleneck? If you have not tweaked the Tomcat configuration, that can lead to problems, but if your total query volume is really only one query per second, this is probably not a worry for you. A tomcat connector can be configured with a maxThreads parameter. The recommended value there is 1, but Tomcat defaults to 200. You didn't include the index sizes. There's half a million docs per shard, but I don't know what that translates to in terms of MB or GB of disk space. On another email thread you mention that your documents are about 50KB each. That would translate to an index that's at least 25GB, possibly more. That email thread also says that optimization for you takes an hour, further indications that you've got some really big indexes. You're saying that you have given 6GB out of the 12GB to Solr, leaving only 6GB for the OS and caching. Ideally you want to have enough RAM to cache the entire index, but in reality you can usually get away with caching between half and two thirds of the index. Exactly what ratio works best is highly dependent on your schema. If my numbers are even close to right, then you've got a lot more index on each server than available RAM. Based on what I can deduce, you would want 24 to 48GB of RAM per server. If my numbers are wrong, then this estimate is wrong. I would be interested in seeing your queries. If the complexity can be expressed as filter queries that get re-used a lot, the filter cache can be a major boost to performance. Solr's caches in general can make a big difference. There is no guarantee that caches will help, of course. Thanks, Shawn
Approximately needed RAM for 5000 query/second at a Solr machine?
Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine?
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
It all depends on the nature of your query and the nature of the data in the index. Does returning results from a result cache count in your QPS? Not to mention how many cores and CPU speed and CPU caching as well. Not to mention network latency. The best way to answer is to do a proof of concept implementation and measure it yourself. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, April 09, 2013 6:06 PM To: solr-user@lucene.apache.org Subject: Approximately needed RAM for 5000 query/second at a Solr machine? Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine?
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Actually I will propose a system and I should figure out about machine specifications. There will be no faceting mechanism at first, just simple search queries of a web search engine. We can think that I will have a commodity server (I don't know is there any benchmark for a usual Solr machine) 2013/4/10 Jack Krupansky j...@basetechnology.com It all depends on the nature of your query and the nature of the data in the index. Does returning results from a result cache count in your QPS? Not to mention how many cores and CPU speed and CPU caching as well. Not to mention network latency. The best way to answer is to do a proof of concept implementation and measure it yourself. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, April 09, 2013 6:06 PM To: solr-user@lucene.apache.org Subject: Approximately needed RAM for 5000 query/second at a Solr machine? Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine?
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? No. That depends on the kind of queries you have, the size and content of the index, the required response time, how frequently the index is updated, and many more factors. So anyone who can guess that is wrong. You can only find that out by running your own benchmarks with your own queries against your own index. In our system, we can meet our response time requirements at a rate of 4000 queries/minute. We have several cores, but most traffic goes to a 3M document index. This index is small documents, mostly titles and authors of books. We have no wildcard queries and less than 5% of our queries use fuzzy matching. We update once per day and have cache hit rates of around 30%. We run new benchmarks twice each year, before our busy seasons. We use the current index and configuration and the queries from the busiest day of the previous season. Our key benchmark is the 95th percentile response time, but we also measure median, 90th, and 99th percentile. We are currently on Solr 3.3 with some customizations. We're working on transitioning to Solr 4. wunder -- Walter Underwood wun...@wunderwood.org
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Hi Walter; Firstly thank for your detailed reply. I know that this is not a well detailed question but I don't have any metrics yet. If we talk about your system, what is the average RAM size of your Solr machines? Maybe that can help me to make a comparison. 2013/4/10 Walter Underwood wun...@wunderwood.org On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? No. That depends on the kind of queries you have, the size and content of the index, the required response time, how frequently the index is updated, and many more factors. So anyone who can guess that is wrong. You can only find that out by running your own benchmarks with your own queries against your own index. In our system, we can meet our response time requirements at a rate of 4000 queries/minute. We have several cores, but most traffic goes to a 3M document index. This index is small documents, mostly titles and authors of books. We have no wildcard queries and less than 5% of our queries use fuzzy matching. We update once per day and have cache hit rates of around 30%. We run new benchmarks twice each year, before our busy seasons. We use the current index and configuration and the queries from the busiest day of the previous season. Our key benchmark is the 95th percentile response time, but we also measure median, 90th, and 99th percentile. We are currently on Solr 3.3 with some customizations. We're working on transitioning to Solr 4. wunder -- Walter Underwood wun...@wunderwood.org
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
We are using Amazon EC2 M1 Extra Large instances (m1.xlarge). http://aws.amazon.com/ec2/instance-types/ wunder On Apr 9, 2013, at 3:35 PM, Furkan KAMACI wrote: Hi Walter; Firstly thank for your detailed reply. I know that this is not a well detailed question but I don't have any metrics yet. If we talk about your system, what is the average RAM size of your Solr machines? Maybe that can help me to make a comparison. 2013/4/10 Walter Underwood wun...@wunderwood.org On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? No. That depends on the kind of queries you have, the size and content of the index, the required response time, how frequently the index is updated, and many more factors. So anyone who can guess that is wrong. You can only find that out by running your own benchmarks with your own queries against your own index. In our system, we can meet our response time requirements at a rate of 4000 queries/minute. We have several cores, but most traffic goes to a 3M document index. This index is small documents, mostly titles and authors of books. We have no wildcard queries and less than 5% of our queries use fuzzy matching. We update once per day and have cache hit rates of around 30%. We run new benchmarks twice each year, before our busy seasons. We use the current index and configuration and the queries from the busiest day of the previous season. Our key benchmark is the 95th percentile response time, but we also measure median, 90th, and 99th percentile. We are currently on Solr 3.3 with some customizations. We're working on transitioning to Solr 4. wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Thanks for your answer. 2013/4/10 Walter Underwood wun...@wunderwood.org We are using Amazon EC2 M1 Extra Large instances (m1.xlarge). http://aws.amazon.com/ec2/instance-types/ wunder On Apr 9, 2013, at 3:35 PM, Furkan KAMACI wrote: Hi Walter; Firstly thank for your detailed reply. I know that this is not a well detailed question but I don't have any metrics yet. If we talk about your system, what is the average RAM size of your Solr machines? Maybe that can help me to make a comparison. 2013/4/10 Walter Underwood wun...@wunderwood.org On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? No. That depends on the kind of queries you have, the size and content of the index, the required response time, how frequently the index is updated, and many more factors. So anyone who can guess that is wrong. You can only find that out by running your own benchmarks with your own queries against your own index. In our system, we can meet our response time requirements at a rate of 4000 queries/minute. We have several cores, but most traffic goes to a 3M document index. This index is small documents, mostly titles and authors of books. We have no wildcard queries and less than 5% of our queries use fuzzy matching. We update once per day and have cache hit rates of around 30%. We run new benchmarks twice each year, before our busy seasons. We use the current index and configuration and the queries from the busiest day of the previous season. Our key benchmark is the 95th percentile response time, but we also measure median, 90th, and 99th percentile. We are currently on Solr 3.3 with some customizations. We're working on transitioning to Solr 4. wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Pushing a whole set of pdf-files to solr
If anybody could still help me out with this, I'd really appreciate it. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
Apache Solr 4 Cookbok says that: curl http://localhost:8983/solr/update/extract?literal.id=1commit=true; -F myfile=@cookbook.pdf is that what you want? 2013/4/10 sdspieg sdsp...@mail.ru If anybody could still help me out with this, I'd really appreciate it. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
The newer release of SimplePostTool with Solr 4.x makes it easy to post PDF files from a directory, including automatically adding the file name to a field. But SolrCell is the direct API that it uses as well. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, April 09, 2013 6:58 PM To: solr-user@lucene.apache.org Subject: Re: Pushing a whole set of pdf-files to solr Apache Solr 4 Cookbok says that: curl http://localhost:8983/solr/update/extract?literal.id=1commit=true; -F myfile=@cookbook.pdf is that what you want? 2013/4/10 sdspieg sdsp...@mail.ru If anybody could still help me out with this, I'd really appreciate it. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slow qTime for distributed search
On 4/9/2013 3:50 PM, Furkan KAMACI wrote: Hi Shawn; You say that: *... your documents are about 50KB each. That would translate to an index that's at least 25GB* I know we can not say an exact size but what is the approximately ratio of document size / index size according to your experiences? If you store the fields, that is actual size plus a small amount of overhead. Starting with Solr 4.1, stored fields are compressed. I believe that it uses LZ4 compression. Some people store all fields, some people store only a few or one - an ID field. The size of stored fields does have an impact on how much OS disk cache you need, but not as much as the other parts of an index. It's been my experience that termvectors take up almost as much space as stored data for the same fields, and sometimes more. Starting with Solr 4.2, termvectors are also compressed. Adding docValues (new in 4.2) to the schema will also make the index larger. The requirements here are similar to stored fields. I do not know whether this data gets compressed, but I don't think it does. As for the indexed data, this is where I am less clear about the storage ratios, but I think you can count on it needing almost as much space as the original data. If the schema uses types or filters that produce a lot of information, the indexed data might be larger than the original input. Examples of data explosions in a schema: trie fields with a non-zero precisionStep, the edgengram filter, the shingle filter. Thanks, Shawn
Re: How can I set configuration options?
: Thanks for the replies. The problem I have is that setting them at the JVM : level would mean that all instances of Solr deployed in the Tomcat instance : are forced to use the same settings. I actually want to set the properties : at the application level (e.g. in solr.xml, zoo.conf or maybe an : application level Tomcat Context.xml file). the thing to keep in mind is that most of the params you refered to are things you would not typically want in a deployed setup. others are just ways of specifying defaults that are substituted into configs... : java -Dbootstrap_confdir=./solr/collection1/conf you don't wnat this option for a normal setup, it's just for boostratping (hence it's only a system property). in a production setup you would use the zookeeper tools to load the configs into your zk quorum. https://wiki.apache.org/solr/SolrCloud#Config_Startup_Bootstrap_Params ...vs... https://wiki.apache.org/solr/SolrCloud#Command_Line_Util : -Dcollection.configName=myconf -DzkRun ditto for collection.configName -- it's only for boostraping zkRun is something you only use in trivial setups like the examples in the SolrCloud tutorial to run zookeeper embedded in Solr. if you are running a production cluster where you want to be able to add/remove solr nodes on the fly, then you are going to want to set of specific machines running standalone zookeper. : -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 zkHost can be specified in solr.xml (allthough i'm not sure why the example solr.xml doesn't include it, i'll update SOLR-4622 to address this), or it can be overridden by a system property. -Hoss
Re: Field exist in schema.xml but returns
Raymond Wiker wrote You have misspelt the tag name in the field definition... you have fiald instead of field. thank you Raymond, it was really hard to find it out in a massive schema file - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Field-exist-in-schema-xml-but-returns-tp4054634p4054903.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
On 4/9/2013 4:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? You've already gotten some good replies, and I'm aware that they haven't really answered your question. This is the kind of question that cannot be answered. The amount of RAM that you'll need for extreme performance actually isn't hard to figure out - you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use. Normally this will be twice the size of all the indexes on the machine, because that's how much disk space will likely be used in a worst-case merge scenario (optimize). That's very expensive, so it is cheaper to budget for only the size of the index. A load of 5000 queries per second is pretty high, and probably something you will not achieve with a single-server (not counting backup) approach. All of the tricks that high-volume website developers use are also applicable to Solr. Once you have enough RAM, you need to worry more about the number of servers, the number of CPU cores in each server, and the speed of those CPU cores. Testing with actual production queries is the only way to find out what you really need. Beyond hardware design, making the requests as simple as possible and taking advantage of caches is important. Solr has caches for queries, filters, and documents. You can also put a caching proxy (something like Varnish) in front of Solr, but that would make NRT updates pretty much impossible, and that kind of caching can be difficult to get working right. Thanks, Shawn
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? 2013/4/10 Shawn Heisey s...@elyograg.org On 4/9/2013 4:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? You've already gotten some good replies, and I'm aware that they haven't really answered your question. This is the kind of question that cannot be answered. The amount of RAM that you'll need for extreme performance actually isn't hard to figure out - you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use. Normally this will be twice the size of all the indexes on the machine, because that's how much disk space will likely be used in a worst-case merge scenario (optimize). That's very expensive, so it is cheaper to budget for only the size of the index. A load of 5000 queries per second is pretty high, and probably something you will not achieve with a single-server (not counting backup) approach. All of the tricks that high-volume website developers use are also applicable to Solr. Once you have enough RAM, you need to worry more about the number of servers, the number of CPU cores in each server, and the speed of those CPU cores. Testing with actual production queries is the only way to find out what you really need. Beyond hardware design, making the requests as simple as possible and taking advantage of caches is important. Solr has caches for queries, filters, and documents. You can also put a caching proxy (something like Varnish) in front of Solr, but that would make NRT updates pretty much impossible, and that kind of caching can be difficult to get working right. Thanks, Shawn
Re: Results Order When Performing Wildcard Query
: My gut says the difference in assignment of docids has to do with how the : FileListEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor docids just represent the order documents are added to the index. if you use DIH with FileListEntityProcessor to create one doc per file then the order of the documents will (if i remember correctly) corrispond tothe order of the files returned by the OS, which may vary. even if the files are ordered consitently by modification date: 1) the modification date of these files on your machines might be different; the graunlarity of file modification dates supported by the filesystem or file io layer in the JVM on each machine might be different -- causing two files to appera to have identical mod times on one machine, but different mod times on the other machine. -Hoss
Re: Pushing a whole set of pdf-files to solr
Thanks for those replies. I will look into them. But if anyone knows of a site that describes step by step how a windows user who has already installed solr (and tomcat) can easily feed a folder (and subfolders) with 100s of pdfs into solr, or would be willing to write down down those steps, I would really appreciate the reference. And I bet you there are lots of people like me... -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pushing a whole set of pdf-files to solr
I am able to run the java -jar post.jar -help command which I found here: http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell post to post all pdf files in a certain folder (preferably recursively) to a collection? Could anybody please post the exact command for that? -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054916.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2.1 SSLInitializationException
Hi Chris, Thanks for your response. My understanding is that GlassFish specifies the keystore as a system property, but does not specify the password in order to protect it from snooping. There's a keychain that requires a password to be passed from the DAS in order to unlock the key for the keystore. Is there some way to specify a different HttpClient implementation (e.g. DefaultHttpClient rather than SystemDefaultHttpClient), as we don't want the application to have access to the keystore? I have also pasted the entire stack trace below: 2013-04-09 10:45:06,144 [main] ERROR org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check solr/home property and the logs 2013-04-09 10:45:06,224 [main] ERROR org.apache.solr.core.SolrCore - null:org.apache.http.conn.ssl.SSLInitializationException: Failure initializing default system SSL context at org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:368) at org.apache.http.conn.ssl.SSLSocketFactory.getSystemSocketFactory(SSLSocketFactory.java:204) at org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault(SchemeRegistryFactory.java:82) at org.apache.http.impl.client.SystemDefaultHttpClient.createClientConnectionManager(SystemDefaultHttpClient.java:118) at org.apache.http.impl.client.AbstractHttpClient.getConnectionManager(AbstractHttpClient.java:466) at org.apache.solr.client.solrj.impl.HttpClientUtil.setMaxConnections(HttpClientUtil.java:179) at org.apache.solr.client.solrj.impl.HttpClientConfigurer.configure(HttpClientConfigurer.java:33) at org.apache.solr.client.solrj.impl.HttpClientUtil.configureClient(HttpClientUtil.java:115) at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:105) at org.apache.solr.handler.component.HttpShardHandlerFactory.init(HttpShardHandlerFactory.java:134) at com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79) at com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63) at com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69) at com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97) at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55) Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772) at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55) at java.security.KeyStore.load(KeyStore.java:1214) at org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:281) at org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:366) ... 50 more Caused by: java.security.UnrecoverableKeyException: Password verification failed at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770) ... 54 more From: Chris Hostetter hossman_luc...@fucit.org To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Sarita Nair sarita...@yahoo.com Sent: Tuesday, April 9, 2013 1:31 PM Subject: Re: Solr 4.2.1 SSLInitializationException : Deploying Solr 4.2.1 to GlassFish 3.1.1 results in the error below. I : have seen similar problems being reported with Solr 4.2 Are you trying to use server SSL with glassfish? can you please post the full stack trace so we can see where this error is coming from. My best guess is that this is coming from the changes made in SOLR-4451 to use system defaults correctly when initializing HttpClient, which suggets that your problem is exactly what the error message says... Keystore was tampered with, or password was incorrect Is it possible that the default keystore password for your JVM (or as overridden by glassfish defaults - possibly using the javax.net.ssl.keyStore sysprop) has a password set on it? If so you need to confiure your JVM with the standard java system properties to specify what that password is. http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E : 2013-04-09 10:45:06,144 [main] ERROR : org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check solr/home property and the logs : 2013-04-09 10:45:06,224 [main] ERROR : org.apache.solr.core.SolrCore - :
Re: Pushing a whole set of pdf-files to solr
On 10 April 2013 07:28, sdspieg sdsp...@mail.ru wrote: I am able to run the java -jar post.jar -help command which I found here: http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell post to post all pdf files in a certain folder (preferably recursively) to a collection? Could anybody please post the exact command for that? [...] There are two options: * I am not familiar with Microsoft Windows, but writing some kind of a batch script that recurses down a directory, and posts files to Solr should be easy. * One could use the Solr DataImportHandler with FileDataSource to handle the filesystem traversal, and TikaEntityProcessor to handle the indexing of rich content. Please see: http://wiki.apache.org/solr/DataImportHandler http://wiki.apache.org/solr/TikaEntityProcessor Regards, Gora
Re: Pushing a whole set of pdf-files to solr
Another progress report. I 'flattened' all the folders which contained the pdf files with Fileboss and then moved the pdf files to the directory where I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I then ran java -Ddata=files -jar post.jar *.pdf and in the command window it seemed to be working fine (these are just academic articles in pdf-format that I downloaded with ZOtyero from EBSCO): 04/10/2013 12:20 AM 159,224 Vorontsov - 2012 - The Korea- Russia Gas Pipeline Project Past, Pres.pdf 04/10/2013 12:12 AM 3,885,056 Walker - 2012 - Asia competes for energy security.pdf 04/10/2013 12:45 AM66,195 Whitmill - 2012 - Is UK Energy Policy Dri ving Energy Innovation - or.pdf 04/10/2013 12:29 AM 2,208,367 Wietfeld - 2011 - Understanding Middle Ea st Gas Exporting Behavior.pdf 04/10/2013 12:59 AM 3,011,185 Wiseman - 2011 - Expanding Regional Renew able Governance.pdf 04/10/2013 12:38 AM 180,692 Woudhuysen - 2012 - Innovation in Energy Expressions of a Crisis, and.pdf 04/10/2013 12:49 AM 229,991 Yergin - 2012 - How Is Energy Remaking th e World.pdf 04/10/2013 12:40 AM 3,397,328 Young - 2012 - Industrial Gases. (cover s tory).pdf 04/10/2013 01:36 AM73,125 Zimmerer - 2011 - New Geographies of Ener gy Introduction to the Spe.pdf ... and so on, all together some 300 articles. But then when I looked in solr, I saw the following: 04:34:41 SEVERE SolrCore org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1) 04:34:41 SEVERE SolrCore org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1) ... and a lot more of those. I'd like to think I made SOME progress, but it also seems like I'm still not close to being there. Any suggestions from the experts here on what I am doing wrong? Thanks! -Stephan -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054920.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
On 4/9/2013 7:03 PM, Furkan KAMACI wrote: These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? What we have been discussing is the OS cache, which is memory that is not used by programs. The OS uses that memory to make everything run faster. The OS will instantly give that memory up if a program requests it. Solr is a java program, and java uses memory a little differently, so Solr most likely will NOT use more memory when it is available. In a normal directly executable program, memory can be allocated at any time, and given back to the system at any time. With Java, you tell it the maximum amount of memory the program is ever allowed to use. Because of how memory is used inside Java, most long-running Java programs (like Solr) will allocate up to the configured maximum even if they don't really need that much memory. Most Java virtual machines will never give the memory back to the system even if it is not required. Thanks, Shawn
Re: Pushing a whole set of pdf-files to solr
On 10 April 2013 08:11, sdspieg sdsp...@mail.ru wrote: Another progress report. I 'flattened' all the folders which contained the pdf files with Fileboss and then moved the pdf files to the directory where I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I then ran java -Ddata=files -jar post.jar *.pdf and in the command window it seemed to be working fine (these are just academic articles in pdf-format that I downloaded with ZOtyero from EBSCO): [...] If it works, great, but it is not generally advisable to have a large number of files under one directory. However, that is not the source of your error here. But then when I looked in solr, I saw the following: 04:34:41 SEVERE SolrCore org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1) [...] Your files seem to have some encoding other than UTF-8: My random guess would be Windows-1252. You need to convert the files to UTF-8. Regards, Gora
Re: Pushing a whole set of pdf-files to solr
The newer SimplePostTool can in fact recurse a directory of PDFs. Just get the usage for the tool. I'm sure it lists the command options. -- Jack Krupansky -Original Message- From: sdspieg Sent: Tuesday, April 09, 2013 9:48 PM To: solr-user@lucene.apache.org Subject: Re: Pushing a whole set of pdf-files to solr Thanks for those replies. I will look into them. But if anyone knows of a site that describes step by step how a windows user who has already installed solr (and tomcat) can easily feed a folder (and subfolders) with 100s of pdfs into solr, or would be willing to write down down those steps, I would really appreciate the reference. And I bet you there are lots of people like me... -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: edismax returns very less matches than regular
Adding debugQuery=true is your friend. I suspect that you'll find your first query is actually searching name:coldfusion OR defaultsearchfield:cache and you _think_ it's searching for both coldfusion and cache in the name field Best Erick On Mon, Apr 8, 2013 at 2:50 AM, amit amit.mal...@gmail.com wrote: I have a simple system. I put the title of webpages into the name field and content of the web pages into the Description field. I want to search both fields and give the name a little more boost. A search on name field or description field returns records cloase to hundreds. http://localhost:8983/solr/select/?q=name:%28coldfusion^2%20cache^1%29fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id But search on both fields using boost just gives 5 matches. http://localhost:8983/solr/mindfire/?q=%28%20coldfusion^2%20cache^1%29*defType=edismaxqf=name^1.5%20description^1.0*fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id I am wondering what is wrong, because there are valid results returned in 1st query which is ignored by edismax. I am on solr3.6 -- View this message in context: http://lucene.472066.n3.nabble.com/edismax-returns-very-less-matches-than-regular-tp4054442.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? 2013/4/10 Shawn Heisey s...@elyograg.org On 4/9/2013 7:03 PM, Furkan KAMACI wrote: These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? What we have been discussing is the OS cache, which is memory that is not used by programs. The OS uses that memory to make everything run faster. The OS will instantly give that memory up if a program requests it. Solr is a java program, and java uses memory a little differently, so Solr most likely will NOT use more memory when it is available. In a normal directly executable program, memory can be allocated at any time, and given back to the system at any time. With Java, you tell it the maximum amount of memory the program is ever allowed to use. Because of how memory is used inside Java, most long-running Java programs (like Solr) will allocate up to the configured maximum even if they don't really need that much memory. Most Java virtual machines will never give the memory back to the system even if it is not required. Thanks, Shawn
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
On 4/9/2013 9:12 PM, Furkan KAMACI wrote: I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? If your index is 5GB, and you give 3GB of RAM to the Solr JVM, then you would want at least 8GB of total RAM for that machine - the 3GB of RAM given to Solr, plus the rest so the OS can cache the index in RAM. If you plan for double the cache memory, you'd need 13 to 14GB. Thanks, Shawn
RE: Solr index Backup and restore of large indexs
Please update? -Original Message- From: Sandeep Kumar Anumalla Sent: 31 March, 2013 12:08 PM To: solr-user@lucene.apache.org Cc: 'Joel Bernstein' Subject: RE: Solr index Backup and restore of large indexs Hi, I am exploring all the possible options. We want to distribute 1 TB traffic among 3 Slor Shards(Masters) and corresponding 3 Solr Slaves. Initially I have used Master/Slave setup. But in this case my traffic rate on Master is very high, because of this we are facing the blow issue while replicating to Slave. - SnapPull failed SEVERE: SnapPull failed :org.apache.solr.common.SolrException: Unable to download _xv0_Lucene41_0.doc completely. Downloaded 0!=5935 In this case the Slave machine also has to be the same Hardware and Software configuration as such the Master; this seems to be more expensive. - Then I decided to use multiple Solr instances on single machine and accessing them by Using EmbeddedSolrServer, and planned to query all these instances to get the required result. In this case there is no need of Slave machine,just we need to take the backup and we can store in any external hard disks. Here there are 2 issues I am facing. 1. Loading is not that much fast when compare to Database. 2. How to take incremental backup? Means I don't want to take the full back up every time. - Thanks Sandeep A -Original Message- From: Joel Bernstein [mailto:joels...@gmail.com] Sent: 28 March, 2013 04:51 AM To: solr-user@lucene.apache.org Subject: Re: Solr index Backup and restore of large indexs Hi, Are you running Solr Cloud or Master/Slave? I'm assuming with 1TB a day you're sharding. With master/slave you can configure incremental index replication to another core. The backup core can be local on the server, on a separate sever or in a separate data center. With Solr Cloud replicas can be setup to automatically have redundant copies of the index. These copies though are live copies and will handle queries. Replicating data to a separate data center is typically not done through Solr Cloud replication. Joel On Mon, Mar 25, 2013 at 11:43 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Try something like this: http://host/solr/replication?command=backup See: http://wiki.apache.org/solr/SolrReplication Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Mar 21, 2013 at 3:23 AM, Sandeep Kumar Anumalla sanuma...@etisalat.ae wrote: Hi, We are loading daily 1TB (Apprx) of index data .Please let me know the best procedure to take Backup and restore of the indexes. I am using Solr 4.2. Thanks Regards Sandeep A Ext : 02618-2856 M : 0502493820 The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies. -- Joel Bernstein Professional Services LucidWorks The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies.
Re: query regarding the use of boost across the fields in edismax query
hi otis can you explain that in some depth like If is search for led in both the cases what could be the difference in the results I get? thanks in advance regards Rohan On Tue, Apr 9, 2013 at 11:25 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Not sure if i'm missing something but in the first case features, cat, and color field have more weight, so matches on them with have bigger contribution to the overall relevancy score. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 1:52 PM, Rohan Thakur rohan.i...@gmail.com wrote: hi all wanted to know what could be the difference between the results if I apply boost accross say 5 fields in query like for first: title^10.0 features^7.0 cat^5.0 color^3.0 root^1.0 and second settings like : title^10.0 features^5.0 cat^3.0 color^2.0 root^1.0 what could be the difference as in the weights are in same order decreasing? thanks in advance regards Rohan
RE: Solr 4.2 Incremental backups
HI Erick, My main point is if I use replication I have to use similar kind of setup (Hardware, storage space) as such as the Master, it more cost effective, that is why I am looking at incremental backup options, so that I can keep these backup any place like external Hard disks, tapes. And moreover when I am using replication we are facing the blow issue while replicating to Slave. - SnapPull failed SEVERE: SnapPull failed :org.apache.solr.common.SolrException: Unable to download _xv0_Lucene41_0.doc completely. Downloaded 0!=5935 Thanks -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 25 March, 2013 07:11 PM To: solr-user@lucene.apache.org Subject: Re: Solr 4.2 Incremental backups That's essentially what replication does, only backs up parts of the index that have changed. However, when segments merge, that might mean the entire index needs to be replicated. Best Erick On Sun, Mar 24, 2013 at 12:08 AM, Sandeep Kumar Anumalla sanuma...@etisalat.ae wrote: Hi, Is there any option to do Incremental backups in Solr 4.2? Thanks Regards Sandeep A Ext : 02618-2856 M : 0502493820 The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies. The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies.