Re: CDCR: Help With Tlog Growth Issues
Hi Shalin, when the buffer is enabled, tlogs are not removed anymore, even if they were replicated [1]: "When buffering updates, the updates log will store all the updates indefinitely. " Once you disable the buffer, all the old tlogs should be cleaned (the next time the tlog cleaning process is triggered). Buffer is useful in scenarios when you want to ensure that the source cluster will not clean updates until the target clusters are fully initialized. For example, let say we perform a whole index replication (SLR-6465), while the whole-index replication is performed, the source cluster should buffer updates until the whole-index replication is completed, otherwise we might miss some updates.. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication(CDCR)-TheBufferElement Kind Regards -- Renaud Delbru On 01/12/2016 17:58, Shalin Shekhar Mangar wrote: Even if buffer is enabled, the old tlogs should be remove once the updates in those tlogs have been replicated to the target. So the real question is why they haven't been removed automatically? On Thu, Dec 1, 2016 at 9:13 PM, Renaud Delbru wrote: Hi Thomas, Looks like the buffer is enabled on the update log, and even if the updates were replicated, they are not removed. What is the output of the command `cdcr?action=STATUS` on both cluster ? If you see in the response `enabled`, then the buffer is enabled. To disable it, you should run the command `/cdcr?action=DISABLEBUFFER`. Kind Regards -- Renaud Delbru On 10/11/2016 23:09, Thomas Tickle wrote: I am having an issue with cdcr that I could use some assistance in resolving. I followed the instructions found here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 The CDCR is setup with a single source to a single target. Both the source and target cluster are identically setup as 3 machines, each running an external zookeeper and a solr instance. I’ve enabled the data replication and successfully seen the documents replicated from the source to the target with no errors in the log files. However, when examining the /cdcr?action=QUEUES command, I noticed that the tlogTotalSize and tlogTotalCount were alarmingly high. Checking the data directory for each shard, I was able to confirm that there was several thousand logs files of each 3-4 megs. It added up to almost 35 GBs of tlogs. Obviously, this amount of tlogs causes a serious issue when trying to restart a solr server after activities such as patch. *Is it normal for old tlogs to never get removed in a CDCR setup?* ** Thomas Tickle Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message. Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.
Re: CDCR: Help With Tlog Growth Issues
Hi Thomas, Looks like the buffer is enabled on the update log, and even if the updates were replicated, they are not removed. What is the output of the command `cdcr?action=STATUS` on both cluster ? If you see in the response `enabled`, then the buffer is enabled. To disable it, you should run the command `/cdcr?action=DISABLEBUFFER`. Kind Regards -- Renaud Delbru On 10/11/2016 23:09, Thomas Tickle wrote: I am having an issue with cdcr that I could use some assistance in resolving. I followed the instructions found here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 The CDCR is setup with a single source to a single target. Both the source and target cluster are identically setup as 3 machines, each running an external zookeeper and a solr instance. I’ve enabled the data replication and successfully seen the documents replicated from the source to the target with no errors in the log files. However, when examining the /cdcr?action=QUEUES command, I noticed that the tlogTotalSize and tlogTotalCount were alarmingly high. Checking the data directory for each shard, I was able to confirm that there was several thousand logs files of each 3-4 megs. It added up to almost 35 GBs of tlogs. Obviously, this amount of tlogs causes a serious issue when trying to restart a solr server after activities such as patch. *Is it normal for old tlogs to never get removed in a CDCR setup?* ** Thomas Tickle Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message. Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.
Re: how to sampling search result
Some people in the Elasticsearch community are using random scoring [1] to sample a document subset from the search results. Maybe something similar could be implemented for Solr ? There are probably more efficient sampling solution than this one, but this solution is likely more straightforward to implement. [1] https://www.elastic.co/guide/en/elasticsearch/guide/current/random-scoring.html -- Renaud Delbru On 27/09/16 15:57, googoo wrote: Hi, Is it possible I can sampling based on "search result"? Like run query first, and search result return 1 million documents. With random sampling, 50% (500K) documents return for facet, and stats. The sampling need based on "search result". Thanks, Yongtao -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-sampling-search-result-tp4298269.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CDCR (Solr6.x) does not start
Hi Uwe, At first look, your configuration seems correct, see my comments below. On 28/06/16 15:36, Uwe Reh wrote: 9. Start CDCR http://SOURCE:s_port/solr/scoll/cdcr?action=start&wt=json {"responseHeader":{"status":0,"QTime":13},"status":["process","started","buffer","enabled"]} ! (not even a single query to the target's zookeeper ??) Indeed, you should have observed a communication between the source cluster and the target zookeeper. Do you see any errors in the log of the source cluster ? Or a log message such as: "Unable to instantiate the log reader for target collection ..." 10. Enter some test data into the SOURCE 11. Explicit commit in SOURCE http://SOURCE:s_port/solr/scoll/update?commit=true&opensearcher=true !! (at least now there should be some traffic, or?) Replication should start even if no commit has been sent to the source cluster. 12. Check errors and queues http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=queues&wt=json {"responseHeader":{"status":0,"QTime":0},"queues":[],"tlogTotalSize":135,"tlogTotalCount":1,"updateLogSynchronizer":"stopped"} http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=errors&wt=json {"responseHeader":{"status":0,"QTime":0},"errors":[]} ! Why is the element queues is empty The empty queue seems to indicate there is an issue, and that cdcr was unable to instantiate the replicator for the target cluster. Just to be sure, your source cluster has 4 shards, but not replica ? If it has replicas, can you ensure that you execute these command on the shard leader. Kind Regards -- Renaud Delbru
Re: Solr6 CDCR issue with a 3 cloud design
Hi Dmitry, On 28/06/16 13:19, dmitry.medve...@barclays.com wrote: No ERRORS and queue size is equal to 0. Should I extend the logging lever to Max maybe? Currently it's default. How can I know, if a commit operation has been sent to the 2 target clusters after the replication? What command should I run to check this? I submit new doc/s to my ACTIVE/PRIMARY cloud and that's all. Commits are not replicated from the source cluster to the target cluster. You have to manually sent a commit to the target cluster manually if you want to see all the pending docs: curl 'http://target_cluster:8983/solr/collection_name/update?commit=true' You can also try to configure an autocommit on your target cluster [1] [1] https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-autoCommit Kind regards -- Renaud Delbru -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Tuesday, June 14, 2016 11:57 To: solr-user@lucene.apache.org Subject: Re: Solr6 CDCR issue with a 3 cloud design Hi dmitry, Was a commit operation sent to the 2 target clusters after the replication ? Replicated documents will not appeared until a commit operation is sent. What is the output of the monitoring actions QUEUES and ERRORS ? Are you seeing any errors reported ? Are you seeing the queue size not equal to 0 ? -- Renaud Delbru On 09/06/16 08:55, dmitry.medve...@barclays.com wrote: I've set up a 3 cloud CDCR: Source => Target1-Source2 => Target2 CDCR environment, and the replication process works perfectly, but: when I shutdown Target1-Source2 cloud (the mediator, for testing for resilience), index/push some docs to Source1 cloud, get back Target1-Source2 cloud online after several min, then I only part of the docs are replicated to the 2 Target clouds (7 of 10 docs tested). Anyone has an idea what is the reason for such a behavior? Configurations attached. Thanks in advance, Dmitry Medvedev. ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer <http://www.barclays.com/emaildisclaimer>. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer <http://www.barclays.com/salesandtradingdisclaimer> regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___
Re: Regarding CDCR SOLR 6
Hi, On 15/06/16 03:18, Bharath Kumar wrote: Hi Renaud, Thank you so much for your response. It is very helpful and it helped me understand the need for turning on buffering. Is it recommended to keep the buffering enabled all the time on the source cluster? If the target cluster is up and running and the cdcr is started, can i turn off the buffering on the source site? yes, no need to keep buffering on if your target cluster is up and running and cdcr replication is started. As you have mentioned, the transaction logs are kept on the source cluster, until the data is replicated on the target cluster, once the cdcr is started. Is there a possibility that target cluster is out of sync with the source cluster and we need to do a hard recovery from the source cluster to sync up the target cluster? If the target cluster goes down while cdcr is replicating, there should be no loss of information. The source cluster will try from time to time to communicate with the target and continue the replication until the target cluster is back up and running. Until it can resume communication, the source cluster will keep a pointer on where the replication should resume, and therefore the update log will not be cleaned up to this point. The pointer on the source cluster is not persistent (maybe that could be something to implement). Therefore if the source cluster is restarted, the pointer will be lost, and buffer should be activated until the target cluster is up and running. Also i have the below configuration on the source cluster to synchronize the update logs. | <||lst| |name||=||"updateLogSynchronizer"||>| |||<||str| |name||=||"schedule"||>1000| | | |Regarding the monitoring of the replication, i am planning to add a script to check the queue size, to make sure the disk is not full in case the target site is down and the transaction log size keeps growing on the source site.| |Is there any other recommended approach?| The best is to use the monitoring api which provides some metrics on how the replication is going. In the cwiki [1], there are also some recommendations on how to monitor the system [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 Kind Regards -- Renaud Delbru | | |Thanks again, your inputs were very helpful.| On Tue, Jun 14, 2016 at 7:10 PM, Bharath Kumar mailto:bharath.mvku...@gmail.com>> wrote: Hi Renaud, Thank you so much for your response. It is very helpful and it helped me understand the need for turning on buffering. Is it recommended to keep the buffering enabled all the time on the source cluster? If the target cluster is up and running and the cdcr is started, can i turn off the buffering on the source site? As you have mentioned, the transaction logs are kept on the source cluster, until the data is replicated on the target cluster, once the cdcr is started, is there a possibility that if on the target cluster On Tue, Jun 14, 2016 at 6:50 AM, Davis, Daniel (NIH/NLM) [C] mailto:daniel.da...@nih.gov>> wrote: I must chime in to clarify something - in case 2, would the source cluster eventually start a log reader on its own? That is, would the CDCR heal over time, or would manual action be required? -----Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions <mailto:renaud@siren.solutions>] Sent: Tuesday, June 14, 2016 4:51 AM To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> Subject: Re: Regarding CDCR SOLR 6 Hi Bharath, The buffer is useful when you need to buffer updates on the source cluster before starting cdcr, if the source cluster might receive updates in the meanwhile and you want to be sure to not miss them. To understand this better, you need to understand how cdcr clean transaction logs. Cdcr when started (with the START action) will instantiate a log reader for each target cluster. The position of the log reader will indicate cdcr which transaction logs it can clean. If all the log readers are beyond a certain point, then cdcr can clean all the transaction logs up to this point. However, there might be cases when the source cluster will be up without any log readers instantiated: 1) The source cluster is started, but cdcr is not started yet 2) the source cluster is started, cdcr is started, but the target cluster was not accessible when cdcr was started. In this case, cdcr will not be able to instantiate a log reader for this cluster. In these two scenarios, if updates are received by the source cluster, then they might be cleaned out from the transaction log as per the normal update log cleaning procedure.
Re: Encryption to Solr indexes – Using Custom Codec
Hi, maybe it is the way you created the jar ? Why not applying the patch to lucene/solr trunk and use ant jar instead to get the codecs jar created for you ? Also, I think the directory where you put the jars should be called "lib" instead of "Lib". you can try also to use the lib directives in your solrconfig.xml [1] [1] https://cwiki.apache.org/confluence/display/solr/Lib+Directives+in+SolrConfig -- Renaud Delbru On 20/06/16 15:42, Sidana, Mohit wrote: Hello, As Part of my studies I am exploring the solutions which can be used for Lucene/Solr Index encryption. I found the patch open on Apache JIRA- Codec for index-level encryption <https://issues.apache.org/jira/browse/LUCENE-6966>(LUCENE-6966). https://issues.apache.org/jira/browse/LUCENE-6966 and I am currently trying to test this Custom codec with Solr to perform secure search over some sensitive records. I've decided to follow the path described in Solr wiki, setting up Simple Text Codec and further tried to use Encrypted codec Source. *Here are the additional details.* I've created a basic jar file out of this source code (Build it as a jar from Eclipse using Maven Plugin). The Solr installation I'm using to test this is the Solr 6.0.0 unzipped, and started via its embedded Jetty server and using the single core. I've placed my jar with the codec in [My_Core\ instance Dir.]\ Lib In: [$SolrDir]\Solr\ My_Core \conf\*solrconfig.xml* I've added the following lines: | ||| || Then in the *schema.xml* file, I've declared some field and field Types that should use this codec: | | | | | | | | | | | | || || || || I'm pretty sure I've followed all the steps described in Solr Wiki; however, when I actually try to use custom codec implementation (named "Encrypted Codec") to index some sample CSV data using simple post tool java -Dtype=text/csv -Durl=http://localhost:8983/solr/My_Core /update -jar post.jar Sales.csv and I have also tried doing the same with SolrJ but I have faced the same error. SolrClient _server_= *new*HttpSolrClient("http://localhost:8983/solr/My_Core "); SolrInputDocument doc= *new*SolrInputDocument(); doc.addField("id", "1234"); doc.addField("name", "A lovely summer holiday"); *try*{ server.add(doc); server.commit(); System.*/out/*.println("Document added!"); } *catch*(SolrServerException | IOException e) { e.printStackTrace(); } } } I get the attached errors in Solr log. org.apache.solr.common.SolrException: Exception writing document id b3e01ada-d0f1-4ddf-ad6a-2828bfe619a3 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:181) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.
Re: Solr6 CDCR issue with a 3 cloud design
Hi dmitry, Was a commit operation sent to the 2 target clusters after the replication ? Replicated documents will not appeared until a commit operation is sent. What is the output of the monitoring actions QUEUES and ERRORS ? Are you seeing any errors reported ? Are you seeing the queue size not equal to 0 ? -- Renaud Delbru On 09/06/16 08:55, dmitry.medve...@barclays.com wrote: I've set up a 3 cloud CDCR: Source => Target1-Source2 => Target2 CDCR environment, and the replication process works perfectly, but: when I shutdown Target1-Source2 cloud (the mediator, for testing for resilience), index/push some docs to Source1 cloud, get back Target1-Source2 cloud online after several min, then I only part of the docs are replicated to the 2 Target clouds (7 of 10 docs tested). Anyone has an idea what is the reason for such a behavior? Configurations attached. Thanks in advance, Dmitry Medvedev. ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer <http://www.barclays.com/emaildisclaimer>. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer <http://www.barclays.com/salesandtradingdisclaimer> regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___
Re: Regarding CDCR SOLR 6
Hi Bharath, The buffer is useful when you need to buffer updates on the source cluster before starting cdcr, if the source cluster might receive updates in the meanwhile and you want to be sure to not miss them. To understand this better, you need to understand how cdcr clean transaction logs. Cdcr when started (with the START action) will instantiate a log reader for each target cluster. The position of the log reader will indicate cdcr which transaction logs it can clean. If all the log readers are beyond a certain point, then cdcr can clean all the transaction logs up to this point. However, there might be cases when the source cluster will be up without any log readers instantiated: 1) The source cluster is started, but cdcr is not started yet 2) the source cluster is started, cdcr is started, but the target cluster was not accessible when cdcr was started. In this case, cdcr will not be able to instantiate a log reader for this cluster. In these two scenarios, if updates are received by the source cluster, then they might be cleaned out from the transaction log as per the normal update log cleaning procedure. That is where the buffer becomes useful. When you know that while starting up your clusters and cdcr, you will be in one of these two scenarios, then you can activate the buffer to be sure to not miss updates. Then when the source and target clusters are properly up and cdcr replication is properly started, you can turn off this buffer. -- Renaud Delbru On 14/06/16 06:41, Bharath Kumar wrote: Hi, I have setup cross data center replication using solr 6, i want to know why the buffer needs to be enabled on the source cluster? Even if the buffer is not enabled, i am able to replicate the data between source and target sites. What is the advantages of enabling the buffer on the source site? If i enable the buffer, the transaction logs are never deleted and over a period of time we are running out of disk. Can you please let me know why the buffer enabling is required?
Re: Need Help with Solr 6.0 Cross Data Center Replication
Hi, unfortunately no, I haven't had the time to reproduce your settings with separated zookeeper instances. I'll update if I have something. -- Renaud Delbru On 07/06/16 16:55, Satvinder Singh wrote: Hi, Any updates on this?? Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 804.744.9630 x273 direct 703.989.8030 cell www.NC4.com <http://www.NC4.com> <https://www.linkedin.com/company/nc4> <https://plus.google.com/+Nc4worldwidesolutions/posts> <https://twitter.com/NC4worldwide> On 5/19/16, 8:41 AM, "Satvinder Singh" wrote: Hi, So this is what I did: I created solr as a service. Below are the steps I followed for that:-- $ tar xzf solr-X.Y.Z.tgz solr-X.Y.Z/bin/install_solr_service.sh --strip-components=2 $ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt/solr1 -d /var/solr1 -u solr -s solr1 -p 8501 $ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt/solr2 -d /var/solr2 -u solr -s solr2 -p 8502 Then to start it in cloud I modified the solr1.cmd.in and solr2.cmd.in in /etc/defaults/ I added ZK_HOST=192.168.56.103:2181,192.168.56.103:2182,192.168.56.103:2183 (192.168.56.103 is where my 3 zookeeper instances are) Then I started the 2 solr services solr1 and solr2 Then I created the configset /bin/solr zk -upconfig -z 192.168.56.103:2181,192.168.56.103:2182,192.168.56.103:2183 -n Liferay -d server/solr/configsets/sample_techproducts_configs/conf Then I created the collection using: http://192.168.56.101:8501/solr/admin/collections?action=CREATE&name=dingdong&numShards=1&replicationFactor=2&collection.configName=liferay This created fine Then I deleted the solrconfig.xml from the zookeeper Liferay configset Then I uploaded the new solrconfig.xml to the configset. When when I do a reload on the collections I get the error. Or I created a new collection I get the error. Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 703.682.6000 x276 direct 703.989.8030 cell www.NC4.com ? -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Thursday, May 19, 2016 7:13 AM To: solr-user@lucene.apache.org Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication I have reproduced your steps and the cdcr request handler started successfully. I have attached to this mail the config sets I have used. It is simply the sample_techproducts_config configset with your solrconfig.xml. I have used solr 6.0.0 with the following commands: $ ./bin/solr start -cloud $ ./bin/solr create_collection -c test_cdcr -d cdcr_configs Connecting to ZooKeeper at localhost:9983 ... Uploading /solr-6.0.0/server/solr/configsets/cdcr_configs/conf for config test_cdcr to ZooKeeper at localhost:9983 Creating new collection 'test_cdcr' using command: http://localhost:8983/solr/admin/collections?action=CREATE&name=test_cdcr&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=test_cdcr { "responseHeader":{ "status":0, "QTime":5765}, "success":{"127.0.1.1:8983_solr":{ "responseHeader":{ "status":0, "QTime":4426}, "core":"test_cdcr_shard1_replica1"}}} $ curl http://localhost:8983/solr/test_cdcr/cdcr?action=STATUS 03stoppedenabled The difference is that I have used the embedded zookeeper, not a separate ensemble. Could you please provide the commands you used to create the collection ? Kind Regards -- Renaud Delbru On 16/05/16 16:55, Satvinder Singh wrote: I also am using a zk ensemble with 3 nodes on each side. Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 703.682.6000 x276 direct 703.989.8030 cell www.NC4.com ? -Original Message- From: Satvinder Singh [mailto:satvinder.si...@nc4.com] Sent: Monday, May 16, 2016 11:54 AM To: solr-user@lucene.apache.org Subject: RE: Need Help with Solr 6.0 Cross Data Center Replication Hi, So the way I am doing it is, for both for the Target and Source side, I took a copy of the sample_techproducts_config configset, can created one configset. Then I modified the solrconfig.xml in there, both for the Target and Source side. And then created the collection, and I get the errors. I get the error if I create a new collection or try to reload an existing collection after the solrconfig update. Attached is the log and configs. Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 703.682.6000 x276 direct 703.989.8030 cell www.NC4.com ? -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Monday, May 16, 2016 11:45 AM To: solr-user@lucene.apache.org Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication Hi,
Re: Solr 6 CDCR does not work
Hi Adam, could you check the response of the monitoring commands [1], QUEUES, ERRORS, OPS. This might help undeerstanding if documents are flowing or if there are issues. Also, do you have an autocommit configured on the target ? CDCR does not replicate commit, and therefore you have to send a commit command on the target to ensure that the latest replicated documents are visible. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication%28CDCR%29-Monitoringcommands -- Renaud Delbru On 29/05/16 12:10, Adam Majid Sanjaya wrote: I’m testing Solr 6 CDCR, but it’s seems not working. Source configuration: targetzkip:2181 corehol corehol 1 1000 128 5000 ${solr.ulog.dir:} Target(s) configuration: disabled cdcr-proccessor-chain ${solr.ulog.dir:} Source Log: no cdcr Target Log: no cdcr Create a core (solrconfig.xml modification directly from the folder data_driven_schema_configs): #bin/solr create -c corehol -p 8983 Start cross-data center replication by running the START command on the source data center http://sourceip::8983/solr/corehol/cdcr?action=START Disable buffer by running the DISABLEBUFFER command on the target data center http://targetip::8983/solr/corehol/cdcr?action=DISABLEBUFFER The documents are not replicated to the target zone. What should I examine?
Re: Inconsistent Solr document count on Target clouds when replicating data in Solr6 CDCR
Hi Dmitry, You can activate debug log and see more information, such as the number of documents replicated by the cdcr replicator thread, etc. However, I think that the issue is that indexes on the target instances are not refreshed, and therefore some of the documents indexed not yet visible. Cdcr does not replicate commit operations, and let the target cluster handle the refresh. You can try to manually execute a commit operation on the target cluster and see if all the documents appears. Kind Regards -- Renaud Delbru On 19/05/16 17:39, dmitry.medve...@barclays.com wrote: I've come across a weird problem which I'm trying to debug at the moment, and was just wondering if anyone has stumbled across it too: I have an active-passive-passive configuration (1 Source cloud, 2 targets), and NOT all the documents are being replicated to the target clouds. Example: 3 docs are being pushed/indexed on the Source cloud, S1, S2, S3, and only 2 docs can be found (almost immediately) on the Target clouds, say T1, T3. The behavior is NOT consistent. I feel like it's a configuration issue, but it could also be a bug. How can I debug this issue? What log files should I examine? I couldn't find anything in the logs (of both the Source & Target clouds). Source configuration: 10.88.52.219:9983,10.36.75.4:9983 demo demo 2 10 128 1000 ${solr.ulog.dir:} Target(s) configuration: disabled cdcr-proc-chain ${solr.ulog.dir:} Thnx, Dmitry Medvedev Tech search leader BARCLAYS CAPITAL Search Platform Engineering Global Technology Infrastructure Services (GTIS) Barclays Capital, Atidim High-Tech Industrial Park, Tel Aviv 61580 * DDI : +972-3-5452462 * Mobile : +972-545874521 * dmitry.medve...@barclayscapital.com<mailto:dmitry.medve...@barclayscapital.com> P Please consider the environment before printing this email ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___
Re: Need Help with Solr 6.0 Cross Data Center Replication
Hi Abdel, have you reloaded the collection [1] after uploading the configuration to zookeeper ? [1] https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api2 -- Renaud Delbru On 16/05/16 17:29, Abdel Belkasri wrote: Thanks Renaud. Here is my setup: 1- I have created 2 sites: Main (source) and DR (traget). 2- Both sites are the same before configuring CDCR 3- The collections (source and target) are created before configuring CDCR 4- collections are created using interactive mode: accepting most defaults except the ports (gettingstarted collection) 5- I have a zookeeper ensemble too. 6- I change the solrconfig.xml, then I upload using the command: # upload configset to zookeeper zkcli.bat -cmd upconfig -zkhost localhost:2181 -confname gettingstarted -solrhome C:\solr\solr-6-cloud\solr-6.0.0 -confdir C:\solr\solr-6-cloud\solr-6.0.0\server\solr\configsets\basic_configs\conf Renaud can you send your confi files... Thanks, --Abdel. On Mon, May 16, 2016 at 12:16 PM, Satvinder Singh wrote: Thank you. To summarize this is what I have, all VMS running on Centos7 : Source Side |___ 1 VM running 3 Zookeeper instances on port 2181, 2182 and 2183 (ZOOKEEPER 3.4.8)(Java 1.8.0_91) |___ 1 VM running 2 solr 6.0 instances on port 8501, 8502 (Solr 6.0) (Java 1.8.0_91) |___ sample_techproducts_config copied as 'liferay', and used to create collections, that is where I am modifying the solrconfig.xml Target Side |___ 1 VM running 3 Zookeeper instances on port 2181, 2182 and 2183 (ZOOKEEPER 3.4.8)(Java 1.8.0_91) |___ 1 VM running 2 solr 6.0 instances on port 8501, 8502 (Solr 6.0) (Java 1.8.0_91) |___ sample_techproducts_config copied as 'liferay', and used to create collections, that is where I am modifying the solrconfig.xml Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 703.682.6000 x276 direct 703.989.8030 cell www.NC4.com -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Monday, May 16, 2016 11:59 AM To: solr-user@lucene.apache.org Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication Thanks Satvinder, Tomorrow, I'll try to reproduce the issue with your steps and will let you know. Regards -- Renaud Delbru On 16/05/16 16:53, Satvinder Singh wrote: Hi, So the way I am doing it is, for both for the Target and Source side, I took a copy of the sample_techproducts_config configset, can created one configset. Then I modified the solrconfig.xml in there, both for the Target and Source side. And then created the collection, and I get the errors. I get the error if I create a new collection or try to reload an existing collection after the solrconfig update. Attached is the log and configs. Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 703.682.6000 x276 direct 703.989.8030 cell www.NC4.com -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Monday, May 16, 2016 11:45 AM To: solr-user@lucene.apache.org Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication Hi, I have tried to reproduce the problem, but was unable to. I have downloaded the Solr 6.0 distribution, added to the solr config the cdcr request handler and modified the update handler to register the CdcrUpdateLog, then start Solr in cloud mode and created a new collection using my solr config. The cdcr request handler starts properly and does not complain about the update log. Could you provide more background on how to reproduce the issue ? E.g., how do you create a new collection with the cdcr configuration. Are you trying to configure CDCR on collections that were created prior to the CDCR configuration ? @Erik: I have noticed a small issue in the CDCR page of the reference guide. In the code snippet in Configuration -> Source Configuration, the element is nested within the . Thanks Regards -- Renaud Delbru On 15/05/16 23:13, Abdel Belkasri wrote: Erick, I tried the new configuration. The same issue that Satvinder is having. The log updater cannot be instantiated... class="solr.CdcrUpdateLog" for some reason that class is causing a problem! Anyway, anyone has a config that works? Regards, --Abdel On Fri, May 13, 2016 at 11:57 AM, Erick Erickson wrote: I changed the CDCR doc, Oliver could you take a glance and see if it is clear now? All I changed was the sample solrconfig sections https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=626 8 7462 Thanks, Erick On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph wrote: Hi, I had the same problem. The documentation is kind of missleading here. You must not add a new element to your config but update the existing . All you need to do is add the class="solr.CdcrUpdateLog" element to the element inside your existing . Hope this helps! Mit freundl
Re: Need Help with Solr 6.0 Cross Data Center Replication
Thanks Satvinder, Tomorrow, I'll try to reproduce the issue with your steps and will let you know. Regards -- Renaud Delbru On 16/05/16 16:53, Satvinder Singh wrote: Hi, So the way I am doing it is, for both for the Target and Source side, I took a copy of the sample_techproducts_config configset, can created one configset. Then I modified the solrconfig.xml in there, both for the Target and Source side. And then created the collection, and I get the errors. I get the error if I create a new collection or try to reload an existing collection after the solrconfig update. Attached is the log and configs. Thanks Satvinder Singh Security Systems Engineer satvinder.si...@nc4.com 703.682.6000 x276 direct 703.989.8030 cell www.NC4.com -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Monday, May 16, 2016 11:45 AM To: solr-user@lucene.apache.org Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication Hi, I have tried to reproduce the problem, but was unable to. I have downloaded the Solr 6.0 distribution, added to the solr config the cdcr request handler and modified the update handler to register the CdcrUpdateLog, then start Solr in cloud mode and created a new collection using my solr config. The cdcr request handler starts properly and does not complain about the update log. Could you provide more background on how to reproduce the issue ? E.g., how do you create a new collection with the cdcr configuration. Are you trying to configure CDCR on collections that were created prior to the CDCR configuration ? @Erik: I have noticed a small issue in the CDCR page of the reference guide. In the code snippet in Configuration -> Source Configuration, the element is nested within the . Thanks Regards -- Renaud Delbru On 15/05/16 23:13, Abdel Belkasri wrote: Erick, I tried the new configuration. The same issue that Satvinder is having. The log updater cannot be instantiated... class="solr.CdcrUpdateLog" for some reason that class is causing a problem! Anyway, anyone has a config that works? Regards, --Abdel On Fri, May 13, 2016 at 11:57 AM, Erick Erickson wrote: I changed the CDCR doc, Oliver could you take a glance and see if it is clear now? All I changed was the sample solrconfig sections https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=6268 7462 Thanks, Erick On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph wrote: Hi, I had the same problem. The documentation is kind of missleading here. You must not add a new element to your config but update the existing . All you need to do is add the class="solr.CdcrUpdateLog" element to the element inside your existing . Hope this helps! Mit freundlichen Grüßen / Kind regards Oliver Rudolph IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.
Re: Need Help with Solr 6.0 Cross Data Center Replication
Hi, I have tried to reproduce the problem, but was unable to. I have downloaded the Solr 6.0 distribution, added to the solr config the cdcr request handler and modified the update handler to register the CdcrUpdateLog, then start Solr in cloud mode and created a new collection using my solr config. The cdcr request handler starts properly and does not complain about the update log. Could you provide more background on how to reproduce the issue ? E.g., how do you create a new collection with the cdcr configuration. Are you trying to configure CDCR on collections that were created prior to the CDCR configuration ? @Erik: I have noticed a small issue in the CDCR page of the reference guide. In the code snippet in Configuration -> Source Configuration, the element is nested within the . Thanks Regards -- Renaud Delbru On 15/05/16 23:13, Abdel Belkasri wrote: Erick, I tried the new configuration. The same issue that Satvinder is having. The log updater cannot be instantiated... class="solr.CdcrUpdateLog" for some reason that class is causing a problem! Anyway, anyone has a config that works? Regards, --Abdel On Fri, May 13, 2016 at 11:57 AM, Erick Erickson wrote: I changed the CDCR doc, Oliver could you take a glance and see if it is clear now? All I changed was the sample solrconfig sections https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 Thanks, Erick On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph wrote: Hi, I had the same problem. The documentation is kind of missleading here. You must not add a new element to your config but update the existing . All you need to do is add the class="solr.CdcrUpdateLog" element to the element inside your existing . Hope this helps! Mit freundlichen Grüßen / Kind regards Oliver Rudolph IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
Re: Cross Data Center Replication - ERROR
Hi Abdel, Your configuration looks ok regarding the cdcr update log. Could you tell us a bit more about your Solr installation ? More specifically, does the solr instances, both source and target, contain one collection that was created prior the configuration of cdcr ? Best, -- Renaud Delbru On 11/05/16 20:46, Abdel Belkasri wrote: Hi there, I am trying to configure Cross Data Center Replication using solr 6.0. I am having issue creating collections or reloading old collections with the new solrconfig.xml on both the target and source side. I keep getting error “org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Solr instance is not configured with the cdcr update log” This is my config on the Source disabled cdcr-proc-chain ${solr.ulog.dir:} 500 20 65536 This is the config on the Target side: disabled cdcr-proc-chain ${solr.ulog.dir:} 500 20 65536 HOW SOLR IS RUNNING: ZKHOSTS parameter in solr.in.sh file under /etc/default and when you start solr service it will start in cloud Any help would be great. Thanks --Abdel.
Re: SolrCloud - Fails to delete documents when some shard is down
On 21/03/16 14:43, Erick Erickson wrote: Hmmm, you say "where I have many shards and can't have one problem causing no deletion of old data.". You then have a shard that, when it comes back up still has all the old data and that _is_ acceptable? Seems like that would be jarring to the users when some portion of the docs in their collection reappeared... But no, there's no similar option for update that I know of. Solr tries very hard for consistency and this would lead to an inconsistent data state. What is the root cause of your shard going down? That's the fundamental problem here... As Erick said, what would be the cause to have a full shard and all its replicas going down at the same time ? Usually, if a shard has multiple replicas, and one node goes down, the replicas on the other nodes should take the lead for this shard, and the delete queries should work. -- Renaud Delbru Best, Erick On Mon, Mar 21, 2016 at 7:08 AM, Tali Finelt wrote: Hi, I am using Solr 4.10.2. When one of the shards in my environment is down and fails to recover - The process of deleting documents from other shards fails as well. For example, When running: https://:8983/solr//update?stream.body= *:*&commit=true I get the following error message: No registered leader was found after waiting for 4000ms , collection: slice: This causes a problem in a big environment where I have many shards and can't have one problem causing no deletion of old data. Is there a way around that? To Query on data in such cases, I use shards.tolerant=true parameter to get results even if some shards are down. Is there something similar for this case? Thanks, Tali
Re: Solr 6.0
Hi Shawn, On 25/02/16 14:07, Shawn Heisey wrote: The CDCR functionality is currently present in the master branch, but I do not know for sure whether it will be included in the 6.0 release. I am not involved with that feature and have no idea how stable the code is. CDCR is stable and is running now for months in a large production deployment without any known issues. Erick, who took care of committing it into the trunk, was planning to release it as part of 6.0. -- Renaud Delbru
Re: Index complex JSON data in SOLR
Hi David, you might want to look at SIREn 1.4 [1], a plugin for Lucene/Solr, that includes a update handler [2] which mimics elasticsearch index api. You can push JSON documents to the API and it will dynamically flatten and index the JSON documents into a set of fields (similar to Elasticsearch). It also index the full json into a SIREn's field to support nested queries. [1] http://siren.solutions/siren/downloads/ [2] http://siren.solutions/manual/solr-configuration-update-handler.html -- Renaud Delbru On 11/15/2014 10:05 PM, David Lee wrote: Hi All, How do I index complex JSON data in SOLR? For example, {prices:[{state:"CA", price:"101.0"}, {state:"NJ", price:"102.0"},{state:"CO", price:"102.0"}]} It's simple in ElasticSearch, but in SOLR it always reports the following error: "Error parsing JSON field value. Unexpected OBJECT_START" Thanks, DL
[ANN] SIREn, a Lucene/Solr plugin for rich JSON data search
One of the coolest features of Lucene/Solr is its ability to index nested documents using a Blockjoin approach. While this works well for small documents and document collections, it becomes unsustainable for larger ones: Blockjoin works by splitting the original document in many documents, one per nested record. For example, a single USPTO patent (XML format converted to JSON) will end up being over 1500 documents in the index. This has massive implications on performance and scalability. Introducing SIREn SIREn is an open source plugin for Solr for indexing and searching rich nested JSON data. SIREn uses a sophisticated "tree indexing" design which ensures that the index is not artificially inflated. This ensures that querying on many types of nested queries can be up to 3x faster. Further, depending on the data, memory requirements for faceting can be up to 10x higher. As such, SIREn allows you to use Solr for larger and more complex datasets, especially so for sophisticated analytics. (You can read our whitepaper to find out more [1]) SIREn is also truly schemaless - it even allows you to change the type of a property between documents without being restricted by a defined mapping. This can be very useful for data integration scenarios where data is described in different ways in different sources. You only need a few minutes to download and try SIREn [2]. It comes with a detailed manual [3] and you have access to the code on GitHub [4]. We look forward to hear about your feedbacks. [1] http://siren.solutions/siren/resources/whitepapers/comparing-siren-1-2-and-lucenes-blockjoin-performance-a-uspto-patent-search-scenario/ [2] http://siren.solutions/siren/downloads/ [3] http://siren.solutions/manual/preface.html [4] https://github.com/sindicetech/siren -- Renaud Delbru CTO SIREn Solutions
Re: Obtaining query AST?
Hi, have a look at the flexible query parser of lucene (contrib package) [1]. It provides a framework to easily create different parsing logic. You should be able to access the AST and to modify as you want how it can be translated into a Lucene query (look at processors and pipeline processors). One time you have your own query parser, then it is straightforward to plug it into Solr. [1] http://lucene.apache.org/java/3_1_0/api/contrib-queryparser/index.html -- Renaud Delbru On 31/05/11 19:24, dar...@ontrenet.com wrote: Hi, I want to write my own query expander. It needs to obtain the AST (abstract syntax tree) of an already parsed query string, navigate to certain parts of it (words) and make logical phrases of those words by adding to the AST - where necessary. This cannot be done to the string because the query logic cannot be semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed first. How can this be done with SolrJ? thanks for any tips. Darren
Resolved- Re: Replication Error - Index fetch failed - File Not Found & OverlappingFileLockException
Hi, I found out the problem by myself. The reason was a bad deployment of of Solr on tomcat. Two instances of solr were instantiated instead of one. The two instances were managing the same indexes, and therefore were trying to write at the same time. My apologies for the noise created on the ml, -- Renaud Delbru On 30/05/11 21:52, Renaud Delbru wrote: Hi, For months, we were using apache solr 3.1.0 snapshots without problems. Recently, we have upgraded our index to apache solr 3.1.0, and also moved to a multi-core infrastructure (4 core per nodes, each core having its own index). We found that one of the index slave started to show failure, i.e., query errors. By looking at the log, we observed some errors during the latest snappull, due to two type of exceptions: - java.io.FileNotFoundException: File does not exist ... and - java.nio.channels.OverlappingFileLockException: null Then, after the failed pull, the index started to show some index related failure: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)] However, after manually restarting the node, everything went back to normal. You can find a more detailed log at [1]. We are afraid to see this problem occurring again. Have you some idea on what can be the cause ? Or a solution to avoid such problem ? [1] http://pastebin.com/vbnyrUgJ Thanks in advance
Replication Error - Index fetch failed - File Not Found & OverlappingFileLockException
Hi, For months, we were using apache solr 3.1.0 snapshots without problems. Recently, we have upgraded our index to apache solr 3.1.0, and also moved to a multi-core infrastructure (4 core per nodes, each core having its own index). We found that one of the index slave started to show failure, i.e., query errors. By looking at the log, we observed some errors during the latest snappull, due to two type of exceptions: - java.io.FileNotFoundException: File does not exist ... and - java.nio.channels.OverlappingFileLockException: null Then, after the failed pull, the index started to show some index related failure: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)] However, after manually restarting the node, everything went back to normal. You can find a more detailed log at [1]. We are afraid to see this problem occurring again. Have you some idea on what can be the cause ? Or a solution to avoid such problem ? [1] http://pastebin.com/vbnyrUgJ Thanks in advance -- Renaud Delbru
Re: Indexing documents with "complex multivalued fields"
Hi, you could look at this recent thread [1], it is similar to your problem. [1] http://search.lucidimagination.com/search/document/33ec1a98d3f93217/search_across_related_correlated_multivalue_fields_in_solr#1f66876c782c78d5 -- Renaud Delbru On 23/05/11 14:40, anass talby wrote: Hi, I'm new in solr and would like to index documents that have complex multivalued fields. I do want to do something like: 1 1 red 2 green ... ... How can i do this with solr thanks in advance.
Re: Support for huge data set?
Hi, Our system [1] consists of +220 million semi-structured web documents (RDF, Microformats, etc.), with fairly small documents (a few kb) and large documents (a few MB). Each document has in addition a dozen of additional fields for indexing and storing metadata about the document. It runs on top of Solr 3.1 with the following configuration: - 2 master indexes - 2 slaves indexes Each server is a quad-core with 32Gb of Ram, and 4 SATA drives in RAID10. The indexing performance are quite good. We can reindex our full data collection in less than a day (using only the two master indexes). Live updates (a few millions documents per day) are processed continuously by our masters. We replicate the change every hours to the slave indexes. Query performance are also ok (you can try it by yourself on [1]). As a side note, we are using Solr 3.1 plus a plugin we have developped for indexing semi-structured data. This plugin is adding much more data to the index than plain Solr. So you can expect even better performance by using plain solr (with respect to indexing performance). [1] http://sindice.com -- Renaud Delbru On 12/05/11 17:59, atreyu wrote: Hi, I have about 300 million docs (or 10TB data) which is doubling every 3 years, give or take. The data mostly consists of Oracle records, webpage files (HTML/XML, etc.) and office doc files. There are b/t two and four dozen concurrent users, typically. The indexing server has> 27 GB of RAM, but it still gets extremely taxed, and this will only get worse. Would Solr be able to efficiently deal with a load of this size? I am trying to avoid the heavy cost of GSA, etc... Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search across related/correlated multivalue fields in Solr
On 27/04/11 19:50, Walter Underwood wrote: This kind of thing is really easy in an XML database. That is an XPath expression, not even a search. Indeed, in fact SIREn is based on a XML IR technique, i.e., a simplified node-based indexing scheme. -- Renaud Delbru
Re: Search across related/correlated multivalue fields in Solr
On 27/04/11 19:37, Renaud Delbru wrote: Hi Jason, On 27/04/11 19:25, Jason Rutherglen wrote: Renaud, Can you provide a brief synopsis of how your system works? SIREn provides a new "field type" for Solr. In this particular SIREn field, the data is not a piece of text, but is organised in a table. Then, SIREn provides query objects to query a specific cell (or group of cell) of this table, a specific row (or group of rows), etc. So, let's take the example of ronotica, you want to index a 1:N relationships between students and educations. Your Solr document will look like: doc { student_id: 100 firstname: john lastname: doe education: { [[2008], [OHIO_ST]], [[2010]], [YALE]] } } where student_id, firstname and lastname are normal solr fields, and education is a siren field. This field represents a table with two columns, degreeYear and Institution, and where each row represent an entry, or record, associated to the student. Then, you can use SIREn to query a document having a row matching 2010 and Yale. In this case, SIREn will not return you the john doe student. I meant "to query a document having a row matching 2010 and *OHIO_ST*" instead of Yale. Sorry for the confusion.
Re: Search across related/correlated multivalue fields in Solr
Hi Jason, On 27/04/11 19:25, Jason Rutherglen wrote: Renaud, Can you provide a brief synopsis of how your system works? SIREn provides a new "field type" for Solr. In this particular SIREn field, the data is not a piece of text, but is organised in a table. Then, SIREn provides query objects to query a specific cell (or group of cell) of this table, a specific row (or group of rows), etc. So, let's take the example of ronotica, you want to index a 1:N relationships between students and educations. Your Solr document will look like: doc { student_id: 100 firstname: john lastname: doe education: { [[2008], [OHIO_ST]], [[2010]], [YALE]] } } where student_id, firstname and lastname are normal solr fields, and education is a siren field. This field represents a table with two columns, degreeYear and Institution, and where each row represent an entry, or record, associated to the student. Then, you can use SIREn to query a document having a row matching 2010 and Yale. In this case, SIREn will not return you the john doe student. I hope my brief synopsis and example is clear, let me know if there is something that you don't understand (maybe in private). Regards, -- Renaud Delbru
Re: Search across related/correlated multivalue fields in Solr
Hi, you might want to look at the SIREn plugin [1,2], which allows you to index and query 1:N relationships such as yours, in a tabular data format [3]. [1] http://siren.sindice.com/ [2] https://github.com/rdelbru/SIREn [3] https://dev.deri.ie/confluence/display/SIREn/Indexing+and+Searching+Tabular+Data Kind Regards, -- Renaud Delbru On 27/04/11 18:30, ronotica wrote: The nature of my project is such that search is needed and specifically search across related entities. We want to perform several queries involving a correlation between two or more properties of a given entity in a collection. To put things in context, here is a snippet of the domain: Student { firstname, lastname } Education { degreeCode, degreeYear, institution } The database tables look like so: STUDENT -- STUDENT_ID FNAME LNAME 100 John Doe 200 Rasheed Jones 300 Mary Hampton EDUCATION - EDUCATION_ID DEGREE_CODE DEGREE_YR INSTITUTION STUDENT_ID 1 MD 2008 OHIO_ST100 2 PHD 2010 YALE 100 3 MS 2007 OHIO_ST 200 4 MD 2010 YALE 300 A student can have many educations. Currently, our documents look like this in solr: DOC_ID STUDENT_IDFNAME LNAME DEGREE_CODEDEGREE_YR INSTITUTION 100 100John Doe MD PHD 2008 2010 OHIO_ST YALE 101 200Rasheed JonesMS 2007 OHIO_ST 102 300Mary Hampton MD 2010 YALE Searching for all students who graduated from OHIO_ST in 2010 currently gives a hit (John Doe) when it shouldn't. What is the best way to have overcome this issue in Solr? This is only happening when I am searching across correlated fields, mainly because the data has been denormalized and Lucene has no notion of relationships between the various fields. One way that as come to mind is to have separate documents for "education" and perform multiple searches to get at an answer. Besides this, is there any other way? Does Solr provide any elegant solution for this? Any help will be greatly appreciated. Thanks. PS: We have about 15 of these kind of relationships all relating to the student and will like to perform search on each of them. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-across-related-correlated-multivalue-fields-in-Solr-tp2871176p2871176.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Queries with undetermined field count
Hi, SIREn [1], a Lucene/Solr plugin, allows you perform queries across an undetermined number of fields, even if you have hundred of thousands of fields. It might be helpful for your scenario. [1] http://siren.sindice.com -- Renaud Delbru On 07/04/11 19:18, jisenhart wrote: I have a question on how to set up queries not having a predetermined field list to search on. Here are some sample docs, 1234 hihello lalachika chika boom boom 1235 foobarhappy happy joy joy some textsome more words to search . . . 4567 bedrock memeyou you super duperare we done? Now a given user user, say fred, belongs to any number of groups, say only fred, and group1 for this example. A query on 'foo' is easy if I know that fred belongs to only these two: _fred:foo OR _group1:foo //will find a hit on doc 1235 However, a user can belong to any number of groups. How do I perform such a search if the users group list is arbitrarily large? Could I somehow make use of reference docs like so: fred fredgroup1 . . . wilma wilmagroup1group5group9group11group31group40
Re: Matching on a multi valued field
Hi, you could try the SIREn plugin [1] which supports multi-valued fields. [1] http://siren.sindice.com -- Renaud Delbru On 29/03/11 21:57, Brian Lamb wrote: Hi all, I have a field set up like this: And I have some records: RECORD1 man's best friend pooch RECORD2 man's worst enemy friend to no one Now if I do a search such as: http://localhost:8983/solr/search/?q=*:*&fq={!q.op=AND df=common_names}man's friend Both records are returned. However, I only want RECORD1 returned. I understand why RECORD2 is returned but how can I structure my query so that only RECORD1 is returned? Thanks, Brian Lamb
Re: Triggering optimise based on time interval
Mainly technical administration effort. We are trying to have a solr packaging that - minimises the effort to deploy the system on a machine. - reduces errors when deploying - centralised the logic of the Solr system Ideally, we would like to have a central place (e.g., solrconfig) where the logic of the system is configured. In that case, the system administrator does not have to bother with a long list of tasks and checkpoints every time we need to release a new version of the solr system, or extend our clusters. He should just have to take the new release, ship it on a machine, and start up solr. -- Renaud Delbru On 16/02/11 13:15, Stefan Matheis wrote: Renaud, just because i'm interested in .. what are your concerns about using cron for that? Stefan On Wed, Feb 16, 2011 at 2:12 PM, Renaud Delbru wrote: Hi, We would like to trigger an optimise every x hours. From what I can see, there is nothing in Solr (3.1-SNAPSHOT) that enables to do such a thing. We have a master-slave configuration. The masters are tuned for fast indexing (large merge factor). However, for the moment, the master index is replicated as it is to the slaves, and therefore it does not provide very fast query time. Our idea was - to configure the replication so that it only happens after an optimise, and - schedule a partial optimise in order to reduce the number of segments every x hours for faster querying. We do not want to rely on cron job for executing the partial optimise every x hours, but we would prefer to configure this directly within the solr config. Our first idea was to create a SolrEventListener, that will be postCommit triggered, and that will be in charge of executing an optimise at regular time interval. Is this a good approach ? Or is there other solutions to achieve this ? Thanks, -- Renaud Delbru
Triggering optimise based on time interval
Hi, We would like to trigger an optimise every x hours. From what I can see, there is nothing in Solr (3.1-SNAPSHOT) that enables to do such a thing. We have a master-slave configuration. The masters are tuned for fast indexing (large merge factor). However, for the moment, the master index is replicated as it is to the slaves, and therefore it does not provide very fast query time. Our idea was - to configure the replication so that it only happens after an optimise, and - schedule a partial optimise in order to reduce the number of segments every x hours for faster querying. We do not want to rely on cron job for executing the partial optimise every x hours, but we would prefer to configure this directly within the solr config. Our first idea was to create a SolrEventListener, that will be postCommit triggered, and that will be in charge of executing an optimise at regular time interval. Is this a good approach ? Or is there other solutions to achieve this ? Thanks, -- Renaud Delbru
Re: Filter Query, Filter Cache and Hit Ratio
Thanks a lot, this totally makes sense but it was hard to figure this out. cheers -- Renaud Delbru On 28/01/11 20:39, cbenn...@job.com wrote: Ooops, I meant NOW/DAY -Original Message- From: cbenn...@job.com [mailto:cbenn...@job.com] Sent: Friday, January 28, 2011 3:37 PM To: solr-user@lucene.apache.org Subject: RE: Filter Query, Filter Cache and Hit Ratio Hi, You've used NOW in the range query which will give a date/time accurate to the millisecond, try using NOW\DAY Colin. -Original Message- From: Renaud Delbru [mailto:renaud.del...@deri.org] Sent: Friday, January 28, 2011 2:22 PM To: solr-user@lucene.apache.org Subject: Filter Query, Filter Cache and Hit Ratio Hi, I am looking for some more information on how the filter cache is working, and how the hit are incremented. We are using filter queries for certain predefined value, such as the timestamp:[2011-01-21T00:00:00Z+TO+NOW] (which is the current day). From what I understand from the documentation: "the filter cache stores the results of any filter queries ("fq" parameters) that Solr is explicitly asked to execute. (Each filter is executed and cached separately. When it's time to use them to limit the number of results returned by a query, this is done using set intersections.)" So, we were imagining that is two consecutive queries (as the one above) was using the same timestamp filter query, the second query will take advantage of the filter cache, and we would see the number of hits increasing (hit on the cached timestamp filter query) . However, this is not the case, the number of hits on the filter cache does not increase and stays very low. Is it normal ? INFO: [] webapp=/siren path=/select params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=t rue&fq=timestamp:[2011-01- 21T00:00:00Z+TO+NOW]&fq=domain:my.wordpress.com&fsv=true} hits=0 status=0 QTime=139 INFO: [] webapp=/siren path=/select params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=t rue&fq=timestamp:[2011-01- 21T00:00:00Z+TO+NOW]&fq=domain:syours.wordpress.com&fsv=true} hits=0 status=0 QTime=138 -- Renaud Delbru
Filter Query, Filter Cache and Hit Ratio
Hi, I am looking for some more information on how the filter cache is working, and how the hit are incremented. We are using filter queries for certain predefined value, such as the timestamp:[2011-01-21T00:00:00Z+TO+NOW] (which is the current day). From what I understand from the documentation: "the filter cache stores the results of any filter queries ("fq" parameters) that Solr is explicitly asked to execute. (Each filter is executed and cached separately. When it's time to use them to limit the number of results returned by a query, this is done using set intersections.)" So, we were imagining that is two consecutive queries (as the one above) was using the same timestamp filter query, the second query will take advantage of the filter cache, and we would see the number of hits increasing (hit on the cached timestamp filter query) . However, this is not the case, the number of hits on the filter cache does not increase and stays very low. Is it normal ? INFO: [] webapp=/siren path=/select params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=true&fq=timestamp:[2011-01-21T00:00:00Z+TO+NOW]&fq=domain:my.wordpress.com&fsv=true} hits=0 status=0 QTime=139 INFO: [] webapp=/siren path=/select params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=true&fq=timestamp:[2011-01-21T00:00:00Z+TO+NOW]&fq=domain:syours.wordpress.com&fsv=true} hits=0 status=0 QTime=138 -- Renaud Delbru
Re: Specifying an AnalyzerFactory in the schema
Hi Chris, On 24/01/11 21:18, Chris Hostetter wrote: : I notice that in the schema, it is only possible to specify a Analyzer class, : but not a Factory class as for the other elements (Tokenizer, Fitler, etc.). : This limits the use of this feature, as it is impossible to specify parameters : for the Analyzer. : I have looked at the IndexSchema implementation, and I think this requires a : simple fix. Do I open an issue about it ? Support for constructing Analyzers directly is very crude, and primarily existed for making it easy for people with old indexes and analyzers to keep working. moving foward, Lucene/Solr eventtually won't "ship" concret Analyzers implementations at all (at least, that's the last concensus i remember) so enhancing support for loading Analyzers (or AnalyzerFactories) doesn't make much sense. Practically speaking, if you have an existing Analyzer that you want to use in Solr, instead of writting an "AnalyzerFactory" for it, you could just write a "TokenizerFactory" that wraps it instead -- functinally that would let you achieve everything ana AnalyzerFactory would, except that Solr would already handle letting the schema.xml specify the positionIncrementGap (which you could happily ignore if you wanted) Thanks for the trick, I haven't thought about doing that. This should work indeed. cheers -- Renaud Delbru
Specifying an AnalyzerFactory in the schema
Hi, I notice that in the schema, it is only possible to specify a Analyzer class, but not a Factory class as for the other elements (Tokenizer, Fitler, etc.). This limits the use of this feature, as it is impossible to specify parameters for the Analyzer. I have looked at the IndexSchema implementation, and I think this requires a simple fix. Do I open an issue about it ? Regards, -- Renaud Delbru
Re: Why does Solr commit block indexing?
Hi Grant, looking forward for a fix ;o). Such a fix would improve quite a lot the performance of Solr update throughput (even if its performance is already quite impressive). cheers -- Renaud Delbru On 17/12/10 13:05, Grant Ingersoll wrote: I'm not sure if there is a issue open, but I know I've talked w/ Yonik about this and a few other changes to the DirectUpdateHandler2 in the past. It does indeed need to be fixed. -Grant On Dec 17, 2010, at 7:04 AM, Renaud Delbru wrote: Hi Michael, thanks for your answer. Do the Solr team is aware of the problem ? Is there an issue opened about this, or ongoing work about that ? Regards, -- Renaud Delbru On 16/12/10 16:45, Michael McCandless wrote: Unfortunately, (I think?) Solr currently commits by closing the IndexWriter, which must wait for any running merges to complete, and then opening a new one. This is really rather silly because IndexWriter has had its own commit method (which does not block ongoing indexing nor merging) for quite some time now. I'm not sure why we haven't switched over already... there must be some trickiness involved. Mike On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbru wrote: Hi, See log at [1]. We are using the latest snapshot of lucene_branch3.1. We have configured Solr to use the ConcurrentMergeScheduler: When a commit() runs, it blocks indexing (all imcoming update requests are blocked until the commit operation is finished) ... at the end of the log we notice a 4 minute gap during which none of the solr cients trying to add data receive any attention. This is a bit annoying as it leads to timeout exception on the client side. Here, the commit time is only 4 minutes, but it can be larger if there are merges of large segments I thought Solr was able to handle commits and updates at the same time: the commit operation should be done in the background, and the server still continue to receive update requests (maybe at a slower rate than normal). But it looks like it is not the case. Is it a normal behaviour ? [1] http://pastebin.com/KPkusyVb Regards -- Renaud Delbru -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
Re: Why does Solr commit block indexing?
Hi Michael, thanks for your answer. Do the Solr team is aware of the problem ? Is there an issue opened about this, or ongoing work about that ? Regards, -- Renaud Delbru On 16/12/10 16:45, Michael McCandless wrote: Unfortunately, (I think?) Solr currently commits by closing the IndexWriter, which must wait for any running merges to complete, and then opening a new one. This is really rather silly because IndexWriter has had its own commit method (which does not block ongoing indexing nor merging) for quite some time now. I'm not sure why we haven't switched over already... there must be some trickiness involved. Mike On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbru wrote: Hi, See log at [1]. We are using the latest snapshot of lucene_branch3.1. We have configured Solr to use the ConcurrentMergeScheduler: When a commit() runs, it blocks indexing (all imcoming update requests are blocked until the commit operation is finished) ... at the end of the log we notice a 4 minute gap during which none of the solr cients trying to add data receive any attention. This is a bit annoying as it leads to timeout exception on the client side. Here, the commit time is only 4 minutes, but it can be larger if there are merges of large segments I thought Solr was able to handle commits and updates at the same time: the commit operation should be done in the background, and the server still continue to receive update requests (maybe at a slower rate than normal). But it looks like it is not the case. Is it a normal behaviour ? [1] http://pastebin.com/KPkusyVb Regards -- Renaud Delbru
Why does Solr commit block indexing?
Hi, See log at [1]. We are using the latest snapshot of lucene_branch3.1. We have configured Solr to use the ConcurrentMergeScheduler: When a commit() runs, it blocks indexing (all imcoming update requests are blocked until the commit operation is finished) ... at the end of the log we notice a 4 minute gap during which none of the solr cients trying to add data receive any attention. This is a bit annoying as it leads to timeout exception on the client side. Here, the commit time is only 4 minutes, but it can be larger if there are merges of large segments I thought Solr was able to handle commits and updates at the same time: the commit operation should be done in the background, and the server still continue to receive update requests (maybe at a slower rate than normal). But it looks like it is not the case. Is it a normal behaviour ? [1] http://pastebin.com/KPkusyVb Regards -- Renaud Delbru
Re: How to Transmit and Append Indexes
Have you looked at Apache Nutch [1]. It is a distributed web crawl and search system, based on Lucene/Solr and Hadoop. [1] http://nutch.apache.org/ -- Renaud Delbru On 19/11/10 16:52, Bing Li wrote: Hi, all, I am working on a distributed searching system. Now I have one server only. It has to crawl pages from the Web, generate indexes locally and respond users' queries. I think this is too busy for it to work smoothly. I plan to use two servers at at least. The jobs to crawl pages and generate indexes are done by one of them. After that, the new available indexes should be transmitted to anther one which is responsible for responding users' queries. From users' point of view, this system must be fast. However, I don't know how I can get the additional indexes which I can transmit. After transmission, how to append them to the old indexes? Does the appending block searching? Thanks so much for your help! Bing Li
Re: How to extend IndexSchema and SchemaField
Hi Chris, I have opened an issue (SOLR-2146 [1]) following that discussion. [1] https://issues.apache.org/jira/browse/SOLR-2146 cheers -- Renaud Delbru On 14/09/10 01:06, Chris Hostetter wrote: : Yes, I have thought of that, or even extending field type. But this does not : work for my use case, since I can have multiple fields of a same type : (therefore with the same field type, and same analyzer), but each one of them : needs specific information. Therefore, I think the only "nice" way to achieve : this is to have the possibility to add attributes to any field definition. Right, at the moment custom FieldType classes can specify whatever attributes they want to use in the declaration -- but it's not possible to specify arbitrary attributes that can be used in the declaration. By all means, pelase open an issue requesting this as a feature. I don't know that anyone explicitly set out to impose this limitation, but one of the reasons it likely exists is because SchemaField is not something that is intended to be customized -- while FieldType objects are constructed once at startup, SchemaField obejcts are frequently created on the fly when dealing with dynamicFields, so initialization complexity is kept to a minimum. That said -- this definitely seems like that type of usecase that we should try to find *some* solution for -- even if it just means having Solr automaticly create hidden FieldType instances for you on startup based on attributes specified in the that the corrisponding FieldType class understands. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: How to extend IndexSchema and SchemaField
Hi Charlie, On 10/09/10 16:11, Charlie Jackson wrote: Have you already explored the idea of using a custom analyzer for your field? Depending on your use case, that might work for you. Yes, I have thought of that, or even extending field type. But this does not work for my use case, since I can have multiple fields of a same type (therefore with the same field type, and same analyzer), but each one of them needs specific information. Therefore, I think the only "nice" way to achieve this is to have the possibility to add attributes to any field definition. cheers -- Renaud Delbru
Re: How to extend IndexSchema and SchemaField
Hi Javier, On 10/09/10 07:15, Javier Diaz wrote: Looking at the code we found out that there's no way to extend the schema. Finally we copied part of the code that reads the schema in our RequestHandler. It works but I'm not sure if it's the best way to do it. Let me know if you want our code as an example. So, do you mean you are duplicating part of the code for reading the schema and parse on your own way the schema in your request handler ? If you could share the code to have a look, it could be helpful and inspiring. cheers. -- Renaud Delbru
Re: How to extend IndexSchema and SchemaField
Hi, so I suppose there is no solution. Is there a chance that SchemaField becomes extensible in the future ? Because, at the moment, all the field attributes (indexed, stored, etc.) are hardcoded inside SchemaField. Do you think it is worth opening an issue about it ? -- Renaud Delbru On 07/09/10 16:13, Renaud Delbru wrote: Hi, I would like to extend the field node in the schema.xml by adding new attributes. For example, I would like to be able to write: And be able to access myattribute directly from IndexSchema and SchemaField objects. However, these two classes are final, and also not very easy to extend ? Is there any other solutions ? thanks,
How to extend IndexSchema and SchemaField
Hi, I would like to extend the field node in the schema.xml by adding new attributes. For example, I would like to be able to write: And be able to access myattribute directly from IndexSchema and SchemaField objects. However, these two classes are final, and also not very easy to extend ? Is there any other solutions ? thanks, -- Renaud Delbru
Re: determine which value produced a hit in multivalued field type
Hi, SIREn [1] could provide you such information (return the value index in the multi-valued field). But actually, only a Lucene extension is available, and you'll have to modified a little bit the SIREn query operator to returns you the value position in the query results. [1] http://siren.sindice.com/ -- Renaud Delbru On 22/01/10 22:52, Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] wrote: Hi, If I have a multiValued field type of text, and I put values [cat,dog,green,blue] in it. Is there a way to tell when I execute a query against that field for dog, that it was in the 1st element position for that multiValued field? Thanks! Tim
Re: Best wasy to solve Parent-Child relationship without Denormalizing?
Hi, SIREn [1] could help you to solve this task (look at the different indexing examples). But actually, only a Lucene extension is available. If you want to use it into Solr, you will have to implement your own Solr plugin (which should require only a limited amount of work). [1] http://siren.sindice.com/ -- Renaud Delbru On 19/01/10 13:14, karthi_1986 wrote: Hi, Here is an extract of my data schema in which my user should be able to issue the following search: company_description:pharmaceutical AND product_description:cosmetic [Company profile] Company name Company url Company description Company user rating [Product profile] Product name Product category Product description Product rating So, I'm expecting a result where all cosmetic products created by pharmaceutical companies are returned. The problem is, I've read in posts a year old that this parent-child relationship can only be solved by indexing the denormalized data together. However, I'm dealing with 10,000,000 companies with possibly 10 products each, so my data requirements are going to be HUGGEE!! Is there a new feature in Solr which can handle this for me without the need for de-normalization?
Re: how to scan dynamic field without specifying each field in query
Hi, maybe SIREn [1] can help you for this task. SIREn is a Lucene plugin that allows to index and query tabular data. You can for example create a SIREn field "foo", index n values in n cells, and then query a specific cell or a range of cells. Unfortunately, the Solr plugin is not yet available, and therefore you will have to write your own query syntax and parser for this task. Regards, [1] http://siren.sindice.com -- Renaud Delbru gdeconto wrote: thx for the reply. you mean into a multivalue field? possible, but was wondering if there was something more flexible than that. the ability to use a function (ie myfunction) would open up some possibilities for more complex searching and search syntax. I could write my own query parser with special extended syntax, but that is farther than I wanted to go. Manepalli, Kalyan wrote: You can copy the dynamic fields value into a different field and query on that field. Thanks, Kalyan Manepalli
Re: Indexing arbitrary RDF resources
Hi, Here in DERI [1], we are working on an extension for Lucene / Solr to handle RDF data and structured queries. The engine is currently in use in the Sindice [2] search engine. We are planning to release our extension, called SIREn (for Semantic Information Retrieval Engine), as open source in the following month. The approach with dynamic fields could work, but it has strong limitations when dealing with a large number of fields. Among them, it will lead to data duplication in the dictionary (the dictionary will become quickly very large since multiple fields / predicate can have identical terms) and it will be very inefficient to ask queries across all the fields. Our work overcomes such problems. We are also currently working on supporting join queries among entities / documents that are not of the most simple kind. If you want to know more, you can contact our team (or send me directly an email). Maybe, it could be a good idea to join our efforts. [1] http://www.deri.ie/ [2] http://www.sindice.com/ -- Regards, Renaud Delbru re...@gmx.net wrote: Hey, all! I'm planning a project where I want to write software that takes an RDF class and uses that information to dynamically support indexing and faceted searching of resources of that type. This would (as I imagine it) function with dynamic fields in all required data types and multiplicities and a mapping from properties to field names. The project will be part of the open CMS software drupal which already has a working Solr integration module. You can find details about my project idea here: http://groups.drupal.org/node/20589 Has something like this already been done or thought of by anyone her? Does anyone here have hints or remarks regarding the idea? Thanks in advance for any comments!
Re: Store content out of solr
A common approach (for web search engines) is to use HBase [1] as a "Document Repository". Each document indexed inside Solr will have an entry (row, identified by the document URL) in the HBase table. This works great when you deal with a large data collection (it scales better than a SQL database). The counterpart is that it is slightly slower than a local database. [1] http://hadoop.apache.org/hbase/ -- Renaud Delbru roberto wrote: Hello, We are indexing information from diferent sources so we would like to centralize the information content so i can retrieve using the ID provided buy solr? Does anyone did something like this, and have some advices ? I thinking in store the information into a database like mysql ? Thanks,
Re: [ANN] Lucid Imagination
Hi Mark, Mark Miller wrote: Hey Renaud - in the future, its probably best to direct Gaze questions (unless it directly relates to Solr) to supp...@lucidimagination.com <mailto:supp...@lucidimagination.com>. Right, I was not aware of this mailing list. Gaze is a tool thats stores RequestHandler statistics avgs (over small intervals) for long time ranges, and then lets you view graphs of that data, either in (basically) real-time or for specific time ranges. There is a Readme explaining install included with the Gaze download. Gaze is pre-installed in the LucidImagination certified distribution of Solr, and in the Readme html file for that, you will find instructions on enabling gaze (you uncomment the Gaze request handler in solrconfig.xml). Ok, the Gaze documentation can only be found in the distribution file. It is what I was trying to find (I was looking on the Lucid Imagination website). Gaze is implemented as a Solr RequestHandler plugin and an additional webapp. The RequestHandler plugin pings chosen request handlers every interval to collect RequestHandler statistics. This info is stored in RRD databases (this is done so that Gaze has a *very* minimal overhead - its meant for production use). The webapp is an interface to selecting which RequestHandlers you want to be monitored and other settings, as well as graph views of the collected data. There are also some other little info tools that display server/jvm and index statistics. Gaze features look quite nice and useful. Thanks for your reply, Regards -- Renaud Delbru
Re: [ANN] Lucid Imagination
Hi, I don't find any documentation about Solr Gaze. How can I use it ? Thanks, Regards -- Renaud Delbru Grant Ingersoll wrote: Hi Lucene and Solr users, As some of you may know, Yonik, Erik, Sami, Mark and I teamed up with Marc Krellenstein to create a company to provide commercial support (with SLAs), training, value-add components and services to users of Lucene and Solr. We have been relatively quiet up until now as we prepare our offerings, but I am now pleased to announce the official launch of Lucid Imagination. You can find us at http://www.lucidimagination.com/ and learn more about us at http://www.lucidimagination.com/About/. We have also launched a beta search site dedicated to searching all things in the Lucene ecosystem: Lucene, Solr, Tika, Mahout, Nutch, Droids, etc. It's powered, of course, by Lucene via Solr (we'll provide details in a separate message later about our setup.) You can search the Lucene family of websites, wikis, mail archives and JIRA issues all in one place. To try it out, browse to http://www.lucidimagination.com/search/. Any and all feedback is welcome at f...@lucidimagination.com. Thanks, Grant -- Grant Ingersoll http://www.lucidimagination.com/
Re: Solr 1.3 Maven Artifact Problem
Hi, About the second point, it was my mistake (source dependencies problem in eclipse). -- Renaud Delbru Renaud Delbru wrote: Hi, I am using the Solr 1.3 mave nartifacts from [1]. It seems that these artifacts are not correct. I have noticed that: 1) solr-core artifact contains org.apache.solr.client.solrj packages, and at the same time, the solr-core artifact depends on the solr-solrj artifact. 2) the source jar does not match the compiled class: I found different method fingerprints in EmbeddedSolr and in CoreDescriptor Do someone encounter the same problem ? [1] http://repo1.maven.org/maven2/org/apache/solr/ Regards,
Solr 1.3 Maven Artifact Problem
Hi, I am using the Solr 1.3 mave nartifacts from [1]. It seems that these artifacts are not correct. I have noticed that: 1) solr-core artifact contains org.apache.solr.client.solrj packages, and at the same time, the solr-core artifact depends on the solr-solrj artifact. 2) the source jar does not match the compiled class: I found different method fingerprints in EmbeddedSolr and in CoreDescriptor Do someone encounter the same problem ? [1] http://repo1.maven.org/maven2/org/apache/solr/ Regards, -- Renaud Delbru
Re: Slow deleteById request
Hi, I think the reason was indeed maxPendingDeletes which was configured to 1000. After having updated to a solr nightly build with Lucene 2.4, the issue seems to have disappeared. Thanks for your advices. -- Renaud Delbru Mike Klaas wrote: On 1-Jul-08, at 10:44 PM, Chris Hostetter wrote: > > : Yes, updating to a newer version of nightly Solr build could solve > the > : problem, but I am a little afraid to do it since solr-trunk has > switched to > : lucene 2.4-dev. > > but did you check wether or not you have maxPendingDeletes > configured as > yonik asked? > > That would explain exactly waht you are seeing ... after a certain > number > of deletes have passed, the next one would automaticly force a > commit (and > a newSearcher) and (i believe) subsequent deletes would block until > the > commit is done ... which sounds like exactly what you describe. It shouldn't cause a commit, just a flushing of deletes. However, deletes count toward both maxDocs and maxTime for purposes, so that is the likely explanation. -Mike
Re: Slow deleteById request
Yonik Seeley wrote: I'd try the latest nightly solr build... it now lets Lucene manage the deletes. Yes, updating to a newer version of nightly Solr build could solve the problem, but I am a little afraid to do it since solr-trunk has switched to lucene 2.4-dev. Thanks for your answers, Yonik. -- Renaud Delbru
Re: Slow deleteById request
Hi Yonik, We are not sending a commit with a delete. It happens when using the following command: curl http://mydomain.net:8080/index/update -s -H 'Content-type:text/xml; charset=utf-8' -d "http://example.org/" or using the SolrJ deleteById method (that does not execute a commit as far as I know). The strange things is that it is not always reproduced. Ten or so delete requests will be executed fast (in few ms), then a batch of few delete requests will take 10, 20 or even 30 seconds. By looking more precisely at the log, it seems that, in fact, the delete request triggers the opening of a new searcher, with its auto-warming. On a large index (our case), I heard that it can take quite some time. Anyway, I am without a precise explanation about this problem. This is not a big issue in our case, since it occurs for few requests and since other concurrent requests will be handled by the other searcher. -- Renaud Delbru Yonik Seeley wrote: That's very strange... are you sending a commit with the delete perhaps? If so, the whole request would block until a new searcher is registered. -Yonik On Tue, Jul 1, 2008 at 8:54 AM, Renaud Delbru <[EMAIL PROTECTED]> wrote: Hi, We experience very slow delete, taking more than 10 seconds. A delete is executed using deleteById (from Solrj or from curl), at the same time documents are being added. By looking at the log (below), it seems that a delete by ID request is only executed during the next commit (done automatically every 1000 added documents), and that the process (Solrj or curl) executing the deleteById request is blocked until the commit is performed. Is it a normal behavior or a misconfiguration of our Solr server ? Thanks in advance for insights. [11:32:02.840]autowarming result for [EMAIL PROTECTED] main [11:32:02.840] queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c umulative_evictions=4289} [11:32:02.840]autowarming [EMAIL PROTECTED] main from [EMAIL PROTECTED] main [11:32:02.840] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.840]autowarming result for [EMAIL PROTECTED] main [11:32:02.840] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.840]Registered new searcher [EMAIL PROTECTED] main [11:32:02.840]{delete=[http://example.org/]} 0 14212 [11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2} status=0 QTime=14212 [11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids [11:32:02.840]Closing Writer DirectUpdateHandler2 [11:32:02.842]Closing [EMAIL PROTECTED] main [11:32:02.842] filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} [11:32:02.842] queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289} [11:32:02.842] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2 [11:32:03.566]DirectUpdateHandler2 docs deleted=0 [11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2 -- Renaud Delbru
Re: Slow deleteById request
Small precision, we are using a nightly build of Solr 1.3 (one of the nightly build just before the integration of Lucene 2.4). -- Renaud Delbru Renaud Delbru wrote: Hi, We experience very slow delete, taking more than 10 seconds. A delete is executed using deleteById (from Solrj or from curl), at the same time documents are being added. By looking at the log (below), it seems that a delete by ID request is only executed during the next commit (done automatically every 1000 added documents), and that the process (Solrj or curl) executing the deleteById request is blocked until the commit is performed. Is it a normal behavior or a misconfiguration of our Solr server ? Thanks in advance for insights. [11:32:02.840]autowarming result for [EMAIL PROTECTED] main [11:32:02.840] queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c umulative_evictions=4289} [11:32:02.840]autowarming [EMAIL PROTECTED] main from [EMAIL PROTECTED] main [11:32:02.840] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.840]autowarming result for [EMAIL PROTECTED] main [11:32:02.840] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.840]Registered new searcher [EMAIL PROTECTED] main [11:32:02.840]{delete=[http://example.org/]} 0 14212 [11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2} status=0 QTime=14212 [11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids [11:32:02.840]Closing Writer DirectUpdateHandler2 [11:32:02.842]Closing [EMAIL PROTECTED] main [11:32:02.842] filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} [11:32:02.842] queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289} [11:32:02.842] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2 [11:32:03.566]DirectUpdateHandler2 docs deleted=0 [11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2
Slow deleteById request
Hi, We experience very slow delete, taking more than 10 seconds. A delete is executed using deleteById (from Solrj or from curl), at the same time documents are being added. By looking at the log (below), it seems that a delete by ID request is only executed during the next commit (done automatically every 1000 added documents), and that the process (Solrj or curl) executing the deleteById request is blocked until the commit is performed. Is it a normal behavior or a misconfiguration of our Solr server ? Thanks in advance for insights. [11:32:02.840]autowarming result for [EMAIL PROTECTED] main [11:32:02.840] queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c umulative_evictions=4289} [11:32:02.840]autowarming [EMAIL PROTECTED] main from [EMAIL PROTECTED] main [11:32:02.840] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.840]autowarming result for [EMAIL PROTECTED] main [11:32:02.840] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.840]Registered new searcher [EMAIL PROTECTED] main [11:32:02.840]{delete=[http://example.org/]} 0 14212 [11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2} status=0 QTime=14212 [11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids [11:32:02.840]Closing Writer DirectUpdateHandler2 [11:32:02.842]Closing [EMAIL PROTECTED] main [11:32:02.842] filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} [11:32:02.842] queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289} [11:32:02.842] documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} [11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2 [11:32:03.566]DirectUpdateHandler2 docs deleted=0 [11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2 -- Renaud Delbru
Re: How to get incrementPositionGap value from IndexSchema ?
Hi Chris, Thanks for your reply. Indeed, there is the getPositionIncrementGap method, I forgot it. I need this information to be able to configure my query processor. I have extended Solr with a new query parser to be able to search document on a sentence-based granularity. Each sentence is a fieldable instance of a field 'sentences', and I execute span queries to be able to match a boolean combination of terms on a sentence-level, not a document-level. I hope this explanation is clear and makes sense. Regards. Chris Hostetter wrote: : I am looking for a way to access the incrementPositionGap value defined for a : field type in the schema.xml. I think you mean "positionIncrementGap" It's a property of the in schema.xml, but internally it's passed to SolrAnalyzer.setPositionIncrementGap. if you want to programaticly know what the "positionIncrementGap" is for any analyzer of any field or fieldtype regardless of wether or not it's a SolrAnalyzer, just use Analzer.getPositionIncrementGap(String fieldName) ie: myFieldType.getAnalyzer().getPositionIncrementGap(myFieldName) If you don't mind me asking: why do you want/need this information in your custom code? -Hoss -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
How to get incrementPositionGap value from IndexSchema ?
Hi, I am looking for a way to access the incrementPositionGap value defined for a field type in the schema.xml. There is a getArgs method in FieldType class, but it is protected and I am not able to access it. Is there another solution ? Regards. -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
Re: SpanQuery support
Very nice, I will try this approach. Thanks Yonik. Yonik Seeley wrote: On Feb 4, 2008 11:53 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote: You could, but that would be the hard way (by a big margin). There are pluggable query parsers now (see QParserPlugin)... but the current missing piece is being able to specify a new parser plugin from solrconfig.xml Hmmm, it appears I forgot what I implemented already ;-) Support for adding new parser plugins from solrconfig.xml already exists (and I just added a test). So add something like the following to your solrconfig.xml And then implement FooQParserPlugin in Java to create your desired query structures (span queries or whatever). See other implementations of FooQParserPlugin in Solr for guidance. To use your "foo" parser, set it to the default query type by adding defType="foo" to the request (or to the defaults for your handler). You can also override the current query type via q=my query -Yonik -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
Re: Querying multiple dynamicField
The idea was to keep separated a certain number of lines (or sentences) in a document without using the GapPosition trick between field instances. I found that the use of multiple dynamic fields is a cleaner and generic approach. By using the copyField, I duplicate data inside the index but I loose also the line distinction. I think the addition of wildcards in the field name can be a good addition to the Solr features. This will give us the ability to query only a certain "type" of dynamic field (typeA_*, typeB_*, etc.). Regards. Lance Norskog wrote: You can use the directive to copy all 'sentence_*' fields into one indexed field. You then have a named field that you can search against. Lance Norskog -Original Message- From: Renaud Delbru [mailto:[EMAIL PROTECTED] Sent: Friday, February 01, 2008 6:48 PM To: solr-user@lucene.apache.org Subject: Querying multiple dynamicField Hi, We would like to know if there is an efficient way to query multiple dynamicField at the same time, using wildcard in the field name. For example, we have a list of dynamic fields "sentence_*" and we would like to execute a query on all the "sentence_*" fields. Is there a way to execute such queries on Solr 1.3 / Lucene 2.3 ? Regards. -- Renaud Delbru -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
Re: SpanQuery support
Yonik Seeley wrote: On Feb 2, 2008 3:43 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote: I was looking at the discussion of SOLR-281. If I understand correctly, the task would be to write my own search component class, SpanQueryComponent that extends the SearchComponent class, then overwriting the declaration of the "query searchComponent" in solrconfig.xml: Then, I will be able to use directly my own query syntax and query component ? Is it correct ? You could, but that would be the hard way (by a big margin). There are pluggable query parsers now (see QParserPlugin)... but the current missing piece is being able to specify a new parser plugin from solrconfig.xml -Yonik I have looked at MoreLikeThisHandler.java. I saw that all the MoreLikeThis logics is defined inside the handler and through the inner class MoreLikeThisHelper. Could I follow the same approach and define a ProximityHandler class that execute Lucene SpanQuery based on some request parameters ? Is it the right way to do ? Regards. -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
Re: SpanQuery support
Hi Yonik, Yonik Seeley wrote: On Feb 2, 2008 3:43 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote: I was looking at the discussion of SOLR-281. If I understand correctly, the task would be to write my own search component class, SpanQueryComponent that extends the SearchComponent class, then overwriting the declaration of the "query searchComponent" in solrconfig.xml: Then, I will be able to use directly my own query syntax and query component ? Is it correct ? You could, but that would be the hard way (by a big margin). There are pluggable query parsers now (see QParserPlugin)... but the current missing piece is being able to specify a new parser plugin from solrconfig.xml -Yonik Hum, I would prefer to follow the easiest way ;o). Could you explain me briefly the easiest way ? And give me some hints on which classes I need to extend to achieve my goal ? Regards. -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
Re: SpanQuery support
Thanks Yonik, I was looking at the discussion of SOLR-281. If I understand correctly, the task would be to write my own search component class, SpanQueryComponent that extends the SearchComponent class, then overwriting the declaration of the "query searchComponent" in solrconfig.xml: Then, I will be able to use directly my own query syntax and query component ? Is it correct ? Regards. Yonik Seeley wrote: Solr 1.3 will have query parser plugins... so you could write your own parser that utilized span queries. -Yonik On Feb 2, 2008 2:48 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote: Do you know if it is currently possible to use the SpanQuery feature of Lucene in Solr 1.3. We would like to use nested span queries such as (("A B") near ("C D")). Do a request handler support such feature ? Or, any idea how could we perform ? -- Renaud Delbru, E.C.S., Ph.D. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
SpanQuery support
Hi, Do you know if it is currently possible to use the SpanQuery feature of Lucene in Solr 1.3. We would like to use nested span queries such as (("A B") near ("C D")). Do a request handler support such feature ? Or, any idea how could we perform ? Regards. -- Renaud Delbru
Querying multiple dynamicField
Hi, We would like to know if there is an efficient way to query multiple dynamicField at the same time, using wildcard in the field name. For example, we have a list of dynamic fields "sentence_*" and we would like to execute a query on all the "sentence_*" fields. Is there a way to execute such queries on Solr 1.3 / Lucene 2.3 ? Regards. -- Renaud Delbru
Re: LSA Implementation
LDA (Latent Dirichlet Allocation) is a similar technique that extends pLSI. You can find some implementation in C++ and Java on the Web. Grant Ingersoll wrote: Interesting. I am not a lawyer, but my understanding has always been that this is not something we could do. The question has come up from time to time on the Lucene mailing list: http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=Latent+Semantic&search_type=AND That being said, there may be other approaches that do similar things that aren't covered by a patent, I don't know. Is there something specific you want to do, or are you just going by the promise of better results using LSI? I suppose if someone said they had a patch for Lucene/Solr that implemented it, we could ask on legal-discuss for advice. -Grant On Nov 26, 2007, at 1:13 PM, Eswar K wrote: I was just searching for info on LSA and came across Semantic Indexing project under GNU license...which of couse is still under development in C++ though. - Eswar On Nov 26, 2007 9:56 PM, Jack <[EMAIL PROTECTED]> wrote: Interesting. Patents are valid for 20 years so it expires next year? :) PLSA does not seem to have been patented, at least not mentioned in http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis On Nov 26, 2007 6:58 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant On Nov 26, 2007, at 8:23 AM, Eswar K wrote: All, Is there any plan to implement Latent Semantic Analysis as part of Solr anytime in the near future? Regards, Eswar -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Renaud Delbru, E.C.S., M.Sc. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/