Re: How to avoid underscore sign indexing problem?
After trying some search case and different params combination of WordDelimeter. I wonder what is the best strategy to index string 2DA012_ISO MARK 2 and can be search by term 2DA012? What if I just want _ to be removed both query/index time, what and how to configure? Floyd 2013/8/22 Floyd Wu floyd...@gmail.com Thank you all. By the way, Jack I gonna by your book. Where to buy? Floyd 2013/8/22 Jack Krupansky j...@basetechnology.com I thought that the StandardTokenizer always split on punctuation, Proving that you haven't read my book! The section on the standard tokenizer details the rules that the tokenizer uses (in addition to extensive examples.) That's what I mean by deep dive. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, August 21, 2013 10:41 PM To: solr-user@lucene.apache.org Subject: Re: How to avoid underscore sign indexing problem? On 8/21/2013 7:54 PM, Floyd Wu wrote: When using StandardAnalyzer to tokenize string Pacific_Rim will get ST textraw_**bytesstartendtypeposition pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011ALPHANUM1 How to make this string to be tokenized to these two tokens Pacific, Rim? Set _ as stopword? Please kindly help on this. Many thanks. Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn
Re: Data Import faile in solr 4.3.0
Thanks for suggestion but as per us this is not the right way to re-index all the data each and every time. we mean when we migrate the sole from older to latest version. there is some way that solr have to provide the solutions for this because re indexing the 50 lac document is not an easy job. we want to know is there any way in solr to do this in easily. Thanks Regards Montu v Boda -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-faile-in-solr-4-3-0-tp4085868p4086020.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Indexing Status
I am not using dih for indexing csv files. Im pushing data through solrj code. But i want a status something like what dih gives. ie. fire a command=status and we get the response. Is anythin like that available for any type of file indexing which we do through api ? On Thu, Aug 22, 2013 at 12:09 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Yes, you can invoke http://host:port/solr/dataimport?command=status which will return how many Solr docs have been added etc. On Wed, Aug 21, 2013 at 4:56 PM, Prasi S prasi1...@gmail.com wrote: Hi, I am using solr 4.4 to index csv files. I am using solrj for this. At frequent intervels my user may request for Status. I have to send get something like in DIH Indexing in progress.. Added xxx documents. Is there anything like in dih, where we can fire a command=status to get the status of indexing for files. Thanks, Prasi -- Regards, Shalin Shekhar Mangar.
relation between optimize and merge
Hi All I do have some diffculty with understand the relation between the optimize and merge Can anyone give some tips about the difference. Regards
when does RAMBufferSize work when commit.
Hi all About the RAMBufferSize and commit ,I have read the doc : http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/60544 I can not figure out how do they make work. Given the settings: ramBufferSizeMB10/ramBufferSizeMB autoCommit maxTime${solr.autoCommit.maxDocs:1000}/maxTime openSearcherfalse/openSearcher /autoCommit If the indexs docs up to 1000 and the size of these docs is below 10MB ,it will trigger an commit. If the size of the indexed docs reaches to 10MB while the the number is below 1000, it will not trigger an commit , however the index docs will just be flushed to disk,it will only commit when the number reaches to 1000? Are the two scenarioes right? Regards
SOLUTION: Clusterstate says state:recovering, but Core says I see state: null?
Aliasing instead of swapping removed this problem! DO NOT USE SWAP WHEN IN CLOUD MODE (solr 4.3) -- View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504p4086037.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCmdDistributor may not be threadsafe...
I have been running DIH Imports (15 000 000 rows) all day and every now and then I get some weird errors. Some examples: A letter is replaced by an unknow character (Should have been a 'V') 285680 [Thread-20] ERROR org.apache.solr.update.SolrCmdDistributor - shard update error StdNode: http://10.231.188.127:8080/solr/kunde0/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: undefined field: KUNDE_ETTERNA?N at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:375) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 938360 [Thread-59] ERROR org.apache.solr.update.SolrCmdDistributor - shard update error StdNode: http://10.231.188.186:8080/solr/kunde0/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unexpected character 'l' (code 108) in start tag Expected a quote at [row,col {unknown-source}]: [1,2188] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:375) ... 1379931 [Thread-22] ERROR org.apache.solr.update.SolrCmdDistributor - shard update error StdNode: http://10.231.188.186:8080/solr/kunde0/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unexpected character '0' (code 48) in start tag Expected a quote 2546924 [Thread-79] ERROR org.apache.solr.update.SolrCmdDistributor - shard update error StdNode: http://10.231.188.127:8080/solr/kunde0/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unexpected character '0' (code 48) in content after '' (malformed start element?). at [row,col {unknown-source}]: [1,6333] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:375) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) I'm running on jdk1.7.0_21. 4.4.0 1504776 with 3 nodes. Seen this before? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCmdDistributor-may-not-be-threadsafe-tp4086042.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Indexing Status
You can use the /admin/mbeans handler to get all system stats. You can find stats such as adds and cumulative_adds under the update handler section. http://localhost:8983/solr/collection1/admin/mbeans?stats=true On Thu, Aug 22, 2013 at 12:35 PM, Prasi S prasi1...@gmail.com wrote: I am not using dih for indexing csv files. Im pushing data through solrj code. But i want a status something like what dih gives. ie. fire a command=status and we get the response. Is anythin like that available for any type of file indexing which we do through api ? On Thu, Aug 22, 2013 at 12:09 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Yes, you can invoke http://host:port/solr/dataimport?command=status which will return how many Solr docs have been added etc. On Wed, Aug 21, 2013 at 4:56 PM, Prasi S prasi1...@gmail.com wrote: Hi, I am using solr 4.4 to index csv files. I am using solrj for this. At frequent intervels my user may request for Status. I have to send get something like in DIH Indexing in progress.. Added xxx documents. Is there anything like in dih, where we can fire a command=status to get the status of indexing for files. Thanks, Prasi -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Data Import faile in solr 4.3.0
No one is asking you to re-index data. The Solr 3.5 index can be read and written by a Solr 4.x installation. On Thu, Aug 22, 2013 at 12:08 PM, Montu v Boda montu.b...@highqsolutions.com wrote: Thanks for suggestion but as per us this is not the right way to re-index all the data each and every time. we mean when we migrate the sole from older to latest version. there is some way that solr have to provide the solutions for this because re indexing the 50 lac document is not an easy job. we want to know is there any way in solr to do this in easily. Thanks Regards Montu v Boda -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-faile-in-solr-4-3-0-tp4085868p4086020.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
DIH not proceeding after few millions
Hi, Im using DIH to index data to solr. Solr 4.4 version is used. Indexing proceeds normal in the beginning. I have some 10 data-config files. file1 - select * from table where id between 1 and 100 file2 - select * from table where id between 100 and 300. and so on. Here 4 batches go normally. For the fifth batch, i ge the status from Admin page ( Dataimport) as *Duration: 2 hrs*. Indexed:0 documents ; deleted:0 documents. And indexing stops. But no documents were indexed. I use single external zookeeper for this. I dont see any exception in solr logs and in Zookeeper, below is the status. INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException when processing sessionid:0x1 40a4ce824b0005 type:create cxid:0x29a zxid:0x157d txntype:-1 reqpath:n/a Error P Any ideas?
Re: Solr Indexing Status
Thanks much . This was useful. On Thu, Aug 22, 2013 at 2:24 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: You can use the /admin/mbeans handler to get all system stats. You can find stats such as adds and cumulative_adds under the update handler section. http://localhost:8983/solr/collection1/admin/mbeans?stats=true On Thu, Aug 22, 2013 at 12:35 PM, Prasi S prasi1...@gmail.com wrote: I am not using dih for indexing csv files. Im pushing data through solrj code. But i want a status something like what dih gives. ie. fire a command=status and we get the response. Is anythin like that available for any type of file indexing which we do through api ? On Thu, Aug 22, 2013 at 12:09 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Yes, you can invoke http://host:port/solr/dataimport?command=status which will return how many Solr docs have been added etc. On Wed, Aug 21, 2013 at 4:56 PM, Prasi S prasi1...@gmail.com wrote: Hi, I am using solr 4.4 to index csv files. I am using solrj for this. At frequent intervels my user may request for Status. I have to send get something like in DIH Indexing in progress.. Added xxx documents. Is there anything like in dih, where we can fire a command=status to get the status of indexing for files. Thanks, Prasi -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Flushing cache without restarting everything?
But is it really a good benchmarking, if you flush the cache? Wouldn't you want to benchmark against a system, that would be comparable to what is under real (=production) load? Dmitry On Tue, Aug 20, 2013 at 9:39 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I just want to run benchmarks and want to have the same starting conditions. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: August-20-13 2:06 PM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? Why? What are you trying to acheive with this? --wunder On Aug 20, 2013, at 11:04 AM, Jean-Sebastien Vachon wrote: Hi All, Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? Thanks - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré.
Re: Data Import faile in solr 4.3.0
thanks actually the problem is that we have migrated the solr 1.4 index data to solr 3.5 using replication feature of solr 3.5. so that what ever data we have in solr 3.5 is of solr 1.4. so i do not think so it is work in solr 4.x. so please suggest your view based on my above point. Thanks Regards Montu v Boda -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-faile-in-solr-4-3-0-tp4085868p4086068.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2.1 update to 4.3/4.4 problem
Hello All, I am also facing a similar issue. I am using Solr 4.3. Following is the configuration I gave in schema.xml fieldType name=string_lower_case class=solr.TextField sortMissingLast=true omitNorms=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=string_id_itm class=solr.TextField sortMissingLast=true omitNorms=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType My requirement is that any string I give during search should be treated as a single string and try to find it, case insensitively. I have got strings like first name and last name(for this am using string_lower_case), and strings with special character '/'(for this am using string_id_itm ). But I am not getting results as expected. The first field type should also accept strings with spaces and give me results but it isn't, and the second field type doesnt work at all e.g of field values: John Smith (for field type 1) MB56789/A (for field type 2) Please help vehovmar wrote Thanks a lot for both replies. Helped me a lot. It seems that EdgeNGramFilterFactory on query analyzer was really my problem, I'll have to test it a little more to be sure. As for the bf parameter, I thinks it's quite fine as it is, from documentation: the bf parameter actually takes a list of function queries separated by whitespace and each with an optional boost Example: bf=ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3 And I'm using field function, Example Syntax: myFloatField or field(myFloatField) Thanks again to both of you guys! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-1-update-to-4-3-4-4-problem-tp4081896p4086070.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facing Solr performance during query search
On Wed, 2013-08-21 at 10:09 +0200, sivaprasad wrote: The slave will poll for every 1hr. And are there normally changes? We have configured ~2000 facets and the machine configuration is given below. I assume that you only request a subset of those facets at a time. How much RAM does your machine have? How large is your index in GB? How many documents do you have in your index? As you are not explicitly warming your facets and since you have a lot of them, my guess is that you're performing initializing facet calls all the time. If the slave only has 32GB of RAM (and thus only about 10GB for disk cache) and if your index is substantially larger than that, the initialization will require a lot of non-cached disk access. Try disabling the slave polling, then send 1000 queries and then re-send the exact same 1000 queries. Are the response times satisfactory the second time? If so, you should consider warming your facets and/or try to come up with a solution where you don't have so many of them. https://sbdevel.wordpress.com/2013/04/16/you-are-faceting-itwrong/ - Toke Eskildsen, State and University Library, Denmark
Re: relation between optimize and merge
optimize is an explicit request to perform a merge. Merges occur in the background, automatically, as needed or indicated by the parameters of the merge policy. An optimize is requested from outside of Solr. -- Jack Krupansky -Original Message- From: YouPeng Yang Sent: Thursday, August 22, 2013 3:18 AM To: solr-user@lucene.apache.org Subject: relation between optimize and merge Hi All I do have some diffculty with understand the relation between the optimize and merge Can anyone give some tips about the difference. Regards
Re: Data Import faile in solr 4.3.0
Call optimize on your Solr 3.5 server which will write a new index segment in v3.5 format. Such an index should be read in Solr 4.x without any problem. On Thu, Aug 22, 2013 at 5:00 PM, Montu v Boda montu.b...@highqsolutions.com wrote: thanks actually the problem is that we have migrated the solr 1.4 index data to solr 3.5 using replication feature of solr 3.5. so that what ever data we have in solr 3.5 is of solr 1.4. so i do not think so it is work in solr 4.x. so please suggest your view based on my above point. Thanks Regards Montu v Boda -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-faile-in-solr-4-3-0-tp4085868p4086068.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
RE: Flushing cache without restarting everything?
How can you validate that the changes you just made had any impact on the performance of the cloud if you don't have the same starting conditions? What we do basically is running a batch of requests to warm up the index and then launch the benchmark itself. That way we can measure the impact of our change(s). Otherwise there is absolutely no way we can be sure who is responsible for the gain or loss of performance. Restarting a cloud is actually a real pain, I just want to know if there is a faster way to proceed. -Original Message- From: Dmitry Kan [mailto:solrexp...@gmail.com] Sent: August-22-13 7:26 AM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? But is it really a good benchmarking, if you flush the cache? Wouldn't you want to benchmark against a system, that would be comparable to what is under real (=production) load? Dmitry On Tue, Aug 20, 2013 at 9:39 PM, Jean-Sebastien Vachon jean- sebastien.vac...@wantedanalytics.com wrote: I just want to run benchmarks and want to have the same starting conditions. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: August-20-13 2:06 PM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? Why? What are you trying to acheive with this? --wunder On Aug 20, 2013, at 11:04 AM, Jean-Sebastien Vachon wrote: Hi All, Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? Thanks - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré. - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré.
Re: when does RAMBufferSize work when commit.
On 8/22/2013 2:25 AM, YouPeng Yang wrote: Hi all About the RAMBufferSize and commit ,I have read the doc : http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/60544 I can not figure out how do they make work. Given the settings: ramBufferSizeMB10/ramBufferSizeMB autoCommit maxTime${solr.autoCommit.maxDocs:1000}/maxTime openSearcherfalse/openSearcher /autoCommit If the indexs docs up to 1000 and the size of these docs is below 10MB ,it will trigger an commit. If the size of the indexed docs reaches to 10MB while the the number is below 1000, it will not trigger an commit , however the index docs will just be flushed to disk,it will only commit when the number reaches to 1000? Your actual config seems to have its wires crossed a little bit. You have the autoCommit.maxDocs value being used in a maxTime tag, not a maxDocs tag. You may want to adjust the variable name or the tag. If that were a maxDocs tag instead of maxTime, your description would be pretty much right on the money. The space taken in the RAM buffer is typically larger than the actual document size, but the general idea is sound. The default for RAMBufferSizeMB in recent Solr versions is 100. Unless you've got super small documents, or you are in a limited memory situation and have a lot of cores, I would not go smaller than that. Thanks, Shawn
Re: Flushing cache without restarting everything?
On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote: Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? As MMapDirectory shares data with the OS disk cache, flushing of Solr-related caches on a machine should involve 1) Shut down all Solr instances on the machine 2) Clear the OS read cache ('sudo echo 1 /proc/sys/vm/drop_caches' on a Linux box) 3) Start the Solr instances I do not know of any Solr-supported way to do step 2. For our performance tests we use custom scripts to perform the steps. - Toke Eskildsen, State and University Library, Denmark
Adding one core to an existing core?
Dear Users, (Solr3.6 + Tomcat7) I use since two years Solr with one core, I would like now to add one another core (a new database). Can I do this without re-indexing my core1 ? could you point me to a good tutorial to do that? (my current database is around 200Go for 86 000 000 docs) My new database will be little, around 1000 documents of 5ko each. thanks a lot, Bruno
Re: Adding one core to an existing core?
Little precision, I'm on Ubuntu 12.04LTS Le 22/08/2013 15:56, Bruno Mannina a écrit : Dear Users, (Solr3.6 + Tomcat7) I use since two years Solr with one core, I would like now to add one another core (a new database). Can I do this without re-indexing my core1 ? could you point me to a good tutorial to do that? (my current database is around 200Go for 86 000 000 docs) My new database will be little, around 1000 documents of 5ko each. thanks a lot, Bruno
Re: Adding one core to an existing core?
First, a core is a separate index so it is completely indipendent from the already existing core(s). So basically you don't need to reindex. In order to have two cores (but the same applies for n cores): you must have in your solr.home the file (solr.xml) described here http://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29 then, you must obviously have one or two directories (corresponding to the instanceDir attribute). I said one or two because if the indexes configuration is basically the same (or something changes but is dynamically configured - i.e. core name) you can create two instances starting from the same configuration. I mean solr persistent=true sharedLib=lib cores adminPath=/admin/cores core name=core0 instanceDir=*conf.dir* / core name=core1 instanceDir=*conf.dir* / /cores /solr Otherwise you must have two different conf directories that contain indexes configuration. You should already have a first one (the current core), you just need to have another conf dir with solrconfig.xml, schema.xml and other required files. In this case each core will have its own instanceDir. solr persistent=true sharedLib=lib cores adminPath=/admin/cores core name=core0 instanceDir=*conf.dir.core0* / core name=core1 instanceDir=*conf.dir.core1* / /cores /solr Best, Andrea On 08/22/2013 04:04 PM, Bruno Mannina wrote: Little precision, I'm on Ubuntu 12.04LTS Le 22/08/2013 15:56, Bruno Mannina a écrit : Dear Users, (Solr3.6 + Tomcat7) I use since two years Solr with one core, I would like now to add one another core (a new database). Can I do this without re-indexing my core1 ? could you point me to a good tutorial to do that? (my current database is around 200Go for 86 000 000 docs) My new database will be little, around 1000 documents of 5ko each. thanks a lot, Bruno
How to access latitude and longitude with only LatLonType?
Hello All, I am currently doing a spatial query in solr. I indexed coordinates (type=location class=solr.LatLonType), but the following query failed. http://localhost/solr/quan/select?q=*:*stats=truestats.field=coordinatesstats.facet=townshiprows=0 It showed an error: Field type location{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFieldType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distErrPct=0.025, class=solr.SpatialRecursivePrefixTreeFieldType, maxDistErr=0.09, units=degrees}} is not currently supported I don't want to create duplicate indexed field latitude and longitude. How can I use only coordinates to do this kind of stats on both latitude and longitude? Thanks, Quan -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-access-latitude-and-longitude-with-only-LatLonType-tp4086109.html Sent from the Solr - User mailing list archive at Nabble.com.
dataimporter tika fields empty
i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl=http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description filehttp://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file urlhttp://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
UpdateProcessor not working with DIH, but works with SolrJ
I have an updateProcessor defined. It seems to work perfectly when I index with SolrJ, but when I use DIH (which I do for a full index rebuild), it doesn't work. This is the case with both Solr 4.4 and Solr 4.5-SNAPSHOT, svn revision 1516342. Here's a solrconfig.xml excerpt: updateRequestProcessorChain name=nohtml !-- First pass converts entities and strips html. -- processor class=solr.HTMLStripFieldUpdateProcessorFactory str name=fieldNameft_text/str str name=fieldNameft_subject/str str name=fieldNamekeywords/str str name=fieldNametext_preview/str /processor !-- Second pass fixes dually-encoded stuff. -- processor class=solr.HTMLStripFieldUpdateProcessorFactory str name=fieldNameft_text/str str name=fieldNameft_subject/str str name=fieldNamekeywords/str str name=fieldNametext_preview/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainnohtml/str /lst /requestHandler If I turn on DEBUG logging for FieldMutatingUpdateProcessorFactory, I see replace value debugs, but the contents of the index are only changed if the update happens with SolrJ, not with DIH. A side issue. FieldMutatingUpdateProcessorFactory has the following line in it, at about line 72: if (destVal != srcVal) { Shouldn't this be the following? if (destVal.equals(srcVal)) { Thanks, Shawn
Re: UpdateProcessor not working with DIH, but works with SolrJ
You should declare this str name=update.chainnohtml/str in the defaults section of the RequestHandler that corresponds to your dataimporthandler. You should have something like this: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdih-config.xml/str str name=update.chainnohtml/str /lst /requestHandler Otherwise the default update chain will be called (and your URP are not part of that). The solrj, behind the scenes, is a client of the /update request handler, that's the reason why using that you can see your URP working. Best, Gazza On 08/22/2013 05:35 PM, Shawn Heisey wrote: I have an updateProcessor defined. It seems to work perfectly when I index with SolrJ, but when I use DIH (which I do for a full index rebuild), it doesn't work. This is the case with both Solr 4.4 and Solr 4.5-SNAPSHOT, svn revision 1516342. Here's a solrconfig.xml excerpt: updateRequestProcessorChain name=nohtml !-- First pass converts entities and strips html. -- processor class=solr.HTMLStripFieldUpdateProcessorFactory str name=fieldNameft_text/str str name=fieldNameft_subject/str str name=fieldNamekeywords/str str name=fieldNametext_preview/str /processor !-- Second pass fixes dually-encoded stuff. -- processor class=solr.HTMLStripFieldUpdateProcessorFactory str name=fieldNameft_text/str str name=fieldNameft_subject/str str name=fieldNamekeywords/str str name=fieldNametext_preview/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainnohtml/str /lst /requestHandler If I turn on DEBUG logging for FieldMutatingUpdateProcessorFactory, I see replace value debugs, but the contents of the index are only changed if the update happens with SolrJ, not with DIH. A side issue. FieldMutatingUpdateProcessorFactory has the following line in it, at about line 72: if (destVal != srcVal) { Shouldn't this be the following? if (destVal.equals(srcVal)) { Thanks, Shawn
Re: UpdateProcessor not working with DIH, but works with SolrJ
On 8/22/2013 9:42 AM, Andrea Gazzarini wrote: You should declare this str name=update.chainnohtml/str in the defaults section of the RequestHandler that corresponds to your dataimporthandler. You should have something like this: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdih-config.xml/str str name=update.chainnohtml/str /lst /requestHandler Otherwise the default update chain will be called (and your URP are not part of that). The solrj, behind the scenes, is a client of the /update request handler, that's the reason why using that you can see your URP working. This results in an error parsing the config, so my cores won't start up. I saw another message via google that talked about using update.processor instead of update.chain, so I tried that as well, with no luck. Can I ask DIH to use the /update handler that I have declared already? Thanks, Shawn
Re: UpdateProcessor not working with DIH, but works with SolrJ
You could declare your update chain as the default by adding 'default=true' to its declaring element: updateRequestProcessorChain name=nohtml default=true and then you wouldn't need to declare it as the default update.chain in either of your request handlers. On Aug 22, 2013, at 11:57 AM, Shawn Heisey s...@elyograg.org wrote: On 8/22/2013 9:42 AM, Andrea Gazzarini wrote: You should declare this str name=update.chainnohtml/str in the defaults section of the RequestHandler that corresponds to your dataimporthandler. You should have something like this: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdih-config.xml/str str name=update.chainnohtml/str /lst /requestHandler Otherwise the default update chain will be called (and your URP are not part of that). The solrj, behind the scenes, is a client of the /update request handler, that's the reason why using that you can see your URP working. This results in an error parsing the config, so my cores won't start up. I saw another message via google that talked about using update.processor instead of update.chain, so I tried that as well, with no luck. Can I ask DIH to use the /update handler that I have declared already? Thanks, Shawn
Re: UpdateProcessor not working with DIH, but works with SolrJ
yes, yes of course, you should use your already declared request handler...that was just a copied and pasted example :) I'm curious about what kind of error you gotI copied the snippet above from a working core (just replaced the name of the chain) BTW: AFAIK is the update.processor that has been deprecated in favor of update.chain so this shouldn't be the problem. Best, Gazza On 08/22/2013 05:57 PM, Shawn Heisey wrote: On 8/22/2013 9:42 AM, Andrea Gazzarini wrote: You should declare this str name=update.chainnohtml/str in the defaults section of the RequestHandler that corresponds to your dataimporthandler. You should have something like this: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdih-config.xml/str str name=update.chainnohtml/str /lst /requestHandler Otherwise the default update chain will be called (and your URP are not part of that). The solrj, behind the scenes, is a client of the /update request handler, that's the reason why using that you can see your URP working. This results in an error parsing the config, so my cores won't start up. I saw another message via google that talked about using update.processor instead of update.chain, so I tried that as well, with no luck. Can I ask DIH to use the /update handler that I have declared already? Thanks, Shawn
Re: UpdateProcessor not working with DIH, but works with SolrJ
On 8/22/2013 10:02 AM, Steve Rowe wrote: You could declare your update chain as the default by adding 'default=true' to its declaring element: updateRequestProcessorChain name=nohtml default=true and then you wouldn't need to declare it as the default update.chain in either of your request handlers. If I did this, would it only apply the HTML processor to only the fields that I have specified in those XML sections? I haven't thought through the implications, but I think it might be OK. Thanks, Shawn
Re: dataimporter tika fields empty
i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: UpdateProcessor not working with DIH, but works with SolrJ
On 8/22/2013 10:06 AM, Andrea Gazzarini wrote: yes, yes of course, you should use your already declared request handler...that was just a copied and pasted example :) I'm curious about what kind of error you gotI copied the snippet above from a working core (just replaced the name of the chain) BTW: AFAIK is the update.processor that has been deprecated in favor of update.chain so this shouldn't be the problem. Here's the full exception. I use xinclude heavily in my solrconfig.xml. The xinclude directives are actually almost the only thing that's in solrconfig.xml. http://apaste.info/7PB0 I'm going to try setting my update processor to default as recommended by Steve Rowe. Thanks, Shawn
Re: UpdateProcessor not working with DIH, but works with SolrJ
Ok, found requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdih-config.xml/str str name=update.chain*nohtml***/str /lst /requestHandler Of course, my mistake...when I changed the name of the chain I deleted the char. Sorry On 08/22/2013 06:15 PM, Shawn Heisey wrote: of update.chain so this shouldn't be the problem.
Schema
Hello there, I have installed solr and its working fine on localhost. Have indexed the example files given along with solr-4.4.0. These are CSV or XML. Now I want to index mysql database for django project and search the queries from user end and also implement more features. What should I do? -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-tp4086136.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Schema
Now use DIH to get the data from MYSQL database in to SOLR.. http://wiki.apache.org/solr/DataImportHandler You need to define the field mapping (between my sql and SOLR document) in data-config.xml. -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-tp4086136p4086140.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Flushing cache without restarting everything?
I was afraid someone would tell me that... thanks for your input -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: August-22-13 9:56 AM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote: Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? As MMapDirectory shares data with the OS disk cache, flushing of Solr-related caches on a machine should involve 1) Shut down all Solr instances on the machine 2) Clear the OS read cache ('sudo echo 1 /proc/sys/vm/drop_caches' on a Linux box) 3) Start the Solr instances I do not know of any Solr-supported way to do step 2. For our performance tests we use custom scripts to perform the steps. - Toke Eskildsen, State and University Library, Denmark - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré.
Solr Ref guide question
Hi all, I think that there is some lack in solr's ref doc. Section Running Solr says to run solr using the command: $ java -jar start.jar But If I do this with a fresh install, I have a stack trace like this: http://pastebin.com/5YRRccTx Is it this behavior as expected? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Ref-guide-question-tp4086142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: dataimporter tika fields empty
i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: How to SOLR file in svn repository
I don't think there's an SOLR- SVN connector available out of the box. You can write a custom SOLRJ indexer program to get the necessary data from SVN (using JAVA API) and add the data to SOLR. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-SOLR-file-in-svn-repository-tp4085904p4086144.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Schema
On Thu, Aug 22, 2013 at 10:56 PM, SolrLover [via Lucene] ml-node+s472066n4086140...@n3.nabble.com wrote: Now use DIH to get the data from MYSQL database in to SOLR.. http://wiki.apache.org/solr/DataImportHandler These are for versions 1.3, 1.4, 3.6 or 4.0. Why versions are mentioned there? Don't they work on solr 4.4.0? -- Kamaljeet Kaur kamalkaur188.wordpress.com facebook.com/kaur.188 -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-tp4086136p4086145.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2.1 update to 4.3/4.4 problem
Your first problem is that the terms aren't getting to the field analysis chain as a unit, if you attach debug=query to your query and say you're searching lastName:(ogden erickson), you'll sees something like lastName:ogden lastName:erickson when what you want is lastname:ogden erickson (note, this is the _parsed_ query, not the input string! So try escaping the space as lastname:ogden\ erickson As for the second problem, _how_ is it not working at all? You're breaking up the input into separate tokens, which you say you don't want to do. If you really want all your names to be treated as strings just ignoring, say, the / take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory and use it with your first type. Best Erick On Thu, Aug 22, 2013 at 7:34 AM, skorrapa korrapati.sus...@gmail.comwrote: Hello All, I am also facing a similar issue. I am using Solr 4.3. Following is the configuration I gave in schema.xml fieldType name=string_lower_case class=solr.TextField sortMissingLast=true omitNorms=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=string_id_itm class=solr.TextField sortMissingLast=true omitNorms=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType My requirement is that any string I give during search should be treated as a single string and try to find it, case insensitively. I have got strings like first name and last name(for this am using string_lower_case), and strings with special character '/'(for this am using string_id_itm ). But I am not getting results as expected. The first field type should also accept strings with spaces and give me results but it isn't, and the second field type doesnt work at all e.g of field values: John Smith (for field type 1) MB56789/A (for field type 2) Please help vehovmar wrote Thanks a lot for both replies. Helped me a lot. It seems that EdgeNGramFilterFactory on query analyzer was really my problem, I'll have to test it a little more to be sure. As for the bf parameter, I thinks it's quite fine as it is, from documentation: the bf parameter actually takes a list of function queries separated by whitespace and each with an optional boost Example: bf=ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3 And I'm using field function, Example Syntax: myFloatField or field(myFloatField) Thanks again to both of you guys! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-1-update-to-4-3-4-4-problem-tp4081896p4086070.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Flushing cache without restarting everything?
We warm the file buffers before starting Solr to avoid spending time waiting for disk IO. The script is something like this: for core in core1 core2 core3 do find /apps/solr/data/${core}/index -type f | xargs cat /dev/null done It makes a big difference in the first few minutes of service. Of course, it helps if you have enough RAM to hold the entire index. wunder On Aug 22, 2013, at 10:28 AM, Jean-Sebastien Vachon wrote: I was afraid someone would tell me that... thanks for your input -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: August-22-13 9:56 AM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote: Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? As MMapDirectory shares data with the OS disk cache, flushing of Solr-related caches on a machine should involve 1) Shut down all Solr instances on the machine 2) Clear the OS read cache ('sudo echo 1 /proc/sys/vm/drop_caches' on a Linux box) 3) Start the Solr instances I do not know of any Solr-supported way to do step 2. For our performance tests we use custom scripts to perform the steps. - Toke Eskildsen, State and University Library, Denmark - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré. -- Walter Underwood wun...@wunderwood.org
Re: How to SOLR file in svn repository
After you connect to Subversion, you'll need parsers for code, etc. You might want to try Krugle instead, since they have already written all that stuff: http://krugle.org/ wunder On Aug 22, 2013, at 10:43 AM, SolrLover wrote: I don't think there's an SOLR- SVN connector available out of the box. You can write a custom SOLRJ indexer program to get the necessary data from SVN (using JAVA API) and add the data to SOLR.
Highlighting and proximity search
Hello, I am dealing with an issue of highlighting and so far the other posts that I've read have not provided a solution. When using proximity search (coming soon~10) I get some documents with no highlights and some documents highlight these words even when they are not in a 10 word proximity. Some more configuration details are below, any help is much appreciated. We are running solr version 4.4.0. Full example query: hl.fragsize=0hl.requireFieldMatch=truesort=document_date_range+deschl.fragListBui lder=singlehl.fragmentsBuilder=coloredhl=trueversion=2.2rows=80hl.highlightMultiTerm=truedf=texthl.useFastVectorHighlighter=truestart=0q=(text:(coming+soon~10))hl.usePhraseHighligh ter=true Configuration of the field being queried: fragListBuilder name=single class=solr.highlight.SingleFragListBuilder/ fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern='([\-]{2,})|([\s\.\?\!,:;\“\”])'/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 catenateAll=1 catenateNumbers=0 catenateWords=1 generateNumberParts=1 generateWordParts=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern='([\-]{2,})|([\s\.\?\!,:;\“\”])'/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 catenateAll=0 catenateNumbers=0 catenateWords=1 generateNumberParts=1 generateWordParts=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType Configuration of highlighter in solrconfig.xml fragListBuilder name=single class=solr.highlight.SingleFragListBuilder/ fragmentsBuilder name=colored class=solr.highlight.ScoreOrderFragmentsBuilder lst name=defaults str name=hl.tag.prelt;![CDATA[ em style=background:yellow,em style=background:lawngreen, em style=background:aquamarine,em style=background:magenta, em style=background:palegreen,em style=background:coral, em style=background:wheat,em style=background:khaki, em style=background:lime,em style=background:deepskyblue]]/str str name=hl.tag.post/str /lst /fragmentsBuilder -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-and-proximity-search-tp4086152.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to SOLR file in svn repository
You need to: 1) crawl the SVN database 2) index the files 3) make a UI that fetches the original file when you click on a search results. Solr only has #2. If you run a subversion web browser app, you can download the developer-only version of the LucidWorks product and crawl the SVN web viewer. This will give you #1 and #3. Lance On 08/21/2013 09:00 AM, jiunarayan wrote: I have a svn respository and svn file path. How can I SOLR search content on the svn file. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-SOLR-file-in-svn-repository-tp4085904.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Adding one core to an existing core?
Thanks a lot !!! Le 22/08/2013 16:23, Andrea Gazzarini a écrit : First, a core is a separate index so it is completely indipendent from the already existing core(s). So basically you don't need to reindex. In order to have two cores (but the same applies for n cores): you must have in your solr.home the file (solr.xml) described here http://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29 then, you must obviously have one or two directories (corresponding to the instanceDir attribute). I said one or two because if the indexes configuration is basically the same (or something changes but is dynamically configured - i.e. core name) you can create two instances starting from the same configuration. I mean solr persistent=true sharedLib=lib cores adminPath=/admin/cores core name=core0 instanceDir=*conf.dir* / core name=core1 instanceDir=*conf.dir* / /cores /solr Otherwise you must have two different conf directories that contain indexes configuration. You should already have a first one (the current core), you just need to have another conf dir with solrconfig.xml, schema.xml and other required files. In this case each core will have its own instanceDir. solr persistent=true sharedLib=lib cores adminPath=/admin/cores core name=core0 instanceDir=*conf.dir.core0* / core name=core1 instanceDir=*conf.dir.core1* / /cores /solr Best, Andrea On 08/22/2013 04:04 PM, Bruno Mannina wrote: Little precision, I'm on Ubuntu 12.04LTS Le 22/08/2013 15:56, Bruno Mannina a écrit : Dear Users, (Solr3.6 + Tomcat7) I use since two years Solr with one core, I would like now to add one another core (a new database). Can I do this without re-indexing my core1 ? could you point me to a good tutorial to do that? (my current database is around 200Go for 86 000 000 docs) My new database will be little, around 1000 documents of 5ko each. thanks a lot, Bruno
Solr cloud hash range set to null after recovery from index corruption
Hi, I have a Solr cloud set up with 12 shards with 2 replicas each, divided on 6 servers (each server hosting 4 cores). Solr version is 4.3.1. Due to memory errors on one machine, 3 of its 4 indexes became corrupted. I unloaded the cores, repaired the indexes with the Lucene CheckIndex tool, and added the cores again. Afterwards the Solr cloud hash range has been set to null for the shards with corrupt indexes. Could anybody point me to why this has occured, and more importantly, how to set the range on the shards again? Thank you. Best, Rikke
Re: updating docs in solr cloud hangs
Erick, I've read over SOLR-4816 after finding your comment about the server-side stack traces showing threads locked up over semaphores and I'm curious how that issue cures the problem on the server-side as the patch only includes client-side changes. Do the servers get so tied up shuffling documents around when they're not sent to the master that they get blocked as described? If they do get blocked due to shuffling documents around is a client-side fix for this not more of a workaround than a true fix? I'm entirely willing to apply this patch to all of the code I've got that talks to my solr servers and try it out but I'm reluctant to because this looks like a client-side fix to a server-side issue. Thanks, Greg -- View this message in context: http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4086160.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: updating docs in solr cloud hangs
Right, it's a little arcane. But the lockup is because the various leaders send documents to each other and wait for returns. If there are a _lot_ of incoming packets to various leaders, it can generate the distributed deadlock. So the shuffling you refer to is the root of the issue. If the leaders only receive documents for the shard they're a leader of, then they won't have to send updates to other leaders and shouldn't hit this condition. But you're right, this situation was encountered the first time by SolrJ clients sending lots and lots or parallel requests, I don't remember whether it was just one client with lots of threads or many clients. If you're not using SolrJ, then it won't do you much good since it's client-side only. As far as being a true fix or not, you can look at it as kicking the can down the road. This patch has several advantages: 1 It should pave the way for, and move towards, linear scalability as far as scaling up to many many nodes when indexing from SolrJ. 2 It should improve throughput in the normal case as well. 3 Along the way it _should_ significantly lower (perhaps remove entirely) the chance that this deadlock will occur, again when indexing from SolrJ. If you had a bunch of clients sending, say, posting csv files to SolrCloud I'd guess you'd find this happening again. So it's an improvement not a perfect cure. But if you think it'd help Best, Erick On Thu, Aug 22, 2013 at 3:23 PM, allrightname allrightn...@gmail.comwrote: Erick, I've read over SOLR-4816 after finding your comment about the server-side stack traces showing threads locked up over semaphores and I'm curious how that issue cures the problem on the server-side as the patch only includes client-side changes. Do the servers get so tied up shuffling documents around when they're not sent to the master that they get blocked as described? If they do get blocked due to shuffling documents around is a client-side fix for this not more of a workaround than a true fix? I'm entirely willing to apply this patch to all of the code I've got that talks to my solr servers and try it out but I'm reluctant to because this looks like a client-side fix to a server-side issue. Thanks, Greg -- View this message in context: http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-tp4067388p4086160.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Schema
On Aug 22, 2013, at 19:53 , Kamaljeet Kaur kamal.kaur...@gmail.com wrote: On Thu, Aug 22, 2013 at 10:56 PM, SolrLover [via Lucene] ml-node+s472066n4086140...@n3.nabble.com wrote: Now use DIH to get the data from MYSQL database in to SOLR.. http://wiki.apache.org/solr/DataImportHandler These are for versions 1.3, 1.4, 3.6 or 4.0. Why versions are mentioned there? Don't they work on solr 4.4.0? Why don't you just try?
Re: Schema
Verisons mentioned in the wiki only tell you that these features are available from that version of Solr. This will not be applicable in your case as you are using the latest version. So everything you find in the wiki would be available in 4.4 Solr -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-tp4086136p4086163.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to SOLR file in svn repository
I don't think you can go into production with that. But cloudera distribution (with Hue) might be a similar or better option. Regards, Alex On 22 Aug 2013 14:38, Lance Norskog goks...@gmail.com wrote: You need to: 1) crawl the SVN database 2) index the files 3) make a UI that fetches the original file when you click on a search results. Solr only has #2. If you run a subversion web browser app, you can download the developer-only version of the LucidWorks product and crawl the SVN web viewer. This will give you #1 and #3. Lance On 08/21/2013 09:00 AM, jiunarayan wrote: I have a svn respository and svn file path. How can I SOLR search content on the svn file. -- View this message in context: http://lucene.472066.n3.** nabble.com/How-to-SOLR-file-**in-svn-repository-tp4085904.**htmlhttp://lucene.472066.n3.nabble.com/How-to-SOLR-file-in-svn-repository-tp4085904.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: dataimporter tika fields empty
Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
RE: updating docs in solr cloud hangs
Thanks, Erick that's exactly the clarification/confirmation I was looking for! Greg
Re: Solr Ref guide question
What version of solr are you using? Have you copied a solr.xml from somewhere else? I can almost reproduce the error you're getting if I put a non-existent core in my solr.xml, e.g.: solr cores adminPath=/admin/cores core name=core0 instanceDir=a_non_existent_core / /cores ... On Thu, Aug 22, 2013 at 1:30 PM, yriveiro yago.rive...@gmail.com wrote: Hi all, I think that there is some lack in solr's ref doc. Section Running Solr says to run solr using the command: $ java -jar start.jar But If I do this with a fresh install, I have a stack trace like this: http://pastebin.com/5YRRccTx Is it this behavior as expected? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Ref-guide-question-tp4086142.html Sent from the Solr - User mailing list archive at Nabble.com. -- Brendan Grainger www.kuripai.com
How to set discountOverlaps=true in Solr 4x schema.xml
If I am using solr.SchemaSimilarityFactory to allow different similarities for different fields, do I set discountOverlaps=true on the factory or per field? What is the syntax? The below does not seem to work similarity class=solr.BM25SimilarityFactory discountOverlaps=true similarity class=solr.SchemaSimilarityFactory discountOverlaps=true / Tom
RE: How to set discountOverlaps=true in Solr 4x schema.xml
Hi Tom, Don't set it as attributes but as lists as Solr uses everywhere: similarity class=solr.SchemaSimilarityFactory bool name=discountOverlapstrue/bool /similarity For BM25 you can also set k1 and b which is very convenient! Cheers -Original message- From:Tom Burton-West tburt...@umich.edu Sent: Thursday 22nd August 2013 22:42 To: solr-user@lucene.apache.org Subject: How to set discountOverlaps=quot;truequot; in Solr 4x schema.xml If I am using solr.SchemaSimilarityFactory to allow different similarities for different fields, do I set discountOverlaps=true on the factory or per field? What is the syntax? The below does not seem to work similarity class=solr.BM25SimilarityFactory discountOverlaps=true similarity class=solr.SchemaSimilarityFactory discountOverlaps=true / Tom
Re: How to set discountOverlaps=true in Solr 4x schema.xml
Thanks Markus, I set it , but it seems to make no difference in the score or statistics listed in the debugQuery or in the ranking. I'm using a field with CommonGrams and a huge list of common words, so there should be a huge difference in the document length with and without discountOverlaps. Is the default for Solr 4 true? similarity class=solr.BM25SimilarityFactory float name=k11.2/float float name=b0.75/float bool name=discountOverlapsfalse/bool /similarity On Thu, Aug 22, 2013 at 4:58 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Tom, Don't set it as attributes but as lists as Solr uses everywhere: similarity class=solr.SchemaSimilarityFactory bool name=discountOverlapstrue/bool /similarity For BM25 you can also set k1 and b which is very convenient! Cheers -Original message- From:Tom Burton-West tburt...@umich.edu Sent: Thursday 22nd August 2013 22:42 To: solr-user@lucene.apache.org Subject: How to set discountOverlaps=quot;truequot; in Solr 4x schema.xml If I am using solr.SchemaSimilarityFactory to allow different similarities for different fields, do I set discountOverlaps=true on the factory or per field? What is the syntax? The below does not seem to work similarity class=solr.BM25SimilarityFactory discountOverlaps=true similarity class=solr.SchemaSimilarityFactory discountOverlaps=true / Tom
Re: How to set discountOverlaps=true in Solr 4x schema.xml
I should have said that I have set it both to true and to false and restarted Solr each time and the rankings and info in the debug query showed no change. Does this have to be set at index time? Tom
Storing query results
I am in the process of setting up a search application that allows the user to view paginated query results. The documents are highly dynamic but I want the search results to be static, i.e. I don't want the user to click the next page button, the query reruns, and now he has a different set of search results because the data changed while he was looking through it. I want the results stored somewhere else and the successive page queries to draw from that. I know Solr has query result caching, but I want to store it entirely. Does Solr provide any functionality like this? I imagine it doesn't, because then you'd need to specify how long to store it, etc. I'm using Solr 4.4.0. I found someone asking something similar here http://lucene.472066.n3.nabble.com/storing-results-td476351.html but that was 6 years ago. -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR Prevent solr of modifying fields when update doc
Hi, How can i prevent solr from update some fields when updating a doc? The problem is, i have an uuid with the field name uuid, but it is not an unique key. When a rss source updates a feed, solr will update the doc with the same link but it generates a new uuid. This is not the desired because this id is used by me to relate feeds with an user. Can someone help me? Many Thanks smime.p7s Description: S/MIME cryptographic signature
Re: Storing query results
Hi jfeist, Your mail reminds me this blog, not sure about solr though. http://blog.mikemccandless.com/2011/11/searcherlifetimemanager-prevents-broken.html From: jfeist jfe...@llminc.com To: solr-user@lucene.apache.org Sent: Friday, August 23, 2013 12:09 AM Subject: Storing query results I am in the process of setting up a search application that allows the user to view paginated query results. The documents are highly dynamic but I want the search results to be static, i.e. I don't want the user to click the next page button, the query reruns, and now he has a different set of search results because the data changed while he was looking through it. I want the results stored somewhere else and the successive page queries to draw from that. I know Solr has query result caching, but I want to store it entirely. Does Solr provide any functionality like this? I imagine it doesn't, because then you'd need to specify how long to store it, etc. I'm using Solr 4.4.0. I found someone asking something similar here http://lucene.472066.n3.nabble.com/storing-results-td476351.html but that was 6 years ago. -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR search by external fields
What we need is similar to what is discussed here, except not as a filter but as an actual query: http://lucene.472066.n3.nabble.com/filter-query-from-external-list-of-Solr-unique-IDs-td1709060.html We'd like to implement a query parser/scorer that would allow us to combine SOLR searches with searching external fields. This is due to the limitation of having to update an entire document even though only a field in the document needs to be updated. For example we have a database table called document_attributes containing two columns document_id, attribute_id. The document_id corresponds to the ID of the documents indexed is SOLR. We'd like to be able to pass in a query like: attribute_id:123 OR text:some_query (attribute_id:123 OR attribute_id:456) AND text:some_query etc... Can we implement a plugin/module in SOLR that's able to parse the above query and then fetch the document_ids associated with the attribute_id and combine the results with the normal processing of SOLR search to return one set of results for the entire query. We'd appreciate any guidance on how to implement this if it is possible.
custom names for replicas in solrcloud
Hi, I am using Solr 4.3 with 3 solr hosta and with an external zookeeper ensemble of 3 servers. And just 1 shard currently. When I create collections using collections api it creates collections with names, collection1_shard1_replica1, collection1_shard1_replica2, collection1_shard1_replica3. Is there any way to pass a custom name? or can I have all the replicas with same name? Any pointers will be much appreciated. Thanks, -Manasi -- View this message in context: http://lucene.472066.n3.nabble.com/custom-names-for-replicas-in-solrcloud-tp4086205.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Measuring SOLR performance
Hi Dmitry, So it seems solrjmeter should not assume the adminPath - and perhaps needs to be passed as an argument. When you set the adminPath, are you able to access localhost:8983/solr/statements/admin/cores ? roman On Wed, Aug 21, 2013 at 7:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi Roman, I have noticed a difference with different solr.xml config contents. It is probably legit, but thought to let you know (tests run on fresh checkout as of today). As mentioned before, I have two cores configured in solr.xml. If the file is: [code] solr persistent=false !-- adminPath: RequestHandler path to manage cores. If 'null' (or absent), cores will not be manageable via request handler -- cores adminPath=/admin/cores host=${host:} hostPort=${jetty.port:8983} hostContext=${hostContext:solr} core name=metadata instanceDir=metadata / core name=statements instanceDir=statements / /cores /solr [/code] then the instruction: python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R cms -t /solr/statements -e statements -U 100 works just fine. If however the solr.xml has adminPath set to /admin solrjmeter produces an error: [error] **ERROR** File solrjmeter.py, line 1386, in module main(sys.argv) File solrjmeter.py, line 1278, in main check_prerequisities(options) File solrjmeter.py, line 375, in check_prerequisities error('Cannot find admin pages: %s, please report a bug' % apath) File solrjmeter.py, line 66, in error traceback.print_stack() Cannot find admin pages: http://localhost:8983/solr/admin, please report a bug [/error] With both solr.xml configs the following url returns just fine: http://localhost:8983/solr/statements/admin/system?wt=json Regards, Dmitry On Wed, Aug 14, 2013 at 2:03 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi Roman, This looks much better, thanks! The ordinary non-comarison mode works. I'll post here, if there are other findings. Thanks for quick turnarounds, Dmitry On Wed, Aug 14, 2013 at 1:32 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Dmitry, oh yes, late night fixes... :) The latest commit should make it work for you. Thanks! roman On Tue, Aug 13, 2013 at 3:37 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi Roman, Something bad happened in fresh checkout: python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R cms -t /solr/statements -e statements -U 100 Traceback (most recent call last): File solrjmeter.py, line 1392, in module main(sys.argv) File solrjmeter.py, line 1347, in main save_into_file('before-test.json', simplejson.dumps(before_test)) File /usr/lib/python2.7/dist-packages/simplejson/__init__.py, line 286, in dumps return _default_encoder.encode(obj) File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line 226, in encode chunks = self.iterencode(o, _one_shot=True) File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line 296, in iterencode return _iterencode(o, 0) File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line 202, in default raise TypeError(repr(o) + is not JSON serializable) TypeError: __main__.ForgivingValue object at 0x7fc6d4040fd0 is not JSON serializable Regards, D. On Tue, Aug 13, 2013 at 8:10 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Dmitry, On Mon, Aug 12, 2013 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi Roman, Good point. I managed to run the command with -C and double quotes: python solrjmeter.py -a -C g1,cms -c hour -x ./jmx/SolrQueryTest.jmx As a result got several files (html, css, js, csv) in the running directory (any way to specify where the output should be stored in this case?) i know it is confusing, i plan to change it - but later, now it is too busy here... When I look onto the comparison dashboard, I see this: http://pbrd.co/17IRI0b two things: the tests probably took more than one hour to finish, so they are not aligned - try generating the comparison with '-c 14400' (ie. 4x3600 secs) the other thing: if you have only two datapoints, the dygraph will not show anything - there must be more datapoints/measurements One more thing: all the previous tests were run with softCommit disabled. After enabling it, the tests started to fail: $ python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R g1 -t /solr/statements -e statements -U 100 $ cd g1 Reading
Re: Flushing cache without restarting everything?
be careful with drop_caches - make sure you sync first On Thu, Aug 22, 2013 at 1:28 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I was afraid someone would tell me that... thanks for your input -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: August-22-13 9:56 AM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote: Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? As MMapDirectory shares data with the OS disk cache, flushing of Solr-related caches on a machine should involve 1) Shut down all Solr instances on the machine 2) Clear the OS read cache ('sudo echo 1 /proc/sys/vm/drop_caches' on a Linux box) 3) Start the Solr instances I do not know of any Solr-supported way to do step 2. For our performance tests we use custom scripts to perform the steps. - Toke Eskildsen, State and University Library, Denmark - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré.
Removing duplicates during a query
Suppose I have two documents with different id, and there is another field, for instance content-hash which is something like a 16-byte hash of the content. Can Solr be configured to return just one copy, and drop the other if both are relevant? If Solr does drop one result, do you get any indication in the document that was kept that there was another copy?
Re: How to avoid underscore sign indexing problem?
Alright, thanks for all your help. I finally fix this problem using PatternReplaceFilterFactory + WordDelimeterfilterFactory. I first replace _ (underscore) using PatternReplaceFilterFactory and then using WordDelimeterFilterFactory to generate word and number part to increase user search hit. Although this decrease search quality a little, but user need higher recall rate than precision. Thank you all. Floyd 2013/8/22 Floyd Wu floyd...@gmail.com After trying some search case and different params combination of WordDelimeter. I wonder what is the best strategy to index string 2DA012_ISO MARK 2 and can be search by term 2DA012? What if I just want _ to be removed both query/index time, what and how to configure? Floyd 2013/8/22 Floyd Wu floyd...@gmail.com Thank you all. By the way, Jack I gonna by your book. Where to buy? Floyd 2013/8/22 Jack Krupansky j...@basetechnology.com I thought that the StandardTokenizer always split on punctuation, Proving that you haven't read my book! The section on the standard tokenizer details the rules that the tokenizer uses (in addition to extensive examples.) That's what I mean by deep dive. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, August 21, 2013 10:41 PM To: solr-user@lucene.apache.org Subject: Re: How to avoid underscore sign indexing problem? On 8/21/2013 7:54 PM, Floyd Wu wrote: When using StandardAnalyzer to tokenize string Pacific_Rim will get ST textraw_**bytesstartendtypeposition pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011ALPHANUM1 How to make this string to be tokenized to these two tokens Pacific, Rim? Set _ as stopword? Please kindly help on this. Many thanks. Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn
Re: How to avoid underscore sign indexing problem?
Ah, but what is the definition of punctuation in Solr? On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky j...@basetechnology.comwrote: I thought that the StandardTokenizer always split on punctuation, Proving that you haven't read my book! The section on the standard tokenizer details the rules that the tokenizer uses (in addition to extensive examples.) That's what I mean by deep dive. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, August 21, 2013 10:41 PM To: solr-user@lucene.apache.org Subject: Re: How to avoid underscore sign indexing problem? On 8/21/2013 7:54 PM, Floyd Wu wrote: When using StandardAnalyzer to tokenize string Pacific_Rim will get ST textraw_**bytesstartendtypeposition pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011ALPHANUM1 How to make this string to be tokenized to these two tokens Pacific, Rim? Set _ as stopword? Please kindly help on this. Many thanks. Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn
Re: More on topic of Meta-search/Federated Search with Solr
You are right, but here's my null hypothesis for studying the impact on relevance.Hash the query to deterministically seed random number generator.Pick one from column A or column B randomly. This is of course wrong - a query might find two non-relevant results in corpus A and lots of relevant results in corpus B, leading to poor precision because the two non-relevant documents are likely to show up on the first page. You can weight on the size of the corpus, but weighting is probably wrong then on any specifc query. It was an interesting thought experiment though. Erik, Since LucidWorks was dinged in the 2013 Magic Quadrant on Enterprise Search due to a lack of Federated Search, the for-profit Enterprise Search companies must be doing it some way.Maybe relevance suffers (a lot), but you can do it if you want to. I have read very little of the IR literature - enough to sound like I know a little, but it is a very little. If there is literature on this, it would be an interesting read. On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote: The lack of global TF/IDF has been answered in the past, in the sharded case, by usually you have similar enough stats that it doesn't matter. This pre-supposes a fairly evenly distributed set of documents. But if you're talking about federated search across different types of documents, then what would you rescore with? How would you even consider scoring docs that are somewhat/ totally different? Think magazine articles an meta-data associated with pictures. What I've usually found is that one can use grouping to show the top N of a variety of results. Or show tabs with different types. Or have the app intelligently combine the different types of documents in a way that makes sense. But I don't know how you'd just get the right thing to happen with some kind of scoring magic. Best Erick On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote: I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking in the absence of global IDF have been explored?I suspect that one could retrieve a much larger than N set of results from a set of shards, re-score in some way that doesn't require IDF, e.g. storing both results in the same priority queue and *re-scoring* before *re-ranking*. The other way to do this would be to have a custom SearchHandler that works differently - it performs the query, retries all results deemed relevant by another engine, adds them to the Lucene index, and then performs the query again in the standard way. This would be quite slow, but perhaps useful as a way to evaluate my method. I still welcome any suggestions on how such a SearchHandler could be implemented.
Re: Removing duplicates during a query
OK - I see that this can be done with Field Collapsing/Grouping. I also see the mentions in the Wiki for avoiding duplicates using a 16-byte hash. So, question withdrawn... On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis dansm...@gmail.com wrote: Suppose I have two documents with different id, and there is another field, for instance content-hash which is something like a 16-byte hash of the content. Can Solr be configured to return just one copy, and drop the other if both are relevant? If Solr does drop one result, do you get any indication in the document that was kept that there was another copy?
Re: Distance sort on a multi-value field
This is actually pretty far afield from my original subject, but it turns out that I also had issues with NRT and multi-field geospatial performance in Solr 4, so I'll follow that up. I've been testing and working with David's SOLR-5170 patch ever since he posted it, and I pushed it into production with only some cosmetic changes a few hours ago. I have a relatively low update and query rate for this particular query type, (something like 2 updates/sec, 10 queries/sec) but a short autosoftcommit time. (5 sec) Based on the data so far this patch looks like it's brought my average response time down from 4 seconds to about 50ms. Very nice! On 8/20/13 7:37 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: The distance sorting code in SOLR-2155 is roughly equivalent to the code that RPT uses (RPT has its lineage in SOLR-2155 after all). I just reviewed it to double-check. It's possible the behavior is slightly better in SOLR-2155 because the cache (a Solr cache) contains normal hard-references whereas RPT has one based on weak references, which will linger longer. But I think the likelihood of OOM is the same. Any way, the current best option is https://issues.apache.org/jira/browse/SOLR-5170 which I posted a few days ago. ~ David Billnbell wrote We have been using 2155 for over 6 months in production with over 2M hits every 10 minutes. No OOM yet. 2155 seems great, and would this issue be any worse than 2155? On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes lt; jwartes@ gt; wrote: Hm, Give me all the stores that only have branches in this area might be a plausible use case for farthest distance. That's essentially a contains question though, so maybe that's already supported? I guess it depends on how contains/intersects/etc handle multi-values. I feel like multi-value interaction really deserves its own section in the documentation. I'm aware of the memory issue, but it seems like if you want sort multi-valued points, it's either this or try to pull in the 2155 patch. In general I'd rather go with the thing that's being maintained. Thanks for the code pointer. You're right, that doesn't look like something I can easily use for more general aggregate scoring control. Ah well. On 8/14/13 12:35 PM, Smiley, David W. lt; dsmiley@ gt; wrote: On 8/14/13 2:26 PM, Jeff Wartes lt; jwartes@ gt; wrote: I'm still pondering aggregate-type operations for scoring multi-valued fields (original thread: http://goo.gl/zOX53f ), and it occurred to me that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing something like that. It isn't. Somewhat surprisingly I don't see this in the documentation anywhere, but I presume the example query: (from: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4) q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10} assigns the distance/score based on the *closest* lat/long if the sfield is a multi-valued field. Yes it does. That's a reasonable default, but it's a bit arbitrary. Can I sort based on the *furthest* lat/long in the document? Or the average distance? Anyone know more about how this works and could give me some pointers? I considered briefly supporting the farthest distance but dismissed it as I saw no real use-case. I didn't think of the average distance; that's plausible. Any way, you're best bet is to dig into the code. The relevant part is ShapeFieldCacheDistanceValueSource. FYI something to keep in mind: https://issues.apache.org/jira/browse/LUCENE-4698 ~ David -- Bill Bell billnbell@ cell 720-256-8076 - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp 4084666p4085797.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Distance sort on a multi-value field
Awesome! Be sure to watch the JIRA issue as it develops. The patch will improve (I've already improved it but not posted it) and one day a solution is bound to get committed. ~ David Jeff Wartes wrote This is actually pretty far afield from my original subject, but it turns out that I also had issues with NRT and multi-field geospatial performance in Solr 4, so I'll follow that up. I've been testing and working with David's SOLR-5170 patch ever since he posted it, and I pushed it into production with only some cosmetic changes a few hours ago. I have a relatively low update and query rate for this particular query type, (something like 2 updates/sec, 10 queries/sec) but a short autosoftcommit time. (5 sec) Based on the data so far this patch looks like it's brought my average response time down from 4 seconds to about 50ms. Very nice! On 8/20/13 7:37 PM, David Smiley (@MITRE.org) lt; DSMILEY@ gt; wrote: The distance sorting code in SOLR-2155 is roughly equivalent to the code that RPT uses (RPT has its lineage in SOLR-2155 after all). I just reviewed it to double-check. It's possible the behavior is slightly better in SOLR-2155 because the cache (a Solr cache) contains normal hard-references whereas RPT has one based on weak references, which will linger longer. But I think the likelihood of OOM is the same. Any way, the current best option is https://issues.apache.org/jira/browse/SOLR-5170 which I posted a few days ago. ~ David Billnbell wrote We have been using 2155 for over 6 months in production with over 2M hits every 10 minutes. No OOM yet. 2155 seems great, and would this issue be any worse than 2155? On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes lt; jwartes@ gt; wrote: Hm, Give me all the stores that only have branches in this area might be a plausible use case for farthest distance. That's essentially a contains question though, so maybe that's already supported? I guess it depends on how contains/intersects/etc handle multi-values. I feel like multi-value interaction really deserves its own section in the documentation. I'm aware of the memory issue, but it seems like if you want sort multi-valued points, it's either this or try to pull in the 2155 patch. In general I'd rather go with the thing that's being maintained. Thanks for the code pointer. You're right, that doesn't look like something I can easily use for more general aggregate scoring control. Ah well. On 8/14/13 12:35 PM, Smiley, David W. lt; dsmiley@ gt; wrote: On 8/14/13 2:26 PM, Jeff Wartes lt; jwartes@ gt; wrote: I'm still pondering aggregate-type operations for scoring multi-valued fields (original thread: http://goo.gl/zOX53f ), and it occurred to me that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing something like that. It isn't. Somewhat surprisingly I don't see this in the documentation anywhere, but I presume the example query: (from: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4) q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10} assigns the distance/score based on the *closest* lat/long if the sfield is a multi-valued field. Yes it does. That's a reasonable default, but it's a bit arbitrary. Can I sort based on the *furthest* lat/long in the document? Or the average distance? Anyone know more about how this works and could give me some pointers? I considered briefly supporting the farthest distance but dismissed it as I saw no real use-case. I didn't think of the average distance; that's plausible. Any way, you're best bet is to dig into the code. The relevant part is ShapeFieldCacheDistanceValueSource. FYI something to keep in mind: https://issues.apache.org/jira/browse/LUCENE-4698 ~ David -- Bill Bell billnbell@ cell 720-256-8076 - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp 4084666p4085797.html Sent from the Solr - User mailing list archive at Nabble.com. - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp4084666p4086226.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to access latitude and longitude with only LatLonType?
Hi Quan You claim to be using LatLonType, yet the error you posted makes it clear you are in fact using SpatialRecursivePrefixTreeFieldType (RPT). Regardless of which spatial field you use, it's not clear to me what sort of statistics could be useful on a spatial field. The stats component doesn't work with any of the spatial fields. Well... it's possible to use LatLonType and then do stats on just the latitude or just the longitude (you should see the auto-generated fields for these in the online schema browser) but that would unlikely be useful. ~ David zhangquan913 wrote Hello All, I am currently doing a spatial query in solr. I indexed coordinates (type=location class=solr.LatLonType), but the following query failed. http://localhost/solr/quan/select?q=*:*stats=truestats.field=coordinatesstats.facet=townshiprows=0 It showed an error: Field type location{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFieldType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distErrPct=0.025, class=solr.SpatialRecursivePrefixTreeFieldType, maxDistErr=0.09, units=degrees}} is not currently supported I don't want to create duplicate indexed field latitude and longitude. How can I use only coordinates to do this kind of stats on both latitude and longitude? Thanks, Quan - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-access-latitude-and-longitude-with-only-LatLonType-tp4086109p4086229.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to avoid underscore sign indexing problem?
Dan, StandardTokenizer implements the word boundary rules from the Unicode Text Segmentation standard annex UAX#29: http://www.unicode.org/reports/tr29/#Word_Boundaries Every character sequence within UAX#29 boundaries that contains a numeric or an alphabetic character is emitted as a term, and nothing else is emitted. Punctuation can be included within a term, e.g. 1,248.99 or 192.168.1.1. To split on underscores, you can convert underscores to e.g. spaces by adding PatternReplaeCharFilterFactory to your analyzer: charFilter class=solr.PatternReplaceCharFilterFactory pattern=_ replacement= / This replacement will be performed prior to StandardTokenizer, which will then see token-splitting spaces instead of underscores. Steve On Aug 22, 2013, at 10:23 PM, Dan Davis dansm...@gmail.com wrote: Ah, but what is the definition of punctuation in Solr? On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky j...@basetechnology.comwrote: I thought that the StandardTokenizer always split on punctuation, Proving that you haven't read my book! The section on the standard tokenizer details the rules that the tokenizer uses (in addition to extensive examples.) That's what I mean by deep dive. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, August 21, 2013 10:41 PM To: solr-user@lucene.apache.org Subject: Re: How to avoid underscore sign indexing problem? On 8/21/2013 7:54 PM, Floyd Wu wrote: When using StandardAnalyzer to tokenize string Pacific_Rim will get ST textraw_**bytesstartendtypeposition pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011ALPHANUM1 How to make this string to be tokenized to these two tokens Pacific, Rim? Set _ as stopword? Please kindly help on this. Many thanks. Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn