Re: Synonym/Tokenizer for Hyphanated Words
what does having a problem mean? Index-time? Query time? But your problem is most likely the tokenizer as you suspect. Try something like WhitespaceTokenizer and build up from there. Three friends: 1 admin/analysis page 2 admin/schema-browser 3 debugQuery=on The first will show you what the happend to tokens _after_ they get through the tokenization. Be aware that this probably isn't entirely helpful when your problem is in the tokenization step. The second shows you what terms are actually in your index. The third shows you what your parsed query looks like. Couple of other things: 1 there's no need to put in all the capitalization forms _if_ you put LowerCaseFilter in front of your synonyms filter. 2 WhiteSpaceTokenizer is pretty simple. For instance, punctuation will be part of the tokens (e.g. periods at the end of sentences). So it's a place to _start_ but you'll have to think about what you really want from your tokenization process before deciding. Best Erick On Thu, Nov 15, 2012 at 12:38 PM, Nathan Tallman ntall...@gmail.com wrote: Hello Solr users, I use Solr 3.5 via Vufind 1.3 and am having a problem with a synonym. No matter what syntax I used, it doesn't seem to have an affect. (See various combinations below.) antisemitism,anti-semitism,Antisemitism,Anti-Semitism,Anti-semitism,anti-Semitism antisemitism,anti\-semitism,Antisemitism,Anti\-Semitism,Anti\-semitism,anti\-Semitism antisemitism,anti semitism,Antisemitism,Anti Semitism,Anti semitism,anti Semitism It was suggested to me that this was not synonym issue, but a tokenizing issue, because anti-semitism was being interpreted as anti semitism. Does anyone have any suggested for making the synonym work? Tweaking the tokenizer in schema.xml? Or somehow escaping the hyphen in synonyms.txt? Many thanks, Nathan
Re: High Slave CPU Intermittently After Replication
that's very strange. How much memory are you giving the JVM? And how much memory is on your machine? If your index is cutting in half on optimize, then it sounds like you're re-indexing everything. Optimize will squeeze out all the data left around by document deletes or updates, so the only reason I can imagine that your index drops by 50% if if you've replaced every document that was there originally. And I'd also guess that you don't have enough activity to trigger merges often enough to squeeze out the deleted documents' data. But this sounds ever so much like you're running with not much memory and are getting into heavy swapping or something like that s your index crosses some threshold. But that's just a guess. Best Erick On Fri, Nov 16, 2012 at 10:23 AM, richardg richa...@dvdempire.com wrote: We tried using MergeFactor setting but out CPU Load/Slow Query time issues were more widespread, optimizing the index always alleviated the issue that is why we are using it now. Our index is 2 GB when optimized and would balloon to over 4 GB so we thought the issue was it was getting too big. I notice a small spike in CPU load after every replication but then after a couple of seconds load returns to normal (which is less that 25%) it is just sometimes (once in the last week) that it would spike and stay high (10 minutes) until I optimized the index. Before I would optimize the index after every commit the issue would occur more often. We would like to not optimize and use the built in Merging but we had before and the issue would occur more often. We were thinking of trying a mergefactor of 2 again but I'm afraid this issues will return. I installed SPM and am monitoring it to see if it tells me anything, I can post the results on Monday and hopefully it will tell us something. At this time we aren't warming and caches, we weren't sure if this was an issue because our slowdowns weren't happening every time. Also, we are using the join functionality of Solr 4 if that helps. Thanks for your help -- View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4020743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: inconsistent number of results returned in solr cloud
Hmmm, first an aside. If by commit after every batch of documents you mean after every call to server.add(doclist), there's no real need to do that unless you're striving for really low latency. the usual recommendation is to use commitWithin when adding and commit only at the very end of the run. This shouldn't actually be germane to your issue, just an FYI. So you're saying that the inconsistency is permanent? By that I mean it keeps coming back inconsistently for minutes/hours/days? I guess if I were trying to test this I'd need to know how you added subsequent collections. In particular what you did re: zookeeper as you added each collection. Best Erick On Fri, Nov 16, 2012 at 2:55 PM, Buttler, David buttl...@llnl.gov wrote: My typical way of adding documents is through SolrJ, where I commit after every batch of documents (where the batch size is configurable) I have now tried committing several times, from the command line (curl) with and without openSearcher=true. It does not affect anything. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, November 16, 2012 11:04 AM To: solr-user@lucene.apache.org Subject: Re: inconsistent number of results returned in solr cloud How did you do the final commit? Can you try a lone commit (with openSearcher=true) and see if that affects things? Trying to determine if this is a known issue or not. - Mark On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote: Hi all, I buried an issue in my last post, so let me pop it up. I have a cluster with 10 collections on it. The first collection I loaded works perfectly. But every subsequent collection returns an inconsistent number of results for each query. The queries can be simply *:*, or more complex facet queries. If I go to individual cores and issue the query, with distrib=false, I get a consistent number of results. I am wondering if there is some delay in returning results from my shards, and the queried node just times out and displays the number of results that it has received so far. If there is such a timeout, it must be very small, as my QTime is around 11 ms. Dave
Re: error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar
There was a discussion of this a bit ago, but the upshot is that the maintainer hasn't released a version compatible with 4.0 yet. Send him money G... FWIW, Erick On Fri, Nov 16, 2012 at 11:16 AM, Miguel Ángel Martín miguelangel.mar...@brainsins.com wrote: hi all: i can open an index create with solr 4.0. with luke version= lukeall-4.0.0-ALPHA.jar i have the error: Format version is not supported (resource: NIOFSIndexInput(path=/Users/desa/data/index/_2.tvx)): 1 (needs to be between 0 and 0) at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:148) at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130) at org.apache.lucene.codecs.lucene40.Lucene40TermVectorsReader.init(Lucene40TermVectorsReader.java:108) at org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat.vectorsReader(Lucene40TermVectorsFormat.java:107) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:118) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:55) at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63) at org.getopt.luke.Luke.openIndex(Luke.java:967) at org.getopt.luke.Luke.openOk(Luke.java:696) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at thinlet.Thinlet.invokeImpl(Thinlet.java:4579) at thinlet.Thinlet.invoke(Thinlet.java:4546) at thinlet.Thinlet.handleMouseEvent(Thinlet.java:3937) at thinlet.Thinlet.processEvent(Thinlet.java:2917) at java.awt.Component.dispatchEventImpl(Component.java:4744) at java.awt.Container.dispatchEventImpl(Container.java:2141) at java.awt.Component.dispatchEvent(Component.java:4572) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4619) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4280) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4210) at java.awt.Container.dispatchEventImpl(Container.java:2127) at java.awt.Window.dispatchEventImpl(Window.java:2489) at java.awt.Component.dispatchEvent(Component.java:4572) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:704) at java.awt.EventQueue.access$400(EventQueue.java:82) at java.awt.EventQueue$2.run(EventQueue.java:663) at java.awt.EventQueue$2.run(EventQueue.java:661) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98) at java.awt.EventQueue$3.run(EventQueue.java:677) at java.awt.EventQueue$3.run(EventQueue.java:675) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.awt.EventQueue.dispatchEvent(EventQueue.java:674) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188) at java.awt.EventDispatchThread.run(EventDispatchThread.java:122) o any ideas? I,ve created another index with lucene 4.0 and this luke open the index well. thanks in advance
Re: Question about Solr Cloud
1 Well, it loads the local conf directory up to zookeeper so new nodes can fetch the configuration and store it locally. 2 No, you have to upload the configuration to ZK and (I think) restart the other servers. It's easy enough to test, just make your changes to the config, upload it, and look at the resulting configs to insure that the changes have been fetched. 3 No. You can run these shards in the same JVM as far as I know. This is sometimes called microsharding or oversharding and is a pretty common approach. Search the list I think theres been some discussion recently on this very topic. 4 Mostly the container you use is determined by which one you're comfortable with. Solr runs on Jetty, Tomcat, JBOSS and a host of others. It's just simplest to start OOB with Jetty. Best, Erick On Sat, Nov 17, 2012 at 2:13 AM, Cool Techi cooltec...@outlook.com wrote: Hi, I have just started working with Solr cloud and have a few questions related to the same, 1) In the start script we provide the the following, what's the purpose of providing this. -Dbootstrap_confdir=./solr/collection1/conf Since we don't yet have a config in zookeeper, this parameter causes the local configuration directory ./solr/conf to be uploaded as the myconf config. The name myconf is taken from the collection.configName param below. -Dcollection.configName=myconf sets the config to use for the new collection. Omitting this param will cause the config name to default to configuration1 2) When we make any changes into the config/schema do we need to copy it to all the shards running in the cloud manually?3) If we want to start with 10 shards on 2 machines, anticipating the future growth, do all these shards needs to run on separate jetty instances4) Is there any advantage of running solr on jetty then Tomcat? Thanks,Ayush
Re: Solr 4:How to call a updateRequestProcessorChain during the /dataimport?
I would _guess_ (but haven't done this with DIH) that simply putting the body.chain in the updatehandler (updateHandler class=solr.DirectUpdateHandler2) would do what you want. But that's purely a guess at this point on my part. Anyone want to correct me? Best Erick On Fri, Nov 16, 2012 at 4:50 PM, srinalluri nallurisr...@yahoo.com wrote: I have a new updateRequestProcessorChain called 'bodychain'. (Please note CountFieldValuesUpdateProcessorFactory is new in Solr 4). I want to call this bodychain during the dataimport. updateRequestProcessorChain name=bodychain processor class=solr.CloneFieldUpdateProcessorFactory str name=sourcebody/str str name=destbody_count/str /processor processor class=solr.CountFieldValuesUpdateProcessorFactory str name=fieldNamebody_count/str /processor processor class=solr.DefaultValueUpdateProcessorFactory str name=fieldNamebody_count/str int name=value0/int /processor /updateRequestProcessorChain Following is the my dataimport handler, which is already having 'update.chain'. I think I can't give more than one update.chain in this handler. When can I add 'bodychain'? requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=update.chaindedupe/str str name=configdata-config.xml/str /lst /requestHandler thanks Srini -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-How-to-call-a-updateRequestProcessorChain-during-the-dataimport-tp4020812.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Bash Script to start delta import handler
Hi Spadez, Nabble has helpfully stripped out your script. Maybe don't use Nabble? Steve On Nov 16, 2012, at 5:06 PM, Spadez james_will...@hotmail.com wrote: Hey guys, I am after a bash script (or python script) which I can use to trigger a delta import of XML files via CRON. After a bit of digging and modification I have this: Can I get any feedback on this? Is there a better way of doing it? Any optimisations or improvements would be most welcome. -- View this message in context: http://lucene.472066.n3.nabble.com/Bash-Script-to-start-delta-import-handler-tp4020815.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about Solr Cloud
You can force Solr to use the new configs by reloading a collection: http://localhost:8983/solr/admin/collections?action=RELOADname=mycollection This'll cause all shards (and replicas) in a collection to collect new configs from ZooKeeper. The main thing to note re Jetty, is that the Jetty included within Solr is included for ease of demoing Solr, rather than for ease of deployment. Whether you are going to deploy to Tomcat, JBoss or Jetty, you would be best downloading a copy of the container, and installing Solr within it (the one embedded doesn't have any startup scripts, nor any maintenance interfaces, etc, all stuff that you'd expect from a servlet container). Upayavira On Sat, Nov 17, 2012, at 03:04 PM, Erick Erickson wrote: 1 Well, it loads the local conf directory up to zookeeper so new nodes can fetch the configuration and store it locally. 2 No, you have to upload the configuration to ZK and (I think) restart the other servers. It's easy enough to test, just make your changes to the config, upload it, and look at the resulting configs to insure that the changes have been fetched. 3 No. You can run these shards in the same JVM as far as I know. This is sometimes called microsharding or oversharding and is a pretty common approach. Search the list I think theres been some discussion recently on this very topic. 4 Mostly the container you use is determined by which one you're comfortable with. Solr runs on Jetty, Tomcat, JBOSS and a host of others. It's just simplest to start OOB with Jetty. Best, Erick On Sat, Nov 17, 2012 at 2:13 AM, Cool Techi cooltec...@outlook.com wrote: Hi, I have just started working with Solr cloud and have a few questions related to the same, 1) In the start script we provide the the following, what's the purpose of providing this. -Dbootstrap_confdir=./solr/collection1/conf Since we don't yet have a config in zookeeper, this parameter causes the local configuration directory ./solr/conf to be uploaded as the myconf config. The name myconf is taken from the collection.configName param below. -Dcollection.configName=myconf sets the config to use for the new collection. Omitting this param will cause the config name to default to configuration1 2) When we make any changes into the config/schema do we need to copy it to all the shards running in the cloud manually?3) If we want to start with 10 shards on 2 machines, anticipating the future growth, do all these shards needs to run on separate jetty instances4) Is there any advantage of running solr on jetty then Tomcat? Thanks,Ayush
Re: Solr/Lucene Tokenizers - cannot get the behavior I need
On 11/16/2012 12:52 PM, Shawn Heisey wrote: On 11/16/2012 12:36 PM, Jack Krupansky wrote: Generally, you don't need the preserveOriginal attribute for WDF. Generate both the word parts and the concatenated terms, and queries should work fine without the original. The separated terms will be indexed as a sequence, and the split/separated terms will generate a phrase query that matches the indexed sequence. And if you index the concatenated terms, that can be queried as well. With that issue out of the way, is there a remaining issue here? You're right, that's handled by catenateWords. I do need preserveOriginal for other things, though. I think it's unimportant for this discussion. I may consider removing it at a later stage, but right now our assessment is that we need it. The immediate problem is that when ICUTokenizer is done with an input of Word1-Word2 I am left with two tokens, Word1 and Word2. The punctuation in the middle is gone. Even if WDF is the very next thing in the analysis chain, there's nothing for it to do - the fact that Word1 and Word2 were connected by punctuation is entirely lost. Ideally I would like to see a splitOnPunctuation option on a majority of available tokenizers, but if a filter were available that did one subset of ICUTokenizer's functionality - splitting tokens on script changes - I would have a solution in combination with WhiteSpaceTokenizer. I have been looking at the source code related to ICUTokenizer, trying to get a handle on how it works. Based on what I've learned so far, I'm not sure that punctuation can be ignored in the way that I need. If someone knows it well enough to comment, I would love to know for sure. Thanks, Shawn
Re: Question about Solr Cloud
bq. fetch the configuration and store it locally. New nodes don't fetch the configs and store them locally - configs are loaded straight from zookeeper currently. - Mark
Re: Solr/Lucene Tokenizers - cannot get the behavior I need
On 11/16/2012 12:30 PM, Shawn Heisey wrote: I am extremely interested in the Unicode behavior of ICUTokenizer, but I cannot disable the punctuation-splitting behavior and let WDF handle it properly, which causes recall problems. There is no filter that I can run after tokenization, either. Looking at ICUTokenizer.java, I do not see any way to write my own tokenizer that does what I need. I have this problem with pretty much all of the tokenizers other than Whitespace. There are situations where I would like to use some of the others, but the punctuation-splitting behavior is a major problem for me. Do I have any options? I have never looked at the ICU code from IBM, so I don't know if it would require major surgery there. Related problem: The entire reason I started down this path is because I'd like to handle CJK better with CJKBigramFilter. It appears that unless you use StandardTokenizer, ClassicTokenizer, or ICUTokenizer, CJKBigramFilter doesn't work ... but none of these tokenizers will handle punctuation right for me. I seem to remember a discussion some time ago around this, saying that a future version of CJKBigramFilter would drop the requirement that each token be tagged. Do I need to file an issue about this, and/or start a new discussion thread? Thanks, Shawn
Re: HasSingleNormFile in solr
Manish, Need to set hasSingleNormFile=0 ins schema On Sun, Nov 18, 2012 at 9:11 AM, Manish Bafna manish.bafna...@gmail.comwrote: Hi, I need to disable HasSingleNormFile in solr, so that multiple norm files are created. Can anyone plz provide information on how to disable this in solr. If HasSingleNormFile is 1, then the field norms are written as a single joined file (with extension .nrm); if it is 0 then each field's norms are stored as separate .fN files. See Normalization Factors below for details. Thanks, Manish.
Re: Solr Delta Import Handler not working
I think this means the pattern did not match any files: str name=Total Rows Fetched0/str The wiki example includes a '^' at the beginning of the filename pattern. This matches a complete line. http://wiki.apache.org/solr/DataImportHandler#Transformers_Example More: Add rootEntity=true. It cannot hurt to be explicit. The date format needs a 'T' instead of a space: http://en.wikipedia.org/wiki/ISO_8601 Cheers! - Original Message - | From: Spadez james_will...@hotmail.com | To: solr-user@lucene.apache.org | Sent: Saturday, November 17, 2012 2:49:30 PM | Subject: Solr Delta Import Handler not working | | Hi, | | These are the exact steps that I have taken to try and get delta | import | handler working. If I can provide any more information to help let me | know. | I have literally spent the entire friday night and today on this and | I throw | in the towel. Where have I gone wrong? | | *Added this line to the solrconfig:* | /requestHandler name=/dataimport | class=org.apache.solr.handler.dataimport.DataImportHandler | lst name=defaults | str name=config/home/solr/data-config.xml/str | /lst | /requestHandler/ | | *Then my data-config.xml looks like this:* | /dataConfig | dataSource type=FileDataSource / | document | entity | name=document | processor=FileListEntityProcessor | baseDir=/var/lib/data | fileName=.*.xml$ | recursive=false | rootEntity=false | dataSource=null | entity | processor=XPathEntityProcessor | url=${document.fileAbsolutePath} | useSolrAddSchema=true | stream=true | /entity | /entity | /document | /dataConfig/ | | *Then in my var/lib/data folder I have a data.xml file that looks | like | this:* | /add | doc | field name=id123/field | field name=descriptionThis is my long description/field | field name=companyGoogle/field | field name=location_nameEngland/field | field name=date2007-12-31 22:29:59/field | field name=sourceGoogle/field | field name=urlwww.google.com/field | field name=latlng45.17614,45.17614/field | /doc | /add/ | | *Finally I then ran this command:* | /http://localhost:8080/solr/dataimport?command=delta-importclean=false/ | | *And I get this result (failed):* | /response | lst name=responseHeader | int name=status0/int | int name=QTime1/int | /lst | lst name=initArgs | lst name=defaults | str name=config/opt/solr/example/solr/conf/data-config.xml/str | /lst | /lst | str name=commanddelta-import/str | str name=statusidle/str | str name=importResponse/ | lst name=statusMessages | str name=Time Elapsed0:15:9.543/str | str name=Total Requests made to DataSource0/str | str name=Total Rows Fetched0/str | str name=Total Documents Processed0/str | str name=Total Documents Skipped0/str | str name=Delta Dump started2012-11-17 17:32:56/str | str name=Identifying Delta2012-11-17 17:32:56/str | str name=*Indexing failed*. Rolled back all changes./str | str name=Rolledback2012-11-17 17:32:56/str | /lst | str name=WARNING | This response format is experimental. It is likely to change in the | future. | /str | /response/ | | | | | | -- | View this message in context: | http://lucene.472066.n3.nabble.com/Solr-Delta-Import-Handler-not-working-tp4020897.html | Sent from the Solr - User mailing list archive at Nabble.com. |