Re: Synonym/Tokenizer for Hyphanated Words

2012-11-17 Thread Erick Erickson
what does having a problem mean? Index-time? Query time?

But your problem is most likely the tokenizer as you suspect. Try something
like WhitespaceTokenizer and build up from there.

Three friends:
1 admin/analysis page
2 admin/schema-browser
3 debugQuery=on
The first will show you what the happend to tokens _after_ they get through
the tokenization. Be aware that this probably isn't entirely helpful when
your problem is in the tokenization step.

The second shows you what terms are actually in your index.

The third shows you what your parsed query looks like.

Couple of other things:
1 there's no need to put in all the capitalization forms _if_ you put
LowerCaseFilter in front of your synonyms filter.
2 WhiteSpaceTokenizer is pretty simple. For instance, punctuation will be
part of the tokens (e.g. periods at the end of sentences). So it's a place
to _start_ but you'll have to think about what you really want from your
tokenization process before deciding.

Best
Erick


On Thu, Nov 15, 2012 at 12:38 PM, Nathan Tallman ntall...@gmail.com wrote:

 Hello Solr users,

 I use Solr 3.5 via Vufind 1.3 and am having a problem with a synonym. No
 matter what syntax I used, it doesn't seem to have an affect. (See various
 combinations below.)


 antisemitism,anti-semitism,Antisemitism,Anti-Semitism,Anti-semitism,anti-Semitism


 antisemitism,anti\-semitism,Antisemitism,Anti\-Semitism,Anti\-semitism,anti\-Semitism

 antisemitism,anti semitism,Antisemitism,Anti Semitism,Anti semitism,anti
 Semitism

 It was suggested to me that this was not synonym issue, but a tokenizing
 issue, because anti-semitism was being interpreted as anti semitism.

 Does anyone have any suggested for making the synonym work? Tweaking the
 tokenizer in schema.xml? Or somehow escaping the hyphen in synonyms.txt?

 Many thanks,
 Nathan



Re: High Slave CPU Intermittently After Replication

2012-11-17 Thread Erick Erickson
that's very strange. How much memory are you giving the JVM? And how much
memory is on your machine?

If your index is cutting in half on optimize, then it sounds like you're
re-indexing everything. Optimize will squeeze out all the data left around
by document deletes or updates, so the only reason I can imagine that your
index drops by 50% if if you've replaced every document that was there
originally. And I'd also guess that you don't have enough activity to
trigger merges often enough to squeeze out the deleted documents' data.

But this sounds ever so much like you're running with not much memory and
are getting into heavy swapping or something like that s your index crosses
some threshold.

But that's just a guess.

Best
Erick


On Fri, Nov 16, 2012 at 10:23 AM, richardg richa...@dvdempire.com wrote:

 We tried using MergeFactor setting but out CPU Load/Slow Query time issues
 were more widespread, optimizing the index always alleviated the issue that
 is why we are using it now.  Our index is 2 GB when optimized and would
 balloon to over 4 GB so we thought the issue was it was getting too big.

 I notice a small spike in CPU load after every replication but then after a
 couple of seconds load returns to normal (which is less that 25%) it is
 just
 sometimes (once in the last week) that it would spike and stay high (10
 minutes) until I optimized the index.  Before I would optimize the index
 after every commit the issue would occur more often.

 We would like to not optimize and use the built in Merging but we had
 before
 and the issue would occur more often.  We were thinking of trying a
 mergefactor of 2 again but I'm afraid this issues will return.

 I installed SPM and am monitoring it to see if it tells me anything, I can
 post the results on Monday and hopefully it will tell us something.

 At this time we aren't warming and caches, we weren't sure if this was an
 issue because our slowdowns weren't happening every time. Also, we are
 using
 the join functionality of Solr 4 if that helps.

 Thanks for your help



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4020743.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: inconsistent number of results returned in solr cloud

2012-11-17 Thread Erick Erickson
Hmmm, first an aside. If by commit after every batch of documents  you
mean after every call to server.add(doclist), there's no real need to do
that unless you're striving for really low latency. the usual
recommendation is to use commitWithin when adding and commit only at the
very end of the run. This shouldn't actually be germane to your issue, just
an FYI.

So you're saying that the inconsistency is permanent? By that I mean it
keeps coming back inconsistently for minutes/hours/days?

I guess if I were trying to test this I'd need to know how you added
subsequent collections. In particular what you did re: zookeeper as you
added each collection.

Best
Erick


On Fri, Nov 16, 2012 at 2:55 PM, Buttler, David buttl...@llnl.gov wrote:

 My typical way of adding documents is through SolrJ, where I commit after
 every batch of documents (where the batch size is configurable)

 I have now tried committing several times, from the command line (curl)
 with and without openSearcher=true.  It does not affect anything.

 Dave

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Friday, November 16, 2012 11:04 AM
 To: solr-user@lucene.apache.org
 Subject: Re: inconsistent number of results returned in solr cloud

 How did you do the final commit? Can you try a lone commit (with
 openSearcher=true) and see if that affects things?

 Trying to determine if this is a known issue or not.

 - Mark

 On Nov 16, 2012, at 1:34 PM, Buttler, David buttl...@llnl.gov wrote:

  Hi all,
  I buried an issue in my last post, so let me pop it up.
 
  I have a cluster with 10 collections on it.  The first collection I
 loaded works perfectly.  But every subsequent collection returns an
 inconsistent number of results for each query.  The queries can be simply
 *:*, or more complex facet queries.  If I go to individual cores and issue
 the query, with distrib=false, I get a consistent number of results.  I am
 wondering if there is some delay in returning results from my shards, and
 the queried node just times out and displays the number of results that it
 has received so far.  If there is such a timeout, it must be very small, as
 my QTime is around 11 ms.
 
  Dave




Re: error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar

2012-11-17 Thread Erick Erickson
There was a discussion of this a bit ago, but the upshot is that the
maintainer hasn't released a version compatible with 4.0 yet. Send him
money G...

FWIW,
Erick


On Fri, Nov 16, 2012 at 11:16 AM, Miguel Ángel Martín 
miguelangel.mar...@brainsins.com wrote:

 hi all:

 i can open an index create with  solr 4.0. with luke version=
  lukeall-4.0.0-ALPHA.jar

 i have the error:

 Format version is not supported (resource:
 NIOFSIndexInput(path=/Users/desa/data/index/_2.tvx)): 1 (needs to be
 between 0 and 0)
  at
 org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:148)
 at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
  at

 org.apache.lucene.codecs.lucene40.Lucene40TermVectorsReader.init(Lucene40TermVectorsReader.java:108)
 at

 org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat.vectorsReader(Lucene40TermVectorsFormat.java:107)
  at

 org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:118)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:55)
  at

 org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
 at

 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
  at

 org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
  at org.getopt.luke.Luke.openIndex(Luke.java:967)
 at org.getopt.luke.Luke.openOk(Luke.java:696)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at thinlet.Thinlet.invokeImpl(Thinlet.java:4579)
 at thinlet.Thinlet.invoke(Thinlet.java:4546)
  at thinlet.Thinlet.handleMouseEvent(Thinlet.java:3937)
 at thinlet.Thinlet.processEvent(Thinlet.java:2917)
  at java.awt.Component.dispatchEventImpl(Component.java:4744)
 at java.awt.Container.dispatchEventImpl(Container.java:2141)
  at java.awt.Component.dispatchEvent(Component.java:4572)
 at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4619)
  at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4280)
 at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4210)
  at java.awt.Container.dispatchEventImpl(Container.java:2127)
 at java.awt.Window.dispatchEventImpl(Window.java:2489)
  at java.awt.Component.dispatchEvent(Component.java:4572)
 at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:704)
  at java.awt.EventQueue.access$400(EventQueue.java:82)
 at java.awt.EventQueue$2.run(EventQueue.java:663)
  at java.awt.EventQueue$2.run(EventQueue.java:661)
 at java.security.AccessController.doPrivileged(Native Method)
  at

 java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
 at

 java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
  at java.awt.EventQueue$3.run(EventQueue.java:677)
 at java.awt.EventQueue$3.run(EventQueue.java:675)
  at java.security.AccessController.doPrivileged(Native Method)
 at

 java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
  at java.awt.EventQueue.dispatchEvent(EventQueue.java:674)
 at

 java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
  at

 java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
 at

 java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
  at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
 at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
  at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
 o


 any ideas?


 I,ve created another index with lucene 4.0 and this luke open the index
 well.

 thanks in advance



Re: Question about Solr Cloud

2012-11-17 Thread Erick Erickson
1 Well, it loads the local conf directory up to zookeeper so new nodes can
fetch the configuration and store it locally.
2 No, you have to upload the configuration to ZK and (I think) restart the
other servers. It's easy enough to test, just make your changes to the
config, upload it, and look at the resulting configs to insure that the
changes have been fetched.
3 No. You can run these shards in the same JVM as far as I know. This is
sometimes called microsharding or oversharding and is a pretty common
approach. Search the list I think theres been some discussion recently on
this very topic.
4 Mostly the container you use is determined by which one you're
comfortable with. Solr runs on Jetty, Tomcat, JBOSS and a host of others.
It's just simplest to start OOB with Jetty.

Best,
Erick


On Sat, Nov 17, 2012 at 2:13 AM, Cool Techi cooltec...@outlook.com wrote:

 Hi,

 I have just started working with Solr cloud and have a few questions
 related to the same,

 1) In the start script we provide the the following, what's the purpose of
 providing this.
 -Dbootstrap_confdir=./solr/collection1/conf Since we don't yet have a
 config in zookeeper, this parameter causes the local configuration
 directory ./solr/conf to be uploaded as the myconf config.  The name
 myconf is taken from the collection.configName param below.
 -Dcollection.configName=myconf sets the config to use for the new
 collection. Omitting this param will cause the config name to default to
 configuration1
 2) When we make any changes into the config/schema do we need to copy it
 to all the shards running in the cloud manually?3) If we want to start with
 10 shards on 2 machines, anticipating the future growth, do all these
 shards needs to run on separate jetty instances4) Is there any advantage of
 running solr on jetty then Tomcat?
 Thanks,Ayush




Re: Solr 4:How to call a updateRequestProcessorChain during the /dataimport?

2012-11-17 Thread Erick Erickson
I would _guess_ (but haven't done this with DIH) that simply putting
the body.chain in the updatehandler (updateHandler
class=solr.DirectUpdateHandler2)
would do what you want.

But that's purely a guess at  this point on my part.

Anyone want to correct me?

Best
Erick


On Fri, Nov 16, 2012 at 4:50 PM, srinalluri nallurisr...@yahoo.com wrote:

 I have a new updateRequestProcessorChain called 'bodychain'.  (Please note
 CountFieldValuesUpdateProcessorFactory is new in Solr 4). I want to call
 this bodychain during the dataimport.

 updateRequestProcessorChain name=bodychain
processor class=solr.CloneFieldUpdateProcessorFactory
  str name=sourcebody/str
  str name=destbody_count/str
/processor
processor class=solr.CountFieldValuesUpdateProcessorFactory
  str name=fieldNamebody_count/str
/processor
processor class=solr.DefaultValueUpdateProcessorFactory
  str name=fieldNamebody_count/str
  int name=value0/int
/processor
  /updateRequestProcessorChain

 Following is the my dataimport handler, which is already having
 'update.chain'.  I think I can't give more than one update.chain in this
 handler. When can I add 'bodychain'?

 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
   str name=update.chaindedupe/str
   str name=configdata-config.xml/str
 /lst
   /requestHandler

 thanks
 Srini



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-How-to-call-a-updateRequestProcessorChain-during-the-dataimport-tp4020812.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Bash Script to start delta import handler

2012-11-17 Thread Steve Rowe
Hi Spadez,

Nabble has helpfully stripped out your script.  Maybe don't use Nabble?

Steve

On Nov 16, 2012, at 5:06 PM, Spadez james_will...@hotmail.com wrote:

 Hey guys,
 
 I am after a bash script (or python script) which I can use to trigger a
 delta import of XML files via CRON. After a bit of digging and modification
 I have this:
 
 
 
 Can I get any feedback on this? Is there a better way of doing it? Any
 optimisations or improvements would be most welcome.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Bash-Script-to-start-delta-import-handler-tp4020815.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Question about Solr Cloud

2012-11-17 Thread Upayavira
You can force Solr to use the new configs by reloading a collection:

http://localhost:8983/solr/admin/collections?action=RELOADname=mycollection

This'll cause all shards (and replicas) in a collection to collect new
configs from ZooKeeper.

The main thing to note re Jetty, is that the Jetty included within Solr
is included for ease of demoing Solr, rather than for ease of
deployment. Whether you are going to deploy to Tomcat, JBoss or Jetty,
you would be best downloading a copy of the container, and installing
Solr within it (the one embedded doesn't have any startup scripts, nor
any maintenance interfaces, etc, all stuff that you'd expect from a
servlet container).

Upayavira



On Sat, Nov 17, 2012, at 03:04 PM, Erick Erickson wrote:
 1 Well, it loads the local conf directory up to zookeeper so new nodes
 can
 fetch the configuration and store it locally.
 2 No, you have to upload the configuration to ZK and (I think) restart
 the
 other servers. It's easy enough to test, just make your changes to the
 config, upload it, and look at the resulting configs to insure that the
 changes have been fetched.
 3 No. You can run these shards in the same JVM as far as I know. This is
 sometimes called microsharding or oversharding and is a pretty common
 approach. Search the list I think theres been some discussion recently on
 this very topic.
 4 Mostly the container you use is determined by which one you're
 comfortable with. Solr runs on Jetty, Tomcat, JBOSS and a host of others.
 It's just simplest to start OOB with Jetty.
 
 Best,
 Erick
 
 
 On Sat, Nov 17, 2012 at 2:13 AM, Cool Techi cooltec...@outlook.com
 wrote:
 
  Hi,
 
  I have just started working with Solr cloud and have a few questions
  related to the same,
 
  1) In the start script we provide the the following, what's the purpose of
  providing this.
  -Dbootstrap_confdir=./solr/collection1/conf Since we don't yet have a
  config in zookeeper, this parameter causes the local configuration
  directory ./solr/conf to be uploaded as the myconf config.  The name
  myconf is taken from the collection.configName param below.
  -Dcollection.configName=myconf sets the config to use for the new
  collection. Omitting this param will cause the config name to default to
  configuration1
  2) When we make any changes into the config/schema do we need to copy it
  to all the shards running in the cloud manually?3) If we want to start with
  10 shards on 2 machines, anticipating the future growth, do all these
  shards needs to run on separate jetty instances4) Is there any advantage of
  running solr on jetty then Tomcat?
  Thanks,Ayush
 
 


Re: Solr/Lucene Tokenizers - cannot get the behavior I need

2012-11-17 Thread Shawn Heisey

On 11/16/2012 12:52 PM, Shawn Heisey wrote:

On 11/16/2012 12:36 PM, Jack Krupansky wrote:
Generally, you don't need the preserveOriginal attribute for WDF. 
Generate both the word parts and the concatenated terms, and queries 
should work fine without the original. The separated terms will be 
indexed as a sequence, and the split/separated terms will generate a 
phrase query that matches the indexed sequence. And if you index the 
concatenated terms, that can be queried as well.


With that issue out of the way, is there a remaining issue here?


You're right, that's handled by catenateWords.  I do need 
preserveOriginal for other things, though.  I think it's unimportant 
for this discussion.  I may consider removing it at a later stage, but 
right now our assessment is that we need it.


The immediate problem is that when ICUTokenizer is done with an input 
of Word1-Word2 I am left with two tokens, Word1 and Word2.  The 
punctuation in the middle is gone.  Even if WDF is the very next thing 
in the analysis chain, there's nothing for it to do - the fact that 
Word1 and Word2 were connected by punctuation is entirely lost.


Ideally I would like to see a splitOnPunctuation option on a majority 
of available tokenizers, but if a filter were available that did one 
subset of ICUTokenizer's functionality - splitting tokens on script 
changes - I would have a solution in combination with WhiteSpaceTokenizer.


I have been looking at the source code related to ICUTokenizer, trying 
to get a handle on how it works.  Based on what I've learned so far, I'm 
not sure that punctuation can be ignored in the way that I need.  If 
someone knows it well enough to comment, I would love to know for sure.


Thanks,
Shawn



Re: Question about Solr Cloud

2012-11-17 Thread Mark Miller
bq. fetch the configuration and store it locally.

New nodes don't fetch the configs and store them locally - configs are
loaded straight from zookeeper currently.

- Mark


Re: Solr/Lucene Tokenizers - cannot get the behavior I need

2012-11-17 Thread Shawn Heisey

On 11/16/2012 12:30 PM, Shawn Heisey wrote:
I am extremely interested in the Unicode behavior of ICUTokenizer, but 
I cannot disable the punctuation-splitting behavior and let WDF handle 
it properly, which causes recall problems.  There is no filter that I 
can run after tokenization, either.  Looking at ICUTokenizer.java, I 
do not see any way to write my own tokenizer that does what I need.


I have this problem with pretty much all of the tokenizers other than 
Whitespace.  There are situations where I would like to use some of 
the others, but the punctuation-splitting behavior is a major problem 
for me.


Do I have any options?  I have never looked at the ICU code from IBM, 
so I don't know if it would require major surgery there.


Related problem: The entire reason I started down this path is because 
I'd like to handle CJK better with CJKBigramFilter.  It appears that 
unless you use StandardTokenizer, ClassicTokenizer, or ICUTokenizer, 
CJKBigramFilter doesn't work ... but none of these tokenizers will 
handle punctuation right for me.


I seem to remember a discussion some time ago around this, saying that a 
future version of CJKBigramFilter would drop the requirement that each 
token be tagged.


Do I need to file an issue about this, and/or start a new discussion thread?

Thanks,
Shawn



Re: HasSingleNormFile in solr

2012-11-17 Thread geetha anjali
Manish,
Need to set hasSingleNormFile=0 ins schema

On Sun, Nov 18, 2012 at 9:11 AM, Manish Bafna manish.bafna...@gmail.comwrote:

 Hi,
 I need to disable HasSingleNormFile in solr, so that multiple norm files
 are created. Can anyone plz provide information on how to disable this in
 solr.


 If HasSingleNormFile is 1, then the field norms are written as a single
 joined file (with extension .nrm); if it is 0 then each field's norms are
 stored as separate .fN files. See Normalization Factors below for
 details.

 Thanks,
 Manish.



Re: Solr Delta Import Handler not working

2012-11-17 Thread Lance Norskog
I think this means the pattern did not match any files:
str name=Total Rows Fetched0/str

The wiki example includes a '^' at the beginning of the filename pattern. This 
matches a complete line. 
http://wiki.apache.org/solr/DataImportHandler#Transformers_Example

More:
Add rootEntity=true. It cannot hurt to be explicit.

The date format needs a 'T' instead of a space:
http://en.wikipedia.org/wiki/ISO_8601

Cheers!

- Original Message -
| From: Spadez james_will...@hotmail.com
| To: solr-user@lucene.apache.org
| Sent: Saturday, November 17, 2012 2:49:30 PM
| Subject: Solr Delta Import Handler not working
| 
| Hi,
| 
| These are the exact steps that I have taken to try and get delta
| import
| handler working. If I can provide any more information to help let me
| know.
| I have literally spent the entire friday night and today on this and
| I throw
| in the towel. Where have I gone wrong?
| 
| *Added this line to the solrconfig:*
| /requestHandler name=/dataimport
| class=org.apache.solr.handler.dataimport.DataImportHandler
| lst name=defaults
|   str name=config/home/solr/data-config.xml/str
| /lst
|   /requestHandler/
| 
| *Then my data-config.xml looks like this:*
| /dataConfig
|   dataSource type=FileDataSource /
|   document
| entity
|   name=document
|   processor=FileListEntityProcessor
|   baseDir=/var/lib/data
|   fileName=.*.xml$
|   recursive=false
|   rootEntity=false
|   dataSource=null
|   entity
| processor=XPathEntityProcessor
| url=${document.fileAbsolutePath}
| useSolrAddSchema=true
| stream=true
|   /entity
| /entity
|   /document
| /dataConfig/
| 
| *Then in my var/lib/data folder I have a data.xml file that looks
| like
| this:*
| /add
| doc
|   field name=id123/field
|   field name=descriptionThis is my long description/field
|   field name=companyGoogle/field
|   field name=location_nameEngland/field
|   field name=date2007-12-31 22:29:59/field
|   field name=sourceGoogle/field
|   field name=urlwww.google.com/field
|   field name=latlng45.17614,45.17614/field
| /doc
| /add/
| 
| *Finally I then ran this command:*
| /http://localhost:8080/solr/dataimport?command=delta-importclean=false/
| 
| *And I get this result (failed):*
| /response
| lst name=responseHeader
| int name=status0/int
| int name=QTime1/int
| /lst
| lst name=initArgs
| lst name=defaults
| str name=config/opt/solr/example/solr/conf/data-config.xml/str
| /lst
| /lst
| str name=commanddelta-import/str
| str name=statusidle/str
| str name=importResponse/
| lst name=statusMessages
| str name=Time Elapsed0:15:9.543/str
| str name=Total Requests made to DataSource0/str
| str name=Total Rows Fetched0/str
| str name=Total Documents Processed0/str
| str name=Total Documents Skipped0/str
| str name=Delta Dump started2012-11-17 17:32:56/str
| str name=Identifying Delta2012-11-17 17:32:56/str
| str name=*Indexing failed*. Rolled back all changes./str
| str name=Rolledback2012-11-17 17:32:56/str
| /lst
| str name=WARNING
| This response format is experimental. It is likely to change in the
| future.
| /str
| /response/
| 
| 
| 
| 
| 
| --
| View this message in context:
| 
http://lucene.472066.n3.nabble.com/Solr-Delta-Import-Handler-not-working-tp4020897.html
| Sent from the Solr - User mailing list archive at Nabble.com.
|