date:20120412

Hi Tirthankar,

The average size of documents would be a few Kb's this is mostly tweets
which are being saved. The two cores are storing different kind of data and
nothing else.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg

-Original Message-
From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
Sent: 12 April 2012 13:14
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.5 takes very long to commit gradually

Hi Rohit,
What would be the average size of your documents and also can you please
share your idea of having 2 cores in the master. I just wanted to know the
reasoning behind the design. 

Thanks in advance 

Tirthankar
On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:

 What operating system?
 Are you using spellchecker with buildOnCommit?
 Anything special in your Update Chain?
 
 --
 Jan Høydahl, search solution architect Cominvent AS - 
 www.cominvent.com Solr Training - www.solrtraining.com
 
 On 12. apr. 2012, at 06:45, Rohit wrote:
 
 We recently migrated from solr3.1 to solr3.5, we have one master and 
 one slave configured. The master has two cores,
 
 1) Core1 - 44555972 documents
 
 2) Core2 - 29419244 documents
 
 We commit every 5000 documents, but lately the commit time gradually 
 increase and solr is taking as very long 15 minutes plus in some 
 cases. What could have caused this, I have checked the logs and the 
 only warning i can see is,
 
 WARNING: Use of deprecated update request parameter update.processor 
 detected. Please use the new parameter update.chain instead, as 
 support for update.processor will be removed in a later version.
 
 Memory details:
 
 export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g
 
 Solr Config:
 
 useCompoundFilefalse/useCompoundFile
 
 mergeFactor10/mergeFactor
 
 ramBufferSizeMB32/ramBufferSizeMB
 
 !-- maxBufferedDocs1000/maxBufferedDocs --
 
 maxFieldLength1/maxFieldLength
 
 writeLockTimeout1000/writeLockTimeout
 
 commitLockTimeout1/commitLockTimeout
 
 Also noticed, that top command show almost 350GB of Virtual memory usage.
 
 What could be causing this, as everything was running fine a few days
back?
 
 
 
 
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg
 
 
 
 

**Legal Disclaimer***
This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or
distribution by others is strictly prohibited. If you have received the
message in error, please advise the sender by reply email and delete the
message. Thank you.
*

RE: Solr 3.5 takes very long to commit gradually

Operating system in linux ubuntu.
No not using spellchecker 
Only language detection in my update chain.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: 12 April 2012 12:50
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.5 takes very long to commit gradually

What operating system?
Are you using spellchecker with buildOnCommit?
Anything special in your Update Chain?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. apr. 2012, at 06:45, Rohit wrote:

 We recently migrated from solr3.1 to solr3.5, we have one master and 
 one slave configured. The master has two cores,
 
 1) Core1 - 44555972 documents
 
 2) Core2 - 29419244 documents
 
 We commit every 5000 documents, but lately the commit time gradually 
 increase and solr is taking as very long 15 minutes plus in some 
 cases. What could have caused this, I have checked the logs and the 
 only warning i can see is,
 
 WARNING: Use of deprecated update request parameter update.processor 
 detected. Please use the new parameter update.chain instead, as 
 support for update.processor will be removed in a later version.
 
 Memory details:
 
 export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g
 
 Solr Config:
 
 useCompoundFilefalse/useCompoundFile
 
 mergeFactor10/mergeFactor
 
 ramBufferSizeMB32/ramBufferSizeMB
 
 !-- maxBufferedDocs1000/maxBufferedDocs --
 
 maxFieldLength1/maxFieldLength
 
 writeLockTimeout1000/writeLockTimeout
 
 commitLockTimeout1/commitLockTimeout
 
 Also noticed, that top command show almost 350GB of Virtual memory usage.
 
 What could be causing this, as everything was running fine a few days
back?
 
 
 
 
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg

Re: Solr 3.5 takes very long to commit gradually

2012-04-12 Thread Tirthankar Chatterjee

thanks Rohit.. for the information.
On Apr 12, 2012, at 4:08 AM, Rohit wrote:

 Hi Tirthankar,
 
 The average size of documents would be a few Kb's this is mostly tweets
 which are being saved. The two cores are storing different kind of data and
 nothing else.
 
 Regards,
 Rohit
 Mobile: +91-9901768202
 About Me: http://about.me/rohitg
 
 -Original Message-
 From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
 Sent: 12 April 2012 13:14
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 3.5 takes very long to commit gradually
 
 Hi Rohit,
 What would be the average size of your documents and also can you please
 share your idea of having 2 cores in the master. I just wanted to know the
 reasoning behind the design. 
 
 Thanks in advance 
 
 Tirthankar
 On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:
 
 What operating system?
 Are you using spellchecker with buildOnCommit?
 Anything special in your Update Chain?
 
 --
 Jan Høydahl, search solution architect Cominvent AS - 
 www.cominvent.com Solr Training - www.solrtraining.com
 
 On 12. apr. 2012, at 06:45, Rohit wrote:
 
 We recently migrated from solr3.1 to solr3.5, we have one master and 
 one slave configured. The master has two cores,
 
 1) Core1 - 44555972 documents
 
 2) Core2 - 29419244 documents
 
 We commit every 5000 documents, but lately the commit time gradually 
 increase and solr is taking as very long 15 minutes plus in some 
 cases. What could have caused this, I have checked the logs and the 
 only warning i can see is,
 
 WARNING: Use of deprecated update request parameter update.processor 
 detected. Please use the new parameter update.chain instead, as 
 support for update.processor will be removed in a later version.
 
 Memory details:
 
 export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g
 
 Solr Config:
 
 useCompoundFilefalse/useCompoundFile
 
 mergeFactor10/mergeFactor
 
 ramBufferSizeMB32/ramBufferSizeMB
 
 !-- maxBufferedDocs1000/maxBufferedDocs --
 
 maxFieldLength1/maxFieldLength
 
 writeLockTimeout1000/writeLockTimeout
 
 commitLockTimeout1/commitLockTimeout
 
 Also noticed, that top command show almost 350GB of Virtual memory usage.
 
 What could be causing this, as everything was running fine a few days
 back?
 
 
 
 
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg
 
 
 
 
 
 **Legal Disclaimer***
 This communication may contain confidential and privileged material for the
 sole use of the intended recipient. Any unauthorized review, use or
 distribution by others is strictly prohibited. If you have received the
 message in error, please advise the sender by reply email and delete the
 message. Thank you.
 *
 
 

**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*

Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)

2012-04-12 Thread Bastian Hepp

Hi,

I'm using Apache Solr 3.5.0 and Jetty 8.1.2 with Windows 7. (Versions in
the Book used... Solr 3.1, Jetty 6.1.26)

I've tried to get Solr running with Jetty.
- I copied the jetty.xml and the webdefault.xml from the example Solr.
- I copied the solr.war to webapps
- I copied the solr directory from the example dir to the jetty dir.

When I try to start I get this error message:

C:\\jetty-solrjava -jar start.jar
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
at org.eclipse.jetty.start.Main.start(Main.java:602)
at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.Server
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at org.eclipse.jetty.util.Loader.loadClass(Loader.java:92)
at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.nodeClass(XmlConfiguration.java:349)
at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:327)
at
org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:291)
at
org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1203)
at java.security.AccessController.doPrivileged(Native Method)
at
org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
... 7 more
Usage: java -jar start.jar [options] [properties] [configs]
   java -jar start.jar --help  # for more information

Thanks for your help,
Bastian

Re: Facets involving multiple fields

2012-04-12 Thread Marc SCHNEIDER

Hi,

Thanks for your answer.
Let's say I have to fields : 'keywords' and 'short_title'.
For these fields I'd like to make a faceted search : if 'Computer' is
stored in at least one of these fields for a document I'd like to get
it added in my results.
doc1 = keywords : 'Computer' / short_title : 'Computer'
doc2 = keywords : 'Computer'
doc3 = short_title : 'Computer'

In this case I'd like to have : Computer (3)

I don't see how to solve this with facet.query.

Thanks,
Marc.

On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson erickerick...@gmail.com wrote:
 Have you considered facet.query? You can specify an arbitrary query
 to facet on which might do what you want. Otherwise, I'm not sure what
 you mean by faceted search using two fields. How should these fields
 be combined into a single facet? What that means practically is not at
 all obvious from your problem statement.

 Best
 Erick

 On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
 marc.schneide...@gmail.com wrote:
 Hi,

 I'd like to make a faceted search using two fields. I want to have a
 single result and not a result by field (like when using
 facet.field=f1,facet.field=f2).
 I don't want to use a copy field either because I want it to be
 dynamic at search time.
 As far as I know this is not possible for Solr 3.x...
 But I saw a new parameter named group.facet for Solr4. Could that
 solve my problem? If yes could somebody give me an example?

 Thanks,
 Marc.

Lexical analysis tools for German language data

Given an input of Windjacke (probably wind jacket in English), I'd
like the code that prepares the data for the index (tokenizer etc) to
understand that this is a Jacke (jacket) so that a query for Jacke
would include the Windjacke document in its result set.

It appears to me that such an analysis requires a dictionary-backed
approach, which doesn't have to be perfect at all; a list of the most
common 2000 words would probably do the job and fulfil a criterion of
reasonable usefulness.

Do you know of any implementation techniques or working implementations
to do this kind of lexical analysis for German language data? (Or other
languages, for that matter?) What are they, where can I find them?

I'm sure there is something out (commercial or free) because I've seen
lots of engines grokking German and the way it builds words.

Failing that, what are the proper terms do refer to these techniques so
you can search more successfully?

Michael

Re: Large Index and OutOfMemoryError: Map failed

2012-04-12 Thread Michael McCandless

Your largest index has 66 segments (690 files) ... biggish but not
insane.  With 64K maps you should be able to have ~47 searchers open
on each core.

Enabling compound file format (not the opposite!) will mean fewer maps
... ie should improve this situation.

I don't understand why Solr defaults to compound file off... that
seems dangerous.

Really we need a Solr dev here... to answer how long is a stale
searcher kept open.  Is it somehow possible 46 old searchers are
being left open...?

I don't see any other reason why you'd run out of maps.  Hmm, unless
MMapDirectory didn't think it could safely invoke unmap in your JVM.
Which exact JVM are you using?  If you can print the
MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.

Yes, switching away from MMapDir will sidestep the too many maps
issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
there really is a leak here (Solr not closing the old searchers or a
Lucene bug or something...) then you'll eventually run out of file
descriptors (ie, same  problem, different manifestation).

Mike McCandless

http://blog.mikemccandless.com

2012/4/11 Gopal Patwa gopalpa...@gmail.com:

 I have not change the mergefactor, it was 10. Compound index file is disable
 in my config but I read from below post, that some one had similar issue and
 it was resolved by switching from compound index file format to non-compound
 index file.

 and some folks resolved by changing lucene code to disable MMapDirectory.
 Is this best practice to do, if so is this can be done in configuration?

 http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html

 I have index document of core1 = 5 million, core2=8million and
 core3=3million and all index are hosted in single Solr instance

 I am going to use Solr for our site StubHub.com, see attached ls -l list
 of index files for all core

 SolrConfig.xml:


   indexDefaults
   useCompoundFilefalse/useCompoundFile
   mergeFactor10/mergeFactor
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength--
   ramBufferSizeMB4096/ramBufferSizeMB
   maxThreadStates10/maxThreadStates
   writeLockTimeout1000/writeLockTimeout
   commitLockTimeout1/commitLockTimeout
   lockTypesingle/lockType
   
   mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 double name=forceMergeDeletesPctAllowed0.0/double
 double name=reclaimDeletesWeight10.0/double
   /mergePolicy

   deletionPolicy class=solr.SolrDeletionPolicy
 str name=keepOptimizedOnlyfalse/str
 str name=maxCommitsToKeep0/str
   /deletionPolicy
   
   /indexDefaults


   updateHandler class=solr.DirectUpdateHandler2
   maxPendingDeletes1000/maxPendingDeletes
autoCommit
  maxTime90/maxTime
  openSearcherfalse/openSearcher
/autoCommit
autoSoftCommit
  maxTime${inventory.solr.softcommit.duration:1000}/maxTime
/autoSoftCommit
   
   /updateHandler


 Forwarded conversation
 Subject: Large Index and OutOfMemoryError: Map failed
 

 From: Gopal Patwa gopalpa...@gmail.com
 Date: Fri, Mar 30, 2012 at 10:26 PM
 To: solr-user@lucene.apache.org


 I need help!!





 I am using Solr 4.0 nightly build with NRT and I often get this error during
 auto commit java.lang.OutOfMemoryError: Map failed. I have search this
 forum and what I found it is related to OS ulimit setting, please se below
 my ulimit settings. I am not sure what ulimit setting I should have? and we
 also get java.net.SocketException: Too many open files NOT sure how many
 open file we need to set?


 I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB,
 with Single shard


 We update the index every 5 seconds, soft commit every 1 second and hard
 commit every 15 minutes


 Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB


 ulimit:

 core file size          (blocks, -c) 0
 data seg size           (kbytes, -d) unlimited
 scheduling priority             (-e) 0
 file size               (blocks, -f) unlimited
 pending signals                 (-i) 401408
 max locked memory       (kbytes, -l) 1024
 max memory size         (kbytes, -m) unlimited
 open files                      (-n) 1024
 pipe size            (512 bytes, -p) 8
 POSIX message queues     (bytes, -q) 819200
 real-time priority              (-r) 0
 stack size              (kbytes, -s) 10240
 cpu time               (seconds, -t) unlimited
 max user processes              (-u) 401408
 virtual memory          (kbytes, -v) unlimited
 file locks                      (-x) unlimited



 ERROR:





 2012-03-29 15:14:08,560 [] priority=ERROR app_name= thread=pool-3-thread-1
 location=CommitTracker line=93 auto

codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas

Hello,

We're using a sorted index in order to implement early termination
efficiently over an index of hundreds of millions of documents. As of now,
we're using the default codecs coming with Lucene 4, but we believe that
due to the fact that the docids are sorted, we should be able to do much
better in terms of storage and achieve much better performance, especially
decompression performance.

In particular, Robert Muir is commenting on these lines here:

https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

We're aware that the in the bulkpostings branch there are different codecs
being implemented and different experiments being done. We don't know
whether we should implement our own codec (i.e. using some RLE-like
techniques) or we should use one of the codecs implemented there (PFOR,
Simple64, ...).

Can you please give us some advice on this?

Thanks
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

AW: Lexical analysis tools for German language data

 Given an input of Windjacke (probably wind jacket in English),
 I'd like the code that prepares the data for the index (tokenizer
 etc) to understand that this is a Jacke (jacket) so that a
 query for Jacke would include the Windjacke document in its
 result set.
 
 It appears to me that such an analysis requires a dictionary-
 backed approach, which doesn't have to be perfect at all; a list
 of the most common 2000 words would probably do the job and fulfil
 a criterion of reasonable usefulness.

A simple approach would obviously be a word list and a regular
expression. There will, however, be nuts and bolts to take care of.
A more sophisticated and tested approach might be known to you.

Michael

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht


Michael,

I'm on this list and the lucene list since several years and have not found 
this yet.
It's been one neglected topics to my taste.

There is a CompoundAnalyzer but it requires the compounds to be dictionary 
based, as you indicate.

I am convinced there's a way to build the de-compounding words efficiently from 
a broad corpus but I have never seen it (and the experts at DFKI I asked for 
for also told me they didn't know of one).

paul

Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :

 Given an input of Windjacke (probably wind jacket in English), I'd
 like the code that prepares the data for the index (tokenizer etc) to
 understand that this is a Jacke (jacket) so that a query for Jacke
 would include the Windjacke document in its result set.
 
 It appears to me that such an analysis requires a dictionary-backed
 approach, which doesn't have to be perfect at all; a list of the most
 common 2000 words would probably do the job and fulfil a criterion of
 reasonable usefulness.
 
 Do you know of any implementation techniques or working implementations
 to do this kind of lexical analysis for German language data? (Or other
 languages, for that matter?) What are they, where can I find them?
 
 I'm sure there is something out (commercial or free) because I've seen
 lots of engines grokking German and the way it builds words.
 
 Failing that, what are the proper terms do refer to these techniques so
 you can search more successfully?
 
 Michael

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling


You might have a look at:
http://www.basistech.com/lucene/


Am 12.04.2012 11:52, schrieb Michael Ludwig:
 Given an input of Windjacke (probably wind jacket in English), I'd
 like the code that prepares the data for the index (tokenizer etc) to
 understand that this is a Jacke (jacket) so that a query for Jacke
 would include the Windjacke document in its result set.
 
 It appears to me that such an analysis requires a dictionary-backed
 approach, which doesn't have to be perfect at all; a list of the most
 common 2000 words would probably do the job and fulfil a criterion of
 reasonable usefulness.
 
 Do you know of any implementation techniques or working implementations
 to do this kind of lexical analysis for German language data? (Or other
 languages, for that matter?) What are they, where can I find them?
 
 I'm sure there is something out (commercial or free) because I've seen
 lots of engines grokking German and the way it builds words.
 
 Failing that, what are the proper terms do refer to these techniques so
 you can search more successfully?
 
 Michael

Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-12 Thread pcrao

Hi Mikhail Khludnev,

Thank you for the reply.
I think the index is getting corrupted because StreamingUpdateSolrServer is
keeping reference
to some index files that are being deleted by EmbeddedSolrServer during
commit/optimize process.
As a result when I Index(Full) using EmbeddedSolrServer and then do
Incremental index using StreamingUpdateSolrServer it fails with a
FileNotFound exception.
 A special note: we don't optimize the index after Incremental
indexing(StreamingUpdateSolrServer) but we do optimize it after the Full
index(EmbeddedSolrServer). Please see the below log and let me know
if you need further information.
---
Mar 29, 2012 12:05:03 AM org.apache.solr.update.processor.LogUpdateProcessor
finish 
INFO: {add=[035405]} 0 28
Mar 29, 2012 12:05:03 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={stream.type=text/htmlliteral.stream_source_info=/snps/docs/customer/q_and_a/html/035405.htmlliteral.stream_name=035405.htmlwt=javabincollectionName=docsversion=2}
status=0 QTime=28
Mar 29, 2012 12:05:03 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=false)
Mar 29, 2012 12:05:03 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {commit=} 0 10
Mar 29, 2012 12:05:03 AM org.apache.solr.common.SolrException log
SEVERE: java.io.FileNotFoundException:
/opt/solr/home/data/docs_index/index/_3d.cfs (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
at
org.apache.lucene.store.MMapDirectory.createSlicer(MMapDirectory.java:229)
at
org.apache.lucene.store.CompoundFileDirectory.init(CompoundFileDirectory.java:65)
at
org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:82)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:112)
at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:700)
at
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:263)
at
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2852)
at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2843)
at
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2616)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2731)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2719)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2703)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:325)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:84)
at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
at
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:52)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1477)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
-

Thanks,
PC Rao.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3905071.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lexical analysis tools for German language data

2012-04-12 Thread Valeriy Felberg

If you want that query jacke matches a document containing the word
windjacke or kinderjacke, you could use a custom update processor.
This processor could search the indexed text for words matching the
pattern .*jacke and inject the word jacke into an additional field
which you can search against. You would need a whole list of possible
suffixes, of course. It would slow down the update process but you
don't need to split words during search.

Best,
Valeriy

On Thu, Apr 12, 2012 at 12:39 PM, Paul Libbrecht p...@hoplahup.net wrote:

 Michael,

 I'm on this list and the lucene list since several years and have not found 
 this yet.
 It's been one neglected topics to my taste.

 There is a CompoundAnalyzer but it requires the compounds to be dictionary 
 based, as you indicate.

 I am convinced there's a way to build the de-compounding words efficiently 
 from a broad corpus but I have never seen it (and the experts at DFKI I asked 
 for for also told me they didn't know of one).

 paul

 Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :

 Given an input of Windjacke (probably wind jacket in English), I'd
 like the code that prepares the data for the index (tokenizer etc) to
 understand that this is a Jacke (jacket) so that a query for Jacke
 would include the Windjacke document in its result set.

 It appears to me that such an analysis requires a dictionary-backed
 approach, which doesn't have to be perfect at all; a list of the most
 common 2000 words would probably do the job and fulfil a criterion of
 reasonable usefulness.

 Do you know of any implementation techniques or working implementations
 to do this kind of lexical analysis for German language data? (Or other
 languages, for that matter?) What are they, where can I find them?

 I'm sure there is something out (commercial or free) because I've seen
 lots of engines grokking German and the way it builds words.

 Failing that, what are the proper terms do refer to these techniques so
 you can search more successfully?

 Michael

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Bernd,

can you please say a little more?
I think this list is ok to contain some description for commercial solutions 
that satisfy a request formulated on list.

Is there any product at BASIS Tech that provides a compound-analyzer with a big 
dictionary of decomposed compounds in German? If yes, for which domain? The 
Google Search result (I wonder if this is politically correct to not have yours 
;-)) shows me that there's an amount of job done in this direction (e.g. Gärten 
to match Garten) but being precise for this question would be more helpful!

paul


Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :

 
 You might have a look at:
 http://www.basistech.com/lucene/
 
 
 Am 12.04.2012 11:52, schrieb Michael Ludwig:
 Given an input of Windjacke (probably wind jacket in English), I'd
 like the code that prepares the data for the index (tokenizer etc) to
 understand that this is a Jacke (jacket) so that a query for Jacke
 would include the Windjacke document in its result set.
 
 It appears to me that such an analysis requires a dictionary-backed
 approach, which doesn't have to be perfect at all; a list of the most
 common 2000 words would probably do the job and fulfil a criterion of
 reasonable usefulness.
 
 Do you know of any implementation techniques or working implementations
 to do this kind of lexical analysis for German language data? (Or other
 languages, for that matter?) What are they, where can I find them?
 
 I'm sure there is something out (commercial or free) because I've seen
 lots of engines grokking German and the way it builds words.
 
 Failing that, what are the proper terms do refer to these techniques so
 you can search more successfully?
 
 Michael

Solr Scoring

2012-04-12 Thread Kissue Kissue

Hi,

I have a field in my index called itemDesc which i am applying
EnglishMinimalStemFilterFactory to. So if i index a value to this field
containing Edges, the EnglishMinimalStemFilterFactory applies stemming
and Edges becomes Edge. Now when i search for Edges, documents with
Edge score better than documents with the actual search word - Edges.
Is there a way i can make documents with the actual search word in this
case Edges score better than document with Edge?

I am using Solr 3.5. My field definition is shown below:

fieldType name=text_en class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
 filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
 filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
filter class=solr.EnglishMinimalStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.EnglishMinimalStemFilterFactory/
  /analyzer
/fieldType

Thanks.

two structures in solr

2012-04-12 Thread tkoomzaaskz

Hi all,

I'm a solr newbie, so sorry if I do anything wrong ;)

I want to use SOLR not only for fast text search, but mainly to create a
very fast search engine for a high-traffic system (MySQL would not do the
job if the db grows too big).

I need to store *two big structures* in SOLR: projects and contractors.
Contractors will search for available projects and project owners will
search for contractors who would do it for them.

So far, I have found a solr tutorial for newbies
http://www.solrtutorial.com, where I found the schema file which defines the
data structure: http://www.solrtutorial.com/schema-xml.html. But my case is
that *I want to have two structures*. I guess running two parallel solr
instances is not the idea. I took a look at
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup
and I can see that the schema goes like:

?xml version=1.0 encoding=UTF-8 ?
schema name=example version=1.5
  types
...
  /types
fields
  field name=id type=string indexed=true stored=true
required=true /
  field name=sku type=text_en_splitting_tight indexed=true
stored=true omitNorms=true/
  field name=name type=text_general indexed=true stored=true/
  field name=alphaNameSort type=alphaOnlySort indexed=true
stored=false/
  ...
/fields
/schema

But still, this is a single structure. And I need 2.

Great thanks in advance for any help. There are not many tutorials for SOLR
in the web.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3905143.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Question about solr.WordDelimiterFilterFactory

WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 - 1234
ab,cd - ab cd

is that close enough?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu joseph...@yahoo.com wrote:
 Hello,

 I am new to solr/lucene. I am tasked to index a large number of documents. 
 Some of these documents contain decimal points. I am looking for a way to 
 index these documents so that adjacent numeric characters (such as [0-9.,]) 
 are treated as single token. For example,

 12.34 = 12.34
 12,345 = 12,345

 However, , and . should be treated as usual when around non-digital 
 characters. For example,

 ab,cd = ab cd.

 It is so that searching for 12.34 will match 12.34 not 12 34. Searching 
 for ab.cd should match both ab.cd and ab cd.

 After doing some research on solr, It seems that there is a build-in analyzer 
 called solr.WordDelimiterFilter that supports a types attribute which map 
 special characters as different delimiters.  However, it isn't exactly what I 
 want. It doesn't provide context check such as , or . must surround by 
 digital characters, etc.

 Does anyone have any experience configuring solr to meet this requirements?  
 Is writing my own plugin necessary for this simple thing?

 Thanks in advance!

 -Jian

Dismax request handler differences Between Solr Version 3.5 and 1.4

2012-04-12 Thread mechravi25

Hi,

We are currently using solr (version 1.4.0.2010.01.13.08.09.44). we have a
strange situation in dismax request handler. when we search for a keyword
and append qt=dismax, we are not getting the any results. The solr request
is as follows: 
http://local:8983/solr/core2/select/?q=Bankversion=2.2start=0rows=10indent=ondefType=dismaxdebugQuery=on

The Response is as follows : 

 result name=response numFound=0 start=0 / 
- lst name=debug
  str name=rawquerystringBank/str 
  str name=querystringBank/str 
  str name=parsedquery+() ()/str 
  str name=parsedquery_toString+() ()/str 
  lst name=explain / 
  str name=QParserDisMaxQParser/str 
  null name=altquerystring / 
  null name=boostfuncs / 
- lst name=timing
  double name=time0.0/double 
- lst name=prepare
  double name=time0.0/double 
- lst name=org.apache.solr.handler.component.QueryComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double 
  /lst
  /lst
- lst name=process
  double name=time0.0/double 
- lst name=org.apache.solr.handler.component.QueryComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double 
  /lst
  /lst
  /lst
  /lst
  /response


We are currently testing the Solr Version 3.5, But the same is working fine
in that version. 

Also the Query alternative params are not working properly in SOlr 1.5 when
compared with version 3.5. The request seems to be the same, but dono where
its making the issue. Please help me out. Thanks i advance.

Regards,
Sivaganesh
siva_srm...@yahoo.co.in

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-request-handler-differences-Between-Solr-Version-3-5-and-1-4-tp3905192p3905192.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets involving multiple fields

facet.query=keywords:computer short_title:computer
seems like what you're asking for.

On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER
marc.schneide...@gmail.com wrote:
 Hi,

 Thanks for your answer.
 Let's say I have to fields : 'keywords' and 'short_title'.
 For these fields I'd like to make a faceted search : if 'Computer' is
 stored in at least one of these fields for a document I'd like to get
 it added in my results.
 doc1 = keywords : 'Computer' / short_title : 'Computer'
 doc2 = keywords : 'Computer'
 doc3 = short_title : 'Computer'

 In this case I'd like to have : Computer (3)

 I don't see how to solve this with facet.query.

 Thanks,
 Marc.

 On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Have you considered facet.query? You can specify an arbitrary query
 to facet on which might do what you want. Otherwise, I'm not sure what
 you mean by faceted search using two fields. How should these fields
 be combined into a single facet? What that means practically is not at
 all obvious from your problem statement.

 Best
 Erick

 On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
 marc.schneide...@gmail.com wrote:
 Hi,

 I'd like to make a faceted search using two fields. I want to have a
 single result and not a result by field (like when using
 facet.field=f1,facet.field=f2).
 I don't want to use a copy field either because I want it to be
 dynamic at search time.
 As far as I know this is not possible for Solr 3.x...
 But I saw a new parameter named group.facet for Solr4. Could that
 solve my problem? If yes could somebody give me an example?

 Thanks,
 Marc.

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling

Paul,

nearly two years ago I requested an evaluation license and tested BASIS Tech
Rosette for Lucene  Solr. Was working excellent but the price much much to 
high.

Yes, they also have compound analysis for several languages including German.
Just configure your pipeline in solr and setup the processing pipeline in
Rosette Language Processing (RLP) and thats it.

Example from my very old schema.xml config:

fieldtype name=text_rlp class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory
rlpContext=solr/conf/rlp-index-context.xml
postPartOfSpeech=false
postLemma=true
postStem=true
postCompoundComponents=true/
 filter class=solr.LowerCaseFilterFactory/
 filter 
class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory
rlpContext=solr/conf/rlp-query-context.xml
postPartOfSpeech=false
postLemma=true
postCompoundComponents=true/
 filter class=solr.LowerCaseFilterFactory/
 filter 
class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldtype

So you just point tokenizer to RLP and have two RLP pipelines configured,
one for indexing (rlp-index-context.xml) and one for querying 
(rlp-query-context.xml).

Example form my rlp-index-context.xml config:

contextconfig
  properties
property name=com.basistech.rex.optimize value=false/
property name=com.basistech.ela.retokenize_for_rex value=true/
  /properties
  languageprocessors
languageprocessorUnicode Converter/languageprocessor
languageprocessorLanguage Identifier/languageprocessor
languageprocessorEncoding and Character Normalizer/languageprocessor
languageprocessorEuropean Language Analyzer/languageprocessor
!--languageprocessorScript Region Locator/languageprocessor
languageprocessorJapanese Language Analyzer/languageprocessor
languageprocessorChinese Language Analyzer/languageprocessor
languageprocessorKorean Language Analyzer/languageprocessor
languageprocessorSentence Breaker/languageprocessor
languageprocessorWord Breaker/languageprocessor
languageprocessorArabic Language Analyzer/languageprocessor
languageprocessorPersian Language Analyzer/languageprocessor
languageprocessorUrdu Language Analyzer/languageprocessor --
languageprocessorStopword Locator/languageprocessor
languageprocessorBase Noun Phrase Locator/languageprocessor
!--languageprocessorStatistical Entity Extractor/languageprocessor --
languageprocessorExact Match Entity Extractor/languageprocessor
languageprocessorPattern Match Entity Extractor/languageprocessor
languageprocessorEntity Redactor/languageprocessor
languageprocessorREXML Writer/languageprocessor
  /languageprocessors
/contextconfig

As you can see I used the European Language Analyzer.

Bernd



Am 12.04.2012 12:58, schrieb Paul Libbrecht:
 Bernd,
 
 can you please say a little more?
 I think this list is ok to contain some description for commercial solutions 
 that satisfy a request formulated on list.
 
 Is there any product at BASIS Tech that provides a compound-analyzer with a 
 big dictionary of decomposed compounds in German? 
 If yes, for which domain? 
 The Google Search result (I wonder if this is politically correct to not have 
 yours ;-)) shows me that there's an amount 
 of job done in this direction (e.g. Gärten to match Garten) but being precise 
 for this question would be more helpful!
 
 paul

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni

You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data. 

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
 Hi,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar

is there a downside to combining search fields with copyfield?

2012-04-12 Thread geeky2

hello everyone,

can people give me their thoughts on this.

currently, my schema has individual fields to search on.

are there advantages or disadvantages to taking several of the individual
search fields and combining them in to a single search field?

would this affect search times, term tokenization or possibly other things.

example of individual fields

brand
category
partno

example of a single combined search field

part_info (would combine brand, category and partno)

thank you for any feedback
mark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-a-downside-to-combining-search-fields-with-copyfield-tp3905349p3905349.html
Sent from the Solr - User mailing list archive at Nabble.com.

AW: Lexical analysis tools for German language data

 Von: Valeriy Felberg

 If you want that query jacke matches a document containing the word
 windjacke or kinderjacke, you could use a custom update processor.
 This processor could search the indexed text for words matching the
 pattern .*jacke and inject the word jacke into an additional field
 which you can search against. You would need a whole list of possible
 suffixes, of course.

Merci, Valeriy - I agree on the feasability of such an approach. The
list would likely have to be composed of the most frequently used terms
for your specific domain.

In our case, it's things people would buy in shops. Reducing overly
complicated and convoluted product descriptions to proper basic terms -
that would do the job. It's like going to a restaurant boasting fancy
and unintelligible names for the dishes you may order when they are
really just ordinary stuff like pork and potatoes.

Thinking some more about it, giving sufficient boost to the attached
category data might also do the job. That would shift the burden of
supplying proper semantics to the guys doing the categorization.

 It would slow down the update process but you don't need to split
 words during search.

  Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
 
  Given an input of Windjacke (probably wind jacket in English),
  I'd like the code that prepares the data for the index (tokenizer
  etc) to understand that this is a Jacke (jacket) so that a
  query for Jacke would include the Windjacke document in its
  result set.

A query for Windjacke or Kinderjacke would probably not have to be
de-specialized to Jacke because, well, that's the user input and users
looking for specific things are probably doing so for a reason. If no
matches are found you can still tell them to just broaden their search.

Michael

Re: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma

Hi,

We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a 
from TeX generated FOP XML file for the Dutch language and have seen decent 
results. A bonus was that now some tokens can be stemmed properly because not 
all compounds are listed in the dictionary for the HunspellStemFilter.

It does introduce a recall/precision problem but it at least returns results 
for those many users that do not properly use compounds in their search query.

There seem to be a small issue with the filter where minSubwordSize=N yields 
subwords of size N-1.

Cheers,

On Thursday 12 April 2012 12:39:44 Paul Libbrecht wrote:
 Michael,
 
 I'm on this list and the lucene list since several years and have not found
 this yet. It's been one neglected topics to my taste.
 
 There is a CompoundAnalyzer but it requires the compounds to be dictionary
 based, as you indicate.
 
 I am convinced there's a way to build the de-compounding words efficiently
 from a broad corpus but I have never seen it (and the experts at DFKI I
 asked for for also told me they didn't know of one).
 
 paul
 
 Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
  Given an input of Windjacke (probably wind jacket in English), I'd
  like the code that prepares the data for the index (tokenizer etc) to
  understand that this is a Jacke (jacket) so that a query for Jacke
  would include the Windjacke document in its result set.
  
  It appears to me that such an analysis requires a dictionary-backed
  approach, which doesn't have to be perfect at all; a list of the most
  common 2000 words would probably do the job and fulfil a criterion of
  reasonable usefulness.
  
  Do you know of any implementation techniques or working implementations
  to do this kind of lexical analysis for German language data? (Or other
  languages, for that matter?) What are they, where can I find them?
  
  I'm sure there is something out (commercial or free) because I've seen
  lots of engines grokking German and the way it builds words.
  
  Failing that, what are the proper terms do refer to these techniques so
  you can search more successfully?
  
  Michael

-- 
Markus Jelsma - CTO - Openindex

Further questions about behavior in ReversedWildcardFilterFactory

2012-04-12 Thread neosky

I ask the question in
http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tt3889226.html
However, when I do some implementation, I get a further questions.
1. Suppose I don't use ReversedWildcardFilterFactory in the index time, it
seems that Solr doesn't allow the leading wildcard search, it will return
the error:
org.apache.lucene.queryParser.ParseException: Cannot parse
'sequence:*A*': '*' or '?' not allowed as first character in
WildcardQuery
But when I use the ReversedWildcardFilterFactory, I can use the *A* in
the query. But as I know, the ReversedWildcardFilterFactory should work in
the index part, should not affect the query behavior. If it is true, how
does this happen?
2.Based on the question above
suppose I have those tokens in index.
1.AB/MNO/UUFI
2.BC/MNO/IUYT
3.D/MNO/QEWA
4./MNO/KGJGLI
5.QOEOEF/MNO/
suppose I use the lucene, I can set the QueryParser with
AllowLeadingWildcard(true), to search *MNO*
it should return the tokens above(1-5)
But in solr, when I conduct the *MNO* with the ReversedWildcardFilterFactory
in the index, but use the StandardAnalyzer in the query, I don't know what
happens here.
The leading *MNO should be fast to match the 5 with
ReversedWildcardFilterFactory
The tailer MNO* should be fast to match 4
But What about *MNO* ?
Thanks!

--
View this message in context:
http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3905416.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Ali S Kureishy

Thanks Darren.

Actually, I would like the system to be homogenous - i.e., use Hadoop based
tools that already provide all the necessary scaling for the lucene index
(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
its own layer of sharding/replication that is outside Hadoop, I feel that
using SolrCloud would be redundant, and a step in the opposite
direction, which is what I'm trying to avoid in the first place. Or am I
mistaken?

Thanks,
Safdar


On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:

 You could use SolrCloud (for the automatic scaling) and just mount a
 fuse[1] HDFS directory and configure solr to use that directory for its
 data.

 [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

 On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
  Hi,
 
  I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than 0.5
  seconds.
 
  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large. Each
  of these indices would likely require a logically distributed and
  replicated index.
 
  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the crawl).
 In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).
 
  However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
 is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.
 
  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?
 
  Any architectural guidance would be greatly appreciated. The more details
  provided, the wider my grin :).
 
  Many many thanks in advance.
 
  Thanks,
  Safdar

AW: Lexical analysis tools for German language data

 Von: Markus Jelsma

 We've done a lot of tests with the HyphenationCompoundWordTokenFilter
 using a from TeX generated FOP XML file for the Dutch language and
 have seen decent results. A bonus was that now some tokens can be
 stemmed properly because not all compounds are listed in the
 dictionary for the HunspellStemFilter.

Thank you for pointing me to these two filter classes.

 It does introduce a recall/precision problem but it at least returns
 results for those many users that do not properly use compounds in
 their search query.

Could you define what the term recall should be taken to mean in this
context? I've also encountered it on the BASIStech website. Okay, I
found a definition:

http://en.wikipedia.org/wiki/Precision_and_recall

Dank je wel!

Michael

RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni


Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

brbrbr--- Original Message ---
On 4/12/2012  09:21 AM Ali S Kureishy wrote:brThanks Darren.
br
brActually, I would like the system to be homogenous - i.e., use Hadoop based
brtools that already provide all the necessary scaling for the lucene index
br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
brits own layer of sharding/replication that is outside Hadoop, I feel that
brusing SolrCloud would be redundant, and a step in the opposite
brdirection, which is what I'm trying to avoid in the first place. Or am I
brmistaken?
br
brThanks,
brSafdar
br
br
brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:
br
br You could use SolrCloud (for the automatic scaling) and just mount a
br fuse[1] HDFS directory and configure solr to use that directory for its
br data.
br
br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
br
br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
br  Hi,
br 
br  I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
br  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
br  crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
br  seconds.
br 
br  Needless to mention, the search index needs to scale to 5Billion pages.
br It
br  is also possible that I might need to store multiple indexes -- one for
br  crawled content, and one for ancillary data that is also very large. 
Each
br  of these indices would likely require a logically distributed and
br  replicated index.
br 
br  However, I would like for such a system to be homogenous with the Hadoop
br  infrastructure that is already installed on the cluster (for the crawl).
br In
br  other words, I would much prefer if the replication and distribution of
br the
br  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
br  using another scalability framework (such as SolrCloud). In addition, it
br  would be ideal if this environment was flexible enough to be dynamically
br  scaled based on the size requirements of the index and the search 
traffic
br  at the time (i.e. if it is deployed on an Amazon cluster, it should be
br easy
br  enough to automatically provision additional processing power into the
br  cluster without requiring server re-starts).
br 
br  However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
br  be ideal for this scenario. I've heard mention of Solr-on-HBase,
br Solandra,
br  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
br is
br  mature enough and would be the right architectural choice to go along
br with
br  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
br aspects
br  above.
br 
br  Lastly, how much hardware (assuming a medium sized EC2 instance) would
br you
br  estimate my needing with this setup, for regular web-data (HTML text) at
br  this scale?
br 
br  Any architectural guidance would be greatly appreciated. The more 
details
br  provided, the wider my grin :).
br 
br  Many many thanks in advance.
br 
br  Thanks,
br  Safdar
br
br
br
br

Re: Question about solr.WordDelimiterFilterFactory

2012-04-12 Thread Jian Xu

Erick,

Thank you for your response! 

The problem with this approach is that searching for 12:34 will also match 
12.34 which is not what I want.



 From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org; Jian Xu joseph...@yahoo.com 
Sent: Thursday, April 12, 2012 8:01 AM
Subject: Re: Question about solr.WordDelimiterFilterFactory
 
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 - 1234
ab,cd - ab cd

is that close enough?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu joseph...@yahoo.com wrote:
 Hello,

 I am new to solr/lucene. I am tasked to index a large number of documents. 
 Some of these documents contain decimal points. I am looking for a way to 
 index these documents so that adjacent numeric characters (such as [0-9.,]) 
 are treated as single token. For example,

 12.34 = 12.34
 12,345 = 12,345

 However, , and . should be treated as usual when around non-digital 
 characters. For example,

 ab,cd = ab cd.

 It is so that searching for 12.34 will match 12.34 not 12 34. Searching 
 for ab.cd should match both ab.cd and ab cd.

 After doing some research on solr, It seems that there is a build-in analyzer 
 called solr.WordDelimiterFilter that supports a types attribute which map 
 special characters as different delimiters.  However, it isn't exactly what I 
 want. It doesn't provide context check such as , or . must surround by 
 digital characters, etc.

 Does anyone have any experience configuring solr to meet this requirements?  
 Is writing my own plugin necessary for this simple thing?

 Thanks in advance!

 -Jian

RE: SOLR 3.3 DIH and Java 1.6

2012-04-12 Thread randolf.julian

Thanks guys for all the help. We moved to an upgraded O.S. version and the
java script worked.

- Randolf

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-3-3-DIH-and-Java-1-6-tp3841355p3905583.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Dmitry Kan

Can anyone help me out with this? Is this too complicated / unclear? I
could share more detail if needed.

On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan dmitry@gmail.com wrote:

 Hello,

 Hopefully this question is not too complex to handle, but I'm currently
 stuck with it.

 We have a system with nTiers, that is:

 Solr front base --- Solr front -- shards

 Inside QueryComponent there is a method createRetrieveDocs(ResponseBuilder
 rb) which collects doc ids of each shard and sends them in different
 queries using the ids parameter:

 [code]
 sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ','));
 [/code]

 This actually produces NPE (same as in
 https://issues.apache.org/jira/browse/SOLR-1477) in the first tier,
 because Solr front (on the second tier) fails to process such a query. I
 have tried to fix this by using a unique field with a value of ids ORed
 (the following code substitutes the code above):

 [code]
   StringBuffer idsORed = new StringBuffer();
   for (IteratorString iterator = ids.iterator(); iterator.hasNext();
 ) {
 String next = iterator.next();

 if (iterator.hasNext()) {
   idsORed.append(next).append( OR );
 } else {
   idsORed.append(next);
 }
   }

   sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(),
 idsORed.toString());
 [/code]

 This works perfectly if for rows=n there is n or less hits from a
 distributed query. However, if there are more than 2*n hits, the querying
 fails with an NPE in a completely different component, which is
 HighlightComponent (highlights are requested in the same query with
 hl=truehl.fragsize=5hl.requireFieldMatch=truehl.fl=targetTextField):

 SEVERE: java.lang.NullPointerException
 at
 org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
 at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:619)

 It sounds like the ids of documents somehow get shuffled and the
 instruction (only a hypothesis)

 [code]
 ShardDoc sdoc = rb.resultIds.get(id);
 [/code]

 returns sdoc=null, which causes the next line of code to fail with an NPE:

 [code]
 int idx = sdoc.positionInResponse;
 [/code]

 Am I missing anything? Can something be done for solving this issue?

 Thanks.

 --
 Regards,

 Dmitry Kan




-- 
Regards,

Dmitry Kan

Re: Error

Please review:

http://wiki.apache.org/solr/UsingMailingLists

You haven't said whether, for instance, you're using trunk which
is the only version that supports the termfreq function.

Best
Erick

On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari
abhishek.tiwari@gmail.com wrote:
 http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*version=2.2start=0rows=10indent=onsort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc

 Error :  HTTP Status 400 - Missing sort order.
 Why i am getting error ?

Import null values from XML file

2012-04-12 Thread randolf.julian

We import an XML file directly to SOLR using a the script called post.sh in
the exampledocs. This is the script:

FILES=$*
URL=http://localhost:8983/solr/update

for f in $FILES; do
  echo Posting file $f to $URL
  curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
  echo
done

#send the commit command to make sure all the changes are flushed and
visible
curl $URL --data-binary 'commit/' -H 'Content-type:text/xml;
charset=utf-8'
echo

Our XML file looks something like this:

add
  doc
field name=ProductGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field
field name=SkuGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field
field name=ProductGroupId1000/field
field name=VendorSkuCodeCK4475/field
field name=VendorSkuAltCodeCK4475/field
field name=ManufacturerSkuCodeNULL/field
field name=ManufacturerSkuAltCodeNULL/field
field name=UpcEanSkuCode840655037330/field
field name=VendorSupersededSkuCodeNULL/field
field name=VendorProductDescriptionEBC CLUTCH KIT/field
field name=VendorSkuDescriptionEBC CLUTCH KIT/field
  /doc
/add

How can I tell solr that the NULL value should be treated as null?

Thanks,
Randolf 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Import-null-values-from-XML-file-tp3905600p3905600.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lexical analysis tools for German language data

German noun decompounding is a little more complicated than it might seem.

There can be transformations or inflections, like the s in Weinachtsbaum 
(Weinachten/Baum).

Internal nouns should be recapitalized, like Baum above.

Some compounds probably should not be decompounded, like Fahrrad 
(farhren/Rad). With a dictionary-based stemmer, you might decide to avoid 
decompounding for words in the dictionary.

Verbs get more complicated inflections, and might need to be decapitalized, 
like farhren above.

Und so weiter.

Note that highlighting gets pretty weird when you are matching only part of a 
word.

Luckily, a lot of compounds are simple, and you could well get a measurable 
improvement with a very simple algorithm. There isn't anything complicated 
about compounds like Orgelmusik or Netzwerkbetreuer.

The Basis Technology linguistic analyzers aren't cheap or small, but they work 
well. 

wunder

On Apr 12, 2012, at 3:58 AM, Paul Libbrecht wrote:

 Bernd,
 
 can you please say a little more?
 I think this list is ok to contain some description for commercial solutions 
 that satisfy a request formulated on list.
 
 Is there any product at BASIS Tech that provides a compound-analyzer with a 
 big dictionary of decomposed compounds in German? If yes, for which domain? 
 The Google Search result (I wonder if this is politically correct to not have 
 yours ;-)) shows me that there's an amount of job done in this direction 
 (e.g. Gärten to match Garten) but being precise for this question would be 
 more helpful!
 
 paul
 
 
 Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :
 
 
 You might have a look at:
 http://www.basistech.com/lucene/
 
 
 Am 12.04.2012 11:52, schrieb Michael Ludwig:
 Given an input of Windjacke (probably wind jacket in English), I'd
 like the code that prepares the data for the index (tokenizer etc) to
 understand that this is a Jacke (jacket) so that a query for Jacke
 would include the Windjacke document in its result set.
 
 It appears to me that such an analysis requires a dictionary-backed
 approach, which doesn't have to be perfect at all; a list of the most
 common 2000 words would probably do the job and fulfil a criterion of
 reasonable usefulness.
 
 Do you know of any implementation techniques or working implementations
 to do this kind of lexical analysis for German language data? (Or other
 languages, for that matter?) What are they, where can I find them?
 
 I'm sure there is something out (commercial or free) because I've seen
 lots of engines grokking German and the way it builds words.
 
 Failing that, what are the proper terms do refer to these techniques so
 you can search more successfully?
 
 Michael

[Solr 4.0] Is it possible to do soft commit from code and not configuration only

2012-04-12 Thread Lyuba Romanchuk

Hi,



I need to configure the solr so that the opened searcher will see a new
document immidiately after it was adding to the index.

And I don't want to perform commit each time a new document is added.

I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but
it didn't help.

Is there way to perform soft commit from code in Solr 4.0 ?


Thank you in advance.

Best regards,

Lyuba

AW: Lexical analysis tools for German language data

 Von: Walter Underwood

 German noun decompounding is a little more complicated than it might
 seem.
 
 There can be transformations or inflections, like the s in
 Weinachtsbaum (Weinachten/Baum).

I remember from my linguistics studies that the terminus technicus for
these is Fugenmorphem (interstitial or joint morpheme). But there's
not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum
in the example above is from the singular (die Weihnacht), then s,
then Baum. Still, it's much more complex then, say, English or Italian.

 Internal nouns should be recapitalized, like Baum above.

Casing won't matter for indexing, I think. The way I would go about
obtaining stems from compound words is by using a dictionary of stems
and a regex. We'll see how far that'll take us.

 Some compounds probably should not be decompounded, like Fahrrad
 (farhren/Rad). With a dictionary-based stemmer, you might decide to
 avoid decompounding for words in the dictionary.

Good point.

 Note that highlighting gets pretty weird when you are matching only
 part of a word.

Guess it'll be a weird when you get it wrong, like Noten in
Notentriegelung.

 Luckily, a lot of compounds are simple, and you could well get a
 measurable improvement with a very simple algorithm. There isn't
 anything complicated about compounds like Orgelmusik or
 Netzwerkbetreuer.

Exactly.

 The Basis Technology linguistic analyzers aren't cheap or small, but
 they work well.

We will consider our needs and options. Thanks for your thoughts.

Michael

Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only


On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote:

 Hi,
 
 
 
 I need to configure the solr so that the opened searcher will see a new
 document immidiately after it was adding to the index.
 
 And I don't want to perform commit each time a new document is added.
 
 I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but
 it didn't help.

Can you elaborate on didn't help? You couldn't find any docs unless you did an 
explicit commit? If that is true and there is no user error, this would be a 
bug.

 
 Is there way to perform soft commit from code in Solr 4.0 ?

Yes - check out the wiki docs - I can't remember how it is offhand (I think it 
was slightly changed recently).

 
 
 Thank you in advance.
 
 Best regards,
 
 Lyuba

- Mark Miller
lucidimagination.com

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

Please see the documentation: 
http://wiki.apache.org/solr/SolrCloud#Required_Config

schema.xml

You must have a _version_ field defined:

field name=_version_ type=long indexed=true stored=true/

On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:

 I didn't have a _version_ field, since nothing in the schema says that
 it's required!
 
 On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote:
 Hard to say why its not working for you. Start with a fresh Solr and
 work forward from there or back out your configs and plugins until it
 works again.
 
 On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push
 
 delete
   query*:*/query
 /delete
 
 followed by:
 
 commit/
 
 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.
 
 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.
 
updateRequestProcessorChain name=RNI
 !-- some day, add parameters when we have some --
 processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.DistributedUpdateProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
   class=solr.XmlUpdateRequestHandler
 lst name=defaults
   str name=update.chainRNI/str
 /lst
 /requestHandler
 
 
 

- Mark Miller
lucidimagination.com

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht


Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
 Some compounds probably should not be decompounded, like Fahrrad
 (farhren/Rad). With a dictionary-based stemmer, you might decide to
 avoid decompounding for words in the dictionary.
 
 Good point.

More or less, Fahrrad is generally abbreviated as Rad.
(even though Rad can mean wheel and bike)

 Note that highlighting gets pretty weird when you are matching only
 part of a word.
 
 Guess it'll be a weird when you get it wrong, like Noten in
 Notentriegelung.

This decomposition should not happen because Noten-triegelung does not have a 
correct second term.

 The Basis Technology linguistic analyzers aren't cheap or small, but
 they work well.
 
 We will consider our needs and options. Thanks for your thoughts.

My question remains as to which domain it aims at covering.
We had such need for mathematics texts... I would be pleasantly surprised if, 
for example, Differenzen-quotient  would be decompounded.

paul

Re: Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)


On 4/12/2012 2:21 AM, Bastian Hepp wrote:

When I try to start I get this error message:

C:\\jetty-solrjava -jar start.jar
java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
 at org.eclipse.jetty.start.Main.start(Main.java:602)
 at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.Server
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at org.eclipse.jetty.util.Loader.loadClass(Loader.java:92)
 at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.nodeClass(XmlConfiguration.java:349)
 at
org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:327)
 at
org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:291)
 at
org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1203)
 at java.security.AccessController.doPrivileged(Native Method)
 at
org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)


Bastian,

The jetty.xml included with Solr is littered with org.mortbay class 
references, which are appropriate for Jetty 6.  Jetty 7 and 8 use the 
org.eclipse prefix, and from the very small amount of investigation I 
did a few weeks ago, have also made other changes to the package names, 
so you might not be able to simply replace org.mortbay with org.eclipse.


The absolutely easiest option would be to just use the jetty included 
with Solr, not version 8.  If you want to keep using Jetty 8, you will 
need to find/make a new jetty.xml file.


If I were set on using Jetty 8 and had to make it work, I would check 
out trunk (Lucene/Solr 4.0) from the Apache SVN server, find the example 
jetty.xml there, and use it instead.  It's possible that you may need to 
still make changes, but that is probably the path of least resistance.  
The jetty version has been upgraded in trunk.


Another option would be to download Jetty 6, find its jetty.xml, and 
compare it with the one in Solr, to find out what the Lucene developers 
changed from default.  Then you would have to take the default jetty.xml 
from Jetty 8 and make similar changes to make a new config.


Apparently Jetty 8 no longer supports JSP with the JRE, so you're 
probably going to need the JDK.  The developers have eliminated JSP from 
trunk, so it will still work with the JRE.


Thanks,
Shawn

Re: Large Index and OutOfMemoryError: Map failed


On Apr 12, 2012, at 6:07 AM, Michael McCandless wrote:

 Your largest index has 66 segments (690 files) ... biggish but not
 insane.  With 64K maps you should be able to have ~47 searchers open
 on each core.
 
 Enabling compound file format (not the opposite!) will mean fewer maps
 ... ie should improve this situation.
 
 I don't understand why Solr defaults to compound file off... that
 seems dangerous.
 
 Really we need a Solr dev here... to answer how long is a stale
 searcher kept open.  Is it somehow possible 46 old searchers are
 being left open...?

Probably only if there is a bug. When a new Searcher is opened, any previous 
Searcher is closed as soon as there are no more references to it (eg all in 
flight requests to that Searcher finish).

 
 I don't see any other reason why you'd run out of maps.  Hmm, unless
 MMapDirectory didn't think it could safely invoke unmap in your JVM.
 Which exact JVM are you using?  If you can print the
 MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.
 
 Yes, switching away from MMapDir will sidestep the too many maps
 issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
 there really is a leak here (Solr not closing the old searchers or a
 Lucene bug or something...) then you'll eventually run out of file
 descriptors (ie, same  problem, different manifestation).
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 2012/4/11 Gopal Patwa gopalpa...@gmail.com:
 
 I have not change the mergefactor, it was 10. Compound index file is disable
 in my config but I read from below post, that some one had similar issue and
 it was resolved by switching from compound index file format to non-compound
 index file.
 
 and some folks resolved by changing lucene code to disable MMapDirectory.
 Is this best practice to do, if so is this can be done in configuration?
 
 http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html
 
 I have index document of core1 = 5 million, core2=8million and
 core3=3million and all index are hosted in single Solr instance
 
 I am going to use Solr for our site StubHub.com, see attached ls -l list
 of index files for all core
 
 SolrConfig.xml:
 
 
  indexDefaults
  useCompoundFilefalse/useCompoundFile
  mergeFactor10/mergeFactor
  maxMergeDocs2147483647/maxMergeDocs
  maxFieldLength1/maxFieldLength--
  ramBufferSizeMB4096/ramBufferSizeMB
  maxThreadStates10/maxThreadStates
  writeLockTimeout1000/writeLockTimeout
  commitLockTimeout1/commitLockTimeout
  lockTypesingle/lockType
  
  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
double name=forceMergeDeletesPctAllowed0.0/double
double name=reclaimDeletesWeight10.0/double
  /mergePolicy
 
  deletionPolicy class=solr.SolrDeletionPolicy
str name=keepOptimizedOnlyfalse/str
str name=maxCommitsToKeep0/str
  /deletionPolicy
  
  /indexDefaults
 
 
  updateHandler class=solr.DirectUpdateHandler2
  maxPendingDeletes1000/maxPendingDeletes
   autoCommit
 maxTime90/maxTime
 openSearcherfalse/openSearcher
   /autoCommit
   autoSoftCommit
 maxTime${inventory.solr.softcommit.duration:1000}/maxTime
   /autoSoftCommit
  
  /updateHandler
 
 
 Forwarded conversation
 Subject: Large Index and OutOfMemoryError: Map failed
 
 
 From: Gopal Patwa gopalpa...@gmail.com
 Date: Fri, Mar 30, 2012 at 10:26 PM
 To: solr-user@lucene.apache.org
 
 
 I need help!!
 
 
 
 
 
 I am using Solr 4.0 nightly build with NRT and I often get this error during
 auto commit java.lang.OutOfMemoryError: Map failed. I have search this
 forum and what I found it is related to OS ulimit setting, please se below
 my ulimit settings. I am not sure what ulimit setting I should have? and we
 also get java.net.SocketException: Too many open files NOT sure how many
 open file we need to set?
 
 
 I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB,
 with Single shard
 
 
 We update the index every 5 seconds, soft commit every 1 second and hard
 commit every 15 minutes
 
 
 Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB
 
 
 ulimit:
 
 core file size  (blocks, -c) 0
 data seg size   (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size   (blocks, -f) unlimited
 pending signals (-i) 401408
 max locked memory   (kbytes, -l) 1024
 max memory size (kbytes, -m) unlimited
 open files  (-n) 1024
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority  (-r) 0
 stack size  (kbytes, -s) 10240
 cpu time   (seconds, -t)

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies

On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller markrmil...@gmail.com wrote:
 Please see the documentation: 
 http://wiki.apache.org/solr/SolrCloud#Required_Config

Did I fail to find this in google or did I just goad you into a writing job?

I'm inclined to write a JIRA asking for _version_ to be configurable
just like the uniqueKey in the schema.




 schema.xml

 You must have a _version_ field defined:

 field name=_version_ type=long indexed=true stored=true/

 On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:

 I didn't have a _version_ field, since nothing in the schema says that
 it's required!

 On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote:
 Hard to say why its not working for you. Start with a fresh Solr and
 work forward from there or back out your configs and plugins until it
 works again.

 On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push

 delete
   query*:*/query
 /delete

 followed by:

 commit/

 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.

 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.

    updateRequestProcessorChain name=RNI
     !-- some day, add parameters when we have some --
     processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.DistributedUpdateProcessorFactory/
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

     !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
                   class=solr.XmlUpdateRequestHandler
     lst name=defaults
       str name=update.chainRNI/str
     /lst
     /requestHandler




 - Mark Miller
 lucidimagination.com

Re: AW: Lexical analysis tools for German language data

On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote:

 I remember from my linguistics studies that the terminus technicus for
 these is Fugenmorphem (interstitial or joint morpheme). 

That is some excellent linguistic jargon. I'll file that with hapax legomenon.

If you don't highlight, you can get good results with pretty rough analyzers, 
but highlighting exposes those, even when they don't affect relevance. For 
example, you can get good relevance just indexing bigrams in Chinese, but it 
looks awful when you highlight them. As soon as you highlight, you need a 
dictionary-based segmenter.

wunder
--
Walter Underwood
wun...@wunderwood.org

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma

On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote:
 Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
  Some compounds probably should not be decompounded, like Fahrrad
  (farhren/Rad). With a dictionary-based stemmer, you might decide to
  avoid decompounding for words in the dictionary.
  
  Good point.
 
 More or less, Fahrrad is generally abbreviated as Rad.
 (even though Rad can mean wheel and bike)
 
  Note that highlighting gets pretty weird when you are matching only
  part of a word.
  
  Guess it'll be a weird when you get it wrong, like Noten in
  Notentriegelung.
 
 This decomposition should not happen because Noten-triegelung does not have
 a correct second term.
 
  The Basis Technology linguistic analyzers aren't cheap or small, but
  they work well.
  
  We will consider our needs and options. Thanks for your thoughts.
 
 My question remains as to which domain it aims at covering.
 We had such need for mathematics texts... I would be pleasantly surprised
 if, for example, Differenzen-quotient  would be decompounded.

The HyphenationCompoundWordTokenFilter can do those things but those words 
must be listed in the dictionary or you'll get strange results. It still 
yields strange results when it emits tokens that are subwords of a subword.

 
 paul

-- 
Markus Jelsma - CTO - Openindex

Re: AW: Lexical analysis tools for German language data

On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote:

 More or less, Fahrrad is generally abbreviated as Rad.
 (even though Rad can mean wheel and bike)

A synonym could handle this, since farhren would not be a good match. It is 
judgement call, but this seems more like an equivalence Fahrrad = Rad than 
decompounding.

wunder
--
Walter Underwood
wun...@wunderwood.org

Re: codecs for sorted indexes

2012-04-12 Thread Michael McCandless

Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?

In which case... you should already be seeing some benefits (smaller
index size) than had you randomly added them (ie the vInts should
take fewer bytes), I think.  (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
sure...).

Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.

Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:

https://issues.apache.org/jira/browse/LUCENE-3892

Mike McCandless

http://blog.mikemccandless.com



On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
 Hello,

 We're using a sorted index in order to implement early termination
 efficiently over an index of hundreds of millions of documents. As of now,
 we're using the default codecs coming with Lucene 4, but we believe that
 due to the fact that the docids are sorted, we should be able to do much
 better in terms of storage and achieve much better performance, especially
 decompression performance.

 In particular, Robert Muir is commenting on these lines here:

 https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

 We're aware that the in the bulkpostings branch there are different codecs
 being implemented and different experiments being done. We don't know
 whether we should implement our own codec (i.e. using some RLE-like
 techniques) or we should use one of the codecs implemented there (PFOR,
 Simple64, ...).

 Can you please give us some advice on this?

 Thanks
 Carlos

 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com

 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

Re: Error

2012-04-12 Thread Abhishek tiwari

i am using 3.4 solr version... please assist...

On Thu, Apr 12, 2012 at 8:41 PM, Erick Erickson erickerick...@gmail.comwrote:

 Please review:

 http://wiki.apache.org/solr/UsingMailingLists

 You haven't said whether, for instance, you're using trunk which
 is the only version that supports the termfreq function.

 Best
 Erick

 On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari
 abhishek.tiwari@gmail.com wrote:
 
 http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*version=2.2start=0rows=10indent=onsort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc
 
  Error :  HTTP Status 400 - Missing sort order.
  Why i am getting error ?

Re: EmbeddedSolrServer and StreamingUpdateSolrServer


On 4/12/2012 4:52 AM, pcrao wrote:

I think the index is getting corrupted because StreamingUpdateSolrServer is
keeping reference
to some index files that are being deleted by EmbeddedSolrServer during
commit/optimize process.
As a result when I Index(Full) using EmbeddedSolrServer and then do
Incremental index using StreamingUpdateSolrServer it fails with a
FileNotFound exception.
  A special note: we don't optimize the index after Incremental
indexing(StreamingUpdateSolrServer) but we do optimize it after the Full
index(EmbeddedSolrServer). Please see the below log and let me know
if you need further information.


I am a relative newbie to all this, and I've never used 
EmbeddedSolrServer, only CommonsHttpSolrServer and 
StreamingUpdateSolrServer.  I'm not even sure the embedded object is an 
option unless your program is running in the same JVM as Solr.  Mine is 
separate.


If I am right about ESS needing to be in the same JVM as Solr, then that 
means it can do a more direct interaction with Solr and therefore might 
not be coordinated with the HTTP access that SUSS uses.  I have read 
multiple times that the developers don't recommend using ESS.  If you 
are going to use it, you probably have to do everything with it.


SUSS does everything in the background, so you have no guarantees as to 
when it will happen, as well as no ability to check for completion or 
errors.  Because of the lack of error detection, I had to stop using SUSS.


Thanks,
Shawn

Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only

2012-04-12 Thread Lyuba Romanchuk

Hi Mark,

Thank you for reply.

I tried to normalize data like in relational databases:

   - there are some types of documents where \
  - documents with the same type have the same fields
  - documents with not equal types may have different fields
  - but all documents have type field and unique key field id .
   - there is main type (all records with this type contains pointers
   to the corresponding records of other types)

There is the configuration that defines what information should be stored
in each type.
When I get a new data for indexing first of all I check if such document is
already in the index\
using facets by the corresponding fields and query on relevant type.
I add documents to solr index without commit from the code but with
autocommit and autoSoftCommit with maxDocs=1 in the solrconfig.xml.
But here there is a problem that if I add a new record for some type the
searcher doesn't see it immediately.
It causes that I get some equal records with the same type but different
ids (unique key).

If I do commit from code after each document is added it works OK but it's
not a solution.
So I wanted to try to do soft commit after adding documents with not-main
type  from code. I searched in wiki documents
but found only commit without parameters and commit with parameters that
don't seem to be what I need.

Best regards,
Lyuba
*
*
*
*

On Thu, Apr 12, 2012 at 6:55 PM, Mark Miller markrmil...@gmail.com wrote:


 On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote:

  Hi,
 
 
 
  I need to configure the solr so that the opened searcher will see a new
  document immidiately after it was adding to the index.
 
  And I don't want to perform commit each time a new document is added.
 
  I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but
  it didn't help.

 Can you elaborate on didn't help? You couldn't find any docs unless you
 did an explicit commit? If that is true and there is no user error, this
 would be a bug.

 
  Is there way to perform soft commit from code in Solr 4.0 ?

 Yes - check out the wiki docs - I can't remember how it is offhand (I
 think it was slightly changed recently).

 
 
  Thank you in advance.
 
  Best regards,
 
  Lyuba

 - Mark Miller
 lucidimagination.com

Re: Error

The termfreq function is only valid for trunk.
You're using 3.4. Since 'termfreq' is not recognized, Solr
gets confused.

Best
Erick

On Thu, Apr 12, 2012 at 10:20 AM, Abhishek tiwari
abhishek.tiwari@gmail.com wrote:
 i am using 3.4 solr version... please assist...

 On Thu, Apr 12, 2012 at 8:41 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Please review:

 http://wiki.apache.org/solr/UsingMailingLists

 You haven't said whether, for instance, you're using trunk which
 is the only version that supports the termfreq function.

 Best
 Erick

 On Thu, Apr 12, 2012 at 4:08 AM, Abhishek tiwari
 abhishek.tiwari@gmail.com wrote:
 
 http://xyz.com:8080/newschema/mainsearch/select/?q=*%3A*version=2.2start=0rows=10indent=onsort=termfreq%28cuisine_priorities_list,%27Chinese%27%29%20desc
 
  Error :  HTTP Status 400 - Missing sort order.
  Why i am getting error ?

Re: is there a downside to combining search fields with copyfield?


On 4/12/2012 7:27 AM, geeky2 wrote:

currently, my schema has individual fields to search on.

are there advantages or disadvantages to taking several of the individual
search fields and combining them in to a single search field?

would this affect search times, term tokenization or possibly other things.

example of individual fields

brand
category
partno

example of a single combined search field

part_info (would combine brand, category and partno)


You end up with one multivalued field, which means that you can only 
have one analyzer chain.  With separate fields, each field can be 
analyzed differently.  Also, if you are indexing and/or storing the 
individual fields, you may have data duplication in your index, making 
it larger and increasing your disk/RAM requirements.  That field will 
have a higher termcount than the individual fields, which means that 
searches against it will naturally be just a little bit slower.  Your 
application will not have to do as much work to construct a query, though.


If you are already planning to use dismax/edismax, then you don't need 
the overhead of a copyField.  You can simply provide access to (e)dismax 
search with the qf (and possibly pf) parameters predefined, or your 
application can provide these parameters.


http://wiki.apache.org/solr/ExtendedDisMax

Thanks,
Shawn

Re: Problem to integrate Solr in Jetty (the first example in the Apache Solr 3.1 Cookbook)

2012-04-12 Thread Bastian Hepp

Thanks Shawn,

I think I'll stay with the build in. I had problems with Solr Cell, but I
could fix it.

Greetings,
Bastian

Am 12. April 2012 18:02 schrieb Shawn Heisey s...@elyograg.org:

 Bastian,

 The jetty.xml included with Solr is littered with org.mortbay class
 references, which are appropriate for Jetty 6.  Jetty 7 and 8 use the
 org.eclipse prefix, and from the very small amount of investigation I did a
 few weeks ago, have also made other changes to the package names, so you
 might not be able to simply replace org.mortbay with org.eclipse.

 The absolutely easiest option would be to just use the jetty included with
 Solr, not version 8.  If you want to keep using Jetty 8, you will need to
 find/make a new jetty.xml file.

 If I were set on using Jetty 8 and had to make it work, I would check out
 trunk (Lucene/Solr 4.0) from the Apache SVN server, find the example
 jetty.xml there, and use it instead.  It's possible that you may need to
 still make changes, but that is probably the path of least resistance.  The
 jetty version has been upgraded in trunk.

 Another option would be to download Jetty 6, find its jetty.xml, and
 compare it with the one in Solr, to find out what the Lucene developers
 changed from default.  Then you would have to take the default jetty.xml
 from Jetty 8 and make similar changes to make a new config.

 Apparently Jetty 8 no longer supports JSP with the JRE, so you're probably
 going to need the JDK.  The developers have eliminated JSP from trunk, so
 it will still work with the JRE.

 Thanks,
 Shawn

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

google must not have found it - i put that in a month or so ago I believe -
at least weeks. As you can see, there is still a bit to fill in, but it
covers the high level. I'd like to add example snippets for the rest soon.

On Thu, Apr 12, 2012 at 12:04 PM, Benson Margulies bimargul...@gmail.comwrote:

 On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller markrmil...@gmail.com
 wrote:
  Please see the documentation:
 http://wiki.apache.org/solr/SolrCloud#Required_Config

 Did I fail to find this in google or did I just goad you into a writing
 job?

 I'm inclined to write a JIRA asking for _version_ to be configurable
 just like the uniqueKey in the schema.



 
  schema.xml
 
  You must have a _version_ field defined:
 
  field name=_version_ type=long indexed=true stored=true/
 
  On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:
 
  I didn't have a _version_ field, since nothing in the schema says that
  it's required!
 
  On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com
 wrote:
  Hard to say why its not working for you. Start with a fresh Solr and
  work forward from there or back out your configs and plugins until it
  works again.
 
  On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
  In my cloud configuration, if I push
 
  delete
query*:*/query
  /delete
 
  followed by:
 
  commit/
 
  I get no errors, the log looks happy enough, but the documents remain
  in the index, visible to /query.
 
  Here's what seems my relevant bit of solrconfig.xml. My URP only
  implements processAdd.
 
 updateRequestProcessorChain name=RNI
  !-- some day, add parameters when we have some --
  processor
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.DistributedUpdateProcessorFactory/
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain
 
  !-- activate RNI processing by adding the RNI URP to the chain
  for xml updates --
requestHandler name=/update
class=solr.XmlUpdateRequestHandler
  lst name=defaults
str name=update.chainRNI/str
  /lst
  /requestHandler
 
 
 
 
  - Mark Miller
  lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 




-- 
- Mark

http://www.lucidimagination.com

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how


: Please see the documentation: 
http://wiki.apache.org/solr/SolrCloud#Required_Config
: 
: schema.xml
: 
: You must have a _version_ field defined:
: 
: field name=_version_ type=long indexed=true stored=true/

Seems like this is the kind of thing that should make Solr fail hard and 
fast on SolrCore init if it sees you are running in cloud mode and yet it 
doesn't find this -- similar to how some other features fail hard and fast 
if you don't have uniqueKey.


-Hoss

Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Mikhail Khludnev

Dmitry,

The last NPE in HighlightingComponent is just a sad coding issue.
few rows later we can see that developer expected to have some docs not
found
// remove nulls in case not all docs were able to be retrieved
  rb.rsp.add(highlighting, SolrPluginUtils.removeNulls(new
SimpleOrderedMap(arr)));
But as you already know he forgot to check if(sdoc!=null){.
Is there anything that stopping you from contributing the patch, beside of
the lack of time, of course?

about the core issue I can't get into it and, particularly, how the using
disjunction query in place of IDS can help you. Could you please provide
more detailed info like stacktraces, etc. Btw, have you checked trunk for
your case?

On Thu, Apr 12, 2012 at 7:08 PM, Dmitry Kan dmitry@gmail.com wrote:

 Can anyone help me out with this? Is this too complicated / unclear? I
 could share more detail if needed.

 On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan dmitry@gmail.com wrote:

  Hello,
 
  Hopefully this question is not too complex to handle, but I'm currently
  stuck with it.
 
  We have a system with nTiers, that is:
 
  Solr front base --- Solr front -- shards
 
  Inside QueryComponent there is a method
 createRetrieveDocs(ResponseBuilder
  rb) which collects doc ids of each shard and sends them in different
  queries using the ids parameter:
 
  [code]
  sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ','));
  [/code]
 
  This actually produces NPE (same as in
  https://issues.apache.org/jira/browse/SOLR-1477) in the first tier,
  because Solr front (on the second tier) fails to process such a query. I
  have tried to fix this by using a unique field with a value of ids ORed
  (the following code substitutes the code above):
 
  [code]
StringBuffer idsORed = new StringBuffer();
for (IteratorString iterator = ids.iterator();
 iterator.hasNext();
  ) {
  String next = iterator.next();
 
  if (iterator.hasNext()) {
idsORed.append(next).append( OR );
  } else {
idsORed.append(next);
  }
}
 
sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(),
  idsORed.toString());
  [/code]
 
  This works perfectly if for rows=n there is n or less hits from a
  distributed query. However, if there are more than 2*n hits, the querying
  fails with an NPE in a completely different component, which is
  HighlightComponent (highlights are requested in the same query with
 
 hl=truehl.fragsize=5hl.requireFieldMatch=truehl.fl=targetTextField):
 
  SEVERE: java.lang.NullPointerException
  at
 
 org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161)
  at
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
  at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
  at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
  at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
  at
 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
  at
 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
  at
  org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
  at java.lang.Thread.run(Thread.java:619)
 
  It sounds like the ids of documents somehow get shuffled and the
  instruction (only a hypothesis)
 
  [code]
  ShardDoc sdoc = rb.resultIds.get(id);
  [/code]
 
  returns sdoc=null, which causes the next line of code to fail with an
 NPE:
 
  [code]
  int idx = sdoc.positionInResponse;
  [/code]
 
  Am I missing anything? Can something be done for solving this issue?
 
  Thanks.
 
  --
  Regards,
 
  Dmitry Kan
 



 --
 Regards,

 Dmitry Kan




-- 
Sincerely yours
Mikhail Khludnev
ge...@yandex.ru

http://www.griddynamics.com
 mkhlud...@griddynamics.com

Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Yonik Seeley

On Wed, Apr 11, 2012 at 8:16 AM, Dmitry Kan dmitry@gmail.com wrote:
 We have a system with nTiers, that is:

 Solr front base --- Solr front -- shards

Although the architecture had this in mind (multi-tier), all of the
pieces are not yet in place to allow it.
The errors you see are a direct result of that.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

RE: solr 3.5 taking long to index

Thanks for pointing these out, but I still have one concern, why is the
Virtual Memory running in 300g+?

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Sent: 12 April 2012 11:58
To: solr-user@lucene.apache.org
Subject: Re: solr 3.5 taking long to index


There were some changes in solrconfig.xml between solr3.1 and solr3.5.
Always read CHANGES.txt when switching to a new version.
Also helpful is comparing both versions of solrconfig.xml from the examples.

Are you sure you need a MaxPermSize of 5g?
Use jvisualvm to see what you really need.
This is also for all other JAVA_OPTS.



Am 11.04.2012 19:42, schrieb Rohit:
 We recently migrated from solr3.1 to solr3.5,  we have one master and 
 one slave configured. The master has two cores,
 
  
 
 1) Core1 - 44555972 documents
 
 2) Core2 - 29419244 documents
 
  
 
 We commit every 5000 documents, but lately the commit is taking very 
 long 15 minutes plus in some cases. What could have caused this, I 
 have checked the logs and the only warning i can see is,
 
  
 
 WARNING: Use of deprecated update request parameter update.processor 
 detected. Please use the new parameter update.chain instead, as 
 support for update.processor will be removed in a later version.
 
  
 
 Memory details:
 
  
 
 export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g
 
  
 
 Solr Config:
 
  
 
 useCompoundFilefalse/useCompoundFile
 
 mergeFactor10/mergeFactor
 
 ramBufferSizeMB32/ramBufferSizeMB
 
 !-- maxBufferedDocs1000/maxBufferedDocs --
 
   maxFieldLength1/maxFieldLength
 
   writeLockTimeout1000/writeLockTimeout
 
   commitLockTimeout1/commitLockTimeout
 
  
 
 What could be causing this, as everything was running fine a few days
back?
 
  
 
  
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg

RE: Solr 3.5 takes very long to commit gradually

Thanks for pointing these out, but I still have one concern, why is the
Virtual Memory running in 300g+?

Regards,
Rohit


-Original Message-
From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
Sent: 12 April 2012 13:43
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.5 takes very long to commit gradually

thanks Rohit.. for the information.
On Apr 12, 2012, at 4:08 AM, Rohit wrote:

 Hi Tirthankar,
 
 The average size of documents would be a few Kb's this is mostly 
 tweets which are being saved. The two cores are storing different kind 
 of data and nothing else.
 
 Regards,
 Rohit
 Mobile: +91-9901768202
 About Me: http://about.me/rohitg
 
 -Original Message-
 From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com]
 Sent: 12 April 2012 13:14
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 3.5 takes very long to commit gradually
 
 Hi Rohit,
 What would be the average size of your documents and also can you 
 please share your idea of having 2 cores in the master. I just wanted 
 to know the reasoning behind the design.
 
 Thanks in advance
 
 Tirthankar
 On Apr 12, 2012, at 3:19 AM, Jan Høydahl wrote:
 
 What operating system?
 Are you using spellchecker with buildOnCommit?
 Anything special in your Update Chain?
 
 --
 Jan Høydahl, search solution architect Cominvent AS - 
 www.cominvent.com Solr Training - www.solrtraining.com
 
 On 12. apr. 2012, at 06:45, Rohit wrote:
 
 We recently migrated from solr3.1 to solr3.5, we have one master and 
 one slave configured. The master has two cores,
 
 1) Core1 - 44555972 documents
 
 2) Core2 - 29419244 documents
 
 We commit every 5000 documents, but lately the commit time gradually 
 increase and solr is taking as very long 15 minutes plus in some 
 cases. What could have caused this, I have checked the logs and the 
 only warning i can see is,
 
 WARNING: Use of deprecated update request parameter 
 update.processor detected. Please use the new parameter update.chain 
 instead, as support for update.processor will be removed in a later
version.
 
 Memory details:
 
 export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g
 
 Solr Config:
 
 useCompoundFilefalse/useCompoundFile
 
 mergeFactor10/mergeFactor
 
 ramBufferSizeMB32/ramBufferSizeMB
 
 !-- maxBufferedDocs1000/maxBufferedDocs --
 
 maxFieldLength1/maxFieldLength
 
 writeLockTimeout1000/writeLockTimeout
 
 commitLockTimeout1/commitLockTimeout
 
 Also noticed, that top command show almost 350GB of Virtual memory
usage.
 
 What could be causing this, as everything was running fine a few 
 days
 back?
 
 
 
 
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg
 
 
 
 
 
 **Legal Disclaimer***
 This communication may contain confidential and privileged material 
 for the sole use of the intended recipient. Any unauthorized review, 
 use or distribution by others is strictly prohibited. If you have 
 received the message in error, please advise the sender by reply email 
 and delete the message. Thank you.
 *
 
 

**Legal Disclaimer***
This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or
distribution by others is strictly prohibited. If you have received the
message in error, please advise the sender by reply email and delete the
message. Thank you.
*

Re: term frequency outweighs exact phrase match

2012-04-12 Thread alxsss

In that case documents 1 and 2 will not be in the results. We need them also be 
shown in the results but be ranked after those docs with exact match.
I think omitting term frequency in calculating ranking in phrase queries will 
solve this issue, but I do not see that such a parameter in configs.
I see omitTermFreqAndPositions=true but not sure if it is the setting I need, 
because its description is too vague.

Thanks.
Alex.


 

 

 

-Original Message-
From: Erick Erickson erickerick...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Wed, Apr 11, 2012 8:23 am
Subject: Re: term frequency outweighs exact phrase match


Consider boosting on phrase with a SHOULD clause, something
like field:apache solr^2..

Best
Erick


On Tue, Apr 10, 2012 at 12:46 PM,  alx...@aim.com wrote:
 Hello,

 I use solr 3.5 with edismax. I have the following issue with phrase search. 
For example if I have three documents with content like

 1.apache apache
 2. solr solr
 3.apache solr

 then search for apache solr displays documents in the order 1,.2,3 instead of 
3, 2, 1 because term frequency in the first and second documents is higher than 
in the third document. We want results be displayed in the order as  3,2,1 
since 
the third document has exact match.

 My request handler is as follows.

 requestHandler name=search class=solr.SearchHandler 
 lst name=defaults
 str name=defTypeedismax/str
 str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qfhost^30  content^0.5 title^1.2/str
 str name=pfhost^30  content^20 title^22 /str
 str name=flurl,id, site ,title/str
 str name=mm2lt;-1 5lt;-2 6lt;90%/str
 int name=ps1/int
 bool name=hltrue/bool
 str name=q.alt*:*/str
 str name=hl.flcontent/str
 str name=f.title.hl.fragsize0/str
 str name=hl.fragsize165/str
 str name=f.title.hl.alternateFieldtitle/str
 str name=f.url.hl.fragsize0/str
 str name=f.url.hl.alternateFieldurl/str
 str name=f.content.hl.fragmenterregex/str
 str name=spellchecktrue/str
 str name=spellcheck.collatetrue/str
 str name=spellcheck.count5/str
 str name=grouptrue/str
 str name=group.fieldsite/str
 str name=group.ngroupstrue/str
 /lst
 arr name=last-components
  strspellcheck/str
 /arr
 /requestHandler

 Any ideas how to fix this issue?

 Thanks in advance.
 Alex.

Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Dmitry Kan

Mikhail,

Thanks for sharing your thoughts. Yes I have tried checking for NULL and
the entire chain of queries between tiers seems to work. But I suspect,
that some docs will be missing. In principle, unless there is an
OutOfMemory or a shard down, the doc ids should be retrieving valid
documents. So this is just a design, as Yonik pointed out.

I would be willing to contribute a patch, it is just an issue of
understanding what exactly should be fixed in the architecture, and I
suspect it isn't a small change..

Dmitry

On Thu, Apr 12, 2012 at 9:22 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Dmitry,

 The last NPE in HighlightingComponent is just a sad coding issue.
 few rows later we can see that developer expected to have some docs not
 found
 // remove nulls in case not all docs were able to be retrieved
  rb.rsp.add(highlighting, SolrPluginUtils.removeNulls(new
 SimpleOrderedMap(arr)));
 But as you already know he forgot to check if(sdoc!=null){.
 Is there anything that stopping you from contributing the patch, beside of
 the lack of time, of course?

 about the core issue I can't get into it and, particularly, how the using
 disjunction query in place of IDS can help you. Could you please provide
 more detailed info like stacktraces, etc. Btw, have you checked trunk for
 your case?

 On Thu, Apr 12, 2012 at 7:08 PM, Dmitry Kan dmitry@gmail.com wrote:

  Can anyone help me out with this? Is this too complicated / unclear? I
  could share more detail if needed.
 
  On Wed, Apr 11, 2012 at 3:16 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
   Hello,
  
   Hopefully this question is not too complex to handle, but I'm currently
   stuck with it.
  
   We have a system with nTiers, that is:
  
   Solr front base --- Solr front -- shards
  
   Inside QueryComponent there is a method
  createRetrieveDocs(ResponseBuilder
   rb) which collects doc ids of each shard and sends them in different
   queries using the ids parameter:
  
   [code]
   sreq.params.add(ShardParams.IDS, StrUtils.join(ids, ','));
   [/code]
  
   This actually produces NPE (same as in
   https://issues.apache.org/jira/browse/SOLR-1477) in the first tier,
   because Solr front (on the second tier) fails to process such a query.
 I
   have tried to fix this by using a unique field with a value of ids ORed
   (the following code substitutes the code above):
  
   [code]
 StringBuffer idsORed = new StringBuffer();
 for (IteratorString iterator = ids.iterator();
  iterator.hasNext();
   ) {
   String next = iterator.next();
  
   if (iterator.hasNext()) {
 idsORed.append(next).append( OR );
   } else {
 idsORed.append(next);
   }
 }
  
 sreq.params.add(rb.req.getSchema().getUniqueKeyField().getName(),
   idsORed.toString());
   [/code]
  
   This works perfectly if for rows=n there is n or less hits from a
   distributed query. However, if there are more than 2*n hits, the
 querying
   fails with an NPE in a completely different component, which is
   HighlightComponent (highlights are requested in the same query with
  
 
 hl=truehl.fragsize=5hl.requireFieldMatch=truehl.fl=targetTextField):
  
   SEVERE: java.lang.NullPointerException
   at
  
 
 org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:161)
   at
  
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
   at
  
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
   at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
   at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
   at
  
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at
  
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at
  
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   at
  
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   at
  
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at
  
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   at
  
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at
  
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   at
  
 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
   at
  
 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
   at

Wildcard searching

2012-04-12 Thread Kissue Kissue

Hi,

I am using the edismax query handler with solr 3.5. From the Solr admin
interface when i do a wildcard search with the string: edge*, all documents
are returned with exactly the same score. When i do the same search from my
application using SolrJ to the same solr instance, only a few documents
have the same maximum score and all the rest have the minimum score. I was
expecting all to have the same score just like in the Solr Admin.

Any pointers why this is happening?

Thanks.

Re: solr 3.4 with nTiers = 2: usage of ids param causes NullPointerException (NPE)

2012-04-12 Thread Dmitry Kan

Thanks Yonik,

This is what I expected. How big the change would be, if I'd start just
with Query and Highlight components? Did the change to QueryComponent I
made make any sense to you? It would of course mean a custom solution,
which I'm willing to contribute as a patch (in case anyone interested). To
make it part of a releasable trunk, one would most probably need to provide
some way to configure 1st tier level.

Thanks,

Dmitry

On Thu, Apr 12, 2012 at 9:34 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Apr 11, 2012 at 8:16 AM, Dmitry Kan dmitry@gmail.com wrote:
  We have a system with nTiers, that is:
 
  Solr front base --- Solr front -- shards

 Although the architecture had this in mind (multi-tier), all of the
 pieces are not yet in place to allow it.
 The errors you see are a direct result of that.

 -Yonik
 lucenerevolution.com - Lucene/Solr Open Source Search Conference.
 Boston May 7-10




-- 
Regards,

Dmitry Kan

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

I think someone already made a JIRA issue like that. I think Yonik might
have had an opinion about it that I cannot remember right now.

On Thu, Apr 12, 2012 at 2:21 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Please see the documentation:
 http://wiki.apache.org/solr/SolrCloud#Required_Config
 :
 : schema.xml
 :
 : You must have a _version_ field defined:
 :
 : field name=_version_ type=long indexed=true stored=true/

 Seems like this is the kind of thing that should make Solr fail hard and
 fast on SolrCore init if it sees you are running in cloud mode and yet it
 doesn't find this -- similar to how some other features fail hard and fast
 if you don't have uniqueKey.


 -Hoss




-- 
- Mark

http://www.lucidimagination.com

Re: Wildcard searching

2012-04-12 Thread Kissue Kissue

Correction, this difference betweeen Solr admin scores and SolrJ scores
happens with leading wildcard queries e.g. *edge


On Thu, Apr 12, 2012 at 8:13 PM, Kissue Kissue kissue...@gmail.com wrote:

 Hi,

 I am using the edismax query handler with solr 3.5. From the Solr admin
 interface when i do a wildcard search with the string: edge*, all documents
 are returned with exactly the same score. When i do the same search from my
 application using SolrJ to the same solr instance, only a few documents
 have the same maximum score and all the rest have the minimum score. I was
 expecting all to have the same score just like in the Solr Admin.

 Any pointers why this is happening?

 Thanks.

Re: is there a downside to combining search fields with copyfield?

2012-04-12 Thread geeky2

You end up with one multivalued field, which means that you can only
have one analyzer chain.

actually two of the three fields being considered for combination in to a
single field ARE multivalued fields.

would this be an issue?

With separate fields, each field can be
analyzed differently. Also, if you are indexing and/or storing the
individual fields, you may have data duplication in your index, making
it larger and increasing your disk/RAM requirements.

this makes sense

That field will
have a higher termcount than the individual fields, which means that
searches against it will naturally be just a little bit slower.

Your
application will not have to do as much work to construct a query, though.

actually this is the primary reason this came up.

If you are already planning to use dismax/edismax, then you don't need
the overhead of a copyField. You can simply provide access to (e)dismax
search with the qf (and possibly pf) parameters predefined, or your
application can provide these parameters.

http://wiki.apache.org/solr/ExtendedDisMax

can you elaborate on this and how EDisMax would preclude the need for
copyfield?

i am using extended dismax now in my response handlers.

here is an example of one of my requestHandlers

requestHandler name=partItemNoSearch class=solr.SearchHandler
default=false
lst name=defaults
str name=defTypeedismax/str
str name=echoParamsall/str
int name=rows5/int
str name=qfitemNo^1.0/str
str name=q.alt*:*/str
/lst
lst name=appends
str name=fqitemType:1/str
str name=sortrankNo asc, score desc/str
/lst
lst name=invariants
str name=facetfalse/str
/lst
/requestHandler

Thanks,
Shawn

--
View this message in context:
http://lucene.472066.n3.nabble.com/is-there-a-downside-to-combining-search-fields-with-copyfield-tp3905349p3906265.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Suggester not working for digit starting terms

2012-04-12 Thread jmlucjav

Well now I am really lost...

1. yes I want to suggest whole sentences too, I want the tokenizer to be
taken into account, and apparently it is working for me in 3.5.0?? I get
suggestions that are like foo bar abc.  Maybe what you mention is only for
file based dictionaries? I am using the field itself.

2. but for the digit issue, in that case nothing is suggested, not even the
term 500 that is there cause I can find it with this query
http://localhost:8983/solr/select/?q={!prefix f=a_suggest}500 

I tried to set threshold to 0 in case the term was being removed, and is not
that.

Moving to 3.6.0 is not a problem (I had already downloaded the rc actually)
but I still see weird things here.

xab

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-not-working-for-digit-starting-terms-tp3893433p3906303.html
Sent from the Solr - User mailing list archive at Nabble.com.

searching across multiple fields using edismax - am i setting this up right?

2012-04-12 Thread geeky2

hello all,

i just want to check to make sure i have this right.

i was reading on this page: http://wiki.apache.org/solr/ExtendedDisMax,
thanks to shawn for educating me.

*i want the user to be able to fire a requestHandler but search across
multiple fields (itemNo, productType and brand) WITHOUT them having to
specify in the query url what fields they want / need to search on*

this is what i have in my request handler


  requestHandler name=partItemNoSearch class=solr.SearchHandler
default=false
lst name=defaults
  str name=defTypeedismax/str
  str name=echoParamsall/str
  int name=rows5/int
  *str name=qfitemNo^1.0 productType^.8 brand^.5/str*
  str name=q.alt*:*/str
/lst
lst name=appends
  str name=sortrankNo asc, score desc/str
/lst
lst name=invariants
  str name=facetfalse/str
/lst
  /requestHandler

this would be an example of a single term search going against all three of
the fields

http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher*debugQuery=onrows=100

this would be an example of a multiple term search across all three of the
fields

http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher
123-xyz*debugQuery=onrows=100


do i understand this correctly?

thank you,
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3906334.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Responding to Requests with Chunks/Streaming

2012-04-12 Thread Mikhail Khludnev

Hello Developers,

I just want to ask don't you think that response streaming can be useful
for things like OLAP, e.g. is you have sharded index presorted and
pre-joined by BJQ way you can calculate counts in many cube cells in
parallel?
Essential distributed test for response streaming just passed.
https://github.com/m-khl/solr-patches/blob/ec4db7c0422a5515392a7019c5bd23ad3f546e4b/solr/core/src/test/org/apache/solr/response/RespStreamDistributedTest.java

branch is https://github.com/m-khl/solr-patches/tree/streaming

Regards

On Mon, Apr 2, 2012 at 10:55 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:

Hello,

Small update - reading streamed response is done via callback. No
SolrDocumentList in memory.
https://github.com/m-khl/solr-patches/tree/streaming
here is the test
https://github.com/m-khl/solr-patches/blob/d028d4fabe0c20cb23f16098637e2961e9e2366e/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java#L138

no progress in distributed search via streaming yet.

Pls let me know if you don't want to have updates from my playground.

Regards

On Thu, Mar 29, 2012 at 1:02 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:

@All
Why nobody desires such a pretty cool feature?

Nicholas,
I have a tiny progress: I'm able to stream in javabin codec format while
searching, It implies sorting by _docid_

here is the diff

https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95

The current issue is that reading response by SolrJ is done as whole.
Reading by callback is supported by EmbeddedServer only. Anyway it should
not a big deal. ResponseStreamingTest.java somehow works.
I'm stuck on introducing response streaming in distributes search, it's
actually more challenging - RespStreamDistributedTest fails

Regards

On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball nicholas.b...@nodelay.com
wrote:

Mikhail Ludovic,

Thanks for both your replies, very helpful indeed!

Ludovic, I was actually looking into just that and did some tests with
SolrJ, it does work well but needs some changes on the Solr server if we
want to send out individual documents a various times. This could be done
with a write() and flush() to the FastOutputStream (daos) in
JavBinCodec. I
therefore think that a combination of this and Mikhail's solution would
work best!

Mikhail, you mention that your solution doesn't currently work and not
sure why this is the case, but could it be that you haven't flushed the
data (os.flush()) you've written in the collect method of
DocSetStreamer? I
think placing the output stream into the SolrQueryRequest is the way to
go,
so that we can access it and write to it how we intend. However, I think
using the JavaBinCodec would be ideal so that we can work with SolrJ
directly, and not mess around with the encoding of the docs/data etc...

At the moment the entry point to JavaBinCodec is through the
BinaryResponseWriter which calls the highest level marshal() method which
decodes and sends out the entire SolrQueryResponse (line 49 @
BinaryResponseWriter). What would be ideal is to be able to break up the
response and call the JavaBinCodec for pieces of it with a flush after
each
call. Did a few tests with a simple Thread.sleep and a flush to see if
this
would actually work and looks like it's working out perfectly. Just
trying
to figure out the best way to actually do it now :) any ideas?

An another note, for a solution to work with the chunked transfer
encoding
(and therefore web browsers), a lot more development is going to be
needed.
Not sure if it's worth trying yet but might look into it later down the
line.

Nick

On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
Ludovic,

I looked through. First of all, it seems to me you don't amend regular
servlet solr server, but the only embedded one.
Anyway, the difference is that you stream DocList via callback, but it
means that you've instantiated it in memory and keep it there until it
will
be completely consumed. Think about a billion numfound. Core idea of my
approach is keep almost zero memory for response.

Regards

On Fri, Mar 16, 2012 at 12:12 AM, lboutros boutr...@gmail.com wrote:

Hi,

I was looking for something similar.

I tried this patch :

https://issues.apache.org/jira/browse/SOLR-2112

it's working quite well (I've back-ported the code in Solr 3.5.0...).

Is it really different from what you are trying to achieve ?

Ludovic.

-
Jouve
France.
--
View this message in context:

http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Sincerely yours
Mikhail Khludnev
ge...@yandex.ru

http://www.griddynamics.com
mkhlud...@griddynamics.com

--
Sincerely yours
Mikhail Khludnev
ge...@yandex.ru

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Yonik Seeley

On Thu, Apr 12, 2012 at 2:21 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Please see the documentation: 
 http://wiki.apache.org/solr/SolrCloud#Required_Config :

 : schema.xml
 :
 : You must have a _version_ field defined:
 :
 : field name=_version_ type=long indexed=true stored=true/

 Seems like this is the kind of thing that should make Solr fail hard and
 fast on SolrCore init if it sees you are running in cloud mode and yet it
 doesn't find this -- similar to how some other features fail hard and fast
 if you don't have uniqueKey.

Off the top of my head:
_version_ is needed for solr cloud where a leader forwards updates to
replicas, unless you're handing update distribution yourself or
providing pre-built shards.
_version_ is needed for realtime-get and optimistic locking

We should document for sure... but at this point it's not clear what
we should enforce. (not saying we shouldn't enforce anything... just
that I haven't really thought about it)

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

[ANNOUNCE] Apache Solr 3.6 released

12 April 2012, Apache Solr™ 3.6.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.6.0 Release Highlights:

 * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

 * Many analyzer factories are now multi term query aware allowing for things
   like field type aware lowercasing when building prefix  wildcard queries.
   (SOLR-2438)

 * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

 * Range Faceting (Dates  Numbers) is now supported in distributed search
   (SOLR-1709)

 * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

 * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

 * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

 * Memory performance improvements to all FST based suggesters (SOLR-2888)

 * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

 * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

 * Many bug fixes

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Lucene/Solr developers

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how


: Off the top of my head:
: _version_ is needed for solr cloud where a leader forwards updates to
: replicas, unless you're handing update distribution yourself or
: providing pre-built shards.
: _version_ is needed for realtime-get and optimistic locking
: 
: We should document for sure... but at this point it's not clear what
: we should enforce. (not saying we shouldn't enforce anything... just
: that I haven't really thought about it)

well ... it may eventually make sense to global enforce it for 
consistency, but in the meantime the individual components that dpeend on 
it can certainly enforce it (just like my uniqueKey example; the search 
components that require it check for themselves on init and fail fast)

(ie: sounds like the RealTimeGetHandler and the existing 
DistributedUpdateProcessor should fail fast on init if the schema doesn't 
have it)


-Hoss

RE: [ANNOUNCE] Apache Solr 3.6 released

2012-04-12 Thread Robert Petersen

I think this page needs updating...  it says it's not out yet.  

https://wiki.apache.org/solr/Solr3.6

-Original Message-
From: Robert Muir [mailto:rm...@apache.org] 
Sent: Thursday, April 12, 2012 1:33 PM
To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; 
announce
Subject: [ANNOUNCE] Apache Solr 3.6 released

12 April 2012, Apache Solr™ 3.6.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.6.0 Release Highlights:

 * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

 * Many analyzer factories are now multi term query aware allowing for things
   like field type aware lowercasing when building prefix  wildcard queries.
   (SOLR-2438)

 * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

 * Range Faceting (Dates  Numbers) is now supported in distributed search
   (SOLR-1709)

 * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

 * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

 * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

 * Memory performance improvements to all FST based suggesters (SOLR-2888)

 * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

 * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

 * Many bug fixes

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Lucene/Solr developers

Re: [ANNOUNCE] Apache Solr 3.6 released

Hi,

Just edit it! its a wiki page anyone can edit! There are probably
other out of date ones too

On Thu, Apr 12, 2012 at 5:57 PM, Robert Petersen rober...@buy.com wrote:
 I think this page needs updating...  it says it's not out yet.

 https://wiki.apache.org/solr/Solr3.6


 -Original Message-
 From: Robert Muir [mailto:rm...@apache.org]
 Sent: Thursday, April 12, 2012 1:33 PM
 To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; 
 announce
 Subject: [ANNOUNCE] Apache Solr 3.6 released

 12 April 2012, Apache Solr™ 3.6.0 available
 The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

 Solr is the popular, blazing fast open source enterprise search platform from
 the Apache Lucene project. Its major features include powerful full-text
 search, hit highlighting, faceted search, dynamic clustering, database
 integration, rich document (e.g., Word, PDF) handling, and geospatial search.
 Solr is highly scalable, providing distributed search and index replication,
 and it powers the search and navigation features of many of the world's
 largest internet sites.

 This release contains numerous bug fixes, optimizations, and
 improvements, some of which are highlighted below.  The release
 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
 note below).

 See the CHANGES.txt file included with the release for a full list of
 details.

 Solr 3.6.0 Release Highlights:

  * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

  * Many analyzer factories are now multi term query aware allowing for 
 things
   like field type aware lowercasing when building prefix  wildcard queries.
   (SOLR-2438)

  * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

  * Range Faceting (Dates  Numbers) is now supported in distributed search
   (SOLR-1709)

  * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

  * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

  * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

  * Memory performance improvements to all FST based suggesters (SOLR-2888)

  * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

  * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

  * Many bug fixes

 Note: The Apache Software Foundation uses an extensive mirroring network for
 distributing releases.  It is possible that the mirror you are using may not
 have replicated the release yet.  If that is the case, please try another
 mirror.  This also goes for Maven access.

 Happy searching,

 Lucene/Solr developers



-- 
lucidimagination.com

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies

I'm probably confused, but it seems to me that the case I hit does not
meet any of Yonik's criteria.

I have no replicas. I'm running SolrCloud in the simple mode where
each doc ends up in exactly one place.

I think that it's just a bug that the code refuses to do the local
deletion when there's no version info.

However, if I am confused, it sure seems like a candidate for the 'at
least throw instead of failing silently' policy.

Re: codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas

Hello Michael,

Yes, we are pre-sorting the documents before adding them to the index. We
have a score associated to every document (not an IR score but a
document-related score that reflects its importance). Therefore, the
document with the biggest score will have the lowest docid (we add it first
to the index). We do this in order to apply early termination effectively.
With the actual coded, we haven't seen much of a difference in terms of
space when we have the index sorted vs not sorted.

So, the question would be: if we force the docids to be sorted, what is the
best way to encode them?. We don't really care if the codec doesn't work
for cases where the documents are not sorted (i.e. if it throws an
exception if documents are not ordered when creating the index). Our idea
here is that it may be possible to trade off generality but achieve very
significant improvements for the specific case.

Would something along the lines of RLE coding work? i.e. if we have to
store docids 1 to 1500, we can represent it as 1::1499 (it would be 2
ints to represent 1500 docids).

Thanks a lot for your help,
Carlos

On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless
luc...@mikemccandless.com wrote:

Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?

In which case... you should already be seeing some benefits (smaller
index size) than had you randomly added them (ie the vInts should
take fewer bytes), I think. (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
sure...).

Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.

Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:

https://issues.apache.org/jira/browse/LUCENE-3892

Mike McCandless

http://blog.mikemccandless.com

On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
Hello,

We're using a sorted index in order to implement early termination
efficiently over an index of hundreds of millions of documents. As of
now,
we're using the default codecs coming with Lucene 4, but we believe that
due to the fact that the docids are sorted, we should be able to do much
better in terms of storage and achieve much better performance,
especially
decompression performance.

In particular, Robert Muir is commenting on these lines here:

https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

We're aware that the in the bulkpostings branch there are different
codecs
being implemented and different experiments being done. We don't know
whether we should implement our own codec (i.e. using some RLE-like
techniques) or we should use one of the codecs implemented there (PFOR,
Simple64, ...).

Can you please give us some advice on this?

Thanks
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies

On Thu, Apr 12, 2012 at 2:14 PM, Mark Miller markrmil...@gmail.com wrote:
 google must not have found it - i put that in a month or so ago I believe -
 at least weeks. As you can see, there is still a bit to fill in, but it
 covers the high level. I'd like to add example snippets for the rest soon.

Mark, is it all true? I don't have an update log or a replication
handler, and neither does the default, and it all works fine in the
simple case from the top of that wiki page.

Re: is there a downside to combining search fields with copyfield?


On 4/12/2012 1:37 PM, geeky2 wrote:

can you elaborate on this and how EDisMax would preclude the need for
copyfield?

i am using extended dismax now in my response handlers.

here is an example of one of my requestHandlers

   requestHandler name=partItemNoSearch class=solr.SearchHandler
default=false
 lst name=defaults
   str name=defTypeedismax/str
   str name=echoParamsall/str
   int name=rows5/int
   str name=qfitemNo^1.0/str
   str name=q.alt*:*/str
 /lst
 lst name=appends
   str name=fqitemType:1/str
   str name=sortrankNo asc, score desc/str
 /lst
 lst name=invariants
   str name=facetfalse/str
 /lst
   /requestHandler


I'm not sure whether or not you can use a multiValued field as the 
source for copyField.  This is the sort of thing that the devs tend to 
think of, so my initial thought would be that it should work, though I 
would definitely test it to be absolutely sure.


Your request handler above has qf set to include the field called 
itemNo.  If you made another that had the following in it, you could do 
without a copyField, by using that request handler.  You would want to 
customize the field boosts:


str name=qfbrand^2.0 category^3.0 partno/str

To really leverage edismax, assuming that you are using a tokenizer that 
splits any of these fields into multiple tokens, and that you want to 
use relevancy ranking, you might want to consider defining pf as well.


Some observations about your handler above... you are free to ignore 
this: I believe that you don't really need the ^1.0 that's in qf, 
because there's only one field, and 1.0 is the default boost.  Also, 
from what I can tell, because you are only using one qf field and are 
not using any of the dismax-specific goodies like pf or mm, you don't 
really need edismax at all here.  If I'm right, to remove edismax, just 
specify itemNo as the value for the df parameter (default field) and 
remove the defType.  The q.alt parameter might also need to come out.


Solr 3.6 (should be released soon) has deprecated the defaultSearchField 
and defaultOperator parameters in schema.xml, the df and q.op handler 
parameters are the replacement.  This will be enforced in Solr 4.0.


http://wiki.apache.org/solr/SearchHandler#Query_Params

Thanks,
Shawn

Re: Solr Scoring

No, I don't think there's an OOB way to make this happen. It's
a recurring theme, make exact matches score higher than
stemmed matches.

Best
Erick

On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com wrote:
 Hi,

 I have a field in my index called itemDesc which i am applying
 EnglishMinimalStemFilterFactory to. So if i index a value to this field
 containing Edges, the EnglishMinimalStemFilterFactory applies stemming
 and Edges becomes Edge. Now when i search for Edges, documents with
 Edge score better than documents with the actual search word - Edges.
 Is there a way i can make documents with the actual search word in this
 case Edges score better than document with Edge?

 I am using Solr 3.5. My field definition is shown below:

 fieldType name=text_en class=solr.TextField positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.StandardTokenizerFactory/
               filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
             filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords_en.txt
                enablePositionIncrements=true
             filter class=solr.LowerCaseFilterFactory/
    filter class=solr.EnglishPossessiveFilterFactory/
        filter class=solr.EnglishMinimalStemFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
        filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords_en.txt
                enablePositionIncrements=true
                /
        filter class=solr.LowerCaseFilterFactory/
    filter class=solr.EnglishPossessiveFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
        filter class=solr.EnglishMinimalStemFilterFactory/
      /analyzer
    /fieldType

 Thanks.

Re: Solr Scoring

It is easy. Create two fields, text_exact and text_stem. Don't use the stemmer 
in the first chain, do use the stemmer in the second. Give the text_exact a 
bigger weight than text_stem.

wunder

On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:

 No, I don't think there's an OOB way to make this happen. It's
 a recurring theme, make exact matches score higher than
 stemmed matches.
 
 Best
 Erick
 
 On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com wrote:
 Hi,
 
 I have a field in my index called itemDesc which i am applying
 EnglishMinimalStemFilterFactory to. So if i index a value to this field
 containing Edges, the EnglishMinimalStemFilterFactory applies stemming
 and Edges becomes Edge. Now when i search for Edges, documents with
 Edge score better than documents with the actual search word - Edges.
 Is there a way i can make documents with the actual search word in this
 case Edges score better than document with Edge?
 
 I am using Solr 3.5. My field definition is shown below:
 
 fieldType name=text_en class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
 filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
 filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
filter class=solr.EnglishMinimalStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
filter class=solr.EnglishMinimalStemFilterFactory/
  /analyzer
/fieldType
 
 Thanks.

Re: two structures in solr

You have to take off your DB hat when using Solr G...

There is no problem at all having documents in the
same index that are of different types. There is no
penalty for field definitions that aren't used. That is, you
can easily have two different types of documents in the
same index.

It's all about simply populating the two types of documents
with different fields. in your case, I suspect you'll have a
type field with two valid values, project and contractor
or some such. Then just attach a filter query depending on
what you want, i.e. fq=type:project or fq=type:contractor
and your searches will be restricted to the proper documents.

Best
Erick

On Thu, Apr 12, 2012 at 5:41 AM, tkoomzaaskz tomasz.du...@gmail.com wrote:
Hi all,

I'm a solr newbie, so sorry if I do anything wrong ;)

I want to use SOLR not only for fast text search, but mainly to create a
very fast search engine for a high-traffic system (MySQL would not do the
job if the db grows too big).

I need to store *two big structures* in SOLR: projects and contractors.
Contractors will search for available projects and project owners will
search for contractors who would do it for them.

So far, I have found a solr tutorial for newbies
http://www.solrtutorial.com, where I found the schema file which defines the
data structure: http://www.solrtutorial.com/schema-xml.html. But my case is
that *I want to have two structures*. I guess running two parallel solr
instances is not the idea. I took a look at
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup
and I can see that the schema goes like:

?xml version=1.0 encoding=UTF-8 ?
schema name=example version=1.5
types
...
/types
fields
field name=id type=string indexed=true stored=true
required=true /
field name=sku type=text_en_splitting_tight indexed=true
stored=true omitNorms=true/
field name=name type=text_general indexed=true stored=true/
field name=alphaNameSort type=alphaOnlySort indexed=true
stored=false/
...
/fields
/schema

But still, this is a single structure. And I need 2.

Great thanks in advance for any help. There are not many tutorials for SOLR
in the web.

--
View this message in context:
http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3905143.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.5 taking long to index


On 4/12/2012 12:42 PM, Rohit wrote:

Thanks for pointing these out, but I still have one concern, why is the
Virtual Memory running in 300g+?


Solr 3.5 uses MMapDirectoryFactory by default to read the index.  This 
does an mmap on the files that make up your index, so their entire 
contents are simply accessible to the application as virtual memory 
(over 300GB in your case), the OS automatically takes care of swapping 
disk pages in and out of real RAM as required.  This approach has less 
overhead and tends to make better use of the OS disk cache than other 
methods.  It does lead to confused questions and scary numbers in memory 
usage reporting, though.


You have mentioned that you are giving 36GB of RAM to Solr.  How much 
total RAM does the machine have?


Thanks,
Shawn

Re: Dismax request handler differences Between Solr Version 3.5 and 1.4

Then I suspect your solrconfig is different or you're using a *slightly*
different URL. When you specify defType=dismax, you're NOT going
to the dismax  requestHandler. You're specifying a dismax style
parser, and Solr expects that you're going to provide all the parameters
on the URL. To whit: qf. If you add qf=field1 field2 field3... you'll
see output.

I found this extremely confusing when I started using Solr. If you use
qt=dismax, _then_ you're specifying that you should use the
requestHandler defined in your solrconfig.xml _named_ dismax.

And this kind of thing was changed because it was so confusing, but
I suspect your 3.5 installation is not quite the same URL. I think 3.5
was changed to use the default field in this case.

BTW, 3.6 has just been released, if you're upgrading anyway you
might want to jump to 3.6

Best
Erick

On Thu, Apr 12, 2012 at 6:08 AM, mechravi25 mechrav...@yahoo.co.in wrote:
 Hi,

 We are currently using solr (version 1.4.0.2010.01.13.08.09.44). we have a
 strange situation in dismax request handler. when we search for a keyword
 and append qt=dismax, we are not getting the any results. The solr request
 is as follows:
 http://local:8983/solr/core2/select/?q=Bankversion=2.2start=0rows=10indent=ondefType=dismaxdebugQuery=on

 The Response is as follows :

  result name=response numFound=0 start=0 /
 - lst name=debug
  str name=rawquerystringBank/str
  str name=querystringBank/str
  str name=parsedquery+() ()/str
  str name=parsedquery_toString+() ()/str
  lst name=explain /
  str name=QParserDisMaxQParser/str
  null name=altquerystring /
  null name=boostfuncs /
 - lst name=timing
  double name=time0.0/double
 - lst name=prepare
  double name=time0.0/double
 - lst name=org.apache.solr.handler.component.QueryComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double
  /lst
  /lst
 - lst name=process
  double name=time0.0/double
 - lst name=org.apache.solr.handler.component.QueryComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double
  /lst
  /lst
  /lst
  /lst
  /response


 We are currently testing the Solr Version 3.5, But the same is working fine
 in that version.

 Also the Query alternative params are not working properly in SOlr 1.5 when
 compared with version 3.5. The request seems to be the same, but dono where
 its making the issue. Please help me out. Thanks i advance.

 Regards,
 Sivaganesh
 siva_srm...@yahoo.co.in

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Dismax-request-handler-differences-Between-Solr-Version-3-5-and-1-4-tp3905192p3905192.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Further questions about behavior in ReversedWildcardFilterFactory

There is special handling build into Solr (but not Lucene I don't think)
that deals with the reversed case, that's probably the source of your
differences.

Leading wildcards are extremely painful if you don't do some trick
like Solr does with the reversed stuff. In order to run, you have to
spin through _every_ term in the field to see which ones match. It
won't be performant on any very large index.

So I would stick with using the Solr stuff unless you have a specific
need to do things at the Lucene level. In which case I'd look carefully
at the Solr implementation to see what I could glean from that
implementation.

Best
Erick

On Thu, Apr 12, 2012 at 8:01 AM, neosky neosk...@yahoo.com wrote:
I ask the question in
http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tt3889226.html
However, when I do some implementation, I get a further questions.
1. Suppose I don't use ReversedWildcardFilterFactory in the index time, it
seems that Solr doesn't allow the leading wildcard search, it will return
the error:
org.apache.lucene.queryParser.ParseException: Cannot parse
'sequence:*A*': '*' or '?' not allowed as first character in
WildcardQuery
But when I use the ReversedWildcardFilterFactory, I can use the *A* in
the query. But as I know, the ReversedWildcardFilterFactory should work in
the index part, should not affect the query behavior. If it is true, how
does this happen?
2.Based on the question above
suppose I have those tokens in index.
1.AB/MNO/UUFI
2.BC/MNO/IUYT
3.D/MNO/QEWA
4./MNO/KGJGLI
5.QOEOEF/MNO/
suppose I use the lucene, I can set the QueryParser with
AllowLeadingWildcard(true), to search *MNO*
it should return the tokens above(1-5)
But in solr, when I conduct the *MNO* with the ReversedWildcardFilterFactory
in the index, but use the StandardAnalyzer in the query, I don't know what
happens here.
The leading *MNO should be fast to match the 5 with
ReversedWildcardFilterFactory
The tailer MNO* should be fast to match 4
But What about *MNO* ?
Thanks!

Re: Suggester not working for digit starting terms

On Thu, Apr 12, 2012 at 3:52 PM, jmlucjav jmluc...@gmail.com wrote:
 Well now I am really lost...

 1. yes I want to suggest whole sentences too, I want the tokenizer to be
 taken into account, and apparently it is working for me in 3.5.0?? I get
 suggestions that are like foo bar abc.  Maybe what you mention is only for
 file based dictionaries? I am using the field itself.

it doesnt use *JUST* your tokenizer. It splits and applies identifier
rules. Such identifier rules include things like, 'cannot start with a
digit'.

That's why i recommend you configure a SuggestQueryConverter so you
have complete control of what is going on rather than dealing with the
spellchecking one.


 Moving to 3.6.0 is not a problem (I had already downloaded the rc actually)
 but I still see weird things here.


installing 3.6 isnt going to do anything magical: as mentioned above
you have to configure the SuggestQueryConverter like the example in
the link if you want to have total control on how the input is treated
before going to the suggester.

-- 
lucidimagination.com

Re: Import null values from XML file

What does treated as null mean? Deleted from the doc?
The problem here is that null-ness is kind of tricky. What
behaviors do you want out of Solr in the NULL case?

You can drop this out of the document by writing a custom
updateHandler. It's actually quite simple to do.

Best
Erick

On Thu, Apr 12, 2012 at 9:14 AM, randolf.julian
randolf.jul...@dominionenterprises.com wrote:
 We import an XML file directly to SOLR using a the script called post.sh in
 the exampledocs. This is the script:

 FILES=$*
 URL=http://localhost:8983/solr/update

 for f in $FILES; do
  echo Posting file $f to $URL
  curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
  echo
 done

 #send the commit command to make sure all the changes are flushed and
 visible
 curl $URL --data-binary 'commit/' -H 'Content-type:text/xml;
 charset=utf-8'
 echo

 Our XML file looks something like this:

 add
  doc
    field name=ProductGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field
    field name=SkuGuidD22BF0B9-EE3A-49AC-A4D6-000B07CDA18A/field
    field name=ProductGroupId1000/field
    field name=VendorSkuCodeCK4475/field
    field name=VendorSkuAltCodeCK4475/field
    field name=ManufacturerSkuCodeNULL/field
    field name=ManufacturerSkuAltCodeNULL/field
    field name=UpcEanSkuCode840655037330/field
    field name=VendorSupersededSkuCodeNULL/field
    field name=VendorProductDescriptionEBC CLUTCH KIT/field
    field name=VendorSkuDescriptionEBC CLUTCH KIT/field
  /doc
 /add

 How can I tell solr that the NULL value should be treated as null?

 Thanks,
 Randolf

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Import-null-values-from-XML-file-tp3905600p3905600.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: codecs for sorted indexes

On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
 Hello Michael,

 Yes, we are pre-sorting the documents before adding them to the index. We
 have a score associated to every document (not an IR score but a
 document-related score that reflects its importance). Therefore, the
 document with the biggest score will have the lowest docid (we add it first
 to the index). We do this in order to apply early termination effectively.
 With the actual coded, we haven't seen much of a difference in terms of
 space when we have the index sorted vs not sorted.

I wouldn't expect that you will see space savings when you sort this way.

The techniques I was mentioning involve sorting documents by other
factors instead (such as grouping related documents from the same
website together: idea being they probably share many of the same
terms): this hopefully creates smaller document deltas that require
less bits to represent.

-- 
lucidimagination.com

Re: searching across multiple fields using edismax - am i setting this up right?

Looks good on a quick glance. There are a couple of things...

1 there's no need for the qt param _if_ you specify the name
as /partItemNoSearch, just use
blahblah/solr/partItemNoSearch
There's a JIRA about when/if you need at. Either will do, it's
up to you which you prefer.

2 I'd consider moving the sort from the appends section to the
defaults section on the theory that you may want to override sorting
sometime.

3 Simple way to see the effects of this is to simply append
debugQuery=on to your URL. You'll see the results of
the query, including the parsed results. It's a little hard to read,
but you should be seeing your search terms spread across
all three fields.

Best
Erick

On Thu, Apr 12, 2012 at 2:06 PM, geeky2 gee...@hotmail.com wrote:
 hello all,

 i just want to check to make sure i have this right.

 i was reading on this page: http://wiki.apache.org/solr/ExtendedDisMax,
 thanks to shawn for educating me.

 *i want the user to be able to fire a requestHandler but search across
 multiple fields (itemNo, productType and brand) WITHOUT them having to
 specify in the query url what fields they want / need to search on*

 this is what i have in my request handler


  requestHandler name=partItemNoSearch class=solr.SearchHandler
 default=false
    lst name=defaults
      str name=defTypeedismax/str
      str name=echoParamsall/str
      int name=rows5/int
      *str name=qfitemNo^1.0 productType^.8 brand^.5/str*
      str name=q.alt*:*/str
    /lst
    lst name=appends
      str name=sortrankNo asc, score desc/str
    /lst
    lst name=invariants
      str name=facetfalse/str
    /lst
  /requestHandler

 this would be an example of a single term search going against all three of
 the fields

 http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher*debugQuery=onrows=100

 this would be an example of a multiple term search across all three of the
 fields

 http://bogus:bogus/somecore/select?qt=partItemNoSearchq=*dishwasher
 123-xyz*debugQuery=onrows=100


 do i understand this correctly?

 thank you,
 mark




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3906334.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Scoring

GAH! I had my head in make this happen in one field when I wrote my
response, without being explicit. Of course Walter's solution is pretty
much the standard way to deal with this.

Best
Erick

On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood wun...@wunderwood.org wrote:
 It is easy. Create two fields, text_exact and text_stem. Don't use the 
 stemmer in the first chain, do use the stemmer in the second. Give the 
 text_exact a bigger weight than text_stem.

 wunder

 On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:

 No, I don't think there's an OOB way to make this happen. It's
 a recurring theme, make exact matches score higher than
 stemmed matches.

 Best
 Erick

 On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com wrote:
 Hi,

 I have a field in my index called itemDesc which i am applying
 EnglishMinimalStemFilterFactory to. So if i index a value to this field
 containing Edges, the EnglishMinimalStemFilterFactory applies stemming
 and Edges becomes Edge. Now when i search for Edges, documents with
 Edge score better than documents with the actual search word - Edges.
 Is there a way i can make documents with the actual search word in this
 case Edges score better than document with Edge?

 I am using Solr 3.5. My field definition is shown below:

 fieldType name=text_en class=solr.TextField positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.StandardTokenizerFactory/
               filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
             filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords_en.txt
                enablePositionIncrements=true
             filter class=solr.LowerCaseFilterFactory/
    filter class=solr.EnglishPossessiveFilterFactory/
        filter class=solr.EnglishMinimalStemFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
        filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords_en.txt
                enablePositionIncrements=true
                /
        filter class=solr.LowerCaseFilterFactory/
    filter class=solr.EnglishPossessiveFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
        filter class=solr.EnglishMinimalStemFilterFactory/
      /analyzer
    /fieldType

 Thanks.

Re: solr hangs

2012-04-12 Thread Peter Markey

Thanks for the response. I have given a size of 8gb for the instance and
has only around few thousands of documents (with 15 fields each having
small amount of data)..apparently the problem is the process (solr jetty
instance) is consuming lots of threads...one time it consumed around 50k
threads and the process maxed out the allowable thread allocated by the OS
(centos) for the process..and in the admin page is see tons of threads
under Thread Dump...it's lik solr is waiting for somethingi have two
leader and replica cores/shards in two instances...and i send the documents
to one of the shard through the csv update handler...

On Wed, Apr 11, 2012 at 7:39 AM, Pawel Rog pawelro...@gmail.com wrote:

 You wrote that you can see such error OutOfMemoryError. I had such
 problems when my caches were to big. It means that there is no more free
 memory in JVM and probably full gc starts running. How big is your Java
 heap? Maybe cache sizes in yout solr are to big according to your JVM
 settings.

 --
 Regards,
 Pawel

 On Tue, Apr 10, 2012 at 9:51 PM, Peter Markey sudoma...@gmail.com wrote:

  Hello,
 
  I have a solr cloud setup based on a blog (
  http://outerthought.org/blog/491-ot.html) and am able to bring up the
  instances and cores. But when I start indexing data (through csv update),
  the core throws a out of memory exception
 (null:java.lang.RuntimeException:
  java.lang.OutOfMemoryError: unable to create new native thread). The
 thread
  dump from new solr ui is below:
 
  cmdDistribExecutor-8-thread-777 (827)
 
 
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1bd11b79
 
- sun.misc.Unsafe.park(Native Method)
- java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
-
 
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await
  (AbstractQueuedSynchronizer.java:2043)
-
 
 
 org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158)
-
org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking
  (ConnPoolByRoute.java:403)
-
org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry
  (ConnPoolByRoute.java:300)
-
 
 
 org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection
  (ThreadSafeClientConnManager.java:224)
-
org.apache.http.impl.client.DefaultRequestDirector.execute
  (DefaultRequestDirector.java:401)
-
org.apache.http.impl.client.AbstractHttpClient.execute
  (AbstractHttpClient.java:820)
-
org.apache.http.impl.client.AbstractHttpClient.execute
  (AbstractHttpClient.java:754)
-
org.apache.http.impl.client.AbstractHttpClient.execute
  (AbstractHttpClient.java:732)
-
org.apache.solr.client.solrj.impl.HttpSolrServer.request
  (HttpSolrServer.java:304)
-
org.apache.solr.client.solrj.impl.HttpSolrServer.request
  (HttpSolrServer.java:209)
-
org.apache.solr.update.SolrCmdDistributor$1.call
  (SolrCmdDistributor.java:320)
-
org.apache.solr.update.SolrCmdDistributor$1.call
  (SolrCmdDistributor.java:301)
- java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
- java.util.concurrent.FutureTask.run(FutureTask.java:166)
-
 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
- java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
- java.util.concurrent.FutureTask.run(FutureTask.java:166)
-
java.util.concurrent.ThreadPoolExecutor.runWorker
  (ThreadPoolExecutor.java:1110)
-
java.util.concurrent.ThreadPoolExecutor$Worker.run
  (ThreadPoolExecutor.java:603)
- java.lang.Thread.run(Thread.java:679)
 
 
 
  Apparently I do see lots of threads like above in the thread dump. I'm
  using latest build from the trunk (Apr 10th). Any insights into this
 issue
  woudl be really helpful. Thanks a lot.

Re: Solr Http Caching


: Are any of you using Solr Http caching? I am interested to see how people
: use this functionality. I have an index that basically changes once a day
: at midnight. Is it okay to enable Solr Http caching for such an index and
: set the max age to 1 day? Any potential issues?
: 
: I am using solr 3.5 with SolrJ.

in a past life i put squid in front of solr as an accelerator.  i didn't 
bother configuring solr to output expiration info in the Cache-Control 
header, i just took advantage of the etag generated from the index 
version (as well as lastModifiedFrom=openTime) to ensure tha Solr would 
short circut and return a 304 w/o doing any processing (or wasting a lot 
of bandwidth returning data) anytime it got an If-Modified-Since or 
If-None-Match request indicating that the cache already had a current 
copy.

If you know your index only changes ever 24 hours, then setting a max-age 
would probably make sense, to elimianate even those conditional requests, 
but i wouldn't set it to 24H (what if a request happens 1 minute before 
your daily rebuild?) set it to whatever the longest amount of time you are 
willing to serve stale results.

-Hoss

Re: Does the lucene can read the index file from solr?

2012-04-12 Thread a sd

hi,neosky, how to do? i need this way too. thanks

On Thu, Apr 12, 2012 at 9:35 PM, neosky neosk...@yahoo.com wrote:

 Thanks！I will try again

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Does-the-lucene-can-read-the-index-file-from-solr-tp3902927p3905364.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Otis Gospodnetic

Hello Ali,

 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.

 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


Yup, OK.

 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.

 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned 
above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a mny-B documents index and hundreds of nodes 
with ElasticSearch right now), etc.  But again, not integrated with Hadoop the 
way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon and will be solid, but is again 
not integrated.

If I were you and I had to pick today - I'd pick ElasticSearch if I were 
completely open.  If I had Solr bias I'd give SolrCloud a try first.

 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?

I don't know off the topic of my head, but I'm guessing several hundred for 
serving search requests.

HTH,

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html


 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar

Re: term frequency outweighs exact phrase match


: I use solr 3.5 with edismax. I have the following issue with phrase 
: search. For example if I have three documents with content like
: 
: 1.apache apache
: 2. solr solr
: 3.apache solr
: 
: then search for apache solr displays documents in the order 1,.2,3 
: instead of 3, 2, 1 because term frequency in the first and second 
: documents is higher than in the third document. We want results be 
: displayed in the order as 3,2,1 since the third document has exact 
: match.

you need to give us a lot more info, like what other data is in the 
various fields for those documents, exactly what your query URL looks 
like, and what debugQuery=true gives you back in terms of score 
explanations ofr each document, because if that sample content is the only 
thing you've got indexed (even if it's in multiple fields), then documents 
#1 and #2 shouldn't even match your query using the mm you've specified...

: str name=mm2lt;-1 5lt;-2 6lt;90%/str

...because doc #1 and #2 will only contain one clause.

Otherwise it should work fine.

I used the example 3.5 schema, and created 3 docs matching what you 
described. (with name copyfield'ed into text)...

add
docfield name=id1/fieldfield name=nameapache apache/field/doc
docfield name=id2/fieldfield name=namesolr solr/field/doc
docfield name=id3/fieldfield name=nameapache solr/field/doc
/add

...and then used this similar query (note mm=1) to get the results you 
would expect...

http://localhost:8983/solr/select/?fl=name,scoredebugQuery=truedefType=edismaxqf=name+textpf=name^10+text^5q=apache%20solrmm=1

result name=response numFound=3 start=0 maxScore=1.309231
doc
float name=score1.309231/float
str name=nameapache solr/str
/doc
doc
float name=score0.022042051/float
str name=nameapache apache/str
/doc
doc
float name=score0.022042051/float
str name=namesolr solr/str
/doc
/result


-Hoss

RE: solr 3.5 taking long to index

The machine has a total ram of around 46GB. My Biggest concern is Solr index 
time gradually increasing and then the commit stops because of timeouts, out 
commit rate is very high, but I am not able to find the root cause of the issue.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 13 April 2012 05:15
To: solr-user@lucene.apache.org
Subject: Re: solr 3.5 taking long to index

On 4/12/2012 12:42 PM, Rohit wrote:
 Thanks for pointing these out, but I still have one concern, why is 
 the Virtual Memory running in 300g+?

Solr 3.5 uses MMapDirectoryFactory by default to read the index.  This does an 
mmap on the files that make up your index, so their entire contents are simply 
accessible to the application as virtual memory (over 300GB in your case), the 
OS automatically takes care of swapping disk pages in and out of real RAM as 
required.  This approach has less overhead and tends to make better use of the 
OS disk cache than other methods.  It does lead to confused questions and scary 
numbers in memory usage reporting, though.

You have mentioned that you are giving 36GB of RAM to Solr.  How much total RAM 
does the machine have?

Thanks,
Shawn

Re: solr 3.5 taking long to index