Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Gabriele Kahlout
Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are the
terms each contains.
So Solr inverted index should be something like:

k0 -- A | C
k1 -- A | B
k2 -- A | B | C
k3 -- B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?

On Tue, Jun 7, 2011 at 12:21 AM, Erick Erickson erickerick...@gmail.comwrote:

 I'm having a hard time understanding what you're driving at, can
 you provide some examples? This *looks* like filter queries,
 but I think you already know about those...

 Best
 Erick

 On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
  Hello,
 
  I've seen that through boosting it's possible to influence the scoring
  function, but what I would like is sort of a boolean property. In some
 way
  it's to search only the indexed documents by that keyword (or the
  intersection/union) rather than the whole set.
  Is this supported in any way?
 
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread pravesh
k0 -- A | C
k1 -- A | B
k2 -- A | B | C
k3 -- B | C 
Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1? 
Do we bother to do that. Now that's what lucene does :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Gabriele Kahlout
On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote:

 k0 -- A | C
 k1 -- A | B
 k2 -- A | B | C
 k3 -- B | C
 Now let q=k1, how do I make sure C doesn't appear as a result since it
 doesn't contain any occurence of k1?
 Do we bother to do that. Now that's what lucene does :)

 Lucene/Solr doesn't do that, it ranks documents based on a scoring
function, and with that it lacks the possibility of specifying that a
particular term must appear (the closest way I know of is boosting it).

The solution would be a way to tell Solr/lucene which documents/indices to
query, i.e. query only the union/intersection of the documents in which
k1,...kn appear, instead of query all indexed documents and apply the
ranking function (which will give weight to documents that contains
k1...kn).



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Master Slave help

2011-06-07 Thread Rohit Gupta
thanks Jayendra..






From: Jayendra Patil jayendra.patil@gmail.com
To: solr-user@lucene.apache.org
Sent: Tue, 7 June, 2011 6:55:58 AM
Subject: Re: Master Slave help

Do you mean the replication happens everytime you restart the server ?
If so, you would need to modify the events you want the replication to happen.

Check for the replicateAfter tag and remove the startup option, if you
don't need it.

requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
!--Replicate on 'startup' and 'commit'. 'optimize' is also a
valid value for replicateAfter. --
str name=replicateAfterstartup/str
str name=replicateAftercommit/str

!--Create a backup after 'optimize'. Other values can be
'commit', 'startup'. It is possible to have multiple entries of this
config string.  Note that this is just for backup, replication does
not require this. --
!-- str name=backupAfteroptimize/str --

!--If configuration files need to be replicated give the
names here, separated by comma --
str name=confFilesschema.xml,stopwords.txt,elevate.xml/str
   !--The default value of reservation is 10 secs.See the
documentation below . Normally , you should not need to specify this
--
str name=commitReserveDuration00:00:10/str
/lst
/requestHandler

Regards,
Jayendra

On Mon, Jun 6, 2011 at 11:24 AM, Rohit Gupta ro...@in-rev.com wrote:
 Hi,

 I have configured my master slave server and everything seems to be running
 fine, the replication completed the firsttime it ran. But everytime I go the 
the
 replication link in the admin panel after restarting the server or server
 startup I notice the replication starting from scratch or at least the stats
 show that.

 What could be wrong?

 Thanks,
 Rohit


Commit taking very long

2011-06-07 Thread Rohit Gupta
Hi,

My commit seems to be taking too much time, if you notice from the Dataimport 
status given below to commit 1000 docs its taking longer than 24 minutes

/lst
str name=statusbusy/str
str name=importResponseA command is still running.../str
−
lst name=statusMessages
str name=Time Elapsed0:24:43.156/str
str name=Total Requests made to DataSource1001/str
str name=Total Rows Fetched1658/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-06-07 09:15:17/str
−
str name=
Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.
/str
/lst

What can be causing this, I have tried looking for a reason or a way to improve 
this, but am just not able to find. At this rate my documents would never get 
indexed, given that I have more than 100,000 records coming into the database 
every hour.

Regards,
Rohit

getting numberformat exception while using tika

2011-06-07 Thread Naveen Gupta
Hi

We are using requestextractinghandler and we are getting following error. we
are giving microsoft docx file for indexing.

I think that this is something to do with field date definition .. but now
very sure ...what field type should we use?

2. we are trying to index jpg (when we search over the name of the jpg, it
is not coming .. though in id i am passing one)

3. what about zip files or rar files.. does tika with solr handle this one ?

java.lang.NumberFormatException: For input string:
quot;2011-01-27T07:18:00Zquot;
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:412)
at java.lang.Long.parseLong(Long.java:461)
at org.apache.solr.schema.TrieField.createField(TrieField.java:434)
at
org.apache.solr.schema.SchemaField.createField(SchemaField.java:98)
at
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)

Thanks
Naveen


How many fields can SOLR handle?

2011-06-07 Thread roySolr
Hello,

I have a SOLR implementation with 1m products. Every products has some
information, lets say a television has some information about pixels and
inches, a computer has information about harddisk, cpu, gpu. When a user
search for computer i want to show the correct facets. An example:

User search for Computer

Facets:

  CPU
   AMD(10)
   Intel(300)

  GPU
   Nvidia(20)
   Ati(290)

Every product has different facets. I have something like this in my schema:

 dynamicField name=*_FACET type=facetType indexed=true  stored=true
multiValued=true/

In SOLR i have now a lot of fields: CPU_FACET, GPU_FACET etc. How many
fields can SOLR handle?

Another question: Is it possible to add the FACET fields automatically to my
query? facet.field=*_FACET? Now i do first a request to a DB to get the
FACET titles and add this to the request: facet.field=cpu_FACET,gpu_FACET.
I'm affraid that *_FACET is a overkill solution.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-many-fields-can-SOLR-handle-tp3033910p3033910.html
Sent from the Solr - User mailing list archive at Nabble.com.


function queries scope

2011-06-07 Thread Marco Martinez
Hi,

I need to use the function queries operations with the score of a given
query, but only in the docset that i get from the query and i dont know if
this is possible.

Example:

q=shops in madridreturns  1 docs  with a specific score for each doc

but now i need to do some stuff like

q=sum(product(2,query(shops in madrid),productValueField) but this will be
return all the docs in my index.


I know that i can do it via filter queries, ex, q=sum(product(2,query(shops
in madrid),productValueField)fq=shops in madrid but this will do the query
two times and i dont want this because the performance is important to our
application.


Is there other approach to accomplished that=


Thanks in advance,

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Indexing Mediawiki

2011-06-07 Thread Tod
I have a need to index an internal instance of Mediawiki.  I'd like to 
use DIH if I can since I have access to the database but the example 
provided on the Solr wiki uses a Mediawiki dump XML file.


Does anyone have any experience using DIH in this manner?  Am I barking 
up the wrong tree and would be better off dumping and indexing the wiki 
instead?




Thanks - Tod


solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory

2011-06-07 Thread bryan rasmussen
As per the subject I am getting java.lang.NoClassDEfFoundError
org/carrot2/core/ControllerFactory
when I try to run clustering.

I am using Solr 3.1:

I get the following error:

java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
at 
org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.init(CarrotClusteringEngine.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412)
at 
org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203)
at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522)
at org.apache.solr.core.SolrCore.init(SolrCore.java:594)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)
Caused by: java.lang.ClassNotFoundException: org.carrot2.core.ControllerFactory
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)

using the following configuration


 searchComponent
class=org.apache.solr.handler.clustering.ClusteringComponent
name=clustering
  lst name=engine
str name=namedefault/str
str 
name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str

!-- Engine-specific parameters --
str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str
  /lst
/searchComponent
  requestHandler name=/search
class=org.apache.solr.handler.component.SearchHandler
lst name=defaults
  str name=echoParamsexplicit/str
/lst
!--
By default, this will register the following components:

arr name=components
  strquery/str
  strfacet/str
  strmlt/str
  strhighlight/str
  strdebug/str
/arr
/requestHandler

requestHandler name=clusty class=solr.SearchHandler default=true
  lst name=defaults
str name=echoParamsexplicit/str

bool name=clusteringtrue/bool
str name=clustering.enginedefault/str
bool name=clustering.resultstrue/bool

!-- Fields to cluster on --
str name=carrot.titletitle/str
str name=carrot.snippetall_text/str

Re: Documents update

2011-06-07 Thread Denis Kuzmenok
Created  file,  reloaded  solr  -  externalfilefield  works fine, if i
change  change  external  files  and  do curl
http://127.0.0.1:4900/solr/site/update -H Content-Type: text/xml 
--data-binary 'commit /'
then  no  thanges are made. If i start solr without external files and
then create them - they are not working..
What is wrong?

PS: Solr 3.2

 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

 On Tuesday 31 May 2011 15:41:32 Denis Kuzmenok wrote:
 Flags   are   stored  to filter results and it's pretty highloaded, it's
 working  fine,  but i can't update index very often just to make flags
 up to time =\
 Where can i read about using external fields / files?
 
  And it wouldn't work unless all the data is stored anyway. Currently
  there's no way to update a single field in a document, although there's
  work being done in that direction (see the column stride JIRA).
  
  What do you want to do with these fields? If it's to influence scoring,
  you could look at external fields.
  
  If the flags are a selection criteria, it's...harder. What are the flags
  used for? Could you consider essentially storing a map of the
  uniqueKey's and flags in a special document and having your app
  read that document and merge the results with the output? If this seems
  irrelevant, a more complete statement of the use-case would be helpful.
  
  Best
  Erick





Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread lee carroll
Gabriele
Lucene uses a combination of boolean and VSM for its IR.

A straight forward query for a keyword will only match docs with that keyword.

Now things quickly get subtle and complex the more sugar you add, more
complicated queries across fields and more complex
analysis chains but I think the short answer to your question is C
will not be returned, it will not be scored either

lee c

On 7 June 2011 08:30, Gabriele Kahlout gabri...@mysimpatico.com wrote:
 On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote:

 k0 -- A | C
 k1 -- A | B
 k2 -- A | B | C
 k3 -- B | C
 Now let q=k1, how do I make sure C doesn't appear as a result since it
 doesn't contain any occurence of k1?
 Do we bother to do that. Now that's what lucene does :)

 Lucene/Solr doesn't do that, it ranks documents based on a scoring
 function, and with that it lacks the possibility of specifying that a
 particular term must appear (the closest way I know of is boosting it).

 The solution would be a way to tell Solr/lucene which documents/indices to
 query, i.e. query only the union/intersection of the documents in which
 k1,...kn appear, instead of query all indexed documents and apply the
 ranking function (which will give weight to documents that contains
 k1...kn).



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).



clustering problems on 3.1

2011-06-07 Thread bryan rasmussen
I added the following to my configuration

  lib dir=c:/projects/solrtest/dist/
regex=apache-solr-clustering-.*\.jar /




requestHandler name=clusty class=solr.SearchHandler default=true
  lst name=defaults
str name=echoParamsexplicit/str

bool name=clusteringtrue/bool
str name=clustering.enginedefault/str
bool name=clustering.resultstrue/bool

!-- Fields to cluster on --
str name=carrot.titletitle/str
str name=carrot.snippetall_text/str
str name=hl.flall_text title/str
 !-- for this field, we want no fragmenting, just highlighting --
 str name=f.name.hl.fragsize150/str
  /lst
  arr name=last-components
strclustering/str
  /arr
/requestHandler


 searchComponent
class=org.apache.solr.handler.clustering.ClusteringComponent
name=clustering
  lst name=engine
str name=namedefault/str
str 
name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str

!-- Engine-specific parameters --
str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str
  /lst
/searchComponent

which ended up with the message solr java.lang.NoClassDefFoundError:
org/carrot2/core/ControllerFactory
and whenever I did a request I got a 404 response back and

SEVERE: REFCOUNT ERROR: unreferenced org.apache.solr.SolrCore@14db38a4 (core1)
has a reference count of 1
appeared in my console.

Any suggestions?

Thanks,
Bryan Rasmussen


Re: Commit taking very long

2011-06-07 Thread Erick Erickson
Are you optimizing? That is unnecessary when committing, and is often the
culprit.


Best
Erick

On Tue, Jun 7, 2011 at 5:42 AM, Rohit Gupta ro...@in-rev.com wrote:
 Hi,

 My commit seems to be taking too much time, if you notice from the Dataimport
 status given below to commit 1000 docs its taking longer than 24 minutes

 /lst
 str name=statusbusy/str
 str name=importResponseA command is still running.../str
 -
 lst name=statusMessages
 str name=Time Elapsed0:24:43.156/str
 str name=Total Requests made to DataSource1001/str
 str name=Total Rows Fetched1658/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2011-06-07 09:15:17/str
 -
 str name=
 Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.
 /str
 /lst

 What can be causing this, I have tried looking for a reason or a way to 
 improve
 this, but am just not able to find. At this rate my documents would never get
 indexed, given that I have more than 100,000 records coming into the database
 every hour.

 Regards,
 Rohit


Re: problem: zooKeeper Integration with solr

2011-06-07 Thread Mohammad Shariq
how this method
(http://localhost:8983/solr/select?shards=*Machine:Port/Solr
Path,**Machine:Port/Solr Path*indent=trueq=query)
is better than zooKeeper, could you please refer any performance doc.


On 7 June 2011 08:18, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com
 wrote:

 Instead of integrating zookeeper, you could create shards over multiple
 machines and specify the shards while you are querying solr.
 Eg: http://localhost:8983/solr/select?shards=*Machine:Port/Solr
 Path,*
 *Machine:Port/Solr Path*indent=trueq=query



 On Mon, Jun 6, 2011 at 5:59 PM, Mohammad Shariq shariqn...@gmail.com
 wrote:

  Hi folk,
  I am using solr to index around 100mn docs.
  now I am planning to move to cluster based solr, so that I can scale the
  indexing and searching process.
  since solrCloud is in development  stage, I am trying to index in shard
  based environment using zooKeeper.
 
  I followed the steps from
  http://wiki.apache.org/solr/ZooKeeperIntegrationthen also I am not
  able to do distributes search.
  Once I index the docs in one shard, not able to query from other shard
 and
  vice-versa, (using the query
 
 
 http://localhost:8180/solr/select/?q=itunesversion=2.2start=0rows=10indent=on
  )
 
  I am running solr3.1 on ubuntu 10.10.
 
  please help me.
 
 
  --
  Thanks and Regards
  Mohammad Shariq
 



 --
 Thanks and Regards,
 DakshinaMurthy BM




-- 
Thanks and Regards
Mohammad Shariq


RE: SpellCheckComponent performance

2011-06-07 Thread Demian Katz
As I may have mentioned before, VuFind is actually doing two Solr queries for 
every search -- a base query that gets basic spelling suggestions, and a 
supplemental spelling-only query that gets shingled spelling suggestions.  If 
there's a way to get two different spelling responses in a single query, I'd 
love to hear about it...  but the double-querying doesn't seem to be a huge 
problem -- the delays I'm talking about are in the spelling portion of the 
initial query.  Just for the sake of completeness, here are both of my spelling 
field types:

!-- Basic Text Field for use with Spell Correction --
fieldType name=textSpell class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
!-- More advanced spell checking field. --
fieldType name=textSpellShingle class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2 
outputUnigrams=false/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2 
outputUnigrams=false/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

...and here are the fields:

   field name=spelling type=textSpell indexed=true stored=true/
   field name=spellingShingle type=textSpellShingle indexed=true 
stored=true multiValued=true/

As you can probably guess, I'm using spelling in my main query and 
spellingShingle in my supplemental query.

Here are stats on the spelling field:

{field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1}

(I obtained these numbers by temporarily adding the spelling field as a facet 
to my warming query -- probably not a very smart way to do it, but it was the 
only way I could figure out!  If there's a more elegant and accurate approach, 
I'd be interested to know what it is.)

I should also note that my basic spelling index is 114MB and my shingled 
spelling index is 931MB -- not outrageously large.  Is there a way to persuade 
Solr to load these into memory for faster performance?

thanks,
Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, June 06, 2011 6:23 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SpellCheckComponent performance
 
 Hmmm, how are you configuring you spell checker? The first-time
 slowdown
 is probably due to cache warming, but subsequent 500 ms slowdowns
 seem odd. How many unique terms are there in your spellecheck index?
 
 It'd probably be best if you showed us your fieldtype and field
 definition...
 
 Best
 Erick
 
 On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz demian.k...@villanova.edu
 wrote:
  I'm continuing to work on tuning my Solr server, and now I'm noticing
 that my biggest bottleneck is the SpellCheckComponent.  This is eating
 multiple seconds on most first-time searches, and still taking around
 500ms even on cached searches.  Here is my configuration:
 
   searchComponent name=spellcheck
 class=org.apache.solr.handler.component.SpellCheckComponent
     lst name=spellchecker
       str name=namebasicSpell/str
       str name=fieldspelling/str
       str name=accuracy0.75/str
       str name=spellcheckIndexDir./spellchecker/str
       str name=queryAnalyzerFieldTypetextSpell/str
       str name=buildOnOptimizetrue/str
     /lst
   /searchComponent
 
  I've done a bit of searching, but the best advice I could find for
 making the search component go faster involved reducing
 spellcheck.maxCollationTries, which doesn't even seem to apply to my
 settings.
 
  Does anyone have any advice on tuning this aspect of my
 configuration?  Are there any extra debug settings that might give
 deeper insight into how the component is spending its time?
 
  thanks,
  Demian
 


Re: [ANNOUNCEMENT] PHP Solr Extension 1.0.1 Stable Has Been Released

2011-06-07 Thread roySolr
Hello,

I have some problems with the installation of the new PECL package
solr-1.0.1.

I run this command:

pecl uninstall solr-beta ( to uninstall old version, 0.9.11)
pecl install solr

The installing is running but then it gives the following error message:

/tmp/tmpKUExET/solr-1.0.1/solr_functions_helpers.c: In function
'solr_json_to_php_native':
/tmp/tmpKUExET/solr-1.0.1/solr_functions_helpers.c:1123: error: too many
arguments to function 'php_json_decode'
make: *** [solr_functions_helpers.lo] Error 1
ERROR: `make' failed

I have php version 5.2.17.

How can i fix this?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCEMENT-PHP-Solr-Extension-1-0-1-Stable-Has-Been-Released-tp3024040p3034350.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: java.lang.AbstractMethodError at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)

2011-06-07 Thread idivad
Finally figured out the problem.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/java-lang-AbstractMethodError-at-org-apache-solr-handler-ContentStreamHandlerBase-handleRequestBody--tp3026470p3034456.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Cloud Query Question

2011-06-07 Thread Jamie Johnson
I am currently experimenting with the Solr Cloud code on trunk and just had
a quick question.  Lets say my setup had 3 nodes a, b and c.  Node a has
1000 results which meet a particular query, b has 2000 and c has 3000.  When
executing this query and asking for row 900 what specifically happens?  From
reading the Distributed Search Wiki I would expect that node a responds with
900, node b responds with 900 and c responds with 900 and the coordinating
node is responsible for taking the top scored items and throwing away the
rest, is this correct or is there some additional coordination that happens
where nodes a, b and c return back an id and a score and the coordinating
node makes an additional request to get back the documents for the ids which
make up the top list?


Re: Solr Cloud Query Question

2011-06-07 Thread Yonik Seeley
On Tue, Jun 7, 2011 at 9:35 AM, Jamie Johnson jej2...@gmail.com wrote:
 I am currently experimenting with the Solr Cloud code on trunk and just had
 a quick question.  Lets say my setup had 3 nodes a, b and c.  Node a has
 1000 results which meet a particular query, b has 2000 and c has 3000.  When
 executing this query and asking for row 900 what specifically happens?  From
 reading the Distributed Search Wiki I would expect that node a responds with
 900, node b responds with 900 and c responds with 900 and the coordinating
 node is responsible for taking the top scored items and throwing away the
 rest, is this correct or is there some additional coordination that happens
 where nodes a, b and c return back an id and a score and the coordinating
 node makes an additional request to get back the documents for the ids which
 make up the top list?

The latter is correct - the first phase only collects enough
information to merge ids from the shards, and then a second phase
requests the stored fields, highlighting, etc for the specific docs
that will be returned.

-Yonik
http://www.lucidimagination.com


Re: function queries scope

2011-06-07 Thread Yonik Seeley
One way is to use the boost qparser:
http://search-lucene.com/jd/solr/org/apache/solr/search/BoostQParserPlugin.html
q={!boost b=productValueField}shops in madrid

Or you can use the edismax parser which as a boost parameter that
does the same thing:
defType=edismaxq=shops in madridboost=productValueField


-Yonik
http://www.lucidimagination.com


On Tue, Jun 7, 2011 at 6:53 AM, Marco Martinez
mmarti...@paradigmatecnologico.com wrote:
 Hi,

 I need to use the function queries operations with the score of a given
 query, but only in the docset that i get from the query and i dont know if
 this is possible.

 Example:

 q=shops in madrid    returns  1 docs  with a specific score for each doc

 but now i need to do some stuff like

 q=sum(product(2,query(shops in madrid),productValueField) but this will be
 return all the docs in my index.


 I know that i can do it via filter queries, ex, q=sum(product(2,query(shops
 in madrid),productValueField)fq=shops in madrid but this will do the query
 two times and i dont want this because the performance is important to our
 application.


 Is there other approach to accomplished that=


 Thanks in advance,

 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42



Re: function queries scope

2011-06-07 Thread Marco Martinez
Thanks, but its not what i'm looking for, because the BoostQParserPlugin
multiplies the score of the query with the function queries defined in the b
param of the BoostQParserPlugin. and i can't use the edismax because we have
our own qparser. Its seems that i have to code another qparser.


Thanks Yonik anyway,

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2011/6/7 Yonik Seeley yo...@lucidimagination.com

 One way is to use the boost qparser:

 http://search-lucene.com/jd/solr/org/apache/solr/search/BoostQParserPlugin.html
 q={!boost b=productValueField}shops in madrid

 Or you can use the edismax parser which as a boost parameter that
 does the same thing:
 defType=edismaxq=shops in madridboost=productValueField


 -Yonik
 http://www.lucidimagination.com


 On Tue, Jun 7, 2011 at 6:53 AM, Marco Martinez
 mmarti...@paradigmatecnologico.com wrote:
  Hi,
 
  I need to use the function queries operations with the score of a given
  query, but only in the docset that i get from the query and i dont know
 if
  this is possible.
 
  Example:
 
  q=shops in madridreturns  1 docs  with a specific score for each
 doc
 
  but now i need to do some stuff like
 
  q=sum(product(2,query(shops in madrid),productValueField) but this will
 be
  return all the docs in my index.
 
 
  I know that i can do it via filter queries, ex,
 q=sum(product(2,query(shops
  in madrid),productValueField)fq=shops in madrid but this will do the
 query
  two times and i dont want this because the performance is important to
 our
  application.
 
 
  Is there other approach to accomplished that=
 
 
  Thanks in advance,
 
  Marco Martínez Bautista
  http://www.paradigmatecnologico.com
  Avenida de Europa, 26. Ática 5. 3ª Planta
  28224 Pozuelo de Alarcón
  Tel.: 91 352 59 42
 



RE: SpellCheckComponent performance

2011-06-07 Thread Dyer, James
Demian,

If you omit spellcheckIndexDir from the configuration, it will create an 
in-memory spelling dictionary.  

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu] 
Sent: Tuesday, June 07, 2011 7:59 AM
To: solr-user@lucene.apache.org
Subject: RE: SpellCheckComponent performance

As I may have mentioned before, VuFind is actually doing two Solr queries for 
every search -- a base query that gets basic spelling suggestions, and a 
supplemental spelling-only query that gets shingled spelling suggestions.  If 
there's a way to get two different spelling responses in a single query, I'd 
love to hear about it...  but the double-querying doesn't seem to be a huge 
problem -- the delays I'm talking about are in the spelling portion of the 
initial query.  Just for the sake of completeness, here are both of my spelling 
field types:

!-- Basic Text Field for use with Spell Correction --
fieldType name=textSpell class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
!-- More advanced spell checking field. --
fieldType name=textSpellShingle class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2 
outputUnigrams=false/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2 
outputUnigrams=false/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

...and here are the fields:

   field name=spelling type=textSpell indexed=true stored=true/
   field name=spellingShingle type=textSpellShingle indexed=true 
stored=true multiValued=true/

As you can probably guess, I'm using spelling in my main query and 
spellingShingle in my supplemental query.

Here are stats on the spelling field:

{field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1}

(I obtained these numbers by temporarily adding the spelling field as a facet 
to my warming query -- probably not a very smart way to do it, but it was the 
only way I could figure out!  If there's a more elegant and accurate approach, 
I'd be interested to know what it is.)

I should also note that my basic spelling index is 114MB and my shingled 
spelling index is 931MB -- not outrageously large.  Is there a way to persuade 
Solr to load these into memory for faster performance?

thanks,
Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, June 06, 2011 6:23 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SpellCheckComponent performance
 
 Hmmm, how are you configuring you spell checker? The first-time
 slowdown
 is probably due to cache warming, but subsequent 500 ms slowdowns
 seem odd. How many unique terms are there in your spellecheck index?
 
 It'd probably be best if you showed us your fieldtype and field
 definition...
 
 Best
 Erick
 
 On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz demian.k...@villanova.edu
 wrote:
  I'm continuing to work on tuning my Solr server, and now I'm noticing
 that my biggest bottleneck is the SpellCheckComponent.  This is eating
 multiple seconds on most first-time searches, and still taking around
 500ms even on cached searches.  Here is my configuration:
 
   searchComponent name=spellcheck
 class=org.apache.solr.handler.component.SpellCheckComponent
     lst name=spellchecker
       str name=namebasicSpell/str
       str name=fieldspelling/str
       str name=accuracy0.75/str
       str name=spellcheckIndexDir./spellchecker/str
       str name=queryAnalyzerFieldTypetextSpell/str
       str name=buildOnOptimizetrue/str
     /lst
   /searchComponent
 
  I've done a bit of searching, but the best advice I could find for
 making the search component go faster involved reducing
 

Re: Nullpointer Exception in Solr 4.x in DebugComponent when using wildcard in facet value

2011-06-07 Thread Stefan Moises

Hi Yonik,

thanks, it's working in trunk now again... I had to re-index though 
because of exceptions at startup, did the index format change again 
between trunk of beginning / mid may and current trunk?


best regards,
Stefan

Am 03.06.2011 15:32, schrieb Yonik Seeley:

This bug was introduced during the cutover from strings to BytesRef on
TermRangeQuery.
I just committed a fix.

-Yonik
http://www.lucidimagination.com

On Fri, Jun 3, 2011 at 5:42 AM, Stefan Moisesmoi...@shoptimax.de  wrote:

Hi,

in Solr 4.x (trunk version of mid may) I have noticed a null pointer
exception if I activate debugging (debug=true) and use a wildcard to filter
by facet value, e.g.
if I have a price field

...debug=truefacet.field=pricefq=price[500+TO+*]
I get

SEVERE: java.lang.RuntimeException: java.lang.NullPointerException
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:538)
at
org.apache.solr.handler.component.DebugComponent.process(DebugComponent.java:77)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:239)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:465)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:555)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NullPointerException
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:402)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:535)

This used to work in Solr 1.4 and I was wondering if it's a bug or a new
feature and if there is a trick to get this working again?

Best regards,
Stefan




.



--
Mit den besten Grüßen aus Nürnberg,
Stefan Moises

***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***




Re: Debugging a Solr/Jetty Hung Process

2011-06-07 Thread Chris Cowan
OK... The fix I thought would fix it didn't fix it (which was to use the 
commitWithin feature). What I can gather from `ps` is that the thread has pages 
locked in memory. Currently I'm using native locking for Solr. Would switching 
to simple help alleviate this problem?

Chris

On Jun 4, 2011, at 2:48 PM, Chris Cowan wrote:

 I found this thread that looks similar to what's happening on my system. I 
 think what happens is there are multiple commits happening at once from the 
 clients and it's causing the same issue. I'm going to use the commitWithin 
 argument to the updates to see if that fixes the problem. I will report back 
 with any findings.
 
 Chris
 
 On Jun 1, 2011, at 12:42 PM, Jonathan Rochkind wrote:
 
 First guess (and it really is just a guess) would be Java garbage 
 collection taking over. There are some JVM parameters you can use to 
 tune the GC process, especially if the machine is multi-core, making 
 sure GC happens in a seperate thread is helpful.
 
 But figuring out exactly what's going on requires confusing JVM 
 debugging of which I am no expert at either.
 
 On 6/1/2011 3:04 PM, Chris Cowan wrote:
 About once a day a Solr/Jetty process gets hung on my server consuming 100% 
 of one of the CPU's. Once this happens the server no longer responds to 
 requests. I've looked through the logs to try and see if anything stands 
 out but so far I've found nothing out of the ordinary.
 
 My current remedy is to log in and just kill the single processes that's 
 hung. Once that happens everything goes back to normal and I'm good for a 
 day or so.  I'm currently  the running following:
 
 solr-jetty-1.4.0+ds1-1ubuntu1
 
 which is comprised of
 
 Solr 1.4.0
 Jetty 6.1.22
 on Unbuntu 10.10
 
 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just 
 looking for advice on how I should go about trouble shooting this problem.
 
 Chris
 



Re: Default query parser operator

2011-06-07 Thread Brian Lamb
I feel like this should be fairly easy to do but I just don't see anywhere
in the documentation on how to do this. Perhaps I am using the wrong search
parameters.

On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 Hi all,

 Is it possible to change the query parser operator for a specific field
 without having to explicitly type it in the search field?

 For example, I'd like to use:

 http://localhost:8983/solr/search/?q=field1:word token field2:parser
 syntax

 instead of

 http://localhost:8983/solr/search/?q=field1:word AND token field2:parser
 syntax

 But, I only want it to be applied to field1, not field2 and I want the
 operator to always be AND unless the user explicitly types in OR.

 Thanks,

 Brian Lamb



Solr Custom Installation

2011-06-07 Thread Federico Czerwinski
Hey there. I was wondering if Solr can be embedded into my Java Web App. As
far as I know, Solr comes as a war or bundled with Jetty if you don't have a
container. I've opened the war's web.xml and found out that it only has a
couple of servlets, filters and that's it.

So, would it be possible to declare those servlets in *my* web.xml, and
include the appropiate jars in my classpath, instead of  having another
webapp deployed in the container?  Does Solr has the jars mavenized?

Thank you

Fede.


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Jonathan Rochkind
Um, normally that would never happen, because, well, like you say, the 
inverted index doesn't have docC for term K1, because doc C didn't 
include term K1.


If you search on q=K1, then how/why would docC ever be in your result 
set?  Are you seeing it in your result set? The question then would be 
_why_, what weird thing is going on to make that happen,  that's not 
expected.


The result set _starts_ from only the documents that actually include 
the term.  Boosting/relevancy ranking only effects what order these 
documents appear in, but there's no reason documentC should be in the 
result set at all in your case of q=k1, where docC is not indexed under k1.


On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:

Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are the
terms each contains.
So Solr inverted index should be something like:

k0 --  A | C
k1 --  A | B
k2 --  A | B | C
k3 --  B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?


Re: Default query parser operator

2011-06-07 Thread Jonathan Rochkind

Nope, not possible.

I'm not even sure what it would mean semantically. If you had default 
operator OR ordinarily, but default operator AND just for field2, 
then what would happen if you entered:


field1:foo field2:bar field1:baz field2:bom

Where the heck would the ANDs and ORs go?  The operators are BETWEEN the 
clauses that specify fields, they don't belong to a field. In general, 
the operators are part of the query as a whole, not any specific field.


In fact, I'd be careful of your example query:
q=field1:foo bar field2:baz

I don't think that means what you think it means, I don't think the 
field1 applies to the bar in that case. Although I could be wrong, 
but you definitely want to check it.  You need field1:foo field1:bar, 
or set the default field for the query to field1, or use parens 
(although that will change the execution strategy and ranking): 
q=field1:(foo bar)   


At any rate, even if there's a way to specify this so it makes sense, 
no, Solr/lucene doesn't support any such thing.




On 6/7/2011 10:56 AM, Brian Lamb wrote:

I feel like this should be fairly easy to do but I just don't see anywhere
in the documentation on how to do this. Perhaps I am using the wrong search
parameters.

On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb
brian.l...@journalexperts.comwrote:


Hi all,

Is it possible to change the query parser operator for a specific field
without having to explicitly type it in the search field?

For example, I'd like to use:

http://localhost:8983/solr/search/?q=field1:word token field2:parser
syntax

instead of

http://localhost:8983/solr/search/?q=field1:word AND token field2:parser
syntax

But, I only want it to be applied to field1, not field2 and I want the
operator to always be AND unless the user explicitly types in OR.

Thanks,

Brian Lamb



Re: Solr Custom Installation

2011-06-07 Thread Tomás Fernández Löbbe
Hi Federico, you can take a look to this wiki page:
http://wiki.apache.org/solr/EmbeddedSolr
http://wiki.apache.org/solr/EmbeddedSolrSolr also has some maven support,
see the ant target generate-maven-artifacts, don't know if that's what you
need.
Regards,
Tomás

On Tue, Jun 7, 2011 at 12:17 PM, Federico Czerwinski fed...@gmail.comwrote:

 Hey there. I was wondering if Solr can be embedded into my Java Web App. As
 far as I know, Solr comes as a war or bundled with Jetty if you don't have
 a
 container. I've opened the war's web.xml and found out that it only has a
 couple of servlets, filters and that's it.

 So, would it be possible to declare those servlets in *my* web.xml, and
 include the appropiate jars in my classpath, instead of  having another
 webapp deployed in the container?  Does Solr has the jars mavenized?

 Thank you

 Fede.



Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Gabriele Kahlout
You are right, Lucene will return based on my scoring function
implementation (Similarity
classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html
):

score(q,d)   =
coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord
·
queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm
·
∑  ( tf(t in 
d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf
·
idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf
2  ·  
t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost
·
norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm
)
It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as
you say C will never be returned.

My issue is when the query has multiple terms (my example was too simple!),
and some are 'mandatory' while others not. In that case I should make a
query that uses the
+%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg.
q=+k1).
I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and
k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through
solrj are received with + in place of the   (default to OR), so
q=k2+k3++k1.



On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Um, normally that would never happen, because, well, like you say, the
 inverted index doesn't have docC for term K1, because doc C didn't include
 term K1.

 If you search on q=K1, then how/why would docC ever be in your result set?
  Are you seeing it in your result set? The question then would be _why_,
 what weird thing is going on to make that happen,  that's not expected.

 The result set _starts_ from only the documents that actually include the
 term.  Boosting/relevancy ranking only effects what order these documents
 appear in, but there's no reason documentC should be in the result set at
 all in your case of q=k1, where docC is not indexed under k1.


 On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:

 Sorry being unclear and thank you for answering.
 Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and
 C(k0,k2,k3),
 where A,B,C are document identifiers and the ks in bracket with each are
 the
 terms each contains.
 So Solr inverted index should be something like:

 k0 --  A | C
 k1 --  A | B
 k2 --  A | B | C
 k3 --  B | C

 Now let q=k1, how do I make sure C doesn't appear as a result since it
 doesn't contain any occurence of k1?




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Jonathan Rochkind
Okay, if you're using a custom similarity, I'm not sure what's going on, 
I'm not familiar with that.


But ordinarily, you are right, you would require k1 with +k1.

What you say about the + being lost suggests something is going wrong. 
Either you are not sending your query to Solr properly escaped, or 
there's a bug in your custom similarity or query parser, or (not too 
likely) there's a bug in Solr.


My experience is using the standard query parser, standard similarity 
class, and contacting Solr via HTTP. (are you using SolrJ or HTTP?).  In 
that case, when you send the q to Solr, you are responsible for 
URI-encoding it when you send it.  So if you want to send a query like 
k2 k3 +k1, you need to URI-escape it first, and this is what you'd send:


q=k2+k3+%2Bk1

or, escaping spaces as %20 instead, which is actually more 'correct' 
with current standards:


q=k2%20k3%20%2Bk1

The important thing is that + escapes as %2B.  You need to escape it 
before sending it to Solr via an HTTP URI query string or HTTP form post 
data. Yes, if you send a raw +, Solr will understand that as 
representing a space, not an actual +.  This is because the + 
character is not 'safe', it needs to be escaped.  The programming 
language of your choice probably already has a library function for 
URI-escaping values.


On 6/7/2011 11:36 AM, Gabriele Kahlout wrote:

You are right, Lucene will return based on my scoring function
implementation (Similarity
classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html
):

score(q,d)   =
coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord
·
queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm
·
∑  ( tf(t in 
d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf
·
idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf
2  ·  
t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost
·
norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm
)
It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as
you say C will never be returned.

My issue is when the query has multiple terms (my example was too simple!),
and some are 'mandatory' while others not. In that case I should make a
query that uses the
+%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg.
q=+k1).
I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and
k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through
solrj are received with + in place of the   (default to OR), so
q=k2+k3++k1.



On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:


Um, normally that would never happen, because, well, like you say, the
inverted index doesn't have docC for term K1, because doc C didn't include
term K1.

If you search on q=K1, then how/why would docC ever be in your result set?
  Are you seeing it in your result set? The question then would be _why_,
what weird thing is going on to make that happen,  that's not expected.

The result set _starts_ from only the documents that actually include the
term.  Boosting/relevancy ranking only effects what order these documents
appear in, but there's no reason documentC should be in the result set at
all in your case of q=k1, where docC is not indexed under k1.


On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:


Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and
C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are
the
terms each contains.
So Solr inverted index should be something like:

k0 --   A | C
k1 --   A | B
k2 --   A | B | C
k3 --   B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?





Data not always returned

2011-06-07 Thread Jerome Renard
Hi all,

I have a problem with my index. Even though I always index the same
data over and over again, whenever I try
a couple of searches (they are always the same as they are issued by a
unit test suite) I do not get the same
results, sometimes I get 3 successes and 2 failures and sometimes it
is the other way around it is unpredictable.

Here is what I am trying to do:

I created a new Solr core with its specific solrconfig.xml and schema.xml
This core stores a list of towns which I plan to use with an
auto-suggestion system, using ngrams (no Suggester)

The indexing process is always the same :
1. the import script deletes all documents in the core :
deletequery*:*/query/delete and commit/
2. the import script fetches date from postgres, 100 rows at a time
2. the import script adds these 100 documents and sends a commit/
3. once all the rows (around 40 000) have been imported the script
send an optimize/ query

Here is what happens:
I run the indexer once and search for 'foo' I get results I expect but
if I search for 'bar' I get nothing
I reindex once again and search for 'foo' I get nothing, but if I
search for 'bar' I get results
The search is made on the name field which is a pretty common
TextField with ngrams.

I tried to physically remove the index (rm -rf path/to/index) and
reindex everything as well and
not all searches work, sometimes the 'foo' search work, sometimes the 'bar' one.

I tried a lot of differents things but now I am running out of ideas.
This is why I am asking for help.

Some useful informations :
Solr version : 3.1.0
Solr Implementation Version: 3.1.0 1085815 - grantingersoll -
2011-03-26 18:00:07
Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
Java 1.5.0_24 on Mac Os X
solrconfig.xml and schema.xml are attached

Thanks in advance for your help.


schema.xml.gz
Description: GNU Zip compressed data


solrconfig.xml.gz
Description: GNU Zip compressed data


Question about tokenizing, searching and retrieving results.

2011-06-07 Thread Luis Cappa Banda
Hello!

My problem is as follows: I've got a field (indexed and stored setted as
true) tokenized by whitespaces and other patterns, with a gap with value
100. For example, if index the following expression for the field that I
mentioned:


*Expression*: A B C D E-  *Index*: tokenAtokenBtokenC
tokenD   tokenE


This behaviour is replicated in search context, so each content asociated to
this field during a search will be tokenized as I explained. If I search the
whole expresion the document indexed is returned correctly as expected, but
if I search something like:

*Expression*: A B C D E F G H I


It doesn't retrieve the document. What's happening? The expression will
match partially and I thought that the document will be returned too. I
tried modifying the gap value but it doesn't work.


Thank you very much.


Re: Solr Cloud Query Question

2011-06-07 Thread Jamie Johnson
Thanks Yonik.  I have a follow on now, how does Solr ensure consistent
results across pages?  So for example if we had my 3 theoretical solr
instances again and a, b and c each returned 100 documents with the same
score and the user only requested 100 documents, how are those 100 documents
chosen from the set available from a, b and c if the documents have the same
score?

On Tue, Jun 7, 2011 at 9:38 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Tue, Jun 7, 2011 at 9:35 AM, Jamie Johnson jej2...@gmail.com wrote:
  I am currently experimenting with the Solr Cloud code on trunk and just
 had
  a quick question.  Lets say my setup had 3 nodes a, b and c.  Node a has
  1000 results which meet a particular query, b has 2000 and c has 3000.
  When
  executing this query and asking for row 900 what specifically happens?
  From
  reading the Distributed Search Wiki I would expect that node a responds
 with
  900, node b responds with 900 and c responds with 900 and the
 coordinating
  node is responsible for taking the top scored items and throwing away the
  rest, is this correct or is there some additional coordination that
 happens
  where nodes a, b and c return back an id and a score and the coordinating
  node makes an additional request to get back the documents for the ids
 which
  make up the top list?

 The latter is correct - the first phase only collects enough
 information to merge ids from the shards, and then a second phase
 requests the stored fields, highlighting, etc for the specific docs
 that will be returned.

 -Yonik
 http://www.lucidimagination.com



Re: Default query parser operator

2011-06-07 Thread Brian Lamb
Hi Jonathan,

Thank you for your reply. Your point about my example is a good one. So let
me try to restate using your example. Suppose I want to apply AND to any
search terms within field1.

Then

field1:foo field2:bar field1:baz field2:bom

would by written as

http://localhost:8983/solr/?q=field1:foo OR field2:bar OR field1:baz OR
field2:bom

But if they were written together like:

http://localhost:8983/solr/?q=field1:(foo baz) field2:(bar bom)

I would want it to be

http://localhost:8983/solr/?q=field1:(foo AND baz) OR field2:(bar OR bom)

But it sounds like you are saying that would not be possible.

Thanks,

Brian Lamb

On Tue, Jun 7, 2011 at 11:27 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Nope, not possible.

 I'm not even sure what it would mean semantically. If you had default
 operator OR ordinarily, but default operator AND just for field2, then
 what would happen if you entered:

 field1:foo field2:bar field1:baz field2:bom

 Where the heck would the ANDs and ORs go?  The operators are BETWEEN the
 clauses that specify fields, they don't belong to a field. In general, the
 operators are part of the query as a whole, not any specific field.

 In fact, I'd be careful of your example query:
q=field1:foo bar field2:baz

 I don't think that means what you think it means, I don't think the
 field1 applies to the bar in that case. Although I could be wrong, but
 you definitely want to check it.  You need field1:foo field1:bar, or set
 the default field for the query to field1, or use parens (although that
 will change the execution strategy and ranking): q=field1:(foo bar)   

 At any rate, even if there's a way to specify this so it makes sense, no,
 Solr/lucene doesn't support any such thing.




 On 6/7/2011 10:56 AM, Brian Lamb wrote:

 I feel like this should be fairly easy to do but I just don't see anywhere
 in the documentation on how to do this. Perhaps I am using the wrong
 search
 parameters.

 On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

  Hi all,

 Is it possible to change the query parser operator for a specific field
 without having to explicitly type it in the search field?

 For example, I'd like to use:

 http://localhost:8983/solr/search/?q=field1:word token field2:parser
 syntax

 instead of

 http://localhost:8983/solr/search/?q=field1:word AND token field2:parser
 syntax

 But, I only want it to be applied to field1, not field2 and I want the
 operator to always be AND unless the user explicitly types in OR.

 Thanks,

 Brian Lamb




Re: Question about tokenizing, searching and retrieving results.

2011-06-07 Thread Tomás Fernández Löbbe
My first guess would be that you are using AND as default operator? you can
see the generated query by using the parameter debugQuery=true

On Tue, Jun 7, 2011 at 1:34 PM, Luis Cappa Banda luisca...@gmail.comwrote:

 Hello!

 My problem is as follows: I've got a field (indexed and stored setted as
 true) tokenized by whitespaces and other patterns, with a gap with value
 100. For example, if index the following expression for the field that I
 mentioned:


 *Expression*: A B C D E-  *Index*: tokenAtokenBtokenC
 tokenD   tokenE


 This behaviour is replicated in search context, so each content asociated
 to
 this field during a search will be tokenized as I explained. If I search
 the
 whole expresion the document indexed is returned correctly as expected, but
 if I search something like:

 *Expression*: A B C D E F G H I


 It doesn't retrieve the document. What's happening? The expression will
 match partially and I thought that the document will be returned too. I
 tried modifying the gap value but it doesn't work.


 Thank you very much.



Re: Solr Cloud Query Question

2011-06-07 Thread Yonik Seeley
On Tue, Jun 7, 2011 at 1:01 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks Yonik.  I have a follow on now, how does Solr ensure consistent
 results across pages?  So for example if we had my 3 theoretical solr
 instances again and a, b and c each returned 100 documents with the same
 score and the user only requested 100 documents, how are those 100 documents
 chosen from the set available from a, b and c if the documents have the same
 score?

Ties within a shard are broken by docid (just like lucene), and ties
across different shards are broken by comparing the shard ids... so
yes, it's consistent.

-Yonik
http://www.lucidimagination.com


Re: Default query parser operator

2011-06-07 Thread Jonathan Rochkind

There's no feature in Solr to do what you ask, no. I don't think.

On 6/7/2011 1:30 PM, Brian Lamb wrote:

Hi Jonathan,

Thank you for your reply. Your point about my example is a good one. So let
me try to restate using your example. Suppose I want to apply AND to any
search terms within field1.

Then

field1:foo field2:bar field1:baz field2:bom

would by written as

http://localhost:8983/solr/?q=field1:foo OR field2:bar OR field1:baz OR
field2:bom

But if they were written together like:

http://localhost:8983/solr/?q=field1:(foo baz) field2:(bar bom)

I would want it to be

http://localhost:8983/solr/?q=field1:(foo AND baz) OR field2:(bar OR bom)

But it sounds like you are saying that would not be possible.

Thanks,

Brian Lamb

On Tue, Jun 7, 2011 at 11:27 AM, Jonathan Rochkindrochk...@jhu.edu  wrote:


Nope, not possible.

I'm not even sure what it would mean semantically. If you had default
operator OR ordinarily, but default operator AND just for field2, then
what would happen if you entered:

field1:foo field2:bar field1:baz field2:bom

Where the heck would the ANDs and ORs go?  The operators are BETWEEN the
clauses that specify fields, they don't belong to a field. In general, the
operators are part of the query as a whole, not any specific field.

In fact, I'd be careful of your example query:
q=field1:foo bar field2:baz

I don't think that means what you think it means, I don't think the
field1 applies to the bar in that case. Although I could be wrong, but
you definitely want to check it.  You need field1:foo field1:bar, or set
the default field for the query to field1, or use parens (although that
will change the execution strategy and ranking): q=field1:(foo bar)   

At any rate, even if there's a way to specify this so it makes sense, no,
Solr/lucene doesn't support any such thing.




On 6/7/2011 10:56 AM, Brian Lamb wrote:


I feel like this should be fairly easy to do but I just don't see anywhere
in the documentation on how to do this. Perhaps I am using the wrong
search
parameters.

On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

  Hi all,

Is it possible to change the query parser operator for a specific field
without having to explicitly type it in the search field?

For example, I'd like to use:

http://localhost:8983/solr/search/?q=field1:word token field2:parser
syntax

instead of

http://localhost:8983/solr/search/?q=field1:word AND token field2:parser
syntax

But, I only want it to be applied to field1, not field2 and I want the
operator to always be AND unless the user explicitly types in OR.

Thanks,

Brian Lamb




Re: Question about tokenizing, searching and retrieving results.

2011-06-07 Thread Yonik Seeley
On Tue, Jun 7, 2011 at 12:34 PM, Luis Cappa Banda luisca...@gmail.com wrote:
 *Expression*: A B C D E F G H I

As written, this is equivalent to

*Expression*: A default_field:B default_field:C default_field:D
default_field:E default_field:F default_field:G default_field:H
default_field:I

Try *Expression*:( A B C D E F G H I)
or *Expression*:A B C D E F G H I for a phrase query.

Oh, and I highly recommend sticking to java identifiers for field
names - it will make your life much easier in the future.

-Yonik
http://www.lucidimagination.com


Solr Cloud and Range Facets

2011-06-07 Thread Jamie Johnson
I have a solr cloud setup wtih 2 servers, when executing a query against
them of the form:

http://localhost:8983/solr/select/?distrib=trueq=*:*facet=truefacet.mincount=1facet.range=dateTimef.dateTime.facet.range.gap=%2B1MONTHf.dateTime.facet.range.start=2011-06-01T00%3A00%3A00Z-1YEARf.dateTime.facet.range.end=2011-07-01T00%3A00%3A00Zf.dateTime.facet.mincount=1start=0rows=0

I am seeing that sometimes the date facet has a count, and other times it
does not.  Specifically I am seeing sometimes:

lst name=facet_ranges
  lst name=dateTime
lst name=counts/
str name=gap+1MONTH/str
date name=start2010-06-01T00:00:00Z/date
date name=end2011-07-01T00:00:00Z/date
  /lst
/lst

and others
lst name=facet_ranges
  lst name=dateTime
lst name=counts
  int name=2011-06-01T00:00:00Z250/int
/lst
str name=gap+1MONTH/str
date name=start2010-06-01T00:00:00Z/date
date name=end2011-07-01T00:00:00Z/date
  /lst
/lst

What could be causing this inconsistency?


Compound word search not what I expected

2011-06-07 Thread kenf_nc
I have a field defined as:
field name=content type=text indexed=true stored=false
termVectors=true multiValued=true /
where text is unmodified from the schema.xml example that came with Solr
1.4.1.

I have documents with some compound words indexed, words like Sandstone. And
in several cases words that are camel case like MaxSize. If I query using
all lower case, sandstone or maxsize, I get the documents I expect. If I
query with proper case, ie. Sandstone or Maxsize I get the documents I
expect. However, if the query is camel case, MaxSize or SandStone, it
doesn't find the documents. In the case of MaxSize it is particularly
frustrating because that is the actual case of the word that was indexed. Is
this expected behavior?  The query analyzer definition the the text field
type is:
analyzer type=query 
  tokenizer class=solr.WhitespaceTokenizerFactory/ 
  filter class=solr.SynonymFilterFactory ignoreCase=true expand=true
synonyms=synonyms.txt/ 
  filter class=solr.StopFilterFactory enablePositionIncrements=true
words=stopwords.txt ignoreCase=true/ 
  filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
catenateAll=0 catenateNumbers=0 catenateWords=0
generateNumberParts=1 generateWordParts=1/ 
  filter class=solr.LowerCaseFilterFactory/ 
  filter language=English class=solr.SnowballPorterFilterFactory
protected=protwords.txt/ 
/analyzer

Is the order by the filters important? If LowerCaseFilterFactory came before
WordDelimiterFilterFactory, would that fix this? Would it break something
else?

Thanks,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3036089.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Compound word search not what I expected

2011-06-07 Thread Markus Jelsma
catenateWords should be set to true. Same goes for the index analyzer. 
preserveOriginal would also work.

 I have a field defined as:
 field name=content type=text indexed=true stored=false
 termVectors=true multiValued=true /
 where text is unmodified from the schema.xml example that came with Solr
 1.4.1.
 
 I have documents with some compound words indexed, words like Sandstone.
 And in several cases words that are camel case like MaxSize. If I query
 using all lower case, sandstone or maxsize, I get the documents I expect.
 If I query with proper case, ie. Sandstone or Maxsize I get the documents
 I expect. However, if the query is camel case, MaxSize or SandStone, it
 doesn't find the documents. In the case of MaxSize it is particularly
 frustrating because that is the actual case of the word that was indexed.
 Is this expected behavior?  The query analyzer definition the the text
 field type is:
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory ignoreCase=true expand=true
 synonyms=synonyms.txt/
   filter class=solr.StopFilterFactory enablePositionIncrements=true
 words=stopwords.txt ignoreCase=true/
   filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
 catenateAll=0 catenateNumbers=0 catenateWords=0
 generateNumberParts=1 generateWordParts=1/
   filter class=solr.LowerCaseFilterFactory/
   filter language=English class=solr.SnowballPorterFilterFactory
 protected=protwords.txt/
 /analyzer
 
 Is the order by the filters important? If LowerCaseFilterFactory came
 before WordDelimiterFilterFactory, would that fix this? Would it break
 something else?
 
 Thanks,
 Ken
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expecte
 d-tp3036089p3036089.html Sent from the Solr - User mailing list archive at
 Nabble.com.


How to deal with many files using solr external file field

2011-06-07 Thread Bohnsack, Sven
Hi all,

we're using solr 1.4 and external file field ([1]) for sorting our 
searchresults. We have about 40.000 Terms, for which we use this sorting option.
Currently we're running into massive OutOfMemory-Problems and were not pretty 
sure, what's the matter. It seems that the garbage collector stops working or 
some processes are going wild. However, solr starts to allocate more and more 
RAM until we experience this OutOfMemory-Exception.


We noticed the following:

For some terms one could see in the solr log that there appear some 
java.io.FileNotFoundExceptions, when solr tries to load an external file for a 
term for which there is not such a file, e.g. solr tries to load the external 
score file for trousers but there ist none in the /solr/data-Folder.

Question: is it possible, that those exceptions are responsible for the 
OutOfMemory-Problem or could it be due to the large(?) number of 40k terms for 
which we want to sort the result via external file field?

I'm looking forward for your answers, suggestions and ideas :)


Regards
Sven


[1]: 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html


Available Solr Indexing strategies

2011-06-07 Thread zarni aung
Hi,

I am very new to Solr and my client is trying to implement full text
searching capabilities to their product by using Solr.  They will also have
master storage that would be the Authoritative data store which will also
provide meta data searches.  Can you please point me in the right direction
for some indexing strategies that people are using for further research.

Thank you,

Zarni


Re: Data not always returned

2011-06-07 Thread Erick Erickson
Well, this is odd. Several questions

1 what do your logs show? I'm wondering if somehow some data is getting
 rejected. I have no idea why that would be, but if you're seeing indexing
 exceptions that would explain it.
2 on the admin/stats page, are maxDocs and numDocs the same in the success
 /failure case? And are they equal to 40,000?
3 what does debugQuery=on show in the two cases? I'd expect it to be
identical, but...
4 admin/schema browser. Look at your three fields and see if things
like unique-terms are
 identical.
5 are the rows being returned before indexing in the same order? I'm
wondering if somehow
 you're getting documents overwritten by having the same id (uniqueKey).
6 Have you poked around with Luke to see what, if anything, is dissimilar?

These are shots in the dark, but my supposition is that somehow you're
not indexing what
you expect, the questions above might give us a clue where to look next.

Best
Erick

On Tue, Jun 7, 2011 at 12:02 PM, Jerome Renard jerome.ren...@gmail.com wrote:
 Hi all,

 I have a problem with my index. Even though I always index the same
 data over and over again, whenever I try
 a couple of searches (they are always the same as they are issued by a
 unit test suite) I do not get the same
 results, sometimes I get 3 successes and 2 failures and sometimes it
 is the other way around it is unpredictable.

 Here is what I am trying to do:

 I created a new Solr core with its specific solrconfig.xml and schema.xml
 This core stores a list of towns which I plan to use with an
 auto-suggestion system, using ngrams (no Suggester)

 The indexing process is always the same :
 1. the import script deletes all documents in the core :
 deletequery*:*/query/delete and commit/
 2. the import script fetches date from postgres, 100 rows at a time
 2. the import script adds these 100 documents and sends a commit/
 3. once all the rows (around 40 000) have been imported the script
 send an optimize/ query

 Here is what happens:
 I run the indexer once and search for 'foo' I get results I expect but
 if I search for 'bar' I get nothing
 I reindex once again and search for 'foo' I get nothing, but if I
 search for 'bar' I get results
 The search is made on the name field which is a pretty common
 TextField with ngrams.

 I tried to physically remove the index (rm -rf path/to/index) and
 reindex everything as well and
 not all searches work, sometimes the 'foo' search work, sometimes the 'bar' 
 one.

 I tried a lot of differents things but now I am running out of ideas.
 This is why I am asking for help.

 Some useful informations :
 Solr version : 3.1.0
 Solr Implementation Version: 3.1.0 1085815 - grantingersoll -
 2011-03-26 18:00:07
 Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
 Java 1.5.0_24 on Mac Os X
 solrconfig.xml and schema.xml are attached

 Thanks in advance for your help.



Re: Compound word search not what I expected

2011-06-07 Thread Erick Erickson
WordDelimiterFilterFactory is doing this to you. It's not clear to me that you
want this in place at all.

Look at admin/analysis for that field to see how that filter breaks things up,
it's often surprising to people.

Best
Erick

On Tue, Jun 7, 2011 at 3:13 PM, kenf_nc ken.fos...@realestate.com wrote:
 I have a field defined as:
    field name=content type=text indexed=true stored=false
 termVectors=true multiValued=true /
 where text is unmodified from the schema.xml example that came with Solr
 1.4.1.

 I have documents with some compound words indexed, words like Sandstone. And
 in several cases words that are camel case like MaxSize. If I query using
 all lower case, sandstone or maxsize, I get the documents I expect. If I
 query with proper case, ie. Sandstone or Maxsize I get the documents I
 expect. However, if the query is camel case, MaxSize or SandStone, it
 doesn't find the documents. In the case of MaxSize it is particularly
 frustrating because that is the actual case of the word that was indexed. Is
 this expected behavior?  The query analyzer definition the the text field
 type is:
 analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory ignoreCase=true expand=true
 synonyms=synonyms.txt/
  filter class=solr.StopFilterFactory enablePositionIncrements=true
 words=stopwords.txt ignoreCase=true/
  filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
 catenateAll=0 catenateNumbers=0 catenateWords=0
 generateNumberParts=1 generateWordParts=1/
  filter class=solr.LowerCaseFilterFactory/
  filter language=English class=solr.SnowballPorterFilterFactory
 protected=protwords.txt/
 /analyzer

 Is the order by the filters important? If LowerCaseFilterFactory came before
 WordDelimiterFilterFactory, would that fix this? Would it break something
 else?

 Thanks,
 Ken

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3036089.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Compound word search not what I expected

2011-06-07 Thread lee carroll
see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

from the wiki

Example of generateWordParts=1 and catenateWords=1:
PowerShot - 0:Power, 1:Shot 1:PowerShot
(where 0,1,1 are token positions)
A's+B'sC's - 0:A, 1:B, 2:C, 2:ABC
Super-Duper-XL500-42-AutoCoder! - 0:Super, 1:Duper, 2:XL,
2:SuperDuperXL, 3:500 4:42, 5:Auto, 6:Coder, 6:AutoCoder

One use for WordDelimiterFilter is to help match words with different
delimiters. One way of doing so is to specify generateWordParts=1
catenateWords=1 in the analyzer used for indexing, and
generateWordParts=1 in the analyzer used for querying. Given that
the current StandardTokenizer immediately removes many intra-word
delimiters, it is recommended that this filter be used after a
tokenizer that leaves them in place (such as WhitespaceTokenizer).


Re: Compound word search not what I expected

2011-06-07 Thread kenf_nc
I tried setting catenateWords=1 on the Query analyzer and that didn't do
anything. I think what I need is to set my Index Analyzer to have
preserveOriginal=1 and then re-index everything. That will be a pain, so
I'll do a small test to make sure first. I'm really surprised
preserveOriginal=1 isn't the default. It's like saying slice and dice
this word so I can search on all kinds of partial matches...but do NOT let
me search on the actual word itself.  I know it's not quite that, but it's
close. Anyway, I'm going to try the preserveOriginal parameter on
WordDelimiterFilterFactory, on both the Index and Query side and see what
happens.

Thanks for all the suggestions,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3037068.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Default query parser operator

2011-06-07 Thread lee carroll
Hi Brian could your front end app do this field query logic?

(assuming you have an app in front of solr)



On 7 June 2011 18:53, Jonathan Rochkind rochk...@jhu.edu wrote:
 There's no feature in Solr to do what you ask, no. I don't think.

 On 6/7/2011 1:30 PM, Brian Lamb wrote:

 Hi Jonathan,

 Thank you for your reply. Your point about my example is a good one. So
 let
 me try to restate using your example. Suppose I want to apply AND to any
 search terms within field1.

 Then

 field1:foo field2:bar field1:baz field2:bom

 would by written as

 http://localhost:8983/solr/?q=field1:foo OR field2:bar OR field1:baz OR
 field2:bom

 But if they were written together like:

 http://localhost:8983/solr/?q=field1:(foo baz) field2:(bar bom)

 I would want it to be

 http://localhost:8983/solr/?q=field1:(foo AND baz) OR field2:(bar OR bom)

 But it sounds like you are saying that would not be possible.

 Thanks,

 Brian Lamb

 On Tue, Jun 7, 2011 at 11:27 AM, Jonathan Rochkindrochk...@jhu.edu
  wrote:

 Nope, not possible.

 I'm not even sure what it would mean semantically. If you had default
 operator OR ordinarily, but default operator AND just for field2,
 then
 what would happen if you entered:

 field1:foo field2:bar field1:baz field2:bom

 Where the heck would the ANDs and ORs go?  The operators are BETWEEN the
 clauses that specify fields, they don't belong to a field. In general,
 the
 operators are part of the query as a whole, not any specific field.

 In fact, I'd be careful of your example query:
    q=field1:foo bar field2:baz

 I don't think that means what you think it means, I don't think the
 field1 applies to the bar in that case. Although I could be wrong,
 but
 you definitely want to check it.  You need field1:foo field1:bar, or
 set
 the default field for the query to field1, or use parens (although that
 will change the execution strategy and ranking): q=field1:(foo bar)
 

 At any rate, even if there's a way to specify this so it makes sense, no,
 Solr/lucene doesn't support any such thing.




 On 6/7/2011 10:56 AM, Brian Lamb wrote:

 I feel like this should be fairly easy to do but I just don't see
 anywhere
 in the documentation on how to do this. Perhaps I am using the wrong
 search
 parameters.

 On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

  Hi all,

 Is it possible to change the query parser operator for a specific field
 without having to explicitly type it in the search field?

 For example, I'd like to use:

 http://localhost:8983/solr/search/?q=field1:word token field2:parser
 syntax

 instead of

 http://localhost:8983/solr/search/?q=field1:word AND token
 field2:parser
 syntax

 But, I only want it to be applied to field1, not field2 and I want the
 operator to always be AND unless the user explicitly types in OR.

 Thanks,

 Brian Lamb





Solr Coldfusion Search Issue

2011-06-07 Thread Alejandro Delgadillo
Hi,

I¹m having some troubles using Solr throught Coldfusion,  the problem right
now is that when I search for a term in a Custom field, the results
sometimes have the value that I sent to the custom field and not to the
field that contains the text, this is the cfsearch sintax that I¹m using:

cfsearch collection=agenda,bitacoras
criteria='contents:#form.search#ANDcustom1:#form.tema#ANDcustom2:#
form.dia#ANDcustom4:#form.anio#ANDcustom3:#form.mon#'
name=result status=meta startrow=#url.start# maxrows=#max#
contextpassages=5 contexthighlightbegin=B
contexthighlightend=BE suggestions=always

Every custom fields gets the value by a combo box or drop box with a list of
option, the thing is that when the users sends a search for CUSTOM1,
sometimes the results include the same searched value un CONTENTS...

Do anyone have an idea on how to fix this?

I¹ll appreciate all the help I can get.

Regards.
Alex


Re: Solr Coldfusion Search Issue

2011-06-07 Thread lee carroll
Can you see the query actually presented to solr in the logs ?

maybe capture that and then run it with a debug true in the admin pages.

sorry i cant help directly with your syntax


On 7 June 2011 23:06, Alejandro Delgadillo adelgadi...@febg.org wrote:
 Hi,

 I¹m having some troubles using Solr throught Coldfusion,  the problem right
 now is that when I search for a term in a Custom field, the results
 sometimes have the value that I sent to the custom field and not to the
 field that contains the text, this is the cfsearch sintax that I¹m using:

 cfsearch collection=agenda,bitacoras
 criteria='contents:#form.search#ANDcustom1:#form.tema#ANDcustom2:#
 form.dia#ANDcustom4:#form.anio#ANDcustom3:#form.mon#'
 name=result status=meta startrow=#url.start# maxrows=#max#
 contextpassages=5 contexthighlightbegin=B
 contexthighlightend=BE suggestions=always

 Every custom fields gets the value by a combo box or drop box with a list of
 option, the thing is that when the users sends a search for CUSTOM1,
 sometimes the results include the same searched value un CONTENTS...

 Do anyone have an idea on how to fix this?

 I¹ll appreciate all the help I can get.

 Regards.
 Alex



Re: Compound word search not what I expected

2011-06-07 Thread Markus Jelsma
You must catenateWord on index-time as well.

 I tried setting catenateWords=1 on the Query analyzer and that didn't do
 anything. I think what I need is to set my Index Analyzer to have
 preserveOriginal=1 and then re-index everything. That will be a pain, so
 I'll do a small test to make sure first. I'm really surprised
 preserveOriginal=1 isn't the default. It's like saying slice and dice
 this word so I can search on all kinds of partial matches...but do NOT let
 me search on the actual word itself.  I know it's not quite that, but it's
 close. Anyway, I'm going to try the preserveOriginal parameter on
 WordDelimiterFilterFactory, on both the Index and Query side and see what
 happens.


wildcard search

2011-06-07 Thread Thomas Fischer
Hello,

I am testing solr 3.2 and have problems with wildcards.
I am indexing values like IA 300; IC 330; IA 317; IA 318 in a field GOK, 
and can't find a way to search with wildcards.
I want to use a wild card search to match something like IA 31? but cannot 
find a way to do so.
GOK:IA\ 38* doesn't work with the contents of GOK indexed as text.
Is there a way to index and search that would meet my requirements?

Thomas




Re: Solr Coldfusion Search Issue

2011-06-07 Thread Alejandro Delgadillo
Thanks Lee for the quick response,

Let me explain it a little bit better

In the CFSEARCH tag, you use the CRITERIA attribute, what it does... By
default is that it sents to the SOLR via post the search query of the user
to the field where the text is stored in this case since I'm indexing PDF
files the variable CONTENTS in solr...

The problem is that also sends the custom field criteria to the contents
variables and that's why I have for example:

If I search the value 03 in CUSTOM1, it also search the same value in
CONTENTS, it works since the results are filtered by the value, but the
contents display the same value, in this case 03

Maybe... I'm not sure... There is another way to search for custom fields
using the CFSEARCH tag, I've tried changing the order, but I still get the
same result...


On 6/7/11 4:14 PM, lee carroll lee.a.carr...@googlemail.com wrote:

 Can you see the query actually presented to solr in the logs ?
 
 maybe capture that and then run it with a debug true in the admin pages.
 
 sorry i cant help directly with your syntax
 
 
 On 7 June 2011 23:06, Alejandro Delgadillo adelgadi...@febg.org wrote:
 Hi,
 
 I¹m having some troubles using Solr throught Coldfusion,  the problem right
 now is that when I search for a term in a Custom field, the results
 sometimes have the value that I sent to the custom field and not to the
 field that contains the text, this is the cfsearch sintax that I¹m using:
 
 cfsearch collection=agenda,bitacoras
 criteria='contents:#form.search#ANDcustom1:#form.tema#ANDcustom2:#
 form.dia#ANDcustom4:#form.anio#ANDcustom3:#form.mon#'
 name=result status=meta startrow=#url.start# maxrows=#max#
 contextpassages=5 contexthighlightbegin=B
 contexthighlightend=BE suggestions=always
 
 Every custom fields gets the value by a combo box or drop box with a list of
 option, the thing is that when the users sends a search for CUSTOM1,
 sometimes the results include the same searched value un CONTENTS...
 
 Do anyone have an idea on how to fix this?
 
 I¹ll appreciate all the help I can get.
 
 Regards.
 Alex
 




Re: wildcard search

2011-06-07 Thread Erick Erickson
Yes there is, but you haven't provided enough information to
make a suggestion. What isthe fieldType definition? What is
the field definition?

Two resources that'll help you greatly are:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

and the admin/analysis page...

Best
Erick

On Tue, Jun 7, 2011 at 6:23 PM, Thomas Fischer fischer...@aon.at wrote:
 Hello,

 I am testing solr 3.2 and have problems with wildcards.
 I am indexing values like IA 300; IC 330; IA 317; IA 318 in a field GOK, 
 and can't find a way to search with wildcards.
 I want to use a wild card search to match something like IA 31? but cannot 
 find a way to do so.
 GOK:IA\ 38* doesn't work with the contents of GOK indexed as text.
 Is there a way to index and search that would meet my requirements?

 Thomas





400 MB Fields

2011-06-07 Thread Otis Gospodnetic
Hello,

What are the biggest document fields that you've ever indexed in Solr or that 
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents 
having a field that can be around 400 MB in size!  I'm curious if anyone has 
any 
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: 400 MB Fields

2011-06-07 Thread Erick Erickson
From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia
of Michigan Civil War Volunteers in a single document/field, so it's probably
within the realm of possibility at least G...

Erick

On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 What are the biggest document fields that you've ever indexed in Solr or that
 you've heard of?  Ah, it must be Tom's Hathi trust. :)

 I'm asking because I just heard of a case of an index where some documents
 having a field that can be around 400 MB in size!  I'm curious if anyone has 
 any
 experience with such monster fields?
 Crazy?  Yes, sure.
 Doable?

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
I think the question is strange... May be you are wondering about possible
OOM exceptions? I think we can pass to Lucene single document containing
comma separated list of term, term, ... (few billion times)... Except
stored and TermVectorComponent...

I believe thousands companies already indexed millions documents with
average size few hundreds Mbytes... There should not be any limits (except
InputSource vs. ByteArray)

100,000 _unique_ terms vs. single document containing 100,000,000,000,000
of non-unique terms (and trying to store offsets)

What about Spell Checker feature? Is anyone tried to index single
terabytes-like document?

Personally, I indexed only small (up to 1000 bytes) documents-fields, but
I believe 500Mb is very common use case with PDFs (which vendors use
Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses
Lucene...)


Fuad




On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote:

From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia
of Michigan Civil War Volunteers in a single document/field, so it's
probably
within the realm of possibility at least G...

Erick

On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 What are the biggest document fields that you've ever indexed in Solr
or that
 you've heard of?  Ah, it must be Tom's Hathi trust. :)

 I'm asking because I just heard of a case of an index where some
documents
 having a field that can be around 400 MB in size!  I'm curious if
anyone has any
 experience with such monster fields?
 Crazy?  Yes, sure.
 Doable?

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/






Re: 400 MB Fields

2011-06-07 Thread Otis Gospodnetic
Hi,


 I think the question is strange... May be you are wondering about  possible
 OOM exceptions? 

No, that's an easier one. I was more wondering whether with 400 MB Fields 
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...

Oh, I know it can be done, but I'm wondering how bad things (like the ones 
above) get.

 I believe thousands  companies already indexed millions documents with
 average size few hundreds  Mbytes... There should not be any limits (except

Which ones are you thinking about?  What sort of documents?

 100,000 _unique_ terms vs. single document containing  100,000,000,000,000
 of non-unique terms (and trying to store  offsets)
 
 Personally, I indexed only small (up  to 1000 bytes) documents-fields, but
 I believe 500Mb is very common use case  with PDFs (which vendors use

Nah, PDF files may be big, but I think the text in them is often not *that* 
big, 
unless those are PDFs of very big books.

Thanks,
Otis


 On  11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com  wrote:
 
 From older (2.4) Lucene days, I once indexed the 23 volume  Encyclopedia
 of Michigan Civil War Volunteers in a single  document/field, so it's
 probably
 within the realm of possibility  at least G...
 
 Erick
 
 On Tue, Jun 7, 2011 at  6:59 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
  Hello,
 
  What are the biggest document  fields that you've ever indexed in Solr
 or that
  you've  heard of?  Ah, it must be Tom's Hathi trust. :)
 
  I'm  asking because I just heard of a case of an index where  some
 documents
  having a field that can be around 400 MB  in size!  I'm curious if
 anyone has any
  experience  with such monster fields?
  Crazy?  Yes, sure.
   Doable?
 
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 
  



Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
Hi Otis,


I am recalling pagination feature, it is still unresolved (with default
scoring implementation): even with small documents, searching-retrieving
documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
take few minutes (I saw it with trunk version 6 months ago, and with very
small documents, total 100 mlns docs); it is advisable to restrict search
results to top-1000 in any case (as with Google)...



I believe things can get wrong; yes, most plain-text retrieved from books
should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for
UTF-8)

Theoretically, it doesn't make any sense to index BIG document containing
all terms from dictionary without any terms frequency calcs, but even
with it... I can't imagine we should index 1000s docs and each is just
(different) version of whole Wikipedia, should be wrong design...

Ok, use case: index single HUGE document. What will we do? Create index
with _the_only_ document? And all search will return the same result (or
nothing)? Paginate it; split into pages. I am pragmatic...


Fuad



On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

Hi,


 I think the question is strange... May be you are wondering about
possible
 OOM exceptions? 

No, that's an easier one. I was more wondering whether with 400 MB Fields
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...




Re: 400 MB Fields

2011-06-07 Thread Lance Norskog
The Salesforce book is 2800 pages of PDF, last I looked.

What can you do with a field that big? Can you get all of the snippets?

On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi f...@efendi.ca wrote:
 Hi Otis,


 I am recalling pagination feature, it is still unresolved (with default
 scoring implementation): even with small documents, searching-retrieving
 documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
 take few minutes (I saw it with trunk version 6 months ago, and with very
 small documents, total 100 mlns docs); it is advisable to restrict search
 results to top-1000 in any case (as with Google)...



 I believe things can get wrong; yes, most plain-text retrieved from books
 should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for
 UTF-8)

 Theoretically, it doesn't make any sense to index BIG document containing
 all terms from dictionary without any terms frequency calcs, but even
 with it... I can't imagine we should index 1000s docs and each is just
 (different) version of whole Wikipedia, should be wrong design...

 Ok, use case: index single HUGE document. What will we do? Create index
 with _the_only_ document? And all search will return the same result (or
 nothing)? Paginate it; split into pages. I am pragmatic...


 Fuad



 On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

Hi,


 I think the question is strange... May be you are wondering about
possible
 OOM exceptions?

No, that's an easier one. I was more wondering whether with 400 MB Fields
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...






-- 
Lance Norskog
goks...@gmail.com


RE: 400 MB Fields

2011-06-07 Thread Burton-West, Tom
Hi Otis, 

Our OCR fields average around 800 KB.  My guess is that the largest docs we 
index (in a single OCR field) are somewhere between 2 and 10MB.  We have had 
issues where the in-memory representation of the document (the in memory index 
structures being built)is several times the size of the text, so I would 
suspect even with the largest ramBufferSizeMB, you might run into problems.  
(This is with the 3.x branch.  Trunk might not have this problem since it's 
much more memory efficient when indexing

Tom Burton-West
www.hathitrust.org/blogs

From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent: Tuesday, June 07, 2011 6:59 PM
To: solr-user@lucene.apache.org
Subject: 400 MB Fields

Hello,

What are the biggest document fields that you've ever indexed in Solr or that
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents
having a field that can be around 400 MB in size!  I'm curious if anyone has any
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



tika integration exception and other related queries

2011-06-07 Thread Naveen Gupta
Hi Can somebody answer this ...

3. can somebody tell me an idea how to do indexing for a zip file ?

1. while sending docx, we are getting following error.

java.lang.

 NumberFormatException: For input string: quot;2011-01-27T07:18:00Zquot;
 at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
 at java.lang.Long.parseLong(Long.java:412)
 at java.lang.Long.parseLong(Long.java:461)
 at org.apache.solr.schema.TrieField.createField(TrieField.java:434)
 at
 org.apache.solr.schema.SchemaField.createField(SchemaField.java:98)
 at
 org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204)
 at
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277)
 at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
 at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
 at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
 at java.lang.Thread.run(Thread.java:619)



Thanks
Naveen



On Tue, Jun 7, 2011 at 3:33 PM, Naveen Gupta nkgiit...@gmail.com wrote:

 Hi

 We are using requestextractinghandler and we are getting following error.
 we are giving microsoft docx file for indexing.

 I think that this is something to do with field date definition .. but now
 very sure ...what field type should we use?

 2. we are trying to index jpg (when we search over the name of the jpg, it
 is not coming .. though in id i am passing one)

 3. what about zip files or rar files.. does tika with solr handle this one
 ?






 java.lang.NumberFormatException: For input string:
 quot;2011-01-27T07:18:00Zquot;
 at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
 at java.lang.Long.parseLong(Long.java:412)
 at java.lang.Long.parseLong(Long.java:461)
 at org.apache.solr.schema.TrieField.createField(TrieField.java:434)
 at
 org.apache.solr.schema.SchemaField.createField(SchemaField.java:98)
 at
 org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204)
 at
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277)
 at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
 at