Analyzing CSV phrase fields

2008-11-24 Thread Neal Richter
Hey all,

Very basic question.. I want to index fields of comma separated values:

Example document:
id: 1
title: Football Teams
keywords: philadelphia eagles, cleveland browns, new york jets

id: 2
title: Baseball Teams
keywords:"philadelphia phillies", "new york yankees", "cleveland indians"

A query of 'new york' should return the obvious documents, but a quoted
phrase query of "yankees cleveland" should return nothing... meaning that
comma breaks phrases without fail.

I've created a textCSV type in the schema.xml file and used the
PatternTokenizerFactory to split on commas, and from there analysis can
proceed as normal via StopFilterFactory, LowerCaseFilter,
RemoveDuplicatesTokenFilter



Has anyone done this before?  Can I somehow use an existing (or combination
of) Analyzer?  It seems as though I need to create a PhraseDelimiterFilter
from the WordDelimiterFilter.. though I am sure there is a way to make an
existing analyzer to break things up the way I want.

Thanks - Neal Richter


Re: new faceting algorithm

2008-11-24 Thread Yonik Seeley
And if you want to verify that the new faceting code has indeed kicked
in, some statistics are logged, like:

Nov 24, 2008 11:14:32 PM org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field features, memSize=14584, time=47, phase1=47,
 nTerms=285, bigTerms=99, termInstances=186

-Yonik

On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> A new faceting algorithm has been committed to the development version
> of Solr, and should be available in the next nightly test build (will
> be dated 11-25).  This change should generally improve field faceting
> where the field has many unique values but relatively few values per
> document.  This new algorithm is now the default for multi-valued
> fields (including tokenized fields) so you shouldn't have to do
> anything to enable it.  We'd love some feedback on how it works to
> ensure that it actually is a win for the majority and should be the
> default.
>
> -Yonik
>


Solr/PHP/Java job opportunity

2008-11-24 Thread Kentster

Lead PHP Developer

Successful Holland, Michigan based ecommerce retailer seeks a talented
senior developer to improve and build upon our existing customer facing
systems. The role is very hands on and PHP intensive but will involve a
growing leadership presence as the company expands. A minimum of 5 years
prior experience with PHP is required. We strongly desire an individual that
has a track record building highly-scalable web applications. 

Responsibilities
•   Design new features, both on the back-end (data modeling and management)
and front-end using PHP, MySQL, SOLR/Lucerne, AJAX and other open-source
technologies. 
•   Experience with large data sets, our environment manages over 700,000
SKU’s.
•   Develop and evolve website infrastructure and back-end processes that
manage business critical tasks. 
•   Continuous improvement of development processes and architectural
patterns. 

Requirements
•   Experience building high quality, scalable web applications. Emphasis on
ecommerce.
•   Extensive experience programming in PHP 4/5 (5 preferred)
•   XML, SOAP, cURL
•   Experience with JavaScript, JSON, AJAX, CSS, and XHTML
•   Expert knowledge of SQL and relational database structures.
•   A good understanding of faceted / guided navigation and search
technologies
•   Ability to develop and document clean, object oriented code.


Prior history with the following is not required, but definitely helpful in
our environment:
•   Java
•   Tomcat
•   Spring Framework
•   Pentaho 
•   Authorize.net API integration
•   Shell scripting
•   Smarty or related template systems
•   JavaScript frameworks such as jQuery

Excellent compensation and benefits are available for the right candidate. 


-- 
View this message in context: 
http://www.nabble.com/Solr-PHP-Java-job-opportunity-tp20675538p20675538.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet Query (fq) and Query (q)

2008-11-24 Thread Erik Hatcher
First, terminology clarification.  fq is *filter* query.  facet.query  
is facet query.



On Nov 24, 2008, at 9:55 AM, Jae Joo wrote:
I am having some trouble to utilize the facet Query. As I know that  
the

facet Query has better performance that simple query (q).


The performance is (about?) the same, caches excluded.  But fq's get  
added to the filterCache, q's to the queryCache.



Here is the example.

http://localhost:8080/test_solr/select?q=*:*&facet=true&fq=state:CA&facet.mincount=1&facet.field=city&facet.field=sector&facet.limit=-1&sort=score+desc

--> facet by sector and city for state of CA.
Any idea how to optimize this query to avoid "q=*:*"?


In this case you could use q=state:CA and remove fq=state:CA, but  
performance-wise as well as all other things considered, I'd stick  
with what you've got - just warm those filterCaches up and your  
performance should be quite acceptable.  Report back with more details  
if not.


Erik


Re: broken socket in Jetty causing invalid XML ?

2008-11-24 Thread Yonik Seeley
I thought the Jetty maxFormContentSize was only for form data (not for
the POST body).
Does increasing this param help?

-Yonik


On Mon, Nov 24, 2008 at 2:45 PM, Anoop Bhatti <[EMAIL PROTECTED]> wrote:
> Hello Solr Community,
>
> I'm getting the stracktrace below when adding docs using the
> CommonsHttpSolrServer.add(Collection
> docs)
> method.  The server doesn't seem to be able to recover from this error.
> We are adding a collection with 1,000 SolrInputDocument's at a time.
> I'm using Solr 1.3.0 and Java 1.6.0_07.
>
> It seems that this problem occurs in Jetty when the TCP connection is
> broken while the stream (from the add(...) method) is being read.  The
> XML read from the broken stream is not valid.  Is this a correct
> diagnosis?
>
> Could this stacktrace be occurring when the max POST size has been
> exceeded?  I'm referring to the example/etc/jetty.xml file, which has
> the setting:
>  
>
>  org.mortbay.jetty.Request.maxFormContentSize
>  100
>
> Right now maxFormContentSize is set to the default 1 MB on my server.
>
> Also, in some cases I have two clients, could Jetty be blocking one
> client and causing it to finally timeout?
>
> This stacktrace doesn't happen right away, is occurs once the Lucene
> indexes are about 30 GB.
> Could the periodic merging of segments be the culprit?
>
> I was also thinking that the problem could be with writing back the
> response (the UpdateRequest).
>
> Here's the gist of my Java client code:
>
> CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(solrServerURL);
>solrServer.setConnectionTimeout(100);
>solrServer.setDefaultMaxConnectionsPerHost(100);
>solrServer.setMaxTotalConnections(100);
>solrServer.add(solrDocs); //the collection of docs
>solrServer.commit();
>
> And here's the stacktrace:
>
> Nov 20, 2008 5:25:33 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update params={wt=javabin&version=2.2}
> status=0 QTime=469
> Nov 20, 2008 5:25:37 PM org.apache.solr.common.SolrException log
> SEVERE: com.ctc.wstx.exc.WstxIOException: null
>at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
>at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
>at 
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
>at 
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
>at 
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
>at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at 
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at 
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.mortbay.jetty.EofException
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:302)
>at 
> org.mortbay.jetty.HttpParser$Input.blockForContent(HttpParser.java:919)
>at org.mortbay.jetty.HttpParser$Input.read(HttpParser.java:897)
>at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:411)
>at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)
>at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:18

new faceting algorithm

2008-11-24 Thread Yonik Seeley
A new faceting algorithm has been committed to the development version
of Solr, and should be available in the next nightly test build (will
be dated 11-25).  This change should generally improve field faceting
where the field has many unique values but relatively few values per
document.  This new algorithm is now the default for multi-valued
fields (including tokenized fields) so you shouldn't have to do
anything to enable it.  We'd love some feedback on how it works to
ensure that it actually is a win for the majority and should be the
default.

-Yonik


Re: New user questions

2008-11-24 Thread Chris Hostetter

: I've gotten Solr up and running, I can ingest the demo objects and
: query them via the admin tool, so far so good.  Now, how do I ingest
: some basic XML, how can I pull from an existing MySQL database, what
: about pulling records in via OAI?  I'm assuming I need to write some
: schemas for that, can someone point me to some documentation or
: tutorial for that?

my "Apache Solr: Out Of The Box" apachecon talk from this year covers 
(to various degrees) all of the features of Solr -- it can give you the 
vocabulary to search the wiki and mailing list archives for more 
details...

http://people.apache.org/~hossman/apachecon2008us/ootb/



-Hoss



RE: Sorting and JVM heap size ....

2008-11-24 Thread souravm
Thanks Yonik. It explains.

Regards,
Sourav

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 7:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size 

On Mon, Nov 24, 2008 at 9:19 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi Yonik,
>
> Thanks again for the detail input.
>
> Let me try to re-confirm my understanding -
>
> 1. What you say is - if sorting is asked for a field, the same field from all 
> documents, which are indexed, would be put in a memory in an un-inverted 
> form. So given this if I have a field of String type with say 20 characters, 
> then (assuming no multibyte characters - all ascii) for 200M documents I need 
> to have at least 20x200 MB, i.e. 4GB memory.

That's the general idea, yes.
For Strings, it's actually just the unique values in a String[], plus
an int[2] of offsets into that String[] for each document.
See Lucene's FieldCache and StringIndex.

-Yonik


> 2. So, if I want to have sorting on 2 such fields I need to allocate at least 
> 8 GB of memory.
>
> 3. Another case is - if there are 2 search requests concurrently hitting the 
> server, each with sorting on the same 20 character date field, then also it 
> would need 2x2GB memory. So if I know that I need to support at least 4 
> concurrent search requests, I need to start the JVM at least with 8 GB heap 
> size.
>
> Please let me know if my understanding is correct.
>
> Regards,
> Sourav
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
> Sent: Monday, November 24, 2008 6:03 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Sorting and JVM heap size 
>
> On Mon, Nov 24, 2008 at 8:48 PM, souravm <[EMAIL PROTECTED]> wrote:
>> I have around 200M documents in index. The field I'm sorting on is a date 
>> string (containing date and time in dd-mmm-  hh:mm:yy format) and the 
>> field is part of the search criteria.
>>
>> Also please note that the number of documents returned by the search 
>> criteria is much less than 200M. In fact even in case of 0 hit I found jvm 
>> out of memory exception.
>
> Right... that's just how the Lucene FieldCache used for sorting works right 
> now.
> The entire field is un-inverted and held in memory.
>
> 200M docs is a *lot*... you might try indexing your date fields as
> integer types that would take only 4 bytes per doc - and that will
> still take up 800M.  Given that 2 searchers can overlap, that still
> adds up to more than your heap - you will need to up that.
>
> The other option is to split your index across multiple nodes and use
> distributed search.  If you want to do any faceting in the future, or
> sort on multiple fields, you will need to do this anyway.
>
> -Yonik
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are 
> not
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this 
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>


Re: Sorting and JVM heap size ....

2008-11-24 Thread Yonik Seeley
On Mon, Nov 24, 2008 at 9:19 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi Yonik,
>
> Thanks again for the detail input.
>
> Let me try to re-confirm my understanding -
>
> 1. What you say is - if sorting is asked for a field, the same field from all 
> documents, which are indexed, would be put in a memory in an un-inverted 
> form. So given this if I have a field of String type with say 20 characters, 
> then (assuming no multibyte characters - all ascii) for 200M documents I need 
> to have at least 20x200 MB, i.e. 4GB memory.

That's the general idea, yes.
For Strings, it's actually just the unique values in a String[], plus
an int[2] of offsets into that String[] for each document.
See Lucene's FieldCache and StringIndex.

-Yonik


> 2. So, if I want to have sorting on 2 such fields I need to allocate at least 
> 8 GB of memory.
>
> 3. Another case is - if there are 2 search requests concurrently hitting the 
> server, each with sorting on the same 20 character date field, then also it 
> would need 2x2GB memory. So if I know that I need to support at least 4 
> concurrent search requests, I need to start the JVM at least with 8 GB heap 
> size.
>
> Please let me know if my understanding is correct.
>
> Regards,
> Sourav
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
> Sent: Monday, November 24, 2008 6:03 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Sorting and JVM heap size 
>
> On Mon, Nov 24, 2008 at 8:48 PM, souravm <[EMAIL PROTECTED]> wrote:
>> I have around 200M documents in index. The field I'm sorting on is a date 
>> string (containing date and time in dd-mmm-  hh:mm:yy format) and the 
>> field is part of the search criteria.
>>
>> Also please note that the number of documents returned by the search 
>> criteria is much less than 200M. In fact even in case of 0 hit I found jvm 
>> out of memory exception.
>
> Right... that's just how the Lucene FieldCache used for sorting works right 
> now.
> The entire field is un-inverted and held in memory.
>
> 200M docs is a *lot*... you might try indexing your date fields as
> integer types that would take only 4 bytes per doc - and that will
> still take up 800M.  Given that 2 searchers can overlap, that still
> adds up to more than your heap - you will need to up that.
>
> The other option is to split your index across multiple nodes and use
> distributed search.  If you want to do any faceting in the future, or
> sort on multiple fields, you will need to do this anyway.
>
> -Yonik
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are 
> not
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this 
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>


Re: a question about solr phrasequery process.

2008-11-24 Thread Yonik Seeley
2008/11/24 finy finy <[EMAIL PROTECTED]>:
> thanks for your help
> ***
> set the position increment of all of the
> tokens you get to 0... this should have the effect of an "OR".
> ***
> how to do this? in solr solrconfig.xml? or schema.xml?
> please help me.

I assumed you already had a custom TokenFilter or Analyzer coded in
Java to turn "oneworldonedream" into multiple tokens.  Call
Token.setPositionIncrement(0) on all but the first token for any group
of tokens you produce from single words.

-Yonik


Re: Phrase query search with stopwords

2008-11-24 Thread Yonik Seeley
Ack!  I tried it too, and it failed for me also.
The analysis page indicates that the tokens are all in the same
positions... need to look into this deeper.
Could you open up a JIRA issue?

-Yonik

On Mon, Nov 24, 2008 at 5:58 PM, Robert Haschart <[EMAIL PROTECTED]> wrote:
> Yonik,
>
> I did make sure enablePositionIncrements="true"  for both indexing and
> queries and just did a test where I  re-indexed a couple of test record
> sets, and submitted a query from the solr admin page, this time searching
> for  title_text:"gone with the wind"  which should return three hits, and
> again it returns 0 hits.
>
> I also tried modifying SolrQueryParser to set  setEnablePositionIncrements
> to true thinkg that would fix the problem,  but it doesn't seem to.
>
>
> -Bob
>
>
> Yonik Seeley wrote:
>
>> Robert,
>>
>> I've reproduced (sort of) this bad behavior with the example schema.
>> There was an example configuration "bug" introduced in SOLR-521
>> where enablePositionIncrements="true" was only set on the index
>> analyzer but not the query analyzer for the "text" fieldType.
>>
>> A query on the example data of
>> features:"Optimized for High Volume Web Traffic"
>> will not match any documents.
>>
>> You seem to indicate that enablePositionIncrements="true" is set for
>> both your index and query analyzer.  Can you verify that, and verify
>> that you restarted solr and reindexed after that change was made?
>>
>> -Yonik
>>
>>
>>
>> On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <[EMAIL PROTECTED]>
>> wrote:
>>
>>>
>>> Greetings all,
>>>
>>> I'm having trouble tracking down why a particular query is not working.
>>> A
>>> user is trying to do a search for alternate_form_title_text:"three films
>>> by
>>> louis malle"  specifically to find the 4 records that contain the phrase
>>> "Three films by Louis Malle" in their alternate_form_title_text field.
>>> However the search return 0 records.
>>>
>>> The modified searches:
>>>
>>> alternate_form_title_text:"three films by louis malle"~1
>>>
>>> or
>>>
>>> alternate_form_title_text:"three films" AND
>>> alternate_form_title_text:"louis
>>> malle"
>>>
>>> both return the 4 records.   So it seems that it is the word "by" which
>>> is
>>> listed in the stopword filter list is causing the problem.
>>>
>>> The analyzer/filter sequence for indexing the alternate_form_title_text
>>> field is _almost_ exactly the same as the sequence for querying that
>>> field.
>>>
>>> for indexing the sequence is:
>>>
>>> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
>>> schema.UnicodeNormalizationFilterFactory {composed=false,
>>> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>> schema.CJKFilterFactory   {bigrams=false}
>>> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>> ignoreCase=true, enablePositionIncrements=true}
>>>
>>> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>> catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
>>> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>> org.apache.solr.analysis.EnglishPorterFilterFactory
>>> {protected=protwords.txt}
>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>>
>>> for querying the sequence is:
>>>
>>> org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
>>> schema.UnicodeNormalizationFilterFactory {composed=false,
>>> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>> schema.CJKFilterFactory   {bigrams=false}
>>> org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
>>> expand=true, ignoreCase=true}
>>> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>> ignoreCase=true, enablePositionIncrements=true}
>>>
>>> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
>>> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>> org.apache.solr.analysis.EnglishPorterFilterFactory
>>> {protected=protwords.txt}
>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>>
>>>
>>> If I run a test through the field anaylsis admin page, submitting the
>>> string* three films by louis malle *through both the Field value (Index)
>>> and
>>> the Field value (query) the reslts (shown below) seem to indicate the the
>>> query ought to find the 4 records in question, by it does not, and I'm at
>>> a
>>> loss to explain why.
>>>
>>>
>>>   Index Analyzer
>>>
>>> term position   1   2   4   5
>>> term text   three   filmlouimall
>>> term type   wordwordwordword
>>> source start,end0,5 6,1115,20   21,26
>>>
>>>
>>>
>>>   Query Analyzer
>>>
>>> term position   1   2   4   5
>>> term text   three   filmlouimall
>>> term type   wordwordwordword
>>> source start,end0,5 6,1115,20   21,26
>>>
>>>
>>>
>>>
>>>
>
>
>


RE: Sorting and JVM heap size ....

2008-11-24 Thread souravm
Hi Yonik,

Thanks again for the detail input.

Let me try to re-confirm my understanding -

1. What you say is - if sorting is asked for a field, the same field from all 
documents, which are indexed, would be put in a memory in an un-inverted form. 
So given this if I have a field of String type with say 20 characters, then 
(assuming no multibyte characters - all ascii) for 200M documents I need to 
have at least 20x200 MB, i.e. 4GB memory.

2. So, if I want to have sorting on 2 such fields I need to allocate at least 8 
GB of memory.

3. Another case is - if there are 2 search requests concurrently hitting the 
server, each with sorting on the same 20 character date field, then also it 
would need 2x2GB memory. So if I know that I need to support at least 4 
concurrent search requests, I need to start the JVM at least with 8 GB heap 
size. 

Please let me know if my understanding is correct.

Regards,
Sourav

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 6:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size 

On Mon, Nov 24, 2008 at 8:48 PM, souravm <[EMAIL PROTECTED]> wrote:
> I have around 200M documents in index. The field I'm sorting on is a date 
> string (containing date and time in dd-mmm-  hh:mm:yy format) and the 
> field is part of the search criteria.
>
> Also please note that the number of documents returned by the search criteria 
> is much less than 200M. In fact even in case of 0 hit I found jvm out of 
> memory exception.

Right... that's just how the Lucene FieldCache used for sorting works right now.
The entire field is un-inverted and held in memory.

200M docs is a *lot*... you might try indexing your date fields as
integer types that would take only 4 bytes per doc - and that will
still take up 800M.  Given that 2 searchers can overlap, that still
adds up to more than your heap - you will need to up that.

The other option is to split your index across multiple nodes and use
distributed search.  If you want to do any faceting in the future, or
sort on multiple fields, you will need to do this anyway.

-Yonik

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: solr.WordDelimiterFilterFactory

2008-11-24 Thread Yonik Seeley
On Thu, Nov 20, 2008 at 9:20 AM, Daniel Rosher <[EMAIL PROTECTED]> wrote:
> I'm trying to index some content that has things like 'java/J2EE' but with
> solr.WordDelimiterFilterFactory and parameters [generateWordParts="1"
> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"] this ends up tokenized as
> 'java','j','2',EE'
>
> Does anyone know a way of having this tokenized as 'java','j2ee'.
>
> Perhaps this filter need something like a protected list of tokens not to
> tokenize like EnglishPorterFilter ?

In addition to the other replies, you could use the SynonymFilter to
normalize certain terms before the WDF (assuming you want to keep the
WDF for other things).

Perhaps try the following synonym rules at both index and query time:

j2ee => javatwoee
java/j2ee => java javatwoee

-Yonik


Re: Sorting and JVM heap size ....

2008-11-24 Thread Yonik Seeley
On Mon, Nov 24, 2008 at 8:48 PM, souravm <[EMAIL PROTECTED]> wrote:
> I have around 200M documents in index. The field I'm sorting on is a date 
> string (containing date and time in dd-mmm-  hh:mm:yy format) and the 
> field is part of the search criteria.
>
> Also please note that the number of documents returned by the search criteria 
> is much less than 200M. In fact even in case of 0 hit I found jvm out of 
> memory exception.

Right... that's just how the Lucene FieldCache used for sorting works right now.
The entire field is un-inverted and held in memory.

200M docs is a *lot*... you might try indexing your date fields as
integer types that would take only 4 bytes per doc - and that will
still take up 800M.  Given that 2 searchers can overlap, that still
adds up to more than your heap - you will need to up that.

The other option is to split your index across multiple nodes and use
distributed search.  If you want to do any faceting in the future, or
sort on multiple fields, you will need to do this anyway.

-Yonik


Re: Using Solr for indexing emails

2008-11-24 Thread Timo Sirainen
On Tue, 2008-11-25 at 12:20 +1100, Norberto Meijome wrote:
> > Store the per-mailbox highest indexed UID in a new unique field created
> > like "//". Always update it by deleting the
> > old one first and then adding the new one.
> 
> you mean delete, commit, add, commit? if you replace the record, simply
> submitting the new document and committing would do (of course, you must 
> ensure
> the value of the  uniqueKey field matches, so SOLR replaces the old doc).

Oh, I thought it ignored the new document in that case. Sure, I'll then
don't do the delete if it gets replaced anyway.

> > So to find out the highest
> > indexed UID for a mailbox just look it up using its unique field. For
> > finding the highest indexed UID for a user's all mailboxes do a single
> > query:
> > 
> >  - fl=highestuid
> >  - q=highestuid:[* TO *]
> >  - fq=user:
> 
> would it be faster to say q=user: AND highestuid:[ * TO *]  ?

Now that I read again what fq really did, yes, sounds like you're right.

> ( and i
> guess you'd sort DESC and return 1 record only).

No, I'd use the above for getting highestuid value for all mailboxes
(there should be only one record per mailbox (each mailbox has separate
uid values -> separate highestuid value)) so I can look at the returned
highestuid values to see what mailboxes aren't fully indexed yet.


signature.asc
Description: This is a digitally signed message part


RE: Sorting and JVM heap size ....

2008-11-24 Thread souravm
Hi Yonik,

Thanks for the reply.

I have around 200M documents in index. The field I'm sorting on is a date 
string (containing date and time in dd-mmm-  hh:mm:yy format) and the field 
is part of the search criteria.

Also please note that the number of documents returned by the search criteria 
is much less than 200M. In fact even in case of 0 hit I found jvm out of memory 
exception.

Regards,
Sourav

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 5:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size 

On Mon, Nov 24, 2008 at 6:26 PM, souravm <[EMAIL PROTECTED]> wrote:
> I have indexed data of size around 20GB. My JVM memory is 1.5GB.
>
> For this data if I do a query with sort flag on (for a single field) I always 
> get java out of memory exception even if the number of hit is 0. With no 
> sorting (or default sorting with score) the query works perfectly fine.`
>
> I can understand that JVM heap size can max out when the number of records 
> hit is high, but why this is happening even when number of records hit is 0 ?
>
> The same query with sort flag on does not give me problem till 1.5 GB of data.
>
> Any explanation ?

Sorting in lucene and solr uninverts the field and creates a
FieldCache entry the first time the sort is used.
How many documents are in your index, and what is the type of the
field you are sorting on?

-Yonik

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Sorting and JVM heap size ....

2008-11-24 Thread Yonik Seeley
On Mon, Nov 24, 2008 at 6:26 PM, souravm <[EMAIL PROTECTED]> wrote:
> I have indexed data of size around 20GB. My JVM memory is 1.5GB.
>
> For this data if I do a query with sort flag on (for a single field) I always 
> get java out of memory exception even if the number of hit is 0. With no 
> sorting (or default sorting with score) the query works perfectly fine.`
>
> I can understand that JVM heap size can max out when the number of records 
> hit is high, but why this is happening even when number of records hit is 0 ?
>
> The same query with sort flag on does not give me problem till 1.5 GB of data.
>
> Any explanation ?

Sorting in lucene and solr uninverts the field and creates a
FieldCache entry the first time the sort is used.
How many documents are in your index, and what is the type of the
field you are sorting on?

-Yonik


unsubscribe me

2008-11-24 Thread Charlie Alswiti
 
 
Charlie Alswiti
eBusiness SW Development
NORTEL Information Technology
3500 Carling Avenue | Ottawa, Ontario | K2H-8E9
Tel +1 (613) 763 7371 ESN 393
[EMAIL PROTECTED]
 


Re: Querying Ranges Problem

2008-11-24 Thread Yonik Seeley
Ensure that the fieldType maps back to solr.SortableIntField rather
than solr.IntField

-Yonik

On Mon, Nov 24, 2008 at 5:19 PM, Jake Conk <[EMAIL PROTECTED]> wrote:
> I have the following query:
>
>
> q=(+thread_title_t:test OR +posts_t_ns_mv:test) AND locked_i:0 AND
> replies_i:[50 TO *]
>
>
> I have replies_i which is an integer field set to return me back
> documents that have a value 50 or greater but the problem is I'm
> getting back results with the replied_i field column with lesser than
> 50 results.
>
> I tried other things like "replies_i:[50 TO 1000]" but I'm still
> getting results with the replies_i field under 50.
>
> Am I doing this wrong or is my other party of the query somehow
> affecting the replies_i value? I tried removing the other part of the
> query but I still get unexpected results.
>
>
> Please help!
>
> Thanks,
>
> - Jake C.
>


Re: Using Solr for indexing emails

2008-11-24 Thread Norberto Meijome
On Mon, 24 Nov 2008 20:21:17 +0200
Timo Sirainen <[EMAIL PROTECTED]> wrote:

> I think I gave enough reasons above for why I don't like this
> solution. :) I also don't like adding new shared global state databases
> just for Solr. Solr should be the one shared global state database..

fair enough - it makes more sense to me now :)

[...]
> Store the per-mailbox highest indexed UID in a new unique field created
> like "//". Always update it by deleting the
> old one first and then adding the new one.

you mean delete, commit, add, commit? if you replace the record, simply
submitting the new document and committing would do (of course, you must ensure
the value of the  uniqueKey field matches, so SOLR replaces the old doc).

> So to find out the highest
> indexed UID for a mailbox just look it up using its unique field. For
> finding the highest indexed UID for a user's all mailboxes do a single
> query:
> 
>  - fl=highestuid
>  - q=highestuid:[* TO *]
>  - fq=user:

would it be faster to say q=user: AND highestuid:[ * TO *]  ?  ( and i
guess you'd sort DESC and return 1 record only).

> If messages are being simultaneously indexed by multiple processes the
> highest-uid value may sometimes (rarely) be set too low, but that
> doesn't matter. The next search will try to re-add some of the messages
> that were already in index, but because they'll have the same unique IDs
> than what already exists they won't get added again. The highest-uid
> gets updated and all is well.

B
_
{Beto|Norberto|Numard} Meijome

Mind over matter: if you don't mind, it doesn't matter

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


RE: facet sort by ranking

2008-11-24 Thread Chris Hostetter

: We having 100 category and each category having it own internal ranking.
: Let consider if I search for any product and its fall under 30 categories
: and we are showing top 10 categories in filter so that user can filter there
: results.

the faceting solr provides out of the box is explicitly called 
"SimpleFaceting" because it's designed to handle the "simple" cases :)  
what you are describing is applying specific, subjective, business logic 
and isn't covered in the SimpleFacets implementation.

if your category rankings never changed you can embed the ranking as part 
of the category name, and then tell solr not to sort the facets by count 
(ie: facet.sort=false) so 001_catZ will appear first, then 002_catM, 
etc...

if your category rankings can/will change at arbitrary points in time, and 
you don't want to have to reindex, you'll need to do implement a custom 
Facet Component that knows about (how to find) your rankings.  I've talked 
about something like this at CNET in two previous ApacheCon talks...

http://people.apache.org/~hossman/apachecon2006us/
http://people.apache.org/~hossman/apachecon2008us/btb/


-Hoss



Re: a question about solr phrasequery process.

2008-11-24 Thread finy finy
thanks for your help
***
set the position increment of all of the
tokens you get to 0... this should have the effect of an "OR".
***
how to do this? in solr solrconfig.xml? or schema.xml?
please help me.

2008/11/24 Yonik Seeley <[EMAIL PROTECTED]>

> 2008/11/24 finy finy <[EMAIL PROTECTED]>:
> > maybe you have make a misunderstanding about what i say,
> >
> > another example:
> >
> > the keyword "oneworldonedream" in myself analyzer will be  analyzed about
> > four token:
> > one world one dream
>
> OK, this is just the way the Lucene and Solr query parsers work...
> multiple tokens from the analyzer of a single "parser level" token is
> normally treated as a phrase query.
>
> What you could try to do is set the position increment of all of the
> tokens you get to 0... this should have the effect of an "OR".
>
> -Yonik
>


Re: Solr Core Admin

2008-11-24 Thread Chris Hostetter

Jeff: would you mind opening a Jira issue about this?

at a minum there should be a better error message -- but i think we could 
probably also improve the way the file is updated to not require directory 
write perms.


: Date: Thu, 20 Nov 2008 08:59:26 -0800
: From: Jeff Newburn <[EMAIL PROTECTED]>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Solr Core Admin
: 
: Ok just FYI solr replaces the file instead of editing.  This means that the
: webserver needs permissions in the directory to delete and create the
: solr.xml file.  Once I fixed that it no longer gave IOException errors.
: 
: 
: On 11/20/08 8:29 AM, "Jeff Newburn" <[EMAIL PROTECTED]> wrote:
: 
: > I am trying to use the api for the solr cores.  Reload works great but when
: > I try to UNLOAD I get a massive exception in IOException.  It seems to
: > unload the module but doesn¹t remove it from the configuration file.  The
: > solr.xml file is full read and write but still errors.  Any ideas?
: > 
: > Solr.xml
: > solr persistent="true">
: > 
: >   
: >   
: > 
: > 
: >   
: > 
: > 
: > Exception:
: > HTTP Status 500 - java.io.IOException: Permission denied
: > org.apache.solr.common.SolrException: java.io.IOException: Permission denied
: > at org.apache.solr.core.CoreContainer.persistFile(CoreContainer.java:585) at
: > org.apache.solr.core.CoreContainer.persist(CoreContainer.java:554) at
: > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHa
: > ndler.java:200) at
: > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
: > java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1287) at
: > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
: > 03) at 
: > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
: > 232) at 
: > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Unknown
: > Source) at org.apache.catalina.core.ApplicationFilterChain.doFilter(Unknown
: > Source) at org.apache.catalina.core.StandardWrapperValve.invoke(Unknown
: > Source) at org.apache.catalina.core.StandardContextValve.invoke(Unknown
: > Source) at org.apache.catalina.core.StandardHostValve.invoke(Unknown Source)
: > at org.apache.catalina.valves.ErrorReportValve.invoke(Unknown Source) at
: > org.apache.catalina.core.StandardEngineValve.invoke(Unknown Source) at
: > org.apache.catalina.connector.CoyoteAdapter.service(Unknown Source) at
: > org.apache.coyote.http11.Http11Processor.process(Unknown Source) at
: > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Unkn
: > own Source) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(Unknown
: > Source) at java.lang.Thread.run(Thread.java:619) Caused by:
: > java.io.IOException: Permission denied at
: > java.io.UnixFileSystem.createFileExclusively(Native Method) at
: > java.io.File.checkAndCreate(File.java:1704) at
: > java.io.File.createTempFile(File.java:1793) at
: > org.apache.solr.core.CoreContainer.persistFile(CoreContainer.java:565) ...
: > 18 more
: 



-Hoss


Re: Boosting by field contents

2008-11-24 Thread Chris Hostetter
:   text field:value
: 
: I want to return all documents with 'text'. Documents where 'field = value'
: boosted over documents where 'field = some other value'.
: 
: This query does it:
: 
:   (text field:value)^100 (text -field:value)
: 
: Is there a simpler way?  (I might have left out crucial AND/OR clauses, but
: you get the picture.)

Unless i'm *drasticly* missunderstanding your question, what you are 
asking for is...

+text field:value

docs must match "text" if they also match "field:value" they will score 
higher.

depending on the finer points of how exactly you want the scores computed, 
you might want something like...

+text field:value^1
...or...
+text^10 field:value

...but in either case it's just a question of scoring, both queries will 
require that "text" match.


-Hoss



Re: Question about dismax 'mm' - give boost to searches by location

2008-11-24 Thread Chris Hostetter

: of those 4 words. So whats happening is last will and testament from all
: states are returned although user specifically asked for florida will. I
: don't want to alter the 'mm' either because its working fine for other
: searches. Just for the search terms with a 'location' , i want to be able to
: match all words. Any easy way to do this? Someone please?

Solr doesn't really have any way to know when something is a location or 
not ... if you have a specific location field, you can give it a very high 
boost value in the qf, and then (in theory) searches that include words 
which match on a location will score much higher then results in other 
locations.  butthe other way to go would be to preprocess the users query 
string, detect when words look like the name of a location, and remove 
them from the main query and treat them as an "fq" instead.




-Hoss



Re: solr.WordDelimiterFilterFactory

2008-11-24 Thread Chris Hostetter

: I'm trying to index some content that has things like 'java/J2EE' but with
: solr.WordDelimiterFilterFactory and parameters [generateWordParts="1"
: generateNumberParts="0" catenateWords="0" catenateNumbers="0"
: catenateAll="0" splitOnCaseChange="0"] this ends up tokenized as
: 'java','j','2',EE'
: 
: Does anyone know a way of having this tokenized as 'java','j2ee'.

WDF was really designed arround the assumption that if java/j2ee was 
something like a product sku, people might query it as javaj2ee 
or java-j2ee or java-j-2-ee, or java/j2-ee etc...

for more generic text, you may want to use a tokenizer that splits on "/" 




-Hoss



Re: Phrase query search with stopwords

2008-11-24 Thread Chris Hostetter

: Subject: Phrase query search with stopwords
: In-Reply-To: <[EMAIL PROTECTED]>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss



Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

2008-11-24 Thread Chris Hostetter

: Subject: Re: Please Help !! Question about Query Phrase Slop (qs) in dismax
: 
: 
: Please help someone...i've been waiting for an answer for the last couple of
: days & no one seems to be helping out here. I did search the wiki & this

Please don't send messages like this.  

This is a volunteer community -- no one (that I know of) is paid to 
read/reply to questions on the solr-user list.  Many of us do our best to 
make sure that all user questions get addressed, but this is a fairly high 
volume list, and sometimes other things in life (work, health, 
relationships, family, etc...) make that take a little longer then we 
would like -- sometimes questions don't get answered for a few days, it's 
just the way it is, please be patient.  Sending multiple "please help, 
still no reply" type messages just adds noise to the list, and give people 
who *do* want to help more to read which means it takes that much longer 
to actually reply.

If you need an answer to a question in a hurry: read the archives and the 
docs, experiment, read the code (if you know java), or hire a consultant 
to help you figure it out.

In this specific case, debugQuery=true would have quickly shown you that 
your qs=5 value wasn't making it's way into the "parsedquery" at all, 
which might have helped you understand what was happening.



-Hoss



Re: Question about Query Phrase Slop (qs) in dismax

2008-11-24 Thread Chris Hostetter

: >From the solr wiki, it sounded like if qs is set to 5 for example, & if the
: search term is 'child custody', only docs with 'child' & 'custody' within 5
: words of one another would be returned in results. Is this correct? If so,

No.  as explained on the wiki...

>> Amount of slop on phrase queries explicitly included in the 
>> user's query string

note the "explicitly included" part ... if the query string doesn't 
contain any quotation marks, 'qs' isn't used at all.  (as opposed to 'ps' 
which is "Amount of slop on phrase queries built for 'pf' fields")

in a query like this...

   q=child+custody&qs=5&qf=...

...the 'qs' is ignored.  if you want to require that the input words all 
appear within a set slop of eachother (in at least one 'qf' field) you 
need to quote the users input...

   q="child+custody"&qs=5&qf=...
  
: in bad user experience as those docs are not so relevant. What more could i
: do to improve quality in the results?

use 'pf' with very high boosts (compared to the 'qf' boosts) so that phrse 
matching docs appear before non phrase matching docs.



-Hoss



Sorting and JVM heap size ....

2008-11-24 Thread souravm
Hi,

I have indexed data of size around 20GB. My JVM memory is 1.5GB.

For this data if I do a query with sort flag on (for a single field) I always 
get java out of memory exception even if the number of hit is 0. With no 
sorting (or default sorting with score) the query works perfectly fine.`

I can understand that JVM heap size can max out when the number of records hit 
is high, but why this is happening even when number of records hit is 0 ?

The same query with sort flag on does not give me problem till 1.5 GB of data.

Any explanation ?

Regards,
Sourav

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Query for Distributed search -

2008-11-24 Thread Chris Hostetter

: Subject: Query for Distributed search -
: In-Reply-To: <[EMAIL PROTECTED]>



http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss



Schema Design Guidance

2008-11-24 Thread Vimal Jobanputra
Hi, and apologies in advance for the lengthy question!

I'm looking to use Solr to power searching & browsing over a large set of
product data stored in a relational db. I'm wandering what the most
appropriate schema design strategy to use is. A simplified view of the
relational data is:

Shop (~1000 rows)
-Id*
-Name

Product (~300,000 rows)
-Id*
-Name
-Availability

ProductFormat (~5 rows)
-Id*
-Name

Component (part of a product that may be sold separately) (~4,000,000 rows)
-Id*
-Name

ProductComponent (~4,000,000 rows)
-ProductId*
-ComponentId*

ShopProduct (~6,000,000 rows)
-ShopId*
-ProductId*
-ProductFormatId*
-AvailableDate

ShopProductPriceList (~15,000,000 rows)
-ShopId*
-ProductId*
-ProductFormatId*
-Applicability (Component/Product)*
-Type (Regular/SalePrice)*
-Amount

* logical primary key

This means:
-availability of a product differs from shop to shop
-the price of a product or component is dependent on the format, and also
differs from shop to shop

Textual searching is required over product & component names, and filtering
is required over Shops, Product Availability, Formats, & Prices.

The simplest approach would be to flatten out the data completely (1 solr
document per ShopProduct and ShopProductComponent). This would result in
~80million documents, which I'm guessing this would need some form of
sharding/distribution

An alternate approach would be to construct one document per Product, and
*nest* the relational data via dynamic fields (and possibly plugins?)
Eg one document per Product; multi-value fields for ProductComponent & Shop;
dynamic fields for Availability/Format, using ShopId as part of the field
name.
This approach would result in far fewer documents (400,000), but more
complex queries. It would also require extending Solr/Lucene to search over
ProductComponents and filter by price, which I'm not quite clear on as
yet...

Any guidance on which of the two general approaches (or others) to explore
further?

Thanks!
Vim


Re: Phrase query search with stopwords

2008-11-24 Thread Robert Haschart

Yonik,

I did make sure enablePositionIncrements="true"  for both indexing and 
queries and just did a test where I  re-indexed a couple of test record 
sets, and submitted a query from the solr admin page, this time 
searching for  title_text:"gone with the wind"  which should return 
three hits, and again it returns 0 hits.


I also tried modifying SolrQueryParser to set  
setEnablePositionIncrements to true thinkg that would fix the problem,  
but it doesn't seem to.



-Bob


Yonik Seeley wrote:


Robert,

I've reproduced (sort of) this bad behavior with the example schema.
There was an example configuration "bug" introduced in SOLR-521
where enablePositionIncrements="true" was only set on the index
analyzer but not the query analyzer for the "text" fieldType.

A query on the example data of
features:"Optimized for High Volume Web Traffic"
will not match any documents.

You seem to indicate that enablePositionIncrements="true" is set for
both your index and query analyzer.  Can you verify that, and verify
that you restarted solr and reindexed after that change was made?

-Yonik



On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <[EMAIL PROTECTED]> wrote:
 


Greetings all,

I'm having trouble tracking down why a particular query is not working.   A
user is trying to do a search for alternate_form_title_text:"three films by
louis malle"  specifically to find the 4 records that contain the phrase
"Three films by Louis Malle" in their alternate_form_title_text field.
However the search return 0 records.

The modified searches:

alternate_form_title_text:"three films by louis malle"~1

or

alternate_form_title_text:"three films" AND alternate_form_title_text:"louis
malle"

both return the 4 records.   So it seems that it is the word "by" which is
listed in the stopword filter list is causing the problem.

The analyzer/filter sequence for indexing the alternate_form_title_text
field is _almost_ exactly the same as the sequence for querying that field.

for indexing the sequence is:

org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
schema.UnicodeNormalizationFilterFactory {composed=false,
remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
schema.CJKFilterFactory   {bigrams=false}
org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}
org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
org.apache.solr.analysis.LowerCaseFilterFactory   {}
org.apache.solr.analysis.EnglishPorterFilterFactory
{protected=protwords.txt}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}

for querying the sequence is:

org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
schema.UnicodeNormalizationFilterFactory {composed=false,
remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
schema.CJKFilterFactory   {bigrams=false}
org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
expand=true, ignoreCase=true}
org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}
org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
org.apache.solr.analysis.LowerCaseFilterFactory   {}
org.apache.solr.analysis.EnglishPorterFilterFactory
{protected=protwords.txt}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}


If I run a test through the field anaylsis admin page, submitting the
string* three films by louis malle *through both the Field value (Index) and
the Field value (query) the reslts (shown below) seem to indicate the the
query ought to find the 4 records in question, by it does not, and I'm at a
loss to explain why.


   Index Analyzer

term position   1   2   4   5
term text   three   filmlouimall
term type   wordwordwordword
source start,end0,5 6,1115,20   21,26



   Query Analyzer

term position   1   2   4   5
term text   three   filmlouimall
term type   wordwordwordword
source start,end0,5 6,1115,20   21,26




   






Querying Ranges Problem

2008-11-24 Thread Jake Conk
I have the following query:


q=(+thread_title_t:test OR +posts_t_ns_mv:test) AND locked_i:0 AND
replies_i:[50 TO *]


I have replies_i which is an integer field set to return me back
documents that have a value 50 or greater but the problem is I'm
getting back results with the replied_i field column with lesser than
50 results.

I tried other things like "replies_i:[50 TO 1000]" but I'm still
getting results with the replies_i field under 50.

Am I doing this wrong or is my other party of the query somehow
affecting the replies_i value? I tried removing the other part of the
query but I still get unexpected results.


Please help!

Thanks,

- Jake C.


broken socket in Jetty causing invalid XML ?

2008-11-24 Thread Anoop Bhatti
Hello Solr Community,

I'm getting the stracktrace below when adding docs using the
CommonsHttpSolrServer.add(Collection
docs)
method.  The server doesn't seem to be able to recover from this error.
We are adding a collection with 1,000 SolrInputDocument's at a time.
I'm using Solr 1.3.0 and Java 1.6.0_07.

It seems that this problem occurs in Jetty when the TCP connection is
broken while the stream (from the add(...) method) is being read.  The
XML read from the broken stream is not valid.  Is this a correct
diagnosis?

Could this stacktrace be occurring when the max POST size has been
exceeded?  I'm referring to the example/etc/jetty.xml file, which has
the setting:
 

  org.mortbay.jetty.Request.maxFormContentSize
  100

Right now maxFormContentSize is set to the default 1 MB on my server.

Also, in some cases I have two clients, could Jetty be blocking one
client and causing it to finally timeout?

This stacktrace doesn't happen right away, is occurs once the Lucene
indexes are about 30 GB.
Could the periodic merging of segments be the culprit?

I was also thinking that the problem could be with writing back the
response (the UpdateRequest).

Here's the gist of my Java client code:

CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(solrServerURL);
solrServer.setConnectionTimeout(100);
solrServer.setDefaultMaxConnectionsPerHost(100);
solrServer.setMaxTotalConnections(100);
solrServer.add(solrDocs); //the collection of docs
solrServer.commit();

And here's the stacktrace:

Nov 20, 2008 5:25:33 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=2.2}
status=0 QTime=469
Nov 20, 2008 5:25:37 PM org.apache.solr.common.SolrException log
SEVERE: com.ctc.wstx.exc.WstxIOException: null
at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:302)
at 
org.mortbay.jetty.HttpParser$Input.blockForContent(HttpParser.java:919)
at org.mortbay.jetty.HttpParser$Input.read(HttpParser.java:897)
at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:411)
at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at 
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at 
com.ctc.wstx.sr.StreamScanner.loadMoreFromCur

Re: [VOTE] Community Logo Preferences

2008-11-24 Thread Rob Casson
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg

thanks to everyone who contributed,
rob

On Sun, Nov 23, 2008 at 11:59 AM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> Please submit your preferences for the solr logo.
>
> For full voting details, see:
>  http://wiki.apache.org/solr/LogoContest#Voting
>
> The eligible logos are:
>  http://people.apache.org/~ryan/solr-logo-options.html
>
> Any and all members of the Solr community are encouraged to reply to this
> thread and list (up to) 5 ranked choices by listing the Jira attachment
> URLs. Votes will be assigned a point value based on rank. For each vote, 1st
> choice has a point value of 5, 5th place has a point value of 1, and all
> others follow a similar pattern.
>
> https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
> https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
> ...
>
> This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT
>
> When the poll is complete, the solr committers will tally the community
> preferences and take a final vote on the logo.
>
> A big thanks to everyone would submitted possible logos -- its great to see
> so many good options.


Re: Phrase query search with stopwords

2008-11-24 Thread Yonik Seeley
Robert,

I've reproduced (sort of) this bad behavior with the example schema.
There was an example configuration "bug" introduced in SOLR-521
where enablePositionIncrements="true" was only set on the index
analyzer but not the query analyzer for the "text" fieldType.

A query on the example data of
features:"Optimized for High Volume Web Traffic"
will not match any documents.

You seem to indicate that enablePositionIncrements="true" is set for
both your index and query analyzer.  Can you verify that, and verify
that you restarted solr and reindexed after that change was made?

-Yonik



On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <[EMAIL PROTECTED]> wrote:
> Greetings all,
>
> I'm having trouble tracking down why a particular query is not working.   A
> user is trying to do a search for alternate_form_title_text:"three films by
> louis malle"  specifically to find the 4 records that contain the phrase
> "Three films by Louis Malle" in their alternate_form_title_text field.
> However the search return 0 records.
>
> The modified searches:
>
> alternate_form_title_text:"three films by louis malle"~1
>
> or
>
> alternate_form_title_text:"three films" AND alternate_form_title_text:"louis
> malle"
>
> both return the 4 records.   So it seems that it is the word "by" which is
> listed in the stopword filter list is causing the problem.
>
> The analyzer/filter sequence for indexing the alternate_form_title_text
> field is _almost_ exactly the same as the sequence for querying that field.
>
> for indexing the sequence is:
>
> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
> schema.UnicodeNormalizationFilterFactory {composed=false,
> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
> schema.CJKFilterFactory   {bigrams=false}
> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true}
> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
> catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
> org.apache.solr.analysis.LowerCaseFilterFactory   {}
> org.apache.solr.analysis.EnglishPorterFilterFactory
> {protected=protwords.txt}
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>
> for querying the sequence is:
>
> org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
> schema.UnicodeNormalizationFilterFactory {composed=false,
> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
> schema.CJKFilterFactory   {bigrams=false}
> org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
> expand=true, ignoreCase=true}
> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true}
> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
> org.apache.solr.analysis.LowerCaseFilterFactory   {}
> org.apache.solr.analysis.EnglishPorterFilterFactory
> {protected=protwords.txt}
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>
>
> If I run a test through the field anaylsis admin page, submitting the
> string* three films by louis malle *through both the Field value (Index) and
> the Field value (query) the reslts (shown below) seem to indicate the the
> query ought to find the 4 records in question, by it does not, and I'm at a
> loss to explain why.
>
>
> Index Analyzer
>
> term position   1   2   4   5
> term text   three   filmlouimall
> term type   wordwordwordword
> source start,end0,5 6,1115,20   21,26
>
>
>
> Query Analyzer
>
> term position   1   2   4   5
> term text   three   filmlouimall
> term type   wordwordwordword
> source start,end0,5 6,1115,20   21,26
>
>
>
>


Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-24 Thread Walter Underwood
This technique was used at Infoseek in 1996, and is very effective.

It also gives a relevance improvement, because you have an estimate
of IDF for phrases (exact for two-word phrases). The terms "the" and
"who" will be very common, but "the who" is quite rare and will have
a big IDF.

wunder

On 11/24/08 10:31 AM, "Burton-West, Tom" <[EMAIL PROTECTED]> wrote:

> Hello all,
> 
> We are having problems with extremely slow phrase queries when the
> phrase query contains a common words. We are reluctant to just use stop
> words due to various problems with false hits and some things becoming
> impossible to search with stop words turned on. (For example "to be or
> not to be", "the who", "man in the moon" vs "man on the moon" etc.)
> 
> The approach to this problem used by Nutch looks promising.  Has anyone
> ported the Nutch CommonGrams filter to Solr?
> 
> "Construct n-grams for frequently occuring terms and phrases while
> indexing. Optimize phrase queries to use the n-grams. Single terms are
> still indexed too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C
> ommonGrams.html
> 
> 
> Tom
> 
> Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Services
> University of Michigan Library



port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-24 Thread Burton-West, Tom
Hello all,

We are having problems with extremely slow phrase queries when the
phrase query contains a common words. We are reluctant to just use stop
words due to various problems with false hits and some things becoming
impossible to search with stop words turned on. (For example "to be or
not to be", "the who", "man in the moon" vs "man on the moon" etc.)

The approach to this problem used by Nutch looks promising.  Has anyone
ported the Nutch CommonGrams filter to Solr?

"Construct n-grams for frequently occuring terms and phrases while
indexing. Optimize phrase queries to use the n-grams. Single terms are
still indexed too, with n-grams overlaid."
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C
ommonGrams.html


Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Services
University of Michigan Library


Re: Using Solr for indexing emails

2008-11-24 Thread Timo Sirainen
On Mon, 2008-11-24 at 14:25 +1100, Norberto Meijome wrote:
> > The main problem is that before doing the search, I first have to check
> > if there are any unindexed messages and then add them to Solr. This is
> > done using a query like:
> >  - fl=uid
> >  - rows=1
> >  - sort=uid desc
> >  - q=uidv: box: user:
> 
> So, if I understand correctly, the process is :
> 
> 1. user sends search query Q to search interface
> 2. interface checks highest indexed uidv in SOLR
> 3. checks in IMAP store for mailbox if there are any objects ('emails') newer
> than uidv from 2.
> 4. anything found in 3. is processed, submitted to SOLR, committed.
> 5. interface submits search query Q to index, gets results
> 6. results are presented / returned to user

Right. Except "uid", not "uidv" (uidv =  = basically
 and  uniquely identifies a mailbox between
recreations/renames).

> It strikes me that this may work ok in some situations but may not scale. I
> would decouple the {find new documents / submit / commit } process from the
> { search / presentation} layer - SPECIALLY if you plan to have several
> mailboxes in play now.

The idea was that not all users are searching their mails, especially in
all mailboxes, so there's no point in wasting CPU and disk space on
indexing messages that are never used.

Also nothing prevents the administrator from configuring the kind of a
setup where message indexing is done on the background for all new
messages. But even if this is done, the search *must* find all the
messages that were added recently (even 1 second ago). So this kind of a
check before searching is still a requirement.

Also I hate all kinds of potential desynchronization issues. For example
if Dovecot relied on message saving to add the message to Solr
immediately there wouldn't need to be a way to do the "check what's
missing query". But this kind of a setup breaks easily if

a) Mail delivery crashes in the middle (or power is lost) between saving
message and indexing it to Solr. Now searching Solr will never find the
message.

b) Solr server breaks (e.g. hardware) and the latest changes get lost.
Since only new messages are indexed, you now have a lot of messages that
can never be searched.

Having separate nightly runs of "check what mails aren't indexed" would
work, but as the number of users increases this checks becomes longer
and longer. There are installations that have millions of mailboxes..

> > So it returns the highest IMAP UID field (which is an always-ascending
> > integer) for the given mailbox (you can ignore the uidvalidity). I can
> > then add all messages with higher UIDs to Solr before doing the actual
> > search.
> > 
> > When searching multiple mailboxes the above query would have to be sent
> > to every mailbox separately. 
> 
> hmm...not sure what you mean by "query would have to be sent to every
> MAILBOX" ... 

I meant that for each mailbox that needs to be checked a separate Solr
query would have to be sent.

> > That really doesn't seem like the best
> > solution, especially when there are a lot of mailboxes. But I don't
> > think Solr has a way to return "highest uid field for each
> > box:"?
> 
> hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
> query for each box, i think...

I see a lot of detailed documentation about facets in the wiki, but they
didn't really help me understand what the facets are all about.. The
"fq" parameter seemed to be somehow relevant to it. I am actually using
it when doing the actual search query:

 - fl=uid,score
 - rows=
 - sort=uid asc
 - q=body:stuff hdr:stuff or any:stuff
 - fq=uidv: box: user:

I didn't use fq with the "check what's missing query" because if there
was no q parameter Solr gave an error.

> > Is that above query even efficient for a single mailbox? 
> 
> i don't think so.

I guess that'll need changing then too.

> >I did consider
> > using separate documents for storing the highest UID for each mailbox,
> > but that causes annoying desynchronization possibilities. Especially
> > because currently I can just keep sending documents to Solr without
> > locking and let it drop duplicates automatically (should be rare). With
> > per-mailbox highest-uid documents I can't really see a way to do this
> > without locking or allowing duplicate fields to be added and later some
> > garbage collection deleting all but the one highest value (annoyingly
> > complex).
> 
> I have a feeling the issues arise from serialising the whole process (as I
> described above... ). It makes more sense (to me)  to implement something
> similar to DIH, where you load data as needed (even a 'delta query', which
> would only return new data... I am not sure whether you could use DIH ( RSS
> feed from IMAP store? )

DIH seems to be about Solr pulling data into it from an external source.
That's not really practical with Dovecot since there's no central
repository of any kind of data, so there's no way to know what has
changed since last pull.

> 

Re: [VOTE] Community Logo Preferences

2008-11-24 Thread Matthew Runo

1. 
https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg
2. 
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
[EMAIL PROTECTED] - 702-943-7833

On Nov 23, 2008, at 8:59 AM, Ryan McKinley wrote:


Please submit your preferences for the solr logo.

For full voting details, see:
 http://wiki.apache.org/solr/LogoContest#Voting

The eligible logos are:
 http://people.apache.org/~ryan/solr-logo-options.html

Any and all members of the Solr community are encouraged to reply to  
this thread and list (up to) 5 ranked choices by listing the Jira  
attachment URLs. Votes will be assigned a point value based on rank.  
For each vote, 1st choice has a point value of 5, 5th place has a  
point value of 1, and all others follow a similar pattern.


https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
...

This poll will be open until Wednesday November 26th, 2008 @ 11:59PM  
GMT


When the poll is complete, the solr committers will tally the  
community preferences and take a final vote on the logo.


A big thanks to everyone would submitted possible logos -- its great  
to see so many good options.




Re: Very bad performance

2008-11-24 Thread Cedric Houis

Great news!

I’ll try to test your patch tomorrow.

Thanks again,

  Cédric


Yonik Seeley wrote:
> 
> On Mon, Nov 24, 2008 at 9:19 AM, Cedric Houis <[EMAIL PROTECTED]>
> wrote:
> 
>> You've said that someone will work to improve faceted search. Could you
>> tell
>> me where I can tract those evolutions?
> 
> Coming soon... see
> https://issues.apache.org/jira/browse/SOLR-475
> 
> -Yonik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Very-bad-performance-tp20366783p20666181.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Score always 0.0

2008-11-24 Thread Cedric Houis

Thanks a lot Yonik.

Effectively, I had something like that in my code: 

SchemaField sfield = _schema.getFieldOrNull(fieldName);
if (sfield != null)
doc.add(sfield.createField(nodeValue, 0));

I set the boost value to 1 and it works now…

Thanks again,

  Cédric


Yonik Seeley wrote:
> 
> Looks like the norm for the doc for that field is 0... did you perhaps
> boost the field or document by 0 when indexing?
> 
> -Yonik
> 
> On Mon, Nov 24, 2008 at 9:59 AM, Cedric Houis <[EMAIL PROTECTED]>
> wrote:
>>
>> Hi Solr team.
>>
>> I've got a question about scoring; when we make a search like this: "bush
>> obama^9", we always have a 0.0 score.
>> Should we not have a higher score we the document contains both "bush"
>> and
>> "obama"?
>>
>> Our default search filed is defined like this:
>> > termVectors="true"/>
>>
>> Here is debug result:
>> 
>>bush obama^9
>>bush obama^9
>>nFullText:bush nFullText:obama^9.0
>>nFullText:bush
>> nFullText:obama^9.0
>>
>>
>>0.0 = (MATCH) product of:
>>0.0 = (MATCH) sum of:
>>0.0 = (MATCH) weight(nFullText:bush in 74), product of:
>>0.06187806 = queryWeight(nFullText:bush), product of:
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.015883753 = queryNorm
>>0.0 = (MATCH) fieldWeight(nFullText:bush in 74), product of:
>>2.0 = tf(termFreq(nFullText:bush)=4)
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.0 = fieldNorm(field=nFullText, doc=74)
>>0.5 = coord(1/2)
>>
>>
>>0.0 = (MATCH) product of:
>>0.0 = (MATCH) sum of:
>>0.0 = (MATCH) weight(nFullText:bush in 148), product of:
>>0.06187806 = queryWeight(nFullText:bush), product of:
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.015883753 = queryNorm
>>0.0 = (MATCH) fieldWeight(nFullText:bush in 148), product of:
>>1.0 = tf(termFreq(nFullText:bush)=1)
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.0 = fieldNorm(field=nFullText, doc=148)
>>0.5 = coord(1/2)
>>
>>
>>0.0 = (MATCH) product of:
>>0.0 = (MATCH) sum of:
>>0.0 = (MATCH) weight(nFullText:bush in 170), product of:
>>0.06187806 = queryWeight(nFullText:bush), product of:
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.015883753 = queryNorm
>>0.0 = (MATCH) fieldWeight(nFullText:bush in 170), product of:
>>2.0 = tf(termFreq(nFullText:bush)=4)
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.0 = fieldNorm(field=nFullText, doc=170)
>>0.5 = coord(1/2)
>>
>>
>>0.0 = (MATCH) product of:
>>0.0 = (MATCH) sum of:
>>0.0 = (MATCH) weight(nFullText:bush in 172), product of:
>>0.06187806 = queryWeight(nFullText:bush), product of:
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.015883753 = queryNorm
>>0.0 = (MATCH) fieldWeight(nFullText:bush in 172), product of:
>>1.4142135 = tf(termFreq(nFullText:bush)=2)
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.0 = fieldNorm(field=nFullText, doc=172)
>>0.5 = coord(1/2)
>>
>>
>>0.0 = (MATCH) product of:
>>0.0 = (MATCH) sum of:
>>0.0 = (MATCH) weight(nFullText:bush in 185), product of:
>>0.06187806 = queryWeight(nFullText:bush), product of:
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.015883753 = queryNorm
>>0.0 = (MATCH) fieldWeight(nFullText:bush in 185), product of:
>>1.0 = tf(termFreq(nFullText:bush)=1)
>>3.8956826 = idf(docFreq=16003, numDocs=289606)
>>0.0 = fieldNorm(field=nFullText, doc=185)
>>0.5 = coord(1/2)
>>
>>
>> 
>>
>> Every explanation is welcome! Thanks in advance,
>>
>> Regards,
>>
>>  Cédric
>> --
>> View this message in context:
>> http://www.nabble.com/Score-always-0.0-tp20662445p20662445.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Score-always-0.0-tp20662445p20665969.html
Sent from the Solr - User mailing list archive at Nabble.com.



Third Hadoop Get Together @ Berlin

2008-11-24 Thread Isabel Drost

The third German Hadoop get together is going to take place at 9th of December 
at newthinking store in Berlin:

http://upcoming.yahoo.com/event/1383706/?ps=6

You can order drinks directly at the bar in the newthinking store. As this Get 
Together takes place in December - Christmas time - there will be cookies as 
well. There are quite a few good restaurants nearby, so we can go there after 
the official part.

Stefan Groschupf offered to prepare a talk on his project katta. We are still 
looking for one or more interesting talks. We would like to invite you, the 
visitor to tell your Hadoop story. If you like, you can bring slides - there 
will be a beamer. Please send your proposal at [EMAIL PROTECTED]

There will be slots of 20min each for talks on your Hadoop topic. After each 
talk there will be time to discuss.

A big Thanks goes to the newthinking store for again providing a room in the 
center of Berlin for us.

Looking forward to seeing you in Berlin,
Isabel Drost

-- 
QOTD: It's not an optical illusion, it just looks like one.   -- Phil White 
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_   VoIP:
 |,4-  ) )-,_..;\ (  `'-'  Tel: (+49) 30 6920 6101
'---''(_/--'  `-'\_) (fL)  IM:  



pgpI1Jphg3Gtu.pgp
Description: PGP signature


Lucene 2.3.1 vs 2.4 benchmarks using LuSql

2008-11-24 Thread Glen Newton
I have some simple indexing benchmarks comparing Lucene 2.3.1 with 2.4:
 http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html

In the next couple of days I will be running benchmarks comparing
Solr's DataImportHandler/JdbcDataSource indexing performance with
LuSql and will release them ASAP.

thanks,

Glen

PS. Previous Lucene benchmarks:
- http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
- http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
- http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html

-- 

-


RE: Query for Distributed search -

2008-11-24 Thread souravm
Hi,

I understand your point on how do I do it myself in my Java code. 

However, I'm more interested to know how the default behaviour of 
DistributedSearch work when I issue a command like "curl 
'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'"
 as mentioned in the wiki.

Regards,
Sourav

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 24, 2008 12:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Query for Distributed search -

If you for instance use SolrJ and the HttpSolrServer, you could for  
instance add logic to your querying making your searches more efficient!  
That is partially the idea of sharding, right? :) So if the user wants to  
search for a log file in June, your application knows that June logs are  
stored on the second box, and hence will redirect the search to that box.  
Alternatively if he wants to search for logs spanning two boxes, you  
merely add the shards parameter to your query and just include the path to  
those to shards in question. I'm not really sure about how solr handles  
the merging of results etc and wether or not the requests are done in  
paralell or sequentially, but I do know that you could easily manage this  
on your own through java if you want to. (Simply setting up one  
HttpSolrServer in your code for each shard, and searching them in  
parallell in separate threads. => then reducing the results afterwards).

Have a look at http://wiki.apache.org/solr/DistributedSearch for more info.
You could also take a look at Hadoop. (http://hadoop.apache.org/)

regards,
  Aleks

On Mon, 24 Nov 2008 06:24:51 +0100, souravm <[EMAIL PROTECTED]> wrote:

> Hi,
>
> Looking for some insight on distributed search.
>
> Say I have an index distributed in 3 boxes and the index contains time  
> and text data (typical log file). Each box has index for different  
> timeline - say Box 1 for all Jan to April, Box 2 for May to August and  
> Box 3 for Sep to Dec.
>
> Now if I try to search for a text string, will the search would happen  
> in parallel in all 3 boxes or sequentially?
>
> Regards,
> Sourav
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended  
> solely
> for the use of the addressee(s). If you are not the intended recipient,  
> please
> notify the sender by e-mail and delete the original message. Further,  
> you are not
> to copy, disclose, or distribute this e-mail or its contents to any  
> other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys  
> has taken
> every reasonable precaution to minimize this risk, but is not liable for  
> any damage
> you may sustain as a result of any virus in this e-mail. You should  
> carry out your
> own virus checks before opening the e-mail or attachment. Infosys  
> reserves the
> right to monitor and review the content of all messages sent to or from  
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on  
> the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>



-- 
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no


Re: [VOTE] Community Logo Preferences

2008-11-24 Thread jm
https://issues.apache.org/jira/secure/attachment/12393951/sslogo-solr-classic.png
https://issues.apache.org/jira/secure/attachment/12394291/apache_solr_contour.png
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png


Re: [VOTE] Community Logo Preferences

2008-11-24 Thread Fergus McMenemie
https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg


Re: Score always 0.0

2008-11-24 Thread Yonik Seeley
Looks like the norm for the doc for that field is 0... did you perhaps
boost the field or document by 0 when indexing?

-Yonik

On Mon, Nov 24, 2008 at 9:59 AM, Cedric Houis <[EMAIL PROTECTED]> wrote:
>
> Hi Solr team.
>
> I've got a question about scoring; when we make a search like this: "bush
> obama^9", we always have a 0.0 score.
> Should we not have a higher score we the document contains both "bush" and
> "obama"?
>
> Our default search filed is defined like this:
>  termVectors="true"/>
>
> Here is debug result:
> 
>bush obama^9
>bush obama^9
>nFullText:bush nFullText:obama^9.0
>nFullText:bush
> nFullText:obama^9.0
>
>
>0.0 = (MATCH) product of:
>0.0 = (MATCH) sum of:
>0.0 = (MATCH) weight(nFullText:bush in 74), product of:
>0.06187806 = queryWeight(nFullText:bush), product of:
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.015883753 = queryNorm
>0.0 = (MATCH) fieldWeight(nFullText:bush in 74), product of:
>2.0 = tf(termFreq(nFullText:bush)=4)
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.0 = fieldNorm(field=nFullText, doc=74)
>0.5 = coord(1/2)
>
>
>0.0 = (MATCH) product of:
>0.0 = (MATCH) sum of:
>0.0 = (MATCH) weight(nFullText:bush in 148), product of:
>0.06187806 = queryWeight(nFullText:bush), product of:
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.015883753 = queryNorm
>0.0 = (MATCH) fieldWeight(nFullText:bush in 148), product of:
>1.0 = tf(termFreq(nFullText:bush)=1)
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.0 = fieldNorm(field=nFullText, doc=148)
>0.5 = coord(1/2)
>
>
>0.0 = (MATCH) product of:
>0.0 = (MATCH) sum of:
>0.0 = (MATCH) weight(nFullText:bush in 170), product of:
>0.06187806 = queryWeight(nFullText:bush), product of:
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.015883753 = queryNorm
>0.0 = (MATCH) fieldWeight(nFullText:bush in 170), product of:
>2.0 = tf(termFreq(nFullText:bush)=4)
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.0 = fieldNorm(field=nFullText, doc=170)
>0.5 = coord(1/2)
>
>
>0.0 = (MATCH) product of:
>0.0 = (MATCH) sum of:
>0.0 = (MATCH) weight(nFullText:bush in 172), product of:
>0.06187806 = queryWeight(nFullText:bush), product of:
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.015883753 = queryNorm
>0.0 = (MATCH) fieldWeight(nFullText:bush in 172), product of:
>1.4142135 = tf(termFreq(nFullText:bush)=2)
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.0 = fieldNorm(field=nFullText, doc=172)
>0.5 = coord(1/2)
>
>
>0.0 = (MATCH) product of:
>0.0 = (MATCH) sum of:
>0.0 = (MATCH) weight(nFullText:bush in 185), product of:
>0.06187806 = queryWeight(nFullText:bush), product of:
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.015883753 = queryNorm
>0.0 = (MATCH) fieldWeight(nFullText:bush in 185), product of:
>1.0 = tf(termFreq(nFullText:bush)=1)
>3.8956826 = idf(docFreq=16003, numDocs=289606)
>0.0 = fieldNorm(field=nFullText, doc=185)
>0.5 = coord(1/2)
>
>
> 
>
> Every explanation is welcome! Thanks in advance,
>
> Regards,
>
>  Cédric
> --
> View this message in context: 
> http://www.nabble.com/Score-always-0.0-tp20662445p20662445.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: [VOTE] Community Logo Preferences

2008-11-24 Thread Sami Siren

https://issues.apache.org/jira/secure/attachment/12394314/apache_soir_001.jpg
https://issues.apache.org/jira/secure/attachment/12394366/solr3_maho.png
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg

--
 Sami Siren


Ryan McKinley wrote:

Please submit your preferences for the solr logo.

For full voting details, see:
  http://wiki.apache.org/solr/LogoContest#Voting

The eligible logos are:
  http://people.apache.org/~ryan/solr-logo-options.html

Any and all members of the Solr community are encouraged to reply to 
this thread and list (up to) 5 ranked choices by listing the Jira 
attachment URLs. Votes will be assigned a point value based on rank. For 
each vote, 1st choice has a point value of 5, 5th place has a point 
value of 1, and all others follow a similar pattern.


https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
...

This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT

When the poll is complete, the solr committers will tally the community 
preferences and take a final vote on the logo.


A big thanks to everyone would submitted possible logos -- its great to 
see so many good options.




Re: [VOTE] Community Logo Preferences

2008-11-24 Thread Alok Dhir

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png


RE: [VOTE] Community Logo Preferences

2008-11-24 Thread Steven A Rowe
https://issues.apache.org/jira/secure/attachment/12392306/apache_solr_sun.png
https://issues.apache.org/jira/secure/attachment/12391946/apache_solr_burning.png
 
https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394070/sslogo-solr-finder2.0.png
https://issues.apache.org/jira/secure/attachment/12393951/sslogo-solr-classic.png

On 11/23/2008 at 11:59 AM, Ryan McKinley wrote:
> Please submit your preferences for the solr logo.
> 
> For full voting details, see:
>http://wiki.apache.org/solr/LogoContest#Voting
> 
> The eligible logos are:
>http://people.apache.org/~ryan/solr-logo-options.html
> 
> Any and all members of the Solr community are encouraged to reply to
> this thread and list (up to) 5 ranked choices by listing the Jira
> attachment URLs. Votes will be assigned a point value based on rank.
> For each vote, 1st choice has a point value of 5, 5th place has a
> point value of 1, and all others follow a similar pattern.
> 
> https://issues.apache.org/jira/secure/attachment/12345/yourfrs
> tchoice.jpg
> https://issues.apache.org/jira/secure/attachment/34567/yoursec
> ondchoice.jpg ...
> 
> This poll will be open until Wednesday November 26th, 2008 @
> 11:59PM GMT
> 
> When the poll is complete, the solr committers will tally the
> community preferences and take a final vote on the logo.
> 
> A big thanks to everyone would submitted possible logos -- its great
> to see so many good options.
>

 



Re: Very bad performance

2008-11-24 Thread Yonik Seeley
On Mon, Nov 24, 2008 at 9:19 AM, Cedric Houis <[EMAIL PROTECTED]> wrote:
> I've made the test with the latest nightly build of Solr. Performances are
> similar.

Yep, see
http://www.nabble.com/NIO-not-working-yet-to20468152.html#a20468152

> You've said that someone will work to improve faceted search. Could you tell
> me where I can tract those evolutions?

Coming soon... see
https://issues.apache.org/jira/browse/SOLR-475

-Yonik


Re: AND query on multivalue text

2008-11-24 Thread Antonio Zippo



>On Nov 24, 2008, at 8:52 AM, Erik Hatcher wrote:

> 
> On Nov 24, 2008, at 8:37 AM, David Santamauro wrote:
 i need to search something as
 myText:billion AND guarantee
 
 i need to be extracted only the record where the words exists in the same 
 value (in this case only the first record) because in the 2nd record the 
 two words are in different values
 
 is it possible?
>>> 
>>> It's not possible with a purely boolean query like this, but it is possible 
>>> with a sloppy phrase query where the position increment gap (see example 
>>> schema.xml) is greater than the slop factor.
>>> 
>>> Erik
>>> 
>> 
>> 
>> I think what is needed here is the concept of SAME, i.e., myText:billion 
>> SAME guarantee. I know a few full-text engines that can handle this operator 
>> one way or another. And without it, I don't quick understand the usefulness 
>> of multiValue fields.
> 
> Yeah, multi-valued fields are a bit awkward to grasp fully in Lucene.  
> Especially in this context where it's a full-text field.  Basically as far as 
> indexing goes, there's no such thing as a "multi-valued" field.  An indexed 
> field gets split into terms, and terms have positional information attached 
> to them (thus a position increment gap can be used to but a big virtual gap 
> between the last term of one field instance and the first term of the next 
> one).  A multi-valued field gets stored (if it is set to be stored, that is) 
> as separate strings, and is retrievable as the separate values.
> 
> Multi-valued fields are handy for facets where, say, a product can have 
> multiple categories associated with it.  In this case it's a bit clearer.  
> It's the full-text multi-valued fields that seem a bit strange.
> 
> Erik
> 

> 
> OK, it seems it is the multi-dimensional aspect that is missing
> 
> field[0]: A B C D
> field[1]:   B   D
> 
> ...and the concept of field array would need to be introduced (probably at 
> the lucene level).
> 
> Do you know if there has been any serious thought given to this, i.e., the 
> possibility of introducing a new SAME operator or is this a corner-case not > 
> > worthy?
> 
> thanks
> David
> 

thanks for all the replies

maybe this could be an interesting request for the developers

bye


  

Re: a question about solr phrasequery process.

2008-11-24 Thread Yonik Seeley
2008/11/24 finy finy <[EMAIL PROTECTED]>:
> maybe you have make a misunderstanding about what i say,
>
> another example:
>
> the keyword "oneworldonedream" in myself analyzer will be  analyzed about
> four token:
> one world one dream

OK, this is just the way the Lucene and Solr query parsers work...
multiple tokens from the analyzer of a single "parser level" token is
normally treated as a phrase query.

What you could try to do is set the position increment of all of the
tokens you get to 0... this should have the effect of an "OR".

-Yonik


Re: a question about solr phrasequery process.

2008-11-24 Thread finy finy
maybe you have make a misunderstanding about what i say,

another example:

the keyword "oneworldonedream" in myself analyzer will be  analyzed about
four token:
one world one dream

when i use myself analyzer in solr,  i input "oneworldonedream",  solr give
the follow result:

 
*-*

 * * oneworldonedream
 * * oneworldonedream
 * * *PhraseQuery(title:"one world dream")*
 * * *title:"one world dream"*
 * * 
 * * *OldLuceneQParser*
...

please be careful with the result: *PhraseQuery(title:"one world dream")  *
**
this is not what i want , i want to get the query:  *title:one title:world
title:dream*
how can i do this?
did solr have a configuration about this?


2008/11/24 Yonik Seeley <[EMAIL PROTECTED]>

>  On Mon, Nov 24, 2008 at 3:55 AM, finy finy <[EMAIL PROTECTED]> wrote:
> > hello everyone:
> >
> >  i use solr 1.2,  i find a problem about solr1.2,
> >
> >   when i search some keyword, i use myself analyzer,  i find that solr
> > consider my terms as PhraseQuery,
> >
> >  for example,solr parser's result is: PhraseQuery( title:"i am good
> man"),
> > but i want to get the query: title:i  title:am title:good title:man
>
> If Solr is producing a phrase query, then you hust have surrounded the
> words in quotes in the query string.  To group terms in a field try
> the following:
>
> title:(i am good man)
>
> -Yonik
>


Score always 0.0

2008-11-24 Thread Cedric Houis

Hi Solr team.

I’ve got a question about scoring; when we make a search like this: “bush
obama^9”, we always have a 0.0 score.
Should we not have a higher score we the document contains both “bush” and
“obama”?

Our default search filed is defined like this: 
 

Here is debug result: 

bush obama^9
bush obama^9
nFullText:bush nFullText:obama^9.0
nFullText:bush
nFullText:obama^9.0


0.0 = (MATCH) product of:
0.0 = (MATCH) sum of:
0.0 = (MATCH) weight(nFullText:bush in 74), product of:
0.06187806 = queryWeight(nFullText:bush), product of:
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.015883753 = queryNorm
0.0 = (MATCH) fieldWeight(nFullText:bush in 74), product of:
2.0 = tf(termFreq(nFullText:bush)=4)
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.0 = fieldNorm(field=nFullText, doc=74)
0.5 = coord(1/2)


0.0 = (MATCH) product of:
0.0 = (MATCH) sum of:
0.0 = (MATCH) weight(nFullText:bush in 148), product of:
0.06187806 = queryWeight(nFullText:bush), product of:
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.015883753 = queryNorm
0.0 = (MATCH) fieldWeight(nFullText:bush in 148), product of:
1.0 = tf(termFreq(nFullText:bush)=1)
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.0 = fieldNorm(field=nFullText, doc=148)
0.5 = coord(1/2)


0.0 = (MATCH) product of:
0.0 = (MATCH) sum of:
0.0 = (MATCH) weight(nFullText:bush in 170), product of:
0.06187806 = queryWeight(nFullText:bush), product of:
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.015883753 = queryNorm
0.0 = (MATCH) fieldWeight(nFullText:bush in 170), product of:
2.0 = tf(termFreq(nFullText:bush)=4)
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.0 = fieldNorm(field=nFullText, doc=170)
0.5 = coord(1/2)


0.0 = (MATCH) product of:
0.0 = (MATCH) sum of:
0.0 = (MATCH) weight(nFullText:bush in 172), product of:
0.06187806 = queryWeight(nFullText:bush), product of:
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.015883753 = queryNorm
0.0 = (MATCH) fieldWeight(nFullText:bush in 172), product of:
1.4142135 = tf(termFreq(nFullText:bush)=2)
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.0 = fieldNorm(field=nFullText, doc=172)
0.5 = coord(1/2)


0.0 = (MATCH) product of:
0.0 = (MATCH) sum of:
0.0 = (MATCH) weight(nFullText:bush in 185), product of:
0.06187806 = queryWeight(nFullText:bush), product of:
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.015883753 = queryNorm
0.0 = (MATCH) fieldWeight(nFullText:bush in 185), product of:
1.0 = tf(termFreq(nFullText:bush)=1)
3.8956826 = idf(docFreq=16003, numDocs=289606)
0.0 = fieldNorm(field=nFullText, doc=185)
0.5 = coord(1/2)




Every explanation is welcome! Thanks in advance, 

Regards,

  Cédric
-- 
View this message in context: 
http://www.nabble.com/Score-always-0.0-tp20662445p20662445.html
Sent from the Solr - User mailing list archive at Nabble.com.



Facet Query (fq) and Query (q)

2008-11-24 Thread Jae Joo
I am having some trouble to utilize the facet Query. As I know that the
facet Query has better performance that simple query (q).
Here is the example.

http://localhost:8080/test_solr/select?q=*:*&facet=true&fq=state:CA&facet.mincount=1&facet.field=city&facet.field=sector&facet.limit=-1&sort=score+desc

--> facet by sector and city for state of CA.
Any idea how to optimize this query to avoid "q=*:*"?

Thanks,

Jae


Re: AND query on multivalue text

2008-11-24 Thread David Santamauro


On Nov 24, 2008, at 8:52 AM, Erik Hatcher wrote:



On Nov 24, 2008, at 8:37 AM, David Santamauro wrote:

i need to search something as
myText:billion AND guarantee

i need to be extracted only the record where the words exists in  
the same value (in this case only the first record) because in  
the 2nd record the two words are in different values


is it possible?


It's not possible with a purely boolean query like this, but it is  
possible with a sloppy phrase query where the position increment  
gap (see example schema.xml) is greater than the slop factor.


Erik




I think what is needed here is the concept of SAME, i.e.,  
myText:billion SAME guarantee. I know a few full-text engines that  
can handle this operator one way or another. And without it, I  
don't quick understand the usefulness of multiValue fields.


Yeah, multi-valued fields are a bit awkward to grasp fully in  
Lucene.  Especially in this context where it's a full-text field.   
Basically as far as indexing goes, there's no such thing as a "multi- 
valued" field.  An indexed field gets split into terms, and terms  
have positional information attached to them (thus a position  
increment gap can be used to but a big virtual gap between the last  
term of one field instance and the first term of the next one).  A  
multi-valued field gets stored (if it is set to be stored, that is)  
as separate strings, and is retrievable as the separate values.


Multi-valued fields are handy for facets where, say, a product can  
have multiple categories associated with it.  In this case it's a  
bit clearer.  It's the full-text multi-valued fields that seem a bit  
strange.


Erik




OK, it seems it is the multi-dimensional aspect that is missing

field[0]: A B C D
field[1]:   B   D

...and the concept of field array would need to be introduced  
(probably at the lucene level).


Do you know if there has been any serious thought given to this, i.e.,  
the possibility of introducing a new SAME operator or is this a corner- 
case not worthy?


thanks
David







solr internationalization support

2008-11-24 Thread rameshgalla

hi,

1)Which languages solr supports out-of-the box other than english?

2)What are the analyzers(stemmer,synonym,tokenizer etc) it provides for each
language?

3)Shall we create our own analyzers for any languages?(If possible explain
how?)

thanx in advance
-- 
View this message in context: 
http://www.nabble.com/solr-internationalization-support-tp20661848p20661848.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Very bad performance

2008-11-24 Thread Cedric Houis

Hi Yonik.

I’ve made the test with the latest nightly build of Solr. Performances are
similar.

You’ve said that someone will work to improve faceted search. Could you tell
me where I can tract those evolutions?

Thanks in advance,

Regards,

  Cédric


Yonik Seeley wrote:
> 
> On Fri, Nov 7, 2008 at 12:58 PM, Cedric Houis <[EMAIL PROTECTED]>
> wrote:
>> Another remark, why do we have better performance when we use parallel
>> instances of SOLR that use the same index on the same machine?
> 
> Internal locking.
> SOLR-465 was committed yesterday... it may improve some things slightly.
> 
> -Yonik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Very-bad-performance-tp20366783p20661305.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: AND query on multivalue text

2008-11-24 Thread Erik Hatcher


On Nov 24, 2008, at 8:37 AM, David Santamauro wrote:

i need to search something as
myText:billion AND guarantee

i need to be extracted only the record where the words exists in  
the same value (in this case only the first record) because in the  
2nd record the two words are in different values


is it possible?


It's not possible with a purely boolean query like this, but it is  
possible with a sloppy phrase query where the position increment  
gap (see example schema.xml) is greater than the slop factor.


Erik




I think what is needed here is the concept of SAME, i.e.,  
myText:billion SAME guarantee. I know a few full-text engines that  
can handle this operator one way or another. And without it, I don't  
quick understand the usefulness of multiValue fields.


Yeah, multi-valued fields are a bit awkward to grasp fully in Lucene.   
Especially in this context where it's a full-text field.  Basically as  
far as indexing goes, there's no such thing as a "multi-valued"  
field.  An indexed field gets split into terms, and terms have  
positional information attached to them (thus a position increment gap  
can be used to but a big virtual gap between the last term of one  
field instance and the first term of the next one).  A multi-valued  
field gets stored (if it is set to be stored, that is) as separate  
strings, and is retrievable as the separate values.


Multi-valued fields are handy for facets where, say, a product can  
have multiple categories associated with it.  In this case it's a bit  
clearer.  It's the full-text multi-valued fields that seem a bit  
strange.


Erik



Re: a question about solr phrasequery process.

2008-11-24 Thread Yonik Seeley
On Mon, Nov 24, 2008 at 3:55 AM, finy finy <[EMAIL PROTECTED]> wrote:
> hello everyone:
>
>  i use solr 1.2,  i find a problem about solr1.2,
>
>   when i search some keyword, i use myself analyzer,  i find that solr
> consider my terms as PhraseQuery,
>
>  for example,solr parser's result is: PhraseQuery( title:"i am good man"),
> but i want to get the query: title:i  title:am title:good title:man

If Solr is producing a phrase query, then you hust have surrounded the
words in quotes in the query string.  To group terms in a field try
the following:

title:(i am good man)

-Yonik


Re: AND query on multivalue text

2008-11-24 Thread David Santamauro


Hello all, I'm new to the list but want to say great work! ... see  
comment below


On Nov 24, 2008, at 7:59 AM, Erik Hatcher wrote:



On Nov 24, 2008, at 6:12 AM, Antonio Zippo wrote:

is it possible to have and AND query on a multivalue text?

i need to search something as
myText:billion AND guarantee

i need to be extracted only the record where the words exists in  
the same value (in this case only the first record) because in the  
2nd record the two words are in different values


is it possible?


It's not possible with a purely boolean query like this, but it is  
possible with a sloppy phrase query where the position increment gap  
(see example schema.xml) is greater than the slop factor.


Erik




I think what is needed here is the concept of SAME, i.e.,  
myText:billion SAME guarantee. I know a few full-text engines that can  
handle this operator one way or another. And without it, I don't quick  
understand the usefulness of multiValue fields.


David







Re: AND query on multivalue text

2008-11-24 Thread Erik Hatcher


On Nov 24, 2008, at 6:12 AM, Antonio Zippo wrote:

is it possible to have and AND query on a multivalue text?

i need to extract the record only if the words are contained inside  
the same value


for example
1st record:


The U.S. government has announced a massive rescue package for  
Citigroup, saying it would guarantee more than $300 billion in  
company assets
while injecting an additional $20 billion in capital into the  
embattled bank. 



2nd record

bla bla bla guarantee bla bla bla
while injecting an additional $20 billion in capital into the  
embattled bank. 




i need to search something as
myText:billion AND guarantee

i need to be extracted only the record where the words exists in the  
same value (in this case only the first record) because in the 2nd  
record the two words are in different values


is it possible?


It's not possible with a purely boolean query like this, but it is  
possible with a sloppy phrase query where the position increment gap  
(see example schema.xml) is greater than the slop factor.


Erik



Re: DateField Problem

2008-11-24 Thread Erik Hatcher


On Nov 24, 2008, at 8:03 AM, Peer Allan wrote:
I am trying to update a system that uses solr to use version 1.3,  
but have
stumbled across a problem I can’t seem to fix.  Solr is throwing  
errors on a

date and I don't know why.  Here is the request XML:


 
   Movie
   1
   Movie:1
   Napoleon Dynamite
   Cool movie about a goofy
guy
   2008-11-24T12:58:47Zfield>

 


Posting this to solr gives me this error:

SEVERE: org.apache.solr.common.SolrException: Error while creating  
field
'time_on_xml_d 
{type=sdouble,properties=indexed,stored,omitNorms,sortMissingL

ast}' from value '2008-11-24T12:58:47Z'


Note that it says "type=sdouble".  You need to have that mapped to a  
date field, not sdouble.  I guess you're getting caught by the *_d  
mapping from the example schema?  Try time_on_xml_dt instead, if  
you've got that mapped.


Erik



DateField Problem

2008-11-24 Thread Peer Allan
Hello all,

I am trying to update a system that uses solr to use version 1.3, but have
stumbled across a problem I can¹t seem to fix.  Solr is throwing errors on a
date and I don't know why.  Here is the request XML:


  
Movie
1
Movie:1
Napoleon Dynamite
Cool movie about a goofy
guy
2008-11-24T12:58:47Z
  


Posting this to solr gives me this error:

SEVERE: org.apache.solr.common.SolrException: Error while creating field
'time_on_xml_d{type=sdouble,properties=indexed,stored,omitNorms,sortMissingL
ast}' from value '2008-11-24T12:58:47Z'

I have read the release notes for 1.3 and realize there is improved
validation which is probably the source of the new error.  I have read the
documentation on the DateField and from what I can tell it the date is in a
valid format.  Can anyone tell me what's wrong here?  Thanks.

Peer



AND query on multivalue text

2008-11-24 Thread Antonio Zippo

hi all,

is it possible to have and AND query on a multivalue text?

i need to extract the record only if the words are contained inside the same 
value

for example
1st record:


The U.S. government has announced a massive rescue package for Citigroup, 
saying it would guarantee more than $300 billion in company assets
while injecting an additional $20 billion in capital into the embattled 
bank. 


2nd record

bla bla bla guarantee bla bla bla
while injecting an additional $20 billion in capital into the embattled 
bank. 



i need to search something as 
myText:billion AND guarantee

i need to be extracted only the record where the words exists in the same value 
(in this case only the first record) because in the 2nd record the two words 
are in different values

is it possible?

thanks


  

Problem generating summaries for redirected url´s

2008-11-24 Thread Elena
Hello everyone,

I am using Nutch with the Solr plugin, and I am having a problem indexing
redirected url´s. While Solr generates its fields just fine, as if they
belonged to the redirected url, Nutch leaves the summary field empty. It
seems as if Nutch tries to generate the summary of the original url and then
makes the query to Solr, which then follows the redirect and fills the rest
of the fields using the final url. But I am not quite sure of this.

I would like to know what is the way Nutch generates summaries, why it
leaves them empty when redirecting. Perharps there is a command to generate
one field in particular, after the indexing is done.

Thanks!


I: Highlighting wildcards

2008-11-24 Thread Antonio Zippo
q=tele?* seems to not work 
the query is ok... but the highlight returns records without the highlighted 
text (only the uniqueKey in highlight section)


> To do it now, you'd have to switch the query parser to using the
old style wildcard (and/or prefix) query, which is slower on large
indexes and has max clause issues.

could you explain me how please?
always for query tele*

thanks





- Messaggio inoltrato -
Da: Mike Klaas <[EMAIL PROTECTED]>
A: solr-user@lucene.apache.org
Inviato: Venerdì 21 novembre 2008, 20:23:10
Oggetto: Re: Highlighting wildcards


On 21-Nov-08, at 3:45 AM, Mark Miller wrote:

> To do it now, you'd have to switch the query parser to using the old style 
> wildcard (and/or prefix) query, which is slower on large indexes and has max 
> clause issues.

An alternative is to query for q=tele?*, which forces wildcardquery

-Mike

> 
> I think I can make it work out of the box for the next release again though. 
> see https://issues.apache.org/jira/browse/SOLR-825
> 
> Antonio Zippo wrote:
>> Hi,
>> 
>> i'm using solr 1.3.0 and SolrJ for my java application
>> 
>> I need to highlight my query words even if I use wildcards
>> 
>> for example
>> q=tele*
>> 
>> i need to highlight words as "television", "telephone", etc
>> 
>> I found this thread
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200704.mbox/[EMAIL 
>> PROTECTED]
>> 
>> but i have not understood ho to solve my problem
>> 
>> could anyone tell me how to solve the problem with SolrJ  and with solr 
>> web (by url)?
>> 
>> thanks in advance,Revenge
>> 
>> 
>> 
>>  Unisciti alla community di Io fotografo e video, il nuovo corso di 
>> fotografia di Gazzetta dello sport:
>> http://www.flickr.com/groups/iofotografoevideo
>> 
> 


  

Re: Query for Distributed search -

2008-11-24 Thread James liu
Up to your solr client.

On Mon, Nov 24, 2008 at 1:24 PM, souravm <[EMAIL PROTECTED]> wrote:

> Hi,
>
> Looking for some insight on distributed search.
>
> Say I have an index distributed in 3 boxes and the index contains time and
> text data (typical log file). Each box has index for different timeline -
> say Box 1 for all Jan to April, Box 2 for May to August and Box 3 for Sep to
> Dec.
>
> Now if I try to search for a text string, will the search would happen in
> parallel in all 3 boxes or sequentially?
>
> Regards,
> Sourav
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>



-- 
regards
j.L


Re: a question about solr queryparser

2008-11-24 Thread finy finy
i have  plugged in my custom Analyzer (correctly) in the solrconfig.xml,

my analyzer result is : one world one dream

but when i input : oneworld onedream, solr parse this into :
PhraseQquery(title:"one world one dream") , solr consider this as a
PhraseQuery, but this is not what i want to get.  i want to get the search
condition: title:one title:world title:dream

2008/9/26 Chris Hostetter <[EMAIL PROTECTED]>

>
> : (correctly) in the solrconfig.xml.  Could you paste the relevant part of
> : solrconfig.xml?  I don't recall a bug related to this, but you could
> : also try Solr 1.3 if you believe you configured things conrrectly.
>
> also check the Analysis Tool (link from the admin page) and see what it
> says your analyzer produces for your field and a *query* for...
>
>oneworld
>
> ...keep in mind that the query parser only associates "chunks" of text
> with something like "title:" if the text is quoted or the whitespace
> between the chunks is escaped, so ...
>
>title:oneworld onedream
>
> ...will cause "oneworld" to be passed to the analyzer for your title field
> and "onedream" to the analyzer for whatever your default field is.
>
>
>
> -Hoss
>
>


a question about solr phrasequery process.

2008-11-24 Thread finy finy
hello everyone:

  i use solr 1.2,  i find a problem about solr1.2,

   when i search some keyword, i use myself analyzer,  i find that solr
consider my terms as PhraseQuery,

 for example,solr parser's result is: PhraseQuery( title:"i am good man"),
but i want to get the query: title:i  title:am title:good title:man

please help me !


Re: Query for Distributed search -

2008-11-24 Thread Aleksander M. Stensby
If you for instance use SolrJ and the HttpSolrServer, you could for  
instance add logic to your querying making your searches more efficient!  
That is partially the idea of sharding, right? :) So if the user wants to  
search for a log file in June, your application knows that June logs are  
stored on the second box, and hence will redirect the search to that box.  
Alternatively if he wants to search for logs spanning two boxes, you  
merely add the shards parameter to your query and just include the path to  
those to shards in question. I'm not really sure about how solr handles  
the merging of results etc and wether or not the requests are done in  
paralell or sequentially, but I do know that you could easily manage this  
on your own through java if you want to. (Simply setting up one  
HttpSolrServer in your code for each shard, and searching them in  
parallell in separate threads. => then reducing the results afterwards).


Have a look at http://wiki.apache.org/solr/DistributedSearch for more info.
You could also take a look at Hadoop. (http://hadoop.apache.org/)

regards,
 Aleks

On Mon, 24 Nov 2008 06:24:51 +0100, souravm <[EMAIL PROTECTED]> wrote:


Hi,

Looking for some insight on distributed search.

Say I have an index distributed in 3 boxes and the index contains time  
and text data (typical log file). Each box has index for different  
timeline - say Box 1 for all Jan to April, Box 2 for May to August and  
Box 3 for Sep to Dec.


Now if I try to search for a text string, will the search would happen  
in parallel in all 3 boxes or sequentially?


Regards,
Sourav

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended  
solely
for the use of the addressee(s). If you are not the intended recipient,  
please
notify the sender by e-mail and delete the original message. Further,  
you are not
to copy, disclose, or distribute this e-mail or its contents to any  
other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys  
has taken
every reasonable precaution to minimize this risk, but is not liable for  
any damage
you may sustain as a result of any virus in this e-mail. You should  
carry out your
own virus checks before opening the e-mail or attachment. Infosys  
reserves the
right to monitor and review the content of all messages sent to or from  
this e-mail
address. Messages sent to or from this e-mail address may be stored on  
the

Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***





--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no