RE: Newbie Question - getting search results from dataimport request handler

2008-11-22 Thread Norskog, Lance

As part of the ETL effort, please consider how to integrate with these two 
open-source ETL systems. I'm not asking for an implementation, just suggesting 
that having a concrete context will help you in the architecture phase.

http://kettle.pentaho.org/
http://www.talend.com/products-data-integration/talend-open-studio.php 

Thanks,

Lance

-Original Message-
From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 21, 2008 8:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie Question - getting search results from dataimport request 
handler

On Sat, Nov 22, 2008 at 3:10 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : > it might be worth considering a new @attribute for  to 
> indicate
> : > that they are going to be used purely as "component" fields (ie: 
> your
> : > first-name/last-name example) and then have DIH pass all 
> non-component
> : > fields along and error if undefined in the schema just like other 
> updating
> : > RequestHandlers do.
> : >
> : > either that, or require that people declaure indexed="false"
> : > stored="false" fields in the schema for these intermediate 
> component
> : > fields so that we can properly warn then when DIH is getting data 
> it
> : > doesn't know what to do with -- protecting people from field name 
> typos
> : > and returning errors instead of silently ignoring unexpected input 
> is
> : > fairly important behavir -- especially for new users.
>
> : Actually it is done by DIH . When the dataconfig is loaded DIH 
> reports
> : these information on the console. though it is limited , it helps to 
> a
> : certain extent
>
> Hmmm.
>
> Logging an error and returning successfully (without adding any docs) 
> is still inconsistent with the way all other RequestHandlers work: 
> fail the request.
>
> I know DIH isn't a typical RequestHandler, but some things (like 
> failing on failure) seem like they should be a given.
SOLR-842 .
DIH is an ETL tool pretending to be a RequestHandler. Originally it was built 
to run outside of Solr using SolrJ. For better integration and ease of use we 
changed it later.

SOLR-853 aims to achieve the oroginal goal

The goal of DIH is to become a full featured ETL tool.



>
>
>
> -Hoss
>
>



--
--Noble Paul


DIH and repeated chunked input

2008-11-12 Thread Norskog, Lance
In http://wiki.apache.org/solr/DataImportHandler there is this
paragraph:
 
If an API supports chunking (when the dataset is too large) multiple
calls need to be made to complete the process. XPathEntityprocessor
supports this with a transformer. If transformer returns a row which
contains a field $hasMore with a the value "true" the Processor makes
another request with the same url template (The actual value is
recomputed before invoking ). A transformer can pass a totally new url
too for the next call by returning a row which contains a field $nextUrl
whose value must be the complete url for the next call. 
 
 
Does this translate as: "Nobody wrote this yet, but it would be really
cool"?
 
Thanks,
 
Lance


RE: Regex Transformer Error

2008-11-05 Thread Norskog, Lance
There is a nice HTML stripper inside Solr.
"solr.HTMLStripStandardTokenizerFactory" 

-Original Message-
From: Ahmed Hammad [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 05, 2008 10:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Regex Transformer Error

Hi,

It works with the attribute regex="<(.|\n)*?>"

Sorry for the disturbance.

Regards,

ahmd


On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am using Solr 1.3 data import handler. One of my table fields has 
> html tags, I want to strip it of the field text. So obviously I need 
> the Regex Transformer.
>
> I added transformer="RegexTransformer" attribute to my entity and a 
> new field with:
>
>  replaceWith="X"/>
>
> Every thing works fine. The text is replace without any problem. The 
> provlem happend with my regular experession to strip html tags. So I 
> use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not 
> allowed in XML. I tried the following regex="<(.|\n)*?>" and 
> regex="C;(.|\n)*?E;" but I get the following error:
>
> The value of attribute "regex" associated with an element type "field"

> must not contain the '<' character. at 
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
> Source) ...
>
> The full stack trace is following:
>
> *FATAL: Could not create importer. DataImporter config invalid
> org.apache.solr.common.SolrException: FATAL: Could not create
importer.
> DataImporter config invalid at
> org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> Handler.java:114)
> at
> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> (DataImportHandler.java:206)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> rBase.java:131) at 
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:232)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> cationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> lterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> lve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> lve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> va:128)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> va:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
> e.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> :286)
> at
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
> .java:857)
> at
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
> cess(Http11AprProtocol.java:565) at 
> org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
> 9) at java.lang.Thread.run(Unknown Source) Caused by:
> org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Exception occurred while initializing context Processing Document # at
> org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> orter.java:176)
> at
> org.apache.solr.handler.dataimport.DataImporter.(DataImporter.ja
> va:93)
> at
> org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> Handler.java:106) ... 17 more Caused by: 
> org.xml.sax.SAXParseException: The value of attribute "regex" 
> associated with an element type "field" must not contain the '<'
> character. at
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
> Source) at 
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
> own
> Source) at
> org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> orter.java:166)
> ... 19 more *
>
> *description* *The server encountered an internal error (FATAL: Could 
> not create importer. DataImporter config invalid
> org.apache.solr.common.SolrException: FATAL: Could not create
importer.
> DataImporter config invalid at
> org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> Handler.java:114)
> at
> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> (DataImportHandler.java:206)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> rBase.java:131) at 
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:232)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> cationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> lterChain.java:206)
> at
> org.apache.catalina.core.Stan

RE: DIH Http input bug - problem with two-level RSS walker

2008-11-01 Thread Norskog, Lance
The inner entity drills down and gets more detail about each item in the
outer loop. It creates one document. 

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 31, 2008 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Http input bug - problem with two-level RSS walker

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]>
wrote:

> I wrote a nested HttpDataSource RSS poller. The outer loop reads an 
> rss feed which contains N links to other rss feeds. The nested loop 
> then reads each one of those to create documents. (Yes, this is an 
> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items. 
> Both feeds use the same
> structure: /rss/channel with a  node and then N  nodes 
> inside the channel. This should create two separate XML streams with 
> two separate Xpath iterators, right?
>
> 
>
>
>
>
>
>
> 
>
> This does indeed walk each url from the outer feed and then fetch the 
> inner rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be

> related. The first problem is that it only stores the first document 
> from each "inner" feed. Each feed has several documents with different

> title fields but it only grabs the first.
>

The idea behind nested entities is to join them together so that one
Solr document is created for each root entity and the child entities
provide more fields which are added to the parent document.

I guess you want to create separate Solr documents from the root entity
as well as the child entities. I don't think that is possible with
nested entities. Essentially, you are trying to crawl feeds, not join
them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java
program and just post results to Solr as usual.



> The other is an off-by-one bug. The outer loop iterates through the 10

> items and then tries to pull an 11th.  It then gives this exception 
> trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM 
> org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>at java.net.URL.(URL.java:567)
>at java.net.URL.(URL.java:464)
>at java.net.URL.(URL.java:413)
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:90)
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:47)
>at
>
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
> ava:18
> 3)
>at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
> hEntit
> yProcessor.java:210)
>at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
> PathEn
> tityProcessor.java:180)
>at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
> ntityP
> rocessor.java:160)
>at
>
>
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Exception in invoking url null Processing Document # 11
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:115)
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:47)
>
>
>
>
>
>


--
Regards,
Shalin Shekhar Mangar.


RE: customizing results in StandardQueryHandler

2008-10-24 Thread Norskog, Lance
Ah!  This will let you post-process result sets with an XSL script:

http://wiki.apache.org/solr/XsltResponseWriter 

-Original Message-
From: Manepalli, Kalyan [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 24, 2008 11:44 AM
To: solr-user@lucene.apache.org
Subject: RE: customizing results in StandardQueryHandler

Ryan,
Actually, what I need is: I always query for a set of fields say
(f1, f2, f3 .. f6). Now once I get the results, based on some logic, I
need to generate the XML which is customized and contains only fields
say (f2, f3, and some new data). 
So the fl will always be (f1 ... f6)



Thanks,
Kalyan Manepalli

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED]
Sent: Friday, October 24, 2008 1:25 PM
To: solr-user@lucene.apache.org
Subject: Re: customizing results in StandardQueryHandler

isn't this just: fl=f1,f3,f4  etc

or am I missing something?


On Oct 24, 2008, at 12:26 PM, Manepalli, Kalyan wrote:

> Hi,
>   In my usecase, I query a set of fields. Then based on the
results, I 
> want to output a customized set of fields. Can I do this without using

> a search component?
> E:g. I query for fields f1, f2, f3, f4. Now based on some conditions, 
> I want to output just f1, f3, f4 (the list of final fields may vary).
>
> How do I rewrite the resultant xml optimally?
> Any thoughts on this will be helpful
>
> Thanks,
> Kalyan



RE: scaling / sharding questions

2008-06-17 Thread Norskog, Lance
I cannot facet on one huge index; it runs out of ram when it attempts to
allocate a giant array. If I store several shards in one JVM, there is
no problem.

Are there any performance benefits to a large index v.s. several small
indexes?

Lance 

-Original Message-
From: Marcus Herou [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 15, 2008 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: scaling / sharding questions

Yep got that.

Thanks.

/M

On Sun, Jun 15, 2008 at 8:42 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> With Lance's MD5 schema you'd do this:
>
> 1 shard: 0-f*
> 2 shards: 0-8*, 9-f*
> 3 shards: 0-5*, 6-a*, b-f*
> 4 shards: 0-3*, 4-7*, 8-b*, c-f*
> ...
> 16 shards: 0*, 1*, 2*... d*, e*, f*
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Marcus Herou <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Cc: [EMAIL PROTECTED]
> > Sent: Saturday, June 14, 2008 5:53:35 AM
> > Subject: Re: scaling / sharding questions
> >
> > Hi.
> >
> > We as well use md5 as the uid.
> >
> > I guess by saying each 1/16th is because the md5 is hex, right?
(0-f).
> > Thinking about md5 sharding.
> > 1 shard: 0-f
> > 2 shards: 0-7:8-f
> > 3 shards: problem!
> > 4 shards: 0-3
> >
> > This technique would require that you double the amount of shards 
> > each
> time
> > you split right ?
> >
> > Split by delete sounds really smart, damn that I did'nt think of 
> > that :)
> >
> > Anyway over time the technique of moving the whole index to a new 
> > shard
> and
> > then delete would probably be more than challenging.
> >
> >
> >
> >
> > I will never ever store the data in Lucene mainly because of bad exp

> > and since I want to create modules which are fast,  scalable and 
> > flexible and storing the data alongside with the index do not match 
> > that for me at
> least.
> >
> > So yes I will have the need to do a "foreach id in ids get document"
> > approach in the searcher code, but at least I can optimize the 
> > retrieval
> of
> > docs myself and let Lucene do what it's good at: indexing and 
> > searching
> not
> > storage.
> >
> > I am more and more thinking in terms of having different levels of
> searching
> > instead of searcing in all shards at the same time.
> >
> > Let's say you start with 4 shards where you each document is 
> > replicated 4 times based on publishdate. Since all shards have the 
> > same data you can
> lb
> > the query to any of the 4 shards.
> >
> > One day you find that 4 shards is not enough because of search
> performance
> > so you add 4 new shards. Now you only index these 4 new shards with 
> > the
> new
> > documents making the old ones readonly.
> >
> > The searcher would then prioritize the new shards and only if the 
> > query returns less than X results you start querying the old shards.
> >
> > This have a nice side effect of having the most relevant/recent 
> > entries
> in
> > the index which is searched the most. Since the old shards will be 
> > mostly idle you can as well convert 2 of the old shards to "new" 
> > shards reducing the need for buying new servers.
> >
> > What I'm trying to say is that you will end up with an architecture 
> > which have many nodes on top which each have few documents and fewer

> > and fewer nodes as you go down the architecture but where each node 
> > store more documents since the search speed get's less and less
relevant.
> >
> > Something like this:
> >
> >  - Primary: 10M docs per shard, make sure 95% of the results
> comes
> > from here.
> > - Standby: 100M docs per shard - merges of 10 primary
indices.
> >  zz - Archive: 1000M docs per shard - merges of 10 standby
indices.
> >
> > Search top-down.
> > The numbers are just speculative. The drawback with this 
> > architecture is that you get no indexing benefit at all if the 
> > architecture drawn above
> is
> > the same as which you use for indexing. I think personally you 
> > should use
> X
> > indexers which then merge indices (MapReduce) for max performance 
> > and lay them out as described above.
> >
> > I think Google do something like this.
> >
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog wrote:
> >
> > > Yes, I've done this split-by-delete several times. The halved 
> > > index
> still
> > > uses as much disk space until you optimize it.
> > >
> > > As to splitting policy: we use an MD5 signature as our unique ID. 
> > > This
> has
> > > the lovely property that we can wildcard.  'contentid:f*' denotes 
> > > 1/16
> of
> > > the whole index. This 1/16 is a very random sample of the whole
index.
> We
> > > use this for several things. If we use this for shards, we have a 
> > > query that matches a shard's contents.
> > >
> > > The Solr/Lucene syntax does not support modular arithmetic,and so 
> > > it
> will
> > > not let you query a subset

RE: Strategy for presenting fresh data

2008-06-12 Thread Norskog, Lance
You can also use a shared file system mounted on a common SAN. 
(This is a high-end server configuration.) 

-Original Message-
From: James Brady [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 12, 2008 9:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Strategy for presenting fresh data

>>
>> In the meantime, I had imagined that, although clumsy,  federated 
>> search could be used for this purpose - posting the new documents to 
>> a group of servers ('latest updates servers') with v limited amount 
>> of documents with v. fast "reload / refresh" times, and sending them 
>> again (on a work queue, possibly), to the 'core servers'. Regularly 
>> cleaning the 'latest updates servers'
>> of the
>> already posted documents to 'core servers' would keep them lean...   
>> of course,
>> this approach sucks compared to a proper solution like what James is 
>> suggesting
>> :)
>>


Otis - is there an issue I should be looking at for more information on
this?

Yes, in principle, sending updates both to a fresh, forgetful and fast
index and a larger, slower index is what I'm thinking of doing.

The only difference is that I'm talking about having the fresh index be
implemented as a RAMDirectory in the same JVM as the large index.

This means that I can avoid the slowness of cross-disk or cross- machine
replication, I can avoid having to index all documents in two places and
I cut out the extra moving part of federated search.

On the other hand, I am going to have to write my own piece to handle
the index flushes and federate searches to the fast and large indices.

Thanks for your input!
James


XSL scripting

2008-06-07 Thread Norskog, Lance
This started out in the num-docs thread, but deserves its own. And a
wiki page.

There is a more complex and general way to get the number documents in
the index. I run a query against solr and postprocess the output with an
XSL script.

Install this xsl script as home/conf/xslt/numfound.xsl.

http://www.w3.org/1999/XSL/Transform";>








Make sure 'curl' is installed, and add numfound.sh, a unix shell script.

SHARD=localhost:8080/solr
QUERY="$1"

LINK="http://$SHARD/select?indent=on&version=2.2&q=$QUERY&start=0&rows=0
&fl=*&wt=xslt&tr=numfound.xsl"
curl --silent "$LINK" -H "Content-Type:text" -X GET

Run it as 
sh numfound.sh "*:*"
 
How to install the XSLT script is to be found on the Wiki.
Star-colon-star is magic for 'all records'.
 

XSL is appalling garbage.

Cheers!
 


RE: Solr indexing configuration help

2008-06-02 Thread Norskog, Lance
Solr 1.2 ignores the 'number of documents' attribute. It honors the
"every 30 minutes" attribute.

Lance 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Sunday, June 01, 2008 6:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr indexing configuration help

On Sun, Jun 1, 2008 at 4:43 AM, Gaku Mak <[EMAIL PROTECTED]> wrote:
> I have tried Yonik's suggestions with the following:
> 1) all autowarming are off
> 2) commented out firstsearch and newsearcher event handlers
> 3) increased autocommit interval to 600 docs and 30 minutes 
> (previously 50 docs and 5 minutes)

Glad it looks like your memory issues are solved, but I really wouldn't
use "docs" at all for an autocommit criteria it will just slow down
your full index builds.

-Yonik

> In addition, I updated the java option with the following:
> -d64 -server -Xms2048M -Xmx3072M -XX:-HeapDumpOnOutOfMemoryError 
> -XX:+UseSerialGC
>
> Results:
> I'm currently at 100,000 documents now with about 9.0GB index on a 
> quad machine with 4GB ram.  The stress test is to add 20 documents 
> every 30 seconds now.
>
> It seems like the serial GC works better than the other two 
> alternatives (-XX:+UseParallelGC or -XX:+UseConcMarkSweepGC) for some 
> reason.  I have not seen any OOM since the changes mentioned above 
> (yet).  If others have better experience with other GC and know how to

> configure it properly, please let me know because using serial GC just
doesn't sound right on a quad machine.
>
> Additional questions:
> Does anyone know how solr/lucene use heap in terms of their 
> generations (young vs tenured) on the indexing environment?  If we 
> have this answer, we would be able to better configure the 
> young/tenured ratio in the heap.  Any help is appreciated!  Thanks!
>
> Now, I'm looking into configuring the slave machines.  Well, that's a 
> separate question.
>
>
>
> Yonik Seeley wrote:
>>
>> Some things to try:
>> - turn off autowarming on the master
>> - turn off autocommit, unless you really need it, or change it to be 
>> less agressive:  autocommitting every 50 docs is bad if you are 
>> rapidly adding documents.
>> - set maxWarmingSearchers to 1 to prevent the buildup of searchers
>>
>> -Yonik
>>
>> On Fri, May 30, 2008 at 3:39 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
>>>
>>> I started running the test on 2 other machines with similar specs 
>>> but more RAM (4G). One of them now has about 60k docs and still 
>>> running fine. On the other machine, solr died at about 43k docs. A 
>>> short while before solr died, I saw that there were 5 searchers at 
>>> the same time. Do any of you know why would solr create 5 searchers,

>>> and if that could cause solr to die? Is there any way to prevent 
>>> this? Also is there a way to totally disable the searcher and 
>>> whether that is a way to optimize the solr master?
>>>
>>> I copied the following from the SOLR Statistics page in case it has 
>>> interested info:
>>>
>>> name:[EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42754
>>> maxDoc : 42754
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so
>>> lr/data/index
>>> indexVersion : 1211702500453
>>> openedAt : Fri May 30 10:04:15 PDT 2008 registeredAt : Fri May 30 
>>> 10:05:05 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42754
>>> maxDoc : 42754
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so
>>> lr/data/index
>>> indexVersion : 1211702500453
>>> openedAt : Fri May 30 10:03:24 PDT 2008 registeredAt : Fri May 30 
>>> 10:03:41 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42675
>>> maxDoc : 42675
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so
>>> lr/data/index
>>> indexVersion : 1211702500450
>>> openedAt : Fri May 30 10:00:53 PDT 2008 registeredAt : Fri May 30 
>>> 10:01:05 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42697
>>> maxDoc : 42697
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so
>>> lr/data/index
>>> indexVersion : 1211702500451
>>> openedAt : Fri May 30 10:02:20 PDT 2008 registeredAt : Fri May 30 
>>> 10:02:22 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.ap

RE: MultiCore on Wiki

2008-04-30 Thread Norskog, Lance
I think I meant: this writeup implies to me that two cores could share
the same "default" index. I don't see how this would work, or be useful.

Thanks,

Lance Norskog 

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 30, 2008 9:30 PM
To: solr-user@lucene.apache.org
Subject: Re: MultiCore on Wiki


On Apr 30, 2008, at 11:52 PM, Lance Norskog wrote:
> The MultiCore writeup on the Wiki 
> (http://wiki.apache.org/solr/MultiCore
> )
> says:
>
> ...
> Configuration->core->dataDir
>   The data directory for a given core. (optional)
>
> How can a core not have its own dataDir? What happens if this is not 
> set?
>

It defaults to the "normal" location, that is whatever is specified in  
solrconfig.xml or "data"  relative to the solr.home for that directory.

(I'm not looking at the code now, so try it out and see...)

ryan


RE: indexing text containing xml tags

2008-04-19 Thread Norskog, Lance
We wrap everything in CDATA tags. Works great. 

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 18, 2008 10:41 PM
To: [EMAIL PROTECTED]
Cc: solr-user@lucene.apache.org
Subject: Re: indexing text containing xml tags

CC'ing the solr-user mailing list because that is the right list for
usage questions.
You'll need to XML encode your title field. Basically you need to
replace '<' with < etc, then you will be able to index them.

On Sat, Apr 19, 2008 at 10:54 AM, Saurabh Kataria <[EMAIL PROTECTED]>
wrote:

>
> Hi everyone,
>
> I am having a problem while indexing my document. A very typical field

> of my document looks like:
>
> pKa Values of the Opened

> Form of a Thieno-1,2,4-triazolo-1,4-diazepine in Water
>
> solr has a problem indexing this because of the xml tags. I was 
> wondering if there is any way that I can index this field "title" 
> without stripping off my tags. If anyone could help me out, that wld
be great.
>
> Thanks,
> SK.
>



--
Regards,
Shalin Shekhar Mangar.


RE: capping term frequency?

2008-04-14 Thread Norskog, Lance
Doing this well is harder. Giving a spam score to each page and boosting
by a function on this score is probably a stronger tool.Can't remember
where I found it. Gives a solid spam score algorithm for several
easy-to-code text analyses and a scoring function. This assumes you
pre-process.

Detecting Spam Web Pages through Content Analysis
WWW 2006, May 23-26, 2006, Edinburgh, Scotland.
ACM 1-59593-323-9/06/0005.

Also "Z. Gyongyi and H. Garcia-Molina." have some interesting papers. 



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 11, 2008 1:12 PM
To: solr-user@lucene.apache.org
Subject: Re: capping term frequency?

Hi,

Probably by writing your own Similarity (Lucene codebase) and
implementing the following method with capping:

  /** Implemented as sqrt(freq). */
  public float tf(float freq) {
return (float)Math.sqrt(freq);
  }

Then put that custom Similarity in a jar in Solr's lib and specify your
Similarity FQCN at the bottom of solrconfig.xml

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: peter360 <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, April 11, 2008 2:16:53 PM
Subject: capping term frequency?


Hi,
How do I cap the term frequency when computing relevancy scores in solr?

The problem is if a keyword repeats many times in the same document, I
don't want it to hijack the relevancy score.  Can I tell solr to cap the
term frequency at a certain threshold?

thanks.
--
View this message in context:
http://www.nabble.com/capping-term-frequency--tp16628189p16628189.html
Sent from the Solr - User mailing list archive at Nabble.com.






RE: Facet Query

2008-04-11 Thread Norskog, Lance
Ok.

I have a query that returns a set A. Doing a facet on field F gives me:
All values of F in the index given as count(*)
And these values can include 0.

I add a facet query that returns B. The facet operation now returns
count(*) on only the values of F that are found in query B.
Query B is only used as a set, none of the counts in query B are used.

Is this it?

Thanks,

Lance

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, April 11, 2008 1:36 PM
To: solr-user@lucene.apache.org
Cc: Norskog, Lance
Subject: Re: Facet Query

On Fri, Apr 11, 2008 at 4:32 PM, Lance Norskog <[EMAIL PROTECTED]>
wrote:
> What do facet queries do that is different from the regular query?  
> What is  a use case where I would use a facet.query in addition to the
regular query?

It returns the number of documents that match the query AND the
facet.query.

-Yonik


RE: indexing slow, IO-bound?

2008-04-07 Thread Norskog, Lance
Also Linux has optional file systems that might be better for this. We
plan to try them.  ReiserFS and XFS have good reputations. (Reiser
himself, that's a different story :(

Cheers,

Lance

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 07, 2008 12:04 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing slow, IO-bound?

On 5-Apr-08, at 7:09 AM, Britske wrote:

> Indexing of these documents takes a long time. Because of the size of 
> the documents (because of the indexed fields) I am currently batching 
> 50 documents at once which takes about 2 seconds.Without adding the 
> 1 indexed fields to the document, indexing flies at about 15 ms 
> for these 50 documents. INdexing is done using SolrJ
>
> This is on a intel core 2 6400 @2.13ghz and 2 gb ram.
>
> To speed this up I let 2 threads do the indexing in parallel. What 
> happens is that solr just takes double the time (about 4 seconds) to 
> complete these two jobs of 50 docs each in parallel. I figured because

> of the multi- core setup indexing should improve, which it doesn't.

Multiple processors really only help indexing speeds when there is heavy
analysis.

> Does this perhaps indicate that the setup is IO-bound? What would be 
> your best guess  (given the fact that the schema has a big amount of 
> indexed
> fields) to try next to improve indexing performance?

Use Lucene 2.3 with solr 1.2, or simple try out solr trunk.  The
indexing has been reworked to be considerably faster (it also makes
better use of multiple processors by spawing a background merging
thread).

-Mike


RE: Merging Solr index

2008-04-05 Thread Norskog, Lance
Thanks!

I have learned Solr as a power user and written a couple of simple
filters. I'm not a Lucene heavy. Where is this in Lucene?  Is it the
default? I don't remember Lucene having the notion of a unique id
(primary key).

In this merge code, with the latest Lucene 2.3, will the duplicates in
solr/data1 override the records in solr/data0? Or the other way around?

How do I add the new Lucene implementation?

try {
  IndexWriter writer = new IndexWriter(new
File("solr/data0/index"),
  new StandardAnalyzer(), false);
  Directory[] dirs = new
Directory[]{FSDirectory.getDirectory(new File("solr/data1/index"))};
  System.out.println(writer);
  writer.addIndexes(dirs);
  writer.close();
} catch (Exception e) {
  e.printStackTrace();
}

Thanks,

Lance Norskog


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Saturday, April 05, 2008 2:37 PM
To: solr-user@lucene.apache.org
Cc: Norskog, Lance
Subject: Re: Merging Solr index

On Fri, Apr 4, 2008 at 6:26 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote:
>  http://wiki.apache.org/solr/MergingSolrIndexes recommends using the  
> Lucene contributed app IndexMergeTool to merge two Solr indexes. What

> happens if both indexes have records with the same unique key? Will 
> they  both go into the new index?

Yes.

>  Is the implementation of unique IDs in the Solr java or in Lucene?

Both.  It was originally just in Solr, but Lucene now has an
implementation.
Neither implementation will prevent this as both just remember documents
(in memory) that were added and then periodically delete older documents
with the same id.

-Yonik


Merging Solr index

2008-04-04 Thread Norskog, Lance
Hi-
 
http://wiki.apache.org/solr/MergingSolrIndexes recommends using the
Lucene contributed app IndexMergeTool to merge two Solr indexes. What
happens if both indexes have records with the same unique key? Will they
both go into the new index?
 
Is the implementation of unique IDs in the Solr java or in Lucene? If it
is in Solr, how would I hackup a Solr IndexMergeTool?
 
Cheers,
 
Lance Norskog
 


RE: Search exact terms

2008-04-02 Thread Norskog, Lance
This is confusing advice to a beginner. A string field will not find a
word in the middle of a sentence.

To get normal searches without this confusions, copy the 'text' type and
make a variant without the Stemmer. The problem is that you are using an
English language stemmer for what appears to be Dutch. There is a Dutch
stemmer, it might be better for your needs if the content is all Dutch.

To make an exact search field which still has helpful searching
properties, make another variant of text that breaks up words but does
not stem. You might also want to add the ISOLatin1 filter which maps all
European characters to USASCII equivalents. This is also very helpful
for multi-language searching.

Lance

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 02, 2008 7:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Search exact terms

search is based on the fields you index and how you index them.

If you index using the "text" field -- with stemming etc, you will have
to search with the same criteria.

If you want exact search, consider the "string" type.  If you want both,
you can use the  to copy the same content into multiple
fields so it is searchable multiple ways

ryan



On Apr 2, 2008, at 4:46 AM, Tim Mahy wrote:
> Hi all,
>
> is there a Solr wide setting that with which I can achieve the 
> following :
>
> if I now search for q=onderwij, I also receive documents with results 
> of "onderwijs" etc.. this is ofcourse the behavior that is described 
> but if I search on "onderwij", I still get the "onderwijs"
> hits, I use for this field the type "text" from the schema.xml that is

> supplied with the default Solr.
>
> Is there a global setting on Solr to always search Exact ?
>
> Greetings,
>
> Tim
>
>
>
>
>
> Info Support - http://www.infosupport.com
>
> Alle informatie in dit e-mailbericht is onder voorbehoud. Info Support

> is op geen enkele wijze aansprakelijk voor vergissingen of 
> onjuistheden in dit bericht en staat niet in voor de juiste en 
> volledige overbrenging van de inhoud hiervan. Op al de werkzaamheden 
> door Info Support uitgevoerd en op al de aan ons gegeven opdrachten 
> zijn - tenzij expliciet anders overeengekomen - onze Algemene 
> Voorwaarden van toepassing, gedeponeerd bij de Kamer van Koophandel te

> Utrecht onder nr. 30135370. Een exemplaar zenden wij u op uw verzoek 
> per omgaande kosteloos toe.
>
> De informatie in dit e-mailbericht is uitsluitend bestemd voor de 
> geadresseerde. Gebruik van deze informatie door anderen is verboden.
> Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van

> deze informatie aan derden is niet toegestaan.
>
> Dit e-mailbericht kan vertrouwelijke informatie bevatten. Indien u dit

> bericht dus per ongeluk ontvangt, stelt Info Support het op prijs als 
> u de zender door een antwoord op deze e-mail hiervan op de hoogte 
> brengt en deze e-mail vervolgens vernietigt.



RE: sort by index id descending?

2008-03-19 Thread Norskog, Lance
... another "magic" field name like "score" ...

This could be done with a separate "magic" punctuation like $score,
$mean (the mean score), etc.so $docid would work.

Cheers,

Lance

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2008 9:01 PM
To: solr-user
Subject: Re: sort by index id descending?


: Is there any way to sort by index id - descending? (by order of
indexed)

Not that i can think of.  Lucene already has support for it, so it would
probably be a fairly simple patch if someone wanted to try to implement
it, we just need some syntax to make the parameter parsing construct the
right Sort object -- allthough I'm loath to add another "magic" field
name like "score" since "docid" or "id" or anything else we can think of
could easily conflict with a field name in someones schema.

if we add something like this I'd want to add configuration to
solrconfig.xml to determine what the "magic" field names for sorting by
internal id and score should be.


-Hoss



RE: Finding an empty field

2008-03-14 Thread Norskog, Lance
It was a surprise to discover that 
dateorigin_sort:""
is a syntax error, but
dateorigin_sort:["" TO *]
is legit. This says that there's a bug in the Lucene syntax parser?

Anyway, with a little more research I discover that this query:
http://64.71.164.205:8080/solr/select/?q=*:*&version=2.2&start=0&rows=0&;
indent=on&facet=true&facet.field=dateorigin_sort&facet.mincount=0&facet.
sort=false

.../solr/select/?q=*:*&version=2.2&start=0&rows=0&indent=on&facet=true&f
acet.field=dateorigin_sort&facet.mincount=0&facet.sort=false

This query says, "Select all records in the index. For each indexed
value of dateorigin_sort, count the number of records with that value."
It yields the following output snippet:


 
 
  
0
1
1 

Umm... it has an indexed empty value that does not correspond to a
record?  Is it an unanchored data item in the index? Would optimizing
make this index data go away?

Thanks,

Lance

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 14, 2008 4:46 PM
To: solr-user@lucene.apache.org
Subject: RE: Finding an empty field


: dateorigin_sort:""  gives a syntax error.  I'm using Solr 1.2. Should
: this work in Solr 1.3? Is it legal in a newer Lucene parser?

Hmm.. not sure.  did you try the range query suggestion? ...

: well, technically range queries "work" they just don't "work" on
numeric
: ranges ... they'd be lexigraphical ranges on the string value, so...
: 
:   dateorigin_sort:[* TO " "] 
: 
: ...could probably help you find anything that is lexigraphically lower
: then a string representation of an integer (assuming
dateorigin_sort:"" 
: doesn't work)



-Hoss



RE: Finding an empty field

2008-03-14 Thread Norskog, Lance

dateorigin_sort:""  gives a syntax error.  I'm using Solr 1.2. Should
this work in Solr 1.3? Is it legal in a newer Lucene parser?



message Query parsing error: Cannot parse 'dateorigin_sort:""':
Lexical error at line 1, column 19. Encountered:  after : "\"\""

description The request sent by the client was syntactically
incorrect (Query parsing error: Cannot parse 'dateorigin_sort:""':
Lexical error at line 1, column 19. Encountered:  after :
"\"\""). 

Thanks,

Lance

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 14, 2008 11:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Finding an empty field


: Somehow the index has acquired one record out of millions in which an
: integer value has been populated by an empty string. I would like to
isolate
: this record and remove it. This field exists solely to make sorting
faster,
: and since it has an empty record, sorting blows up. 
:  
: Is it possible to find this record? Is there any way to differentiate
: between this record and all of the other records which have real
numbers
: populated?  

have you tried searching for...

 dateorigin_sort:""
?

: This query will isolate records which do not have the field populated.
(It
: works on all field types.)
: -dateorigin_sort:[* TO *]
: But, since this record is an integer (not an sint) no other range
query
: works.

well, technically range queries "work" they just don't "work" on numeric
ranges ... they'd be lexigraphical ranges on the string value, so...

dateorigin_sort:[* TO " "] 

...could probably help you find anything that is lexigraphically lower
then a string representation of an integer (assuming dateorigin_sort:"" 
doesn't work)


disclaimer: i haven't actaully tested either of these on an index with a

bogus integer like you describe ... but i'm pretty sure they should work

given what i'm remembering about the code)


-Hoss



RE: Tomcat and Solr - out of memory

2008-01-08 Thread Norskog, Lance
On Tomcat, an OutOfMemory on a query leaves the server in an OK state, and 
future queries work.
But a facet query that runs out of ram does not free its undone state and all 
future requests get OutOfMemory.

Lance 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stuart Sierra
Sent: Tuesday, January 08, 2008 7:05 AM
To: solr-user@lucene.apache.org
Subject: Re: Tomcat and Solr - out of memory

On Jan 7, 2008 12:08 PM, Jae Joo <[EMAIL PROTECTED]> wrote:
> What happens if Solr application hit the max. memory of heap assigned?
>
> Will be die or just slow down?

In my (limited) experience (with Jetty), Solr will not die but it will return 
HTTP 500 errors on all requests until it is restarted.

-Stuart

No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.17.13/1213 - Release Date: 1/7/2008 9:14 
AM
 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.17.13/1213 - Release Date: 1/7/2008 9:14 
AM
 


RE: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

2007-12-27 Thread Norskog, Lance
As a reference:

I have several million records where are about 20 fields. One of them is
100-1k bytes, and the rest are 20-50 bytes. There is a reliable 5%
performance difference between pulling just the unique key field and
pulling all of the fields. 

-Original Message-
From: Geert-Jan Brits [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 27, 2007 8:44 AM
To: solr-user@lucene.apache.org
Subject: Re: big perf-difference between solr-server vs. SOlrJ
req.process(solrserver)

yeah, that makes sense.
so, in in all, could scanning all the fields and loading the 10 fields
add up to cost about the same or even more as performing the intial
query? (Just making sure)

I am wondering if the following change to the schema would help in this
case:

current setup:
It's possible to have up to 2000 product-variants.
each product-variant has:
- 1 price field (stored / indexed)
- 1 multivalued field which contains product-variant characteristics
(strored / not indexed).

This adds up to the 4000 fields described. Moreover there are some
fields on the product level but these would contibute just a tiny bit to
the overall scanning / loading costs (about 50 -stored and indexed-
fields in total)

possible new setup (only the changes) :
- index but not store the price-field.
- store the price as just another one of the product-variant
characteristics in the multivalued product-variant field.

as a result this would bring back the maximum number of stored fields to
about 2050 from 4050 and thereby about halving scanning / loading costs
while leaving the current quering-costs intact.
Indexing costs would increase a bit.

Would you expect the same performance gain?

Thanks,
Geert-Jan

2007/12/27, Yonik Seeley <[EMAIL PROTECTED]>:
>
> On Dec 27, 2007 11:01 AM, Britske <[EMAIL PROTECTED]> wrote:
> > after inspecting solrconfig.xml I see that I already have enabled 
> > lazy
> field
> > loading by:
> > true (I guess it 
> > was enabled by default)
> >
> > Since any query returns about 10 fields (which differ from query to
> query) ,
> > would this mean that only these 10 of about 2000-4000 fields are
> retrieved /
> > loaded?
>
> Yes, but that's not the whole story.
> Lucene stores all of the fields back-to-back with no index (there is 
> no random access to particular stored fields)... so all of the fields 
> must be at least scanned.
>
> -Yonik
>


RE: An observation on the "Too Many Files Open" problem

2007-12-26 Thread Norskog, Lance
In Java files, database handles, and other external open resources are
not automatically closed when the object is garbage-collected. You have
to explicitly close the resource. (There is a feature called
'finalization' where you can implement this for your own classes, but
this has turned out to be a badly designed feature.) 

-Original Message-
From: Mark Baird [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 24, 2007 10:25 AM
To: Solr Mailing List
Subject: An observation on the "Too Many Files Open" problem

Running our Solr server (latest 1.2 Solr release) on a Linux machine we
ran into the "Too Many Open Files" issue quite a bit.  We've since
changed the ulimit max filehandle setting, as well as the Solr
mergeFactor setting and haven't been running into the problem anymore.
However we are seeing some behavior from Solr that seems a little odd to
me.  When we are in the middle of our batch index process and we run the
lsof command we see a lot of open file handles hanging around that
reference Solr index files that have been deleted by Solr and no longer
exist on the system.

The documents we are indexing are potentially very large, so due to
various memory constraints we only send 300 docs to Solr at a time.
With a commit between each set of 300 documents.  Now one of the things
that I read may cause old file handles to hang around was if you had an
old IndexReader still open pointing to those old files.  However
whenever you issue a commit to the server it is supposed to close the
old IndexReader and open a new one.

So my question is, when the Reader is being closed due to a commit, what
exactly is happening?  Is it just being set to null and a new instance
being created?  I'm thinking the reader may be sitting around in memory
for a while before the garbage collector finally gets to it, and in that
time it is still holding those files open.  Perhaps an explicit method
call that closes any open file handles should occur before setting the
reference to null?

After looking at the code, it looks like reader.close() is explicitly
being called as long as the closeReader property in SolrIndexSearcher is
set to true, but I'm not sure how to check if that is always getting set
to true or not.  There is one constructor of SolrIndexSearcher that sets
it to false.

Any insight here would be appreciated.  Are stale file handles something
I should just expect from the JVM?  I've never ran into the "Too Many
Files Open" exception before, so this is my first time looking at the
lsof command.  Perhaps I'm reading too much into the data it's showing
me.


Mark Baird


RE: Solr replication

2007-12-19 Thread Norskog, Lance
You can probably find an rsync port for Windows in the gnu32 or cygnus
distributions.  There is a bigger problem here.

To quote myself in another recent mail:
The replication scripts use two Unix file system tricks. 

1) Files are not directly bound with with filenames, instead
there is a layer of indirection called an 'inode'. So, multiple file and
directory names point to the same physical file. The "." and ".."
directory entries are implemented this way. 

2) Physical files are bound to all open file descriptors even
after there are no file names for the files. So, file data exists until
all file names are gone AND all open files are gone.

Windows does not (I think) support these features, even if they use an
indirection in their file system. The hardlink tricks are not available.
If you want to replicate with snapshots, you will have to make a
complete copy of your new files at the source, and copy those into the
index directory at the target. You may have to stop Solr at the source
and/or target during these operations.

Lance

-Original Message-
From: Dilip.TS [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 17, 2007 8:26 PM
To: SOLR
Subject: RE: Solr replication

Hi,
 I understand that the Rsync is a Unix/Linux daemon thread which needs
to be enable/run to achieve Solr Collection Distribution.
Do we have any similar support for the Solr Collection Distribution in
the Windows environment or Do we need to write equivalent commands (in
the form of batch files) which will do the same steps as the shell
scripts placed under solr/bin folder.

Thanks in advance.

Regards,
Dilip.

 -Original Message-
From: Bill Au [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 18, 2007 4:00 AM
To: [EMAIL PROTECTED]
Subject: Re: Solr replication


  Rsync is a Unix/Linux command.  I dont' know if that's available on
Windows.  All the distribution scripts were developed and tested under
Unix/Linux.  They may or may not work on Windows.  I don't know much
about Windows so if you are running on Windows that I am the wrong
person to be asking help.  You may want to use the mailing list to see
if anyone is doing collection distribution on Windows.

  Solr is accessed through HTTP so you just need to use HTTP (for
example,
IE) on a Windows system to access a Solr server.

  Bill


  On Dec 17, 2007 8:53 AM, Dilip.TS < [EMAIL PROTECTED]> wrote:

Hi Bill,

I have a basic question (as im not an expert in unix).

I understand that the rsync is a deamon thread (similar to services
in
Windows).

Im not clear about what are the things/steps required to set up this
rysncd
deamon thread?
(Dont mind asking this question againg since im not very much clear
about
this)

Does it mean that the SOLR servers(both master and slave) should be
made
running on a unix/linux machine only?

How does a client (using Windows environment) able to access the
SOLR Server
running on Unix/Platform?

Any links/references would be of great help.

Thanks in advance.

Regards
Dilip




-Original Message-
From: Bill Au [mailto: [EMAIL PROTECTED]
Sent: Saturday, December 15, 2007 1:08 AM
To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: Solr replication


On Dec 14, 2007 7:00 AM, Dilip.TS <[EMAIL PROTECTED]> wrote:

> Hi,
> I have the following requirement for SOLR Collection Distribution
using
> Embedded Solr with the Jetty server:
>
> I have different data folders for multiple instances of SOLR
within the
> Same
> application.
> Im using the same SOLR_HOME with a single bin and conf folder.
>
> My query is:
> 1)Is is possible to have the same SOLR_HOME for multiple solr
instances
> and
> still be able to
>  achieve Solr Distribution?
>  (As i understand that we need to have differnet rsync port for
different
> solr instances)


Yes, solr distribution will work for multiple solr instances even if
they
all use the same SOLR_HOME.
All the distribution scripts have a command line argument for
specifying the
data directory.


>
> 2)Can i get some more information about how to start this rsyncd
daemon
> and
>  which is the best way of doing it i.e. to start during system
reboot or
> doing it manually?


Please note that the rsyncd
 
-CollectionDistributionScripts#head-1e6cdce516ecf1eb31bffceaccf2abeb72bd
ce81

So it is best to configure the master server to run the rsyncd-start
script
at system boot time.  If the rsync daemon has for some reasons been
disabled, it will not be started automatically at system reboot even
if it
is configured to do so.  If rsyncd is started manually, then one
will have
to remember to start it every time the master server is rebooted.


>
> 3)Let me know if my understanding is correct. We require 1 Master
Server
  

RE: retrieve lucene "doc id"

2007-12-18 Thread Norskog, Lance
Exactly.  We have done some projects where we extract records en masse.
With this technique we can make a query that will fetch exactly 3000
+-50  records, and walk through every 50 records using the query as a
filter. Works pretty well.

Lance

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 18, 2007 11:07 AM
To: solr-user@lucene.apache.org
Subject: Re: retrieve lucene "doc id"

Hi Lance,

You said:
We use the standard (some RFC) text representation of 32 hex
characters.
This has the advantage that F* pulls 1/16 of the total index, with a
completely randomized distribution, F**  1/256, etc.  This is very
handy for data analysis and document extraction. 

Could you elaborate on the last sentence?  Maybe give an example of what
you have in mind?
Are you thinking that this, because of uniform distribution, lets you
easily get a subset of documents of predictable size and thus have an
apriori knowledge of how large of a data set you'll get and work with?
Or something else?

Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: "Norskog, Lance" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, December 17, 2007 2:43:55 PM
Subject: RE: retrieve lucene "doc id"

We are using MD5 to generate our IDs. MD5s are 128 bits creating a very
unique and very randomized number for the content. Nobody has ever
reported two different data sets that create the same MD5.

We use the standard (some RFC) text representation of 32 hex
characters.
This has the advantage that F* pulls 1/16 of the total index, with a
completely randomized distribution, F**  1/256, etc.  This is very
handy for data analysis and document extraction. 

MD5 creates 128 bits, but if your index is small enough that you are
willing to risk it, you could pick 64 bits and park them in a Java
long.

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: retrieve lucene "doc id"

Yonik Seeley wrote:
> On Dec 17, 2007 1:40 AM, Ben Incani <[EMAIL PROTECTED]>
wrote:
>> I have converted to using the Solr search interface and I am trying 
>> to retrieve documents from a list of search results (where
 previously

>> I had used the doc id directly from the lucene query results) and
 the

>> solr id I have got currently indexed is unfortunately configured not
be unique!
> 
> Ouch... I'd try to make a unique Id then!
> Or barring that, just try to make the query match exactly the docs
 you

> want back (don't do the 2 phase thing).
> 

In 1.3-dev, you can use UUIDField to have solr generate a UUID for each
doc.

ryan





RE: Issues with postOptimize

2007-12-17 Thread Norskog, Lance
Also, the script itself has to be execute mode. 

Lance
 

-Original Message-
From: climbingrose [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 17, 2007 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Issues with postOptimize

Make sure that the user running Solr has permission to execute
snapshooter.
Also, try ./snapshooter instead of snapshooter.

Good luck.

On Dec 18, 2007 10:57 AM, Sunny Bassan <[EMAIL PROTECTED]> wrote:

> I've set up solrconfig.xml to create a snap shot of an index after 
> doing a optimize, but the snap shot cannot be created because of 
> permission issues. I've set permissions to the bin, data and log 
> directories to read/write/execute for all users. Even with these 
> settings I cannot seem to be able to run snapshooter on the
postOptimize event. Any ideas?
> Could it be a java permissions issue? Thanks.
>
> Sunny
>
> Config settings:
>
> 
>  snapshooter
>  /search/replication_test/0/index/solr/bin
>  true
> 
>
> Error:
>
> Dec 17, 2007 7:45:19 AM org.apache.solr.core.RunExecutableListener 
> exec
> FINE: About to exec snapshooter
> Dec 17, 2007 7:45:19 AM org.apache.solr.core.SolrException log
> SEVERE: java.io.IOException: Cannot run program "snapshooter" (in 
> directory "/search/replication_test/0/index/solr/bin"):
> java.io.IOException: error=13, Permission denied  at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>  at java.lang.Runtime.exec(Runtime.java:593)
>  at
> org.apache.solr.core.RunExecutableListener.exec(RunExecutableListener.
> ja
> va:70)
>  at
> org.apache.solr.core.RunExecutableListener.postCommit(RunExecutableLis
> te
> ner.java:97)
>  at
> org.apache.solr.update.UpdateHandler.callPostOptimizeCallbacks(UpdateH
> an
> dler.java:105)
>  at
>
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.
> java:516)
>  at
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateReques
> tH
> andler.java:214)
>  at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlU
> pd
> ateRequestHandler.java:84)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> rB
> ase.java:77)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> ja
> va:191)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .j
> ava:159)
>  at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> ca
> tionFilterChain.java:235)
>  at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> lt
> erChain.java:206)
>  at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> lv
> e.java:233)
>  at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> lv
> e.java:175)
>  at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> va
> :128)
>  at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> va
> :102)
>  at
>
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
> java:109)
>  at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> :2
> 63)
>  at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:
> 84
> 4)
>  at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proces
> s(
> Http11Protocol.java:584)
>  at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447
> )  at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=13, 
> Permission denied  at 
> java.lang.UNIXProcess.(UNIXProcess.java:148)
>  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>  ... 23 more
>
>
>
>


--
Regards,

Cuong Hoang


RE: Replication hooks - changing the index while the slave is running ...

2007-12-17 Thread Norskog, Lance
It works via two Unix file system tricks. 

1) Files are not directly bound with with filenames, instead there is a
layer of indirection called an 'inode'. So, multiple file and directory
names point to the same physical file. The "." and ".." directory
entries are implemented this way. 

2) Physical files are bound to all open file descriptors even after
there are no file names for the files. So, file data exists until all
file names are gone AND all open files are gone.

Lance

-Original Message-
From: Tracy Flynn [mailto:[EMAIL PROTECTED] 
Sent: Saturday, December 15, 2007 7:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Replication hooks - changing the index while the slave is
running ...

That helps

Thanks for the prompt reply

On Dec 15, 2007, at 10:15 AM, Yonik Seeley wrote:

> On Dec 14, 2007 7:36 PM, Tracy Flynn
> <[EMAIL PROTECTED]> wrote:
>> 1) The existing index(es) being used by the Solr slave instance are 
>> physically deleted
>> 2) The new index snapshots are renamed/moved from their temporary 
>> installation location to the default index location
>> 3) The slave is sent a 'commit' to force a new IndexReader to start 
>> to read the new index.
>>
>> What happens to search requests against the existing/old index during

>> step 1) and between steps 1 and 2?
>
> Search requests will still work on the old searcher/index.
>
>> Where do they get information if
>> they need to go to disk for results that are not cached? Do they a) 
>> hang b) produce no results c) error in some other way?
>
> A lucene IndexReader keeps all the files open that aren't loaded into 
> memory... and external deletion has no effect on the ability to keep 
> reading these open files (they aren't really deleted yet).
>
> -Yonik



RE: retrieve lucene "doc id"

2007-12-17 Thread Norskog, Lance
We are using MD5 to generate our IDs. MD5s are 128 bits creating a very
unique and very randomized number for the content. Nobody has ever
reported two different data sets that create the same MD5.

We use the standard (some RFC) text representation of 32 hex characters.
This has the advantage that F* pulls 1/16 of the total index, with a
completely randomized distribution, F**  1/256, etc.  This is very handy
for data analysis and document extraction. 

MD5 creates 128 bits, but if your index is small enough that you are
willing to risk it, you could pick 64 bits and park them in a Java long.

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 17, 2007 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: retrieve lucene "doc id"

Yonik Seeley wrote:
> On Dec 17, 2007 1:40 AM, Ben Incani <[EMAIL PROTECTED]>
wrote:
>> I have converted to using the Solr search interface and I am trying 
>> to retrieve documents from a list of search results (where previously

>> I had used the doc id directly from the lucene query results) and the

>> solr id I have got currently indexed is unfortunately configured not
be unique!
> 
> Ouch... I'd try to make a unique Id then!
> Or barring that, just try to make the query match exactly the docs you

> want back (don't do the 2 phase thing).
> 

In 1.3-dev, you can use UUIDField to have solr generate a UUID for each
doc.

ryan


RE: Solr 1.3 expected release date

2007-12-12 Thread Norskog, Lance
 
... SOLR-303 (Distributed Search over HTTP)...

Woo-hoo!


-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 12, 2007 12:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 1.3 expected release date

Owens, Martin wrote:
> What date or year do we believe Solr 1.3 will be released?
> 
> Regards, Martin Owens

2008 for sure.  It will be after lucene 2.3 and that is a month(more?)
away.  My honest guess is late Jan to mid Feb.

I think the last *major* change going into 1.3 is SOLR-303 (Distributed
Search over HTTP) -- this will require some reworking of new features
like SearchComponents and solrj.  After that, changes will mostly be for
stability and clarity.

I don't really want to promote using nightly builds, but if you need 1.3
features, the current ones are stable.  The interfaces may change, but
it should not crash or anything like that.

ryan



RE: Facets - What's a better term for non technical people?

2007-12-11 Thread Norskog, Lance
In SQL terms they are: 'select unique'. Except on only one field. 

-Original Message-
From: Charles Hornberger [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 11, 2007 9:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Facets - What's a better term for non technical people?

FAST calls them "navigators" (which I think is a terrible term - YMMV of course 
:-))

I tend to think that "filters" -- or perhaps "dynamic filters" -- captures the 
essential function.

On Dec 11, 2007 2:38 AM, "DAVIGNON Andre - CETE NP/DIODé/PANDOC"
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> > So, has anyone got a good example of the language they might use 
> > over, say, a set of radio buttons and fields on a web form, to 
> > indicate that selecting one or more of these would return facets. 'Show 
> > grouping by'
> > or 'List the sets that the results fall into' or something similar.
>
> Here's what i found some time : 
> http://www.searchtools.com/info/faceted-metadata.html
>
> It has been quite useful to me.
>
> André Davignon
>
>


RE: SOLR X FAST

2007-12-11 Thread Norskog, Lance
FAST is a little less flexible (no dynamic fields) and not programmable
at the Lucene level.

We recently switched from FAST to Solr because of cost reasons.  They
did not know how to license us; they are used to, say, IBM running FAST
on hundreds of servers.  We are a startup with very specific needs. It's
turned out to be worthwhile because we only want to do one thing really
well and we can customize Solr for it. 

Lance

-Original Message-
From: Nuno Leitao [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 11, 2007 5:51 PM
To: solr-user@lucene.apache.org
Subject: Re: SOLR X FAST


FAST uses two pipelines - an ingestion pipeline (for document feeding)
and a query pipeline which are fully programmable (i.e., you can
customize it fully). At ingestion time you typically prepare documents
for indexing (tokenize, character normalize, lemmatize, clean up text,
perform entity extraction for facets, perform static boosting for
certain documents, etc.), while at query time you can expand synonyms,
and do other general query side tasks (not unlike Solr).

Horizontal scalability means the ability to cluster your search engine
across a large number of servers, so you can scale up on the number of
documents, queries, crawls, etc.

There are FAST deployments out there which run on dozens, in some cases
hundreds of nodes serving multiple terabyte size indexes and achieving
hundreds of queries per seconds.

Yet again, if your requirements are relatively simple then Lucene might
do the job just fine.

Hope this helps.

--Nuno.

On 12 Dec 2007, at 01:33, Ravish Bhagdev wrote:

> Could you please elaborate on what you mean by ingestion pipeline and 
> horizontal scalability?  I apologize if this is a stupid question 
> everyone else on the forum is familiar with.
>
> Thanks,
> Ravi
>
> On Dec 12, 2007 1:09 AM, Nuno Leitao <[EMAIL PROTECTED]> wrote:
>> Depends, if you are looking for a small sized index (gigabytes rather

>> than dozens or hundreds of gigabytes or terabytes) with relatively 
>> simple requirements (a few facets, simple tokenization, English only 
>> linguistics, etc.) Solr is likely to be appropriate for most cases.
>>
>> FAST however gives you great horizontal scalability, out of the box 
>> linguistics for many languages (including CJK), contextual and scope 
>> searching, a web, file and database crawler, programmable ingestion 
>> pipeline, etc.
>>
>> Regards.
>>
>> --Nuno
>>
>>
>> On 11 Dec 2007, at 22:09, William Silva wrote:
>>
>>> Hi,
>>> How is the best way to compare SOLR and FAST Search ?
>>> Thanks,
>>> William.
>>
>>



RE: Cache use

2007-12-06 Thread Norskog, Lance
There are query and document field caches. A query cache is a list of
records that match a query. A document cache actually contains the
fields. Fetching from your query cache still has to assemble the results
from the indexed data. If the ram-based index is paging, that is an
answerr.  

Note that Lucene stores different fields of the same query, and the
index output, in different ares of the index. In my case, with very
small records of maybe 20 fields, there was a 5% difference between
fetching one field and all fields. This could be very different with
your index.

Lance 

-Original Message-
From: sfox [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 06, 2007 1:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Cache use

One possible explanation is that the OS's native file system caching is
being successful at keeping these files mostly in RAM most of the time. 
  And so the performance benefits of 'forcing' the files into RAM by
using tmpfs aren't significant.

So the slowness of the queries is the result of being CPU bound, rather
than IO bound.  The cache within Solr is faster because it is saving and
returning the information for which the CPU-bound work has already been
done.

Just one possible explanation.

Sean Fox

Matthew Phillips wrote:
> No one has a suggestion? I must be missing something because as I 
> understand it from Dennis' email, all of queries are very quick 
> (cached type response times) whereas mine are not. I can clearly see 
> time differences between queries that are cached (things that have 
> been auto
> warmed) and queries that are not. This seems odd as my whole index is 
> loaded on a tmpfs memory based file system. Thanks for the help.
> 
> Matt
> 
> On Dec 4, 2007, at 3:55 PM, Matthew Phillips wrote:
> 
>> Thanks for the suggestion, Dennis. I decided to implement this as you

>> described on my collection of about 400,000 documents, but I did not 
>> receive the results I expected.
>>
>> Prior to putting the indexes on a tmpfs, I did a bit of benchmarking 
>> and found that it usually takes a little under two seconds for each 
>> facet query. After moving my indexes from disk to a tmpfs file 
>> system, I seem to get about the same result from facet queries: about

>> two seconds.
>>
>> Does anyone have any insight into this? Doesn't it seem odd that my 
>> response times are about the same? Thanks for the help.
>>
>> Matt Phillips
>>
>> Dennis Kubes wrote:
>>> One way to do this if you are running on linux is to create a tempfs

>>> (which is ram) and then mount the filesystem in the ram.  Then your 
>>> index acts normally to the application but is essentially served 
>>> from Ram.  This is how we server the Nutch lucene indexes on our web

>>> search engine (www.visvo.com) which is ~100M pages.  Below is how 
>>> you can achieve this, assuming your indexes are in /path/to/indexes:
>>> mv /path/to/indexes /path/to/indexes.dist mkdir /path/to/indexes cd 
>>> /path/to mount -t tmpfs -o size=2684354560 none /path/to/indexes 
>>> rsync --progress -aptv indexes.dist/* indexes/ chown -R user:group 
>>> indexes This would of course be limited by the amount of RAM you 
>>> have on the machine.  But with this approach most searches are 
>>> sub-second.
>>> Dennis Kubes
>>> Evgeniy Strokin wrote:
 Hello,...
 we have 110M records index under Solr. Some queries takes a while, 
 but we need sub-second results. I guess the only solution is cache 
 (something else?)...
 We use standard LRUCache. In docs it says (as far as I understood) 
 that it loads view of index in to memory and next time works with 
 memory instead of hard drive.
 So, my question: hypothetically, we can have all index in memory if

 we'd have enough memory size, right? In this case the result should

 come up very fast. We have very rear updates. So I think this could

 be a solution.
 How should I configure the cache to achieve such approach?
 Thanks for any advise.
 Gene
> 


RE: out of heap space, every day

2007-12-04 Thread Norskog, Lance
"String[nTerms()]": Does this mean that you compare the first term, then
the second, etc.? Otherwise I don't understand how to compare multiple
terms in two records.

Lance 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Tuesday, December 04, 2007 8:06 AM
To: solr-user@lucene.apache.org
Subject: Re: out of heap space, every day

On Dec 4, 2007 10:59 AM, Brian Whitman <[EMAIL PROTECTED]> wrote:
> >
> > For faceting and sorting, yes.  For normal search, no.
> >
>
> Interesting you mention that, because one of the other changes since 
> last week besides the index growing is that we added a sort to an sint

> field on the queries.
>
> Is it reasonable that a sint sort would require over 2.5GB of heap on 
> a 8M index? Is there any empirical data on how much RAM that will
need?

int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms.
Then double that to allow for a warming searcher.

One can decrease this memory usage by using an "integer" instead of an
"sint" field if you don't need range queries.  The memory usage would
then drop to a straight int[maxDoc()] (4 bytes per document).

-Yonik


RE: out of heap space, every day

2007-12-04 Thread Norskog, Lance
Thanks!

I've seen a few formulae like this go by over the months. Can someone
please make a wiki page for memory and processing estimation with
locality properties?  Or is there a Lucene page we can use?

Lance 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Tuesday, December 04, 2007 8:06 AM
To: solr-user@lucene.apache.org
Subject: Re: out of heap space, every day

On Dec 4, 2007 10:59 AM, Brian Whitman <[EMAIL PROTECTED]> wrote:
> >
> > For faceting and sorting, yes.  For normal search, no.
> >
>
> Interesting you mention that, because one of the other changes since 
> last week besides the index growing is that we added a sort to an sint

> field on the queries.
>
> Is it reasonable that a sint sort would require over 2.5GB of heap on 
> a 8M index? Is there any empirical data on how much RAM that will
need?

int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms.
Then double that to allow for a warming searcher.

One can decrease this memory usage by using an "integer" instead of an
"sint" field if you don't need range queries.  The memory usage would
then drop to a straight int[maxDoc()] (4 bytes per document).

-Yonik


RE: How to delete records that don't contain a field?

2007-12-04 Thread Norskog, Lance
Oops, I should explain.  *:* means all records. This trick puts a
positive query in front of your negative query, and that allows it to
work.

Lance 

-Original Message-
From: Rob Casson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 04, 2007 7:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to delete records that don't contain a field?

i'm using this:

*:* -[* TO *]

which is what lance suggested..works just fine.

fyi: https://issues.apache.org/jira/browse/SOLR-381

On Dec 3, 2007 8:09 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> Wouldn't this be: *:* AND "negative query"
>
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik 
> Seeley
> Sent: Monday, December 03, 2007 2:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to delete records that don't contain a field?
>
> On Dec 3, 2007 5:22 PM, Jeff Leedy <[EMAIL PROTECTED]> wrote:
>
> > I was wondering if there was a way to post a delete query using curl

> > to delete all records that do not contain a certain field--something

> > like
> > this:
> >
> > curl http://localhost:8080/solr/update --data-binary
> > '-_title:[* TO *]' -H 
> > 'Content-type:text/xml; charset=utf-8'
> >
> > The minus syntax seems to return the correct list of ids (that is, 
> > all
>
> > records that do not contain the "_title" field) when I use the Solr 
> > administrative console to do the above query, so I'm wondering if 
> > Solr
>
> > just doesn't support this type of delete.
>
>
> Not yet... it makes sense to support this in the future though.
>
> -Yonik
>


RE: How to delete records that don't contain a field?

2007-12-03 Thread Norskog, Lance
Wouldn't this be: *:* AND "negative query" 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Monday, December 03, 2007 2:23 PM
To: solr-user@lucene.apache.org
Subject: Re: How to delete records that don't contain a field?

On Dec 3, 2007 5:22 PM, Jeff Leedy <[EMAIL PROTECTED]> wrote:

> I was wondering if there was a way to post a delete query using curl 
> to delete all records that do not contain a certain field--something 
> like
> this:
>
> curl http://localhost:8080/solr/update --data-binary
> '-_title:[* TO *]' -H 
> 'Content-type:text/xml; charset=utf-8'
>
> The minus syntax seems to return the correct list of ids (that is, all

> records that do not contain the "_title" field) when I use the Solr 
> administrative console to do the above query, so I'm wondering if Solr

> just doesn't support this type of delete.


Not yet... it makes sense to support this in the future though.

-Yonik


RE: LowerCaseFilterFactory and spellchecker

2007-11-30 Thread Norskog, Lance
What would also help is a query to find records for the spellcheck
dictionary builder. We would like to make separate spelling indexes for
all records in english, one in spanish, etc. We would also like to
slice&dice the records by other dimensions as well, and have separate
spelling DBs for each partition.

That is, we would like to make an english spelling dictionary and a
spanish dictionary, and also make subject-specific dictionaries like
News and Sports. These are separate orthogonal partitions of our index.

The usual practice for this is to create separate fields in the records
where one field is only populated for english records, one for spanish
records, etc. In our situation this is not practical for space reasons
and other proprietary reasons. 

Lance

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 29, 2007 6:01 PM
To: solr-user@lucene.apache.org
Subject: Re: LowerCaseFilterFactory and spellchecker

On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote:

>
> I'm not very familiar with the SpellCheckerRequestHandler, but i don't

> think you are doing anything wrong.
>
> a quick skim of the code indicates that the "q" param isn't being 
> analyzed by that handler, so the raw input string is pased to the 
> SpellChecker.suggestSimilar method. This may or may not have been 
> intentional.
>
> I personally can't think of
> any reason why it wouldn't make sense to get the query analyzer for 
> the termSourceField and use it to analyze the q param before getting 
> suggestions.

It does make some sense, but I'm not sure that it should be blindly
analyzed without adding logic to handle certain cases (like the
QueryParser does).  What happens if the analyzer produces two tokens?
The spellchecker has to deal with this appropriately.  Spell checkers
should be able to "reverse analyze" the suggestions as well, so "Pyhton"
gets corrected to "Python" and not "python".  Similarly, "ad-hco" should
probably suggest "ad-hoc" and not "adhoc".

-Mike


Schema class configuration syntax

2007-11-28 Thread Norskog, Lance
Hi-
 
What is the  element in an  element that will load
this class:
 
org.apache.lucene.analysis.cn.ChineseFilter
 
This did not work:
 
 

This is in Solr 1.2.
 
Thanks,
 
Lance Norskog


RE: LowerCaseFilterFactory and spellchecker

2007-11-28 Thread Norskog, Lance
Oops, sorry, didn't think that through.

The query to the spellchecker is not filtered through the field query
definition. You have to do your own lower-case transformation when you
do the query.  This is a simple thing to resolve. But, I'm working with
international alphabets and I would like 'protege' and 'protege with
both e's accented` to match. The ISOLatin1 filter does this in indexing
& querying. But I have to rip off the code and use it in my app to
preprocess words for spell-checks.

Lance

-Original Message-
From: Rob Casson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 28, 2007 5:16 PM
To: solr-user@lucene.apache.org
Subject: Re: LowerCaseFilterFactory and spellchecker

lance,

thanks for the quick replylooks like 'thorne' is getting added to
the dictionary, as it comes up as a suggestion for 'Thorne'

i could certainly just lowercase in my client, but just confirming that
i'm not just screwing it up in the firstplace :)

thanks again,
rc

On Nov 28, 2007 8:11 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> There are a few parameters for limiting what words are added to the 
> dictionary.  You might be trimming out 'thorne'. See this page:
>
> http://wiki.apache.org/solr/SpellCheckerRequestHandler
>
>
> -Original Message-
> From: Rob Casson [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 28, 2007 4:25 PM
> To: solr-user@lucene.apache.org
> Subject: LowerCaseFilterFactory and spellchecker
>
> think i'm just doing something wrong...
>
> was experimenting with the spellcheck handler with the nightly 
> checkout from 11-28; seems my spellchecking is case-sensitive, even 
> tho i think i'm adding the LowerCaseFilterFactory to both the index 
> and query analyzers.
>
> here's a brief rundown of my testing steps.
>
> from schema.xml:
>
>  positionIncrementGap="100">
> 
> 
> 
>  class="solr.RemoveDuplicatesTokenFilterFactory"/>
> 
> 
> 
> 
> 
>  class="solr.RemoveDuplicatesTokenFilterFactory"/>
> 
> 
> 
>
>  multiValued="true"/>
>  multiValued="true"/>
>
> 
>
> 
>
> from solrconfig.xml:
>
>  class="solr.SpellCheckerRequestHandler" startup="lazy">
> 
> 1
> 0.5
> 
> spell
> spelling
> 
>
> 
>
> adding the doc:
>
> curl http://localhost:8983/solr/update -H "Content-Type: text/xml"
> --data-binary ' name="title">Thorne'
> curl http://localhost:8983/solr/update -H "Content-Type: text/xml"
> --data-binary ''
>
> 
>
> building the spellchecker:
>
> http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuil
> d
>
> 
>
> querying the spellchecker:
>
> results from 
> http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker
>
>  
> 
> 0
> 1
> 
> Thorne
> false
> 
> thorne
> 
> 
>
> results from 
> http://localhost:8983/solr/select/?q=thorne&qt=spellchecker
>
>  
> 
> 0
> 2
> 
> thorne
> true
> 
> 
>
>
> any pointers as to what i'm doing wrong, misinterpreting?  i suspect
i'm
> just doing something bone-headed in the analyzer sections...
>
> thanks as always,
>
> rob casson
> miami university libraries
>


RE: LowerCaseFilterFactory and spellchecker

2007-11-28 Thread Norskog, Lance
There are a few parameters for limiting what words are added to the
dictionary.  You might be trimming out 'thorne'. See this page:

http://wiki.apache.org/solr/SpellCheckerRequestHandler

-Original Message-
From: Rob Casson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 28, 2007 4:25 PM
To: solr-user@lucene.apache.org
Subject: LowerCaseFilterFactory and spellchecker

think i'm just doing something wrong...

was experimenting with the spellcheck handler with the nightly checkout
from 11-28; seems my spellchecking is case-sensitive, even tho i think
i'm adding the LowerCaseFilterFactory to both the index and query
analyzers.

here's a brief rundown of my testing steps.

from schema.xml:























from solrconfig.xml:



1
0.5

spell
spelling




adding the doc:

curl http://localhost:8983/solr/update -H "Content-Type: text/xml"
--data-binary 'Thorne'
curl http://localhost:8983/solr/update -H "Content-Type: text/xml"
--data-binary ''



building the spellchecker:

http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuild



querying the spellchecker:

results from http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker




0
1

Thorne
false

thorne



results from http://localhost:8983/solr/select/?q=thorne&qt=spellchecker




0
2

thorne
true




any pointers as to what i'm doing wrong, misinterpreting?  i suspect i'm
just doing something bone-headed in the analyzer sections...

thanks as always,

rob casson
miami university libraries


RE: LSA Implementation

2007-11-27 Thread Norskog, Lance
WordNet itself is English-only. There are various ontology projects for
it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page. Thanks
for starting me on the search!

Lance 

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:

> The WordNet project at Princeton (USA) is a large database of
synonyms.
> If you're only working in English this might be useful instead of 
> running your own analyses.
>
> http://en.wikipedia.org/wiki/WordNet
> http://wordnet.princeton.edu/
>
> Lance
>
> -Original Message-
> From: Eswar K [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 26, 2007 6:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> In addition to recording which keywords a document contains, the 
> method examines the document collection as a whole, to see which other

> documents contain some of those same words. this algo should consider 
> documents that have many words in common to be semantically close, and

> ones with few words in common to be semantically distant. This simple 
> method correlates surprisingly well with how a human being, looking at

> content, might classify a document collection. Although the algorithm 
> doesn't understand anything about what the words *mean*, the patterns 
> it notices can make it seem astonishingly intelligent.
>
> When you search an such  an index, the search engine looks at 
> similarity values it has calculated for every content word, and 
> returns the documents that it thinks best fit the query. Because two 
> documents may be semantically very close even if they do not share a 
> particular keyword,
>
> Where a plain keyword search will fail if there is no exact match, 
> this algo will often return relevant documents that don't contain the 
> keyword at all.
>
> - Eswar
>
> On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]>
wrote:
>
> >
> > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> >
> > > We essentially are looking at having an implementation for doing 
> > > search which can return documents having conceptually similar 
> > > words without necessarily having the original word searched for.
> >
> > Very challenging.  Say someone searches for "LSA" and hits an 
> > archived
>
> > version of the mail you sent to this list.  "LSA" is a reasonably 
> > discriminating term.  But so is "Eswar".
> >
> > If you knew that the original term was "LSA", then you might look 
> > for documents near it in term vector space.  But if you don't know 
> > the original term, only the content of the document, how do you know

> > whether you should look for docs near "lsa" or "eswar"?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
>


RE: LSA Implementation

2007-11-26 Thread Norskog, Lance
The WordNet project at Princeton (USA) is a large database of synonyms.
If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other
documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close, and
ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking at
content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns it
notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at similarity
values it has calculated for every content word, and returns the
documents that it thinks best fit the query. Because two documents may
be semantically very close even if they do not share a particular
keyword,

Where a plain keyword search will fail if there is no exact match, this
algo will often return relevant documents that don't contain the keyword
at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote:

>
> On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
>
> > We essentially are looking at having an implementation for doing 
> > search which can return documents having conceptually similar words 
> > without necessarily having the original word searched for.
>
> Very challenging.  Say someone searches for "LSA" and hits an archived

> version of the mail you sent to this list.  "LSA" is a reasonably 
> discriminating term.  But so is "Eswar".
>
> If you knew that the original term was "LSA", then you might look for 
> documents near it in term vector space.  But if you don't know the 
> original term, only the content of the document, how do you know 
> whether you should look for docs near "lsa" or "eswar"?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>


DirectUpdateHandler and DirectUpdateHandler2

2007-11-26 Thread Norskog, Lance
Hi-
 
We have a situation where we are submitting the same document several
times, and have not handled this the right way yet.  So,
DirectUpdateHandler2 overwrites the existing record.
 
If we used DirectUpdateHandler, we could use the feature where we tell
it to not overwrite existing records. This option is in the DUH2
arguments, but is not implemented in DUH2 for speed reasons.

Are there any features in DUH2 that are not in DUH? I mean semantic
differences, not just speedups.
 
Thanks,
 
Lance Norskog


RE: Performance problems for OR-queries

2007-11-26 Thread Norskog, Lance
https://issues.apache.org/jira/browse/lucene-997 is a patch to limit the time 
used for a query. 

Google clearly estimates the total # of results, and over-estimates.

Lance 

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 22, 2007 1:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance problems for OR-queries

On 22-Nov-07, at 6:02 AM, Jörg Kiegeland wrote:

>
>>> 1. Does Solr support this kind of index access with better 
>>> performance ?
>>> Is there anything special to define in schema.xml?
>>>
>>
>> No... Solr uses Lucene at it's core, and all matching documents for a 
>> query are scored.
>>
> So it is not possible to have a "google" like performance with Solr, 
> i.e. to search for a set of keywords and only the 10 best documents 
> are listed, without touching the other millions of (web) documents 
> matching less keywords.
> I infact would not know how to program such an index, however google 
> has done it somehow..

I can be fairly certain that google does not execute queries that match 
millions of documents on a single machine.  The default query operator is 
(mostly) AND, so the possible match sets is much smaller.  Also, I imagine they 
have relatively few documents per machine.

>>> 2. Can one switch off this ordering and just return any 100 
>>> documents fullfilling the query (though  getting best-matching 
>>> documents would be a nice feature if it would be fast)?
>>>
>>
>> a feature like this could be developed... but what is the usecase for 
>> this?  What are you tring to accomplish where either relevancy or 
>> complete matching doesn't matter?  There may be an easier workaround 
>> for your specific case.
>>
> This is not an actual Use-Case for my project, however I just wanted 
> to know if it would be possible.
>
> Because of the performance results, we designed a new type of query. I 
> would like to know how fast it would be before I implement the 
> following query:
>
> I have N keywords and execute a query of the form
>
> keyword1 AND keyword2 AND .. AND keywordN
>
> there may be again some millions of matching documents and I want to 
> get the first 100 documents.
> To have a ordering criteria, each Solr document has a field named 
> "REV" which has a natural number. The returned 100 documents shall be 
> those with the lowest numbers in the "REV" field.
>
> My questions now are:
>
> (1) Will the query perform in O(100) or in O(all possible matches)?

O(all possible matches)

> (2) If the answer to (1) is O(all possible matches), what will be the 
> performance if I dont order for the "REV" field? Will Solr order it 
> after the point of time where a document was created/ modified? What I 
> have to do to get O(100) complexity finally?

Ordering by natural document order in the index is sufficient to achieve 
O(100), but you'll have to insert code in Solr to stop after 100 docs (another 
alternative is to stop processing after a given amount of time).  Also, using 
O() in the case isn't quite accurate:  
there are costs that vary based on the number of docs in the index too.

-Mike


RE: CJK Analyzers for Solr

2007-11-26 Thread Norskog, Lance
I notice this is in the future tense. Is the CJKTokenizer available yet?
>From what I can see, the CJK code should be a Filter instead anyway.
Also, the ChineseFilter and CJKTokenizer do two different things. 

CJKTokenizer turns C1C2C3C4 into 'C1C2 C2C3 C3C4'. ChineseFilter (from
2001) turns C1C2 into 'C1 C2'. I hope someone who speaks Mandarin or
Cantonese understands what this should do.

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 10:28 AM
To: solr-user@lucene.apache.org
Subject: Re: CJK Analyzers for Solr

Hoss,

Thanks a lot. Will look into it.

Regards,
Eswar

On Nov 26, 2007 11:55 PM, Chris Hostetter <[EMAIL PROTECTED]>
wrote:

>
> : Does Solr come with Language analyzers for CJK? If not, can you 
> please
> : direct me to some good CJK analyzers?
>
> Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers
jar.
> they can be used in Solr.  both have been included in Solr for a while

> now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but 
> starting with Solr 1.3 a Factory for the Tokenizer will also be 
> included so it can be used in a more complex analysis chain defined in
the schema.
>
>
>
> -Hoss
>
>


RE: Weird memory error.

2007-11-20 Thread Norskog, Lance
AppPerfect has a free-for-noncommercial-use version of their tools. I've
used them before and was very impressed.

http://www.appperfect.com/products/devtest.html#versions

 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Tuesday, November 20, 2007 9:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Weird memory error.

On Nov 20, 2007 11:29 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> Can you recommend one? I am not familar with how to profile under
Java.

Netbeans has one for free:
http://www.netbeans.org/products/profiler/

-Yonik


RE: Solr cluster topology.

2007-11-20 Thread Norskog, Lance
http://wiki.apache.org/solr/CollectionDistribution

http://wiki.apache.org/solr/SolrCollectionDistributionScripts

http://wiki.apache.org/solr/SolrCollectionDistributionStatusStats

http://wiki.apache.org/solr/SolrOperationsTools

http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

http://wiki.apache.org/solr/CollectionRebuilding
 
http://wiki.apache.org/solr/SolrAdminGUI




-Original Message-
From: Matthew Runo [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 20, 2007 10:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr cluster topology.

Yes. The clients will always be a minute or two behind the master.

I like the way some people are doing it - make them all masters! Just
post your updates to each of them - you loose a bit of performance
perhaps, but it doesn't matter if a server bombs out or you have to
upgrade them, since they're all exactly the same.

--Matthew

On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote:

> Hi All!
>
> I just started reading about Solr a couple of days ago (not full time 
> of course) and it looks like a pretty impressive set of 
> technologies... I have still a few questions I have not clearly found:
>
> Q: On a cluster, as I understand it, one and only one machine is a  
> master, and N servers could be slaves...The clients, do they all  
> talk to the master for indexing and to a load balancer for  
> searching?   Is one particular machine configured to know it is the  
> master? Or is it only the settings for replicating the index that  
> matter?   Or does one post reindex petitions to any of the slaves  
> and they will forward it to the master?
>
> How can we have failover in the master?
>
> It is a correct assumption that slaves could always be a bit out of 
> sync with the master, correct? A matter of minutes perhaps...
>
> Thanks in advance for your responses!
>
>



RE: snappuller rsync parameter error? - "solr" hardcoded

2007-11-14 Thread Norskog, Lance
Be careful. 'rsync'  has different meanings for 'directory' v.s.
'directory/'. I ran afoul of this.

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 14, 2007 8:49 AM
To: solr-user@lucene.apache.org
Subject: Re: snappuller rsync parameter error? - "solr" hardcoded

I'm not an rsync expert, but I beleive that /solr/ is a virtual
directory defined in the rsyncd config. It is mapped to the real
directory.

wunder

On 11/14/07 8:43 AM, "Jae Joo" <[EMAIL PROTECTED]> wrote:

> In the snappuller, the "solr" is hardcoded. Should it be 
> "${master_data_dir}?
> 
> # rsync over files that have changed
> rsync -Wa${verbose}${compress} --delete ${sizeonly} \ ${stats} 
> rsync://${master_host}:${rsyncd_port}/solr/${name}/
> ${data_dir}/${name}-wip
> 
> Thanks,
> 
> Jae



RE: Delte all docs in a SOLR index?

2007-11-09 Thread Norskog, Lance
A safer way is to stop Solr and remove the index directory. There is
less chance of corruption, and it will faster. 

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 09, 2007 10:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Delte all docs in a SOLR index?

Thanks!

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, November 9, 2007 1:51:03 PM
Subject: Re: Delte all docs in a SOLR index?



: Sorry for another basic question -- but what is the best safe way to
: delete all docs in a SOLR index.

I thought this was a FAQ, but it's hidden in another question
(rebuilding if schema changes)  i'll pull it out into a top level
question...

*:*

: I am in my first few days using SOLR and Lucene, am iterating the
schema
: often, starting and stoping with test docs, etc.  I like to know a
very
: quick way to clean out the index and start over repeatedly -- can't
seem
: to find it on the wiki -- maybe its Friday :)

Huh .. that's actually the FAQ that does talk about deleting all docs
 :)

"How can I rebuild my index from scratch if I change my schema?"

http://wiki.apache.org/solr/FAQ#head-9aafb5d8dff5308e8ea4fcf4b71f19f029c
4bb99



-Hoss






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 


RE: Score of exact matches

2007-11-06 Thread Norskog, Lance
What is the performance profile of this against merely searching against
one field? My situation is millions of small records with an average of
200 bytes/text field.

Lance 

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 05, 2007 9:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Score of exact matches

This is fairly straightforward and works well with the DisMax handler.
Indes the text into three different fields with three different sets of
analyzers. Use something like this in the request handler:

 

 0.01
 
   exact^16 noaccent^4 stemmed
 
 
   exact^16 noaccent^4 stemmed
 
   
 

You will probably need to adjust the weights for your content, though I
expect these are a good starting place.

Per-field analyzers are very easy to use in Solr and are extremely
powerful. I wish we'd thought of that in Ultraseek.

wunder
==
Search Guy, Netflix
Formerly: Architect, Ultraseek

On 11/5/07 9:05 PM, "Papalagi Pakeha" <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> I use Solr 1.2 on a job advertising site. I started from the default 
> setup that runs all documents and queries through 
> EnglishPorterFilterFactory. As a result for example an ad with 
> "accounts" in its title is matched when someone runs a query for 
> "accountant" because both are stemmed to the "account" word and then 
> they match.
> 
> Is it somehow possible to give a higher score to exact matches and 
> sort them before matches from stemmed terms?
> 
> Close to this is a problem with accents - I can remove accents from 
> both documents and from queries and then run the query on non-accented

> terms. But I'd like to give higher score to documents where the search

> term matches exactly (i.e. including accents and possibly letter 
> capitalization, etc) and sort them before more fuzzy searches.
> 
> To me it looks like I have to run multiple sub-queries for each query,

> one for exact match, one for accents removed and one for stemmed words

> and then combine the results and compute the final score for each 
> match. Is that possible?
> 
> Thanks!
> 
> PaPa



RE: specify index location

2007-11-06 Thread Norskog, Lance
Snapshots are in that directory, and the spellchecker has its own
indexes under there.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Monday, November 05, 2007 8:57 PM
To: solr-user@lucene.apache.org
Subject: Re: specify index location

On 11/5/07, evol__ <[EMAIL PROTECTED]> wrote:
> Just a remark:
>Might be a good idea to change this to ./data/index

> to reflect the location that is expected in there.

./data is the generic solr data directory "index" stores the main index
under the data directory.

-Yonik


RE: My filters are not used

2007-10-25 Thread Norskog, Lance
This search has up to 8000 records. Does this require a query cache of
8000 records? When is the query cache filled?

This answers a second question: the filter design is intended for small
search sets. I'm interested in selecting maybe 1/10 of a few million
records as a search limiter. Is it possible to create a similar feature
that caches low-level data areas for aquery? Let's say that the if query
selects 1/10 of the document space, this means that only 40% of the
total memory area contains data for that 1/10. Is there a cheap way to
record this data? Would it be a feature like filters which records a
much lower-level data structure like disk blocks?

Thanks,

Lance Norskog

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, October 24, 2007 8:24 PM
To: solr-user@lucene.apache.org
Subject: Re: My filters are not used

On 10/24/07, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> I am creating a filter that is never used. Here is the query sequence:
>
> q=*:*&fq=contentid:00*&start=0&rows=200
>
> q=*:*&fq=contentid:00*&start=200&rows=200
>
> q=*:*&fq=contentid:00*&start=400&rows=200
>
> q=*:*&fq=contentid:00*&start=600&rows=200
>
> q=*:*&fq=contentid:00*&start=700&rows=200
>
> Accd' to the statistics here is my filter cache usage:
>
> lookups : 1
[...]
>
> I'm completely confused. I thought this should be 1 insert, 4 lookups,

> 4 hits, and a hitratio of 100%.

Solr has a query cache too... the query cache is checked, there's a hit,
and the query process is short circuited.

-Yonik


New issue: request for limit parameter for search time, hits, and estimated ram usage

2007-10-24 Thread Norskog, Lance
http://issues.apache.org/jira/browse/SOLR-392
 
Summary:
 
It would be good for end-user applications if Solr allowed searches to
cease before finishing, and still return partial results.


RE: Converting German special characters / umlaute

2007-10-24 Thread Norskog, Lance
Isn't this what ISOLatin1Filter does?  Turn Björk into Bjork?  This should be 
much faster than PatternReplaceFilterFactory.

-Original Message-
From: Matthias Eireiner [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 24, 2007 1:47 PM
To: solr-user@lucene.apache.org
Subject: AW: Converting German special characters / umlaute

Dear list,

it has been some time, but here is what I did.
I had a look at Thomas Traeger's tip to use the SnowballPorterFilterFactory, 
which does not actually do the job.
Its purpose is to convert regular ASCII into special characters. 

And I want it the other way, such that all special character are converted to 
regular ASCII.
The tip of J.J. Larrea, to use the PatternReplaceFilterFactory, solved the 
problem. 
 
And as Chris Hostetter noted, stored fields always return the initial value, 
which turned the second part of my question obsolete.

Thanks a lot for your help!

best
Matthias



-Ursprüngliche Nachricht-
Von: Thomas Traeger [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 26. September 2007 23:44
An: solr-user@lucene.apache.org
Betreff: Re: Converting German special characters / umlaute


Try the SnowballPorterFilterFactory described here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You should use the German2 variant that converts ä and ae to a, ö and oe

to o and so on. More details:
http://snowball.tartarus.org/algorithms/german2/stemmer.html

Every document in solr can have any number of fields which might have 
the same source but have different field types and are therefore handled

differently (stored as is, analyzed in different ways...). Use copyField

in your schema.xml to feed your data into multiple fields. During 
searching you decide which fields you like to search on (usually the 
analyzed ones) and which you retrieve when getting the document back.

Tom

Matthias Eireiner schrieb:
> Dear list,
>
> I have two questions regarding German special characters or umlaute.
>
> is there an analyzer which automatically converts all german special 
> characters to their specific dissected from, such as ü to ue and ä to 
> ae, etc.?!
>
> I also would like to have, that the search is always run against the 
> dissected data. But when the results are returned the initial data 
> with the non modified data should be returned.
>
> Does lucene GermanAnalyzer this job? I run across it, but I could not 
> figure out from the documentation whether it does the job or not.
>
> thanks a lot in advance.
>
> Matthias
>   




RE: Solr and security

2007-10-24 Thread Norskog, Lance
Solr does not do security itself. Servlet containers usually support
various security options: account/password through HTTP authentication
(very weak security) and certificates (very strong security) are what I
would look at first.

Lance

-Original Message-
From: Wagner,Harry [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 24, 2007 9:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr and security

One effective method is to block access to the port Solr runs on. Force
application access to come thru the HTTP server, and let it map to the
application server (i.e., like mod_jk does for for Apache & Tomcat).
Simple, but effective.

Cheers!
harry

-Original Message-
From: Cool Coder [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 24, 2007 12:17 PM
To: solr-user@lucene.apache.org
Subject: Solr and security

Hi Group,
   As far as I know, to use solr, we need to deploy it as  a
server and communicate to solr using http protocol. How about its
security? i.e. how can we ensure that it only accepts request from
predefined set of users only. Is there any way we can specify this in
solr or solr depends only on web server security model. I am not sure
whether my interpretation is right?
  Your suggestion/input?
   
  - BR

 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 


RE: history

2007-07-07 Thread Norskog, Lance
We have another use case. We would like count the number of times a
document came up in any search, and the total number of times it was
read. If these counters are not indexed, it seems like an update would
be a simple integer poke into the index. 

Also, thanks for the spellcheck info.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Saturday, July 07, 2007 9:21 AM
To: solr-user@lucene.apache.org
Subject: Re: history

On 7/7/07, Brian Whitman <[EMAIL PROTECTED]> wrote:
> I have been trying to plan out a history function for Solr. When I 
> update a document with an existing unique key, I would like the older 
> version to stay around and get tagged with the date and some metadata 
> to indicate it's not "live." Any normal search would not touch history

> documents.

Interesting...
One might be able to accomplish this with the update processors that
Ryan & I have been batting around for the last few days, in conjunction
with updateable documents, which is on-deck.

The first idea that comes to mind is that during an update, you could
change the id of the older document to be something like id_,
and reindex it with the addition of a live:false field.

For normal queries, use a filter of -live:false filter.
For all old of a document, use a prefix query id:mydocid_* for all
versions of a document, use query id:mydocid*

So if you can hold off a little bit, you shouldn't need a custom query
handler.  This will be a good use case to ensure that our request
processors and updateable documents are powerful enough.

-Yonik


RE: most popular/most commonly accessed records

2007-07-06 Thread Norskog, Lance
Documents in Lucene are read-only. You need to track accesses
separately. You can have a changeble score value in your documents, but
you'll have to re-index them for each change.

We use Google Analytics as a first cut. If you look at
http:/www.divvio.com (pimping my employer) and look at the page source,
down at the bottom we trigger a message to GA. Unforch it has our
license key, which I think we want to change :) 

The next step is to use such dynamic statistics in calculating boosts;
it seems like we would want to combine relational DB accesses with
Lucene indexing to calculate relevance and boosts. 

Lance

-Original Message-
From: Karen Loughran [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 06, 2007 6:59 AM
To: solr-user@lucene.apache.org
Subject: most popular/most commonly accessed records


Hi all,
Is there a way through solr to find out about "most commonly accessed"
solr documents ?  So for example, my client may wish to list the top 10
most popular videos, based on previous accesses to them in the  solr
server db.

If there are any solr features to help with this can someone point me to
them ?  Had a browse through the user documentation, but can't see
anything obvious ?

Many thanks
Karen Loughran


Checking for empty fields

2007-07-05 Thread Norskog, Lance
I understand that I cannot query on the 'null' value for a field, and so
I should make null fields -1 instead.
 
About dynamic fields: is there a way to query for the existence of a
dynamic field?
 
Thanks,
 
Lance Norskog
Divvio, Inc.