javax.xml.stream.XMLStreamException while indexing

2008-07-28 Thread Pieter Berkel
I've recently encountered a strange error while batch indexing around 500
average-sized documents:

HTTP Status 500 - null

javax.xml.stream.XMLStreamException
at com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3700)
at com.bea.xml.stream.MXParser.more(MXParser.java:3715)
at com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1756)
at com.bea.xml.stream.MXParser.next(MXParser.java:1333)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:323)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:197)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:125)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1038)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at
org.hyperic.hq.product.servlet.filter.JMXFilter.doFilter(JMXFilter.java:324)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:210)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:870)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:685)
at java.lang.Thread.run(Thread.java:595)

Most other reports of this exception refer to an XML parse error on a
particular line / column, however this is not the case in this situation.
It doesn't seem to be a problem with the data either, since it fails on
different sets of documents on every occasion (i.e. I can't find specific
input data to reproduce this problem).  Increasing / decreasing the number
of documents still results in the same error.

The system I'm using consists of Solr 1.3 dev (compiled from SVN on
2008-07-21), Tomcat 5.5.23, and Sun Java SDK 1.5.0-11-1 running on Ubuntu
Server 7.10 with all current updates applied.  Has anybody else experienced
a similar problem to this? Would upgrading either Tomcat / Java help in this
instance?  Thanks in advance for any help.

regards,
Pieter


Re: Faceting over limited result set

2007-11-13 Thread Pieter Berkel
On Nov 14, 2007 6:44 AM, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> An implementation might look like:
>
>  DocList superlist;
>  int facetDocLimit = params.getInt(DMP.FACET_DOCLIMIT, -1);
>  if(facetDocLimit > 0 && facetDocLimit != req.getLimit()) {
>superlist = s.getDocList(query, restrictions,
> SolrPluginUtils.getSort(req),
> req.getStart(), facetDocLimit,
> flags);
>results.docSet = SearcherUtils.getDocSetFromDocList
> (superlist, s);
>results.docList = superlist.subset(0, req.getLimit());
>  } else {
>
> Where getDocSetFromDocList() uses DocSetHitCollector to build a DocSet.
>
> To answer the performance question: There is a gain to be had when
> doing lots of faceting on huge indices, if N is low (say, 500-1000).
> One problem with the implementation above is that it stymies the
> query caching in SolrIndexSearcher (since the generated DocList is >
> the cache upper bound).
>
> -Mike

Thanks Mike, that looks like a good place to start.  While I really
can't think of any practical use for limiting the size of DocSet other
than simple faceting, the new search component architecture make it a
little more difficult to confine any implementation to only the facet
component (unless there is an efficient way to obtain a subset of a
DocSet, which there doesn't seem to be).  I'm also aware of the query
caching issues arising from SolrIndexSearcher however if N is
sufficiently low this (hopefully) shouldn't be too much of a problem.

I can't find either the SearcherUtils class nor any reference to a
getDocSetFromDocList() method in svn trunk, is this deprecated or
custom-build code?

-Piete


Re: DINSTINCT ON functionality in Solr?

2007-11-12 Thread Pieter Berkel
Currently this functionality is not available in Solr out-of-the-box,
however there is a patch implementing Field Collapsing
http://issues.apache.org/jira/browse/SOLR-236 which might be similar to what
you are trying to achieve.

Piete



On 13/11/2007, Jörg Kiegeland <[EMAIL PROTECTED]> wrote:
>
> Is there a way to define a query in that way that a search result
> contains only one representative of every set of documents which are
> equal on a given field (it is not important which representative
> document), i.e. to have the DINTINCT ON-concept from relational
> databases in Solr?
>
> If this cannot be done with the search API of Lucene, may be one can use
> Solr server side hooks or filters to achieve this? How?
>
> The reason why I do not want to do this filtering manually, is, because
> I want to have as many matches as possible with respect to my defined
> result limit for the query (and filtering the search result on client
> side may really kick me off from this limit far away).
>
> Thanks..
>


Re: Faceting over limited result set

2007-11-12 Thread Pieter Berkel
On 13/11/2007, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> can you elaborate on your use case ... the only time i've ever seen people
> ask about something like this it was because true facet counts were too
> expensive to compute, so they were doing "sampling" of the first N
> results.
>
> In Solr, Sampling like this would likely be just as expensive as getting
> the full count.


It's not really a performance-related issue, the primary goal is to use the
facet information to determine the most relevant product category related to
the particular search being performed.

Generally the facets returned by simple, generic queries are fine for this
purpose (e.g. a search for "nokia" will correctly return "Mobile / Cell
Phone" as the most frequent facet), however facet data for more specific
searches are not as clear-cut (e.g. "samsung tv" where TVs will appear at
the top of the search results, but will also match other "samsung' products
like mobile phones and mp3 players - obviously I could tweak 'mm' parameter
to fix this particular case, but it wouldn't really solve my problem).

The theory is that facet information generated from the first 'x' (lets say
100) matches to a query (ordered by score / relevance) will be more accurate
(for the above purpose) than facets obtained over the entire result set.  So
ideally, it would be useful to be able to contstrain the size of the DocSet
somehow (as you mention below).


matching occurs in increasing order of docid, so even if there was as hook
> to say "stop matching after N docs" those N wouldn't be a good
> representative sample, they would be biased towards "older" documents
> (based on when they were indexed, not on any particular date field)
>
> if what you are interested in is stats on the first N docs according to a
> specific sort (score or otherwise) then you could write a custom request
> handler that executed a search with a limit of N, got the DocList,
> iterated over it to build a DocSet, and then used that DocSet to do
> faceting ... but that would probably take even longer then just using the
> full DocSet matching the entire query.



I was hoping to avoid having to write a custom request handler but your
suggestion above sounds like it would do the trick.  I'm also debating
whether to extract my own facet info from a result set on the client side,
but this would be even slower.

Thanks for your suggestions so far,
Piete


Faceting over limited result set

2007-11-11 Thread Pieter Berkel
I'm trying to obtain faceting information based on the first 'x' (lets say
100-500) results matching a given (dismax) query.  The actual documents
matching the query are not important in this case, so intuitively the
simplest approach I can think of would be to limit the result set to 'x'
documents.

Unfortunately I can't find any easy way to limit the number of documents
matched (and returned in the set).  It might be possible to achieve the
desired result by using a function query + filter query, however that seems
a but hack-ish and hopefully I've missed something basic that leads to a
simpler solution.

Apologies if this has already been discussed / solved before.

Thanks,
Piete


Re: SOLR 1.3 Release?

2007-10-25 Thread Pieter Berkel
On 26/10/2007, James liu <[EMAIL PROTECTED]> wrote:
>
> where i can read 1.3 new features?
>


Take a look at CHANGES.txt in the root directory of svn trunk, or also here:
http://svn.apache.org/viewvc/lucene/solr/trunk/CHANGES.txt

Piete


Re: Search results problem

2007-10-17 Thread Pieter Berkel
Just to clarify,  refers to the maximum number of *terms*
that will be indexed per field, not the character length of the field (I
wasn't clear about that in my previous post).

Unfortunately there is no way to specify an unlimited value, although if you
set it to a suitably large value, you shouldn't really have any problems
(other than running out of memory).

Piete



On 17/10/2007, Thorsten Scherler <[EMAIL PROTECTED]>
wrote:
>
> On Wed, 2007-10-17 at 20:44 +1000, Pieter Berkel wrote:
> > There is a configuration option called "" in
> > solrconfig.xmlwith the default value of 10,000.  You may need to
> > increase this value if
> > you are indexing fields that are longer.
> >
>
> Is there a way to define a unlimited value? Like -1?
>
> TIA
>
> salu2
>
> >
> >
> > On 17/10/2007, Maximilian Hütter <[EMAIL PROTECTED]> wrote:
> > >
> > > Daniel Naber schrieb:
> > > > On Tuesday 16 October 2007 12:03, Maximilian Hütter wrote:
> > > >
> > > >> the content of one document is completely contained in another,
> > > >> but search for a special word I only get one document as result.
> > > >> I am absolutely sure it is contained in the other document, but I
> will
> > > >> only get the "parent" doc if I add a word.
> > > >
> > > > You should try debugging the problem with Luke, e.g. use
> "reconstruct &
> > > > edit" to see if the term is really indexed in both documents.
> > > >
> > > > Regards
> > > >  Daniel
> > > >
> > >
> > > Thank you for the tip, after using luke I can see that the term is
> > > really missing in the other document.
> > > Is there a size restriction for field content in Solr/Lucene? Because
> > > from the "fulltext" field I use as default field (after luke
> > > reconstruction) seem to be missing a lot strings I expected to find
> there.
> > >
> > > Best regards,
> > >
> > > Max
> > >
> > > --
> > > Maximilian Hütter
> > > blue elephant systems GmbH
> > > Wollgrasweg 49
> > > D-70599 Stuttgart
> > >
> > > Tel:  (+49) 0711 - 45 10 17 578
> > > Fax:  (+49) 0711 - 45 10 17 573
> > > e-mail :  [EMAIL PROTECTED]
> > > Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
> > > Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
> > >
> --
> Thorsten Scherler thorsten.at.apache.org
> Open Source Java  consulting, training and solutions
>
>


Re: Search results problem

2007-10-17 Thread Pieter Berkel
There is a configuration option called "" in
solrconfig.xmlwith the default value of 10,000.  You may need to
increase this value if
you are indexing fields that are longer.



On 17/10/2007, Maximilian Hütter <[EMAIL PROTECTED]> wrote:
>
> Daniel Naber schrieb:
> > On Tuesday 16 October 2007 12:03, Maximilian Hütter wrote:
> >
> >> the content of one document is completely contained in another,
> >> but search for a special word I only get one document as result.
> >> I am absolutely sure it is contained in the other document, but I will
> >> only get the "parent" doc if I add a word.
> >
> > You should try debugging the problem with Luke, e.g. use "reconstruct &
> > edit" to see if the term is really indexed in both documents.
> >
> > Regards
> >  Daniel
> >
>
> Thank you for the tip, after using luke I can see that the term is
> really missing in the other document.
> Is there a size restriction for field content in Solr/Lucene? Because
> from the "fulltext" field I use as default field (after luke
> reconstruction) seem to be missing a lot strings I expected to find there.
>
> Best regards,
>
> Max
>
> --
> Maximilian Hütter
> blue elephant systems GmbH
> Wollgrasweg 49
> D-70599 Stuttgart
>
> Tel:  (+49) 0711 - 45 10 17 578
> Fax:  (+49) 0711 - 45 10 17 573
> e-mail :  [EMAIL PROTECTED]
> Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
> Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
>


Re: delete by negative query

2007-10-15 Thread Pieter Berkel
You need to explicitly define the field you are referring to in order to
achieve this, otherwise the query parser will assume that the minus
character is part of the query and interpret it as field:"-solr" (where
"field" is the name of the default field set in your schema).  Try:

curl http://localhost:8983/solr/update --data-binary
'-field:solr' -H 'Content-type:text/xml;
charset=utf-8'

Piete



On 16/10/2007, Rob Casson <[EMAIL PROTECTED]> wrote:
>
> i'm having no luck deleting by a negative query
>
> indexing the example docs from 1.2, these steps work:
>
> curl http://localhost:8983/solr/update --data-binary
> 'solr' -H 'Content-type:text/xml;
> charset=utf-8'
>
> curl http://localhost:8983/solr/update --data-binary '' -H
> 'Content-type:text/xml; charset=utf-8'
>
> but if i reindex, and change the delete query to a negative, the
> non-'solr' docs don't get deleted:
>
> curl http://localhost:8983/solr/update --data-binary
> '-solr' -H 'Content-type:text/xml;
> charset=utf-8'
>
> curl http://localhost:8983/solr/update --data-binary '' -H
> 'Content-type:text/xml; charset=utf-8'
>
> good chance i'm missing something obvious
>
> tia,
> r
>


Re: solr tuple/tag store

2007-10-09 Thread Pieter Berkel
On 10/10/2007, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> > Without seeing the actual queries that are slow, it's difficult to
> determine
> > what the problem is.  Have you tried using EXPLAIN (
> > http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your
> query
> > is using the table indexes effectively?
> >
>
> Yes, the issue is with the number of rows with 10M rows, select(*) can
> take > 1 min.  With 10M rows, it was actually faster to remove the index
> so that it was forced to do a single iteration through all docs rather
> then use the index (I don't fully understand why)
>
> EXPLAIN says it is a simple query using the primary key, but can still
> take >30sec to complete!
>
> In general it seems like a bad idea to have mysql tables with lots of
> rows...  that is why i'm leaning towards a solr solution.
>


MySQL shouldn't really have any problem working with tables having 10M+ rows
(especially with simple select queries), most likely the issues you are
experiencing are a result of memory limits set in the mysql conf.  If you
want to persevere a little longer, try increasing the values of
"innodb_additional_mem_pool_size" and "innodb_buffer_pool_size" in your
my.cnf config file (see
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html for more
info).

If there is no compelling reason for sticking with a RDBMS then maybe the
solr solutions iisted above might be better.

Piete


Re: Solr and KStem

2007-10-09 Thread Pieter Berkel
Hi Harry,

I re-discovered this thread last week and have made some minor changes to
the code (remove deprication warnings) so that it compiles with trunk.  I
think it would be quite useful to get this stemmer into Solr once all the
legal / licensing issues are resolved.  If there are no objections, I'll
open a JIRA ticket and upload my changes so we can make sure we're all
working with the same code.

cheers,
Piete



On 11/09/2007, Wagner,Harry <[EMAIL PROTECTED]> wrote:
>
> Bill,
> Currently it is a plug-in.  Put the lower case filter ahead of kstem,
> just as for porter (example below).  You can use it with porter, but I
> can't imagine why you would want to.  At least not in the same analyzer.
> Hope this helps.
>
> 
>   
> 
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> 
>  cacheSize="2"/>
> 
>   
>   
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
> 
>  cacheSize="2"/>
> 
>   
> 
>
> Cheers... harry
>
>


Re: solr tuple/tag store

2007-10-09 Thread Pieter Berkel
Given that the tables are of type InnoDB, I think it's safe to assume that
you're not planning to use MySQL full-text search (only supported on MyISAM
tables).  If you are not concerned about transactional integrity provided by
InnoDB, perhaps you could try using MyISAM tables (although most people
report speed improvements for insert operations (on relatively small data
sets) rather than selects).

Without seeing the actual queries that are slow, it's difficult to determine
what the problem is.  Have you tried using EXPLAIN (
http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your query
is using the table indexes effectively?

Pieter



On 10/10/2007, Lance Norskog <[EMAIL PROTECTED]> wrote:
>
> You did not give your queries. I assume that you are searching against the
> 'entryID' and updating the tag list.
>
> MySQL has a "fulltext" index. I assume this is a KWIC index but do not
> know.
> A "fulltext" index on "entryID" should be very very fast since
> single-record
> results are what Lucene does best.
>
> Lance
>


Re: Spell Check Handler

2007-10-08 Thread Pieter Berkel
I started to look at this back in August and decided to wait for
climbingrose's implementation, however since then my priorities changed and
I hadn't had a chance to re-visit it.

Sounds like there is quite a bit of interest in this feature, so it would be
great if those who have make progress on this so far to share their code so
we can avoid any further duplication of effort.  JIRA is still the best
place to upload code contributions, regardless of the amount of testing
performed or documentation included (concur with Hoss).

Thanks,
Pieter



On 08/10/2007, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> Did I miss this contribution or did it not happen?  I'm referring to the
> change to the SpellCheckerRequestHandler to handle spelling
> corrections/suggestions for multi-word queries.
>
> Any chance you can provide a patch?
>
> Thanks!
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>


Re: Indexing XML

2007-10-05 Thread Pieter Berkel
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> Is there a solution to this? Has anyone done this before?


I would suggest changing the field type of "originalRecord" to "string"
rather than "text", and if you're still having trouble with the XML data
simply encapsulated the data with a CDATA:



cheers,
Piete


Re: searching for non-empty fields

2007-09-27 Thread Pieter Berkel
While in theory -URL:"" should be valid syntax, the Lucene query parser
doesn't accept it and throws a ParseException.  I've considered raising this
issue on lucene-dev but it didn't seem to affect many users so I decided not
to pursue the matter.



On 27/09/2007, Chris Hostetter <[EMAIL PROTECTED]> wrote:

> ...and to work arround the problem untill you reindex...
>
> q=(URL:[* TO *] -URL:"")
>
> ...at least: i'm 97% certain that will work.  it won't help if you "empty"
> values are really " " or "  " or ...
>
>


Re: searching for non-empty fields

2007-09-26 Thread Pieter Berkel
I've experienced a similar problem before, assuming the field type is
"string" (i.e. not tokenized), there is subtle yet important difference
between a field that is null (i.e. not contained in the document) and one
that is an empty string (in the document but with no value). See
http://www.nabble.com/indexing-null-values--tf4238702.html#a12067741 for a
previous discussion of the issue.

Your query will work if you make sure the URL field is omitted from the
document at index time when the field is blank.

cheers,
Piete



On 27/09/2007, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
> I have a large index with a field for a URL. For some reason or
> another, sometimes a doc will get indexed with that field blank. This
> is fine but I want a query to return only the set URL fields...
>
> If I do a query like:
>
> q=URL:[* TO *]
>
> I get a lot of empty fields back, like:
>
> 
> 
> http://thing.com
>
> What I can query for to remove the empty fields?
>
>
>
>


Re: Term extraction

2007-09-21 Thread Pieter Berkel
Thanks for the response guys:

Grant: I had a brief look at LingPipe, it looks quite interesting but I'm
concerned that the licensing may prevent me from using it in my project.
Michael: I have used the Yahoo API in the past but due to it's generic
nature, I wasn't entirely happy with the results in my test cases.
Yonik: This is the approach I had in mind, will it still work if I put the
SynonymFilter after the word-delimiter filter in the schema config? Ideally
I want to strip out the underscore char before it gets indexed, is that
possible by using a PatternReplaceFilterFactory after the SynonymFilter?

Cheers,
Piete



On 21/09/2007, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 9/19/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
> > However, I'd like to be able to
> > analyze documents more intelligently to recognize phrase keywords such
> as
> > "open source", "Microsoft Office", "Bill Gates" rather than splitting
> each
> > word into separate tokens (the field is never used in search queries so
> > matching is not an issue).  I've been looking at SynonymFilterFactory as
> a
> > possible solution to this problem but haven't been able to work out the
> > specifics of how to configure it for phrase mappings.
>
> SynonymFilter works out-of-the-box with multi-token synonyms...
>
> Microsoft Office => microsoft_office
> Bill Gates, William Gates => bill_gates
>
> Just don't use a word-delimiter filter if you use underscore to join
> words.
>
> -Yonik
>


Re: setting absolute path for snapshooter in solrconfig.xml doesn't work

2007-09-19 Thread Pieter Berkel
If you don't need to pass any command line arguments to snapshooter, remove
(or comment out) this line from solrconfig.xml:

 arg1 arg2 

By the same token, if you're not setting environment variables either,
remove the following line as well:

 MYVAR=val1 

Once you alter / remove those two lines, snapshooter should function as
expected.

cheers,
Piete



On 20/09/2007, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
>
> Hi, Pieter,
>
> Thanks!  Now the exception is gone. However, There's no snapshot file
> created in the data directory. Strangely, the snapshooter.log seems to
> complete successfully.  Any idea what else I'm missing?
>
> $ cat var/SolrHome/solr/logs/snapshooter.log
> 2007/09/19 20:16:17 started by solruser
> 2007/09/19 20:16:17 command: /var/SolrHome/solr/bin/snapshooter arg1 arg2
> 2007/09/19 20:16:17 taking snapshot
> var/SolrHome/solr/data/snapshot.20070919201617
> 2007/09/19 20:16:17 ended (elapsed time: 0 sec)
>
> Thanks,
>
> -Hui
>
>
>
>
> On 9/19/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
> >
> > See this recent thread for some helpful info:
> >
> >
> http://www.nabble.com/solr-doesn%27t-find-exe-in-postCommit-event-tf4264879.html#a12167792
> >
> > You'll probably want to configure your exe with an absolute path rather
> > than
> > the dir:
> >
> >   /var/SolrHome/solr/bin/snapshooter
> >   .
> >
> > In order to get the snapshooter working correctly.
> >
> > cheers,
> > Piete
> >
> >
> >
> > On 20/09/2007, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi, there,
> > >
> > > I used an absolute path for the "dir" param in the solrconfig.xml as
> > > below:
> > >
> > > 
> > >   snapshooter
> > >   /var/SolrHome/solr/bin
> > >   true
> > >arg1 arg2 
> > >MYVAR=val1 
> > > 
> > >
> > > However, I got "snapshooter: not found"  exception thrown in
> > catalina.out.
> > > I don't see why this doesn't work. Anything I'm missing?
> > >
> > >
> > > Many thanks,
> > >
> > > -Hui
> > >
> >
>
>
>
> --
> Regards,
>
> -Hui
>


Re: Term extraction

2007-09-19 Thread Pieter Berkel
Thanks Brian, I think the "smart" approaches you refer to might be outside
the scope of my current project.  The documents I am indexing already have
manually-generated keyword data, moving forward I'd like to have these
keywords automatically generated, selected from a pre-defined list of
keywords (i.e. the "simple" approach).

The data is fairly clean and domain-specific so I don't expect there will be
more than several hundred of these phrase terms to deal with, which is why I
was exploring the SynonymFilterFactory option.

Pieter



On 20/09/2007, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
> On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote:
>
> > I'm currently looking at methods of term extraction and automatic
> > keyword
> > generation from indexed documents.
>
> We do it manually (not in solr, but we put the results in solr.) We
> do it the usual way - chunk (into n-grams, named entities & noun
> phrases) and count (tf & df). It works well enough. There is a bevy
> of literature on the topic if you want to get "smart" -- but be
> warned smart and fast are likely not very good friends.
>
> A lot depends on the provenance of your data -- is it clean text that
> uses a lot of domain specific terms? Is it webtext?
>
>


Re: Filter by Group

2007-09-19 Thread Pieter Berkel
Sounds like you're on the right track, if your groups overap (i.e. a
document can be in group A and B), then you should ensure your "groups"
field is multivalued.

If you are searching for "foo" in documents contained in group "A", then it
might be more efficient to use a filter query (fq) like:

q=foo&fq=groups:A

See the wiki page on common query parameters for more info:
http://wiki.apache.org/solr/CommonQueryParameters#head-6522ef80f22d0e50d2f12ec487758577506d6002

cheers,
Piete



On 20/09/2007, mark angelillo <[EMAIL PROTECTED]> wrote:
>
> Hey all,
>
> Let's say I have an index of one hundred documents, and these
> documents are grouped into 4 groups A, B, C, and D. The groups do in
> fact overlap. What would people recommend as the best way to apply a
> search query and return only the documents that are in group A? Also,
> how about if we run the same search query but return only those
> documents in groups A, C and D?
>
> I imagine that I could do this by indexing a text field populated
> with the group names and adding something like "groups:A" to the
> query but I'm wondering if there's a better solution.
>
> Thanks in advance,
> Mark
>
> mark angelillo
> snooth inc.
> o: 646.723.4328
> c: 484.437.9915
> [EMAIL PROTECTED]
> snooth -- 1.7 million ratings and counting...
>
>
>


Re: setting absolute path for snapshooter in solrconfig.xml doesn't work

2007-09-19 Thread Pieter Berkel
See this recent thread for some helpful info:
http://www.nabble.com/solr-doesn%27t-find-exe-in-postCommit-event-tf4264879.html#a12167792

You'll probably want to configure your exe with an absolute path rather than
the dir:

  /var/SolrHome/solr/bin/snapshooter
  .

In order to get the snapshooter working correctly.

cheers,
Piete



On 20/09/2007, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
>
> Hi, there,
>
> I used an absolute path for the "dir" param in the solrconfig.xml as
> below:
>
> 
>   snapshooter
>   /var/SolrHome/solr/bin
>   true
>arg1 arg2 
>MYVAR=val1 
> 
>
> However, I got "snapshooter: not found"  exception thrown in catalina.out.
> I don't see why this doesn't work. Anything I'm missing?
>
>
> Many thanks,
>
> -Hui
>


Term extraction

2007-09-19 Thread Pieter Berkel
I'm currently looking at methods of term extraction and automatic keyword
generation from indexed documents.  I've been experimenting with
MoreLikeThis and values returned by the "mlt.interestingTerms" parameter and
so far this approach has worked well.  However, I'd like to be able to
analyze documents more intelligently to recognize phrase keywords such as
"open source", "Microsoft Office", "Bill Gates" rather than splitting each
word into separate tokens (the field is never used in search queries so
matching is not an issue).  I've been looking at SynonymFilterFactory as a
possible solution to this problem but haven't been able to work out the
specifics of how to configure it for phrase mappings.

Has anybody else dealt with this problem before or able to offer any
insights into achieve the desired results?

Thanks in advance,
Pieter


Re: Web statistics for solr?

2007-08-22 Thread Pieter Berkel
Matthew,

Maybe the SOLR Statistics page would suit your purpose?
(click on "statistics" from the main solr page or use the following url)
http://localhost:8983/solr/admin/stats.jsp

cheers,
Piete



On 23/08/07, Matthew Runo <[EMAIL PROTECTED]> wrote:
>
> Hello!
>
> I was wondering if anyone has written a script that displays any
> stats from SOLR.. queries per second, number of docs added.. this
> sort of thing.
>
> Sort of a general dashboard for SOLR.
>
> I'd rather not write it myself if I don't need to, and I didn't see
> anything conclusive in the archives for the email list.
>
> ++
>   | Matthew Runo
>   | Zappos Development
>   | [EMAIL PROTECTED]
>   | 702-943-7833
> ++
>
>
>


Re: defining fiels to be returned when using mlt

2007-08-22 Thread Pieter Berkel
Hi Stefan,

Currently there is no way to specify the list of fields to be returned by
the MoreLikeThis handler.  I've been looking to address this issue in
https://issues.apache.org/jira/browse/SOLR-295 (point 3) however in the
broader scheme of things, it seems logical to wait until
https://issues.apache.org/jira/browse/SOLR-281 is resolved before making
changes to MLT.

cheers,
Piete



On 22/08/07, Stefan Rinner <[EMAIL PROTECTED]> wrote:
>
> Hi
>
> Is there any way to define the numer/type of fields of the documents
> returned in the "moreLikeThis" part of the response, when "mlt" is
> set to true?
>
> Currently I'm using morelikethis to show the number and sources of
> similar documents - therefore I'd need only the "source" field of
> these similar documents and not everything.
>
> - stefan
>


Re: Structured Lucene documents

2007-08-21 Thread Pieter Berkel
On 21/08/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote:
>
> It seems the highlights fields must be specified, and that I can't use the
> * completion to do so.
> Am I true ? Is there a way to go throught this obligation ?


As far as I know, dynamic fields are used mainly at during indexing and
aren't expandable at query time.  It would be quite cool if Solr could do
query-time expansions of dynamic fields (e.g. hl.fl=page_*) however that
would require some knowledge of the dynamic fields already stored in the
index, which I don't think is currently available in either Solr or Lucene.

Piete


Re: clear index

2007-08-20 Thread Pieter Berkel
If you are using solr 1.2 the following command (followed by a commit /
optimize) should do the trick:

*:*

cheers,
Piete


On 21/08/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
>
> what is the best approach to clearing an index?
>
> The use case is that I'm doing some performance testing with various
> index sizes.  In between indexing (embedded and soon HTTP/XML) I need to
> clear the index so I have a fresh start.
>
> What's the best approach, close the index and delete the files?  Hack
> together some query that will match all documents and delete by query?
> Looking at the Lucene API it looks like they have the same functionality
> that is exposed already (delete by id or query).
>
> Paul Sundling
>
>


Re: Indexing large documents

2007-08-20 Thread Pieter Berkel
You will probably need to increase the value of maxFieldLength in your
solrconfig.xml.  The default value is 1 which might explain why your
documents are not being completely indexed.

Piete


On 20/08/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> The that should show some errors if something goes wrong, if not the
> console usually will.  The errors will look like a java stacktrace
> output.  Did increasing the heap do anything for you?  Changing mine
> to 256mb max worked fine for all of our files.
>
> On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > Well, I am using the java textmining library to extract text from
> documents,
> > then i do a post to solr
> > I do not have an error log, i only have *.request.log files in the logs
> > directory
> >
> > Thanks
> >
> > On 8/20/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > >
> > > Fouad,
> > >
> > > I would check the error log or console for any possible errors first.
> > > They may not show up, it really depends on how you are processing the
> > > word document (custom solr, feeding the text to it, etc).  We are
> > > using a custom version of solr with PDF, DOC, XLS, etc text extraction
> > > and I have successfully indexed 40mb documents.  I did have indexing
> > > problems with a large document or two and simply increasing the heap
> > > size fixed the problem.
> > >
> > > - Pete
> > >
> > > On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote:
> > > > Hello,
> > > >
> > > > I am using solr to index text extracted from word documents, and it
> is
> > > > working really well.
> > > > Recently i started noticing that some documents are not indexed,
> that is
> > > i
> > > > know that the word foobar is in a document, but when i search for
> foobar
> > > the
> > > > id of that document is not returned.
> > > > I suspect that this has to do with the size of the document, and
> that
> > > > documents with a lot of text are not being indexed.
> > > > Please advise.
> > > >
> > > > thanks,
> > > > fmardini
> > > >
> > >
> >
>


Re: sub facets

2007-08-17 Thread Pieter Berkel
Hi Jae Joo,

Please provide a bit more information about exactly what you are trying to
achieve so we can help you.

cheers,
Piete



On 18/08/07, Jae Joo <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Can anyone help me how to do sub faces?
> Thanks,
>
> Jae Joo
>


Re: solr + carrot2

2007-08-16 Thread Pieter Berkel
Any updates on this?  It certainly would be quite interesting to see how
well carrot2 clustering can be integrated with solr, I suppose it's a fairly
similar concept to simple faceting (maybe another candidate for SOLR-281
component?).

One concern I have is that the additional processing required at query time
would make the whole operation significant slower (which is something I'd
like to avoid).  I've been wondering if it might be possible to calculate
(and store) clustering information at index time
however since carrot2 seems to use the query term & result set to create
clustering info this doesn't appear to be a practical approach.

In a similar vein, I'm also looking at methods of term extraction and
automatic keyword generation from indexed documents.  I've been
experimenting with MoreLikeThis and values returned by the "
mlt.interestingTerms" parameter, which has potential but needs a bit of
refinement before it can be truely useful.  Has anybody else discovered
clever or useful methods of term extraction using solr?

Piete



On 02/08/07, Burkamp, Christian <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> In my opinion the results from carrot2 clustering could be used in the
> same way that facet results are used.
> That's the way I'm planning to use them.
> The user of the search application can narrow the search by selecting one
> of the facets presented in the search result presentation. These facets
> could come from metadata (classic facets) or from dynamically computed
> categories which are results from carrot2.
>
> From this point of view it would be most convenient to have the
> integration for carrot2 directly in the StandardRequestHandler. This leaves
> questions open like "how should filters for categories from carrot2 be
> formulated".
>
> Is anybody already using carrot2 with solr?
>
> -- Christian
>
> -Ursprüngliche Nachricht-
> Von: [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED] Im Auftrag von
> Stanislaw Osinski
> Gesendet: Mittwoch, 1. August 2007 14:01
> An: solr-user@lucene.apache.org
> Betreff: Re: solr + carrot2
>
>
> >
> > Has anyone looked into using carrot2 clustering with solr?
> >
> > I know this is integrated with nutch:
> >
> > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/clustering/car
> > rot2/Clusterer.html
> >
> > It looks like carrot has support to read results from a solr index:
> >
> > http://demo.carrot2.org/head/api/org/carrot2/input/solr/package-summar
> > y.html
> >
> > But I'm hoping for something that returns clustered results from solr.
> >
> > Carrot also has something to read lucene indexes:
> >
> > http://demo.carrot2.org/head/api/org/carrot2/input/lucene/package-summ
> > ary.html
> >
> > Any pointers or experience before I (may) delve into this?
> >
>
> First of all, apologies for a delayed response. I'm one of Carrot2
> developers and indeed we did some Solr integration, but from Carrot2's
> perspective, which I guess will not be directly useful in this case. If you
> have any ideas for integration, questions or requests for changes/patches,
> feel free to post on Carrot2 mailing list or file an issue for us.
>
> Thanks,
>
> Staszek
>


Re: Function Queries

2007-08-16 Thread Pieter Berkel
Hi Yakn,

On 17/08/07, Yakn <[EMAIL PROTECTED]> wrote:

> One example is that if you have mm being blank in the solrConfig.xml
> and not commented out, then it will throw a NumberFormatException.


The required format of the mm field is described in more detail here:
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html
The parser is pretty fussy about how this field is formatted, when a value
is not specified the default value is "100%" which means "match documents
that contain every term in the query".
Perhaps it might be a good idea to add a simple sanity check to mm testing
for empty string?

Another example is that without something in qf, then the query, using

dt=dismax in the query request string, does not return any results.


qf is a required parameter and is needed (in combination with q param) to
construct a Lucene query, it won't work without it (as you've discovered).
The default values in solrconfig.xml serve as an example and you'll most
probably need to change it to match your schema (the value in
solrconfig.xmlis only used if qf is not set in your request query
string).

So, what I am really looking for here is the proper way to do the whole
> solrConfig.xml, for the dismax request handler. It seems that I am somehow
>
missing something.


I think the whole point of the values set in the solrconfig.xml included
with the distribution are to serve as a guide for you to try with the
examples provided.  In general most of these default values (that don't
refer to specific fields) can be left unmodified and dismax requests will
still work fine, however you can change and tweak these parameters to suit
your particular requirements if neccessary.


> The way that I understand it right now is this, for all
> the fields that will be searched on and a function query will be used,
> they
> need to be in the qf parameter.


Only fields that you want to match terms in "q" need to be listed in "qf",
it is not necessary to list fields used in a function query there.


> For the function query itself, I have just a
> field called importancerank which is a float type field. I do not use
> ord()
> or reord() or linear() etc... because I just want to take that value of
> that
> field and add it to the score.


I haven't tried this myself but it should be as simple as adding the
following to your query string: bf=importancerank


> I also have a 0.01 in tie. I have echoParams
> set to explicit. These are the only parameters that I have set up. I have
> the rest commented out such as pf, ps, q.alt, and mm. Also, what is fl? I
> could not find any documentation on that.


A lot of parameters (including fl) are common and used by both Standard and
Dismax request handlers, so you should take a look at:
http://wiki.apache.org/solr/CommonQueryParameters

What happens currently for me is that when I put the dt=dismax parameter in
> my query request string, I get exactly the same results as if I didn't,
> meaning it didn't appear to sort it at all. What other parameters do I
> have
>
to fill out in the request handler to make this work? What might I have done
> wrong in my thinking of how things work?


You'll have to provide more information about your query (e.g. query string
parameters, field definitions from schema.xml, list the contents of
 in
solrconfig.xml) in order to see what's going on.

Another thing that would be helpful is to see a whole solrConfig schema for
> the dismax request handler. I have only
> read about bits of it and I think that to get a view of a full one that
> actually works would be very helpful. Thanks again.


This is the solrconfig.xml that I mentioned earlier, it is provided with the
Solr distribution (in /example/solr/conf/):
http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml

Hope this helps,
Piete


Re: how to retrieve all the documents in an index?

2007-08-15 Thread Pieter Berkel
Hi Hui,

I'm not 100% certain but I believe this syntax was added in 1.2 (it
certainly works in the svn trunk code), can anyone confirm this?

cheers,
Piete



On 14/08/07, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
>
> Piete,
>
> I tried and it doesn't work for Solr 1.1.  Is it supported for 1.2 or at
> all?
>
> (Right now, I'm using a work-around by a range query for a field whose
> range
> is known to be larger than 0.)
>
>
> Thanks,
>
> -Hui
>
>
>
> On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
> >
> > Try using q=*:* to match all documents in the index.
> >
> > Piete
> >
> >
> >
> > On 13/08/07, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi, there,
> > >
> > > I found the following post on the web. Is this still the simplest
> > > get-around
> > > to retrieve all documents in an index? (I'm asking just in case I
> don't
> > > know
> > > there's a more standard way to do that now.)
> > >
> > >
> > > thanks,
> > >
> > > -Hui
> > >
> > >
> > >
> > > From  "Fuad Efendi" < [EMAIL PROTECTED]>
> > > Subject RE: MatchAllDocsQuery in solr?
> > > Date Wed, 29 Nov 2006 01:58:25 GMT
> > >
> > > Workaround
> > > ==
> > >
> > > Define a field abcd with constant
> value
> > > 'abcd' for all documents (choose value not listed in any 'stop-word'
> > > etc.).
> > > Lucene query 'scan_all:abcd' will retrieve 'all' documents.
> > > Enjoy!
> > >
> > >
> > > -Original Message-
> > > From: Tom
> > > Sent: Tuesday, November 21, 2006 5:08 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: MatchAllDocsQuery in solr?
> > >
> > >
> > > Is there a way to do a match all docs query in solr?
> > >
> > > I mean is there something I can put in a solr URL that will get
> > > recognized by the SolrQueryParser as meaning a "match all"?
> > >
> > > Why? Because I'm porting unit tests from our internal Lucene
> > > container to Solr, and the tests usually run such a query,  upon
> > > completion, to make sure the index is in the expected state (nothing
> > > missing, nothing extra).
> > >
> > > Yes, I can create a query that will match all my docs, there are a
> > > few fields that have a relatively small range of values. I was just
> > > looking for a standard way to do it first.
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> >
>
>
>
> --
> Regards,
>
> -Hui
>


Re: schema.xml changes and the impact on Solr?

2007-08-13 Thread Pieter Berkel
On 14/08/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>
> > 2 - Question about the structure of the injected xml file... does it
> > need to exactly match the data in solr?  I know it makes sense that
> > we're only injecting the fields that solr needs and not excluding fields
> > that it needs... but how fussy is solr when it comes to matching the xml
> > in injection?
> >
>
> by design it is fussy.
>
> I think there is some way to make a non-indexed, non-stored dynamic
> field that just will ignore unknown fields.
>

It's easy to ignore unknown fields in XML input, just uncomment these lines
in your schema.xml:





cheers,
Piete


Re: Structured Lucene documents

2007-08-13 Thread Pieter Berkel
On 13/08/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote:
>
> Hello !Thanks Pieter,That seems a good idea - if not an ideal one - even
> if it sort of an hack. I will try it as soon as possible and keep you
> informed.The hl.fl parameter doesn't have to be initialized, I think, so
> it won't be a problem.On the other hand, I will have the exact same
> problem to specify the (dynamic) field on wich the request is performed... I
> need to be able to execute the request on the full text of the page only :
> must I specify all of the -hightly variable- name of each page field in my
> query ?I think that structured index document could be of great value to
> complex documents indexation. Is there a way that someday Solr will include
> such possibility, or is it basically impossible (due to the way Lucene works
> for example) ?Kind Regards,Pierre-Yves Landron


Hi Pierre-Yves,

Maybe you could use dynamic field copy in your schema.xml to index content
from all page stored in your document in a separate field, something like:



and then you would only need to query on the "all_pages" field.  Not quite
sure how this might be affected by the hl.requireFieldMatch=true parameter
but it's worth a try.

cheers,
Piete


Re: how to retrieve all the documents in an index?

2007-08-12 Thread Pieter Berkel
Try using q=*:* to match all documents in the index.

Piete



On 13/08/07, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
>
> Hi, there,
>
> I found the following post on the web. Is this still the simplest
> get-around
> to retrieve all documents in an index? (I'm asking just in case I don't
> know
> there's a more standard way to do that now.)
>
>
> thanks,
>
> -Hui
>
>
>
> From  "Fuad Efendi" < [EMAIL PROTECTED]>
> Subject RE: MatchAllDocsQuery in solr?
> Date Wed, 29 Nov 2006 01:58:25 GMT
>
> Workaround
> ==
>
> Define a field abcd with constant value
> 'abcd' for all documents (choose value not listed in any 'stop-word'
> etc.).
> Lucene query 'scan_all:abcd' will retrieve 'all' documents.
> Enjoy!
>
>
> -Original Message-
> From: Tom
> Sent: Tuesday, November 21, 2006 5:08 PM
> To: solr-user@lucene.apache.org
> Subject: MatchAllDocsQuery in solr?
>
>
> Is there a way to do a match all docs query in solr?
>
> I mean is there something I can put in a solr URL that will get
> recognized by the SolrQueryParser as meaning a "match all"?
>
> Why? Because I'm porting unit tests from our internal Lucene
> container to Solr, and the tests usually run such a query,  upon
> completion, to make sure the index is in the expected state (nothing
> missing, nothing extra).
>
> Yes, I can create a query that will match all my docs, there are a
> few fields that have a relatively small range of values. I was just
> looking for a standard way to do it first.
>
> Thanks,
>
> Tom
>


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-12 Thread Pieter Berkel
Do you consistently add 10,000 documents to your index every day or does the
number of new documents added per day vary?


On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> I'm having the date boosting function as well. I'm using this function:
> F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around
> 10,000 of documents added in one day, rord(createDate) returns very
> different values for the same createDate. For example, the last document
> added with have rord(createdDate) =1 while the last document added will
> have
> rord(createdDate) = 10,000. When createDate > 10,000, value of F is
> approaching 0. Therefore, the boost query doesn't make any difference
> between the the last document added today and the document added 10 days
> ago. Now if I replace 1000 in F with a large number, say 10,  the
> boost
> function  suddenly gives the last few documents enormous boost and make
> the
> other query scores irrelevant.
>
> So in my case (and many others' I believe), the "true" date value would be
> more appropriate. I'm thinking along the same line of adding timestamp. It
> wouldn't add much overhead this way, would it?
>


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-12 Thread Pieter Berkel
On 11/08/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> i would agree with you there, this is where a more robust (ie:
> less efficient) DateField-ish class that supports configuration options
> to specify:
>   1) the output format
>   2) the input format(s)
>   3) the indexed format
> ...as SimpleDateFormatter pattern strings would be handy.  The
> ValueSource it uses could return seconds (or some other unit based on
> another config option) since epoch as the intValue.


That definitely sounds like a sensible and flexible approach, I'll have to
take a closer look at the ValueSource and FunctionQuery classes and see what
I can come up with.

it's been discussed before, but there are a lot of tricky issues involved
> which is probably why no one has really tackled it.


It does seem somehow related to the issue of making the value of NOW
constant during the entire execution of a query, hopefully not in the
to-hard basket.

be careful what you wish for.  you are 100% correct that functions using
> hte (r)ord value of a DateField aren't a function of true age, but
> dependong on how you look at it that may be better then using the real age
> (i think so anyway).


I understand the problems you describe with using true age values, although
I wonder how much recip() (or perhaps some other logarithmic function) would
be able to dampen any unpleasant side-effects created by unusual publishing
patterns, not publishing on weekends, etc.  Using "min age" sounds like a
much better idea than using NOW to avoid any of the described weirdness too,
but that might increase the complexity of the function.

I'm still keen to get something working, at least to compare the results it
generates with the current ordinal method.

Piete


Re: Spell Check Handler

2007-08-11 Thread Pieter Berkel
 On 11/08/07, climbingrose<
[EMAIL PROTECTED]> wrote:
>
> That's exactly what I did with my custom version of the
> SpellCheckerHandler.
> However, I didn't handle suggestionCount and only returned the one
> corrected
> phrase which contains the "best" corrected terms. There is an issue on
> Lucene issue tracker regarding multi-word spellchecker:
> https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>


I'd be interested to take a look at your modifications to the
SpellCheckerHandler, how did you handle phrase queries? maybe we can open a
JIRA issue to expand the spell checking functionality to perform analysis on
multi-word input values.

I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking at
LUCENE-550, but since these patches are not yet included in the Lucene trunk
yet it might be a little difficult to justify implementing them in Solr.


Re: Spell Check Handler

2007-08-10 Thread Pieter Berkel
On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> The spellchecker handler doesn't seem to work with multi-word query. For
> example, when I tried to spellcheck "Java developar", it returns nothing
> while if I tried "developar", spellchecker correctly returns "developer".
> I
> followed the setup on the wiki.


While I suppose the general case for using the spelling checker would be a
query containing a single misspelled word, it would be quite useful if the
handler applied the analyzer specified by the termSourceField fieldType to
the query input and then checked the spelling of each query token. This
would seem to be the most flexible way of supporting multi-word queries
(provided the termSourceField didn't use any stemmer filters I suppose).

Piete


Re: tomcat and solr multiple instances

2007-08-09 Thread Pieter Berkel
The current working directory (Cwd) is the directory from which you started
the Tomcat server and is not dependent on the Solr instance configurations.
So as long as SolrHome is correct for each Solr instance, you shouldn't have
a problem.

cheers,
Piete



On 10/08/07, Jae Joo <[EMAIL PROTECTED]> wrote:
>
> Here are the Catalina/localhost/ files
> For "example" instance
>   debug="0" crossContext="true">
>  value="/rpt/src/apache-solr-1.2.0/example/solr"
> override="true" />
>
> 
>
>
> For ca_companies instance
>
>   debug="0" crossContext="true">
>  value="/rpt/src/apache-solr-1.2.0/ca_companies/solr"
> override="true" />
>
> 
>
>
> Urls
> http://host:8080/solr/admin --> pointint "example" instance (Problem...)
> http://host:8080/solr_ca/admin --> pointing "ca-companies" instance (it
> is working)
>
> -Original Message-
> From: Jae Joo [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 09, 2007 5:45 PM
> To: solr-user@lucene.apache.org
> Subject: tomcat and solr multiple instances
>
> Hi,
>
>
>
> I have built 2 solr instance - one is "example" and the other is
> "ca_companies".
>
>
>
> The "ca_companies" solr instance is working find, but "example is not
> working...
>
>
>
> In the admin page, "/solr/admin", for "example" instance, it shows that
>
>
>
> Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf
>
> --> this should be
>
> Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf
>
>
>
> SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/
>
>
>
> Any one knows why?
>
>
>
> If I run Jetty for instance "example", it is working well...
>
>
>
> Thanks,
>
>
>
> Jae Joo
>
>


Re: retrieving range of fields for the results

2007-08-08 Thread Pieter Berkel
On 09/08/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> Faceting ignores pagenation/startat/maxresults/etc.
>


This is correct, the facet information returned is based on the entire
result set matching the query rather than the document set returned by the
query.  The start and row parameters have no effect on facet_counts at all,
so you should be able to obtain the necessary information from a single
query.


Re: Structured Lucene documents

2007-08-08 Thread Pieter Berkel
In theory, you could store all your pages in a single document using a
dynamic field type:



Store each page in a separate field (e.g. page1, page2, page3 .. pageN) then
at query time, use the highlighting parameters to highlight matches in the
page fields. You should be able to determine the page field that matched the
query by observing the highlighted results (I'm not certain if the
hl.flparameter accepts dynamic field names, you may need to specify
them all
manually):

hl=true&hl.fl=page1,page2,page3,pageN&hl.requireFieldMatch=true

It sounds like a bit of a rough hack and I haven't actually tried to do this
myself, maybe someone else has a better idea?

cheers,
Piete


On 08/08/2007, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote:
>
> Hello,Is it possible to structure lucene documents via Solr, so one
> document coud fit into another one ?What I would like to do, for example :I
> want to retrieve full text articles, that fit on several pages for each of
> them. Results must take in account both the pages and the article from wich
> the search terms are from. I can create a lucene document for each pages of
> the article AND the article itself, and do two requests to get my results,
> but it would duplicate the full text in the index, and will not be too
> efficient. Ideally, what I would like to do is to create a document for
> indexing the text of each pages of the article, and group these documents in
> one document that describe the article : this way, when Lucene retrieve a
> requested term, i'll get the article and the page that contains the 
> term.Iwonder if there's a way to emulate elegantly this behavior with Solr 
> ?Kind
> Regards,Pierre-Yves Landron
> _
> Express yourself instantly with MSN Messenger! Download today it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Re: retrieving range of fields for the results

2007-08-08 Thread Pieter Berkel
You could probably achieve this using faceting, however it would not be a
very efficient solution (although your multi-query idea isn't very efficient
either).  Since facets return result sets in field sort order if you specify
facet.sort=false (setting that to try sorts the field by count), so you
could use query parameters like the following:

facet=true&facet.sort=false&facet.missing=false&facet.limit=-1&facet.mincount=1&facet.field=price&facet.field=publish_date

And simply extract the first and last values of the facet which should give
you the minumum and maximum values.

I'm pretty certain there is currently no option to simply return the minimum
/ maximum facet values but it might be and idea for a future enhancement.

cheers,
Piete


On 08/08/2007, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
>
> Piete,
>
> Thanks for the pointers and detailed info. Actually I'm aware of the
> faceting feature. Faceting provides the ability to categorize the results
> with the count for each category. However,I don't think that solves my
> problem.
>
> Let me give a more detailed example for my requirement:
>
> Let's say we indexed 100 documents representing books with the following
> fields:
>
> (id, title, price, publish_date, description)
>
> Suppose a query "title:art history"  ( We are using Solr 1.1, will use
> "sort" param when migrating to 1.2.) which returns 5 results:
>
> (25, "The History of Art", 56.0, 07-10-1995, "blah blah blah")
> (54, "Art History", 38.0, 02-13-1997, "blah blah blah")
> (13, "Art", 45.0, 10-05-1980, "blah blah blah")
> (3, "The Art of War", 40.0, 12-12-2000, "blah blah blah")
> (38, "History of Everything", 15.0, 12-31-2001, "blah blah blah")
>
> Now my requirement is that along with these five results, I also want Solr
> to return the following:
>
> The [min, max] range for the 'price' field, in this case:  [15.0, 56.0]
> The [min, max] range for the 'publish_date' field, in this case:
> [10-05-1980, 12-31-2001]
>
> I don't think faceting would give me these ranges, it can only give me
> counts of ranges/values, for example if I specified facet.query of
> different
> price ranges.
>
> So all I can think of now is that I have to
> 1) make 'price' and 'publish_date' sortable in the schema;
> 2) after each query (which returns non-empty result set), issue two more
> queries, one adding sort on 'price' to the original query, the other
> adding
> sort on 'publish_date'. (the sort order doesn't matter).
> 3) get the respective min, max values of these two fields from the first
> and
> last document returned for each of the two subsequent queries.
>
> Is there any better way to accomplish this?
>
>
> Thanks,
>
> -Hui
>
>
>
> On 8/7/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
> >
> > The functionality you are describing is called "Faceting" in Solr and
> can
> > be
> > achieved with a single query, take a look at the following wiki pages
> for
> > more info:
> >
> > http://wiki.apache.org/solr/SolrFacetingOverview
> > http://wiki.apache.org/solr/SimpleFacetParameters
> >
> > In regards to faceting date fields such as publish_date, take a look at
> > https://issues.apache.org/jira/browse/SOLR-258 which was recently
> commited
> > to the svn trunk (although not much documentation on that yet).
> >
> > Just a note about your query, specifying the sort order in the q
> parameter
> > is depricated syntax, you are better to use the sort parameter for that,
> > again refer to the wiki:
> >
> > http://wiki.apache.org/solr/CommonQueryParameters
> >
> > Hope this helps,
> > Piete
> >
> >
> > On 07/08/07, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi, there,
> > >
> > > We used Solr to index a set of documents and provide free-text search
> > > functionality with sorting options on a number of fields. We have a
> > > requirement that along with each search result we want to obtain the
> > > ranges
> > > of a few fields for the resulting documents. Here's an example:
> > >
> > > Let's say we indexed documents representing books with the following
> > > fields:
> > >
> > > title, price, publish_date, description
> > >
> > > All of these fields are stored in the index (that can be returned);
> and
> > > "price" and "publish_date" are sortable

Re: Configuring Synonyms

2007-08-07 Thread Pieter Berkel
What is the fieldType of your Colour field?  You must ensure that the
particular field that you are using to store Colour information is
configured to use solr.SynonymFilterFactory in your schema.xml configuration
file.

cheers,
Piete


On 07/08/07, beejenny <[EMAIL PROTECTED]> wrote:
>
>
> Hello,
>
> I am trying to configure some synonyms against an index of catalog
> products
> and I can't seem to get it right.
>
> We are indexing a field called Colour which contains the colour of a
> product.  Some of the values for colour are a little obscure and we'd like
> to map them to more common colours.
>
> For example
> Straw, Melon, Buttermilk, Mustard => Yellow
> Navy, Ocean, cobalt => Blue
>
> When I search for q=Colour:Yellow I would like to bring back all the
> documents with colour of yellow, Straw, Melon, Buttermilk and Mustard.
>
> Effectively
> q=Colour:Yellow+Colour:Straw+Colour:Melon+Colour:Buttermilk+Colour:Mustard
>
> But no matter what I do I only get back products with "Yellow" in the
> colour
> field.  Is using synonyms the best way to try to achieve this?
>
> Any pointers would be greatly appreciated.
>
> Jenny
> --
> View this message in context:
> http://www.nabble.com/Configuring-Synonyms-tf4229295.html#a12031632
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: retrieving range of fields for the results

2007-08-07 Thread Pieter Berkel
The functionality you are describing is called "Faceting" in Solr and can be
achieved with a single query, take a look at the following wiki pages for
more info:

http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/SimpleFacetParameters

In regards to faceting date fields such as publish_date, take a look at
https://issues.apache.org/jira/browse/SOLR-258 which was recently commited
to the svn trunk (although not much documentation on that yet).

Just a note about your query, specifying the sort order in the q parameter
is depricated syntax, you are better to use the sort parameter for that,
again refer to the wiki:

http://wiki.apache.org/solr/CommonQueryParameters

Hope this helps,
Piete


On 07/08/07, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:
>
> Hi, there,
>
> We used Solr to index a set of documents and provide free-text search
> functionality with sorting options on a number of fields. We have a
> requirement that along with each search result we want to obtain the
> ranges
> of a few fields for the resulting documents. Here's an example:
>
> Let's say we indexed documents representing books with the following
> fields:
>
> title, price, publish_date, description
>
> All of these fields are stored in the index (that can be returned); and
> "price" and "publish_date" are sortable as well.
>
> Now suppose we post a query for example "title:art history;price asc", we
> want to the following results:
>
> 1. the documents satisfying the query (which by default are returned by
> solr);
> 2. the range of the price field of the results.
> 3. the range of the publish_date of the results.
>
> My question is -- can solr return all of the above data within one
> response?  If not, all I can think of is to issue one more query that
> "title:art history;publish_date asc" (or desc) so that I can use the first
> and last result of this query to get the range for publish_date.  (Due to
> the original query already asking to sort by price, we are lucky in this
> case that we don't have to issue another query to get the range of price.
> But then this does not apply in the general case.)
>
> Any idea is appreciated!
>
> --
> Regards,
>
> -Hui
>


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-06 Thread Pieter Berkel
Actually, just thinking about this a bit more, perhaps adding a function
call such as parseDate() might add too much overhead to the actual query,
perhaps it would be better to first convert the date to a timestamp at index
time and store it in a field type slong?  This might be more efficient but
that still leaves the problem of obtaining the current timestamp to use in
the boost function.



On 06/08/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> I've been using a simple variation of the boost function given in the
> examples used to boost more recent documents:
>
> recip(rord(creationDate),1,1000,1000)^1.3
>
> While it seems to work pretty well, I've realised that this may not be
> quite as effective as i had hoped given that the calculation is based on the
> ordinal of the field value rather than the value of the field itself.  In
> cases where the field type is 'date' and the actual field values are not
> distributed evenly across all documents in the index, the value returned by
> rord() is not going to give a true reflection of document age.  For example,
> using Hoss' new date faceting feature, I can see that the rate at which
> documents have been added to the index I'm maintaining has been slowly but
> steadily increasing over the past few months, and I fear this fact will skew
> the boost value calculated by the function listed above.
>
> There doesn't seem to be currently any way of performing date arithmetic
> or convert a date field into an integer (seconds since epoch?), ideally I'd
> like to be able to do something like:
>
> recip(intval(parseDate('NOW')-parseDate(creationDate)),1,1000,1000)^ 1.3
>
> so that the function calculates the boost based on the actual document
> age, rather than the relative age.  Does anybody have any thoughts or
> comments on this approach?
>
> cheers,
> Piete
>
>
>


FunctionQuery and boosting documents using date arithmetic

2007-08-06 Thread Pieter Berkel
I've been using a simple variation of the boost function given in the
examples used to boost more recent documents:

recip(rord(creationDate),1,1000,1000)^1.3

While it seems to work pretty well, I've realised that this may not be quite
as effective as i had hoped given that the calculation is based on the
ordinal of the field value rather than the value of the field itself.  In
cases where the field type is 'date' and the actual field values are not
distributed evenly across all documents in the index, the value returned by
rord() is not going to give a true reflection of document age.  For example,
using Hoss' new date faceting feature, I can see that the rate at which
documents have been added to the index I'm maintaining has been slowly but
steadily increasing over the past few months, and I fear this fact will skew
the boost value calculated by the function listed above.

There doesn't seem to be currently any way of performing date arithmetic or
convert a date field into an integer (seconds since epoch?), ideally I'd
like to be able to do something like:

recip(intval(parseDate('NOW')-parseDate(creationDate)),1,1000,1000)^1.3

so that the function calculates the boost based on the actual document age,
rather than the relative age.  Does anybody have any thoughts or comments on
this approach?

cheers,
Piete


Re: why solr eat my and word

2007-08-02 Thread Pieter Berkel
I realize you've fixed the problem by replacing "and" with "&&" but it's
worthwhile to note that boolean operators in lucene are case-sensitive, you
must use uppercase "AND" and "OR" in your query for it to work properly.

cheers,
Piete



On 02/08/07, sammael <[EMAIL PROTECTED]> wrote:
>
>
> post.addParameter("q","(post_date:[118598900 TO
> *])and(catregory_id:2)");
>
> returned result contains category_id ==1
>
> also in the case of
> post.addParameter("q","(post_date:[118598900 TO *])");
> post.addParameter("qf","((forum_id:1 forum_id:2)and(category_id:2)");
>
> returned result contains category_id ==1 too
>
> It's a bug or I'm doing anithing wrong?
>
> --
> View this message in context:
> http://www.nabble.com/why-solr-eat-my-and-word-tf4204987.html#a11960988
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: MoreLikeThis handler and field collapsing.

2007-07-31 Thread Pieter Berkel
What exactly are you trying to achieve by using the MoreLikeThis handler?  I
created a patch that adds MoreLikeThis functionality (available in the
Standard request handler) to the Dismax handler in
http://issues.apache.org/jira/browse/SOLR-295
 which may be of interest
(although unfortunately not quite the same as what you requested).

As far as I'm aware, there is no real need for MoreLikeThis to be a
standalone request handler in its own right (and to be honest the current
implemetation feels a bit clumsy), rather it should be incorporated as a
"plugin" or search component in the Standard and Dismax handlers (like the
way Facets, Highlighting and Collapsing currently are implemented), which is
what Ryan is trying to achieve with SOLR-281.  I'm hoping the "search
component" idea will gain traction and support soon as I'd really like so
see the Dismax request handler support MoreLikeThis functionality soon! (but
I digress...)

cheers,
Piete



On 31/07/07, Nuno Leitao <[EMAIL PROTECTED]> wrote:
>
> I will take a stab at patching the MoreLikeThis handler - but given
> that I have never touched a single line of Solr code this might fail
> miserably :)
>
> Maybe there is a kind soul which could provide a new patch for
> SOLR-236 which includes field collapse with MLT ?
>
> On 30/07/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> > Nuno Leitao wrote:
> > > Hi,
> > >
> > > I have a 1.3 Solr with the field collapsing patch (SOLR-236 -
> > > http://issues.apache.org/jira/browse/SOLR-236).
> > >
> > > Collapsing works great, but only using the dismax and standard query
> > > handler - I haven't managed to get it to work using the MoreLikeThis
> > > handler though - I am going for a simplistic approach where I just run
> a
> > > query such as:
> > >
> > > /mlt?start=0&rows=3&collapse.field=collapsefield&collapse.type=normal
> > >
> > > Looking at the SOLR-236 patch it seems the field collapsing has only
> > > been patched into the StandardRequestHandler and the
> > > DisMaxRequestHandler, which would explain why this fails to work
> > > completely, but perhaps someone has found another way ?
> > >
> >
> > I have not tried, but field collapsing should be able to work with the
> > MoreLikeThis handler -- but it is not part of the patch.
> >
> > Given that we keep trying to add more widgets to the search chain, there
> > has been talk of "search component" based handler that can easily share
> > this sort of functionality.
> >
> > check:
> > https://issues.apache.org/jira/browse/SOLR-281
> >
> http://www.nabble.com/search-components-%28plugins%29-tf3898040.html#a11050274
> >
> > SOLR-281 is now just a quick/dirty brainstorm, but I think it is the
> > likely direction for how field collapsing will be integrated.
> >
> > In short, if you need something to work quickly: apply the same pattern
> > from DisMax and Standard to the MoreLikeThis handler.  If you have more
> > time (and interest) it would be great to add these features to SOLR-281.
>
> >
> >
> > ryan
> >
> >
> >
>


Re: Return only one result per results group

2007-07-25 Thread Pieter Berkel

Debra,

It sounds like what you are trying to do is implemented in a new feature
known as "Field collapsing" (see
https://issues.apache.org/jira/browse/SOLR-236 for more info). Unfortunately
it isn't quite mature enough to be included in the main distribution so in
order to try it out you'll probably need to obtain the latest solr source
code from svn trunk, apply the patch " SOLR-236-FieldCollapsing.patch" from
the above URL and compile it yourself.

If you do choose to do this, you might find some information on this wiki
page helpful:
http://wiki.apache.org/solr/HowToContribute

cheers,
Piete



On 26/07/07, Debra <[EMAIL PROTECTED] > wrote:



Is there a way to receive only one result for each group of search results

depending on a specified group field.

Example:
Searching a list of  articles with author as the group field would return
articles that match the query but will return only one article per author
even if the author has more then one article matching the query.

Thank you
--
View this message in context: 
http://www.nabble.com/Return-only-one-result-per-results-group-tf4148109.html#a11800407

Sent from the Solr - User mailing list archive at Nabble.com.




Re: DisMax query and date boosting

2007-07-19 Thread Pieter Berkel

Try using a boost function (bf) parameter like this:

bf=recip(rord(listedDate),1,1000,1000)^2.5

This should boost documents with more recent listedDate so they appear
higher in the results list. For more info see the wiki page on
DismaxRequestHandler and Functions:

http://wiki.apache.org/solr/DisMaxRequestHandler
http://wiki.apache.org/solr/FunctionQuery

cheers,
Piete



On 19/07/07, climbingrose <[EMAIL PROTECTED]> wrote:


Hi all,

I'm puzzling over how to boost a date field in a DisMax query. Atm, my qf
is
"title^5 summary^1". However, what I really want to do is to allow
document
with latest "listedDate" to have better score. For example, documents with
listedDate:[NOW-1DAY TO *] have additional score over documents with
listedDate:[* TO NOW-10DAY]. Any idea?

--
Regards,

Cuong Hoang