javax.xml.stream.XMLStreamException while indexing

2008-07-28 Thread Pieter Berkel
I've recently encountered a strange error while batch indexing around 500 average-sized documents: HTTP Status 500 - null javax.xml.stream.XMLStreamException at com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3700) at com.bea.xml.stream.MXParser.more(MXParser.java:3715) at com.bea.x

Re: Faceting over limited result set

2007-11-13 Thread Pieter Berkel
On Nov 14, 2007 6:44 AM, Mike Klaas <[EMAIL PROTECTED]> wrote: > > An implementation might look like: > > DocList superlist; > int facetDocLimit = params.getInt(DMP.FACET_DOCLIMIT, -1); > if(facetDocLimit > 0 && facetDocLimit != req.getLimit()) { >superlist =

Re: DINSTINCT ON functionality in Solr?

2007-11-12 Thread Pieter Berkel
Currently this functionality is not available in Solr out-of-the-box, however there is a patch implementing Field Collapsing http://issues.apache.org/jira/browse/SOLR-236 which might be similar to what you are trying to achieve. Piete On 13/11/2007, Jörg Kiegeland <[EMAIL PROTECTED]> wrote: > >

Re: Faceting over limited result set

2007-11-12 Thread Pieter Berkel
On 13/11/2007, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > can you elaborate on your use case ... the only time i've ever seen people > ask about something like this it was because true facet counts were too > expensive to compute, so they were doing "sampling" of the first N > results. > > In

Faceting over limited result set

2007-11-11 Thread Pieter Berkel
I'm trying to obtain faceting information based on the first 'x' (lets say 100-500) results matching a given (dismax) query. The actual documents matching the query are not important in this case, so intuitively the simplest approach I can think of would be to limit the result set to 'x' documents

Re: SOLR 1.3 Release?

2007-10-25 Thread Pieter Berkel
On 26/10/2007, James liu <[EMAIL PROTECTED]> wrote: > > where i can read 1.3 new features? > Take a look at CHANGES.txt in the root directory of svn trunk, or also here: http://svn.apache.org/viewvc/lucene/solr/trunk/CHANGES.txt Piete

Re: Search results problem

2007-10-17 Thread Pieter Berkel
, you shouldn't really have any problems (other than running out of memory). Piete On 17/10/2007, Thorsten Scherler <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-10-17 at 20:44 +1000, Pieter Berkel wrote: > > There is a configuration option called "" in > > so

Re: Search results problem

2007-10-17 Thread Pieter Berkel
There is a configuration option called "" in solrconfig.xmlwith the default value of 10,000. You may need to increase this value if you are indexing fields that are longer. On 17/10/2007, Maximilian Hütter <[EMAIL PROTECTED]> wrote: > > Daniel Naber schrieb: > > On Tuesday 16 October 2007 12:03

Re: delete by negative query

2007-10-15 Thread Pieter Berkel
You need to explicitly define the field you are referring to in order to achieve this, otherwise the query parser will assume that the minus character is part of the query and interpret it as field:"-solr" (where "field" is the name of the default field set in your schema). Try: curl http://local

Re: solr tuple/tag store

2007-10-09 Thread Pieter Berkel
On 10/10/2007, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > Without seeing the actual queries that are slow, it's difficult to > determine > > what the problem is. Have you tried using EXPLAIN ( > > http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your > query > > is using the tab

Re: Solr and KStem

2007-10-09 Thread Pieter Berkel
Hi Harry, I re-discovered this thread last week and have made some minor changes to the code (remove deprication warnings) so that it compiles with trunk. I think it would be quite useful to get this stemmer into Solr once all the legal / licensing issues are resolved. If there are no objections

Re: solr tuple/tag store

2007-10-09 Thread Pieter Berkel
Given that the tables are of type InnoDB, I think it's safe to assume that you're not planning to use MySQL full-text search (only supported on MyISAM tables). If you are not concerned about transactional integrity provided by InnoDB, perhaps you could try using MyISAM tables (although most people

Re: Spell Check Handler

2007-10-08 Thread Pieter Berkel
I started to look at this back in August and decided to wait for climbingrose's implementation, however since then my priorities changed and I hadn't had a chance to re-visit it. Sounds like there is quite a bit of interest in this feature, so it would be great if those who have make progress on t

Re: Indexing XML

2007-10-05 Thread Pieter Berkel
> SOLR has of course a problem with the XML in the 'originalRecord' field. > Is there a solution to this? Has anyone done this before? I would suggest changing the field type of "originalRecord" to "string" rather than "text", and if you're still having trouble with the XML data simply encapsulat

Re: searching for non-empty fields

2007-09-27 Thread Pieter Berkel
While in theory -URL:"" should be valid syntax, the Lucene query parser doesn't accept it and throws a ParseException. I've considered raising this issue on lucene-dev but it didn't seem to affect many users so I decided not to pursue the matter. On 27/09/2007, Chris Hostetter <[EMAIL PROTECTED

Re: searching for non-empty fields

2007-09-26 Thread Pieter Berkel
I've experienced a similar problem before, assuming the field type is "string" (i.e. not tokenized), there is subtle yet important difference between a field that is null (i.e. not contained in the document) and one that is an empty string (in the document but with no value). See http://www.nabble.

Re: Term extraction

2007-09-21 Thread Pieter Berkel
ctory after the SynonymFilter? Cheers, Piete On 21/09/2007, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > On 9/19/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > However, I'd like to be able to > > analyze documents more intelligently to recognize phrase keywo

Re: setting absolute path for snapshooter in solrconfig.xml doesn't work

2007-09-19 Thread Pieter Berkel
> var/SolrHome/solr/data/snapshot.20070919201617 > 2007/09/19 20:16:17 ended (elapsed time: 0 sec) > > Thanks, > > -Hui > > > > > On 9/19/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > > > See this recent thread for some helpful info: >

Re: Term extraction

2007-09-19 Thread Pieter Berkel
tman <[EMAIL PROTECTED]> wrote: > > On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: > > > I'm currently looking at methods of term extraction and automatic > > keyword > > generation from indexed documents. > > We do it manually (not in solr, but we put t

Re: Filter by Group

2007-09-19 Thread Pieter Berkel
Sounds like you're on the right track, if your groups overap (i.e. a document can be in group A and B), then you should ensure your "groups" field is multivalued. If you are searching for "foo" in documents contained in group "A", then it might be more efficient to use a filter query (fq) like: q

Re: setting absolute path for snapshooter in solrconfig.xml doesn't work

2007-09-19 Thread Pieter Berkel
See this recent thread for some helpful info: http://www.nabble.com/solr-doesn%27t-find-exe-in-postCommit-event-tf4264879.html#a12167792 You'll probably want to configure your exe with an absolute path rather than the dir: /var/SolrHome/solr/bin/snapshooter . In order to get the snap

Term extraction

2007-09-19 Thread Pieter Berkel
I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. I've been experimenting with MoreLikeThis and values returned by the "mlt.interestingTerms" parameter and so far this approach has worked well. However, I'd like to be able to analyze docu

Re: Web statistics for solr?

2007-08-22 Thread Pieter Berkel
Matthew, Maybe the SOLR Statistics page would suit your purpose? (click on "statistics" from the main solr page or use the following url) http://localhost:8983/solr/admin/stats.jsp cheers, Piete On 23/08/07, Matthew Runo <[EMAIL PROTECTED]> wrote: > > Hello! > > I was wondering if anyone has w

Re: defining fiels to be returned when using mlt

2007-08-22 Thread Pieter Berkel
Hi Stefan, Currently there is no way to specify the list of fields to be returned by the MoreLikeThis handler. I've been looking to address this issue in https://issues.apache.org/jira/browse/SOLR-295 (point 3) however in the broader scheme of things, it seems logical to wait until https://issues

Re: Structured Lucene documents

2007-08-21 Thread Pieter Berkel
On 21/08/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote: > > It seems the highlights fields must be specified, and that I can't use the > * completion to do so. > Am I true ? Is there a way to go throught this obligation ? As far as I know, dynamic fields are used mainly at during indexing and

Re: clear index

2007-08-20 Thread Pieter Berkel
If you are using solr 1.2 the following command (followed by a commit / optimize) should do the trick: *:* cheers, Piete On 21/08/07, Sundling, Paul <[EMAIL PROTECTED]> wrote: > > what is the best approach to clearing an index? > > The use case is that I'm doing some performance testing with va

Re: Indexing large documents

2007-08-20 Thread Pieter Berkel
You will probably need to increase the value of maxFieldLength in your solrconfig.xml. The default value is 1 which might explain why your documents are not being completely indexed. Piete On 20/08/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > The that should show some errors if something

Re: sub facets

2007-08-17 Thread Pieter Berkel
Hi Jae Joo, Please provide a bit more information about exactly what you are trying to achieve so we can help you. cheers, Piete On 18/08/07, Jae Joo <[EMAIL PROTECTED]> wrote: > > Hi, > > Can anyone help me how to do sub faces? > Thanks, > > Jae Joo >

Re: solr + carrot2

2007-08-16 Thread Pieter Berkel
Any updates on this? It certainly would be quite interesting to see how well carrot2 clustering can be integrated with solr, I suppose it's a fairly similar concept to simple faceting (maybe another candidate for SOLR-281 component?). One concern I have is that the additional processing required

Re: Function Queries

2007-08-16 Thread Pieter Berkel
Hi Yakn, On 17/08/07, Yakn <[EMAIL PROTECTED]> wrote: > One example is that if you have mm being blank in the solrConfig.xml > and not commented out, then it will throw a NumberFormatException. The required format of the mm field is described in more detail here: http://lucene.apache.org/solr/a

Re: how to retrieve all the documents in an index?

2007-08-15 Thread Pieter Berkel
s it supported for 1.2 or at > all? > > (Right now, I'm using a work-around by a range query for a field whose > range > is known to be larger than 0.) > > > Thanks, > > -Hui > > > > On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]>

Re: schema.xml changes and the impact on Solr?

2007-08-13 Thread Pieter Berkel
On 14/08/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > > 2 - Question about the structure of the injected xml file... does it > > need to exactly match the data in solr? I know it makes sense that > > we're only injecting the fields that solr needs and not excluding fields > > that it needs...

Re: Structured Lucene documents

2007-08-13 Thread Pieter Berkel
On 13/08/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote: > > Hello !Thanks Pieter,That seems a good idea - if not an ideal one - even > if it sort of an hack. I will try it as soon as possible and keep you > informed.The hl.fl parameter doesn't have to be initialized, I think, so > it won't be a

Re: how to retrieve all the documents in an index?

2007-08-12 Thread Pieter Berkel
Try using q=*:* to match all documents in the index. Piete On 13/08/07, Yu-Hui Jin <[EMAIL PROTECTED]> wrote: > > Hi, there, > > I found the following post on the web. Is this still the simplest > get-around > to retrieve all documents in an index? (I'm asking just in case I don't > know > ther

Re: FunctionQuery and boosting documents using date arithmetic

2007-08-12 Thread Pieter Berkel
Do you consistently add 10,000 documents to your index every day or does the number of new documents added per day vary? On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote: > > I'm having the date boosting function as well. I'm using this function: > F = recip(rord(creationDate),1,1000,1000)^10.

Re: FunctionQuery and boosting documents using date arithmetic

2007-08-12 Thread Pieter Berkel
On 11/08/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > i would agree with you there, this is where a more robust (ie: > less efficient) DateField-ish class that supports configuration options > to specify: > 1) the output format > 2) the input format(s) > 3) the indexed format > ...as Si

Re: Spell Check Handler

2007-08-11 Thread Pieter Berkel
On 11/08/07, climbingrose< [EMAIL PROTECTED]> wrote: > > That's exactly what I did with my custom version of the > SpellCheckerHandler. > However, I didn't handle suggestionCount and only returned the one > corrected > phrase which contains the "bes

Re: Spell Check Handler

2007-08-10 Thread Pieter Berkel
On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote: > > The spellchecker handler doesn't seem to work with multi-word query. For > example, when I tried to spellcheck "Java developar", it returns nothing > while if I tried "developar", spellchecker correctly returns "developer". > I > followed the

Re: tomcat and solr multiple instances

2007-08-09 Thread Pieter Berkel
The current working directory (Cwd) is the directory from which you started the Tomcat server and is not dependent on the Solr instance configurations. So as long as SolrHome is correct for each Solr instance, you shouldn't have a problem. cheers, Piete On 10/08/07, Jae Joo <[EMAIL PROTECTED]>

Re: retrieving range of fields for the results

2007-08-08 Thread Pieter Berkel
On 09/08/07, Mike Klaas <[EMAIL PROTECTED]> wrote: > > Faceting ignores pagenation/startat/maxresults/etc. > This is correct, the facet information returned is based on the entire result set matching the query rather than the document set returned by the query. The start and row parameters have

Re: Structured Lucene documents

2007-08-08 Thread Pieter Berkel
In theory, you could store all your pages in a single document using a dynamic field type: Store each page in a separate field (e.g. page1, page2, page3 .. pageN) then at query time, use the highlighting parameters to highlight matches in the page fields. You should be able to determine the page

Re: retrieving range of fields for the results

2007-08-08 Thread Pieter Berkel
27;price' to the original query, the other > adding > sort on 'publish_date'. (the sort order doesn't matter). > 3) get the respective min, max values of these two fields from the first > and > last document returned for each of the two subsequent queries. &g

Re: Configuring Synonyms

2007-08-07 Thread Pieter Berkel
What is the fieldType of your Colour field? You must ensure that the particular field that you are using to store Colour information is configured to use solr.SynonymFilterFactory in your schema.xml configuration file. cheers, Piete On 07/08/07, beejenny <[EMAIL PROTECTED]> wrote: > > > Hello,

Re: retrieving range of fields for the results

2007-08-07 Thread Pieter Berkel
The functionality you are describing is called "Faceting" in Solr and can be achieved with a single query, take a look at the following wiki pages for more info: http://wiki.apache.org/solr/SolrFacetingOverview http://wiki.apache.org/solr/SimpleFacetParameters In regards to faceting date fields s

Re: FunctionQuery and boosting documents using date arithmetic

2007-08-06 Thread Pieter Berkel
that still leaves the problem of obtaining the current timestamp to use in the boost function. On 06/08/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > I've been using a simple variation of the boost function given in the > examples used to boost more recent docum

FunctionQuery and boosting documents using date arithmetic

2007-08-06 Thread Pieter Berkel
I've been using a simple variation of the boost function given in the examples used to boost more recent documents: recip(rord(creationDate),1,1000,1000)^1.3 While it seems to work pretty well, I've realised that this may not be quite as effective as i had hoped given that the calculation is base

Re: why solr eat my and word

2007-08-02 Thread Pieter Berkel
I realize you've fixed the problem by replacing "and" with "&&" but it's worthwhile to note that boolean operators in lucene are case-sensitive, you must use uppercase "AND" and "OR" in your query for it to work properly. cheers, Piete On 02/08/07, sammael <[EMAIL PROTECTED]> wrote: > > > post.

Re: MoreLikeThis handler and field collapsing.

2007-07-31 Thread Pieter Berkel
What exactly are you trying to achieve by using the MoreLikeThis handler? I created a patch that adds MoreLikeThis functionality (available in the Standard request handler) to the Dismax handler in http://issues.apache.org/jira/browse/SOLR-295 which

Re: Return only one result per results group

2007-07-25 Thread Pieter Berkel
Debra, It sounds like what you are trying to do is implemented in a new feature known as "Field collapsing" (see https://issues.apache.org/jira/browse/SOLR-236 for more info). Unfortunately it isn't quite mature enough to be included in the main distribution so in order to try it out you'll proba

Re: DisMax query and date boosting

2007-07-19 Thread Pieter Berkel
Try using a boost function (bf) parameter like this: bf=recip(rord(listedDate),1,1000,1000)^2.5 This should boost documents with more recent listedDate so they appear higher in the results list. For more info see the wiki page on DismaxRequestHandler and Functions: http://wiki.apache.org/solr/D