RE: Processing solr response....

2007-09-04 Thread Jonathan Woods
This kind of thing is what I was getting at in SOLR-344
(https://issues.apache.org/jira/browse/SOLR-344).  There I said I'd post a
prototype Java API - but for now, I've had to give up and go back to my
home-grown Lucene-based code.

 -Original Message-
 From: Ravish Bhagdev [mailto:[EMAIL PROTECTED] 
 Sent: 04 September 2007 15:30
 To: solr-user@lucene.apache.org
 Subject: Processing solr response
 
 Hi,
 
 Apologies if this has been asked before but I couldn't find 
 anything when I searched...
 
 I have been looking ant SolJava examples.  I've been using 
 Nutch/Lucene before which returns results from query nicely 
 in a class with url, title and snippet (summary).  While Solr 
 seems to return XML with score and other details with just 
 the url field.
 
 Is there a way to avoid having to deal with XML after each 
 query?  I want to avoid parsing it will be much better if I 
 could get results directly into a Java data structure like a 
 List or Map etc through the API.
 
 Also can anyone point me to some example or documentation 
 which clarifies XML returned by Solr and also how to get 
 variations of this including specifying what exactly i would 
 see in xml like which particular fields etc.  Hope i'm making 
 sense
 
 Thanks,
 Ravi
 
 
 



RE: performance questions

2007-08-31 Thread Jonathan Woods
Only if you think the rest of Solr would be better written in JRuby too! 

 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
 Sent: 31 August 2007 02:57
 To: solr-user@lucene.apache.org
 Subject: Re: performance questions
 
 
 On Aug 30, 2007, at 6:31 PM, Mike Klaas wrote:
  Another reason why people use stored procs is to prevent multiple 
  round-trips in a multi-stage query operation.  This is exactly what 
  complex RequestHandlers do (and the equivalent to a custom 
 stored proc 
  would be writing your own handler).
 
 And we should be writing those handlers in JRuby ;)   Who's with me?
 
   Erik
 
 
 
 



RE: range index

2007-08-27 Thread Jonathan Woods
Or you could write your own Analyzer and Tokenizer to produce single values
corresponding, say, to the start of each range.

Jon

 -Original Message-
 From: Jae Joo [mailto:[EMAIL PROTECTED] 
 Sent: 27 August 2007 16:46
 To: solr-user@lucene.apache.org
 Subject: Re: range index
 
 I could build index with Sales Vol ranges using 
 PatternReplaceFilterFactory
 
 
 filter class=solr.PatternReplaceFilterFactory
 pattern=(^000[1-4].*) replacement=10M - 50M
 replace=all
 /
filter class=solr.PatternReplaceFilterFactory
 pattern=(^000[5-9].*) replacement=50M - 100M
 replace=all
 /
 filter class=solr.PatternReplaceFilterFactory
 pattern=(^00[1-9].*) replacement=100M - 
 1B replace=all
 /
  filter class=solr.PatternReplaceFilterFactory
 pattern=(^0[1-9].*) replacement=\1B replace=all
 /
 
 Thanks,
 
 Jae
 On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:
 
 
  On Aug 27, 2007, at 9:48 AM, Jae Joo wrote:
   That works. But I am looking how to do that at INDEXING 
 TIME, but at 
   query time.
  
   Any way for that?
 
  I'm not sure I understand the question.   The example provided works
  at query time.  If you want to bucket things at indexing time you 
  could do that, but no real reason to with Solr's caching making the 
  range buckets fast at query time.
 
  Could you elaborate on what you are trying to do?
 
  Erik
 
 
 
  
   Thanks,
  
   Jae
  
   On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:
  
  
   On Aug 27, 2007, at 9:32 AM, Jae Joo wrote:
   Is there any way to catagorize by price range?
  
   I would like to do facet by price range. (ex. 100-200, 201-500, 
   501-1000,
   ...)
  
   Yes, look at using facet queries using range queries.  
 There is an 
   example of this very thing here:
  
   http://wiki.apache.org/solr/ 
   
 SimpleFacetParameters#head-1da3ab3995bc4abcdce8e0f04be7355ba19e9b2c
   
  
  Erik
  
  
 
 
 



Filtering using data only available at query time

2007-08-27 Thread Jonathan Woods
I've got a Lucene-based search implementation which searches over documents
in a CMS and weeds out those hits which aren't accessible to the user
carrying out the search.  The raw search results are returned as an
iterator, and I wrap another iterator around this to silently consume the
inaccessible hits.  (Yes, I know... wasteful!)  The search is therefore
based on data (user permissions) which can't be known at indexing time.
 
I'm now porting the search implementation over to Solr.  I took a look at
FunctionQuery, and wondered if there was some way I could use it to do this
kind of filtering - but as far as I can tell, it's only about scoring a hit
- ValueSource can't signal 'don't include this at all'.  Is there a case for
introducing some kind of boolean include/exclude factor somewhere along the
API?  Or is there another obvious way to do this?  I guess I could implement
my own Query subclass and use it as a filter [query] in the search, but I
wonder if it would be still be useful in FunctionQuery.
 
Jon
 


RE: Filtering using data only available at query time

2007-08-27 Thread Jonathan Woods
I know what you mean, and maybe I'm just being obstinate.  But in the
general case, it isn't possible to know these things ahead of time.  The
indexing machinery isn't told about changes in user permissions (e.g.
demotion from administrative to ordinary user), and even if it were I'd hate
to have to reindex everything just to reflect that change.

Jon

 -Original Message-
 From: Daniel Pitts [mailto:[EMAIL PROTECTED] 
 Sent: 27 August 2007 18:10
 To: solr-user@lucene.apache.org
 Subject: RE: Filtering using data only available at query time
 
 Can you add some fields that let set a filter or query that 
 weed out the results that the user doesn't have access too?
 
 If its as simple as Admin versus User, you could have a 
 boolean field called AdminOnly, and when a User is querying, 
 add a fq=[* TO *] -AdminOnly:true
 
 You could get more specific if you need to, just provide the 
 information that you would use to determine the availability 
 of the record to any given user, and then construct the 
 filter based on the current user.
 
  -Original Message-
  From: Jonathan Woods [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 27, 2007 10:00 AM
  To: solr-user@lucene.apache.org
  Subject: Filtering using data only available at query time
  
  I've got a Lucene-based search implementation which searches over 
  documents in a CMS and weeds out those hits which aren't 
 accessible to 
  the user carrying out the search.  The raw search results 
 are returned 
  as an iterator, and I wrap another iterator around this to silently 
  consume the inaccessible hits.  (Yes, I know... wasteful!)  
 The search 
  is therefore based on data (user permissions) which can't 
 be known at 
  indexing time.
   
  I'm now porting the search implementation over to Solr.  I 
 took a look 
  at FunctionQuery, and wondered if there was some way I 
 could use it to 
  do this kind of filtering - but as far as I can tell, it's 
 only about 
  scoring a hit
  - ValueSource can't signal 'don't include this at all'.  Is there a 
  case for introducing some kind of boolean include/exclude factor 
  somewhere along the API?  Or is there another obvious way 
 to do this?  
  I guess I could implement my own Query subclass and use it 
 as a filter 
  [query] in the search, but I wonder if it would be still be 
 useful in 
  FunctionQuery.
   
  Jon
   
  
 
 
 



RE: range index

2007-08-27 Thread Jonathan Woods
I don't know of any - sorry.  I guess this is more a Lucene issue than a
Solr one, though Solr analyzers should subclass SolrAnalyzer rather than
org.apache.lucene.analysis.Analyzer.

I guess you could Google around for something useful - I had a quick look,
but couldn't find anything compelling.  When I implemented my first
Analyzer, I explored the source code and Javadoc for Analyzer and Tokenizer,
and looked at simple Tokenizers to get an understanding of what's going on.

Jon
 

 -Original Message-
 From: Jae Joo [mailto:[EMAIL PROTECTED] 
 Sent: 27 August 2007 17:51
 To: solr-user@lucene.apache.org
 Subject: Re: range index
 
 Any sample code and howto write Analyzer and Tockenizer available?
 
 Jae
 
 On 8/27/07, Jonathan Woods [EMAIL PROTECTED] wrote:
 
  Or you could write your own Analyzer and Tokenizer to 
 produce single 
  values corresponding, say, to the start of each range.
 
  Jon
 
   -Original Message-
   From: Jae Joo [mailto:[EMAIL PROTECTED]
   Sent: 27 August 2007 16:46
   To: solr-user@lucene.apache.org
   Subject: Re: range index
  
   I could build index with Sales Vol ranges using 
   PatternReplaceFilterFactory
  
  
   filter class=solr.PatternReplaceFilterFactory
   pattern=(^000[1-4].*) replacement=10M - 50M
   replace=all
   /
  filter class=solr.PatternReplaceFilterFactory
   pattern=(^000[5-9].*) replacement=50M - 100M
   replace=all
   /
   filter class=solr.PatternReplaceFilterFactory
   pattern=(^00[1-9].*) replacement=100M - 1B 
   replace=all
   /
filter class=solr.PatternReplaceFilterFactory
   pattern=(^0[1-9].*) replacement=\1B 
 replace=all
   /
  
   Thanks,
  
   Jae
   On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:
   
   
On Aug 27, 2007, at 9:48 AM, Jae Joo wrote:
 That works. But I am looking how to do that at INDEXING
   TIME, but at
 query time.

 Any way for that?
   
I'm not sure I understand the question.   The example 
 provided works
at query time.  If you want to bucket things at 
 indexing time you 
could do that, but no real reason to with Solr's caching making 
the range buckets fast at query time.
   
Could you elaborate on what you are trying to do?
   
Erik
   
   
   

 Thanks,

 Jae

 On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:


 On Aug 27, 2007, at 9:32 AM, Jae Joo wrote:
 Is there any way to catagorize by price range?

 I would like to do facet by price range. (ex. 100-200, 
 201-500, 501-1000,
 ...)

 Yes, look at using facet queries using range queries.
   There is an
 example of this very thing here:

 http://wiki.apache.org/solr/

   
 SimpleFacetParameters#head-1da3ab3995bc4abcdce8e0f04be7355ba19e9b2c
 

Erik


   
   
  
 
 
 



RE: Filtering using data only available at query time

2007-08-27 Thread Jonathan Woods
But [the type of user] which has permission can change too.

 -Original Message-
 From: Daniel Pitts [mailto:[EMAIL PROTECTED] 
 Sent: 27 August 2007 19:07
 To: solr-user@lucene.apache.org
 Subject: RE: Filtering using data only available at query time
 
 I think you're missing my point.
 
 Don't index which users have permission, index which type of 
 user has permission. Then _filter_ based on that.
 
  -Original Message-
  From: Jonathan Woods [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 27, 2007 10:26 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Filtering using data only available at query time
  
  I know what you mean, and maybe I'm just being obstinate.  
  But in the general case, it isn't possible to know these 
 things ahead 
  of time.  The indexing machinery isn't told about changes in user 
  permissions (e.g.
  demotion from administrative to ordinary user), and even if it were 
  I'd hate to have to reindex everything just to reflect that change.
  
  Jon
  
   -Original Message-
   From: Daniel Pitts [mailto:[EMAIL PROTECTED]
   Sent: 27 August 2007 18:10
   To: solr-user@lucene.apache.org
   Subject: RE: Filtering using data only available at query time
   
   Can you add some fields that let set a filter or query that
  weed out
   the results that the user doesn't have access too?
   
   If its as simple as Admin versus User, you could have a
  boolean field
   called AdminOnly, and when a User is querying, add a fq=[* TO *] 
   -AdminOnly:true
   
   You could get more specific if you need to, just provide the 
   information that you would use to determine the 
 availability of the 
   record to any given user, and then construct the filter
  based on the
   current user.
   
-Original Message-
From: Jonathan Woods [mailto:[EMAIL PROTECTED]
Sent: Monday, August 27, 2007 10:00 AM
To: solr-user@lucene.apache.org
Subject: Filtering using data only available at query time

I've got a Lucene-based search implementation which 
 searches over 
documents in a CMS and weeds out those hits which aren't
   accessible to
the user carrying out the search.  The raw search results
   are returned
as an iterator, and I wrap another iterator around this
  to silently
consume the inaccessible hits.  (Yes, I know... wasteful!)
   The search
is therefore based on data (user permissions) which can't
   be known at
indexing time.
 
I'm now porting the search implementation over to Solr.  I
   took a look
at FunctionQuery, and wondered if there was some way I
   could use it to
do this kind of filtering - but as far as I can tell, it's
   only about
scoring a hit
- ValueSource can't signal 'don't include this at all'.  
  Is there a
case for introducing some kind of boolean 
 include/exclude factor 
somewhere along the API?  Or is there another obvious way
   to do this?  
I guess I could implement my own Query subclass and use it
   as a filter
[query] in the search, but I wonder if it would be still be
   useful in
FunctionQuery.
 
Jon
 

   
   
   
  
 
 
 



RE: Embedded about 50% faster for indexing

2007-08-27 Thread Jonathan Woods
I don't think you should apologise for highlighting embedded usage.  For
circumstances in which you're at liberty to run a Solr instance in the same
JVM as an app which uses it, I find it very strange that you should have to
use anything _other_ than embedded, and jump through all the unnecessary
hoops (XML conversion, HTTP transport) that this implies.  It's a bit like
suggesting you should throw away Java method invocations altogether, and
write everything in XML-RPC.

Bit of a pet issue of mine!  I'll be creating a JIRA issue on the subject
soon.

Jon

 -Original Message-
 From: Sundling, Paul [mailto:[EMAIL PROTECTED] 
 Sent: 28 August 2007 03:24
 To: solr-user@lucene.apache.org
 Subject: RE: Embedded about 50% faster for indexing
 
 At this point I think I'm going recommend against embedded, 
 regardless of any performance advantage.  The level of 
 documentation is just too low, while the XML API is clearly 
 documented.  It's clear that XML is preferred.
 
 The embedded example on the wiki is pretty good, but until 
 mutliple core support comes out in the next version, you have 
 to use multiple SolrCore.  If they are accessed in the same 
 webapp, then you can't just set JNDI (since you can only have 
 one value).  So you have to use a Config object as alluded to 
 in the example.  However, you look at the code and there is 
 no javadoc for the constructor.  The constructor args are 
 (String name, InputStream is, String prefix).  I think name 
 is a unique name for the solr core, but that is a guess.  
 Inputstream may be a stream to the solr home, but it could be 
 anything.  Prefix may be a URI prefix.  These are all guesses 
 without trying to read through the code.
 
 When I look at SolrCore, it looks like it's a singleton, so 
 maybe I can't even access more than one SolrCore using 
 embedded anyway.  :(  So I apologize for highlighting Embedded.  
 
 Anyway it's clear how to do multiple solr cores using XML.  
 You just have different post URI for the difference cores.  
 You can easily inject that with Spring and externalize the 
 config.  Simple and easy.  So I concede XML is the way to go. :)  
 
 Paul Sundling
 
 -Original Message-
 From: Mike Klaas [mailto:[EMAIL PROTECTED]
 Sent: Monday, August 27, 2007 5:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Embedded about 50% faster for indexing
 
 
 On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote:
 
  Whether embedded solr should give me a performance boost or not, it
  did.
  :)  I'm not surprised, since it skips XML parsing.  
 Although you never
  know where cycles are used for sure until you profile.
 
 It certainly is possible that XML parsing dwarfs indexing, but I'd  
 expect that only to occur under very light analysis and field 
 storage  
 workloads.
 
  I tried doing more records per post (200) and it was 
 actually slightly
 
  slower and seemed to require more memory.  This makes sense because
  you
  have to take up more memory for the StringBuilder to store the much
  larger XML.  For 10,000 it was much slower.  For that size I would  
  need
  to XML streaming or something to make it work.
 
  The solr war was on the same machine, so network overhead was only
  from
  using loopback.
 
 The big question is still your connection handling strategy:  
 are you  
 using persistent http connections?  Are you threadedly indexing?
 
 cheers,
 -Mike
 
  Paul Sundling
 
  -Original Message-
  From: climbingrose [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 27, 2007 12:22 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Embedded about 50% faster for indexing
 
 
  Haven't tried the embedded server but I think I have to agree with
  Mike.
  We're currently sending 2000 job batches to SOLR server and 
 the amount
  of time required to transfer documents over http is insignificant
  compared with the time required to index them. So I do 
 think unless  
  you
  are sending document one by one, embedded SOLR shouldn't 
 give you much
  more performance boost.
 
  On 8/25/07, Mike Klaas [EMAIL PROTECTED] wrote:
 
  On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of 
  Yonik Seeley
  Sent: Friday, August 24, 2007 2:07 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Embedded about 50% faster for indexing
 
  One thing I'd like to avoid is everyone trying to embed just for 
  performance gains. If there is really that much 
 difference, then we
 
  need a better way for people to get that without 
 resorting to Java 
  code.
 
  -Yonik
 
 
  Theoretically and practically, embedded solution will be 
 faster than
 
  going through http/xml.
 
  This is only true if the http interface adds significant 
 overhead to 
  the cost of indexing a document, and I don't see why this 
 should be 
  so, as indexing is relatively heavyweight.  setting up the 
 connection
 
  could be expensive, but this can be greatly mitigated 

RE: Query optimisation - multiple filter caches?

2007-08-22 Thread Jonathan Woods
I understand - thanks, Yonik.

I notice that LuceneQueryOptimizer is still used in
SolrIndexSearcher.search(Query, Filter, Sort) - is the idea then that this
method is deprecated, or that the config parameter
query/boolTofilterOptimizer is no longer to be used?  As for the other
search() methods, they just delegate directly to
org.apache.lucene.search.IndexSearcher, so no use of caches there.

Jon

 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
 Sent: 16 August 2007 01:40
 To: solr-user@lucene.apache.org
 Subject: Re: Query optimisation - multiple filter caches?
 
 On 8/15/07, Jonathan Woods [EMAIL PROTECTED] wrote:
  I'm trying to understand how best to integrate directly with Solr 
  (Java-to-Java in the same JVM) to make the most of its query 
  optimisation - chiefly, its caching of queries which merely filter 
  rather than rank results.
 
  I notice that SolrIndexSearcher maintains a filter cache 
 and so does 
  LuceneQueryOptimiser.  Shouldn't they be contributing to/using the 
  same cache, or are they used for different things?
 
 LuceneQueryOptimiser is no longer used since one can directly 
 specify filters via fq parameters.
 
 -Yonik
 
 
 



Query optimisation - multiple filter caches?

2007-08-15 Thread Jonathan Woods
I'm trying to understand how best to integrate directly with Solr
(Java-to-Java in the same JVM) to make the most of its query optimisation -
chiefly, its caching of queries which merely filter rather than rank
results.
 
I notice that SolrIndexSearcher maintains a filter cache and so does
LuceneQueryOptimiser.  Shouldn't they be contributing to/using the same
cache, or are they used for different things?
 
Jon
 


RE: Best use of wildcard searches

2007-08-11 Thread Jonathan Woods
Thanks, Lance.

I recall reading that Lucene is used in a superfast RDF query engine:

http://www.deri.ie/about/press/releases/details/?uid=55ref=213.

Jon

 -Original Message-
 From: Lance Norskog [mailto:[EMAIL PROTECTED] 
 
 The Protégé project at Stanford has nice tools for editing 
 knowledge bases, taxonomies, etc.
 
 



RE: Too many open files

2007-08-09 Thread Jonathan Woods
You could try committing updates more frequently, or maybe optimising the
index beforehand (and even during!).  I imagine you could also change the
Solr config, if you have access to it, to tweak indexing (or index creation)
parameters - http://wiki.apache.org/solr/SolrConfigXml should be of use to
you here.

In the unlikely event I qualify for the MMs, I hereby donate them back to
you for giving to someone else!

Jon

-Original Message-
From: Kevin Holmes [mailto:[EMAIL PROTECTED] 
Sent: 09 August 2007 15:23
To: solr-user@lucene.apache.org
Subject: Too many open files

result status=1java.io.FileNotFoundException:
/usr/local/bin/apache-solr/enr/solr/data/index/_16ik.tii (Too many open
files)

 

When I'm importing, this is the error I get.  I know it's vague and obscure.
Can someone suggest where to start?  I'll buy a bag of MMs (not peanut) for
anyone who can help me solve this*

 

*limit one bag per successful solution for a total maximum of 1 bag to be
given




RE: Best use of wildcard searches

2007-08-09 Thread Jonathan Woods
Maybe there's a different way, in which path-like values like this are
treated explicitly.

I use a similar approach to Matthew at www.colfes.com, where all pages are
generated from Lucene searches according to filters on a couple of
hierarchical categories ('spaces'), i.e. subject and organisational unit.
From that experience, a few things occur to me here:

1.  The structure of any particular category/space is not immediately
derivable from data, so unless we're Google or doing something RDF-like
they're something you define up front.  For this reason, and because it
makes internationalisation easier, I feel you should model this kind of
standing data independently of its representation.

So instead searching for DepartmentsMen's ApparelJackets, I index (and
search for) a String /departments/mensapparel/jackets/, and used a simple
standing data mapping to resolves each of the nodes along the path to a
human-readable form when necessary.  In my case, the values for any
particular resource (e.g. a news article) are defined by CMS users from
drop-downs.

2.  In my Lucene library, I redundantly indexed paths like
/departments/mensapparel/jackets/ into successive fragments, together with
the whole path value:

/departments
/departments/mensapparel
/departments/mensapparel/jackets
/departments/mensapparel/jackets/

using my own PathAnalyzer (extends Analyzer, of course) which makes it very
fast to query on path fragments: all goods anywhere in the men's apparel
section - query on /departments/mensapparel; all goods categorised as
exactly in the men's apparel section - query on
/departments/mensapparel/.

I implemented all queries like this as filters, and cached the filter
definitions.  I guess Solr's query optimisation and filter caching do all
this out of the box, so it may end up being just as fast to use the kind of
PrefixQuery suggested in this thread.

3.  However, I can post/attach/donate PathAnalyzer if anyone thinks it might
still be useful.  I started off calling it HierarchyValueAnalyzer, then
TreeNodePathAnalyzer, but now that it's PathAnalyzer I cna't help thinking
it might have lots of applications

Jon 

 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
 Sent: 09 August 2007 21:50
 To: solr-user@lucene.apache.org
 Subject: Re: Best use of wildcard searches
 
 On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
  http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel%
  3EMen's%20Apparel%
  3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python
 
  The same exact query, with... wait..
 
  Wow. I'm making myself look like an idiot.
 
  I swear that these queries didn't work the first time I ran them...
 
  But now \  and ? give the same results, as would be expected, 
  while   returns nothing.
 
  I'm sorry for wasting your time, but I do appreciate the help!
 
 lo - these things can happen when you get too many levels of 
 escaping needed.
 Hopefully we can improve the situation in the future to get 
 rid of the query parser escaping for certain queries such as 
 prefix and term.