RE: Processing solr response....
This kind of thing is what I was getting at in SOLR-344 (https://issues.apache.org/jira/browse/SOLR-344). There I said I'd post a prototype Java API - but for now, I've had to give up and go back to my home-grown Lucene-based code. -Original Message- From: Ravish Bhagdev [mailto:[EMAIL PROTECTED] Sent: 04 September 2007 15:30 To: solr-user@lucene.apache.org Subject: Processing solr response Hi, Apologies if this has been asked before but I couldn't find anything when I searched... I have been looking ant SolJava examples. I've been using Nutch/Lucene before which returns results from query nicely in a class with url, title and snippet (summary). While Solr seems to return XML with score and other details with just the url field. Is there a way to avoid having to deal with XML after each query? I want to avoid parsing it will be much better if I could get results directly into a Java data structure like a List or Map etc through the API. Also can anyone point me to some example or documentation which clarifies XML returned by Solr and also how to get variations of this including specifying what exactly i would see in xml like which particular fields etc. Hope i'm making sense Thanks, Ravi
RE: performance questions
Only if you think the rest of Solr would be better written in JRuby too! -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 31 August 2007 02:57 To: solr-user@lucene.apache.org Subject: Re: performance questions On Aug 30, 2007, at 6:31 PM, Mike Klaas wrote: Another reason why people use stored procs is to prevent multiple round-trips in a multi-stage query operation. This is exactly what complex RequestHandlers do (and the equivalent to a custom stored proc would be writing your own handler). And we should be writing those handlers in JRuby ;) Who's with me? Erik
RE: range index
Or you could write your own Analyzer and Tokenizer to produce single values corresponding, say, to the start of each range. Jon -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: 27 August 2007 16:46 To: solr-user@lucene.apache.org Subject: Re: range index I could build index with Sales Vol ranges using PatternReplaceFilterFactory filter class=solr.PatternReplaceFilterFactory pattern=(^000[1-4].*) replacement=10M - 50M replace=all / filter class=solr.PatternReplaceFilterFactory pattern=(^000[5-9].*) replacement=50M - 100M replace=all / filter class=solr.PatternReplaceFilterFactory pattern=(^00[1-9].*) replacement=100M - 1B replace=all / filter class=solr.PatternReplaceFilterFactory pattern=(^0[1-9].*) replacement=\1B replace=all / Thanks, Jae On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 27, 2007, at 9:48 AM, Jae Joo wrote: That works. But I am looking how to do that at INDEXING TIME, but at query time. Any way for that? I'm not sure I understand the question. The example provided works at query time. If you want to bucket things at indexing time you could do that, but no real reason to with Solr's caching making the range buckets fast at query time. Could you elaborate on what you are trying to do? Erik Thanks, Jae On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 27, 2007, at 9:32 AM, Jae Joo wrote: Is there any way to catagorize by price range? I would like to do facet by price range. (ex. 100-200, 201-500, 501-1000, ...) Yes, look at using facet queries using range queries. There is an example of this very thing here: http://wiki.apache.org/solr/ SimpleFacetParameters#head-1da3ab3995bc4abcdce8e0f04be7355ba19e9b2c Erik
Filtering using data only available at query time
I've got a Lucene-based search implementation which searches over documents in a CMS and weeds out those hits which aren't accessible to the user carrying out the search. The raw search results are returned as an iterator, and I wrap another iterator around this to silently consume the inaccessible hits. (Yes, I know... wasteful!) The search is therefore based on data (user permissions) which can't be known at indexing time. I'm now porting the search implementation over to Solr. I took a look at FunctionQuery, and wondered if there was some way I could use it to do this kind of filtering - but as far as I can tell, it's only about scoring a hit - ValueSource can't signal 'don't include this at all'. Is there a case for introducing some kind of boolean include/exclude factor somewhere along the API? Or is there another obvious way to do this? I guess I could implement my own Query subclass and use it as a filter [query] in the search, but I wonder if it would be still be useful in FunctionQuery. Jon
RE: Filtering using data only available at query time
I know what you mean, and maybe I'm just being obstinate. But in the general case, it isn't possible to know these things ahead of time. The indexing machinery isn't told about changes in user permissions (e.g. demotion from administrative to ordinary user), and even if it were I'd hate to have to reindex everything just to reflect that change. Jon -Original Message- From: Daniel Pitts [mailto:[EMAIL PROTECTED] Sent: 27 August 2007 18:10 To: solr-user@lucene.apache.org Subject: RE: Filtering using data only available at query time Can you add some fields that let set a filter or query that weed out the results that the user doesn't have access too? If its as simple as Admin versus User, you could have a boolean field called AdminOnly, and when a User is querying, add a fq=[* TO *] -AdminOnly:true You could get more specific if you need to, just provide the information that you would use to determine the availability of the record to any given user, and then construct the filter based on the current user. -Original Message- From: Jonathan Woods [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 10:00 AM To: solr-user@lucene.apache.org Subject: Filtering using data only available at query time I've got a Lucene-based search implementation which searches over documents in a CMS and weeds out those hits which aren't accessible to the user carrying out the search. The raw search results are returned as an iterator, and I wrap another iterator around this to silently consume the inaccessible hits. (Yes, I know... wasteful!) The search is therefore based on data (user permissions) which can't be known at indexing time. I'm now porting the search implementation over to Solr. I took a look at FunctionQuery, and wondered if there was some way I could use it to do this kind of filtering - but as far as I can tell, it's only about scoring a hit - ValueSource can't signal 'don't include this at all'. Is there a case for introducing some kind of boolean include/exclude factor somewhere along the API? Or is there another obvious way to do this? I guess I could implement my own Query subclass and use it as a filter [query] in the search, but I wonder if it would be still be useful in FunctionQuery. Jon
RE: range index
I don't know of any - sorry. I guess this is more a Lucene issue than a Solr one, though Solr analyzers should subclass SolrAnalyzer rather than org.apache.lucene.analysis.Analyzer. I guess you could Google around for something useful - I had a quick look, but couldn't find anything compelling. When I implemented my first Analyzer, I explored the source code and Javadoc for Analyzer and Tokenizer, and looked at simple Tokenizers to get an understanding of what's going on. Jon -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: 27 August 2007 17:51 To: solr-user@lucene.apache.org Subject: Re: range index Any sample code and howto write Analyzer and Tockenizer available? Jae On 8/27/07, Jonathan Woods [EMAIL PROTECTED] wrote: Or you could write your own Analyzer and Tokenizer to produce single values corresponding, say, to the start of each range. Jon -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: 27 August 2007 16:46 To: solr-user@lucene.apache.org Subject: Re: range index I could build index with Sales Vol ranges using PatternReplaceFilterFactory filter class=solr.PatternReplaceFilterFactory pattern=(^000[1-4].*) replacement=10M - 50M replace=all / filter class=solr.PatternReplaceFilterFactory pattern=(^000[5-9].*) replacement=50M - 100M replace=all / filter class=solr.PatternReplaceFilterFactory pattern=(^00[1-9].*) replacement=100M - 1B replace=all / filter class=solr.PatternReplaceFilterFactory pattern=(^0[1-9].*) replacement=\1B replace=all / Thanks, Jae On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 27, 2007, at 9:48 AM, Jae Joo wrote: That works. But I am looking how to do that at INDEXING TIME, but at query time. Any way for that? I'm not sure I understand the question. The example provided works at query time. If you want to bucket things at indexing time you could do that, but no real reason to with Solr's caching making the range buckets fast at query time. Could you elaborate on what you are trying to do? Erik Thanks, Jae On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 27, 2007, at 9:32 AM, Jae Joo wrote: Is there any way to catagorize by price range? I would like to do facet by price range. (ex. 100-200, 201-500, 501-1000, ...) Yes, look at using facet queries using range queries. There is an example of this very thing here: http://wiki.apache.org/solr/ SimpleFacetParameters#head-1da3ab3995bc4abcdce8e0f04be7355ba19e9b2c Erik
RE: Filtering using data only available at query time
But [the type of user] which has permission can change too. -Original Message- From: Daniel Pitts [mailto:[EMAIL PROTECTED] Sent: 27 August 2007 19:07 To: solr-user@lucene.apache.org Subject: RE: Filtering using data only available at query time I think you're missing my point. Don't index which users have permission, index which type of user has permission. Then _filter_ based on that. -Original Message- From: Jonathan Woods [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 10:26 AM To: solr-user@lucene.apache.org Subject: RE: Filtering using data only available at query time I know what you mean, and maybe I'm just being obstinate. But in the general case, it isn't possible to know these things ahead of time. The indexing machinery isn't told about changes in user permissions (e.g. demotion from administrative to ordinary user), and even if it were I'd hate to have to reindex everything just to reflect that change. Jon -Original Message- From: Daniel Pitts [mailto:[EMAIL PROTECTED] Sent: 27 August 2007 18:10 To: solr-user@lucene.apache.org Subject: RE: Filtering using data only available at query time Can you add some fields that let set a filter or query that weed out the results that the user doesn't have access too? If its as simple as Admin versus User, you could have a boolean field called AdminOnly, and when a User is querying, add a fq=[* TO *] -AdminOnly:true You could get more specific if you need to, just provide the information that you would use to determine the availability of the record to any given user, and then construct the filter based on the current user. -Original Message- From: Jonathan Woods [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 10:00 AM To: solr-user@lucene.apache.org Subject: Filtering using data only available at query time I've got a Lucene-based search implementation which searches over documents in a CMS and weeds out those hits which aren't accessible to the user carrying out the search. The raw search results are returned as an iterator, and I wrap another iterator around this to silently consume the inaccessible hits. (Yes, I know... wasteful!) The search is therefore based on data (user permissions) which can't be known at indexing time. I'm now porting the search implementation over to Solr. I took a look at FunctionQuery, and wondered if there was some way I could use it to do this kind of filtering - but as far as I can tell, it's only about scoring a hit - ValueSource can't signal 'don't include this at all'. Is there a case for introducing some kind of boolean include/exclude factor somewhere along the API? Or is there another obvious way to do this? I guess I could implement my own Query subclass and use it as a filter [query] in the search, but I wonder if it would be still be useful in FunctionQuery. Jon
RE: Embedded about 50% faster for indexing
I don't think you should apologise for highlighting embedded usage. For circumstances in which you're at liberty to run a Solr instance in the same JVM as an app which uses it, I find it very strange that you should have to use anything _other_ than embedded, and jump through all the unnecessary hoops (XML conversion, HTTP transport) that this implies. It's a bit like suggesting you should throw away Java method invocations altogether, and write everything in XML-RPC. Bit of a pet issue of mine! I'll be creating a JIRA issue on the subject soon. Jon -Original Message- From: Sundling, Paul [mailto:[EMAIL PROTECTED] Sent: 28 August 2007 03:24 To: solr-user@lucene.apache.org Subject: RE: Embedded about 50% faster for indexing At this point I think I'm going recommend against embedded, regardless of any performance advantage. The level of documentation is just too low, while the XML API is clearly documented. It's clear that XML is preferred. The embedded example on the wiki is pretty good, but until mutliple core support comes out in the next version, you have to use multiple SolrCore. If they are accessed in the same webapp, then you can't just set JNDI (since you can only have one value). So you have to use a Config object as alluded to in the example. However, you look at the code and there is no javadoc for the constructor. The constructor args are (String name, InputStream is, String prefix). I think name is a unique name for the solr core, but that is a guess. Inputstream may be a stream to the solr home, but it could be anything. Prefix may be a URI prefix. These are all guesses without trying to read through the code. When I look at SolrCore, it looks like it's a singleton, so maybe I can't even access more than one SolrCore using embedded anyway. :( So I apologize for highlighting Embedded. Anyway it's clear how to do multiple solr cores using XML. You just have different post URI for the difference cores. You can easily inject that with Spring and externalize the config. Simple and easy. So I concede XML is the way to go. :) Paul Sundling -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 5:50 PM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote: Whether embedded solr should give me a performance boost or not, it did. :) I'm not surprised, since it skips XML parsing. Although you never know where cycles are used for sure until you profile. It certainly is possible that XML parsing dwarfs indexing, but I'd expect that only to occur under very light analysis and field storage workloads. I tried doing more records per post (200) and it was actually slightly slower and seemed to require more memory. This makes sense because you have to take up more memory for the StringBuilder to store the much larger XML. For 10,000 it was much slower. For that size I would need to XML streaming or something to make it work. The solr war was on the same machine, so network overhead was only from using loopback. The big question is still your connection handling strategy: are you using persistent http connections? Are you threadedly indexing? cheers, -Mike Paul Sundling -Original Message- From: climbingrose [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 12:22 AM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing Haven't tried the embedded server but I think I have to agree with Mike. We're currently sending 2000 job batches to SOLR server and the amount of time required to transfer documents over http is insignificant compared with the time required to index them. So I do think unless you are sending document one by one, embedded SOLR shouldn't give you much more performance boost. On 8/25/07, Mike Klaas [EMAIL PROTECTED] wrote: On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, August 24, 2007 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing One thing I'd like to avoid is everyone trying to embed just for performance gains. If there is really that much difference, then we need a better way for people to get that without resorting to Java code. -Yonik Theoretically and practically, embedded solution will be faster than going through http/xml. This is only true if the http interface adds significant overhead to the cost of indexing a document, and I don't see why this should be so, as indexing is relatively heavyweight. setting up the connection could be expensive, but this can be greatly mitigated
RE: Query optimisation - multiple filter caches?
I understand - thanks, Yonik. I notice that LuceneQueryOptimizer is still used in SolrIndexSearcher.search(Query, Filter, Sort) - is the idea then that this method is deprecated, or that the config parameter query/boolTofilterOptimizer is no longer to be used? As for the other search() methods, they just delegate directly to org.apache.lucene.search.IndexSearcher, so no use of caches there. Jon -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 16 August 2007 01:40 To: solr-user@lucene.apache.org Subject: Re: Query optimisation - multiple filter caches? On 8/15/07, Jonathan Woods [EMAIL PROTECTED] wrote: I'm trying to understand how best to integrate directly with Solr (Java-to-Java in the same JVM) to make the most of its query optimisation - chiefly, its caching of queries which merely filter rather than rank results. I notice that SolrIndexSearcher maintains a filter cache and so does LuceneQueryOptimiser. Shouldn't they be contributing to/using the same cache, or are they used for different things? LuceneQueryOptimiser is no longer used since one can directly specify filters via fq parameters. -Yonik
Query optimisation - multiple filter caches?
I'm trying to understand how best to integrate directly with Solr (Java-to-Java in the same JVM) to make the most of its query optimisation - chiefly, its caching of queries which merely filter rather than rank results. I notice that SolrIndexSearcher maintains a filter cache and so does LuceneQueryOptimiser. Shouldn't they be contributing to/using the same cache, or are they used for different things? Jon
RE: Best use of wildcard searches
Thanks, Lance. I recall reading that Lucene is used in a superfast RDF query engine: http://www.deri.ie/about/press/releases/details/?uid=55ref=213. Jon -Original Message- From: Lance Norskog [mailto:[EMAIL PROTECTED] The Protégé project at Stanford has nice tools for editing knowledge bases, taxonomies, etc.
RE: Too many open files
You could try committing updates more frequently, or maybe optimising the index beforehand (and even during!). I imagine you could also change the Solr config, if you have access to it, to tweak indexing (or index creation) parameters - http://wiki.apache.org/solr/SolrConfigXml should be of use to you here. In the unlikely event I qualify for the MMs, I hereby donate them back to you for giving to someone else! Jon -Original Message- From: Kevin Holmes [mailto:[EMAIL PROTECTED] Sent: 09 August 2007 15:23 To: solr-user@lucene.apache.org Subject: Too many open files result status=1java.io.FileNotFoundException: /usr/local/bin/apache-solr/enr/solr/data/index/_16ik.tii (Too many open files) When I'm importing, this is the error I get. I know it's vague and obscure. Can someone suggest where to start? I'll buy a bag of MMs (not peanut) for anyone who can help me solve this* *limit one bag per successful solution for a total maximum of 1 bag to be given
RE: Best use of wildcard searches
Maybe there's a different way, in which path-like values like this are treated explicitly. I use a similar approach to Matthew at www.colfes.com, where all pages are generated from Lucene searches according to filters on a couple of hierarchical categories ('spaces'), i.e. subject and organisational unit. From that experience, a few things occur to me here: 1. The structure of any particular category/space is not immediately derivable from data, so unless we're Google or doing something RDF-like they're something you define up front. For this reason, and because it makes internationalisation easier, I feel you should model this kind of standing data independently of its representation. So instead searching for DepartmentsMen's ApparelJackets, I index (and search for) a String /departments/mensapparel/jackets/, and used a simple standing data mapping to resolves each of the nodes along the path to a human-readable form when necessary. In my case, the values for any particular resource (e.g. a news article) are defined by CMS users from drop-downs. 2. In my Lucene library, I redundantly indexed paths like /departments/mensapparel/jackets/ into successive fragments, together with the whole path value: /departments /departments/mensapparel /departments/mensapparel/jackets /departments/mensapparel/jackets/ using my own PathAnalyzer (extends Analyzer, of course) which makes it very fast to query on path fragments: all goods anywhere in the men's apparel section - query on /departments/mensapparel; all goods categorised as exactly in the men's apparel section - query on /departments/mensapparel/. I implemented all queries like this as filters, and cached the filter definitions. I guess Solr's query optimisation and filter caching do all this out of the box, so it may end up being just as fast to use the kind of PrefixQuery suggested in this thread. 3. However, I can post/attach/donate PathAnalyzer if anyone thinks it might still be useful. I started off calling it HierarchyValueAnalyzer, then TreeNodePathAnalyzer, but now that it's PathAnalyzer I cna't help thinking it might have lots of applications Jon -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 09 August 2007 21:50 To: solr-user@lucene.apache.org Subject: Re: Best use of wildcard searches On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python The same exact query, with... wait.. Wow. I'm making myself look like an idiot. I swear that these queries didn't work the first time I ran them... But now \ and ? give the same results, as would be expected, while returns nothing. I'm sorry for wasting your time, but I do appreciate the help! lo - these things can happen when you get too many levels of escaping needed. Hopefully we can improve the situation in the future to get rid of the query parser escaping for certain queries such as prefix and term.