Collection Distirbution in windows
i know this is a stupid question, but are there any collection distribution scripts for windows available ? thanks !
UTF-8 2-byte vs 4-byte encodings
Hi, I have a question regarding UTF-8 encodings, illustrated by the utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for example the e acute character, represented as two bytes 0xC3 0xA9. When this file is added to Solar and retrieved later, the XML output contains a four-byte representation of that character, namely 0xC2 0x83 0xC2 0xA9. If, on the other hand, the input data contains this same character as an entity #A9; the output contains the two-byte encoded representation 0xC3 0xA9. Why is that so, and is there a way to always get characters like these out of Solr as their two-byte representations? The reason I'm asking is that I often have to deal with CDATA sections in my input files that contain raw (two-byte) UTF8 characters that can't be encoded as entities. Thanks, Gereon
AW: UTF-8 2-byte vs 4-byte encodings
Gereon, The four bytes do not look like a valid utf-8 encoded character. 4-byte characters in utf-8 start with the binary sequence 0 (For reference see the excellent wikipedia article on utf-8 encoding). Your problem looks like someone interpreted your valid 2-byte utf-8 encoded character as two single byte characters in some fancy encoding. This happens if you send XML updates to solr via http without setting the encoding properly. It is not sufficient to set the encoding in the XML but you need an additional HTTP header to set the encoding (Content-type: text/xml; charset=UTF-8) --Christian -Ursprüngliche Nachricht- Von: Gereon Steffens [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 2. Mai 2007 09:59 An: solr-user@lucene.apache.org Betreff: UTF-8 2-byte vs 4-byte encodings Hi, I have a question regarding UTF-8 encodings, illustrated by the utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for example the e acute character, represented as two bytes 0xC3 0xA9. When this file is added to Solar and retrieved later, the XML output contains a four-byte representation of that character, namely 0xC2 0x83 0xC2 0xA9. If, on the other hand, the input data contains this same character as an entity #A9; the output contains the two-byte encoded representation 0xC3 0xA9. Why is that so, and is there a way to always get characters like these out of Solr as their two-byte representations? The reason I'm asking is that I often have to deal with CDATA sections in my input files that contain raw (two-byte) UTF8 characters that can't be encoded as entities. Thanks, Gereon
Re: AW: UTF-8 2-byte vs 4-byte encodings
Hi Chrisitian, It is not sufficient to set the encoding in the XML but you need an additional HTTP header to set the encoding (Content-type: text/xml; charset=UTF-8) Thanks, that's what I was missing. Gereon
Searchproblem composite words
Hi, I have a search problem with composite words. For example I have the composite word wishlist in my document. I can easily find the document by using the search string wishlist or wish* but I don't get any result with list. I can do a fuzzy search but this gives me too many results. Is where a better way to fix this problem ? Kindly regards Lutz Steinborn 4c GmbH
Re: Collection Distirbution in windows
The collection distribution scripts relies on hard links and rsync. It seems that both maybe avaialble on Windows hard links: http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fsutil_hardlink.mspx?mfr=true rsync: http://samba.anu.edu.au/rsync/download.html I say maybe because I don't know if hard link on windows work the same way as hard link on Linux/Unix. You will also need something like cygwin to run the bash scripts. Bill On 5/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: i know this is a stupid question, but are there any collection distribution scripts for windows available ? thanks !
Re: Leading wildcards
I just downloaded the latest nightly build of Lucene and compiled it with the solr 1.1.0 source, and now leading + trailing wildcards work like a charm. The only issue is, the lucene-core .jar file seems to have a runtime dependency on clover.jar. Does anyone know if this is intentional, or how I can get a lucene-core without the clover dependency? - mps
related multivalued fields
I am a newbie to Solr and found it very easy to get started! However, now I am stuck at this issue of dealing with correlated vector fields. for example the data on scientific publications. It will have a list of authors and their respective organization. Sample data can be represented as: publication titleToward better searching/title author nameJohn Smith/name organizationACME/organization author nameMary Ann/name organizationJumbo Inc/organization author publication How can I make Solr handle query like: author:John Smith AND organization:ACME? It seems I have to collapse the above sample into: publication title/title author_nameJohn Smith, Mary Annauthor_name author_organizationACME, Jumbo Inc/author_organization /publication Which obviously won't give me the answer I wanted. This seems like a generic problem in handling hierarchical data and right now I am hitting a roadblock in that solr only handles flat scalar field values. Would like to hear your suggestion/experience on how to handle the problem. Regards, -Jerry __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Leading wildcards
As far as I know, there is no clover dependency, at least not in the trunk version of Solr. I tried this cheap trick: $ strings lib/lucene-core-2.1.0.jar | grep -i clover Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Michael Pelz Sherman [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, May 2, 2007 10:52:53 AM Subject: Re: Leading wildcards I just downloaded the latest nightly build of Lucene and compiled it with the solr 1.1.0 source, and now leading + trailing wildcards work like a charm. The only issue is, the lucene-core .jar file seems to have a runtime dependency on clover.jar. Does anyone know if this is intentional, or how I can get a lucene-core without the clover dependency? - mps
Re: Leading wildcards
Try it on the nightly build, dude: [EMAIL PROTECTED] tmp]# strings lucene-core-nightly.jar | grep -i clover|more org/apache/lucene/LucenePackage$__CLOVER_0_0.class org/apache/lucene/analysis/Analyzer$__CLOVER_1_0.class org/apache/lucene/analysis/CachingTokenFilter$__CLOVER_2_0.class org/apache/lucene/analysis/CharTokenizer$__CLOVER_3_0.class org/apache/lucene/analysis/ISOLatin1AccentFilter$__CLOVER_4_0.class org/apache/lucene/analysis/KeywordAnalyzer$__CLOVER_5_0.class org/apache/lucene/analysis/KeywordTokenizer$__CLOVER_6_0.class ... Otis Gospodnetic [EMAIL PROTECTED] wrote: As far as I know, there is no clover dependency, at least not in the trunk version of Solr. I tried this cheap trick: $ strings lib/lucene-core-2.1.0.jar | grep -i clover Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Michael Pelz Sherman To: solr-user@lucene.apache.org Sent: Wednesday, May 2, 2007 10:52:53 AM Subject: Re: Leading wildcards I just downloaded the latest nightly build of Lucene and compiled it with the solr 1.1.0 source, and now leading + trailing wildcards work like a charm. The only issue is, the lucene-core .jar file seems to have a runtime dependency on clover.jar. Does anyone know if this is intentional, or how I can get a lucene-core without the clover dependency? - mps
RE: NullPointerException (not schema related)
Otis, Thanks for the response, that list should be very useful! Charlie -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 02, 2007 11:13 AM To: solr-user@lucene.apache.org Subject: Re: NullPointerException (not schema related) Charlie, There is nothing built into Solr for that. But you can use any of the numerous free proxies/load balancers. Here is a collection that I've got: http://www.simpy.com/user/otis/search/load%2Bbalance+OR+proxy Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Charlie Jackson [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, May 1, 2007 5:31:13 PM Subject: RE: NullPointerException (not schema related) I went with the first approach which got me up and running. Your other example config (using ./snapshooter) made me realize how foolish my original problem was! Anyway, I've got the whole thing up and running and it looks pretty awesome! One quick question, though. As stated in the wiki, one of the benefits of distributing the indexes is load balance the queries. Is there a built-in solr mechanism for performing this query load balancing? I'm suspecting there is not, and I haven't seen anything about it in the wiki, but I wanted to check because I know I'm going to be asked. Thanks, Charlie -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 3:20 PM To: solr-user@lucene.apache.org Subject: RE: NullPointerException (not schema related) : listener event=postCommit class=solr.RunExecutableListener : str name=exesnapshooter/str : str name=dir/usr/local/Production/solr/solr/bin//str : bool name=waittrue/bool : /listener : the directory. However, when I committed data to the index, I was : getting No such file or directory errors from the Runtime.exec call. I : verified all of the permissions, etc, with the user I was trying to use. : In the end, I wrote up a little test program to see if it was a problem : with the Runtime.exec call and I think it is. I'm running this on CentOS : 4.4 and Runtime.exec seems to have a hard time directly executing bash : scripts. For example, if I called Runtime.exec with a command of : test_program (which is a bash script), it failed. If I called : Runtime.exec with a command of /bin/bash test_program it worked. this initial problem you were having may be a result of path issues. dir doesn't need to be the directory where your script lives, it's the directory where you wnat your script to run (the cwd of the process). it's possible that the error you were getting was because . isn't in the PATH that was being used, you should try something like this... listener event=postCommit class=solr.RunExecutableListener str name=exe/usr/local/Production/solr/solr/bin/snapshooter/str str name=dir/usr/local/Production/solr/solr/bin//str bool name=waittrue/bool /listener ...or maybe even... listener event=postCommit class=solr.RunExecutableListener str name=exe./snapshooter/str !-- note the ./ -- str name=dir/usr/local/Production/solr/solr/bin//str bool name=waittrue/bool /listener -Hoss
Re: Leading wildcards
I tried, but ran into a missing ant file: lucene-nightly\build.xml:7: Cannot find common-build.xml imported from C:\download\lucene-nightly\build.xml I've posted to the lucene dev list as well; will try the lucene user list too. - mps Otis Gospodnetic [EMAIL PROTECTED] wrote: Try building your own jar (ant jar-core in lucene's trunk): strings /home/otis/dev/repos/lucene/java/trunk/build/lucene-core-2.2-dev.jar | grep -i clover I'll have a look at the nightly later, but you should also bring up that issue on [EMAIL PROTECTED] list. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Michael Pelz Sherman To: solr-user@lucene.apache.org Sent: Wednesday, May 2, 2007 12:11:45 PM Subject: Re: Leading wildcards Try it on the nightly build, dude: [EMAIL PROTECTED] tmp]# strings lucene-core-nightly.jar | grep -i clover|more org/apache/lucene/LucenePackage$__CLOVER_0_0.class org/apache/lucene/analysis/Analyzer$__CLOVER_1_0.class org/apache/lucene/analysis/CachingTokenFilter$__CLOVER_2_0.class org/apache/lucene/analysis/CharTokenizer$__CLOVER_3_0.class org/apache/lucene/analysis/ISOLatin1AccentFilter$__CLOVER_4_0.class org/apache/lucene/analysis/KeywordAnalyzer$__CLOVER_5_0.class org/apache/lucene/analysis/KeywordTokenizer$__CLOVER_6_0.class ... Otis Gospodnetic wrote: As far as I know, there is no clover dependency, at least not in the trunk version of Solr. I tried this cheap trick: $ strings lib/lucene-core-2.1.0.jar | grep -i clover Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Michael Pelz Sherman To: solr-user@lucene.apache.org Sent: Wednesday, May 2, 2007 10:52:53 AM Subject: Re: Leading wildcards I just downloaded the latest nightly build of Lucene and compiled it with the solr 1.1.0 source, and now leading + trailing wildcards work like a charm. The only issue is, the lucene-core .jar file seems to have a runtime dependency on clover.jar. Does anyone know if this is intentional, or how I can get a lucene-core without the clover dependency? - mps
Delete by filter?
Hi. First off, thanks for a nice piece of software. I'm wondering how to delete a range of documents with a range filter instead of a query. I want to remove all docs with a creation date within two dates. As far as I remember range filters are much quicker then queries in lucene. /Johan
Re: Searchproblem composite words
: For example I have the composite word wishlist in my document. I can : easily find the document by using the search string wishlist or wish* : but I don't get any result with list. what you are describing is basically a substring search problem ... sometimes this can be dealt with by using something like the WordDeliminterFilter -- but only if people are using WishList in their documents. Another approach would be to use and NGram based tokenizer (built in support for this will probably be added soon) but then searches for things like able will match words like cable ... which may not be what you want (yes it is a substring, but it is not what anyone would consider a composite word the best way to match what you want extremely acurately would be to use the SynonymFilter and enumerate every composite word you care about in the Synonym list ... tedious yes, but also very accurate. -Hoss
Re: Custom HitCollector with SolrIndexSearcher and caching
: I feel like I might be missing something, and there is in fact a way to : use a custom HitCollector and benefit from caching, but I just don't see : it now. I can't think of any easy way to do what you describe ... you can always use the low level IndexSearcher methods with a custom HitCollector that wraps a DocSetHitCollector and then explicitly cache the DocSet yourself, but thta doesn't really help you with the DocList ... there definitely doesn't seem to be an *easy* way to do what you're describing at the moment, but with a little refactoring methods like getDocListAndSet *coult* take in some sort of CompositeHitCollector class with an API like... /** * a HitCollector whose colelct method will delegate to a specified * HitCollector for each match it wants collected */ public abstract class CompositeHitCollector extends HitCollector { public setComposed(HitCollector inner); } ...then the meat and potatoes methods of SolrIndexSearcher could take in your custom written CompositeHitCollector, specify the anonymous inner HitCollector it needs to use for the case it finds itself in, and now you've got a window into the collection process where you can much with scores or igore certain matches. It would be a non trivial change, but it would be possible. -Hoss
Re: Delete by filter?
: I'm wondering how to delete a range of documents with : a range filter instead of a query. I want to remove all docs with a : creation date within two dates. : : As far as I remember range filters are much quicker then queries in lucene. Never fear, the default query parser in Solr does a lot of query magic under the covers to make things better ... if you do a deleteByQuery and your query is a range query, Solr will parse it as a ConstantScoreQuery (backed by a range filter) (FYI: Filter's aren't neccessarily faster then queries, they just have different memory characteristics, dictated by the number of docs instead of by the number of terms) : : /Johan : -Hoss
Group results by field?
Hello! I was wondering - is it possible to search and group the results by a given field? For example, I have an index with several million records. Most of them are different sizes of the same style_id. I'd love to be able to do.. group.by=style_id or something like that in the results, and provide the style_id as a clickable link to see all the sizes of that style. Any ideas? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++
Re: Group results by field?
Hi Matthew, You might be able to just get away with just using facets, depending on whether your goal is to provide a clickable list of styles_ids to the user, or if you want to only return one search result for each style_id. For a list of clickable styles, it's basic faceting, and works really well. http://wiki.apache.org/solr/SimpleFacetParameters Facet on style_id, present the list of facets to the user, and if the user selects style_id =37, then reissue the query with one more clause (+style_id:37) If you want the ability to only show one search result from each group, then you might consider the structure of your data. Is each style/size a separate record? Or is each style a record with multi-valued sizes? The latter might give you what you really want. Or, if you really want to remove dups from search results, you could do what I've done.I ended up modifying SolrIndexSearcher, and replacing FieldSortedHitQueue, and ScorePriorityQueue with versions that remove dups based in a particular field. Tom On 5/2/07, Matthew Runo [EMAIL PROTECTED] wrote: Hello! I was wondering - is it possible to search and group the results by a given field? For example, I have an index with several million records. Most of them are different sizes of the same style_id. I'd love to be able to do.. group.by=style_id or something like that in the results, and provide the style_id as a clickable link to see all the sizes of that style. Any ideas? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++
Re: Group results by field?
Ahh, ok. I'll check out Saxon-B and XSLT templates. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On May 2, 2007, at 3:57 PM, Brian Whitman wrote: On May 2, 2007, at 6:55 PM, Matthew Runo wrote: I was wondering - is it possible to search and group the results by a given field? For example, I have an index with several million records. Most of them are different sizes of the same style_id. I'd love to be able to do.. group.by=style_id or something like that in the results, and provide the style_id as a clickable link to see all the sizes of that style. As far as I know there's no in-Solr grouping mechanism. But we use the XSLTResponseWriter for this: http://wiki.apache.org/solr/XsltResponseWriter (look near the bottom)