Collection Distirbution in windows

2007-05-02 Thread Maarten . De . Vilder
i know this is a stupid question, but are there any collection 
distribution scripts for windows available ?

thanks !

UTF-8 2-byte vs 4-byte encodings

2007-05-02 Thread Gereon Steffens
Hi,

I have a question regarding UTF-8 encodings, illustrated by the
utf8-example.xml file. This file contains raw, unescaped UTF8 characters,
for example the e acute character, represented as two bytes 0xC3 0xA9.
When this file is added to Solar and retrieved later, the XML output
contains a four-byte representation of that character, namely 0xC2 0x83
0xC2 0xA9.

If, on the other hand, the input data contains this same character as an
entity #A9; the output contains the two-byte encoded representation 0xC3
0xA9.

Why is that so, and is there a way to always get characters like these out
of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in
my input files that contain raw (two-byte) UTF8 characters that can't be
encoded as entities.

Thanks,
Gereon



AW: UTF-8 2-byte vs 4-byte encodings

2007-05-02 Thread Burkamp, Christian
Gereon,

The four bytes do not look like a valid utf-8 encoded character. 4-byte 
characters in utf-8 start with the binary sequence 0 (For reference 
see the excellent wikipedia article on utf-8 encoding).
Your problem looks like someone interpreted your valid 2-byte utf-8 encoded 
character as two single byte characters in some fancy encoding. This happens if 
you send XML updates to solr via http without setting the encoding properly. It 
is not sufficient to set the encoding in the XML but you need an additional 
HTTP header to set the encoding (Content-type: text/xml; charset=UTF-8)

--Christian

-Ursprüngliche Nachricht-
Von: Gereon Steffens [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 2. Mai 2007 09:59
An: solr-user@lucene.apache.org
Betreff: UTF-8 2-byte vs 4-byte encodings


Hi,

I have a question regarding UTF-8 encodings, illustrated by the 
utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for 
example the e acute character, represented as two bytes 0xC3 0xA9. When this 
file is added to Solar and retrieved later, the XML output contains a four-byte 
representation of that character, namely 0xC2 0x83 0xC2 0xA9.

If, on the other hand, the input data contains this same character as an entity 
#A9; the output contains the two-byte encoded representation 0xC3 0xA9.

Why is that so, and is there a way to always get characters like these out of 
Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in my 
input files that contain raw (two-byte) UTF8 characters that can't be encoded 
as entities.

Thanks,
Gereon



Re: AW: UTF-8 2-byte vs 4-byte encodings

2007-05-02 Thread Gereon Steffens
Hi Chrisitian,

 It is not sufficient to set the encoding in the XML but
 you need an additional HTTP header to set the encoding (Content-type:
 text/xml; charset=UTF-8)
Thanks, that's what I was missing.

Gereon



Searchproblem composite words

2007-05-02 Thread Lutz Steinborn

Hi,

I have a search problem with composite words.

For example I have the composite word wishlist in my document. I can
easily find the document by using the search string wishlist or wish*
but I don't get any result with list.

I can do a fuzzy search but this gives me too many results.

Is where a better way to fix this problem ?


Kindly regards

Lutz Steinborn
4c GmbH


Re: Collection Distirbution in windows

2007-05-02 Thread Bill Au

The collection distribution scripts relies on hard links and rsync.  It
seems that both maybe avaialble on Windows

hard links:
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fsutil_hardlink.mspx?mfr=true

rsync:
http://samba.anu.edu.au/rsync/download.html

I say maybe because I don't know if hard link on windows work the same way
as hard link on Linux/Unix.

You will also need something like cygwin to run the bash scripts.

Bill

On 5/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:


i know this is a stupid question, but are there any collection
distribution scripts for windows available ?

thanks !


Re: Leading wildcards

2007-05-02 Thread Michael Pelz Sherman
I just downloaded the latest nightly build of Lucene and compiled it with the 
solr 1.1.0 source, and now leading + trailing wildcards work like a charm.
   
  The only issue is, the lucene-core .jar file seems to have a runtime 
dependency on clover.jar. Does anyone know if this is intentional, or how I can 
get a lucene-core without the clover dependency?
   
  - mps


related multivalued fields

2007-05-02 Thread RJ Tang
I am a newbie to Solr and found it very easy to get started!
However, now I am stuck at this issue of dealing with correlated vector fields.
for example
the data on scientific publications. It will have a list of authors and their
respective organization. Sample data can be represented as: 
publication
 titleToward better searching/title
 author
nameJohn Smith/name
organizationACME/organization
 author
nameMary Ann/name
organizationJumbo Inc/organization
 author
publication

How can I make Solr handle query like: 
author:John Smith AND organization:ACME?

It seems I have to collapse the above sample into:
publication
  title/title
  author_nameJohn Smith, Mary Annauthor_name
  author_organizationACME, Jumbo Inc/author_organization
/publication
Which obviously won't give me the answer I wanted.

This seems like a generic problem in handling hierarchical data
and right now I am hitting a roadblock in that solr only handles
flat scalar field values.

Would like to hear your suggestion/experience on how to handle the problem.

Regards,
-Jerry

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Leading wildcards

2007-05-02 Thread Otis Gospodnetic
As far as I know, there is no clover dependency, at least not in the trunk 
version of Solr.  I tried this cheap trick:

$ strings lib/lucene-core-2.1.0.jar  | grep -i clover

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Michael Pelz Sherman [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, May 2, 2007 10:52:53 AM
Subject: Re: Leading wildcards

I just downloaded the latest nightly build of Lucene and compiled it with the 
solr 1.1.0 source, and now leading + trailing wildcards work like a charm.
   
  The only issue is, the lucene-core .jar file seems to have a runtime 
dependency on clover.jar. Does anyone know if this is intentional, or how I can 
get a lucene-core without the clover dependency?
   
  - mps





Re: Leading wildcards

2007-05-02 Thread Michael Pelz Sherman
Try it on the nightly build, dude:
   
  [EMAIL PROTECTED] tmp]# strings lucene-core-nightly.jar | grep -i clover|more
org/apache/lucene/LucenePackage$__CLOVER_0_0.class
org/apache/lucene/analysis/Analyzer$__CLOVER_1_0.class
org/apache/lucene/analysis/CachingTokenFilter$__CLOVER_2_0.class
org/apache/lucene/analysis/CharTokenizer$__CLOVER_3_0.class
org/apache/lucene/analysis/ISOLatin1AccentFilter$__CLOVER_4_0.class
org/apache/lucene/analysis/KeywordAnalyzer$__CLOVER_5_0.class
org/apache/lucene/analysis/KeywordTokenizer$__CLOVER_6_0.class
...

Otis Gospodnetic [EMAIL PROTECTED] wrote:
  As far as I know, there is no clover dependency, at least not in the trunk 
version of Solr. I tried this cheap trick:

$ strings lib/lucene-core-2.1.0.jar | grep -i clover

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share

- Original Message 
From: Michael Pelz Sherman 
To: solr-user@lucene.apache.org
Sent: Wednesday, May 2, 2007 10:52:53 AM
Subject: Re: Leading wildcards

I just downloaded the latest nightly build of Lucene and compiled it with the 
solr 1.1.0 source, and now leading + trailing wildcards work like a charm.

The only issue is, the lucene-core .jar file seems to have a runtime dependency 
on clover.jar. Does anyone know if this is intentional, or how I can get a 
lucene-core without the clover dependency?

- mps






RE: NullPointerException (not schema related)

2007-05-02 Thread Charlie Jackson
Otis,

Thanks for the response, that list should be very useful!

Charlie

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 02, 2007 11:13 AM
To: solr-user@lucene.apache.org
Subject: Re: NullPointerException (not schema related)

Charlie,

There is nothing built into Solr for that.  But you can use any of the
numerous free proxies/load balancers.  Here is a collection that I've
got:
http://www.simpy.com/user/otis/search/load%2Bbalance+OR+proxy

Otis 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Charlie Jackson [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, May 1, 2007 5:31:13 PM
Subject: RE: NullPointerException (not schema related)

I went with the first approach which got me up and running. Your other
example config (using ./snapshooter) made me realize how foolish my
original problem was!

Anyway, I've got the whole thing up and running and it looks pretty
awesome! 

One quick question, though. As stated in the wiki, one of the benefits
of distributing the indexes is load balance the queries. Is there a
built-in solr mechanism for performing this query load balancing? I'm
suspecting there is not, and I haven't seen anything about it in the
wiki, but I wanted to check because I know I'm going to be asked.

Thanks,
Charlie

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 01, 2007 3:20 PM
To: solr-user@lucene.apache.org
Subject: RE: NullPointerException (not schema related)


: listener event=postCommit class=solr.RunExecutableListener
:   str name=exesnapshooter/str
:   str name=dir/usr/local/Production/solr/solr/bin//str
:   bool name=waittrue/bool
: /listener

: the directory. However, when I committed data to the index, I was
: getting No such file or directory errors from the Runtime.exec call.
I
: verified all of the permissions, etc, with the user I was trying to
use.
: In the end, I wrote up a little test program to see if it was a
problem
: with the Runtime.exec call and I think it is. I'm running this on
CentOS
: 4.4 and Runtime.exec seems to have a hard time directly executing bash
: scripts. For example, if I called Runtime.exec with a command of
: test_program (which is a bash script), it failed. If I called
: Runtime.exec with a command of /bin/bash test_program it worked.

this initial problem you were having may be a result of path issues.
dir
doesn't need to be the directory where your script lives, it's the
directory where you wnat your script to run (the cwd of the process).
it's possible that the error you were getting was because . isn't in
the
PATH that was being used, you should try something like this...

 listener event=postCommit class=solr.RunExecutableListener
   str
name=exe/usr/local/Production/solr/solr/bin/snapshooter/str
   str name=dir/usr/local/Production/solr/solr/bin//str
   bool name=waittrue/bool
 /listener

...or maybe even...

 listener event=postCommit class=solr.RunExecutableListener
   str name=exe./snapshooter/str !-- note the ./ --
   str name=dir/usr/local/Production/solr/solr/bin//str
   bool name=waittrue/bool
 /listener

-Hoss






Re: Leading wildcards

2007-05-02 Thread Michael Pelz Sherman
I tried, but ran into a missing ant file:
   
  lucene-nightly\build.xml:7: Cannot find common-build.xml imported from 
C:\download\lucene-nightly\build.xml
   
  I've posted to the lucene dev list as well; will try the lucene user list too.
   
  - mps

Otis Gospodnetic [EMAIL PROTECTED] wrote:
  Try building your own jar (ant jar-core in lucene's trunk):

strings /home/otis/dev/repos/lucene/java/trunk/build/lucene-core-2.2-dev.jar | 
grep -i clover

I'll have a look at the nightly later, but you should also bring up that issue 
on [EMAIL PROTECTED] list.

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share

- Original Message 
From: Michael Pelz Sherman 
To: solr-user@lucene.apache.org
Sent: Wednesday, May 2, 2007 12:11:45 PM
Subject: Re: Leading wildcards

Try it on the nightly build, dude:

[EMAIL PROTECTED] tmp]# strings lucene-core-nightly.jar | grep -i clover|more
org/apache/lucene/LucenePackage$__CLOVER_0_0.class
org/apache/lucene/analysis/Analyzer$__CLOVER_1_0.class
org/apache/lucene/analysis/CachingTokenFilter$__CLOVER_2_0.class
org/apache/lucene/analysis/CharTokenizer$__CLOVER_3_0.class
org/apache/lucene/analysis/ISOLatin1AccentFilter$__CLOVER_4_0.class
org/apache/lucene/analysis/KeywordAnalyzer$__CLOVER_5_0.class
org/apache/lucene/analysis/KeywordTokenizer$__CLOVER_6_0.class
...

Otis Gospodnetic wrote:
As far as I know, there is no clover dependency, at least not in the trunk 
version of Solr. I tried this cheap trick:

$ strings lib/lucene-core-2.1.0.jar | grep -i clover

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share

- Original Message 
From: Michael Pelz Sherman 
To: solr-user@lucene.apache.org
Sent: Wednesday, May 2, 2007 10:52:53 AM
Subject: Re: Leading wildcards

I just downloaded the latest nightly build of Lucene and compiled it with the 
solr 1.1.0 source, and now leading + trailing wildcards work like a charm.

The only issue is, the lucene-core .jar file seems to have a runtime dependency 
on clover.jar. Does anyone know if this is intentional, or how I can get a 
lucene-core without the clover dependency?

- mps










Delete by filter?

2007-05-02 Thread Johan Oskarsson

Hi.

First off, thanks for a nice piece of software.

I'm wondering how to delete a range of documents with
a range filter instead of a query. I want to remove all docs with a 
creation date within two dates.


As far as I remember range filters are much quicker then queries in lucene.

/Johan


Re: Searchproblem composite words

2007-05-02 Thread Chris Hostetter

: For example I have the composite word wishlist in my document. I can
: easily find the document by using the search string wishlist or wish*
: but I don't get any result with list.

what you are describing is basically a substring search problem ...
sometimes this can be dealt with by using something like the
WordDeliminterFilter -- but only if people are using WishList in their
documents.

Another approach would be to use and NGram based tokenizer (built in
support for this will probably be added soon) but then searches for things
like able will match words like cable ... which may not be what you
want (yes it is a substring, but it is not what anyone would consider a
composite word

the best way to match what you want extremely acurately would be to use
the SynonymFilter and enumerate every composite word you care about in the
Synonym list ... tedious yes, but also very accurate.



-Hoss



Re: Custom HitCollector with SolrIndexSearcher and caching

2007-05-02 Thread Chris Hostetter

: I feel like I might be missing something, and there is in fact a way to
: use a custom HitCollector and benefit from caching, but I just don't see
: it now.

I can't think of any easy way to do what you describe ... you can always
use the low level IndexSearcher methods with a custom HitCollector that
wraps a DocSetHitCollector and then explicitly cache the DocSet yourself,
but thta doesn't really help you with the DocList ... there definitely
doesn't seem to be an *easy* way to do what you're describing at the
moment, but with a little refactoring methods like getDocListAndSet
*coult* take in some sort of CompositeHitCollector class with an API
like...

   /**
* a HitCollector whose colelct method will delegate to a specified
* HitCollector for each match it wants collected
*/
   public abstract class CompositeHitCollector extends HitCollector {
 public setComposed(HitCollector inner);
   }

...then the meat and potatoes methods of SolrIndexSearcher could take in
your custom written CompositeHitCollector, specify the anonymous inner
HitCollector it needs to use for the case it finds itself in, and now
you've got a window into the collection process where you can much with
scores or igore certain matches.

It would be a non trivial change, but it would be possible.




-Hoss



Re: Delete by filter?

2007-05-02 Thread Chris Hostetter

: I'm wondering how to delete a range of documents with
: a range filter instead of a query. I want to remove all docs with a
: creation date within two dates.
:
: As far as I remember range filters are much quicker then queries in lucene.

Never fear, the default query parser in Solr does a lot of query magic
under the covers to make things better ... if you do a deleteByQuery and
your query is a range query, Solr will parse it as a ConstantScoreQuery
(backed by a range filter)


(FYI: Filter's aren't neccessarily faster then queries, they just have
different memory characteristics, dictated by the number of docs instead
of by the number of terms)



:
: /Johan
:



-Hoss



Group results by field?

2007-05-02 Thread Matthew Runo

Hello!

I was wondering - is it possible to search and group the results by a  
given field?


For example, I have an index with several million records. Most of  
them are different sizes of the same style_id.


I'd love to be able to do.. group.by=style_id or something like that  
in the results, and provide the style_id as a clickable link to see  
all the sizes of that style.


Any ideas?

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++




Re: Group results by field?

2007-05-02 Thread Tom Hill

Hi Matthew,

You might be able to just get away with just using facets, depending on
whether your goal is to provide a clickable list of styles_ids to the user,
or if you want to only return one search result for each style_id.

For a list of clickable styles, it's basic faceting, and works really well.

http://wiki.apache.org/solr/SimpleFacetParameters
Facet on style_id, present the list of facets to the user, and if the user
selects style_id =37, then reissue the query with one more clause
(+style_id:37)

If you want the ability to only show one search result from each group, then
you might consider the structure of your data. Is each style/size a separate
record? Or is each style a record with multi-valued sizes? The latter might
give you what you really want.

Or, if you really want to remove dups from search results, you could do what
I've done.I ended up modifying SolrIndexSearcher, and replacing
FieldSortedHitQueue, and ScorePriorityQueue with versions that remove dups
based in a particular field.

Tom


On 5/2/07, Matthew Runo [EMAIL PROTECTED] wrote:


Hello!

I was wondering - is it possible to search and group the results by a
given field?

For example, I have an index with several million records. Most of
them are different sizes of the same style_id.

I'd love to be able to do.. group.by=style_id or something like that
in the results, and provide the style_id as a clickable link to see
all the sizes of that style.

Any ideas?

++
  | Matthew Runo
  | Zappos Development
  | [EMAIL PROTECTED]
  | 702-943-7833
++





Re: Group results by field?

2007-05-02 Thread Matthew Runo

Ahh, ok.

I'll check out Saxon-B and XSLT templates.

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On May 2, 2007, at 3:57 PM, Brian Whitman wrote:


On May 2, 2007, at 6:55 PM, Matthew Runo wrote:

I was wondering - is it possible to search and group the results  
by a given field?


For example, I have an index with several million records. Most of  
them are different sizes of the same style_id.


I'd love to be able to do.. group.by=style_id or something like  
that in the results, and provide the style_id as a clickable link  
to see all the sizes of that style.



As far as I know there's no in-Solr grouping mechanism. But we use  
the XSLTResponseWriter for this:


http://wiki.apache.org/solr/XsltResponseWriter (look near the bottom)