How much data can Solr handle?

2009-06-26 Thread Daniel Löfquist
We're looking to build a search solution that can contain as many as 10 million
different items and I was wondering if Solr could handle that kind of data 
amount or not?

Has anybody done any testing or published any kind of results for a 
Solr-installation
working on huge amounts of data like this?

//Daniel

-- 
Daniel Löfquist
Software Engineer

CDON.COM
Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden

Office: +46 40 601 61 00
Direct: +46 40 601 61 16
Fax: +46 40 601 61 20
E-mail: daniel.lofqu...@it.cdon.com <mailto:daniel.lofqu...@it.cdon.com>

CDON.COM <http://www.cdon.com/>

Confidentiality
Information contained in this e-mail is intended for the use of the
addressee only, and is confidential. Any dissemination, distribution,
copying or use of this communication without prior permission of
the addressee is strictly prohibited. If you are not the intended
addressee you must delete this e-mail and its attachments.


Group by field in Solr

2009-08-20 Thread Daniel Löfquist
Hello,

I'm trying to accomplish something akin to "GROUP BY" in SQL but in Solr.
I have an index full of songs (one song per document in Solr) by various 
artists and I
would like to construct a search that gives me all of the artists back, one row 
per
artist. The current search returns one row per artist and song.

So I get this right now if I search after "war" in the artist-field:

30 Years War - Ideal Means
30 Years War - Dirty Castle
30 Years War - Misinformed
All Out War - Soaked In Torment
All Out War - Claim Your Innocence
Audio War - Negativity
Audio War - One Drug
Audio War - Super Freak

But this is what I'd really like:

30 Years War - whatever song
All Out War - whatever song
Audio War - whatever song

I tried using facets but couldn't get it to work properly. Anybody have a clue 
how to do
something like this?

//Daniel


No search hits for items starting with one-letter words

2009-10-21 Thread Daniel Löfquist
Hello all,

I have an odd problem. I have a Solr-index containing songs by various artists. 
When I
perform a search for something that starts with a one-letter word I receive no 
hits. If
I remove the one-letter word I get hits though.

So for example, if I search for "a hard days night" or "i want you back" I get 
0 hits
but if I search for "hard days night" or "want you back" there are hits.

This behaviour doesn't affect items starting with a number. So if a song-title 
were to
start with a number that's no problem, I will get hits for that.

The fieldtype I'm using for the text-field containing song-title is defined in 
my
schema.xml like this:


   


 



   
   


 




   
  

Can anyone tell me what may be the source of my problem and how to fix it?

I'm on a deadline so quick answers are greatly appreciated ;-)

Thanks for listening,

//Daniel


Solr interprets UTF-8 as ISO-8859-1

2008-03-31 Thread Daniel Löfquist

Hello,

We're building a webapplication that uses Solr for searching and I've
come upon a problem that I can't seem to get my head around.

We have a servlet that accepts input via XML-RPC and based on that input
constructs the correct URL to perform a search with the Solr-servlet.

I know that the call to Solr (the URL) from our servlet looks like this
(which is what it should look like):

http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25

But Solr reports the input-fields (the GET-variables in the URL) as:

INFO: /select/
fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25

which is all fine except where it says "ljusblå". Apparently Solr is
interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
this garbage that makes the search return 0 when it should in reality
return 3 hits.

All other searches that don't use special characters work 100% fine.

I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
help me out and point me in the direction of a solution?

Sincerely,

Daniel Löfquist



Re: Solved! Solr interprets UTF-8 as ISO-8859-1

2008-04-01 Thread Daniel Löfquist
That did the trick. I actually figured it out on my own 10 minutes after 
I posted to the mailinglist. Typical ;-)

Thanks for the help anyway everybody!

//Daniel

Uwe Klosa wrote:

You should set uriEncoding="UTF-8" in your application server. For tomcat
you can do that in the server.xml. For Glassfish you have to create a
sun-web.xml containing the according parameters. Yoy r application server
should provide a similar mechanism.

Uwe

On Mon, Mar 31, 2008 at 4:32 PM, Daniel Löfquist <
[EMAIL PROTECTED]> wrote:


Hello,

We're building a webapplication that uses Solr for searching and I've
come upon a problem that I can't seem to get my head around.

We have a servlet that accepts input via XML-RPC and based on that input
constructs the correct URL to perform a search with the Solr-servlet.

I know that the call to Solr (the URL) from our servlet looks like this
(which is what it should look like):

http://myserver:8080/solrproducts/select/?q=all_SV:ljusbl
å+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25

But Solr reports the input-fields (the GET-variables in the URL) as:

INFO: /select/

fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25

which is all fine except where it says "ljusblå". Apparently Solr is
interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
this garbage that makes the search return 0 when it should in reality
return 3 hits.

All other searches that don't use special characters work 100% fine.

I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
help me out and point me in the direction of a solution?

Sincerely,

Daniel Löfquist






--
Daniel Löfquist
Application Manager / Software Engineer

CDON.COM
Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden

Office: +46 40 601 61 00
Direct: +46 40 601 61 16
Mobile: +46 702 92 21 75
Fax: +46 40 601 61 20
E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>

CDON.COM <http://www.cdon.com/>

Confidentiality
Information contained in this e-mail is intended for the use of the
addressee only, and is confidential. Any dissemination, distribution,
copying or use of this communication without prior permission of
the addressee is strictly prohibited. If you are not the intended
addressee you must delete this e-mail and its attachments.


Searching "inside of words"

2008-04-17 Thread Daniel Löfquist

Hi,

I'm still pretty new to Solr. We're using it for searching on our site 
right now though.


The configuration is however pretty much based on the example-files that 
come with Solr and there's one type of search that I can't get to work.


Each item has fields called "title" and "description", both of which are 
of type "text".


The type "text" is defined like this in our schema.xml :




	words="stopwords.txt"/>
	generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0"/>







	ignoreCase="true" expand="true"/>
	words="stopwords.txt"/>
	generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0"/>







My problem is that if I have an item with "title"="Termobyxa", a search 
for "Termo" gives me a hit but if I search for "ermo" or "byxa" I get no 
hit. How do I make it so that this kind of search "inside a word" 
returns a hit?


Sincerely,

Daniel Löfquist



Re: Searching "inside of words"

2008-05-16 Thread Daniel Löfquist

Sorry for taking forever to reply but anyway...

We're using Solr-1.2.0 and can't for various reasons use the 
Nightly-version.
The 1.2.0-version doesn't have NGramFilterFactory and 
EdgeNGramFilterFactory so the only ones I can utilize are 
EdgeNGramTokenizerFactory and NGramTokenizerFactory.


I've done some playing around with them but the best result I've gotten 
so far is a field-type that enables searching for specific letters, for 
example I can search for an item that contains the letters a and x, but 
it returns a hit no matter where these letters are in the text, they 
don't have to be next to each other, and that's not the result I was 
going for. If the field contains "monitor" I want a hit on a search for 
"onit" but not on "rint" for example.


I've never attempted to construct a new field-type of my own before and 
I'm finding the available documentation somewhat incomplete and not very 
helpful so I really need some pointers from people who know better than 
me here.
If anyone could help me out maybe even with some example-code I'd be 
eternally grateful.


//Daniel


Otis Gospodnetic wrote:

Hi Daniel,
Well, searching "inside of words" requires special treatment, because normally 
searches work on words/terms/tokens.

Make use of the following:
$ ff \*NGram\*java
./src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java
./src/java/org/apache/solr/analysis/NGramTokenizerFactory.java
./src/java/org/apache/solr/analysis/NGramFilterFactory.java
./src/java/org/apache/solr/analysis/EdgeNGramFilterFactory.java

Use these to create a new field type make Solr tokenize and index your terms as, say, uni-grams.  
Instead (or in addition to) indexing "Termobyxa", index "T e r m o b y x a".  
Do the same with the query-time analyzer, and you'll be able to search within words.
 
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Daniel Löfquist <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, April 17, 2008 5:46:15 AM
Subject: Searching "inside of words"

Hi,

I'm still pretty new to Solr. We're using it for searching on our site 
right now though.


The configuration is however pretty much based on the example-files that 
come with Solr and there's one type of search that I can't get to work.


Each item has fields called "title" and "description", both of which are 
of type "text".


The type "text" is defined like this in our schema.xml :




words="stopwords.txt"/>
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0"/>







ignoreCase="true" expand="true"/>
words="stopwords.txt"/>
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0"/>







My problem is that if I have an item with "title"="Termobyxa", a search 
for "Termo" gives me a hit but if I search for "ermo" or "byxa" I get no 
hit. How do I make it so that this kind of search "inside a word" 
returns a hit?


Sincerely,

Daniel Löfquist






--
Daniel Löfquist
Application Manager / Software Engineer

CDON.COM
Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden

Office: +46 40 601 61 00
Direct: +46 40 601 61 16
Mobile: +46 702 92 21 75
Fax: +46 40 601 61 20
E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>

CDON.COM <http://www.cdon.com/>

Confidentiality
Information contained in this e-mail is intended for the use of the
addressee only, and is confidential. Any dissemination, distribution,
copying or use of this communication without prior permission of
the addressee is strictly prohibited. If you are not the intended
addressee you must delete this e-mail and its attachments.


Re: Searching "inside of words"

2008-05-19 Thread Daniel Löfquist

Thank you for your reply.
I've been trying some things out this morning but I'm still not getting 
it to work properly. I have a feeling that I'm on the right track 
somewhat though.


The type in my schema.xml looks like this:













If I'm understanding everything correctly this should create tokens with 
the size of 2 to 18 letters at the time of indexing, right?


However, I can't search properly now. I have to slice my search-string 
up into 2-letter chunks. So if I'm searching for "monitor" I have to 
send "mo+ni+to+r" to Solr. Like this:

http://localhost:8080/solrtest/select/?q=mo+ni+to+r&q.op=AND
when I want it to be like this:
http://localhost:8080/solrtest/select/?q=monitor&q.op=AND

I'm sure I'm doing something completely wrong. I just need some one more 
wise to the ways of Lucene and Solr to point directly at what it is 
that's wrong ;-)


//Daniel

Chris Hostetter wrote:

: so the only ones I can utilize are EdgeNGramTokenizerFactory and
: NGramTokenizerFactory.
: 
: I've done some playing around with them but the best result I've gotten so far

: is a field-type that enables searching for specific letters, for example I can
: search for an item that contains the letters a and x, but it returns a hit no
: matter where these letters are in the text, they don't have to be next to each
: other, and that's not the result I was going for. If the field contains
: "monitor" I want a hit on a search for "onit" but not on "rint" for example.

NGramTokenizerFactory should work fine for this ... the key is to use it 
at indexing time with the appropriate min and max gram sizes to meet your 
needs -- at query time, don't use it at all (use keyword or 
whitespace tokenizer)


so the word "monitor" will be indexed as these tokens (but not 
neccessarily in this order)...


  m o n i t o r mo on ni it to or mon oni nit ... onit ...

and at search time when the user gives you "onit" that term will exist.

: I've never attempted to construct a new field-type of my own before and I'm
: finding the available documentation somewhat incomplete and not very helpful

FWIW: creating a new FieldType is almost never what you need if you 
are dealing with text .. creating new FieldTypes is something that 
typically only needs done in cases where you want specialized encoding or 
sorting.


-Hoss



--
Daniel Löfquist
Application Manager / Software Engineer

CDON.COM
Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden

Office: +46 40 601 61 00
Direct: +46 40 601 61 16
Mobile: +46 702 92 21 75
Fax: +46 40 601 61 20
E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>

CDON.COM <http://www.cdon.com/>

Confidentiality
Information contained in this e-mail is intended for the use of the
addressee only, and is confidential. Any dissemination, distribution,
copying or use of this communication without prior permission of
the addressee is strictly prohibited. If you are not the intended
addressee you must delete this e-mail and its attachments.


Re: Searching "inside of words"

2008-05-20 Thread Daniel Löfquist
Thanks a million! That totally did the trick. It is now working at least 
95% like I want it to.


Gotta tweak it a little more but it seems like the hard part is over.

Thanks once again to everybody who helped out.

//Daniel

Chris Hostetter wrote:
: You are doing the right thing.  If you are creating n-grams at index 
: time, you have to match that at query time.  If the query is "monitor", 
: you need to pass that through n-gram tokenizer, too.  n-grams of length 
: 18 look a little weird


you don't *have* to use ngrams at query time ... his goal is "parital" 
word matching, so he wants to create various sized ngrams so that input 
like "onit" matches "monitor" but does not match "on it"


Daniel: the options for NGramTokenizerFactory are minGramSize 
and maxGramSize ... not minGram and maxGram ... you are getting the 
defaults (which are 1 and 2 i think)


it confused me too untill i tried you schema changes, and then looked at 
the analysis.jsp link and saw only 1 and 2 gram tokens being created .. 
then i checked the class.




-Hoss