SOLR deduplication

2011-01-26 Thread Jason Brown
Hi - I have the SOLR deduplication configured and working well.

Is there any way I can tell which documents have been not added to the index as 
a result of the deduplication rejecting subsequent identical documents?

Many Thanks

Jason Brown.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Dismax score - maximu of any one field?

2010-12-20 Thread Jason Brown

Can anyone tell me hoe the dismax score is computed? Is it the maximum score 
for any of the component fields that are searched? Thank You.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


De-duplication not working as I expected - duplicates still getting into the index

2010-12-14 Thread Jason Brown
I have configured de-duplication according to the Wiki..

My signature field is defined thus...  



and my updateRequestProcessor as follows



  true
  false
  signature
  content
  org.apache.solr.update.processor.Lookup3Signature



  

I am using SOLRJ to write to the index with the binary (as opposed to XML) so 
my update handler is defined as below.

 

  dedupe

  

However I was expecting SOLR to only allow 1 instance of a duplicate document 
into the index, but I get the following results when I query mt index...

I have deliberately added my ISA Letter file 4 times and can see it has 
correctly generated an identical signature for the first 4 entries 
(d91a5ce933457fd5). The fifth entry is a different document and correctly has a 
different signature. 

I was expecting to only see 1 instance of the duplicate. Am I misinterpreting 
the way it works? Many Thanks.


?

ISA Letter
d91a5ce933457fd5

?

ISA Letter
d91a5ce933457fd5

?

ISA Letter
d91a5ce933457fd5

?

ISA Letter
d91a5ce933457fd5

?

ISA Mailing pack letter
fd9d9e1c0de32fb5


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: Boost on newer documents

2010-11-30 Thread Jason Brown
Hi - you do understand may case - we tried what you suggested but as the 
relevancy is very precise we couldn't get it it to do a dual-sort.

I like the idea of using one of the dismax parameters (bf) to in-effect 
increase the boost on a newer document. 

Thanks for all replies, most useful.


-Original Message-
From: Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com]
Sent: Tue 30/11/2010 09:26
To: solr-user@lucene.apache.org
Subject: Re: Boost on newer documents
 
hi,

I might not understand your case right but can you not add an extra
publishedDate field and then specify a secondary (after relevance) sort by
that?

On 30 November 2010 08:05,  wrote:

> You could also put a short representation of the data (I suggest days since
> 01.01.2010) as payload and calculate boost with payload function of the
> similarity.
>
> >-Original Message-----
> >From: ext Jason Brown [mailto:jason.br...@sjp.co.uk]
> >Sent: Montag, 29. November 2010 17:28
> >To: solr-user@lucene.apache.org
> >Subject: Boost on newer documents
> >
> >
> >Hi,
> >
> >I use the dismax query to search across several fields.
> >
> >I find I have a lot of documents with the same document name (one of the
> fields that the dismax queries) so I wanted to adjust the
> >relevance so that titles with a newer published date have a higher
> relevance than documents with the same title but are older. Does
> >anyone know how I can achieve this?
> >
> >Thank You
> >
> >Jason.
> >
> >If you wish to view the St. James's Place email disclaimer, please use the
> link below
> >
> >http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: Boost on newer documents

2010-11-29 Thread Jason Brown
Great - Thank You.


-Original Message-
From: Mat Brown [mailto:m...@patch.com]
Sent: Mon 29/11/2010 16:33
To: solr-user@lucene.apache.org
Subject: Re: Boost on newer documents
 
Hi Jason,

You can use boost functions in the dismax handler to do this:

http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29

Mat

On Mon, Nov 29, 2010 at 11:28, Jason Brown  wrote:
>
> Hi,
>
> I use the dismax query to search across several fields.
>
> I find I have a lot of documents with the same document name (one of the 
> fields that the dismax queries) so I wanted to adjust the relevance so that 
> titles with a newer published date have a higher relevance than documents 
> with the same title but are older. Does anyone know how I can achieve this?
>
> Thank You
>
> Jason.
>
> If you wish to view the St. James's Place email disclaimer, please use the 
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Boost on newer documents

2010-11-29 Thread Jason Brown

Hi,

I use the dismax query to search across several fields.

I find I have a lot of documents with the same document name (one of the fields 
that the dismax queries) so I wanted to adjust the relevance so that titles 
with a newer published date have a higher relevance than documents with the 
same title but are older. Does anyone know how I can achieve this?

Thank You

Jason.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: Synonym Filtering on String Fields

2010-11-26 Thread Jason Brown
Thanks Erick - I do exactly want multiple terms generated from my string field 
i.e.

I want the single term fund manager summary to be turned into 2 terms > fund 
manager summary, fund manager report
I want the single term guide to be turned into the 2 terms -> guide, product 
guide

I am using term synonomoulsly with what will be in the index. (I appreciate the 
outputs of the synonym filter wont be stored per se, just added as terms to the 
index)

The problem I was having is that I am doing this on a a field as I described 
below and was having problems with the multi-word terms, the behaviour is

guide is getting turned into 3 terms guide, product, guide (3 terms, I only 
want 2, guide and product guide)
fund manager summary and fund manager report were not having any impact on the 
synonym filter, the output was the same as the input.

I need these as strings (I dont search on this field, its just for facetting), 
I have another text field which I do the search on.

I will give Ahmet's comments a go. Thanks All.



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Fri 26/11/2010 14:16
To: solr-user@lucene.apache.org
Subject: Re: Synonym Filtering on String Fields
 
Besides Ahmet's comments, I have to wonder if you want to do this in a
single field?
The problem is that you're expanding your synonyms into a field. Let's say
you
expand "memory" into "memory", "recall" and "RAM". Now you have three
tokens in your field. What does faceting mean now? Perhaps you would be
better
off using the  directive to make a field for faceting and use
Solr text type for your searchable field? Of course this may be wy off
base

About your point (1), you say synonyms aren't getting picked up. You might
be
getting fooled by seeing the stored value. Look in the admin page under
"schema
browser" to see the terms in the index, which would have the synonyms. Just
selecting the document via search will only show you the stored values which
would
NOT have the synonyms.

Best
Erick

On Fri, Nov 26, 2010 at 5:15 AM, Jason Brown  wrote:

>
> I have the following field type set up in my schema. The idea is to fire
> phrases of text such as 'fund manager summary' (without the quotes) at it,
> and for the synonym processing to recognise this, and add the rest of the
> synonyms (index-time synonym processing with expansion) to the index from my
> synonym file (example below)
>
>   positionIncrementGap="100">
>  
>
> ignoreCase="true" expand="true"/>
> 
>  
>
>  
>
>
>
> in synonyms.txt.
>
> fund manager summary, fund manager report
> guide, product guide
>
> I run into 2 issues...
>
> (1) After analysis of the field in SOLR, I find that both
>
> fund manager summay
> fund manage report
>
> are NOT getting picked up by the synonym factory (after processing I just
> get the source term outputted from the synonym filter)
>
> (2) If I analyse guide, I do get product and guide (*2) outputted from the
> synonym filter factory - but as  seperate terms (3 terms in total), I
> expected it to generate just 1 additional term - i.e. product guide
>
> It seems that it is able to pick up a single word and output two (as
> seperate terms), but it fails to pick up multiple words.
>
> Can anyone help? (incidentally when I use this approach on a SOLR text
> field type it all works fine, but I cant use a SOLR text field type for this
> as I use this field for facetting.
>
>
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Synonym Filtering on String Fields

2010-11-26 Thread Jason Brown

I have the following field type set up in my schema. The idea is to fire 
phrases of text such as 'fund manager summary' (without the quotes) at it, and 
for the synonym processing to recognise this, and add the rest of the synonyms 
(index-time synonym processing with expansion) to the index from my synonym 
file (example below)

 
  


 
  

  
  


in synonyms.txt.

fund manager summary, fund manager report
guide, product guide

I run into 2 issues...

(1) After analysis of the field in SOLR, I find that both 

fund manager summay
fund manage report

are NOT getting picked up by the synonym factory (after processing I just get 
the source term outputted from the synonym filter)

(2) If I analyse guide, I do get product and guide (*2) outputted from the 
synonym filter factory - but as  seperate terms (3 terms in total), I expected 
it to generate just 1 additional term - i.e. product guide

It seems that it is able to pick up a single word and output two (as seperate 
terms), but it fails to pick up multiple words.

Can anyone help? (incidentally when I use this approach on a SOLR text field 
type it all works fine, but I cant use a SOLR text field type for this as I use 
this field for facetting.



If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Synonym processing at index time

2010-11-24 Thread Jason Brown

Good Morning - I will explain my current config/fucntionality.

I have 4 fields in my index...

1) Doc Title - a text field
2) Keyword Phrase, e.g. fund manager, a text field (with some edge n gram 
functionality at index time)
3) Keyword Phrase, e.g. fund manager, a string field (for facetting)
4) Content Field, i.e. my full document text, a text field

I have a nice bit of auto-complete functionality in my UI which works as 
follows...

user searches -> fund ma

and my service layer calls SOLR to say please find all docs with fund and ma in 
it. My search results are fine, I also ask for facets and counts in this same 
query so I can use them in my auto-complete (I ask for field (3) above when 
facetting).

This allows me to use the facets and counts to show a nice auto-complete each 
time a user hits a key.

Ok so far. I have a nice auto-complete based upon business domain Keyword 
Phrases.

Now.on to synonyms, for example fund manager and fund lead are the same 
thing in my business domain.

I was planning on simply adding the synonyms as normal entries into fields 2 
and 3 (both multi-valued fileds) so that they would be inserted into the index 
and be available for my auto-complete. This would be OK and to clarify, nothing 
to do with the synonyms.txt file at this point.

However, as SOLR has synonym processing I should take advantage of it (also at 
this point my synonym fund lead would not have found its way into field 4 (full 
text off the document) where fund manager was in the content).

SO I belive I should so something like...

fund manager, fund lead 

...in my synonym file that I only want to process at index time (so it appears 
in my autocomplete) with expansion on. I want wherever fund manager or fund 
lead is found, for the index to have fund manager and fund lead.

As I have expansion on and have multi word synonyms (phrases as both a source 
and target) then to use the synonym file at index time seems best.

However, I am very confused at this point.

I can see how the synonym file would be processed correctly for field 3 (a 
string field) and both terms fund maanger and fund lead should go into the 
index OK.

But I can't see how it would work for the text fields (2 and 4).

My Index time filter chain has synonym processing as per the default text field 
processing (after whitespace tokenisation), so I cant see how my terms fund 
manager and fund lead can be found by the synonym filter. 

I've looked in the book by Eric Pugh and they say that for multi-word synonyms 
to work you must use synonyms at index time and with expansion - they say you 
cant do synonym processing at query time as synonym phrases aren't recognised 
after whitespace parsing - but my index chain (and the defauly SOLR config for 
text fields ) also whitespace parses.

it would be great to take advantage of synonym processing by SOLR instead of 
mty original plan - but am confused how multi-word synonms can be recognised at 
index time and added to the index - am I missing something about inde time 
processign of synonyms here?

Many Thanks for any help/advice.

Jason.






If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: Filter by relevance

2010-11-04 Thread Jason Brown
I have a dismax query where I check for values in 3 fields against documents in 
the index - a title, a list of keyword tags and then full-text of the document.

I usually get lots of results and I can see that the first results are OK - 
it's giving precedence to titles and tag matches, as my dismax boosts on title 
and keywords (normal boost and phrase boost).

After say 20/30 good results I start to get matches based upon just the 
full-text, so these are less relevant. 

I am also facet.couting on my keyword tags (and presenting in the results as a 
way of filtering) and as you can imagine the counts are high because of the 
number of overall results. I want to somehow make the facet counts more 
associated with the higher relevancy results.

My options as I see it are - 

1) exclude full-text from the dismax altogether
2) configure the dismax normal boost on full-text to zero, but phrase boost to 
something higher (the aim here is to only really get a hit on the full-text if 
my search term is foound as a phrase in the full-text)
3) limit my results by relevancy or number of results

If I do (3) above will the facet.counts respect the lower number of results - 
this is the overall aim really.

Thank You

Jason.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wed 03/11/2010 23:15
To: solr-user@lucene.apache.org
Subject: Re: Filter by relevance
 
Be aware, though, that relevance isn't absolute, it's only interesting
#within# a query. And it's
then normed between 0 and 1. So picking "a certain value" is rarely doing
what you think it will.
Limiting to the top N docs is usually more reasonable

But this may be an XY problem. What is it you're trying to accomplish?
Perhaps if you
state the problem, some other suggestions may be in the offing

Best
Erick

On Wed, Nov 3, 2010 at 4:48 PM, Jason Brown  wrote:

> Is it possible to filter my search results by relevance? For example,
> anything below a certain value shouldn't be returned?
>
> I also retrieve facet counts in my search queries, so it would be useful if
> the facet counts also respected the filter on the relevance.
>
> Thank You.
>
> Jason.
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


phrase boost on dismax query

2010-11-03 Thread Jason Brown

I have 3 fields in my index that I use in a dismax query with boosts and phrase 
boosts.

I've realised that 1 field I'm not really interested in at all, unless the 
search term is in that field as a phrase.

Is it realistic to set the normal boost to zero for this field, but the phrase 
boost to soemthing much higher in order to achieve the desired effect?

Thank You

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Filter by relevance

2010-11-03 Thread Jason Brown
Is it possible to filter my search results by relevance? For example, anything 
below a certain value shouldn't be returned?

I also retrieve facet counts in my search queries, so it would be useful if the 
facet counts also respected the filter on the relevance.

Thank You.

Jason.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


facet Prefix (or term prefix)

2010-10-22 Thread Jason Brown

I am aware of the facet.prefix facility. I am using SOLR to return a facetted 
fields contents - I use the facet.prefix to restrict what returns from SOLR - 
this is very useful for predictive search functionality (autocomplete).

My only issue is that the field I facet on is a string and could have 2 or 3 
words in it, thus this process will only return strings that begin with what 
the user is typing into my UI search box. It would be useful if I could get 
facets back where I could match somewhere in the facetted field (not just at 
the begninning), i.e. is there a fact.contains method?

If not I'll just have to code this in my service layer having received all 
facets from SOLR (without the prefix)

Thanks for any help.




If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: Dismax phrase boosts on multi-value fields

2010-10-20 Thread Jason Brown
Thanks - I was hoping it wouldnt match - and I belive you've confimred it wont 
in my case as the default positionIncrementGap is set.

Many Thanks

Jason.


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Thu 21/10/2010 02:27
To: solr-user@lucene.apache.org
Subject: RE: Dismax phrase boosts on multi-value fields
 
Which is why the positionIncrementGap is set to a high number normally (100 in 
the sample schema.xml).  With this being so, phrases won't match accross values 
in a multi-valued field. If for some reason you were using a dismax ps phrase 
slop that was higher than your positionIncrementGap, you could get phrase boost 
matches accross individual values.  But normally that won't happen unless you 
do something odd to make it happen because you actually want it to, because 
positionIncrementGap is 100. If for some reason you wanted to use a phrase slop 
of over 100 but still make sure it didn't go accross individual value 
boundaries you could just set positionIncrementGap to something absurdly high 
(I'm not entirely sure why it isn't something absurdly high in the sample 
schema.xml, instead of the high-but-not-absurdly-so 100, since most people will 
probably expect individual values to be entirely seperate). 

Jason, are you _trying_ to make that happen, or hoping it won't?  Ordinarily, 
it won't. 

From: Erick Erickson [erickerick...@gmail.com]
Sent: Wednesday, October 20, 2010 7:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Dismax phrase boosts on multi-value fields

Well, it all depends (tm). your example wouldn't match, but if you
didn't have an increment gap greater than 1, "black cat his blue" #would#
match.

Best
Erick


On Wed, Oct 20, 2010 at 3:22 AM, Jason Brown  wrote:

> Thanks Jonathan.
>
> To further clarify, I understand the the match of
>
> my blue rabbit
>
> would have to be found in 1 element (of my multi-valued defined field) for
> the phrase boost on that field to kick in.
>
> If for example my document had the following 3 entries for the multi-value
> field
>
>
> my black cat
> his blue car
> her pink rabbit
>
> Then I assume the phrase boost would not kick-in as the search term (my
> blue rabbit) isnt found in a single element (but can be found across them).
>
> Thanks again
>
> Jason.
>
> 
>
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Tue 19/10/2010 17:27
> To: solr-user@lucene.apache.org
> Subject: Re: Dismax phrase boosts on multi-value fields
>
>
>
> You are correct.  The query needs to match as a phrase. It doesn't need
> to match "everything". Note that if a value is:
>
> "long sentence with my blue rabbit in it",
>
> then query "my blue rabbit" will also match as a phrase, for phrase
> boosting or query purposes.
>
> Jonathan
>
> Jason Brown wrote:
> >
> >
> > Hi - I have a multi-value field, so say for example it consists of
> >
> > 'my black cat'
> > 'my white dog'
> > 'my blue rabbit'
> >
> > The field is whitespace parsed when put into the index.
> >
> > I have a phrase query boost configured on this field which I understand
> kicks in when my search term is found entirely in this field.
> >
> > So, if the search term is 'my blue rabbit', then I understand that my
> phrase boost will be applied as this is found entirley in this field.
> >
> > My question/presumption is that as this is a multi-valued field, only 1
> value of the multi-value needs to match for the phrase query boost (given my
> very imaginative set of test data :-) above, you can see that this obviously
> matches 1 value and not them all)
> >
> > Thanks for your help.
> >
> >
> >
> >
> >
> >
> > If you wish to view the St. James's Place email disclaimer, please use
> the link below
> >
> > http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
> >
> >
>
>
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: Dismax phrase boosts on multi-value fields

2010-10-20 Thread Jason Brown
Thanks Jonathan.
 
To further clarify, I understand the the match of 
 
my blue rabbit
 
would have to be found in 1 element (of my multi-valued defined field) for the 
phrase boost on that field to kick in.
 
If for example my document had the following 3 entries for the multi-value 
field
 
 
my black cat
his blue car
her pink rabbit
 
Then I assume the phrase boost would not kick-in as the search term (my blue 
rabbit) isnt found in a single element (but can be found across them).
 
Thanks again
 
Jason.



From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Tue 19/10/2010 17:27
To: solr-user@lucene.apache.org
Subject: Re: Dismax phrase boosts on multi-value fields



You are correct.  The query needs to match as a phrase. It doesn't need
to match "everything". Note that if a value is:

"long sentence with my blue rabbit in it",

then query "my blue rabbit" will also match as a phrase, for phrase
boosting or query purposes.

Jonathan

Jason Brown wrote:
> 
>
> Hi - I have a multi-value field, so say for example it consists of
>
> 'my black cat'
> 'my white dog'
> 'my blue rabbit'
>
> The field is whitespace parsed when put into the index.
>
> I have a phrase query boost configured on this field which I understand kicks 
> in when my search term is found entirely in this field.
>
> So, if the search term is 'my blue rabbit', then I understand that my phrase 
> boost will be applied as this is found entirley in this field.
>
> My question/presumption is that as this is a multi-valued field, only 1 value 
> of the multi-value needs to match for the phrase query boost (given my very 
> imaginative set of test data :-) above, you can see that this obviously 
> matches 1 value and not them all)
>
> Thanks for your help.
>
>
>
>
>
>
> If you wish to view the St. James's Place email disclaimer, please use the 
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>
>  



If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Dismax phrase boosts on multi-value fields

2010-10-19 Thread Jason Brown
 

Hi - I have a multi-value field, so say for example it consists of 

'my black cat'
'my white dog'
'my blue rabbit'

The field is whitespace parsed when put into the index.

I have a phrase query boost configured on this field which I understand kicks 
in when my search term is found entirely in this field.

So, if the search term is 'my blue rabbit', then I understand that my phrase 
boost will be applied as this is found entirley in this field. 

My question/presumption is that as this is a multi-valued field, only 1 value 
of the multi-value needs to match for the phrase query boost (given my very 
imaginative set of test data :-) above, you can see that this obviously matches 
1 value and not them all)

Thanks for your help.






If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


FW: Dismax phrase boosts on multi-value fields

2010-10-19 Thread Jason Brown



-Original Message-
From: Jason Brown
Sent: Tue 19/10/2010 13:45
To: d...@lucene.apache.org
Subject: Dismax phrase boosts on multi-value fields
 

Hi - I have a multi-value field, so say for example it consists of 

'my black cat'
'my white dog'
'my blue rabbit'

The field is whitespace parsed when put into the index.

I have a phrase query boost configured on this field which I understand kicks 
in when my search term is found entirely in this field.

So, if the search term is 'my blue rabbit', then I understand that my phrase 
boost will be applied as this is found entirley in this field. 

My question/presumption is that as this is a multi-valued field, only 1 value 
of the multi-value needs to match for the phrase query boost (given my very 
imaginative set of test data :-) above, you can see that this obviously matches 
1 value and not them all)

Thanks for your help.





If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: What is the maximum number of documents that can be indexed ?

2010-10-14 Thread Jason Brown
Not related to the opening thread - but wante to thank Eric for his book. 
Clarified a lot of stuff and very useful.


-Original Message-
From: Eric Pugh [mailto:ep...@opensourceconnections.com]
Sent: Thu 14/10/2010 15:34
To: solr-user@lucene.apache.org
Subject: Re: What is the maximum number of documents that can be indexed ?
 
I would recommend looking at the work the HathiTrust has done.  They have 
published some really great blog articles about the work they have done in 
scaling Solr, and have put in huge amounts of data.   

The good news is that there isn't a exact number, because "It depends".   The 
bad news is that there isn't an exact number because "it depends"!

Eric



On Oct 13, 2010, at 8:58 PM, Otis Gospodnetic wrote:

> Marco (use solr-u...@lucene list to follow up, please),
> 
> There are no precise answers to such questions.  Solr can keep indexing.  The 
> limit is, I think, the available disk space.  I've never pushed Solr or 
> Lucene 
> to the point where Lucene index segments would become a serious pain, but 
> even 
> that can be controlled.  Same thing with number of open files, large file 
> support, etc.
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
>> 
>> From: Marco Ciaramella 
>> To: d...@lucene.apache.org
>> Sent: Wed, October 13, 2010 6:19:15 PM
>> Subject: What is the maximum number of documents that can be indexed ?
>> 
>> Hi all,
>> I am working on a performance specification document on a Solr/Lucene-based 
>> application; this document is intended for the final customer. My question 
>> is: 
>> what is the maximum number of document I can index assuming 10 or 20kbytes 
>> for 
>> each document? 
>> 
>> 
>> I could not find a precise answer to this question, and I tend to consider 
>> that 
>> Solr index can be virtually limited only by the JVM, the Operating System 
>> (limits to large file support), or by hardware constraints (mainly RAM, etc. 
>> ... 
>> ). 
>> 
>> 
>> Thanks
>> Marco
>> 
>> 
>> 

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal










If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


RE: multi level faceting

2010-10-04 Thread Jason Brown
Yes, by adding fq back into the main query you will get results increasingly 
filtered each time.

You may run into an issue if you are displaying facet counts, as the facet part 
of the query will also obey the increasingly filtered fq, and so not display 
counts for other categories anymore from the chosen facet (depends if you need 
to display counts from a facet once the first value from the facet has been 
chosen if you get my drift). Local params are a way to deal with this by not 
subjecting the facet count to the same fq restriction (but allowing the search 
results to obey it).



-Original Message-
From: Nguyen, Vincent (CDC/OD/OADS) (CTR) [mailto:v...@cdc.gov]
Sent: Mon 04/10/2010 16:34
To: solr-user@lucene.apache.org
Subject: RE: multi level faceting
 
Ok.  Thanks for the quick response.

Vincent Vu Nguyen
Division of Science Quality and Translation
Office of the Associate Director for Science
Centers for Disease Control and Prevention (CDC)
404-498-6154
Century Bldg 2400
Atlanta, GA 30329 


-Original Message-
From: Allistair Crossley [mailto:a...@roxxor.co.uk] 
Sent: Monday, October 04, 2010 9:40 AM
To: solr-user@lucene.apache.org
Subject: Re: multi level faceting

I think that is just sending 2 fq facet queries through. In Solr PHP I
would do that with, e.g.

$params['facet'] = true;
$params['facet.fields'] = array('Size');
$params['fq'] => array('sex' => array('Men', 'Women'));

but yes i think you'd have to send through what the current facet query
is and add it to your next drill-down

On Oct 4, 2010, at 9:36 AM, Nguyen, Vincent (CDC/OD/OADS) (CTR) wrote:

> Hi,
> 
> 
> 
> I was wondering if there's a way to display facet options based on
> previous facet values.  For example, I've seen many shopping sites
where
> a user can facet by "Mens" or "Womens" apparel, then be shown "sizes"
to
> facet by (for Men or Women only - whichever they chose).  
> 
> 
> 
> Is this something that would have to be handled at the application
> level?
> 
> 
> 
> Vincent Vu Nguyen
> 
> 
> 
> 
> 




If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Facet Counts Issue when Using Dismax Query Parser in SOLR

2010-10-01 Thread Jason Brown



I am retrieving facet counts against a specific column in my index and these 
look accurate. The query for retrieving these counts is also running a dismax 
search using the q param (against 4 columns in my index, 1 of which I am facet 
counting on as mentioned above).

So far, so good. I show my search results, I show my facets and associated 
counts.

However, I want the user to be able to 'drill-down' by re-running the same 
search (same q param), but adding in one of the facets to filter the results. 
Clearly, I can't modify the q parameter to filter against my facetted column 
(in addition to the previous q value), as dismax wont allow a q param to have a 
column specified.

So I add a fq param to filter the results by the chosen facet. This seems 
logical, but the number of search results I get is NOT the same as the count 
against the facet.

I thought that by adding an fq param I am basically saying (ensuring I keep the 
q param the same), re-run the search but filter my results where my facetted 
column has value 'x'.

However as the number of results is not what I am expecting, I believe it may 
be using the fq param first to define the number of docs against which the q 
param is subsequently used. But this doesnt seem very intuitive. But it would 
explain the difference in the facet count and subsequent number of search 
results that I am observing.

Could someone help point out which of the 2 interpreations of fq is correct?

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer