Re: How do I make sure the resulting documents contain the query terms?

2011-06-06 Thread pravesh
>k0 --> A | C
>k1 --> A | B
>k2 --> A | B | C
>k3 --> B | C 
>Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1? 
Do we bother to do that. Now that's what lucene does :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How do I make sure the resulting documents contain the query terms?

2011-06-06 Thread Gabriele Kahlout
Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are the
terms each contains.
So Solr inverted index should be something like:

k0 --> A | C
k1 --> A | B
k2 --> A | B | C
k3 --> B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?

On Tue, Jun 7, 2011 at 12:21 AM, Erick Erickson wrote:

> I'm having a hard time understanding what you're driving at, can
> you provide some examples? This *looks* like filter queries,
> but I think you already know about those...
>
> Best
> Erick
>
> On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout
>  wrote:
> > Hello,
> >
> > I've seen that through boosting it's possible to influence the scoring
> > function, but what I would like is sort of a boolean property. In some
> way
> > it's to search only the indexed documents by that keyword (or the
> > intersection/union) rather than the whole set.
> > Is this supported in any way?
> >
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> >
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: problem: zooKeeper Integration with solr

2011-06-06 Thread bmdakshinamur...@gmail.com
Instead of integrating zookeeper, you could create shards over multiple
machines and specify the shards while you are querying solr.
Eg: http://localhost:8983/solr/select?shards=*:/,*
*:/*&indent=true&q=



On Mon, Jun 6, 2011 at 5:59 PM, Mohammad Shariq wrote:

> Hi folk,
> I am using solr to index around 100mn docs.
> now I am planning to move to cluster based solr, so that I can scale the
> indexing and searching process.
> since solrCloud is in development  stage, I am trying to index in shard
> based environment using zooKeeper.
>
> I followed the steps from
> http://wiki.apache.org/solr/ZooKeeperIntegrationthen also I am not
> able to do distributes search.
> Once I index the docs in one shard, not able to query from other shard and
> vice-versa, (using the query
>
> http://localhost:8180/solr/select/?q=itunes&version=2.2&start=0&rows=10&indent=on
> )
>
> I am running solr3.1 on ubuntu 10.10.
>
> please help me.
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>



-- 
Thanks and Regards,
DakshinaMurthy BM


Re: Master Slave help

2011-06-06 Thread Jayendra Patil
Do you mean the replication happens everytime you restart the server ?
If so, you would need to modify the events you want the replication to happen.

Check for the replicateAfter tag and remove the startup option, if you
don't need it.




startup
commit





schema.xml,stopwords.txt,elevate.xml
   
00:00:10



Regards,
Jayendra

On Mon, Jun 6, 2011 at 11:24 AM, Rohit Gupta  wrote:
> Hi,
>
> I have configured my master slave server and everything seems to be running
> fine, the replication completed the firsttime it ran. But everytime I go the 
> the
> replication link in the admin panel after restarting the server or server
> startup I notice the replication starting from scratch or at least the stats
> show that.
>
> What could be wrong?
>
> Thanks,
> Rohit


Re: synonyms problem

2011-06-06 Thread Erick Erickson
Please take a look at the analysis page for the field in question. I don't
even know what happens if you define ONLY a query analyzer (or you left
things out as an efficiency).

Substituting synonyms to a string field is suspicious, I assume you're only
indexing single tokens in that field.

You have to re-index after the query time change to see the effects.

Best
Erick

On Mon, Jun 6, 2011 at 8:33 PM, deniz  wrote:
> well i was trying to say that; i have changed the config files for synonyms
> and so on but nothing happens so i thought i needed to do something in java
> code too... i was trying to ask about that...
>
> -
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3032666.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: synonyms problem

2011-06-06 Thread deniz
well i was trying to say that; i have changed the config files for synonyms
and so on but nothing happens so i thought i needed to do something in java
code too... i was trying to ask about that...

-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3032666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SpellCheckComponent performance

2011-06-06 Thread Erick Erickson
Hmmm, how are you configuring you spell checker? The first-time slowdown
is probably due to cache warming, but subsequent 500 ms slowdowns
seem odd. How many unique terms are there in your spellecheck index?

It'd probably be best if you showed us your fieldtype and field definition...

Best
Erick

On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz  wrote:
> I'm continuing to work on tuning my Solr server, and now I'm noticing that my 
> biggest bottleneck is the SpellCheckComponent.  This is eating multiple 
> seconds on most first-time searches, and still taking around 500ms even on 
> cached searches.  Here is my configuration:
>
>   class="org.apache.solr.handler.component.SpellCheckComponent">
>    
>      basicSpell
>      spelling
>      0.75
>      ./spellchecker
>      textSpell
>      true
>    
>  
>
> I've done a bit of searching, but the best advice I could find for making the 
> search component go faster involved reducing spellcheck.maxCollationTries, 
> which doesn't even seem to apply to my settings.
>
> Does anyone have any advice on tuning this aspect of my configuration?  Are 
> there any extra debug settings that might give deeper insight into how the 
> component is spending its time?
>
> thanks,
> Demian
>


Re: How do I make sure the resulting documents contain the query terms?

2011-06-06 Thread Erick Erickson
I'm having a hard time understanding what you're driving at, can
you provide some examples? This *looks* like filter queries,
but I think you already know about those...

Best
Erick

On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout
 wrote:
> Hello,
>
> I've seen that through boosting it's possible to influence the scoring
> function, but what I would like is sort of a boolean property. In some way
> it's to search only the indexed documents by that keyword (or the
> intersection/union) rather than the whole set.
> Is this supported in any way?
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>


Re: Minimum Should Match + External Field + Function Query with boost

2011-06-06 Thread fbytes
Seem to have a solution but I am still trying to figure out how/why it works. 


Addition of "defType=edismax" in the boost query seem to honor "MM" and
correct boosting based on external file source. 

The new query syntax
q={!boost b=dishRating v=$qq defType=edismax}&qq=hot chicken wings 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Minimum-Should-Match-not-enforced-with-External-Field-Function-Query-with-boost-tp2985564p3032143.html
Sent from the Solr - User mailing list archive at Nabble.com.


SpellCheckComponent performance

2011-06-06 Thread Demian Katz
I'm continuing to work on tuning my Solr server, and now I'm noticing that my 
biggest bottleneck is the SpellCheckComponent.  This is eating multiple seconds 
on most first-time searches, and still taking around 500ms even on cached 
searches.  Here is my configuration:

  

  basicSpell
  spelling
  0.75
  ./spellchecker
  textSpell
  true

  

I've done a bit of searching, but the best advice I could find for making the 
search component go faster involved reducing spellcheck.maxCollationTries, 
which doesn't even seem to apply to my settings.

Does anyone have any advice on tuning this aspect of my configuration?  Are 
there any extra debug settings that might give deeper insight into how the 
component is spending its time?

thanks,
Demian


How do I make sure the resulting documents contain the query terms?

2011-06-06 Thread Gabriele Kahlout
Hello,

I've seen that through boosting it's possible to influence the scoring
function, but what I would like is sort of a boolean property. In some way
it's to search only the indexed documents by that keyword (or the
intersection/union) rather than the whole set.
Is this supported in any way?


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Solr Indexing Patterns

2011-06-06 Thread Jonathan Rochkind

This is a start, for many common best practices:

http://wiki.apache.org/solr/SolrRelevancyFAQ

Many of the questions in there have an answer that involves 
de-normalizing. As an example. It may be that even if your specific 
problem isn't in there,  I myself anyway found reading through there 
gave me a general sense of common patterns in Solr.


( It's certainly true that some things are hard to do in Solr.  It turns 
out that an RDBMS is a remarkably flexible thing -- but when it doesn't 
do something you need well, and you turn to a specialized tool instead 
like Solr, you certainly give up some things


One of the biggest areas of limitation involves hieararchical or 
relationship data, definitely. There are a variety of features, some 
more fully baked than others, some not yet in a Solr release, meant to 
provide tools to get at different aspects of this. Including "pivot 
facetting",  "join" (https://issues.apache.org/jira/browse/SOLR-2272), 
and field-collapsing.  Each, IMO, is trying to deal with different 
aspects of dealing with hieararchical or multi-class data, or data that 
is entities with relationships. ).


On 6/6/2011 3:43 PM, Judioo wrote:

I do think that Solr would be better served if there was a *best practice
section *of the site.

Looking at the majority of emails to this list they resolve around "how do I
do X?".

Seems like tutorials with real world examples would serve Solr no end of
good.

I still do not have an example of the best method to approach my problem,
although Erick has  help me understand the limitations of Solr.

Just thought I'd say.






On 6 June 2011 20:26, Judioo  wrote:


Thanks


On 6 June 2011 19:32, Erick Erickson  wrote:


#Everybody# (including me) who has any RDBMS background
doesn't want to flatten data, but that's usually the way to go in
Solr.

Part of whether it's a good idea or not depends on how big the index
gets, and unfortunately the only way to figure that out is to test.

But that's the first approach I'd try.

Good luck!
Erick

On Mon, Jun 6, 2011 at 11:42 AM, Judioo  wrote:

On 5 June 2011 14:42, Erick Erickson  wrote:


See: http://wiki.apache.org/solr/SchemaXml

By adding ' "multiValued="true" ' to the field, you can add
the same field multiple times in a doc, something like



  value1
  value2



I can't see how that would work as one would need to associate the

right

start / end dates and price.
As I understand using multivalued and thus flattening the  discounts

would

result in:

{
"name":"The Book",
"price":"$9.99",
"price":"$3.00",
"price":"$4.00","synopsis":"thanksgiving special",
"starts":"11-24-2011",
"starts":"10-10-2011",
"ends":"11-25-2011",
"ends":"10-11-2011",
"synopsis":"Canadian thanksgiving special",
  },

How does one differentiate the different offers?




But there's no real ability  in Solr to store "sub documents",
so you'd have to get creative in how you encoded the discounts...


This is what I'm asking :)
What is the best / recommended / known patterns for doing this?




But I suspect a better approach would be to store each discount as
a separate document. If you're in the trunk version, you could then
group results by, say, ISBN and get responses grouped together...


This is an option but seems sub optimal. So say I store the discounts in
multiple documents with ISDN as an attribute and also store the title

again

with ISDN as an attribute.

To get
"all books currently discounted"

requires 2 request

* get all discounts currently active
* get all books  using ISDN retrieved from above search

Not that bad. However what happens when I want
"all books that are currently on discount in the "horror" genre

containing

the word 'elm' in the title."

The only way I can see in catering for the above search is to duplicate

all

searchable fields in my "book" document in my "discount" document.

Coming

from a RDBM background this seems wrong.

Is this the correct approach to take?




Best
Erick

On Sat, Jun 4, 2011 at 1:42 AM, Judioo  wrote:

Hi,
Discounts can change daily. Also there can be a lot of them (over

time

and

in a given time period ).

Could you give an example of what you mean buy multi-valuing the

field.

Thanks

On 3 June 2011 14:29, Erick Erickson

wrote:

How often are the discounts changed? Because you can simply
re-index the book information with a multiValued "discounts" field
and get something similar to your example (&wt=json)


Best
Erick

On Fri, Jun 3, 2011 at 8:38 AM, Judioo  wrote:

What is the "best practice" method to index the following in Solr:

I'm attempting to use solr for a book store site.

Each book will have a price but on occasions this will be

discounted.

The

discounted price exists for a defined time period but there may be

many

discount periods. Each discount will have a brief synopsis, start

and

end

time.

A subset of the desired output would be as follows:

...
"response":{"numFound":1,"start":0,"docs":[
  

Re: Solr Indexing Patterns

2011-06-06 Thread Judioo
I do think that Solr would be better served if there was a *best practice
section *of the site.

Looking at the majority of emails to this list they resolve around "how do I
do X?".

Seems like tutorials with real world examples would serve Solr no end of
good.

I still do not have an example of the best method to approach my problem,
although Erick has  help me understand the limitations of Solr.

Just thought I'd say.






On 6 June 2011 20:26, Judioo  wrote:

> Thanks
>
>
> On 6 June 2011 19:32, Erick Erickson  wrote:
>
>> #Everybody# (including me) who has any RDBMS background
>> doesn't want to flatten data, but that's usually the way to go in
>> Solr.
>>
>> Part of whether it's a good idea or not depends on how big the index
>> gets, and unfortunately the only way to figure that out is to test.
>>
>> But that's the first approach I'd try.
>>
>> Good luck!
>> Erick
>>
>> On Mon, Jun 6, 2011 at 11:42 AM, Judioo  wrote:
>> > On 5 June 2011 14:42, Erick Erickson  wrote:
>> >
>> >> See: http://wiki.apache.org/solr/SchemaXml
>> >>
>> >> By adding ' "multiValued="true" ' to the field, you can add
>> >> the same field multiple times in a doc, something like
>> >>
>> >> 
>> >> 
>> >>  value1
>> >>  value2
>> >> 
>> >> 
>> >>
>> >> I can't see how that would work as one would need to associate the
>> right
>> > start / end dates and price.
>> > As I understand using multivalued and thus flattening the  discounts
>> would
>> > result in:
>> >
>> > {
>> >"name":"The Book",
>> >"price":"$9.99",
>> >"price":"$3.00",
>> >"price":"$4.00","synopsis":"thanksgiving special",
>> >"starts":"11-24-2011",
>> >"starts":"10-10-2011",
>> >"ends":"11-25-2011",
>> >"ends":"10-11-2011",
>> >"synopsis":"Canadian thanksgiving special",
>> >  },
>> >
>> > How does one differentiate the different offers?
>> >
>> >
>> >
>> >> But there's no real ability  in Solr to store "sub documents",
>> >> so you'd have to get creative in how you encoded the discounts...
>> >>
>> >
>> > This is what I'm asking :)
>> > What is the best / recommended / known patterns for doing this?
>> >
>> >
>> >
>> >>
>> >> But I suspect a better approach would be to store each discount as
>> >> a separate document. If you're in the trunk version, you could then
>> >> group results by, say, ISBN and get responses grouped together...
>> >>
>> >
>> > This is an option but seems sub optimal. So say I store the discounts in
>> > multiple documents with ISDN as an attribute and also store the title
>> again
>> > with ISDN as an attribute.
>> >
>> > To get
>> > "all books currently discounted"
>> >
>> > requires 2 request
>> >
>> > * get all discounts currently active
>> > * get all books  using ISDN retrieved from above search
>> >
>> > Not that bad. However what happens when I want
>> > "all books that are currently on discount in the "horror" genre
>> containing
>> > the word 'elm' in the title."
>> >
>> > The only way I can see in catering for the above search is to duplicate
>> all
>> > searchable fields in my "book" document in my "discount" document.
>> Coming
>> > from a RDBM background this seems wrong.
>> >
>> > Is this the correct approach to take?
>> >
>> >
>> >
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Sat, Jun 4, 2011 at 1:42 AM, Judioo  wrote:
>> >> > Hi,
>> >> > Discounts can change daily. Also there can be a lot of them (over
>> time
>> >> and
>> >> > in a given time period ).
>> >> >
>> >> > Could you give an example of what you mean buy multi-valuing the
>> field.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On 3 June 2011 14:29, Erick Erickson 
>> wrote:
>> >> >
>> >> >> How often are the discounts changed? Because you can simply
>> >> >> re-index the book information with a multiValued "discounts" field
>> >> >> and get something similar to your example (&wt=json)
>> >> >>
>> >> >>
>> >> >> Best
>> >> >> Erick
>> >> >>
>> >> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo  wrote:
>> >> >> > What is the "best practice" method to index the following in Solr:
>> >> >> >
>> >> >> > I'm attempting to use solr for a book store site.
>> >> >> >
>> >> >> > Each book will have a price but on occasions this will be
>> discounted.
>> >> The
>> >> >> > discounted price exists for a defined time period but there may be
>> >> many
>> >> >> > discount periods. Each discount will have a brief synopsis, start
>> and
>> >> end
>> >> >> > time.
>> >> >> >
>> >> >> > A subset of the desired output would be as follows:
>> >> >> >
>> >> >> > ...
>> >> >> > "response":{"numFound":1,"start":0,"docs":[
>> >> >> >  {
>> >> >> >"name":"The Book",
>> >> >> >"price":"$9.99",
>> >> >> >"discounts":[
>> >> >> >{
>> >> >> > "price":"$3.00",
>> >> >> > "synopsis":"thanksgiving special",
>> >> >> > "starts":"11-24-2011",
>> >> >> > "ends":"11-25-2011",
>> >> >> >},
>> >> >> >{
>> >> >> > "price":"$4.00",
>> >> >> > "synopsis":"Canadian thanksgiving special",
>>

Re: Solr Indexing Patterns

2011-06-06 Thread Judioo
Thanks

On 6 June 2011 19:32, Erick Erickson  wrote:

> #Everybody# (including me) who has any RDBMS background
> doesn't want to flatten data, but that's usually the way to go in
> Solr.
>
> Part of whether it's a good idea or not depends on how big the index
> gets, and unfortunately the only way to figure that out is to test.
>
> But that's the first approach I'd try.
>
> Good luck!
> Erick
>
> On Mon, Jun 6, 2011 at 11:42 AM, Judioo  wrote:
> > On 5 June 2011 14:42, Erick Erickson  wrote:
> >
> >> See: http://wiki.apache.org/solr/SchemaXml
> >>
> >> By adding ' "multiValued="true" ' to the field, you can add
> >> the same field multiple times in a doc, something like
> >>
> >> 
> >> 
> >>  value1
> >>  value2
> >> 
> >> 
> >>
> >> I can't see how that would work as one would need to associate the right
> > start / end dates and price.
> > As I understand using multivalued and thus flattening the  discounts
> would
> > result in:
> >
> > {
> >"name":"The Book",
> >"price":"$9.99",
> >"price":"$3.00",
> >"price":"$4.00","synopsis":"thanksgiving special",
> >"starts":"11-24-2011",
> >"starts":"10-10-2011",
> >"ends":"11-25-2011",
> >"ends":"10-11-2011",
> >"synopsis":"Canadian thanksgiving special",
> >  },
> >
> > How does one differentiate the different offers?
> >
> >
> >
> >> But there's no real ability  in Solr to store "sub documents",
> >> so you'd have to get creative in how you encoded the discounts...
> >>
> >
> > This is what I'm asking :)
> > What is the best / recommended / known patterns for doing this?
> >
> >
> >
> >>
> >> But I suspect a better approach would be to store each discount as
> >> a separate document. If you're in the trunk version, you could then
> >> group results by, say, ISBN and get responses grouped together...
> >>
> >
> > This is an option but seems sub optimal. So say I store the discounts in
> > multiple documents with ISDN as an attribute and also store the title
> again
> > with ISDN as an attribute.
> >
> > To get
> > "all books currently discounted"
> >
> > requires 2 request
> >
> > * get all discounts currently active
> > * get all books  using ISDN retrieved from above search
> >
> > Not that bad. However what happens when I want
> > "all books that are currently on discount in the "horror" genre
> containing
> > the word 'elm' in the title."
> >
> > The only way I can see in catering for the above search is to duplicate
> all
> > searchable fields in my "book" document in my "discount" document. Coming
> > from a RDBM background this seems wrong.
> >
> > Is this the correct approach to take?
> >
> >
> >
> >>
> >> Best
> >> Erick
> >>
> >> On Sat, Jun 4, 2011 at 1:42 AM, Judioo  wrote:
> >> > Hi,
> >> > Discounts can change daily. Also there can be a lot of them (over time
> >> and
> >> > in a given time period ).
> >> >
> >> > Could you give an example of what you mean buy multi-valuing the
> field.
> >> >
> >> > Thanks
> >> >
> >> > On 3 June 2011 14:29, Erick Erickson  wrote:
> >> >
> >> >> How often are the discounts changed? Because you can simply
> >> >> re-index the book information with a multiValued "discounts" field
> >> >> and get something similar to your example (&wt=json)
> >> >>
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo  wrote:
> >> >> > What is the "best practice" method to index the following in Solr:
> >> >> >
> >> >> > I'm attempting to use solr for a book store site.
> >> >> >
> >> >> > Each book will have a price but on occasions this will be
> discounted.
> >> The
> >> >> > discounted price exists for a defined time period but there may be
> >> many
> >> >> > discount periods. Each discount will have a brief synopsis, start
> and
> >> end
> >> >> > time.
> >> >> >
> >> >> > A subset of the desired output would be as follows:
> >> >> >
> >> >> > ...
> >> >> > "response":{"numFound":1,"start":0,"docs":[
> >> >> >  {
> >> >> >"name":"The Book",
> >> >> >"price":"$9.99",
> >> >> >"discounts":[
> >> >> >{
> >> >> > "price":"$3.00",
> >> >> > "synopsis":"thanksgiving special",
> >> >> > "starts":"11-24-2011",
> >> >> > "ends":"11-25-2011",
> >> >> >},
> >> >> >{
> >> >> > "price":"$4.00",
> >> >> > "synopsis":"Canadian thanksgiving special",
> >> >> > "starts":"10-10-2011",
> >> >> > "ends":"10-11-2011",
> >> >> >},
> >> >> > ]
> >> >> >  },
> >> >> >  .
> >> >> >
> >> >> > A requirement is to be able to search for just discounted
> >> publications. I
> >> >> > think I could use date faceting for this ( return publications that
> >> are
> >> >> > within a discount window ). When a discount search is performed no
> >> >> > publications that are not currently discounted will be returned.
> >> >> >
> >> >> > My question are:
> >> >> >
> >> >> >   - Does solr support this type of sub documents
> >> >> >
> >> >> > In the above example th

Re: Solr performance tuning - disk i/o?

2011-06-06 Thread Erick Erickson
If you're seeing results, things must be OK. It's a little strange,
though, I'm seeing
warmup times of 1 on the trivial reload of the example documents.

But I wouldn't worry too much here. Those are pretty high autowarm counts, you
might have room to reduce them but absent long autowarm times there's not much
reason to mess with them...

Best
Erick

On Mon, Jun 6, 2011 at 1:38 PM, Demian Katz  wrote:
> All of my cache autowarmCount settings are either 1 or 5.  
> maxWarmingSearchers is set to 2.  I previously shared the contents of my 
> firstSearcher and newSearcher events -- just a "queries" array surrounded by 
> a standard-looking  tag.  The events are definitely firing -- in 
> addition to the measurable performance improvement they give me, I can 
> actually see them happening in the console output during startup.  That seems 
> to cover every configuration option in my file that references warming in any 
> way, and it all looks reasonable to me.  warmupTime remains consistently 0 in 
> the statistics display.  Is there anything else I should be looking at?  In 
> any case, I'm not too alarmed by this... it just seems a little strange.
>
> thanks,
> Demian
>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Monday, June 06, 2011 11:59 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr performance tuning - disk i/o?
>>
>> Polling interval was in reference to slaves in a multi-machine
>> master/slave setup. so probably not
>> a concern just at present.
>>
>> Warmup time of 0 is not particularly normal, I'm not quite sure what's
>> going on there but you may
>> want to look at firstsearcher, newsearcher and autowarm parameters in
>> config.xml..
>>
>> Best
>> Erick
>>
>> On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz 
>> wrote:
>> > Thanks once again for the helpful suggestions!
>> >
>> > Regarding the selection of facet fields, I think publishDate (which
>> is actually just a year) and callnumber-first (which is actually a very
>> broad, high-level category) are okay.  authorStr is an interesting
>> problem: it's definitely a useful facet (when a user searches for an
>> author, odds are good that they want the one who published the most
>> books... i.e. a search for dickens will probably show Charles Dickens
>> at the top of the facet list), but it has a long tail since there are
>> many minor authors who have only published one or two books...  Is
>> there a possibility that the facet.mincount parameter could be helpful
>> here, or does that have no impact on performance/memory footprint?
>> >
>> > Regarding polling interval for slaves, are you referring to a
>> distributed Solr environment, or is this something to do with Solr's
>> internals?  We're currently a single-server environment, so I don't
>> think I have to worry if it's related to a multi-server setup...  but
>> if it's something internal, could you point me to the right area of the
>> admin panel to check my stats?  I'm not seeing anything about polling
>> on the statistics page.  It's also a little strange that all of my
>> warmupTime stats on searchers and caches are showing as 0 -- is that
>> normal?
>> >
>> > thanks,
>> > Demian
>> >
>> >> -Original Message-
>> >> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> >> Sent: Friday, June 03, 2011 4:45 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Solr performance tuning - disk i/o?
>> >>
>> >> Quick impressions:
>> >>
>> >> The faceting is usually best done on fields that don't have lots of
>> >> unique
>> >> values for three reasons:
>> >> 1> It's questionable how much use to the user to have a gazillion
>> >> facets.
>> >>      In the case of a unique field per document, in fact, it's
>> useless.
>> >> 2> resource requirements go up as a function of the number of unique
>> >>      terms. This is true for faceting and sorting.
>> >> 3> warmup times grow the more terms have to be read into memory.
>> >>
>> >>
>> >> Glancing at your warmup stuff, things like publishDate, authorStr
>> and
>> >> maybe
>> >> callnumber-first are questionable. publishDate depends on how coarse
>> >> the
>> >> resolution is. If it's by day, that's not really much use.
>> authorStr..
>> >> How many
>> >> authors have more than one publication? Would this be better served
>> by
>> >> some
>> >> kind of autosuggest rather than facets? callnumber-first... I don't
>> >> really know, but
>> >> if it's unique per document it's probably not something the user
>> would
>> >> find useful
>> >> as a facet.
>> >>
>> >> The admin page will help you determine the number of unique terms
>> per
>> >> field,
>> >> which may guide you whether or not to continue to facet on these
>> >> fields.
>> >>
>> >> As Otis said, doing a sort on the fields during warmup will also
>> help.
>> >>
>> >> Watch your polling interval for any slaves in relation to the warmup
>> >> times.
>> >> If your polling interval is shorter than the warmup times, you ru

Re: Solr Indexing Patterns

2011-06-06 Thread Erick Erickson
#Everybody# (including me) who has any RDBMS background
doesn't want to flatten data, but that's usually the way to go in
Solr.

Part of whether it's a good idea or not depends on how big the index
gets, and unfortunately the only way to figure that out is to test.

But that's the first approach I'd try.

Good luck!
Erick

On Mon, Jun 6, 2011 at 11:42 AM, Judioo  wrote:
> On 5 June 2011 14:42, Erick Erickson  wrote:
>
>> See: http://wiki.apache.org/solr/SchemaXml
>>
>> By adding ' "multiValued="true" ' to the field, you can add
>> the same field multiple times in a doc, something like
>>
>> 
>> 
>>  value1
>>  value2
>> 
>> 
>>
>> I can't see how that would work as one would need to associate the right
> start / end dates and price.
> As I understand using multivalued and thus flattening the  discounts would
> result in:
>
> {
>    "name":"The Book",
>    "price":"$9.99",
>    "price":"$3.00",
>    "price":"$4.00",    "synopsis":"thanksgiving special",
>    "starts":"11-24-2011",
>    "starts":"10-10-2011",
>    "ends":"11-25-2011",
>    "ends":"10-11-2011",
>    "synopsis":"Canadian thanksgiving special",
>  },
>
> How does one differentiate the different offers?
>
>
>
>> But there's no real ability  in Solr to store "sub documents",
>> so you'd have to get creative in how you encoded the discounts...
>>
>
> This is what I'm asking :)
> What is the best / recommended / known patterns for doing this?
>
>
>
>>
>> But I suspect a better approach would be to store each discount as
>> a separate document. If you're in the trunk version, you could then
>> group results by, say, ISBN and get responses grouped together...
>>
>
> This is an option but seems sub optimal. So say I store the discounts in
> multiple documents with ISDN as an attribute and also store the title again
> with ISDN as an attribute.
>
> To get
> "all books currently discounted"
>
> requires 2 request
>
> * get all discounts currently active
> * get all books  using ISDN retrieved from above search
>
> Not that bad. However what happens when I want
> "all books that are currently on discount in the "horror" genre containing
> the word 'elm' in the title."
>
> The only way I can see in catering for the above search is to duplicate all
> searchable fields in my "book" document in my "discount" document. Coming
> from a RDBM background this seems wrong.
>
> Is this the correct approach to take?
>
>
>
>>
>> Best
>> Erick
>>
>> On Sat, Jun 4, 2011 at 1:42 AM, Judioo  wrote:
>> > Hi,
>> > Discounts can change daily. Also there can be a lot of them (over time
>> and
>> > in a given time period ).
>> >
>> > Could you give an example of what you mean buy multi-valuing the field.
>> >
>> > Thanks
>> >
>> > On 3 June 2011 14:29, Erick Erickson  wrote:
>> >
>> >> How often are the discounts changed? Because you can simply
>> >> re-index the book information with a multiValued "discounts" field
>> >> and get something similar to your example (&wt=json)
>> >>
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo  wrote:
>> >> > What is the "best practice" method to index the following in Solr:
>> >> >
>> >> > I'm attempting to use solr for a book store site.
>> >> >
>> >> > Each book will have a price but on occasions this will be discounted.
>> The
>> >> > discounted price exists for a defined time period but there may be
>> many
>> >> > discount periods. Each discount will have a brief synopsis, start and
>> end
>> >> > time.
>> >> >
>> >> > A subset of the desired output would be as follows:
>> >> >
>> >> > ...
>> >> > "response":{"numFound":1,"start":0,"docs":[
>> >> >  {
>> >> >    "name":"The Book",
>> >> >    "price":"$9.99",
>> >> >    "discounts":[
>> >> >        {
>> >> >         "price":"$3.00",
>> >> >         "synopsis":"thanksgiving special",
>> >> >         "starts":"11-24-2011",
>> >> >         "ends":"11-25-2011",
>> >> >        },
>> >> >        {
>> >> >         "price":"$4.00",
>> >> >         "synopsis":"Canadian thanksgiving special",
>> >> >         "starts":"10-10-2011",
>> >> >         "ends":"10-11-2011",
>> >> >        },
>> >> >     ]
>> >> >  },
>> >> >  .
>> >> >
>> >> > A requirement is to be able to search for just discounted
>> publications. I
>> >> > think I could use date faceting for this ( return publications that
>> are
>> >> > within a discount window ). When a discount search is performed no
>> >> > publications that are not currently discounted will be returned.
>> >> >
>> >> > My question are:
>> >> >
>> >> >   - Does solr support this type of sub documents
>> >> >
>> >> > In the above example the discounts are the sub documents. I know solr
>> is
>> >> not
>> >> > a relational DB but I would like to store and index the above
>> >> representation
>> >> > in a single document if possible.
>> >> >
>> >> >   - what is the best method to approach the above
>> >> >
>> >> > I can see in many examples the authors tend to denormalize to solve
>> >> similar
>> >> > problems. This s

Re: TIKA INTEGRATION PERFORMANCE

2011-06-06 Thread Tomás Fernández Löbbe
On Mon, Jun 6, 2011 at 1:47 PM, Naveen Gupta  wrote:

> Hi Tomas,
>
> 1. Regarding SolrInputDocument,
>
> We are not using java client, rather we are using php solr, wrapping
> content
> in SolrInputDocument, i am not sure how to do in PHP client? In this case,
> we need tika related jars to avail the metadata such as content .. we
> certainly don't want to handle all these things in PHP client.
>

I don't understand, Tika IS integrated in Solr, it doesn't matter which
client or client language you are using. To add a static value, all you have
to do is add it as a request parameter with the prefix "literal". Something
like "literal.somefield=thevalue". Content and other file metadata such as
author etc (see
http://wiki.apache.org/solr/ExtractingRequestHandler#Metadata) will be added
to the document inside Solr and indexed. You don't need to handle this on
the client application.

>
>  Secondly, what i was asking about commit strategy --
>
> what about suppose you have 100 docs
>
> iterate over 99 docs and fire curl without commit in url
>
> and for 100th doc, we will use commit 
>
> so doing so, will it also update the indexes for last 99 docs 
>
> while(upto 99){
> curl_command = url without commit;
> }
>
> when i = 100, url would be commit
>

You can certainly do this. The 100 documents will be available for search
after the commit. Non of the documents will be available for search before
commit.

>
> i wanted to achieve something similar to optimize kind of thing 
>

Optimize command should be issued when not many queries or updates are sent
to the index. It uses lots of resources and will slow down queries.

>
> why these kind of use cases which are general purpose not included in
> example (especially in other language ...java guys can easily do using API)
>

They are, you can the auto-commit feature, configured on solrconfig.xml
file. You can either tell Solr to commit on a time interval or when a number
of documents are updated and not committed. On the example file, the
autocommit is commented, but you can uncomment it.


> I am basically a Java Guy, so i can feel the problem
>
> Thanks
> Naveen
> 2011/6/6 Tomás Fernández Löbbe 
>
> > 1. About the commit strategy, all the ExtractingRequestHandler (request
> > handler that uses Tika to extract content from the input file) will do is
> > extract the content of your file and add it to a SolrInputDocument. The
> > commit strategy should not change because of this, compared to other
> > documents you might be indexing. It is usually not recommended to commit
> on
> > every new / updated document.
> >
> > 2. Don't know if I understand the question. you can add all the static
> > fields you want to the document by adding the "literal." prefix to the
> name
> > of the fields when using ExtractingRequestHandler (as you are doing with
> "
> > literal.id"). You can also leave empty fields if they are not marked as
> > "required" at the schema.xml file. See:
> > http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
> >
> > 3. Solr cores can work almost as completely different Solr instances. You
> > could tell one core to replicate from another core. I don't think this
> > would
> > be of any help here. If you want to separate the indexing operations from
> > the query operations, you could probably use different machines, that's
> > usually a better option. Configure the indexing box as master and the
> query
> > box as slave. Here you have some more information about it:
> > http://wiki.apache.org/solr/SolrReplication
> >
> > Were this the answers you were looking for or did I misunderstand your
> > questions?
> >
> > Tomás
> >
> > On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta 
> wrote:
> >
> > > Hi
> > >
> > > Since it is php, we are using solphp for calling curl based call,
> > >
> > > what my concern here is that for each user, we might be having 20-40
> > > attachments needed to be indexed each day, and there are various users
> > > ..daily we are targeting around 500-1000 users ..
> > >
> > > right now if you see, we
> > >
> > >  > > $ch = curl_init('
> > > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true'
> );
> > >  curl_setopt ($ch, CURLOPT_POST, 1);
> > >  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
> > >  $result= curl_exec ($ch);
> > > ?>
> > >
> > > also we are planning to use other fields which are to be indexed and
> > stored
> > > ...
> > >
> > >
> > > There are couple of questions here
> > >
> > > 1. what would be the best strategies for commit. if we take all the
> > > documents in an array and iterating one by one and fire the curl and
> for
> > > the
> > > last doc, if we commit, will it work or for each doc, we need to
> commit?
> > >
> > > 2. we are having several fields which are already defined in schema and
> > few
> > > of the them are required earlier, but for this purpose, we don't want,
> > how
> > > to have two requirement together in the same schema?
> > >
> 

Re: Need query help

2011-06-06 Thread Alexey Serba
See "Tagging and excluding Filters" section

* 
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

2011/6/6 Denis Kuzmenok :
> For now i have a collection with:
> id (int)
> price (double) multivalue
> brand_id (int)
> filters (string) multivalue
>
> I  need  to  get available brand_id, filters, price values and list of
> id's   for   current   query.  For  example now i'm doing queries with
> facet.field=brand_id/filters/price:
> 1) to get current id's list: (brand_id:100 OR brand_id:150) AND 
> (filters:p1s100 OR filters:p4s20)
> 2) to get available filters on selected properties (same properties but
> another  values):  (brand_id:100 OR brand_id:150) AND (filters:p1s* OR
> filters:p4s*)
> 3) to get available brand_id (if any are selected, if none - take from
> 1st query results): (filters:p1s100 OR filters:p4s20)
> 4) another request to get available prices if any are selected
>
> Is there any way to simplify this task?
> Data needed:
> 1) Id's for selected filters, price, brand_id
> 2) Available filters, price, brand_id from selected values
> 3) Another values for selected properties (is any chosen)
> 4) Another brand_id for selected brand_id
> 5) Another price for selected price
>
> Will appreciate any help or thoughts!
>
> Cheers,
> Denis Kuzmenok
>
>


Re: Auto-scaling solr setup

2011-06-06 Thread Akshay
Yes sadly ..  I too have not much clue about AWS.

The SolrReplication API doesnt give me what i want exactly.. For the time
being i have hacked my way into the amazon image bootstrapping the
replication check in a shell script ((curl & awk) very dirty way) . Once the
check suceeds I enable the server using the Solr healthcheck for
load-balancers. I was wondering if anyone has moved to the cloud..specially
Amazon auto-scaling where they dont have control over when a new node is
fired.. All scenarios i encountered were people creating a node .. warming
up the cache and then adding it under the HAProxy LB.

I guess warmup is not that big an issue as compared to an empty response.
Thanks for your response :)

Regards,
Akshay

On Mon, Jun 6, 2011 at 6:33 PM, Erick Erickson wrote:

> The HTTP interface (http://wiki.apache.org/solr/SolrReplication#HTTP_API)
> can be used to control lots of parts of replication.
>
> As to warmups, I don't know of a good way to test that. I don't know
> whether
> getting the current status on the slave includes whether warmup is
> completed
> or not. At worst, after replication is complete you could wait an interval
> (see
> the warmup times on your running servers) before routing requests to the
> slave.
>
> I haven't any clue at all about AWS...
>
> Best
> Erick
>
> On Mon, Jun 6, 2011 at 9:18 AM, Akshay  wrote:
> > So i am trying to setup an auto-scaling search system of ec2 solr-slaves
> > which scale up as number of requests increase and vice versa
> > Here is what I have
> > 1. A solr master and underlying slaves(scalable). And an elastic load
> > balancer to distribute the load.
> > 2. The ec2-auto-scaling setup fires nodes when traffic increases. However
> > the replication times(replication speed) for the index from the master
> > varies for these newly fired nodes.
> > 3. I want to avoid addition of these nodes to the load balancer till it
> has
> > completed initial replication and has a warmed up cache.
> >For this I need to know a way I can check if the initial replication
> has
> > completed. and also a way of warming up the cache post this.
> >
> > I can think of doing this via .. a shellscript/awk(checking times
> > replicated/index size) ... is there a cleaner way ?
> >
> > Also on the side note .. any suggestions or pointers to how one set up
> their
> > scalable solr setup on cloud(AWS mainly) would be helpful.
> >
> > Regards,
> > Akshay
> >
>


RE: Solr performance tuning - disk i/o?

2011-06-06 Thread Demian Katz
All of my cache autowarmCount settings are either 1 or 5.  
maxWarmingSearchers is set to 2.  I previously shared the contents of my 
firstSearcher and newSearcher events -- just a "queries" array surrounded by a 
standard-looking  tag.  The events are definitely firing -- in 
addition to the measurable performance improvement they give me, I can actually 
see them happening in the console output during startup.  That seems to cover 
every configuration option in my file that references warming in any way, and 
it all looks reasonable to me.  warmupTime remains consistently 0 in the 
statistics display.  Is there anything else I should be looking at?  In any 
case, I'm not too alarmed by this... it just seems a little strange.

thanks,
Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, June 06, 2011 11:59 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr performance tuning - disk i/o?
> 
> Polling interval was in reference to slaves in a multi-machine
> master/slave setup. so probably not
> a concern just at present.
> 
> Warmup time of 0 is not particularly normal, I'm not quite sure what's
> going on there but you may
> want to look at firstsearcher, newsearcher and autowarm parameters in
> config.xml..
> 
> Best
> Erick
> 
> On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz 
> wrote:
> > Thanks once again for the helpful suggestions!
> >
> > Regarding the selection of facet fields, I think publishDate (which
> is actually just a year) and callnumber-first (which is actually a very
> broad, high-level category) are okay.  authorStr is an interesting
> problem: it's definitely a useful facet (when a user searches for an
> author, odds are good that they want the one who published the most
> books... i.e. a search for dickens will probably show Charles Dickens
> at the top of the facet list), but it has a long tail since there are
> many minor authors who have only published one or two books...  Is
> there a possibility that the facet.mincount parameter could be helpful
> here, or does that have no impact on performance/memory footprint?
> >
> > Regarding polling interval for slaves, are you referring to a
> distributed Solr environment, or is this something to do with Solr's
> internals?  We're currently a single-server environment, so I don't
> think I have to worry if it's related to a multi-server setup...  but
> if it's something internal, could you point me to the right area of the
> admin panel to check my stats?  I'm not seeing anything about polling
> on the statistics page.  It's also a little strange that all of my
> warmupTime stats on searchers and caches are showing as 0 -- is that
> normal?
> >
> > thanks,
> > Demian
> >
> >> -Original Message-
> >> From: Erick Erickson [mailto:erickerick...@gmail.com]
> >> Sent: Friday, June 03, 2011 4:45 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Solr performance tuning - disk i/o?
> >>
> >> Quick impressions:
> >>
> >> The faceting is usually best done on fields that don't have lots of
> >> unique
> >> values for three reasons:
> >> 1> It's questionable how much use to the user to have a gazillion
> >> facets.
> >>      In the case of a unique field per document, in fact, it's
> useless.
> >> 2> resource requirements go up as a function of the number of unique
> >>      terms. This is true for faceting and sorting.
> >> 3> warmup times grow the more terms have to be read into memory.
> >>
> >>
> >> Glancing at your warmup stuff, things like publishDate, authorStr
> and
> >> maybe
> >> callnumber-first are questionable. publishDate depends on how coarse
> >> the
> >> resolution is. If it's by day, that's not really much use.
> authorStr..
> >> How many
> >> authors have more than one publication? Would this be better served
> by
> >> some
> >> kind of autosuggest rather than facets? callnumber-first... I don't
> >> really know, but
> >> if it's unique per document it's probably not something the user
> would
> >> find useful
> >> as a facet.
> >>
> >> The admin page will help you determine the number of unique terms
> per
> >> field,
> >> which may guide you whether or not to continue to facet on these
> >> fields.
> >>
> >> As Otis said, doing a sort on the fields during warmup will also
> help.
> >>
> >> Watch your polling interval for any slaves in relation to the warmup
> >> times.
> >> If your polling interval is shorter than the warmup times, you run a
> >> risk of
> >> "runaway warmups".
> >>
> >> As you've figured out, measuring responses to the first few queries
> >> doesn't
> >> always measure what you really need ..
> >>
> >> I don't have the pages handy, but autowarming is a good topic to
> >> understand,
> >> so you might spend some time tracking it down.
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
> >>  wrote:
> >> > Thanks to you and Otis for the suggestions!  Some more
> information:
> >> >
> >> > - Based on the Sol

Re: Auto-scaling solr setup

2011-06-06 Thread Erick Erickson
The HTTP interface (http://wiki.apache.org/solr/SolrReplication#HTTP_API)
can be used to control lots of parts of replication.

As to warmups, I don't know of a good way to test that. I don't know whether
getting the current status on the slave includes whether warmup is completed
or not. At worst, after replication is complete you could wait an interval (see
the warmup times on your running servers) before routing requests to the
slave.

I haven't any clue at all about AWS...

Best
Erick

On Mon, Jun 6, 2011 at 9:18 AM, Akshay  wrote:
> So i am trying to setup an auto-scaling search system of ec2 solr-slaves
> which scale up as number of requests increase and vice versa
> Here is what I have
> 1. A solr master and underlying slaves(scalable). And an elastic load
> balancer to distribute the load.
> 2. The ec2-auto-scaling setup fires nodes when traffic increases. However
> the replication times(replication speed) for the index from the master
> varies for these newly fired nodes.
> 3. I want to avoid addition of these nodes to the load balancer till it has
> completed initial replication and has a warmed up cache.
>    For this I need to know a way I can check if the initial replication has
> completed. and also a way of warming up the cache post this.
>
> I can think of doing this via .. a shellscript/awk(checking times
> replicated/index size) ... is there a cleaner way ?
>
> Also on the side note .. any suggestions or pointers to how one set up their
> scalable solr setup on cloud(AWS mainly) would be helpful.
>
> Regards,
> Akshay
>


Re: SolrJ and Range Faceting

2011-06-06 Thread Jamie Johnson
Small error, shouldn't be using this.start but should instead be using
Double.parseDouble(this.getValue());
and
sdf.parse(count.getValue());
respectfully.

On Mon, Jun 6, 2011 at 1:16 PM, Jamie Johnson  wrote:

> Thanks Martijn.  I pulled your patch and it looks like what I was looking
> for.  The original FacetField class has a getAsFilterQuery method which
> returns the criteria to use as an fq parameter, I have logic which does this
> in my class which works, any chance of getting something like this added to
> the patch as well?
>
>
>   public static class Numeric extends RangeFacet {
>
> public Numeric(String name, Number start, Number end, Number gap) {
>   super(name, start, end, gap);
> }
>
>   public String getAsFilterQuery(){
>   Double end = this.start.doubleValue() + this.gap.doubleValue() -
> 1;
>   return this.name + ":[" + this.start + " TO " + end + "]");
>   }
>
>
>   }
>
>
> and for dates (there's a parse exception below which I am not doing
> anything with currently)
>
>   public String getAsFilterQuery(){
>   RangeFacet.Date dateCount =
> (RangeFacet.Date)count.getRangeFacet();
>
>   DateMathParser parser = new DateMathParser(TimeZone.getDefault(),
> Locale.getDefault());
>   SimpleDateFormat sdf = new
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
>
>   parser.setNow(dateCount.getStart());
>   Date end = parser.parseMath(dateCount.getGap());
>   String startStr = sdf.format(dateCount.getStart()) + "Z";
>   String endStr = sdf.format(end) + "Z";
>   String label = startStr + " TO " + endStr;
>   return facetField.getName() + ":[" + label + "]");
>
>   }
>
>
> On Fri, Jun 3, 2011 at 7:05 AM, Martijn v Groningen <
> martijn.is.h...@gmail.com> wrote:
>
>> Hi Jamie,
>>
>> I don't know why range facets didn't make it into SolrJ. But I've recently
>> opened an issue for this:
>> https://issues.apache.org/jira/browse/SOLR-2523
>>
>> I hope this will be committed soon. Check the patch out and see if you
>> like
>> it.
>>
>> Martijn
>>
>> On 2 June 2011 18:22, Jamie Johnson  wrote:
>>
>> > Currently the range and date faceting in SolrJ acts a bit differently
>> than
>> > I
>> > would expect.  Specifically, range facets aren't parsed at all and date
>> > facets end up generating filterQueries which don't have the range, just
>> the
>> > lower bound.  Is there a reason why SolrJ doesn't support these?  I have
>> > written some things on my end to handle these and generate filterQueries
>> > for
>> > date ranges of the form dateTime:[start TO end] and I have a function
>> > (which
>> > I copied from the date faceting) which parses the range facets, but
>> would
>> > prefer not to have to maintain these myself.  Is there a plan to
>> implement
>> > these?  Also is there a plan to update FacetField to not have end be a
>> > date,
>> > perhaps making it a String like start so we can support date and range
>> > queries?
>> >
>>
>>
>>
>> --
>> Met vriendelijke groet,
>>
>> Martijn van Groningen
>>
>
>


Re: SolrJ and Range Faceting

2011-06-06 Thread Jamie Johnson
Thanks Martijn.  I pulled your patch and it looks like what I was looking
for.  The original FacetField class has a getAsFilterQuery method which
returns the criteria to use as an fq parameter, I have logic which does this
in my class which works, any chance of getting something like this added to
the patch as well?


  public static class Numeric extends RangeFacet {

public Numeric(String name, Number start, Number end, Number gap) {
  super(name, start, end, gap);
}

  public String getAsFilterQuery(){
  Double end = this.start.doubleValue() + this.gap.doubleValue() -
1;
  return this.name + ":[" + this.start + " TO " + end + "]");
  }


  }


and for dates (there's a parse exception below which I am not doing anything
with currently)

  public String getAsFilterQuery(){
  RangeFacet.Date dateCount =
(RangeFacet.Date)count.getRangeFacet();

  DateMathParser parser = new DateMathParser(TimeZone.getDefault(),
Locale.getDefault());
  SimpleDateFormat sdf = new
SimpleDateFormat("-MM-dd'T'HH:mm:ss");

  parser.setNow(dateCount.getStart());
  Date end = parser.parseMath(dateCount.getGap());
  String startStr = sdf.format(dateCount.getStart()) + "Z";
  String endStr = sdf.format(end) + "Z";
  String label = startStr + " TO " + endStr;
  return facetField.getName() + ":[" + label + "]");
  }


On Fri, Jun 3, 2011 at 7:05 AM, Martijn v Groningen <
martijn.is.h...@gmail.com> wrote:

> Hi Jamie,
>
> I don't know why range facets didn't make it into SolrJ. But I've recently
> opened an issue for this:
> https://issues.apache.org/jira/browse/SOLR-2523
>
> I hope this will be committed soon. Check the patch out and see if you like
> it.
>
> Martijn
>
> On 2 June 2011 18:22, Jamie Johnson  wrote:
>
> > Currently the range and date faceting in SolrJ acts a bit differently
> than
> > I
> > would expect.  Specifically, range facets aren't parsed at all and date
> > facets end up generating filterQueries which don't have the range, just
> the
> > lower bound.  Is there a reason why SolrJ doesn't support these?  I have
> > written some things on my end to handle these and generate filterQueries
> > for
> > date ranges of the form dateTime:[start TO end] and I have a function
> > (which
> > I copied from the date faceting) which parses the range facets, but would
> > prefer not to have to maintain these myself.  Is there a plan to
> implement
> > these?  Also is there a plan to update FacetField to not have end be a
> > date,
> > perhaps making it a String like start so we can support date and range
> > queries?
> >
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>


Re: TIKA INTEGRATION PERFORMANCE

2011-06-06 Thread Naveen Gupta
Hi Tomas,

1. Regarding SolrInputDocument,

We are not using java client, rather we are using php solr, wrapping content
in SolrInputDocument, i am not sure how to do in PHP client? In this case,
we need tika related jars to avail the metadata such as content .. we
certainly don't want to handle all these things in PHP client.

 Secondly, what i was asking about commit strategy --

what about suppose you have 100 docs

iterate over 99 docs and fire curl without commit in url

and for 100th doc, we will use commit 

so doing so, will it also update the indexes for last 99 docs 

while(upto 99){
 curl_command = url without commit;
}

when i = 100, url would be commit

i wanted to achieve something similar to optimize kind of thing 

why these kind of use cases which are general purpose not included in
example (especially in other language ...java guys can easily do using API)

I am basically a Java Guy, so i can feel the problem

Thanks
Naveen
2011/6/6 Tomás Fernández Löbbe 

> 1. About the commit strategy, all the ExtractingRequestHandler (request
> handler that uses Tika to extract content from the input file) will do is
> extract the content of your file and add it to a SolrInputDocument. The
> commit strategy should not change because of this, compared to other
> documents you might be indexing. It is usually not recommended to commit on
> every new / updated document.
>
> 2. Don't know if I understand the question. you can add all the static
> fields you want to the document by adding the "literal." prefix to the name
> of the fields when using ExtractingRequestHandler (as you are doing with "
> literal.id"). You can also leave empty fields if they are not marked as
> "required" at the schema.xml file. See:
> http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
>
> 3. Solr cores can work almost as completely different Solr instances. You
> could tell one core to replicate from another core. I don't think this
> would
> be of any help here. If you want to separate the indexing operations from
> the query operations, you could probably use different machines, that's
> usually a better option. Configure the indexing box as master and the query
> box as slave. Here you have some more information about it:
> http://wiki.apache.org/solr/SolrReplication
>
> Were this the answers you were looking for or did I misunderstand your
> questions?
>
> Tomás
>
> On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta  wrote:
>
> > Hi
> >
> > Since it is php, we are using solphp for calling curl based call,
> >
> > what my concern here is that for each user, we might be having 20-40
> > attachments needed to be indexed each day, and there are various users
> > ..daily we are targeting around 500-1000 users ..
> >
> > right now if you see, we
> >
> >  > $ch = curl_init('
> > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
> >  curl_setopt ($ch, CURLOPT_POST, 1);
> >  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
> >  $result= curl_exec ($ch);
> > ?>
> >
> > also we are planning to use other fields which are to be indexed and
> stored
> > ...
> >
> >
> > There are couple of questions here
> >
> > 1. what would be the best strategies for commit. if we take all the
> > documents in an array and iterating one by one and fire the curl and for
> > the
> > last doc, if we commit, will it work or for each doc, we need to commit?
> >
> > 2. we are having several fields which are already defined in schema and
> few
> > of the them are required earlier, but for this purpose, we don't want,
> how
> > to have two requirement together in the same schema?
> >
> > 3. since it is frequent commit, how to use solr multicore for write and
> > read
> > operations separately ?
> >
> > Thanks
> > Naveen
> >
>


Re: How to get default result?

2011-06-06 Thread Tomás Fernández Löbbe
Hi Richard, are you setting the value to 0 at index time when the
housenumber is not present? If you are, this would be as simple as modify
the query at the application layer to city = a, street= b, housenumber=(14
OR 0).
If you are not doing anything at index time with the not present
housenumbers, you could do something like city:a AND street:b AND
(housenumber:14 OR NOT housenumber:[* TO *]).

First option is better if you ask me. You can set the default value on your
schema. See http://wiki.apache.org/solr/SchemaXml#Fields

On Mon, Jun 6, 2011 at 1:14 PM, richardr  wrote:

> Dear list,
>
> i got a question regarding my address search:
> I am searching for address data. If there is one address field not definied
> (in this case the housenumber) for the specific query (e.g. city = a,
> street
> = b, housenumber=14), I am getting no result. For every street there exists
> at least one housenumber (=0).
>
> Is it possible to get this default value (housenumber 0) as a result, if
> the
> user is searching for the housenumber 14, which does not exist in our
> model?
>
> Thanks in advance,
> Richard
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-default-result-tp3030665p3030665.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Default query parser operator

2011-06-06 Thread Brian Lamb
Hi all,

Is it possible to change the query parser operator for a specific field
without having to explicitly type it in the search field?

For example, I'd like to use:

http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax


instead of

http://localhost:8983/solr/search/?q=field1:word AND token field2:parser
syntax

But, I only want it to be applied to field1, not field2 and I want the
operator to always be AND unless the user explicitly types in OR.

Thanks,

Brian Lamb


How to get default result?

2011-06-06 Thread richardr
Dear list,

i got a question regarding my address search:
I am searching for address data. If there is one address field not definied
(in this case the housenumber) for the specific query (e.g. city = a, street
= b, housenumber=14), I am getting no result. For every street there exists
at least one housenumber (=0).

Is it possible to get this default value (housenumber 0) as a result, if the
user is searching for the housenumber 14, which does not exist in our model?

Thanks in advance,
Richard

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-default-result-tp3030665p3030665.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr performance tuning - disk i/o?

2011-06-06 Thread Erick Erickson
Polling interval was in reference to slaves in a multi-machine
master/slave setup. so probably not
a concern just at present.

Warmup time of 0 is not particularly normal, I'm not quite sure what's
going on there but you may
want to look at firstsearcher, newsearcher and autowarm parameters in
config.xml..

Best
Erick

On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz  wrote:
> Thanks once again for the helpful suggestions!
>
> Regarding the selection of facet fields, I think publishDate (which is 
> actually just a year) and callnumber-first (which is actually a very broad, 
> high-level category) are okay.  authorStr is an interesting problem: it's 
> definitely a useful facet (when a user searches for an author, odds are good 
> that they want the one who published the most books... i.e. a search for 
> dickens will probably show Charles Dickens at the top of the facet list), but 
> it has a long tail since there are many minor authors who have only published 
> one or two books...  Is there a possibility that the facet.mincount parameter 
> could be helpful here, or does that have no impact on performance/memory 
> footprint?
>
> Regarding polling interval for slaves, are you referring to a distributed 
> Solr environment, or is this something to do with Solr's internals?  We're 
> currently a single-server environment, so I don't think I have to worry if 
> it's related to a multi-server setup...  but if it's something internal, 
> could you point me to the right area of the admin panel to check my stats?  
> I'm not seeing anything about polling on the statistics page.  It's also a 
> little strange that all of my warmupTime stats on searchers and caches are 
> showing as 0 -- is that normal?
>
> thanks,
> Demian
>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Friday, June 03, 2011 4:45 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr performance tuning - disk i/o?
>>
>> Quick impressions:
>>
>> The faceting is usually best done on fields that don't have lots of
>> unique
>> values for three reasons:
>> 1> It's questionable how much use to the user to have a gazillion
>> facets.
>>      In the case of a unique field per document, in fact, it's useless.
>> 2> resource requirements go up as a function of the number of unique
>>      terms. This is true for faceting and sorting.
>> 3> warmup times grow the more terms have to be read into memory.
>>
>>
>> Glancing at your warmup stuff, things like publishDate, authorStr and
>> maybe
>> callnumber-first are questionable. publishDate depends on how coarse
>> the
>> resolution is. If it's by day, that's not really much use. authorStr..
>> How many
>> authors have more than one publication? Would this be better served by
>> some
>> kind of autosuggest rather than facets? callnumber-first... I don't
>> really know, but
>> if it's unique per document it's probably not something the user would
>> find useful
>> as a facet.
>>
>> The admin page will help you determine the number of unique terms per
>> field,
>> which may guide you whether or not to continue to facet on these
>> fields.
>>
>> As Otis said, doing a sort on the fields during warmup will also help.
>>
>> Watch your polling interval for any slaves in relation to the warmup
>> times.
>> If your polling interval is shorter than the warmup times, you run a
>> risk of
>> "runaway warmups".
>>
>> As you've figured out, measuring responses to the first few queries
>> doesn't
>> always measure what you really need ..
>>
>> I don't have the pages handy, but autowarming is a good topic to
>> understand,
>> so you might spend some time tracking it down.
>>
>> Best
>> Erick
>>
>> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
>>  wrote:
>> > Thanks to you and Otis for the suggestions!  Some more information:
>> >
>> > - Based on the Solr stats page, my caches seem to be working pretty
>> well (few or no evictions, hit rates in the 75-80% range).
>> > - VuFind is actually doing two Solr queries per search (one initial
>> search followed by a supplemental spell check search -- I believe this
>> is necessary because VuFind has two separate spelling indexes, one for
>> shingled terms and one for single words).  That is probably
>> exaggerating the problem, though based on searches with debugQuery on,
>> it looks like it's always the initial search (rather than the
>> supplemental spelling search) that's consuming the bulk of the time.
>> > - enableLazyFieldLoading is set to true.
>> > - I'm retrieving 20 documents per page.
>> > - My JVM settings: -server -
>> Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m -
>> XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5
>> >
>> > It appears that a large portion of my problem had to do with
>> autowarming, a topic that I've never had a strong grasp on, though
>> perhaps I'm finally learning (any recommended primer links would be
>> welcome!).  I did have some autowarming settings in solrconfig.xml (

Re: Solr Indexing Patterns

2011-06-06 Thread Judioo
On 5 June 2011 14:42, Erick Erickson  wrote:

> See: http://wiki.apache.org/solr/SchemaXml
>
> By adding ' "multiValued="true" ' to the field, you can add
> the same field multiple times in a doc, something like
>
> 
> 
>  value1
>  value2
> 
> 
>
> I can't see how that would work as one would need to associate the right
start / end dates and price.
As I understand using multivalued and thus flattening the  discounts would
result in:

{
"name":"The Book",
"price":"$9.99",
"price":"$3.00",
"price":"$4.00","synopsis":"thanksgiving special",
"starts":"11-24-2011",
"starts":"10-10-2011",
"ends":"11-25-2011",
"ends":"10-11-2011",
"synopsis":"Canadian thanksgiving special",
  },

How does one differentiate the different offers?



> But there's no real ability  in Solr to store "sub documents",
> so you'd have to get creative in how you encoded the discounts...
>

This is what I'm asking :)
What is the best / recommended / known patterns for doing this?



>
> But I suspect a better approach would be to store each discount as
> a separate document. If you're in the trunk version, you could then
> group results by, say, ISBN and get responses grouped together...
>

This is an option but seems sub optimal. So say I store the discounts in
multiple documents with ISDN as an attribute and also store the title again
with ISDN as an attribute.

To get
"all books currently discounted"

requires 2 request

* get all discounts currently active
* get all books  using ISDN retrieved from above search

Not that bad. However what happens when I want
"all books that are currently on discount in the "horror" genre containing
the word 'elm' in the title."

The only way I can see in catering for the above search is to duplicate all
searchable fields in my "book" document in my "discount" document. Coming
from a RDBM background this seems wrong.

Is this the correct approach to take?



>
> Best
> Erick
>
> On Sat, Jun 4, 2011 at 1:42 AM, Judioo  wrote:
> > Hi,
> > Discounts can change daily. Also there can be a lot of them (over time
> and
> > in a given time period ).
> >
> > Could you give an example of what you mean buy multi-valuing the field.
> >
> > Thanks
> >
> > On 3 June 2011 14:29, Erick Erickson  wrote:
> >
> >> How often are the discounts changed? Because you can simply
> >> re-index the book information with a multiValued "discounts" field
> >> and get something similar to your example (&wt=json)
> >>
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Jun 3, 2011 at 8:38 AM, Judioo  wrote:
> >> > What is the "best practice" method to index the following in Solr:
> >> >
> >> > I'm attempting to use solr for a book store site.
> >> >
> >> > Each book will have a price but on occasions this will be discounted.
> The
> >> > discounted price exists for a defined time period but there may be
> many
> >> > discount periods. Each discount will have a brief synopsis, start and
> end
> >> > time.
> >> >
> >> > A subset of the desired output would be as follows:
> >> >
> >> > ...
> >> > "response":{"numFound":1,"start":0,"docs":[
> >> >  {
> >> >"name":"The Book",
> >> >"price":"$9.99",
> >> >"discounts":[
> >> >{
> >> > "price":"$3.00",
> >> > "synopsis":"thanksgiving special",
> >> > "starts":"11-24-2011",
> >> > "ends":"11-25-2011",
> >> >},
> >> >{
> >> > "price":"$4.00",
> >> > "synopsis":"Canadian thanksgiving special",
> >> > "starts":"10-10-2011",
> >> > "ends":"10-11-2011",
> >> >},
> >> > ]
> >> >  },
> >> >  .
> >> >
> >> > A requirement is to be able to search for just discounted
> publications. I
> >> > think I could use date faceting for this ( return publications that
> are
> >> > within a discount window ). When a discount search is performed no
> >> > publications that are not currently discounted will be returned.
> >> >
> >> > My question are:
> >> >
> >> >   - Does solr support this type of sub documents
> >> >
> >> > In the above example the discounts are the sub documents. I know solr
> is
> >> not
> >> > a relational DB but I would like to store and index the above
> >> representation
> >> > in a single document if possible.
> >> >
> >> >   - what is the best method to approach the above
> >> >
> >> > I can see in many examples the authors tend to denormalize to solve
> >> similar
> >> > problems. This suggest that for each discount I am required to
> duplicate
> >> the
> >> > book data or form a document
> >> > association<
> http://stackoverflow.com/questions/2689399/solr-associations
> >> >.
> >> > Which method would you advise?
> >> >
> >> > It would be nice if solr could return a response structured as above.
> >> >
> >> > Much Thanks
> >> >
> >>
> >
>


Re: Search with Synonyms in two fields

2011-06-06 Thread Jonathan Rochkind

On 6/5/2011 3:36 AM, occurred wrote:

Ok, thx for the answer.

My idea now is to store both field-values in one field and pre- and
suffix the values from field2 with something very special.
Also then the synonyms have to have the special pre- and suffixes.


What are you actually trying to do?

Usually, what people would do is just store both original values and 
synonym expansion in one field, the end, no need to use custom suffixes. 
Then you could have a _second_ field with only the original values 
without synonym expansion, if you sometimes need to search without 
synonym expansion.


You want to search over both original values and expanded synonyms, you 
search over the field that does that. You want to, in another search, 
search only over original values without synonym expansion, you search 
over the field without synonyms expanded in it.


That's usually the sort of thing people do. "de-normalization" in Solr 
is not something to be avoided, it's instead a general pattern.




Master Slave help

2011-06-06 Thread Rohit Gupta
Hi,

I have configured my master slave server and everything seems to be running 
fine, the replication completed the firsttime it ran. But everytime I go the 
the 
replication link in the admin panel after restarting the server or server 
startup I notice the replication starting from scratch or at least the stats 
show that.

What could be wrong?

Thanks,
Rohit

Need query help

2011-06-06 Thread Denis Kuzmenok
For now i have a collection with:
id (int)
price (double) multivalue
brand_id (int)
filters (string) multivalue

I  need  to  get available brand_id, filters, price values and list of
id's   for   current   query.  For  example now i'm doing queries with
facet.field=brand_id/filters/price:
1) to get current id's list: (brand_id:100 OR brand_id:150) AND (filters:p1s100 
OR filters:p4s20)
2) to get available filters on selected properties (same properties but
another  values):  (brand_id:100 OR brand_id:150) AND (filters:p1s* OR
filters:p4s*)
3) to get available brand_id (if any are selected, if none - take from
1st query results): (filters:p1s100 OR filters:p4s20)
4) another request to get available prices if any are selected

Is there any way to simplify this task?
Data needed:
1) Id's for selected filters, price, brand_id
2) Available filters, price, brand_id from selected values
3) Another values for selected properties (is any chosen)
4) Another brand_id for selected brand_id
5) Another price for selected price

Will appreciate any help or thoughts!

Cheers,
Denis Kuzmenok



Auto-scaling solr setup

2011-06-06 Thread Akshay
So i am trying to setup an auto-scaling search system of ec2 solr-slaves
which scale up as number of requests increase and vice versa
Here is what I have
1. A solr master and underlying slaves(scalable). And an elastic load
balancer to distribute the load.
2. The ec2-auto-scaling setup fires nodes when traffic increases. However
the replication times(replication speed) for the index from the master
varies for these newly fired nodes.
3. I want to avoid addition of these nodes to the load balancer till it has
completed initial replication and has a warmed up cache.
For this I need to know a way I can check if the initial replication has
completed. and also a way of warming up the cache post this.

I can think of doing this via .. a shellscript/awk(checking times
replicated/index size) ... is there a cleaner way ?

Also on the side note .. any suggestions or pointers to how one set up their
scalable solr setup on cloud(AWS mainly) would be helpful.

Regards,
Akshay


RE: Solr performance tuning - disk i/o?

2011-06-06 Thread Demian Katz
Thanks once again for the helpful suggestions!

Regarding the selection of facet fields, I think publishDate (which is actually 
just a year) and callnumber-first (which is actually a very broad, high-level 
category) are okay.  authorStr is an interesting problem: it's definitely a 
useful facet (when a user searches for an author, odds are good that they want 
the one who published the most books... i.e. a search for dickens will probably 
show Charles Dickens at the top of the facet list), but it has a long tail 
since there are many minor authors who have only published one or two books...  
Is there a possibility that the facet.mincount parameter could be helpful here, 
or does that have no impact on performance/memory footprint?

Regarding polling interval for slaves, are you referring to a distributed Solr 
environment, or is this something to do with Solr's internals?  We're currently 
a single-server environment, so I don't think I have to worry if it's related 
to a multi-server setup...  but if it's something internal, could you point me 
to the right area of the admin panel to check my stats?  I'm not seeing 
anything about polling on the statistics page.  It's also a little strange that 
all of my warmupTime stats on searchers and caches are showing as 0 -- is that 
normal?

thanks,
Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, June 03, 2011 4:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr performance tuning - disk i/o?
> 
> Quick impressions:
> 
> The faceting is usually best done on fields that don't have lots of
> unique
> values for three reasons:
> 1> It's questionable how much use to the user to have a gazillion
> facets.
>  In the case of a unique field per document, in fact, it's useless.
> 2> resource requirements go up as a function of the number of unique
>  terms. This is true for faceting and sorting.
> 3> warmup times grow the more terms have to be read into memory.
> 
> 
> Glancing at your warmup stuff, things like publishDate, authorStr and
> maybe
> callnumber-first are questionable. publishDate depends on how coarse
> the
> resolution is. If it's by day, that's not really much use. authorStr..
> How many
> authors have more than one publication? Would this be better served by
> some
> kind of autosuggest rather than facets? callnumber-first... I don't
> really know, but
> if it's unique per document it's probably not something the user would
> find useful
> as a facet.
> 
> The admin page will help you determine the number of unique terms per
> field,
> which may guide you whether or not to continue to facet on these
> fields.
> 
> As Otis said, doing a sort on the fields during warmup will also help.
> 
> Watch your polling interval for any slaves in relation to the warmup
> times.
> If your polling interval is shorter than the warmup times, you run a
> risk of
> "runaway warmups".
> 
> As you've figured out, measuring responses to the first few queries
> doesn't
> always measure what you really need ..
> 
> I don't have the pages handy, but autowarming is a good topic to
> understand,
> so you might spend some time tracking it down.
> 
> Best
> Erick
> 
> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
>  wrote:
> > Thanks to you and Otis for the suggestions!  Some more information:
> >
> > - Based on the Solr stats page, my caches seem to be working pretty
> well (few or no evictions, hit rates in the 75-80% range).
> > - VuFind is actually doing two Solr queries per search (one initial
> search followed by a supplemental spell check search -- I believe this
> is necessary because VuFind has two separate spelling indexes, one for
> shingled terms and one for single words).  That is probably
> exaggerating the problem, though based on searches with debugQuery on,
> it looks like it's always the initial search (rather than the
> supplemental spelling search) that's consuming the bulk of the time.
> > - enableLazyFieldLoading is set to true.
> > - I'm retrieving 20 documents per page.
> > - My JVM settings: -server -
> Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m -
> XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5
> >
> > It appears that a large portion of my problem had to do with
> autowarming, a topic that I've never had a strong grasp on, though
> perhaps I'm finally learning (any recommended primer links would be
> welcome!).  I did have some autowarming settings in solrconfig.xml (an
> arbitrary search for a bunch of random keywords in the newSearcher and
> firstSearcher events, plus autowarmCount settings on all of my caches).
>  However, when I looked at the debugQuery output, I noticed that a huge
> amount of time was being wasted loading facets on the first search
> after restarting Solr, so I changed my newSearcher and firstSearcher
> events to this:
> >
> >      
> >        
> >          *:*
> >          0
> >          10
> >          true
> >          1

problem: zooKeeper Integration with solr

2011-06-06 Thread Mohammad Shariq
Hi folk,
I am using solr to index around 100mn docs.
now I am planning to move to cluster based solr, so that I can scale the
indexing and searching process.
since solrCloud is in development  stage, I am trying to index in shard
based environment using zooKeeper.

I followed the steps from
http://wiki.apache.org/solr/ZooKeeperIntegrationthen also I am not
able to do distributes search.
Once I index the docs in one shard, not able to query from other shard and
vice-versa, (using the query
http://localhost:8180/solr/select/?q=itunes&version=2.2&start=0&rows=10&indent=on
)

I am running solr3.1 on ubuntu 10.10.

please help me.


-- 
Thanks and Regards
Mohammad Shariq


Re: Applying synonyms increase the data size from MB to GBs

2011-06-06 Thread Erick Erickson
Have you considered query-time expansion rather than index-time expansion?
In general this will lead to more complex queries, but smaller indexes.

Take a look at the analysis page available from the admin page to see exactly
what happens.

What is the high-legel problem you're trying to solve? Having this huge an
expansion in index size is pretty unusual, and I'm wondering if there might be
another approach to the problem...

Best
Erick

On Mon, Jun 6, 2011 at 6:19 AM, Ahmet Arslan  wrote:
>> Is there a way where in I can apply all those file to same
>> tag with some
>> delimiter separated?
>>
>> like this:
>>         > class="solr.SynonymFilterFactory"
>> synonyms="BODYTaxonomy.txt
>> , ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt"
>> ignoreCase="true"
>> expand="true"/>
>
>
> Yes, you can perfectly feed multiple text files separated by comma to 
> synonyms parameter.
>
> synonyms="BODYTaxonomy.txt,ClinicalObs.txt,MicTaxo.txt,SPTaxo.txt"
>


Re: java.io.IOException: The specified network name is no longer available

2011-06-06 Thread Erick Erickson
Yep, but note the discussion. It's not at all clear that Solr is the
place to deal with an
unreliable network, and it sounds like that's the root of your issue.

It doesn't look like anyone's hot to change Solr's behavior here, and
it's arguable that
Solr isn't the place to compensate for an unreliable share, but that's
debatable. Do you
have the energy to propose a patch?

Best
Erick


On Mon, Jun 6, 2011 at 1:02 AM, Gaurav Shingala
 wrote:
>
> Hi,
>
> Yes, you are right I have a remote file system also I have checked and 
> confirmed that there was no issue in network.
> One more thing i need to include here is i had found same bug with ID 
> SOLR-2235 on ASF JIRA.
>
>
> Thanks,
> Gaurav
>
>> Date: Fri, 3 Jun 2011 09:13:00 -0400
>> Subject: Re: java.io.IOException: The specified network name is no longer 
>> available
>> From: erickerick...@gmail.com
>> To: solr-user@lucene.apache.org
>>
>> You'v got to tell us more about your setup. We can only guess that you're
>> on a remote file system and there's a problem there, which would be a
>> network problem outside of Solr's purview
>>
>> You might want to review:
>> http://wiki.apache.org/solr/UsingMailingLists
>>
>> Best
>> Erick
>>
>> On Fri, Jun 3, 2011 at 1:52 AM, Gaurav Shingala
>>  wrote:
>> >
>> > Hi,
>> >
>> > I am using solr 1.4.1 and at the time of updating index getting following 
>> > error:
>> >
>> > 2011-06-03 05:54:06,943 ERROR [org.apache.solr.core.SolrCore] 
>> > (http-10.38.33.146-8080-4) java.io.IOException: The specified network name 
>> > is no longer available
>> >    at java.io.RandomAccessFile.readBytes(Native Method)
>> >    at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
>> >    at 
>> > org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(SimpleFSDirectory.java:132)
>> >    at 
>> > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
>> >    at 
>> > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
>> >    at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
>> >    at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64)
>> >    at 
>> > org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
>> >    at 
>> > org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
>> >    at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
>> >    at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
>> >    at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
>> >    at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:1103)
>> >    at 
>> > org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:981)
>> >    at 
>> > org.apache.solr.search.SolrIndexReader.termDocs(SolrIndexReader.java:320)
>> >    at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:640)
>> >    at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
>> >    at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:581)
>> >    at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903)
>> >    at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
>> >    at 
>> > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
>> >    at 
>> > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
>> >    at 
>> > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>> >    at 
>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> >    at 
>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>> >    at 
>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>> >    at 
>> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:274)
>> >    at 
>> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:242)
>> >    at 
>> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
>> >    at 
>> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>> >    at 
>> > org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:181)
>> >    at 
>> > org.jboss.modcluster.catalina.CatalinaContext$RequestListenerValve.event(CatalinaContext.java:285)
>> >    at 
>> > org.jboss.modcluster.catalina.CatalinaContext$RequestListenerValve.invoke(CatalinaContext.java:261)
>> >    at 
>> > org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:88)
>> >    at 
>> > org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:100)
>> >    at 
>> > or

Re: TIKA INTEGRATION PERFORMANCE

2011-06-06 Thread Tomás Fernández Löbbe
1. About the commit strategy, all the ExtractingRequestHandler (request
handler that uses Tika to extract content from the input file) will do is
extract the content of your file and add it to a SolrInputDocument. The
commit strategy should not change because of this, compared to other
documents you might be indexing. It is usually not recommended to commit on
every new / updated document.

2. Don't know if I understand the question. you can add all the static
fields you want to the document by adding the "literal." prefix to the name
of the fields when using ExtractingRequestHandler (as you are doing with "
literal.id"). You can also leave empty fields if they are not marked as
"required" at the schema.xml file. See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

3. Solr cores can work almost as completely different Solr instances. You
could tell one core to replicate from another core. I don't think this would
be of any help here. If you want to separate the indexing operations from
the query operations, you could probably use different machines, that's
usually a better option. Configure the indexing box as master and the query
box as slave. Here you have some more information about it:
http://wiki.apache.org/solr/SolrReplication

Were this the answers you were looking for or did I misunderstand your
questions?

Tomás

On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta  wrote:

> Hi
>
> Since it is php, we are using solphp for calling curl based call,
>
> what my concern here is that for each user, we might be having 20-40
> attachments needed to be indexed each day, and there are various users
> ..daily we are targeting around 500-1000 users ..
>
> right now if you see, we
>
>  $ch = curl_init('
> http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
>  curl_setopt ($ch, CURLOPT_POST, 1);
>  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
>  $result= curl_exec ($ch);
> ?>
>
> also we are planning to use other fields which are to be indexed and stored
> ...
>
>
> There are couple of questions here
>
> 1. what would be the best strategies for commit. if we take all the
> documents in an array and iterating one by one and fire the curl and for
> the
> last doc, if we commit, will it work or for each doc, we need to commit?
>
> 2. we are having several fields which are already defined in schema and few
> of the them are required earlier, but for this purpose, we don't want, how
> to have two requirement together in the same schema?
>
> 3. since it is frequent commit, how to use solr multicore for write and
> read
> operations separately ?
>
> Thanks
> Naveen
>


Re: synonyms problem

2011-06-06 Thread Erick Erickson
What does "call synonym methods in Java" mean? That is, what are
you trying to accomplish and from where?

Best
Erick

On Sun, Jun 5, 2011 at 9:48 PM, deniz  wrote:
> well i have changed it into text... but still confused about how to use
> synonyms...
>
> and also I want to know how to call synonym methods in java... i have tried
> to use synonymmap and some other similar things but nothing happens...
> anyone can give me a sample or a website that i can find examples about solr
> in java?
>
> -
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3028353.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Expunging deletes from a very large index

2011-06-06 Thread Michael McCandless
You can drop your mergeFactor to 2 and then run expungeDeletes?

This will make the operation take longer but (assuming you have > 3
segments in your index) should use less transient disk space.

You could also make a custom merge policy, that expunges one segment
at a time (even slower but even less transient disk space required).

optimize(maxNumSegments) may also help, though it's not guaranteed to
reclaim disk space due to deleted docs.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 6, 2011 at 2:16 AM, Simon Wistow  wrote:
> Due to some emergency maintenance I needed to run delete on a large
> number of documents in a 200Gb index.
>
> The problem is that it's taking an inordinately long amount of time (2+
> hours so far and counting) and is steadily eating up disk space -
> presumably up to 2x index size which is getting awfully close to the
> wire on this machine.
>
> Is that inevitable? Is there any way to speed up the process or use less
> space? Maybe do an optimize with a different number of maxSegments?
>
> I suspect not but I thought it was worth asking.
>
>
>
>
>


Re: Applying synonyms increase the data size from MB to GBs

2011-06-06 Thread Ahmet Arslan
> Is there a way where in I can apply all those file to same
> tag with some
> delimiter separated?
> 
> like this:
>          class="solr.SynonymFilterFactory"
> synonyms="BODYTaxonomy.txt
> , ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt"
> ignoreCase="true"
> expand="true"/>


Yes, you can perfectly feed multiple text files separated by comma to synonyms 
parameter.

synonyms="BODYTaxonomy.txt,ClinicalObs.txt,MicTaxo.txt,SPTaxo.txt"


Travel Assistance applications now open for ApacheCon NA 2011

2011-06-06 Thread Simon Willnauer
The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is
now accepting applications for ApacheCon North America 2011, 7-11 November
in Vancouver BC, Canada.

The TAC is seeking individuals from the Apache community at-large --users,
developers, educators, students, Committers, and Members-- who would like to
attend ApacheCon, but need some financial support in order to be able to get
there. There are limited places available, and all applicants will be scored
on their individual merit.

Financial assistance is available to cover flights/trains, accommodation and
entrance fees either in part or in full, depending on circumstances.
However, the support available for those attending only the BarCamp (7-8
November) is less than that for those attending the entire event (Conference
+ BarCamp 7-11 November). The Travel Assistance Committee aims to support
all official ASF events, including cross-project activities; as such, it may
be prudent for those in Asia and Europe to wait for an event geographically
closer to them.

More information can be found at http://www.apache.org/travel/index.html
including a link to the online application and detailed instructions for
submitting.

Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1).

We wish good luck to all those who will apply, and thank you in advance for
tweeting, blogging, and otherwise spreading the word.

Regards,
The Travel Assistance Committee


Re: Solr Field name restrictions

2011-06-06 Thread Marc SCHNEIDER
Hi,

Using Solr 3.1 I'm getting errors when trying to sort on fields containing
dashes in the name...
So that's true stay away from dashes if you can.

Marc.

On Sun, Jun 5, 2011 at 3:46 PM, Erick Erickson wrote:

> I'd stay away from dashes too. It's too easy for the query parsers
> to mistake them for the NOT operator on a URL.
>
> You've really got two issues here:
> 1> what is allowable in the field name
> 2> what causes grief with some query parser.
>
> To avoid <2>, I'd really just stick with characters and underscores.
>
> Best
> Erick
>
> 2011/6/4 François Schiettecatte :
> > Underscores and dashes are fine, but I would think that colons (:) are
> verboten.
> >
> > François
> >
> > On Jun 4, 2011, at 9:49 PM, Jamie Johnson wrote:
> >
> >> Is there a list anywhere detailing field name restrictions.  I imagine
> >> fields containing periods (.) are problematic if you try to use that
> field
> >> when doing faceted queries, but are there any others?  Are underscores
> (_)
> >> or dashes (-) ok?
> >
> >
>


Re: Feature: skipping caches and info about cache use

2011-06-06 Thread pravesh
SOLR1.3+  logs only the fresh queries in the logs. If you re-run the same
query then it is served from cache, and not printed on the logs(unless
cache(s) are not warmed or sercher is reopened).

So, Otis's proposal would definitely help in doing some benchmarks &
baselining the search :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Feature-skipping-caches-and-info-about-cache-use-tp3020325p3028894.html
Sent from the Solr - User mailing list archive at Nabble.com.