multi cores vs filter queries for a multi tenant deployment

2010-10-11 Thread Tharindu Mathew
Hi everyone,

I'm sort of looking in to a deployment which will support multi tenancy.
This means that there will be 1000s of tenant domains each having 1000s of
users. I need to figure out which approach is better for this deployment
when using the solr server.

Approach #1 - Use multi cores for each tenant and thereby use separate
indexes for each. If necessary use filter queries with user ids for users.
Approach #2 - Use filter queries with tenant ids to filter out results of
different tenant domains. Similarly, as above, use user ids as needed.

My concern comes on aspects of performance and security.

Will using approach #1 be a killer for performance? With this many number of
users, this setup has to scale smoothly for so many number of users. When
the deployment potentially will have 1000s of cores, how can I prevent a
security vulnerability appearing between cores?

What are the implications of using approach #2? Will I have to constantly
check around for code with security checks since only a single index is
used?

Any feedback for the above concerns would be really appreciated.

Thanks in advance.

-- 
Regards,

Tharindu


Re: facet.method: enum vs. fc

2010-10-11 Thread Paolo Castagna

Thank you Erick, your explanation was helpful.
I'll stick with fc and come back to this later if I need further tuning.

Paolo

Erick Erickson wrote:

Yep, that was probably the best choice

It's a classic time/space tradeoff. The enum method creates a bitset for
#each#
unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring
some overhead here). So if your facet field has 10 unique values, and 8M
documents,
you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so
on. But
this is very, very fast.

fc on the other hand, eats up cache for storing the string value for each
unique value,
plus various counter arrays (several bytes/doc). For most cases, it will use
less memory
than enum, but will be slower.

I'd stick with fc for the time being and think about enum if 1> you have a
good idea of
what the number of unique terms is or 2> you start to need to finely tune
your speed.

HTH
Erick

On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna <
castagna.li...@googlemail.com> wrote:


Hi,
I am using Solr v1.4 and I am not sure which facet.method I should use.

What should I use if I do not know in advance if the number of values
for a given field will be high or low?

What are the pros/cons of using facet.method=enum vs. facet.method=fc?

When should I use enum vs. fc?

I have found some comments and suggestions here:

 "enum enumerates all terms in a field, calculating the set intersection
 of documents that match the term with documents that match the query.
 This was the default (and only) method for faceting multi-valued fields
 prior to Solr 1.4.
 "fc (stands for field cache), the facet counts are calculated by
 iterating over documents that match the query and summing the terms
 that appear in each document. This was the default method for single
 valued fields prior to Solr 1.4.
 The default value is fc (except for BoolField) since it tends to use
 less memory and is faster when a field has many unique terms in the
 index."
 -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

 "facet.method=enum [...] this is excellent for fields where there is
 a small set of distinct values. The average number of values per
 document does not matter.
 facet.method=fc [...] this is excellent for situations where the
 number of indexed values for the field is high, but the number of
 values per document is low. For multi-valued fields, a hybrid approach
 is used that uses term filters from the filterCache for terms that
 match many documents."
 -- http://wiki.apache.org/solr/SolrFacetingOverview

 "If you are faceting on a field that you know only has a small number
 of values (say less than 50), then it is advisable to explicitly set
 this to enum. When faceting on multiple fields, remember to set this
 for the specific fields desired and not universally for all facets.
 The request handler configuration is a good place to put this."
 -- Book: "Solr 1.4 Enterprise Search Server", pag. 148

This is the part of the Solr code which deals with the facet.method
parameter:

 if (enumMethod) {
   counts = getFacetTermEnumCounts([...]);
 } else {
   if (multiToken) {
 UnInvertedField uif = [...]
 counts = uif.getCounts([...]);
   } else {
 [...]
 if (per_segment) {
   [...]
   counts = ps.getFacetCounts([...]);
 } else {
   counts = getFieldCacheCounts([...]);
 }
   }
 }
 --
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java

See also:

 -
http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values

At the end, since I do not know in advance the number of different
values for my fields I went for facet.method=fc, does this seems
reasonable to you?

Thank you,
Paolo





Re: configuring custom CharStream in solr

2010-10-11 Thread Michael Sokolov

 On 10/11/2010 10:18 PM, Chris Hostetter wrote:

: OK - I found the answer pecking through the source - apparently the name of
: the element to configure a CharFilter is  - fancy that :)

there's even an example, right there on the wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories


-Hoss

I am just bathing myself in wizardly astuteness today

thanks

-Mike


Re: LuceneRevolution - NoSQL: A comparison

2010-10-11 Thread Dennis Gearon
It sounds, of course, a lot like transaction isolation using MVCC. It's the 
obvious solution, and has been for since the late 1970's.

I hope it won't be too hard to convince people to use it :-) It's been the 
reason for the early success of Oracle.

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/11/10, Yonik Seeley  wrote:

> From: Yonik Seeley 
> Subject: Re: LuceneRevolution - NoSQL: A comparison
> To: solr-user@lucene.apache.org
> Date: Monday, October 11, 2010, 7:20 PM
> On Mon, Oct 11, 2010 at 8:32 PM,
> Peter Keegan 
> wrote:
> > I listened with great interest to Grant's presentation
> of the NoSQL
> > comparisons/alternatives to Solr/Lucene. It sounds
> like the jury is still
> > out on much of this. Here's a use case that might
> favor using a NoSQL
> > alternative for storing 'stored fields' outside of
> Lucene.
> >
> > When Solr does a distributed search across shards, it
> does this in 2 phases
> > (correct me if I'm wrong):
> >
> > 1. 1st query to get the docIds and facet counts
> > 2. 2nd query to retrieve the stored fields of the top
> hits
> >
> > The problem here is that the index could change
> between (1) and (2), so it's
> > not an atomic transaction.
> 
> Yep.
> 
> As I discussed with Peter at Lucene Revolution, if this
> feature is
> important to people, I think the easiest way to solve it
> would be via
> leases.
> 
> During a query, a client could request a lease for a
> certain amount of
> time on whatever index version is used to generate the
> response.  Solr
> would then return the index version to the client along
> with the
> response, and keep the index open for that amount of
> time.  The client
> could make consistent additional requests (such as the 2nd
> phase of a
> distributed request)  by requesting the same version
> of the index.
> 
> -Yonik
>


Re: LuceneRevolution - NoSQL: A comparison

2010-10-11 Thread Dennis Gearon
Well,
 I think that if some is searching the 'whole of the dataset' to find the 
'individual data' then an SQL database outside of Solr makes as much sense. 
There's plenty of data in the world or most applications that needs to stay 
normalized or at least has benefits to being that way.
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/11/10, Peter Keegan  wrote:

> From: Peter Keegan 
> Subject: LuceneRevolution - NoSQL: A comparison
> To: solr-user@lucene.apache.org
> Date: Monday, October 11, 2010, 5:32 PM
> I listened with great interest to
> Grant's presentation of the NoSQL
> comparisons/alternatives to Solr/Lucene. It sounds like the
> jury is still
> out on much of this. Here's a use case that might favor
> using a NoSQL
> alternative for storing 'stored fields' outside of Lucene.
> 
> When Solr does a distributed search across shards, it does
> this in 2 phases
> (correct me if I'm wrong):
> 
> 1. 1st query to get the docIds and facet counts
> 2. 2nd query to retrieve the stored fields of the top hits
> 
> The problem here is that the index could change between (1)
> and (2), so it's
> not an atomic transaction. If the stored fields were kept
> outside of Lucene,
> only the first query would be necessary. However, this
> would mean that the
> external NoSQL data store would have to be synchronized
> with the Lucene
> index, which might present its own problems. (I'm just
> throwing this out for
> discussion)
> 
> Peter
>


Re: LuceneRevolution - NoSQL: A comparison

2010-10-11 Thread Yonik Seeley
On Mon, Oct 11, 2010 at 8:32 PM, Peter Keegan  wrote:
> I listened with great interest to Grant's presentation of the NoSQL
> comparisons/alternatives to Solr/Lucene. It sounds like the jury is still
> out on much of this. Here's a use case that might favor using a NoSQL
> alternative for storing 'stored fields' outside of Lucene.
>
> When Solr does a distributed search across shards, it does this in 2 phases
> (correct me if I'm wrong):
>
> 1. 1st query to get the docIds and facet counts
> 2. 2nd query to retrieve the stored fields of the top hits
>
> The problem here is that the index could change between (1) and (2), so it's
> not an atomic transaction.

Yep.

As I discussed with Peter at Lucene Revolution, if this feature is
important to people, I think the easiest way to solve it would be via
leases.

During a query, a client could request a lease for a certain amount of
time on whatever index version is used to generate the response.  Solr
would then return the index version to the client along with the
response, and keep the index open for that amount of time.  The client
could make consistent additional requests (such as the 2nd phase of a
distributed request)  by requesting the same version of the index.

-Yonik


Re: configuring custom CharStream in solr

2010-10-11 Thread Chris Hostetter

: OK - I found the answer pecking through the source - apparently the name of
: the element to configure a CharFilter is  - fancy that :)

there's even an example, right there on the wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories


-Hoss


Re: configuring custom CharStream in solr

2010-10-11 Thread Michael Sokolov

 On 10/11/2010 8:38 PM, Michael Sokolov wrote:

 On 10/11/2010 6:41 PM, Koji Sekiguchi wrote:

(10/10/12 5:57), Michael Sokolov wrote:
I would like to inject my CharStream (or possibly it could be a 
CharFilter;
this is all in flux at the moment) into the analysis chain for a 
field.  Can
I do this in solr using the Analyzer configuration syntax in 
schema.xml, or

would I need to define my own Analyzer?  The solr wiki describes adding
Tokenizers, but doesn't say anything about CharReaders/Filters.

Thanks for any pointers

-Mike


Hi Mike,

You can write your own CharFilterFactory that creates your own
CharStream. Please refer existing CharFilterFactories in Solr
to see how you can implement it.

Koji

Koji - thanks for your response.  I think I can see my way clear to 
making a factory class for my stream.  My question was really about 
how to configure the factory.  I see a number of examples of 
tokenizers and analyzers configured in the example schema.xml, but no 
readers.  For example:








configures a specific tokenizer.  If I want to configure my 
CharStream, is there an element for that?  Eg:









I am guessing that I need to create my own analyzer and hard-code the 
reader/tokenizer filter chain in there, but it would be nice if there 
were a syntax like the one I inferred above.


-Mike
OK - I found the answer pecking through the source - apparently the name 
of the element to configure a CharFilter is  - fancy that :)


-MIke


Re: configuring custom CharStream in solr

2010-10-11 Thread Michael Sokolov

 On 10/11/2010 6:41 PM, Koji Sekiguchi wrote:

(10/10/12 5:57), Michael Sokolov wrote:
I would like to inject my CharStream (or possibly it could be a 
CharFilter;
this is all in flux at the moment) into the analysis chain for a 
field.  Can
I do this in solr using the Analyzer configuration syntax in 
schema.xml, or

would I need to define my own Analyzer?  The solr wiki describes adding
Tokenizers, but doesn't say anything about CharReaders/Filters.

Thanks for any pointers

-Mike


Hi Mike,

You can write your own CharFilterFactory that creates your own
CharStream. Please refer existing CharFilterFactories in Solr
to see how you can implement it.

Koji

Koji - thanks for your response.  I think I can see my way clear to 
making a factory class for my stream.  My question was really about how 
to configure the factory.  I see a number of examples of tokenizers and 
analyzers configured in the example schema.xml, but no readers.  For 
example:








configures a specific tokenizer.  If I want to configure my CharStream, 
is there an element for that?  Eg:









I am guessing that I need to create my own analyzer and hard-code the 
reader/tokenizer filter chain in there, but it would be nice if there 
were a syntax like the one I inferred above.


-Mike


LuceneRevolution - NoSQL: A comparison

2010-10-11 Thread Peter Keegan
I listened with great interest to Grant's presentation of the NoSQL
comparisons/alternatives to Solr/Lucene. It sounds like the jury is still
out on much of this. Here's a use case that might favor using a NoSQL
alternative for storing 'stored fields' outside of Lucene.

When Solr does a distributed search across shards, it does this in 2 phases
(correct me if I'm wrong):

1. 1st query to get the docIds and facet counts
2. 2nd query to retrieve the stored fields of the top hits

The problem here is that the index could change between (1) and (2), so it's
not an atomic transaction. If the stored fields were kept outside of Lucene,
only the first query would be necessary. However, this would mean that the
external NoSQL data store would have to be synchronized with the Lucene
index, which might present its own problems. (I'm just throwing this out for
discussion)

Peter


multicore replication slave

2010-10-11 Thread Christopher Bottaro
Hello,

I can't get my multicore slave to replicate from the master.

The master is setup properly and the following urls return "00OKNo
command" as expected:
http://solr.mydomain.com:8983/solr/core1/replication
http://solr.mydomain.com:8983/solr/core2/replication
http://solr.mydomain.com:8983/solr/core3/replication

The following pastie shows how my slave is setup:
http://pastie.org/1214209

But it's not working (i.e. I see no replication attempts in the slave's log).

Any ideas?

Thanks for the help.


Re: Trouble with exception Document [Null] missing required field DocID

2010-10-11 Thread Chris Hostetter

: Right. You're requiring that every document have an ID (via uniqueKey), but
: there's nothing
: magic about DIH that'll automagically parse a PDF file and map something
: into your ID
: field.
: 
: So you have to create a unique ID before you send your doc to Curl. I'm

a) This example isn't using DIH, it's using the extracting request handler 
directly

b) in the example URL provided, Ahson was already using the exact syntax 
you mentioned...

: > curl
: > "
: > 
http://localhost:8983/solr1/update/extract?literal.DocID=123&fmap.content=Contents&commit=true
: > "
: >  -F "myfi...@d:/solr/apache-solr-1.4.0/docs/filename1.pdf"

...note the "literal.DocID" param (where "DocID" is the field listed as 
uniqueKey in his example)

The actual root of the problem is that the "lowernames" param 
(which is declared "true" in the Solr 1.4 example declaration of 
/update/extract) is getting applied to all field names, even the literal 
ones...

http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations

Ahson: You could change your uniqueKey field to something that is all 
lowercase, or you could set lowernames=false in your config (which will 
impact all field names extract by Tika)

(Personally, i think the order of operations in the 
ExtractingRequestHandler makes no sense at all)

-Hoss


Re: having problem about Solr Date Field.

2010-10-11 Thread Dennis Gearon
So, regarding DST, do you put everything in GMT, and make adjustments for in 
the 'seach for/between' data/time values before the query for both DST and TZ?


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/11/10, Chris Hostetter  wrote:

> From: Chris Hostetter 
> Subject: Re: having problem about Solr Date Field.
> To: solr-user@lucene.apache.org
> Date: Monday, October 11, 2010, 3:23 PM
> 
> : Of course if your index is for users in one time zone
> only, you may 
> : insert the local time to Solr, and everything will work
> well. However, 
> 
> This is a bad assumption to make -- it will screw you up if
> your "one time 
> zone" has anything like "Daylight Saving Time" (Because UTC
> Does not)
> 
> 
> -Hoss
>


Re: Records from DIH not easily queried for

2010-10-11 Thread Dennis Gearon
Well, found the problem, us of course.

We were using string instead of text for the field type in the schema config 
file. So it wasn't tokenizing words or doing other 'search by word' enabling 
preprocessing before storing the document in the index. We could have only 
found whole sentences.

Now it works! But, now the long road to tuning it to find what we WANT it to 
find . . . begins.

That and getting what we want out of geospatial. We're just starting on that.


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sun, 10/10/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Records from DIH not easily queried for
> To: solr-user@lucene.apache.org
> Date: Sunday, October 10, 2010, 8:11 AM
> The phrase that jumps out is "with
> fields slightly modified". I'm
> guessing that your modifications are off by a little.
> Here's what
> I'd check first:
> 1> check the case. Sometimes the DB <-> field link
> is case
> sensitive.
> 2> Look in your index via the admin page and look at
> your actual
> fields as reported there. Are they really what you expect?
> 3> Try your query with &debugQuery=on. Is what you
> get back
> what you expect?
> 4> Sometimes your browser cache will fool you, try the
> force-refresh
> combination on your browser.
> 
> There's no magic here, nothing special or different about
> DIH
> imported data than any other sort. So it's almost
> certainly
> some innocent-seeming change that's not, typo, incorrect
> assumption, etc.
> 
> If none of that works, you need to post your schema changes
> and
> your query results (with &debugQuery=on). Particularly,
> post
> the fieldType definitions as well as your field
> definitions...
> 
> Best
> Erick
> 
> On Sun, Oct 10, 2010 at 10:55 AM, Dennis Gearon wrote:
> 
> > With a brand new setup, per the demo/tutorial, with
> fields slightly changed
> > in the config and data, posting XML records results in
> a simple qiery being
> > able to find records.
> >
> >
> > But records imported via a plain jane DIH request can
> only be found using
> > 'q=*:*' queries.
> >
> > There's no filtering, tokenizing, blah blah. It's the
> factory settings. The
> > installation is as new at this as we are :-)
> >
> > Anyone have any ideas why we can't query for DIH
> handled records? Do they
> > have some magic juju done to them that XML Posts
> don't, or visa versa?
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own
> mistakes. It is usually a
> > better idea to learn from others’ mistakes, so you
> do not have to make them
> > yourself. from '
> > http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
>


Re: Where is the lock file?

2010-10-11 Thread Chris Hostetter

: I've looked through the configuration file. I can see where it defines the
: lock type and I can see the unlock configuration. But I don't see where it
: specifies the lock file. Where is it? What is its name?

as mentioned in the stack trace you pasted, the name of the lock file in 
question is "write.lock" however what's really odd is that based on your 
stack trace you seem to be using the SingleInstanceLockFactory (ie: 
"single") which means the lock file is never written 
to disk -- it's an entirely in memory Lock object.

If you are getting that stack trace, that suggests that something is 
seriously wack with your Solr setup -- is it possible you have multiple 
isntances of Solr in the same JVM trying to use the same directory? (ie: 
an instance that wasn't shutdown cleanly, and then you started up a new 
instance using war hot deploy or something like it?)

: Also, to speed up nutch, we changed the configuration to start several map
: tasks at once. Is nutch trying to kick off several solr sessions at once and
: is that causing messages like the above? Should we just change the lock to
: simple?

i don't know enough about nutch to know what this means ... if Nutch is 
starting up multiple Solr servers (in the same JVM) then this might 
explain the exception above ... using a "simple" lock isn't going to make 
hte problem go away though: only one Solr instance can be writting to an 
index at a time.



-Hoss


Re: How to get line numbers from Solr plugin to show up in stack trace

2010-10-11 Thread Chris Hostetter

: Hello, I am writing a clustering component for Solr. It registers, loads and 
: works properly. However, whenever there is an exception inside my plugin, I 
: cannot get tomcat to show me the line numbers. It always says "Unknown 
source" 
: for my classes. The stack trace in tomcat shows line numbers for everything 
up 
: to org.apache.solr.handler.component.SearchHandler class, but after that it 
: shows my class names without line numbers. My compiler in ant build file is 
set 
: to include debug info:
: 

I've never seen "debuglevel" in a build.xml ... Solr's build.xml just uses 
debug="true" and things seem to work fine.

Googling for "ant debuglevel" suggests that:
  1) you don't want "and" in that attribute
  2) you odn't want any spaces in there either




-Hoss


Re: configuring custom CharStream in solr

2010-10-11 Thread Koji Sekiguchi

(10/10/12 5:57), Michael Sokolov wrote:

I would like to inject my CharStream (or possibly it could be a CharFilter;
this is all in flux at the moment) into the analysis chain for a field.  Can
I do this in solr using the Analyzer configuration syntax in schema.xml, or
would I need to define my own Analyzer?  The solr wiki describes adding
Tokenizers, but doesn't say anything about CharReaders/Filters.

Thanks for any pointers

-Mike


Hi Mike,

You can write your own CharFilterFactory that creates your own
CharStream. Please refer existing CharFilterFactories in Solr
to see how you can implement it.

Koji

--
http://www.rondhuit.com/en/


Re: having problem about Solr Date Field.

2010-10-11 Thread Chris Hostetter

: Of course if your index is for users in one time zone only, you may 
: insert the local time to Solr, and everything will work well. However, 

This is a bad assumption to make -- it will screw you up if your "one time 
zone" has anything like "Daylight Saving Time" (Because UTC Does not)


-Hoss


Re: StatsComponent and multi-valued fields

2010-10-11 Thread Chris Hostetter

: I'm able to execute stats queries against multi-valued fields, but when
: given a facet, the statscomponent only considers documents that have a facet
: value as the last value in the field.
: 
: As an example, imagine you are running stats on "fooCount", and you want to
: facet on "bar", which is multi-valued.  Two documents...

It's a known bug ... StatsComponent's "Faceted Stats" make some really 
gross assumptions about the Field...

https://issues.apache.org/jira/browse/SOLR-1782

-Hoss


weighted facets

2010-10-11 Thread Peter Karich
Hi,

I need a feature which is well explained from Mr Goll at this site **

So, it then would be nice to do sth. like:

facet.stats=sum(fieldX)&facet.stats.sort=fieldX

And the output (sorted against the sum-output) can look sth. like this:

 
   
 767
 892

Is there something similar or was this answered from Hoss at the lucene
revolution? If not I'll open a JIRA issue ...


BTW: is the work from
http://www.cs.cmu.edu/~ddash/papers/facets-cikm.pdf contributed back to
solr?


Regards,
Peter.



PS: Related issue:
https://issues.apache.org/jira/browse/SOLR-680
https://issues.apache.org/jira/secure/attachment/12400054/SOLR-680.patch



**
http://lucene.crowdvine.com/posts/14137409

Quoting his question in case the site goes offline:

Hi Chris,

Usually a facet search returns the document count for the
unique values in the facet field. Is there a way to
return a weighted facet count based on a user-defined function (sum,
product, etc.) of another field?

Here is a sum example. Assume we have the following
4 documents with 3 fields

ID facet_field weight_field
1 solr 0.4
2 lucene 0.3
3 lucene 0.1
4 lucene 0.2

Is there a way to return

solr 0.4
lucene 0.6

instead of

solr 1
lucene 3

Given the facet_field contains multiple values

ID facet_field weight_field
1 solr lucene 0.2
2 lucene 0.3
3 solr lucene 0.1
4 lucene 0.2

Is there a way to return

solr 0.3
lucene 0.8

instead of

solr 2
lucene 4

Thanks,
Johannes


Re: data import / delta question

2010-10-11 Thread Tim Heckman
Thanks, Erick. I was starting to think I may have to go the SolrJ route.

Here's a simplified version of my DIH config showing what I'm trying to do.







  










On Mon, Oct 11, 2010 at 4:25 PM, Erick Erickson  wrote:
> Without seeing your DIH config, it's really hard to say much of anything.
>
> You can gain finer control over edge cases by writing a Java
> app that uses SolrJ if necessary.
>
> HTH
> Erick
>
> On Mon, Oct 11, 2010 at 3:27 PM, Tim Heckman  wrote:
>
>> My data-import-config.xml has a parent entity and a child entity. The
>> data is coming from rdbms's.
>>
>> I'm trying to make use of the delta-import feature where a change in
>> the child entity can be used to regenerate the entire document.
>>
>> The child entity is on a different database (and a different server)
>> from the parent entity, so the child's parentDeltaQuery cannot
>> reference the table of the parent entity the way that the example on
>> the wiki does, because it's bound to the database connection for the
>> child entity's data (as far as I can tell).
>>
>> http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
>>
>>
>> I have tried extracting the parent's ID's from the child table in the
>> parentDeltaQuery, thinking that these id's would be fed into the
>> parent's deltaImportQuery, but this doesn't seem to work, either.
>>
>> Should this work? If not, any suggestions how to work around it?
>>
>> thanks,
>> Tim
>>
>


configuring custom CharStream in solr

2010-10-11 Thread Michael Sokolov
I would like to inject my CharStream (or possibly it could be a CharFilter;
this is all in flux at the moment) into the analysis chain for a field.  Can
I do this in solr using the Analyzer configuration syntax in schema.xml, or
would I need to define my own Analyzer?  The solr wiki describes adding
Tokenizers, but doesn't say anything about CharReaders/Filters.

Thanks for any pointers

-Mike



Re: Deleting Documents with null fields by query

2010-10-11 Thread Erick Erickson
"erase all the content". Oops.

first, I should look more carefully. You don't want the AND in there, use
*:* -content:[* TO *]

In general, don't mix and match booleans and native Lucene query syntax...

Before sending this to Solr, what do you get back when you try just the
query
in, say, the admin page? I'd be testing the query there before actually
submitting
the delete

Best
Erick

On Mon, Oct 11, 2010 at 4:33 PM, Claudio Devecchi wrote:

> yes..
>
> dont work, doing it I erase all the content. :(
>
> or, another thing that will help me is to make a query that doesnt bring
> the
> null one.
>
> tks
>
> On Mon, Oct 11, 2010 at 5:27 PM, Erick Erickson  >wrote:
>
> > Have you tried something like:
> >
> > '*:* AND
> > -content:[* TO *]
> >
> >
> > On Mon, Oct 11, 2010 at 4:01 PM, Claudio Devecchi  > >wrote:
> >
> > > Hi everybody,
> > >
> > > I'm trying to delete by query some documents with null content (this
> > > happened because I crawled my intranet and somethings came null)
> > >
> > > When I try this works fine (I'm deleting from my solr index every
> > document
> > > that dont have wiki on the field content)
> > > curl http://localhost:8983/solr/update?commit=true -H 'Content-Type:
> > > text/xml' --data-binary '*:* AND
> > > -content:wiki'
> > >
> > > Now I need to make a query that delete every document that have the
> field
> > > content null.
> > >
> > > Somebody could help me pls?
> > >
> > > Tks
> > > CLaudio
> > >
> >
>
>
>
> --
> Claudio Devecchi
> flickr.com/cdevecchi
>


Re: Deleting Documents with null fields by query

2010-10-11 Thread Claudio Devecchi
yes..

dont work, doing it I erase all the content. :(

or, another thing that will help me is to make a query that doesnt bring the
null one.

tks

On Mon, Oct 11, 2010 at 5:27 PM, Erick Erickson wrote:

> Have you tried something like:
>
> '*:* AND
> -content:[* TO *]
>
>
> On Mon, Oct 11, 2010 at 4:01 PM, Claudio Devecchi  >wrote:
>
> > Hi everybody,
> >
> > I'm trying to delete by query some documents with null content (this
> > happened because I crawled my intranet and somethings came null)
> >
> > When I try this works fine (I'm deleting from my solr index every
> document
> > that dont have wiki on the field content)
> > curl http://localhost:8983/solr/update?commit=true -H 'Content-Type:
> > text/xml' --data-binary '*:* AND
> > -content:wiki'
> >
> > Now I need to make a query that delete every document that have the field
> > content null.
> >
> > Somebody could help me pls?
> >
> > Tks
> > CLaudio
> >
>



-- 
Claudio Devecchi
flickr.com/cdevecchi


Re: Disable (or prohibit) per-field overrides

2010-10-11 Thread Erick Erickson
I'm clueless in that case, because you're right, that's a lot of picky
maintenance

Sorry 'bout that
Erick

On Mon, Oct 11, 2010 at 4:18 PM, Markus Jelsma
wrote:

> Yes, we're using it but the problem is that there can be many fields and
> that means quite a large list of parameters to set for each request handler,
> and there can be many request handlers.
>
> It's not very practical for us to maintain such big set of invariants.
>
> Thanks
>
>
>
> On Mon, 11 Oct 2010 16:12:35 -0400, Erick Erickson <
> erickerick...@gmail.com> wrote:
>
>> Have you looked at "invariants" in solrconfig.xml?
>>
>> Best
>> Erick
>>
>> On Mon, Oct 11, 2010 at 12:23 PM, Markus Jelsma
>> wrote:
>>
>>  Hi,
>>>
>>> Anyone knows useful method to disable or prohibit the per-field override
>>> features for the search components? If not, where to start to make it
>>> configurable via solrconfig and attempt to come up with a working patch?
>>>
>>> Cheers,
>>> --
>>> Markus Jelsma - CTO - Openindex
>>> http://www.linkedin.com/in/markus17
>>> 050-8536600 / 06-50258350
>>>
>>>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>


Re: Deleting Documents with null fields by query

2010-10-11 Thread Erick Erickson
Have you tried something like:

'*:* AND
-content:[* TO *]


On Mon, Oct 11, 2010 at 4:01 PM, Claudio Devecchi wrote:

> Hi everybody,
>
> I'm trying to delete by query some documents with null content (this
> happened because I crawled my intranet and somethings came null)
>
> When I try this works fine (I'm deleting from my solr index every document
> that dont have wiki on the field content)
> curl http://localhost:8983/solr/update?commit=true -H 'Content-Type:
> text/xml' --data-binary '*:* AND
> -content:wiki'
>
> Now I need to make a query that delete every document that have the field
> content null.
>
> Somebody could help me pls?
>
> Tks
> CLaudio
>


Re: data import / delta question

2010-10-11 Thread Erick Erickson
Without seeing your DIH config, it's really hard to say much of anything.

You can gain finer control over edge cases by writing a Java
app that uses SolrJ if necessary.

HTH
Erick

On Mon, Oct 11, 2010 at 3:27 PM, Tim Heckman  wrote:

> My data-import-config.xml has a parent entity and a child entity. The
> data is coming from rdbms's.
>
> I'm trying to make use of the delta-import feature where a change in
> the child entity can be used to regenerate the entire document.
>
> The child entity is on a different database (and a different server)
> from the parent entity, so the child's parentDeltaQuery cannot
> reference the table of the parent entity the way that the example on
> the wiki does, because it's bound to the database connection for the
> child entity's data (as far as I can tell).
>
> http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
>
>
> I have tried extracting the parent's ID's from the child table in the
> parentDeltaQuery, thinking that these id's would be fed into the
> parent's deltaImportQuery, but this doesn't seem to work, either.
>
> Should this work? If not, any suggestions how to work around it?
>
> thanks,
> Tim
>


Re: Solr unresponsive but still taking queries

2010-10-11 Thread Erick Erickson
The first question is "what's been changing"? I suspect something's been
growing
right along and finally tripped you up. Places I would look first:
1> how much free space is on your disk? Have your logs (or other files)
grown without bound?
2> If this is a Unix box, what does "top" report? In other words, profile
your machine and
see what the limiting resources is. You should be seeing something
pathalogical.
Your CPUs should be pegged (find out the program using it up). Or your
I/O is swapping
like a crazy thing. Or

Until you have some clue where you're being starved, you're just guessing...
Even negative
data is better than none (i.e. being CPU bound rules out most I/O problems
and vice-versa).

It's even possible that what's happening is that some other program on that
box is
mis-behaving and starving your searcher process. The possibilities are
endless.

A *very* quick way to test a lot would be to move the searcher onto another
box and see
what happens then.

Best
Erick

On Mon, Oct 11, 2010 at 2:36 PM, Hitendra Molleti
wrote:

> Hi,
>
>
>
> We are running a CMS based on Java and use Solr 1.4 as the indexer.
>
>
>
> Till today afternoon things were fine until we hit this Solr issue where it
> sort of becomes unresponsive. We tried to stop and restart Solr but no
> help.
>
>
>
> When we look into the logs Solr is receiving queries and running them but
> we
> do not seem to get the responses and after an endless wait the page
> generates a 503 error (Varnish on the front end).
>
>
>
> Can someone help us with any possible suggestions or solutions.
>
>
>
> Thanks
>
>
>
> Hitendra
>
>


Re: Disable (or prohibit) per-field overrides

2010-10-11 Thread Markus Jelsma
Yes, we're using it but the problem is that there can be many fields 
and that means quite a large list of parameters to set for each request 
handler, and there can be many request handlers.


It's not very practical for us to maintain such big set of invariants.

Thanks


On Mon, 11 Oct 2010 16:12:35 -0400, Erick Erickson 
 wrote:

Have you looked at "invariants" in solrconfig.xml?

Best
Erick

On Mon, Oct 11, 2010 at 12:23 PM, Markus Jelsma
wrote:


Hi,

Anyone knows useful method to disable or prohibit the per-field 
override
features for the search components? If not, where to start to make 
it
configurable via solrconfig and attempt to come up with a working 
patch?


Cheers,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: Disable (or prohibit) per-field overrides

2010-10-11 Thread Erick Erickson
Have you looked at "invariants" in solrconfig.xml?

Best
Erick

On Mon, Oct 11, 2010 at 12:23 PM, Markus Jelsma
wrote:

> Hi,
>
> Anyone knows useful method to disable or prohibit the per-field override
> features for the search components? If not, where to start to make it
> configurable via solrconfig and attempt to come up with a working patch?
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>


Re: deleteByQuery issue

2010-10-11 Thread Erick Erickson
I'd guess that after you delete your documents and commit, you're still
using
an IndexReader that you haven't reopened when you search. WARNING:
I'm not all that familiar with EmbeddedSolrServer, so this may be way
off
base.

HTH
Erick

On Mon, Oct 11, 2010 at 12:04 PM, Claudio Atzori  wrote:

>  On 10/11/2010 04:06 PM, Ahmet Arslan wrote:
>
>>
>> --- On Mon, 10/11/10, Claudio Atzori  wrote:
>>
>>  From: Claudio Atzori
>>> Subject: deleteByQuery issue
>>> To: solr-user@lucene.apache.org
>>> Date: Monday, October 11, 2010, 10:38 AM
>>>  Hi everybody,
>>> in my application I use an instance of EmbeddedSolrServer
>>> (solr 1.4.1), the following snippet shows how I am
>>> instantiating it:
>>>
>>>   File home = new

>>> File(indexDataPath(solrDataDir, indexName));
>>>
  container = new

>>> CoreContainer(indexDataPath(solrDataDir, indexName));
>>>

 container.load(indexDataPath(solrDataDir,
>>> indexName), new File(home, "solr.xml"));
>>>
  return new

>>> EmbeddedSolrServer(container, indexName);
>>>
>>> and I'm going through some issues using deleteByQuery
>>> method, in fact, when I try to delete a subset of documents,
>>> or even all the documents from the index, I see as they are
>>> correctly marked for deletion on the luke inspector (
>>> http://code.google.com/p/luke/), but after a commit I
>>> can still retrieve them, just like they haven't been
>>> removed...
>>>
>>> I can see the difference and see the documents disappear
>>> only when I restart my jetty application, but obviously this
>>> cannot be a feature... any idea?
>>>
>> I think you are accessing same solr index using both embedded server and
>> http.
>> The changes that you made using embedded server won't be reflected to http
>> until a commit issued from http. I mean if you hit this url:
>>
>> http://localhost:8983/solr/update?commit=true
>>
>> the deleted documents won't be retrieved anymore.
>>
>> P.s. if you want to expunge deleted docs completely you can either
>> optimize or commit with expungeDeletes = "true".
>>
>>
> Thanks for your reply.
> Alright I'll better explain my scenario. I'm not exposing any http
> interface of the index. I handle the whole index 'life cycle' via java code
> with the EmbeddedSolrServer instance, so I'm handling commits,
> optimizations, feedings, index creation, all through that instance, moreover
> my client application calls embeddedSolrServerInstance.commit() after
> deleteByQuery, but the documents are still there
>
>


Deleting Documents with null fields by query

2010-10-11 Thread Claudio Devecchi
Hi everybody,

I'm trying to delete by query some documents with null content (this
happened because I crawled my intranet and somethings came null)

When I try this works fine (I'm deleting from my solr index every document
that dont have wiki on the field content)
curl http://localhost:8983/solr/update?commit=true -H 'Content-Type:
text/xml' --data-binary '*:* AND
-content:wiki'

Now I need to make a query that delete every document that have the field
content null.

Somebody could help me pls?

Tks
CLaudio


Re: Prioritizing adjectives in solr search

2010-10-11 Thread Erick Erickson
You can do some interesting things with payloads. You could index a
particular value as the payload that identified the "kind" of word it was,
where "kind" is something you define. Then at query time, you could
boost depending on what part kind of word you identified it as in both
the query and at indexing time.

But I can't even imagine how one would go about supporting this in a
general search engine. This kind of thing seems far too domain
specific.

Best
Erick


On Sun, Oct 10, 2010 at 8:50 PM, Ron Mayer  wrote:

> Walter Underwood wrote:
> > I think this is a bad idea. The tf.idf algorithm will already put a
> higher weight on "hammers" than on "blue", because "hammers" will be more
> rare than "blue". Plus, you are making huge assumptions about the queries.
> In a search for "Canon camera", "Canon" is an adjective, but it is the
> important part of the query.
> >
> > Have you looked at your query logs and which queries are successful and
> which are not?
> >
> > Don't make radical changes like this unless you can justify them from the
> logs.
>
> The one radical change I'd like in the area of adjectives in noun clauses
> is if
> more weight were put when the adjectives apply to the appropriate noun.
>
> For example, a search for:
>   'red baseball cap black leather jacket'
> should find a doc with "the guy wore a red cap, blue jeans, and a leather
> jacket"
> before one that says "the guy wore a black cap, leather pants, and a red
> jacket".
>
>
> The closest I've come at doing this was to use a variety of "phrase slop"
> boosts simultaneously - so that "red [any_few_words] cap" "baseball cap"
> "leather jacket", "black [any_few_words] jacket" all add boosts to the
> score.
>
>
>
>
>
>
>
> >
> > wunder
> >
> > On Oct 4, 2010, at 8:38 PM, Otis Gospodnetic wrote:
> >
> >> Hi,
> >>
> >> If you want "blue" to be used in search, then you should not treat it as
> a
> >> stopword.
> >>
> >> Re payloads: http://search-lucene.com/?q=payload+score
> >> and http://search-lucene.com/?q=payload+score&fc_type=wiki (even
> better, look at
> >> hit #1)
> >>
> >> Otis
> >> 
> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> Lucene ecosystem search :: http://search-lucene.com/
> >>
> >>
> >>
> >> - Original Message 
> >>> From: Hasnain 
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Mon, October 4, 2010 9:50:46 AM
> >>> Subject: Re: Prioritizing advectives in solr search
> >>>
> >>>
> >>> Hi Otis,
> >>>
> >>> Thank you for replying,  unfortunately Im unable to fully grasp
> what
> >>> you are trying to say, can you  please elaborate what is payload with
> >>> adjective terms?
> >>>
> >>> also Im using  stopwords.txt to stop adjectives, adverbs and verbs, now
> when
> >>> I search for  "Blue hammers", solr searches for "blue hammers" and
> "hammers"
> >>> but not  "blue", but the problem here is user can also search for just
> >>> "Blue", then it  wont search for anything...
> >>>
> >>> any suggestions on this??
> >>>
> >>> --
> >>> View  this message in context:
> >>>
> http://lucene.472066.n3.nabble.com/Prioritizing-adjectives-in-solr-search-tp1613029p1629725.html
> >>>
> >>> Sent  from the Solr - User mailing list archive at Nabble.com.
> >>>
> >
> >
> >
> >
>
>


Re: CoreContainer Usage

2010-10-11 Thread Amit Nithian
Hi sorry perhaps my question wasn't very clear. Basically I am trying
to build a federated search where I blend the results of queries to
multiple cores together. This is like distributed search but I believe
the distributed search will issue network calls which I would like to
avoid.

I have read that someone will use a single core as the federated
search handler and then run the searches across multiple cores and
blend the results. This is great but I can't figure out how to easily
get access to an instance of the CoreContainer that I hope has been
initialized (so I am not having it re-parse the configuration files).

Any help would be appreciated.

Thanks!
Amit

On Thu, Oct 7, 2010 at 10:07 AM, Amit Nithian  wrote:
> I am trying to understand the multicore setup of Solr more and saw
> that SolrCore.getCore is deprecated in favor of
> CoreContainer.getCore(name). How can I get a reference to the
> CoreContainer for I assume it's been created somewhere in Solr and is
> it possible for one core to get access to another SolrCore via the
> CoreContainer?
>
> Thanks
> Amit
>


data import / delta question

2010-10-11 Thread Tim Heckman
My data-import-config.xml has a parent entity and a child entity. The
data is coming from rdbms's.

I'm trying to make use of the delta-import feature where a change in
the child entity can be used to regenerate the entire document.

The child entity is on a different database (and a different server)
from the parent entity, so the child's parentDeltaQuery cannot
reference the table of the parent entity the way that the example on
the wiki does, because it's bound to the database connection for the
child entity's data (as far as I can tell).

http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command


I have tried extracting the parent's ID's from the child table in the
parentDeltaQuery, thinking that these id's would be fed into the
parent's deltaImportQuery, but this doesn't seem to work, either.

Should this work? If not, any suggestions how to work around it?

thanks,
Tim


Re: Prioritizing advectives in solr search

2010-10-11 Thread Chris Hostetter

: here is my scenario, im using dismax handler and my understanding is when I
: query "Blue hammer", solr brings me results for "blue hammer", "blue" and
: "hammer", and in the same hierarchy, which is understandable, is there any
: way I can manage the "blue" keyword, so that solr searches for "blue hammer"
: and "hammer" and not any results for "blue".

at a very simple level, you can achieve something like this by using a 
"qf" that points at fields where adjectives have been removed (ie: using 
StopFilter) and using "pf" fields where the adjectives have been left 
alone -- thus a query for "blue hammer" will match any doc containing 
"hammer" but the "pf" clause will boost documents matching the phrase 
"blue hammer" (documents matching only "blue" will not match, and 
documents matching "blue" and "hammer" farther apart then the "ps" param 
will not get the phrase boost)

But pleast note Walter's comments and consider them carefully before 
treating this as a silver bullet.

: 
: my handler is as follows...
: 
:  
: 
:  
:dismax
:explicit
:   0.6
:   name^2.3 mat_nr^0.4
:   0% 
: 
: any suggestion on this??
: -- 
: View this message in context: 
http://lucene.472066.n3.nabble.com/Prioritizing-advectives-in-solr-search-tp1613029p1613029.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss


Solr unresponsive but still taking queries

2010-10-11 Thread Hitendra Molleti
Hi,

 

We are running a CMS based on Java and use Solr 1.4 as the indexer.

 

Till today afternoon things were fine until we hit this Solr issue where it
sort of becomes unresponsive. We tried to stop and restart Solr but no help.

 

When we look into the logs Solr is receiving queries and running them but we
do not seem to get the responses and after an endless wait the page
generates a 503 error (Varnish on the front end).

 

Can someone help us with any possible suggestions or solutions.

 

Thanks

 

Hitendra



Re: Search within a subset of documents

2010-10-11 Thread Sergey Bartunov
And so I think. Actually I hope that I can do something like that:

1) tell the Solr to prepare for searching
2) start my very fast filtering routine
3) send asynchronoussly IDs of filtered documents to the Solr and
expect that Solr is ranging them in the parallel
4) get the result quickly

On 11 October 2010 21:25, Gora Mohanty  wrote:
> On Mon, Oct 11, 2010 at 8:20 PM, Sergey Bartunov  wrote:
>> Whether it will be enough effective if the subset is really large?
> [...]
>
> If the subset of IDs is large, and disjoint (so that you cannot use ranges),
> the query might look ugly, but generating it should not be much of a
> problem if you are using some automated method to create the query.
>
> If you mean whether it will be efficient enough, the only way is to try
> it out, and measure performance. Offhand, I do not think that it should
> increase the query response time by a lot.
>
> Regards,
> Gora
>


Re: How to manage different indexes for different users

2010-10-11 Thread Tharindu Mathew
Great! Just what I need. Thanks for all the help. I'll let you know how it
goes.

On Mon, Oct 11, 2010 at 11:37 PM, Markus Jelsma
wrote:

> Well, set the user ID for each document and use a filter query to filter
> only on field:.
>
> On Mon, 11 Oct 2010 23:25:29 +0530, Tharindu Mathew 
> wrote:
>
>> On Mon, Oct 11, 2010 at 10:48 PM, Markus Jelsma  wrote:
>>
>>  Then you probably read on how to create [1] the new core. Keep in
>> mind, you might need to do some additional local scripting to create a
>> new instance dir.
>>
>>  Do the user share the same schema? If so, you'd be better of keeping
>> a single index and preventing the users from querying others.
>>
>> Yes, they will be sharing the same schema. If I understand correctly.
>> going with a single core is recommended in that case? But how do I
>> prevent users from querying other users data?
>>
>>  [1]: http://wiki.apache.org/solr/CoreAdmin#CREATE [2]
>>
>>  On Mon, 11 Oct 2010 22:40:03 +0530, Tharindu Mathew  wrote:
>>
>>  Thanks Li. I checked out multi cores documentation.
>>
>>  How do I dynamically create new cores as new users are added. Is
>> that
>>  possible?
>>
>>  On Mon, Oct 11, 2010 at 2:31 PM, Li Li  wrote:
>>
>>
>>  will one user search other user's index?
>>  if not, you can use multi cores.
>>
>>  2010/10/11 Tharindu Mathew :
>>
>>  > Hi everyone,
>>  >
>>  > I'm using solr to integrate search into my web app.
>>  >
>>  > I have a bunch of users who would have to be given their own
>> individual
>>  > indexes.
>>  >
>>  > I'm wondering whether I'd have to append their user ID as I index
>> a file.
>>  > I'm not sure which approach to follow. Is there a sample or a doc
>> I can
>>  read
>>  > to understand how to approach this problem?
>>  >
>>  > Thanks in advance.
>>  >
>>  > --
>>  > Regards,
>>  >
>>  > Tharindu
>>  >
>>
>>  --
>>  Markus Jelsma - CTO - Openindex
>>  http://www.linkedin.com/in/markus17 [6]
>>  050-8536600 / 06-50258350
>>
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>



-- 
Regards,

Tharindu


Re: Sorting on arbitary 'custom' fields

2010-10-11 Thread Simon Wistow
On Sat, Oct 09, 2010 at 06:31:19PM -0400, Erick Erickson said:
> I'm confused. What do you mean that a user can "set any
> number of arbitrarily named fields on a document". It sounds
> like you are talking about a user adding arbitrarily may entries
> to a multi-valued field? Or is it some kind of key:value pairs
> in a field in your schema?

Users can add arbitary key/values to documents. Kind of like Machine 
Tags.

So whilst a document has some standard fields (e.g title="My Random 
Document", user="Simon", date="2010-10-11") I might have added 
current_temp_in_c="32" to one of my documents but you might have put 
time_taken_to_write_in_mins="30".

We currently don't index these fields but we'd like to and be able to 
have users sort on them. 

Ideas I had:

- Everytime a user adds a new field (e.g time_taken_to_write_in_mins) 
update the global schema

But that would be horrible and would create an index with many thousands 
of fields.

- Give each user their own core and update each individual schema

Better but still inelegant

The multi valued field idea occurred to me because I could have, for 
example

user_field: [time_taken_to_write_in_mins=30, current_temp_in_c=32]

(i.e flatten the key/value)

I could then maybe write something that allowed sorting only on matched 
values of multi-value field. 

sort=user_field:time_taken_to_write_in_mins=*

or

fq=user_field:time_taken_to_write_in_mins=*&sort=user_field

It was just an idea though and I was hoping that there would be a 
simpler more orthodox way of doing it.

thanks,

Simon


Re: How to manage different indexes for different users

2010-10-11 Thread Markus Jelsma
Well, set the user ID for each document and use a filter query to 
filter only on field:.


On Mon, 11 Oct 2010 23:25:29 +0530, Tharindu Mathew 
 wrote:

On Mon, Oct 11, 2010 at 10:48 PM, Markus Jelsma  wrote:
 Then you probably read on how to create [1] the new core. Keep in
mind, you might need to do some additional local scripting to create 
a

new instance dir.

 Do the user share the same schema? If so, you'd be better of keeping
a single index and preventing the users from querying others.

Yes, they will be sharing the same schema. If I understand correctly.
going with a single core is recommended in that case? But how do I
prevent users from querying other users data?

  [1]: http://wiki.apache.org/solr/CoreAdmin#CREATE [2]

 On Mon, 11 Oct 2010 22:40:03 +0530, Tharindu Mathew  wrote:
  Thanks Li. I checked out multi cores documentation.

 How do I dynamically create new cores as new users are added. Is
that
 possible?

 On Mon, Oct 11, 2010 at 2:31 PM, Li Li  wrote:

  will one user search other user's index?
 if not, you can use multi cores.

 2010/10/11 Tharindu Mathew :
 > Hi everyone,
 >
 > I'm using solr to integrate search into my web app.
 >
 > I have a bunch of users who would have to be given their own
individual
 > indexes.
 >
 > I'm wondering whether I'd have to append their user ID as I index
a file.
 > I'm not sure which approach to follow. Is there a sample or a doc
I can
 read
 > to understand how to approach this problem?
 >
 > Thanks in advance.
 >
 > --
 > Regards,
 >
 > Tharindu
 >

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17 [6]
 050-8536600 / 06-50258350


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: How to manage different indexes for different users

2010-10-11 Thread Tharindu Mathew
On Mon, Oct 11, 2010 at 10:48 PM, Markus Jelsma
wrote:

> Then you probably read on how to create [1] the new core. Keep in mind, you
> might need to do some additional local scripting to create a new instance
> dir.
>
> Do the user share the same schema? If so, you'd be better of keeping a
> single index and preventing the users from querying others.
>
> Yes, they will be sharing the same schema. If I understand correctly. going
with a single core is recommended in that case? But how do I prevent users
from querying other users data?

[1]: http://wiki.apache.org/solr/CoreAdmin#CREATE
>
>
> On Mon, 11 Oct 2010 22:40:03 +0530, Tharindu Mathew 
> wrote:
>
>> Thanks Li. I checked out multi cores documentation.
>>
>> How do I dynamically create new cores as new users are added. Is that
>> possible?
>>
>> On Mon, Oct 11, 2010 at 2:31 PM, Li Li  wrote:
>>
>>  will one user search other user's index?
>>> if not, you can use multi cores.
>>>
>>> 2010/10/11 Tharindu Mathew :
>>> > Hi everyone,
>>> >
>>> > I'm using solr to integrate search into my web app.
>>> >
>>> > I have a bunch of users who would have to be given their own individual
>>> > indexes.
>>> >
>>> > I'm wondering whether I'd have to append their user ID as I index a
>>> file.
>>> > I'm not sure which approach to follow. Is there a sample or a doc I can
>>> read
>>> > to understand how to approach this problem?
>>> >
>>> > Thanks in advance.
>>> >
>>> > --
>>> > Regards,
>>> >
>>> > Tharindu
>>> >
>>>
>>>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>



-- 
Regards,

Tharindu


Re: facet.method: enum vs. fc

2010-10-11 Thread Erick Erickson
Yep, that was probably the best choice

It's a classic time/space tradeoff. The enum method creates a bitset for
#each#
unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring
some overhead here). So if your facet field has 10 unique values, and 8M
documents,
you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so
on. But
this is very, very fast.

fc on the other hand, eats up cache for storing the string value for each
unique value,
plus various counter arrays (several bytes/doc). For most cases, it will use
less memory
than enum, but will be slower.

I'd stick with fc for the time being and think about enum if 1> you have a
good idea of
what the number of unique terms is or 2> you start to need to finely tune
your speed.

HTH
Erick

On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna <
castagna.li...@googlemail.com> wrote:

> Hi,
> I am using Solr v1.4 and I am not sure which facet.method I should use.
>
> What should I use if I do not know in advance if the number of values
> for a given field will be high or low?
>
> What are the pros/cons of using facet.method=enum vs. facet.method=fc?
>
> When should I use enum vs. fc?
>
> I have found some comments and suggestions here:
>
>  "enum enumerates all terms in a field, calculating the set intersection
>  of documents that match the term with documents that match the query.
>  This was the default (and only) method for faceting multi-valued fields
>  prior to Solr 1.4.
>  "fc (stands for field cache), the facet counts are calculated by
>  iterating over documents that match the query and summing the terms
>  that appear in each document. This was the default method for single
>  valued fields prior to Solr 1.4.
>  The default value is fc (except for BoolField) since it tends to use
>  less memory and is faster when a field has many unique terms in the
>  index."
>  -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
>
>  "facet.method=enum [...] this is excellent for fields where there is
>  a small set of distinct values. The average number of values per
>  document does not matter.
>  facet.method=fc [...] this is excellent for situations where the
>  number of indexed values for the field is high, but the number of
>  values per document is low. For multi-valued fields, a hybrid approach
>  is used that uses term filters from the filterCache for terms that
>  match many documents."
>  -- http://wiki.apache.org/solr/SolrFacetingOverview
>
>  "If you are faceting on a field that you know only has a small number
>  of values (say less than 50), then it is advisable to explicitly set
>  this to enum. When faceting on multiple fields, remember to set this
>  for the specific fields desired and not universally for all facets.
>  The request handler configuration is a good place to put this."
>  -- Book: "Solr 1.4 Enterprise Search Server", pag. 148
>
> This is the part of the Solr code which deals with the facet.method
> parameter:
>
>  if (enumMethod) {
>counts = getFacetTermEnumCounts([...]);
>  } else {
>if (multiToken) {
>  UnInvertedField uif = [...]
>  counts = uif.getCounts([...]);
>} else {
>  [...]
>  if (per_segment) {
>[...]
>counts = ps.getFacetCounts([...]);
>  } else {
>counts = getFieldCacheCounts([...]);
>  }
>}
>  }
>  --
> https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java
>
> See also:
>
>  -
> http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values
>
> At the end, since I do not know in advance the number of different
> values for my fields I went for facet.method=fc, does this seems
> reasonable to you?
>
> Thank you,
> Paolo
>


Re: Problem with Indexing

2010-10-11 Thread Gora Mohanty
On Mon, Oct 11, 2010 at 1:27 PM, Jörg Agatz  wrote:
> ok, i have try it.. and now iget this error:
>
> POSTing file e067f59c-d046-11df-b552-000c29e17baa_SEARCH.xml
> SimplePostTool: FATAL: Solr returned an error:
> this_writer_hit_an_OutOfMemoryError_cannot_flush__javalangIllegalStateException
[...]

Not sure in this particular case, but this looks like Solr is running out of
memory. How much RAM do you have allocated in the Java container
that Solr is running in?

Regards,
Gora


Re: Search within a subset of documents

2010-10-11 Thread Gora Mohanty
On Mon, Oct 11, 2010 at 8:20 PM, Sergey Bartunov  wrote:
> Whether it will be enough effective if the subset is really large?
[...]

If the subset of IDs is large, and disjoint (so that you cannot use ranges),
the query might look ugly, but generating it should not be much of a
problem if you are using some automated method to create the query.

If you mean whether it will be efficient enough, the only way is to try
it out, and measure performance. Offhand, I do not think that it should
increase the query response time by a lot.

Regards,
Gora


Re: How to manage different indexes for different users

2010-10-11 Thread Markus Jelsma
Then you probably read on how to create [1] the new core. Keep in mind, 
you might need to do some additional local scripting to create a new 
instance dir.


Do the user share the same schema? If so, you'd be better of keeping a 
single index and preventing the users from querying others.


[1]: http://wiki.apache.org/solr/CoreAdmin#CREATE

On Mon, 11 Oct 2010 22:40:03 +0530, Tharindu Mathew 
 wrote:

Thanks Li. I checked out multi cores documentation.

How do I dynamically create new cores as new users are added. Is that
possible?

On Mon, Oct 11, 2010 at 2:31 PM, Li Li  wrote:


will one user search other user's index?
if not, you can use multi cores.

2010/10/11 Tharindu Mathew :
> Hi everyone,
>
> I'm using solr to integrate search into my web app.
>
> I have a bunch of users who would have to be given their own 
individual

> indexes.
>
> I'm wondering whether I'd have to append their user ID as I index 
a file.
> I'm not sure which approach to follow. Is there a sample or a doc 
I can

read
> to understand how to approach this problem?
>
> Thanks in advance.
>
> --
> Regards,
>
> Tharindu
>



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: How to manage different indexes for different users

2010-10-11 Thread Tharindu Mathew
Thanks Li. I checked out multi cores documentation.

How do I dynamically create new cores as new users are added. Is that
possible?

On Mon, Oct 11, 2010 at 2:31 PM, Li Li  wrote:

> will one user search other user's index?
> if not, you can use multi cores.
>
> 2010/10/11 Tharindu Mathew :
> > Hi everyone,
> >
> > I'm using solr to integrate search into my web app.
> >
> > I have a bunch of users who would have to be given their own individual
> > indexes.
> >
> > I'm wondering whether I'd have to append their user ID as I index a file.
> > I'm not sure which approach to follow. Is there a sample or a doc I can
> read
> > to understand how to approach this problem?
> >
> > Thanks in advance.
> >
> > --
> > Regards,
> >
> > Tharindu
> >
>



-- 
Regards,

Tharindu


Disable (or prohibit) per-field overrides

2010-10-11 Thread Markus Jelsma
Hi,

Anyone knows useful method to disable or prohibit the per-field override 
features for the search components? If not, where to start to make it 
configurable via solrconfig and attempt to come up with a working patch?

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: deleteByQuery issue

2010-10-11 Thread Claudio Atzori

 On 10/11/2010 04:06 PM, Ahmet Arslan wrote:


--- On Mon, 10/11/10, Claudio Atzori  wrote:


From: Claudio Atzori
Subject: deleteByQuery issue
To: solr-user@lucene.apache.org
Date: Monday, October 11, 2010, 10:38 AM
  Hi everybody,
in my application I use an instance of EmbeddedSolrServer
(solr 1.4.1), the following snippet shows how I am
instantiating it:


  File home = new

File(indexDataPath(solrDataDir, indexName));

  container = new

CoreContainer(indexDataPath(solrDataDir, indexName));



container.load(indexDataPath(solrDataDir,
indexName), new File(home, "solr.xml"));

  return new

EmbeddedSolrServer(container, indexName);

and I'm going through some issues using deleteByQuery
method, in fact, when I try to delete a subset of documents,
or even all the documents from the index, I see as they are
correctly marked for deletion on the luke inspector 
(http://code.google.com/p/luke/), but after a commit I
can still retrieve them, just like they haven't been
removed...

I can see the difference and see the documents disappear
only when I restart my jetty application, but obviously this
cannot be a feature... any idea?

I think you are accessing same solr index using both embedded server and http.
The changes that you made using embedded server won't be reflected to http 
until a commit issued from http. I mean if you hit this url:

http://localhost:8983/solr/update?commit=true

the deleted documents won't be retrieved anymore.

P.s. if you want to expunge deleted docs completely you can either optimize or commit 
with expungeDeletes = "true".



Thanks for your reply.
Alright I'll better explain my scenario. I'm not exposing any http 
interface of the index. I handle the whole index 'life cycle' via java 
code with the EmbeddedSolrServer instance, so I'm handling commits, 
optimizations, feedings, index creation, all through that instance, 
moreover my client application calls embeddedSolrServerInstance.commit() 
after deleteByQuery, but the documents are still there




facet.method: enum vs. fc

2010-10-11 Thread Paolo Castagna

Hi,
I am using Solr v1.4 and I am not sure which facet.method I should use.

What should I use if I do not know in advance if the number of values
for a given field will be high or low?

What are the pros/cons of using facet.method=enum vs. facet.method=fc?

When should I use enum vs. fc?

I have found some comments and suggestions here:

 "enum enumerates all terms in a field, calculating the set intersection
  of documents that match the term with documents that match the query.
  This was the default (and only) method for faceting multi-valued fields
  prior to Solr 1.4.
 "fc (stands for field cache), the facet counts are calculated by
  iterating over documents that match the query and summing the terms
  that appear in each document. This was the default method for single
  valued fields prior to Solr 1.4.
  The default value is fc (except for BoolField) since it tends to use
  less memory and is faster when a field has many unique terms in the
  index."
  -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

 "facet.method=enum [...] this is excellent for fields where there is
  a small set of distinct values. The average number of values per
  document does not matter.
  facet.method=fc [...] this is excellent for situations where the
  number of indexed values for the field is high, but the number of
  values per document is low. For multi-valued fields, a hybrid approach
  is used that uses term filters from the filterCache for terms that
  match many documents."
  -- http://wiki.apache.org/solr/SolrFacetingOverview

 "If you are faceting on a field that you know only has a small number
  of values (say less than 50), then it is advisable to explicitly set
  this to enum. When faceting on multiple fields, remember to set this
  for the specific fields desired and not universally for all facets.
  The request handler configuration is a good place to put this."
  -- Book: "Solr 1.4 Enterprise Search Server", pag. 148

This is the part of the Solr code which deals with the facet.method
parameter:

  if (enumMethod) {
counts = getFacetTermEnumCounts([...]);
  } else {
if (multiToken) {
  UnInvertedField uif = [...]
  counts = uif.getCounts([...]);
} else {
  [...]
  if (per_segment) {
[...]
counts = ps.getFacetCounts([...]);
  } else {
counts = getFieldCacheCounts([...]);
  }
}
  }
  -- 
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java


See also:

 - 
http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values


At the end, since I do not know in advance the number of different
values for my fields I went for facet.method=fc, does this seems
reasonable to you?

Thank you,
Paolo


Re: Search within a subset of documents

2010-10-11 Thread Sergey Bartunov
Whether it will be enough effective if the subset is really large?

On 11 October 2010 18:39, Gora Mohanty  wrote:
> On Mon, Oct 11, 2010 at 7:00 PM, Sergey Bartunov  wrote:
>> Is it possible to use Solr for searching within a subset of documents
>> represented by enumeration of document IDs?
>
> Couldn't you add the document ID to the query, e.g., if the field is
> called id, you can use ?q=id:, e.g., ?q=id:1234? You could
> use a range, etc., to include all the desired IDs.
>
> Regards,
> Gora
>


Re: Search within a subset of documents

2010-10-11 Thread Gora Mohanty
On Mon, Oct 11, 2010 at 7:00 PM, Sergey Bartunov  wrote:
> Is it possible to use Solr for searching within a subset of documents
> represented by enumeration of document IDs?

Couldn't you add the document ID to the query, e.g., if the field is
called id, you can use ?q=id:, e.g., ?q=id:1234? You could
use a range, etc., to include all the desired IDs.

Regards,
Gora


Re: KStemmer for Solr

2010-10-11 Thread Ahmet Arslan

> Because I'm using solr from trunk and not from lucid
> imagination
> I was missing KStemmer. So I decided to add this stemmer to
> my installation.
> 
> After some modifications KStemmer is now working fine as
> stand-alone.
> Now I have a KStemmerFilter.
> Next will be to write the KStemmerFilterFactory.
> 
> I would place the Factory in
> "lucene-solr/solr/src/java/org/apache/solr/analysis/"
> to the other Factories, but where to place the Filter?
> 
> Does it make sense to place the Filter somewhere under
> "lucene-solr/modules/analysis/common/src/java/org/apache/lucene/analysis/"
> ?
> But this is for Lucene and not Solr...
> 
> Or should I place the Filter in a subdirectory of the
> Factories?

For this kind of modification you don't need to modify standard distro.

You can jar these new classes, and put this jar into solrhome/lib directory. 
For more info : http://wiki.apache.org/solr/SolrPlugins


  


Re: Index time boosting is not working with boosting value in document level

2010-10-11 Thread Ahmet Arslan
> Eric,
>    Score is not coming properly even after
> giving boost value in document
> and field level.
>    Please find the solrconfig.xml,
> schema.xml, data-config.xml, the feed and
> the score & query.
>    Doc with id 'ABCDEF/L' is boosted and doc
> with id 'MA147LL/A' is not
> boosted, but both are returning same score - 0.1942141.
>    Could you please help me to find where I
> did a mistake?

It seems that you are using DIH to index feed.xml. You can directly post 
feed.xml to solr, then your boosts will be taken into account. There is a 
script named post.sh for this purpose.

As Erik said, you can always verify boosts with &debugQuery=on. 





KStemmer for Solr

2010-10-11 Thread Bernd Fehling

Because I'm using solr from trunk and not from lucid imagination
I was missing KStemmer. So I decided to add this stemmer to my installation.

After some modifications KStemmer is now working fine as stand-alone.
Now I have a KStemmerFilter.
Next will be to write the KStemmerFilterFactory.

I would place the Factory in 
"lucene-solr/solr/src/java/org/apache/solr/analysis/"
to the other Factories, but where to place the Filter?

Does it make sense to place the Filter somewhere under
"lucene-solr/modules/analysis/common/src/java/org/apache/lucene/analysis/" ?
But this is for Lucene and not Solr...

Or should I place the Filter in a subdirectory of the Factories?

Any suggestion for me?

Regards,
Bernd


Re: deleteByQuery issue

2010-10-11 Thread Ahmet Arslan


--- On Mon, 10/11/10, Claudio Atzori  wrote:

> From: Claudio Atzori 
> Subject: deleteByQuery issue
> To: solr-user@lucene.apache.org
> Date: Monday, October 11, 2010, 10:38 AM
>  Hi everybody,
> in my application I use an instance of EmbeddedSolrServer
> (solr 1.4.1), the following snippet shows how I am
> instantiating it:
> 
> >         File home = new
> File(indexDataPath(solrDataDir, indexName));
> > 
> >         container = new
> CoreContainer(indexDataPath(solrDataDir, indexName));
> >     
>    container.load(indexDataPath(solrDataDir,
> indexName), new File(home, "solr.xml"));
> > 
> >         return new
> EmbeddedSolrServer(container, indexName);
> 
> and I'm going through some issues using deleteByQuery
> method, in fact, when I try to delete a subset of documents,
> or even all the documents from the index, I see as they are
> correctly marked for deletion on the luke inspector 
> (http://code.google.com/p/luke/), but after a commit I
> can still retrieve them, just like they haven't been
> removed...
> 
> I can see the difference and see the documents disappear
> only when I restart my jetty application, but obviously this
> cannot be a feature... any idea?

I think you are accessing same solr index using both embedded server and http.
The changes that you made using embedded server won't be reflected to http 
until a commit issued from http. I mean if you hit this url:

http://localhost:8983/solr/update?commit=true  

the deleted documents won't be retrieved anymore.

P.s. if you want to expunge deleted docs completely you can either optimize or 
commit with expungeDeletes = "true".






Re: How to get Term Frequency

2010-10-11 Thread Ahmet Arslan
> I have a question that how could somebody get term
> frequency as we do get in 
> lucene by the following method DocFreq(new Term("Field",
> "value")); using  solr/solrnet.

You can get term frequency with 
http://wiki.apache.org/solr/TermVectorComponent.

If you are interested in document frequency, you can use 
http://wiki.apache.org/solr/TermsComponent


  


Search within a subset of documents

2010-10-11 Thread Sergey Bartunov
Is it possible to use Solr for searching within a subset of documents
represented by enumeration of document IDs?


Re: Index time boosting is not working with boosting value in document level

2010-10-11 Thread Shanmugavel SRD

Eric,
   Score is not coming properly even after giving boost value in document
and field level.
   Please find the solrconfig.xml, schema.xml, data-config.xml, the feed and
the score & query.
   Doc with id 'ABCDEF/L' is boosted and doc with id 'MA147LL/A' is not
boosted, but both are returning same score - 0.1942141.
   Could you please help me to find where I did a mistake?

solrconfig.xml


  

  

  
  
  


data-config.xml 

   
  
  
  
  
   
  
solr
  



schema.xml


  
   
   
  







  
  







  

  


  
  
   
   
   
 

 
 id

 
 name

 
 


data-config.xml












feed




  F8V7067-APL-KIT
  Belkin Mobile Power Cord for iPod w/ Dock



  IW-02
  iPod & iPod Mini USB 2.0 Cable



  MA147LL/A
  Apple 60 GB iPod with Video Playback Black



  ABCDEF/L
  Apple 60 GB iPod with Video Playback
Black




Query & Response

http://localhost:8080/solr/core0/select/?q=ipod&version=2.2&start=0&rows=10&indent=on&fl=score



0
15



0.27466023
IW-02
iPod & iPod Mini USB 2.0 Cable


0.24276763
F8V7067-APL-KIT
Belkin Mobile Power Cord for iPod w/ Dock


0.1942141
MA147LL/A
Apple 60 GB iPod with Video Playback Black


0.1942141
ABCDEF/L
Apple 60 GB iPod with Video Playback Black




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-time-boosting-is-not-working-with-boosting-value-in-document-level-tp1649072p1680215.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr start in server

2010-10-11 Thread Yavuz Selim YILMAZ
I solved it nohup java -jar start.jar&
Thnx.
--

Yavuz Selim YILMAZ


2010/10/11 Gora Mohanty 

> On Mon, Oct 11, 2010 at 1:23 PM, Yavuz Selim YILMAZ
>  wrote:
> > I use AIX 5.3.
> >
> > How can I handle?
> [...]
>
> Have not used AIX in ages, but this should work, assuming a sh-type of
> shell:
>  java -jar start.jar > jetty_log.txt 2>&1 &
> This will save the output from Jetty to jetty_log.txt. If you do not want
> to
> save the output (the file might get quite large depending on your usage),
> you can use
>  java -jar start.jar > /dev/null 2>&1 &
>
> Regards,
> Gora
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-10-11 Thread Anders Melchiorsen
Hi,

why do you need to change the lockType? Does a readonly instance need
locks at all?


thanks,
Anders.



On Tue, 14 Sep 2010 15:00:54 +0200, Peter Karich  wrote:
> Peter Sturge,
>
> this was a nice hint, thanks again! If you are here in Germany anytime I
> can invite you to a beer or an apfelschorle ! :-)
> I only needed to change the lockType to none in the solrconfig.xml,
> disable the replication and set the data dir to the master data dir!
>
> Regards,
> Peter Karich.
>
>> Hi Peter,
>>
>> this scenario would be really great for us - I didn't know that this is
>> possible and works, so: thanks!
>> At the moment we are doing similar with replicating to the readonly
>> instance but
>> the replication is somewhat lengthy and resource-intensive at this
>> datavolume ;-)
>>
>> Regards,
>> Peter.
>>
>>
>>> 1. You can run multiple Solr instances in separate JVMs, with both
>>> having their solr.xml configured to use the same index folder.
>>> You need to be careful that one and only one of these instances will
>>> ever update the index at a time. The best way to ensure this is to use
>>> one for writing only,
>>> and the other is read-only and never writes to the index. This
>>> read-only instance is the one to use for tuning for high search
>>> performance. Even though the RO instance doesn't write to the index,
>>> it still needs periodic (albeit empty) commits to kick off
>>> autowarming/cache refresh.
>>>
>>> Depending on your needs, you might not need to have 2 separate
>>> instances. We need it because the 'write' instance is also doing a lot
>>> of metadata pre-write operations in the same jvm as Solr, and so has
>>> its own memory requirements.
>>>
>>> 2. We use sharding all the time, and it works just fine with this
>>> scenario, as the RO instance is simply another shard in the pack.
>>>
>>>
>>> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich 
wrote:
>>>
>>>
 Peter,

 thanks a lot for your in-depth explanations!
 Your findings will be definitely helpful for my next performance
 improvement tests :-)

 Two questions:

 1. How would I do that:



> or a local read-only instance that reads the same core as the
indexing
> instance (for the latter, you'll need something that periodically
> refreshes - i.e. runs commit()).
>
>
 2. Did you try sharding with your current setup (e.g. one big,
 nearly-static index and a tiny write+read index)?

 Regards,
 Peter.



> Hi,
>
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
>
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more
geared
> toward fast reads than frequent writes.
>
> Symptoms:
> If anyone has used faceting in searches where you are also
performing
> frequent commits, you've likely encountered the dreaded OutOfMemory
or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due to
> multiple 'onDeck' searchers and autowarming - i.e. new searchers
don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely your
> Solr's JVM will run out of memory eventually, as the number of
> searchers (and their cache arrays) will keep growing until the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this, turn on
INFO
> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
> onDeckSearchers=x'.
>
> In tests, we've only ever seen this problem when using faceting, and
> facet.method=fc.
>
> Some solutions to this are:
> Reduce the commit rate to allow searchers to fully warm before
the
> next commit
> Reduce or eliminate the autowarming in caches
> Both of the above
>
> The trouble is, if you're doing NRT commits, you likely have a good
> reason for it, and reducing/elimintating autowarming will very
> significantly impact search performance in high commit rate
> environments.
>
> Solution:
> Here are some setup steps we've used that allow lots of faceting (we
> typically search with at least 20-35 different facet fields, and
date
> faceting/sorting) on large indexes, and still keep decent search
> performance:
>
> 1. Firstly, you 

How to get Term Frequency

2010-10-11 Thread Ahson Iqbal
hi All

I have a question that how could somebody get term frequency as we do get in 
lucene by the following method DocFreq(new Term("Field", "value")); using 
solr/solrnet.



  

Re: Multiple masters and replication between masters?

2010-10-11 Thread Arunkumar Ayyavu
Thanks Otis. That was helpful.

On Mon, Oct 11, 2010 at 9:19 AM, Otis Gospodnetic
 wrote:
> Arun,
>
> Yes, changing the solrconfig.xml to point to the new master could require a
> restart.
> However, if you use logical addresses (VIPs in the Load Balancer or even local
> hostname aliases if you don't have a LB) then you just need to point those
> VIPs/aliases to new IPs and the Solr slave won't have to be restarted.
>
>
> Otis
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
>> From: Arunkumar Ayyavu 
>> To: solr-user@lucene.apache.org
>> Sent: Sun, October 10, 2010 1:57:34 PM
>> Subject: Re: Multiple masters and replication between masters?
>>
>> On Mon, Oct 4, 2010 at 4:58 PM, Upayavira  wrote:
>> > On Mon,  2010-10-04 at 00:25 +0530, Arunkumar Ayyavu wrote:
>> >> I'm looking at  setting up multiple masters for redundancy (for index
>> >> updates). I  found the thread in this link
>> >>
>>(http://www.lucidimagination.com/search/document/68ac303ce8425506/multiple_masters_solr_replication_1_4)
>>
>> >>  discussed this subject more than a year back. Does Solr support  such
>> >> configuration today?
>> >
>> > Solr does not support  master/master replication. When you commit
>> > documents to SOLR, it adds a  segment to the underlying Lucene index.
>> > Replication then syncs that  segment to your slaves. To do master/master
>> > replication, you would have  to pull changes from each master, then merge
>> > those changed segments into  a single updated index. This is more complex
>> > than what is happening in  the current Solr replication (which is not
>> > much more than an rsync of  the index files).
>> >
>> > Note, if you commit your changes to two  masters, you cannot switch a
>> > slave between them, as it is unlikely that  the two masters will have
>> > matching index files. If you did so, you would  probably trigger a pull
>> > of the entire index across the network, which  (while it would likely
>> > work) would not be the most efficient  action.
>> >
>> > What you can do is think cleverly about how you organise  your
>> > master/slave setup. E.g. have a slave that doesn't get queried,  but
>> > exists to take over the role of the master in case it fails. The  index
>> > on a slave is the same as that in a master, and can immediately  take on
>> > the role of the master (receiving commits), and upon failure of  your
>> > master, you could point your other slaves at this new master, and  things
>> > should just carry on as before.
>> Wouldn't this require restart  of Solr instances?
>>
>> Sorry, I couldn't respond to you earlier as I wasn't  checking my mails
>> for sometime.
>>
>> >
>> > Also, if you have a lot  of slaves (such that they are placing too big a
>> > load on your master),  insert intermediate hosts that are both slaves off
>> > the master, and  masters to your query slaves. That way, you could have,
>> > say, two boxes  slaving off the master, then 20 or 30 slaving off them.
>> >
>> >> And  does Solr support replication between masters? Otherwise, I'll
>> >> have  to post the updates to all masters to keep the indexes of masters
>> >> in  sync. Does SolrCloud address this case? (Please note it is too
>> >> early  for me to read about SolrCloud as I'm still learning Solr)
>> >
>> > I  don't believe SolrCloud is aiming to support master/master
>> >  replication.
>> >
>> > HTH
>> >
>> >  Upayavira
>> >
>> >
>> >
>>
>>
>>
>> --
>> Arun
>>
>



-- 
Arun


Re: How to manage different indexes for different users

2010-10-11 Thread Li Li
will one user search other user's index?
if not, you can use multi cores.

2010/10/11 Tharindu Mathew :
> Hi everyone,
>
> I'm using solr to integrate search into my web app.
>
> I have a bunch of users who would have to be given their own individual
> indexes.
>
> I'm wondering whether I'd have to append their user ID as I index a file.
> I'm not sure which approach to follow. Is there a sample or a doc I can read
> to understand how to approach this problem?
>
> Thanks in advance.
>
> --
> Regards,
>
> Tharindu
>


question about SolrCore

2010-10-11 Thread Li Li
hi all,
I want to know the detail of IndexReader in SolrCore. I read a
little codes of SolrCore. Here is my understanding, are they correct?
Each SolrCore has many SolrIndexSearcher and keeps them in
_searchers. and _searcher keep trace of the latest version of index.
Each SolrIndexSearcher has a SolrIndexReader. If there isn't any
update, all these searchers share one single SolrIndexReader. If there
is an update, then a newSearcher will be created and a new
SolrIndexReader associated with it.
I did a simple test.
A thread do a query and blocked by breakpoint. Then I feed some
data to update index. When commit, a newSearcher is created.
Here is the debug info:

SolrCore _searcher [solrindexsearc...@...ab]

_searchers[solrindexsearc...@...77,solrindexsearc...@...ab,solrindexsearc...@..f8]
 solrindexsearc...@...77 's SolrIndexReader is old one
and ab and f8 share the same newest SolrIndexReader
When query finished solrindexsearc...@...77 is discarded. When
newSearcher success to warmup, There is only one SolrIndexSearcher.
The SolrIndexReader of old version of index is discarded and only
segments in newest SolrIndexReader are referenced. Those segments not
in new version can then be deleted because no file pointer reference
them
.
Then I start 3 queries. There is only one SolrIndexSearcher but RefCount=4.
It seems many search can share one single SolrIndexSearcher.
So in which situation, there will exist more than one
SolrIndexSearcher that they share just one SolrIndexReader?
Another question, for each version of index, is there just one
SolrIndexReader instance associated with it? will it occur that more
than one SolrIndexReader are opened and they are the same version of
index?


How to manage different indexes for different users

2010-10-11 Thread Tharindu Mathew
Hi everyone,

I'm using solr to integrate search into my web app.

I have a bunch of users who would have to be given their own individual
indexes.

I'm wondering whether I'd have to append their user ID as I index a file.
I'm not sure which approach to follow. Is there a sample or a doc I can read
to understand how to approach this problem?

Thanks in advance.

-- 
Regards,

Tharindu


Re: Solr start in server

2010-10-11 Thread Gora Mohanty
On Mon, Oct 11, 2010 at 1:23 PM, Yavuz Selim YILMAZ
 wrote:
> I use AIX 5.3.
>
> How can I handle?
[...]

Have not used AIX in ages, but this should work, assuming a sh-type of
shell:
  java -jar start.jar > jetty_log.txt 2>&1 &
This will save the output from Jetty to jetty_log.txt. If you do not want to
save the output (the file might get quite large depending on your usage),
you can use
  java -jar start.jar > /dev/null 2>&1 &

Regards,
Gora


Re: Problem with Indexing

2010-10-11 Thread Jörg Agatz
ok, i have try it.. and now iget this error:

POSTing file e067f59c-d046-11df-b552-000c29e17baa_SEARCH.xml
SimplePostTool: FATAL: Solr returned an error:
this_writer_hit_an_OutOfMemoryError_cannot_flush__javalangIllegalStateException_this_writer_hit_an_OutOfMemoryError_cannot_flush__at_orgapacheluceneindexIndexWriterdoFlushInternalIndexWriterjava4204__at_orgapacheluceneindexIndexWriterdoFlushIndexWriterjava4192__at_orgapacheluceneindexIndexWriterflushIndexWriterjava4183__at_orgapacheluceneindexIndexWriterupdateDocumentIndexWriterjava2647__at_orgapacheluceneindexIndexWriterupdateDocumentIndexWriterjava2601__at_orgapachesolrupdateDirectUpdateHandler2addDocDirectUpdateHandler2java241__at_orgapachesolrupdateprocessorRunUpdateProcessorprocessAddRunUpdateProcessorFactoryjava61__at_orgapachesolrhandlerXMLLoaderprocessUpdateXMLLoaderjava139__at_orgapachesolrhandlerXMLLoaderloadXMLLoaderjava69__at_orgapachesolrhandlerContentStreamHandlerBasehandleRequestBodyContentStreamHandlerBasejava54__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava131__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1316__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava338__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava241__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_

i dont know, how i can index a lot of xml (fast)


Re: Solr start in server

2010-10-11 Thread Yavuz Selim YILMAZ
I use AIX 5.3.

How can I handle?
--

Yavuz Selim YILMAZ


2010/10/11 Gora Mohanty 

> On Mon, Oct 11, 2010 at 1:09 PM, Yavuz Selim YILMAZ
>  wrote:
> > I have a solr installation on a server. I start it with the help of putty
> (
> > with the start.jar). But when I close the putty instance, automatically
> solr
> > instance also closes. How can I solve this problem? I mean, I close
> > connection with server, but solr instance still runs?
> [...]
>
> What operating system is the server running? You will have to put the job
> in the background. For some operating systems/shells, you also have to
> configure things so that background jobs are not killed on logging out.
>
> Regards,
> Gora
>


Re: Solr start in server

2010-10-11 Thread Gora Mohanty
On Mon, Oct 11, 2010 at 1:09 PM, Yavuz Selim YILMAZ
 wrote:
> I have a solr installation on a server. I start it with the help of putty (
> with the start.jar). But when I close the putty instance, automatically solr
> instance also closes. How can I solve this problem? I mean, I close
> connection with server, but solr instance still runs?
[...]

What operating system is the server running? You will have to put the job
in the background. For some operating systems/shells, you also have to
configure things so that background jobs are not killed on logging out.

Regards,
Gora


Solr start in server

2010-10-11 Thread Yavuz Selim YILMAZ
I have a solr installation on a server. I start it with the help of putty (
with the start.jar). But when I close the putty instance, automatically solr
instance also closes. How can I solve this problem? I mean, I close
connection with server, but solr instance still runs?
--

Yavuz Selim YILMAZ


deleteByQuery issue

2010-10-11 Thread Claudio Atzori

 Hi everybody,
in my application I use an instance of EmbeddedSolrServer (solr 1.4.1), 
the following snippet shows how I am instantiating it:



File home = new File(indexDataPath(solrDataDir, indexName));

container = new CoreContainer(indexDataPath(solrDataDir, 
indexName));
container.load(indexDataPath(solrDataDir, indexName), new 
File(home, "solr.xml"));


return new EmbeddedSolrServer(container, indexName);


and I'm going through some issues using deleteByQuery method, in fact, 
when I try to delete a subset of documents, or even all the documents 
from the index, I see as they are correctly marked for deletion on the 
luke inspector (http://code.google.com/p/luke/), but after a commit I 
can still retrieve them, just like they haven't been removed...


I can see the difference and see the documents disappear only when I 
restart my jetty application, but obviously this cannot be a feature... 
any idea?




Re: Solr PHP PECL Extension going to Stable Release - Wishing for Any New Features?

2010-10-11 Thread Lukas Kahwe Smith

On 11.10.2010, at 07:03, Israel Ekpo wrote:

> I am currently working on a couple of bug fixes for the Solr PECL extension
> that will be available in the next release 0.9.12 sometime this month.
> 
> http://pecl.php.net/package/solr
> 
> Documentation of the current API and features for the PECL extension is
> available here
> 
> http://www.php.net/solr
> 
> A couple of users in the community were asking when the PHP extension will
> be moving from beta to stable.
> 
> The API looks stable so far with no serious issues and I am looking to
> moving it from *Beta* to *Stable *on November 20 2010
> 
> If you are using Solr via PHP and would like to see any new features in the
> extension please feel free to send me a note.
> 
> I would like to incorporate those changes in 0.9.12 so that user can try
> them out and send me some feedback before the release of version 1.0
> 
> Thanks in advance for your response.


we already had some emails about this.
imho there are too many methods for specialized tasks, that its easy to get 
lost in the API, especially since not all of them have written documentation 
yet beyond the method signatures.

also i do think that there should be methods for escaping and also tokenizing 
lucene queries to enable "validation" of the syntax used etc.

see here for a use case and a user land implementation:
http://pooteeweet.org/blog/1796

regards,
Lukas Kahwe Smith
m...@pooteeweet.org