date:20110420

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Bernd Fehling


Hi Erik,


1
0


Due to 44 minutes optimization time we do an optimization once a day
during the night.

I will try with an smaler index on my development system.

Best regards,
Bernd


Am 20.04.2011 17:50, schrieb Erick Erickson:

It looks OK, but still doesn't explain keeping the old files around. What is
your  in your solrconfig.xml look like? It's
possible that you're seeing Solr attempt to keep around several
optimized copies of the index, but that still doesn't explain why
restarting Solr removes them unless the deletionPolicy gets invoked
on sometime and you're index files are aging out (I don't know the
internals of deletion well enough to say).

About optimization. It's become less important with recent code. Once
upon a time, it made a substantial difference in search speed. More
recently, it has very little impact on search speed, and is used
much more sparingly. Its greatest benefit is reclaiming unused resources
left over from deleted documents. So you might want to avoid the pain
of optimizing (44 minutes!) and only optimize rarely of if you have
deleted a lot of documents.

It might be worthwhile to try (with a smaller index !) a bunch of optimize
cycles and see if the  idea has any merit. I'd expect
your index to reach a maximum and stay there after the saved
copies of the index was reached...

But otherwise I'm puzzled...

Erick

On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
  wrote:

Hi Erik,

Am 20.04.2011 15:42, schrieb Erick Erickson:


H, this isn't right. You've pretty much eliminated the obvious
things. What does lsof show? I'm assuming it shows the files are
being held open by your Solr instance, but it's worth checking.


Just commited new content 3 times and finally optimized.
Again having old index files left.

Then checked on my master, only the newest version of index files are
listed with lsof. No file handles to the old index files but the
old index files remain in data/index/.
Thats strange.

This time replication worked fine and cleaned up old index on slaves.



I'm not getting the same behavior, admittedly on a Windows box.
The only other thing I can think of is that you have a query that's
somehow never ending, but that's grasping at straws.

Do your log files show anything interesting?


Lets see:
- it has the old generation (generation=12) and its files
- and recognizes that there have been several commits (generation=18)

20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

  
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
_3xm.fdx, segment
s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]

  
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
_3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
_3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
_3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
_3xs.tis, _3x
m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1302159868447


- after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
  the SolrDeletionPolicy onCommit and has the new generation 19 listed.


20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=3

  
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
_3xm.fdx, segment
s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]

  
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
_3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
_3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
_3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
_3xs.tis, _3x
m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]

  
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,

Re: Need to create dyanamic indexies base on different document workspaces

2011-04-20 Thread Chandan Tamrakar

It depends on your application design how you want your index


There is a feature called solr core . http://wiki.apache.org/solr/CoreAdmin
You could still have a single index but a field  to differentiate the items
in index

thanks


On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala <
gaurav.shing...@hotmail.com> wrote:

>
>
>
>
> Hi,
>
> Is there a way to create different solr indexes for different categories?
> We have different document workspaces and ideally want each workspace to
> have its own solr index.
>
> Thanks,
> Gaurav
>




-- 
Chandan Tamrakar
*
*

Need to create dyanamic indexies base on different document workspaces

2011-04-20 Thread Gaurav Shingala





Hi,

Is there a way to create different solr indexes for different categories? 
We have different document workspaces and ideally want each workspace to have 
its own solr index.

Thanks,
Gaurav

Re: Apache Spam Filter Blocking Messages

2011-04-20 Thread Marvin Humphrey

On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL

Note the "HTML_MESSAGE" in the list of things SpamAssassin didn't like.

> Apparently I sound like spam when I write perfectly good English and include
> some xml and a link to a jira ticket in my e-mail (I tried a couple
> different variations).  Anyone know a way around this filter, or should I
> just respond to those involved in the e-mail chain directly and avoid the
> mailing list?

Send plain text email instead of HTML.  That solves the problem 99% of the
time.

Marvin Humphrey

Apache Spam Filter Blocking Messages

2011-04-20 Thread Trey Grainger

Hey (solr-user) Mailing list admin's,

I've tried replying to a thread multiple times tonight, and keep getting a
bounce-back with this response:
Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.1) exceeded threshold
(FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
(state 18).

Apparently I sound like spam when I write perfectly good English and include
some xml and a link to a jira ticket in my e-mail (I tried a couple
different variations).  Anyone know a way around this filter, or should I
just respond to those involved in the e-mail chain directly and avoid the
mailing list?

Thanks,

-Trey

The issue of import data from database using Solr DIH

2011-04-20 Thread Kevin Xiang

Hi all,
I am a new to solr,I am importing data from database using DIH(solr
1.4).One document is made up of two entity,Every entity is a table in
database.
For example:
Table1:have 3 fields;
Table2:have 4 fields;
If it is Ok,it will be 7 fields.
But it is only 4 fields,it seem that solr don't merge the fields and
table2 over write table1.
The key is OS06Y.
The configuration of db-data-config.xml is the following:















Has anyone come across this issue?
Any suggestions on how to fix this issue is much appreciated. 
Thanks.

RE: Creating a TrieDateField (and other Trie fields) from Lucene Java

2011-04-20 Thread Craig Stires

Hi Yonik,

The limitations I need to work within, have to do with the index already
being built as part of an existing process.

Currently, the Solr server is in read-only mode and receives new indexes
daily from a Java application.  The Java app runs Lucene/Tika and is
indexing resources within the local network.  It builds off of a different
schema framework, then moves the finished indexes over to the Solr
deployment path.  The Solr server swaps over at that point.  The Solr server
isn't the only consumer of the indexes.  There are other Java apps which
read/write to the Lucene index, during the staging process.

This was working without issues, when using types were part of Lucene core
(String, Boolean, Integer, etc), because they just resolved to Strings.
But, the TrieDateField works off of byte data, so needed to find a way to
create those fields, using the existing classes.

Thanks,
-Craig

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Wednesday, 20 April 2011 11:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Creating a TrieDateField (and other Trie fields) from Lucene
Java

On Tue, Apr 19, 2011 at 11:17 PM, Craig Stires 
wrote:
> The barrier I have is that I need to build this offline (without using a
> solr server, solrconfig.xml, or schema.xml)

This is pretty unusual... can you share your use case?
Solr can also be run in embedded mode if you can't run a stand-alone
server for some reason.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Solr - upgrade from 1.4.1 to 3.1 - finding AbstractSolrTestCase binaries - help please?

2011-04-20 Thread Bob Sandiford

HI, all.

I'm working on upgrading from 1.4.1 to 3.1, and I'm having some troubles with 
some of the unit test code for our custom Filters.  We wrote the tests to 
extend AbstractSolrTestCase, and I've been reading the thread about the 
test-harness elements not being present in the 3.1 distributables. [1]

So, I have checked out the 3.1 branch code and built that (ant 
generate-maven-artifacts), and I've found the 
lucene-test-framework-3.1-xxx.jar(s).  However, these contain only the lucene 
level framework elements, and none of the solr.

Did the solr test framework actually get built and embedded in one of the solr 
jars somewhere?  Or, if not, is there some way to build a jar that contains the 
solr portion of the test harnesses?

[1] SOLR-2061 Generate jar 
containing test classes.
*
Thanks!

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

my documents are user entries, so i'm guessing they vary a lot.
Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
thanks guys!

On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley wrote:

> On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort  wrote:
> > Thanks
> > but i've disabled the cache already, since my concern is speed and i'm
> > willing to pay the price (memory)
>
> Then you should not disable the cache.
>
> >, and my subset are not fixed.
> > Does the facet search do any extra work that i don't need, that i might
> be
> > able to disable (either by a flag or by a code change),
> > Somehow i feel, or rather hope, that counting the terms of 200K documents
> > and finding the top 500 should take less than 30 seconds.
>
> Using facet.enum.cache.minDf should be a little faster than just
> disabling the cache - it's a different code path.
> Using the cache selectively will speed things up, so try setting that
> minDf to 1000 or so for example.
>
> How many unique terms do you have in the index?
> Is this Solr 3.1 - there were some optimizations when there were many
> terms to iterate over?
> You could also try trunk, which has even more optimizations, or the
> bulkpostings branch if you really want to experiment.
>
> -Yonik
>

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort  wrote:
> Thanks
> but i've disabled the cache already, since my concern is speed and i'm
> willing to pay the price (memory)

Then you should not disable the cache.

>, and my subset are not fixed.
> Does the facet search do any extra work that i don't need, that i might be
> able to disable (either by a flag or by a code change),
> Somehow i feel, or rather hope, that counting the terms of 200K documents
> and finding the top 500 should take less than 30 seconds.

Using facet.enum.cache.minDf should be a little faster than just
disabling the cache - it's a different code path.
Using the cache selectively will speed things up, so try setting that
minDf to 1000 or so for example.

How many unique terms do you have in the index?
Is this Solr 3.1 - there were some optimizations when there were many
terms to iterate over?
You could also try trunk, which has even more optimizations, or the
bulkpostings branch if you really want to experiment.

-Yonik

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

BTW,
i'm using solr 1.4.1, does 3.1 or 4.0 contain any performance improvements
that will make a difference as far as facet search?
thanks again
Ofer

On Thu, Apr 21, 2011 at 2:45 AM, Ofer Fort  wrote:

> Thanks
> but i've disabled the cache already, since my concern is speed and i'm
> willing to pay the price (memory), and my subset are not fixed.
> Does the facet search do any extra work that i don't need, that i might be
> able to disable (either by a flag or by a code change),
> Somehow i feel, or rather hope, that counting the terms of 200K documents
> and finding the top 500 should take less than 30 seconds.
>
>
>
> On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley 
> wrote:
>
>> On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
>>  wrote:
>> >
>> > : thanks, but that's what i started with, but it took an even longer
>> time and
>> > : threw this:
>> > : Approaching too many values for UnInvertedField faceting on field
>> 'text' :
>> > : bucket size=15560140
>> > : Approaching too many values for UnInvertedField faceting on field
>> 'text :
>> > : bucket size=15619075
>> > : Exception during facet counts:org.apache.solr.common.SolrException:
>> Too many
>> > : values for UnInvertedField faceting on field text
>> >
>> > right ... facet.method=fc is a good default, but cases like full text
>> > faceting can cause it to seriously blow up the memory ... i didn't eve
>> > realize it was possible to get it to fail this way, i would have just
>> > expected an OutOfmemoryException.
>> >
>> > facet.method=enum is probably your best bet in this situation precisely
>> > because it does a linera scan over the terms ... it's slower because
>> it's
>> > safer.
>> >
>> > the one speed up you might be able to get is to ensure you don't use the
>> > filterCache -- that way you don't wast time constantly
>> caching/overwriting
>> > DocSets
>>
>> Right - or only using filterCache for high df terms via
>> http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>>
>
>

Re: How to return score without using _val_

2011-04-20 Thread Yonik Seeley

On Tue, Apr 19, 2011 at 11:41 PM, Bill Bell  wrote:
> I would like to influence the score but I would rather not mess with the q=
> field since I want the query to dismax for Q.
>
> Something like:
>
> fq={!type=dismax qf=$qqf v=$qspec}&
> fq={!type=dismax qt=dismaxname v=$qname}&
> q=_val_:"{!type=dismax qf=$qqf  v=$qspec}" _val_:"{!type=dismax
> qt=dismaxname v=$qname}"
>
> Is there a way to do a filter and add the FQ to the score by doing it
> another way?
>
> Also does this do multiple queries? Is this the right way to do it?

I really don't understand what you're trying to do...
Backing up, you say you want to influence the score,  but I can't
figure out how you would like to influence the score.

Do you want to:
 - add the score of another query to the main dismax query (use "bq")
 - multiply the main dismax score by another query (use edismax along
with boost, or the boost query type)
 - something else?

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

Thanks
but i've disabled the cache already, since my concern is speed and i'm
willing to pay the price (memory), and my subset are not fixed.
Does the facet search do any extra work that i don't need, that i might be
able to disable (either by a flag or by a code change),
Somehow i feel, or rather hope, that counting the terms of 200K documents
and finding the top 500 should take less than 30 seconds.


On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley wrote:

> On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
>  wrote:
> >
> > : thanks, but that's what i started with, but it took an even longer time
> and
> > : threw this:
> > : Approaching too many values for UnInvertedField faceting on field
> 'text' :
> > : bucket size=15560140
> > : Approaching too many values for UnInvertedField faceting on field 'text
> :
> > : bucket size=15619075
> > : Exception during facet counts:org.apache.solr.common.SolrException: Too
> many
> > : values for UnInvertedField faceting on field text
> >
> > right ... facet.method=fc is a good default, but cases like full text
> > faceting can cause it to seriously blow up the memory ... i didn't eve
> > realize it was possible to get it to fail this way, i would have just
> > expected an OutOfmemoryException.
> >
> > facet.method=enum is probably your best bet in this situation precisely
> > because it does a linera scan over the terms ... it's slower because it's
> > safer.
> >
> > the one speed up you might be able to get is to ensure you don't use the
> > filterCache -- that way you don't wast time constantly
> caching/overwriting
> > DocSets
>
> Right - or only using filterCache for high df terms via
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
 wrote:
>
> : thanks, but that's what i started with, but it took an even longer time and
> : threw this:
> : Approaching too many values for UnInvertedField faceting on field 'text' :
> : bucket size=15560140
> : Approaching too many values for UnInvertedField faceting on field 'text :
> : bucket size=15619075
> : Exception during facet counts:org.apache.solr.common.SolrException: Too many
> : values for UnInvertedField faceting on field text
>
> right ... facet.method=fc is a good default, but cases like full text
> faceting can cause it to seriously blow up the memory ... i didn't eve
> realize it was possible to get it to fail this way, i would have just
> expected an OutOfmemoryException.
>
> facet.method=enum is probably your best bet in this situation precisely
> because it does a linera scan over the terms ... it's slower because it's
> safer.
>
> the one speed up you might be able to get is to ensure you don't use the
> filterCache -- that way you don't wast time constantly caching/overwriting
> DocSets

Right - or only using filterCache for high df terms via
http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Chris Hostetter


: thanks, but that's what i started with, but it took an even longer time and
: threw this:
: Approaching too many values for UnInvertedField faceting on field 'text' :
: bucket size=15560140
: Approaching too many values for UnInvertedField faceting on field 'text :
: bucket size=15619075
: Exception during facet counts:org.apache.solr.common.SolrException: Too many
: values for UnInvertedField faceting on field text

right ... facet.method=fc is a good default, but cases like full text 
faceting can cause it to seriously blow up the memory ... i didn't eve 
realize it was possible to get it to fail this way, i would have just 
expected an OutOfmemoryException.

facet.method=enum is probably your best bet in this situation precisely 
because it does a linera scan over the terms ... it's slower because it's 
safer.

the one speed up you might be able to get is to ensure you don't use the 
filterCache -- that way you don't wast time constantly caching/overwriting 
DocSets

and FWIW...

: > If facet search is not the correct approach, i thought about using
: > something
: > like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
: > in solr. Should i implememt a request handler that executes this kind of

HighFreqTerms just looks at the raw docfreq for the terms, nearly 
identical to the TermsComponent -- there is no way to deal with your 
"subset of documents" requrements using an approach like that.

If the number of subsets you have to deal with are fixed, finite, and 
non-overlapping, using distinct cores for each subset (which you can 
aggregate using distributed search when you don't want this type of query) 
can also be a wise choice in many situations

(ie: if you have a "books" core and a "movies" core you can search both 
using distributed search, or hit the terms component on just one of them 
to get the top terms for that core)

-Hoss

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

seems like the facet search is not all that suited for a full text field. (
http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197
)

Maybe i should go another direction. I think that the HighFreqTerms
approach, just not sure how to start.

On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort  wrote:

> thanks, but that's what i started with, but it took an even longer time and
> threw this:
> Approaching too many values for UnInvertedField faceting on field 'text' :
> bucket size=15560140
> Approaching too many values for UnInvertedField faceting on field 'text :
> bucket size=15619075
> Exception during facet counts:org.apache.solr.common.SolrException: Too
> many values for UnInvertedField faceting on field text
>
>
>
> On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind wrote:
>
>> I think faceting is probably the best way to do that, indeed. It might be
>> slow, but it's kind of set up for exactly that case, I can't imagine any
>> other technique being faster -- there's stuff that has to be done to look up
>> the info you want.
>>
>> BUT, I see your problem:  don't use facet.method=enum. Use
>> facet.method=fc.  Works a LOT better for very high arity fields (lots and
>> lots of unique values) like you have. I bet you'll see significant speed-up
>> if you use facet.method=fc instead, hopefully fast enough to be workable.
>>
>> With facet.method=enum, I would have indeed predicted it would be horribly
>> slow, before solr 1.4 when facet.method=fc became available, it was nearly
>> impossible to facet on very high arity fields, facet.method=fc is the magic.
>> I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
>> explicitly set it to enum instead!
>>
>> Jonathan
>> 
>> From: Ofer Fort [ofer...@gmail.com]
>> Sent: Wednesday, April 20, 2011 6:49 PM
>> To: solr-user@lucene.apache.org
>> Subject: Highest frequency terms for a subset of documents
>> Hi,
>> I am looking for the best way to find the terms with the highest frequency
>> for a given subset of documents. (terms in the text field)
>> My first thought was to do a count facet search , where the query defines
>> the subset of documents and the facet.field is the text field, this gives
>> me
>> the result but it is very very slow.
>> These are my params:
>> true
>> 0
>> 3
>> on
>> 500
>> enum
>> xml
>> 0
>> 2.2
>> count
>>   in_subset:1
>> text
>> 
>>
>> The index contains 7M documents, the subset is about 200K. A simple query
>> for the subset takes around 100ms, but the facet search takes 40s.
>>
>> Am i doing something wrong?
>>
>> If facet search is not the correct approach, i thought about using
>> something
>> like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
>> in solr. Should i implememt a request handler that executes this kind of
>> code?
>>
>> thanks for any help
>>
>
>

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

thanks, but that's what i started with, but it took an even longer time and
threw this:
Approaching too many values for UnInvertedField faceting on field 'text' :
bucket size=15560140
Approaching too many values for UnInvertedField faceting on field 'text :
bucket size=15619075
Exception during facet counts:org.apache.solr.common.SolrException: Too many
values for UnInvertedField faceting on field text


On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind  wrote:

> I think faceting is probably the best way to do that, indeed. It might be
> slow, but it's kind of set up for exactly that case, I can't imagine any
> other technique being faster -- there's stuff that has to be done to look up
> the info you want.
>
> BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.
>  Works a LOT better for very high arity fields (lots and lots of unique
> values) like you have. I bet you'll see significant speed-up if you use
> facet.method=fc instead, hopefully fast enough to be workable.
>
> With facet.method=enum, I would have indeed predicted it would be horribly
> slow, before solr 1.4 when facet.method=fc became available, it was nearly
> impossible to facet on very high arity fields, facet.method=fc is the magic.
> I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
> explicitly set it to enum instead!
>
> Jonathan
> 
> From: Ofer Fort [ofer...@gmail.com]
> Sent: Wednesday, April 20, 2011 6:49 PM
> To: solr-user@lucene.apache.org
> Subject: Highest frequency terms for a subset of documents
> Hi,
> I am looking for the best way to find the terms with the highest frequency
> for a given subset of documents. (terms in the text field)
> My first thought was to do a count facet search , where the query defines
> the subset of documents and the facet.field is the text field, this gives
> me
> the result but it is very very slow.
> These are my params:
> true
> 0
> 3
> on
> 500
> enum
> xml
> 0
> 2.2
> count
>   in_subset:1
> text
> 
>
> The index contains 7M documents, the subset is about 200K. A simple query
> for the subset takes around 100ms, but the facet search takes 40s.
>
> Am i doing something wrong?
>
> If facet search is not the correct approach, i thought about using
> something
> like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
> in solr. Should i implememt a request handler that executes this kind of
> code?
>
> thanks for any help
>

Re: How to index MS SQL Server column with image type

2011-04-20 Thread Chris Hostetter


: Subject: How to index MS SQL Server column with image type
: 
: Hi all,
: 
: When I index a column(image type) of a table  via *
: http://localhost:8080/solr/dataimport?command=full-import*
: *There is a error like this: String length must be a multiple of four.*

For future refrence: full error message (with stack traces) are the best 
way to get people to help you diagnose problems.

I think the crux of hte issue is that DataImportHandler doesn't 
currently have any way of indexing raw binary data like images.

Under teh covers, Solr can deal with pure binary fields, but there aren't 
a lot of good usecases i can think of for it -- particularly if you want 
to *index* those bytes...

: 

...can you please explain what your goal is?  what are you ultimatley 
hoping to do with that field?





-Hoss

RE: Highest frequency terms for a subset of documents

2011-04-20 Thread Jonathan Rochkind

I think faceting is probably the best way to do that, indeed. It might be slow, 
but it's kind of set up for exactly that case, I can't imagine any other 
technique being faster -- there's stuff that has to be done to look up the info 
you want. 

BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.  
Works a LOT better for very high arity fields (lots and lots of unique values) 
like you have. I bet you'll see significant speed-up if you use facet.method=fc 
instead, hopefully fast enough to be workable. 

With facet.method=enum, I would have indeed predicted it would be horribly 
slow, before solr 1.4 when facet.method=fc became available, it was nearly 
impossible to facet on very high arity fields, facet.method=fc is the magic. I 
think facet.method=fc is even the default in Solr 1.4+, if you hadn't 
explicitly set it to enum instead! 

Jonathan

From: Ofer Fort [ofer...@gmail.com]
Sent: Wednesday, April 20, 2011 6:49 PM
To: solr-user@lucene.apache.org
Subject: Highest frequency terms for a subset of documents
Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
true
0
3
on
500
enum
xml
0
2.2
count
   in_subset:1
text


The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help

Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
true
0
3
on
500
enum
xml
0
2.2
count
   in_subset:1
text


The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help

entity name issue

2011-04-20 Thread tjtong

Hi guys,

I have encountered a problem with entity name, see the data config code
below. the variable '${ea.a_aid}' was always empty. I suspect it is a
namespace issue. Anyone knows how to bypass it? 

This is on oracle database. I had to use the prefix "myschema.", otherwise,
the table name was not recognized. The similar thing worked on database
without adding a prefix to the table names.
Thanks in advance!

  

   

--
View this message in context: 
http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2843812.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: stemming filter analyzers, any favorites?

2011-04-20 Thread Robert Petersen

I have been doing that, and for Bags example the trailing 's' is not being 
removed by the Kstemmer so if indexing the word bags and searching on bag you 
get no matches.  Why wouldn't the trailing 's' get stemmed off?  Kstemmer is 
dictionary based so bags isn't in the dictionary?   That trailing 's' should 
always be dropped no?  That seems like it would be better, we don't want to 
make synonyms for basic use cases like this.  I fear I will have to return to 
the Porter stemmer.  Are there other better ones is my main question.

Off topic secondary question: sometimes I am puzzled by the output of the 
analysis page.  It seems like there should be a match, but I don't get the 
results during a search that I'd expect...  

Like in the case if the WordDelimiterFilterFactory splits up a term into a 
bunch of terms before the K-stemmer is applied, sometimes if the matching term 
is in position two of the final analysis but the searcher had the partial term 
just alone and so thereby in position 1 in the analysis stack then when 
searching there wasn't a match.  Am I reading this correctly?  Is that right or 
should that match and I am misreading my analysis output?  

Thanks!

Robi

PS  I have a category named Bags and am catching flack for it not coming up in 
a search for bag.  hah
PPS the term is not in protwords.txt

com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text   bags
term type   word
source start,end0,4
payload 

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 20, 2011 10:55 AM
To: solr-user@lucene.apache.org
Subject: Re: stemming filter analyzers, any favorites?

You can get a better sense of exactly what tranformations occur when
if you look at the analysis page (be sure to check the "verbose"
checkbox).

I'm surprised that "bags" doesn't match "bag", what does the analysis
page say?

Best
Erick

On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen  wrote:
> Stemming filter analyzers... anyone have any favorites for particular
> search domains?  Just wondering what people are using.  I'm using Lucid
> K Stemmer and having issues.   Seems like it misses a lot of common
> stems.  We went to that because of excessively loose matches on the
> solr.PorterStemFilterFactory
>
>
> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
> it is missing a lot of common stem reductions.  Ie   Bags does not match
> Bag in our searches.
>
> Here is my analyzer stack:
>
>                 positionIncrementGap="100">
>                        
>                                 class="solr.WhitespaceTokenizerFactory"/>
>                                 class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="true"/>
>                                 ignoreCase="true" words="stopwords.txt"/>
>                          generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="1"
>                preserveOriginal="1"
>                />                               class="solr.LowerCaseFilterFactory"/>
>                                
>                                 class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt"/>
>                                 class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        
>                        
>                                 class="solr.WhitespaceTokenizerFactory"/>
>                                 class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
> ignoreCase="true" expand="true"/>
>                                 ignoreCase="true" words="stopwords.txt"/>
>                          generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="1"
>                preserveOriginal="1"
>                />                               class="solr.LowerCaseFilterFactory"/>
>                                
>                                 class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt"/>
>                                 class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        
>                
>

Re: ConcurrentLRUCache$Stats error

2011-04-20 Thread Chris Hostetter


: https://issues.apache.org/jira/browse/SOLR-1797

that issue doesn't seem to have anything to do with the stack trace 
reported...

: > SEVERE: java.util.concurrent.ExecutionException:
: > java.lang.NoSuchMethodError:
: > org.apache.solr.common.util.ConcurrentLRUCache$Stats.add(Lorg/apache/solr/c
: > ommon/util/ConcurrentLRUCache$Stats;)V

NoSuchMethodError means that one compiled java class expects another 
compiled java class to have a method that it does not actually have -- 
this typically happens when you have inconcsisten classfiles (or jars) in 
your classpath.

ie: you most likely have a mix of jars from two different versions of 
solr/lucene.

-Hoss

RE: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz

That's good news -- thanks for the help (not to mention the reassurance that 
Solr itself is actually working right)!  Hopefully 3.1.1 won't be too far off, 
though; when the analysis tool lies, life can get very confusing! :-)

- Demian

> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Wednesday, April 20, 2011 2:54 PM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: Re: Bug in solr.KeywordMarkerFilterFactory?
> 
> No, this is only a bug in analysis.jsp.
> 
> you can see this by comparing analysis.jsp's "dontstems bees" to using
> the query debug interface:
> 
>   "dontstems bees"
>   "dontstems bees"
>   PhraseQuery(text:"dontstems bee")
>   text:"dontstems bee"
> 
> On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
>  wrote:
> > On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz
>  wrote:
> >> I've just started experimenting with the
> solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some
> strange behavior.  It seems that every word subsequent to a protected
> word is also treated as being protected.
> >
> > You're right!  This was broken by LUCENE-2901 back in Jan.
> > I've opened this issue:
>  https://issues.apache.org/jira/browse/LUCENE-3039
> >
> > The easiest short-term workaround for you would probably be to create
> > a custom filter that looks like KeywordMarkerFilter before the
> > LUCENE-2901 change.
> >
> > -Yonik
> > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > 25-26, San Francisco
> >

Re: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Robert Muir

No, this is only a bug in analysis.jsp.

you can see this by comparing analysis.jsp's "dontstems bees" to using
the query debug interface:

  "dontstems bees"
  "dontstems bees"
  PhraseQuery(text:"dontstems bee")
  text:"dontstems bee"

On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
 wrote:
> On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz  
> wrote:
>> I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
>> Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
>> subsequent to a protected word is also treated as being protected.
>
> You're right!  This was broken by LUCENE-2901 back in Jan.
> I've opened this issue:  https://issues.apache.org/jira/browse/LUCENE-3039
>
> The easiest short-term workaround for you would probably be to create
> a custom filter that looks like KeywordMarkerFilter before the
> LUCENE-2901 change.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

RE: Solr - Multi Term highlighting issue

2011-04-20 Thread Ramanathapuram, Rajesh

Thanks Erick. 

I tried your suggestion, the issue still exists.

http://localhost:8983/searchsolr/mainCore/select?indent=on&version=2.2&q=mec+us+chile&fq=storyid%3DXXX%22&start=0&rows=10&fl=*&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=story%2C+slug&hl.fragsize=10&hl.highlightMultiTerm=true&hl.usePhraseHighlighter=true&hl.mergeContiguous=false

- 
  10 
   
  on 
  false 


... Corboba. (MEC)CHILE/FOREST FIRES ...


thanks & regards,
Rajesh Ramana 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 20, 2011 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Multi Term highlighting issue

Does your configuration have "hl.mergeContiguous" set to true by any chance? 
And what happens if you explicitly set this to "false" on your query?

Best
Erick

On Wed, Apr 20, 2011 at 9:43 AM, Ramanathapuram, Rajesh 
 wrote:
> Hello,
>
> I am dealing with a highlighting issue in SOLR, I will try to explain 
> the issue.
>
> When I search for a single term in solr, it wraps  tag around the 
> words I want to highlight, all works well.
> But if I search multiple term, for most part highlighting works good 
> and then for some of the terms, the highlight return multiple terms in 
> a sing  tag     ...
> srchtrm1)  srchtrm2 I expect solr to return 
> highlight terms like    ... srchtrm1) ... 
> srchtrm2
>
> When I search for 'US mec chile', here is how my result appears
>  ... Corboba. (MEC)CHILE/FOREST FIRES: 
> We had ... with US and Chile ...,
>  (MEC)US  
>
> This is what I was expecting it to be
>  ... Corboba. (MEC)CHILE/FOREST
> FIRES: We had ... with US and Chile ..., 
> (MEC)US 
>
> Here is my query params
> - 
> - 
>  0
>  26
> - 
>     10
>     
>     on
>     story, slug
>     standard
>     on
>     10
>     2.2
>     true
>     *
>     0
>     mec us chile
>     standard
>     true
>     storyid="  X"
>  
>  
>
> Here are some other links I found in the forum, but no real conclusion
>
> http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_
> hi
> ghlighting_question#78163c42a67cb533
>
> I am going to try this patch, which also had no conclusive results
>   https://issues.apache.org/jira/browse/SOLR-1394
>
> Has anyone come across this issue?
> Any suggestions on how to fix this issue is much appreciated.
>
>
> thanks & regards,
> Rajesh Ramana
>

Re: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz  wrote:
> I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
> Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
> subsequent to a protected word is also treated as being protected.

You're right!  This was broken by LUCENE-2901 back in Jan.
I've opened this issue:  https://issues.apache.org/jira/browse/LUCENE-3039

The easiest short-term workaround for you would probably be to create
a custom filter that looks like KeywordMarkerFilter before the
LUCENE-2901 change.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz

I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
subsequent to a protected word is also treated as being protected.

For testing purposes, I have put the word "spelling" in my protwords.txt.  If I 
do a test for "spelling bees" in the analyze tool, the stemmer produces 
"spelling bees" - nothing is stemmed.  But if I do a test for "bees spelling", 
I get "bee spelling", the expected result with "bees" stemmed but "spelling" 
left unstemmed.  I have tried extended examples - in every case I tried, all of 
the words prior to "spelling" get stemmed, but none of the words after 
"spelling" get stemmed.  When turning on the verbose mode of the analyze tool, 
I can see that the settings of the "keyword" attribute introduced by 
solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so 
I think the solr.KeywordMarkerFilterFactory component is to blame, and not 
anything later in the analyze chain.

Any idea what might be going wrong?  Is this a known issue?

Here is my field type definition, in case it makes a difference:


  







  
  








  


thanks,
Demian

Re: stemming filter analyzers, any favorites?

2011-04-20 Thread Erick Erickson

You can get a better sense of exactly what tranformations occur when
if you look at the analysis page (be sure to check the "verbose"
checkbox).

I'm surprised that "bags" doesn't match "bag", what does the analysis
page say?

Best
Erick

On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen  wrote:
> Stemming filter analyzers... anyone have any favorites for particular
> search domains?  Just wondering what people are using.  I'm using Lucid
> K Stemmer and having issues.   Seems like it misses a lot of common
> stems.  We went to that because of excessively loose matches on the
> solr.PorterStemFilterFactory
>
>
> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
> it is missing a lot of common stem reductions.  Ie   Bags does not match
> Bag in our searches.
>
> Here is my analyzer stack:
>
>                 positionIncrementGap="100">
>                        
>                                 class="solr.WhitespaceTokenizerFactory"/>
>                                 class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="true"/>
>                                 ignoreCase="true" words="stopwords.txt"/>
>                          generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="1"
>                preserveOriginal="1"
>                />                               class="solr.LowerCaseFilterFactory"/>
>                                
>                                 class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt"/>
>                                 class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        
>                        
>                                 class="solr.WhitespaceTokenizerFactory"/>
>                                 class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
> ignoreCase="true" expand="true"/>
>                                 ignoreCase="true" words="stopwords.txt"/>
>                          generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="1"
>                preserveOriginal="1"
>                />                               class="solr.LowerCaseFilterFactory"/>
>                                
>                                 class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt"/>
>                                 class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        
>                
>

stemming filter analyzers, any favorites?

2011-04-20 Thread Robert Petersen

Stemming filter analyzers... anyone have any favorites for particular
search domains?  Just wondering what people are using.  I'm using Lucid
K Stemmer and having issues.   Seems like it misses a lot of common
stems.  We went to that because of excessively loose matches on the
solr.PorterStemFilterFactory


I understand K Stemmer is a dictionary based stemmer.  Seems to me like
it is missing a lot of common stem reductions.  Ie   Bags does not match
Bag in our searches.

Here is my analyzer stack:

Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-20 Thread Robert Muir

Hi, there is a proposed patch uploaded to the issue. Maybe you can
help by reviewing/testing it?

2011/4/20 Robert Gründler :
> Hi all,
>
> i'm getting the following exception when using highlighting for a field
> containing HTMLStripCharFilterFactory:
>
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ...
> exceeds length of provided text sized 21
>
> It seems this is a know issue:
>
> https://issues.apache.org/jira/browse/LUCENE-2208
>
> Does anyone know if there's a fix implemented yet in solr?
>
>
> thanks!
>
>
> -robert
>
>
>
>

Re: Creating a TrieDateField (and other Trie fields) from Lucene Java

2011-04-20 Thread Yonik Seeley

On Tue, Apr 19, 2011 at 11:17 PM, Craig Stires  wrote:
> The barrier I have is that I need to build this offline (without using a
> solr server, solrconfig.xml, or schema.xml)

This is pretty unusual... can you share your use case?
Solr can also be run in embedded mode if you can't run a stand-alone
server for some reason.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-20 Thread Robert Gründler


Hi all,

i'm getting the following exception when using highlighting for a field 
containing HTMLStripCharFilterFactory:


org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
... exceeds length of provided text sized 21


It seems this is a know issue:

https://issues.apache.org/jira/browse/LUCENE-2208

Does anyone know if there's a fix implemented yet in solr?


thanks!


-robert

Re: Solr - Multi Term highlighting issue

2011-04-20 Thread Erick Erickson

Does your configuration have "hl.mergeContiguous" set to true by any
chance? And what
happens if you explicitly set this to "false" on your query?

Best
Erick

On Wed, Apr 20, 2011 at 9:43 AM, Ramanathapuram, Rajesh
 wrote:
> Hello,
>
> I am dealing with a highlighting issue in SOLR, I will try to explain
> the issue.
>
> When I search for a single term in solr, it wraps  tag around the
> words I want to highlight, all works well.
> But if I search multiple term, for most part highlighting works good and
> then for some of the terms,
> the highlight return multiple terms in a sing  tag     ...
> srchtrm1)  srchtrm2
> I expect solr to return highlight terms like    ... srchtrm1)
> ... srchtrm2
>
> When I search for 'US mec chile', here is how my result appears
>  ... Corboba. (MEC)CHILE/FOREST FIRES: We
> had ... with US and Chile ...,
>  (MEC)US  
>
> This is what I was expecting it to be
>  ... Corboba. (MEC)CHILE/FOREST
> FIRES: We had ... with US and Chile ...,
> (MEC)US 
>
> Here is my query params
> - 
> - 
>  0
>  26
> - 
>     10
>     
>     on
>     story, slug
>     standard
>     on
>     10
>     2.2
>     true
>     *
>     0
>     mec us chile
>     standard
>     true
>     storyid="  X"
>  
>  
>
> Here are some other links I found in the forum, but no real conclusion
>
> http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_hi
> ghlighting_question#78163c42a67cb533
>
> I am going to try this patch, which also had no conclusive results
>   https://issues.apache.org/jira/browse/SOLR-1394
>
> Has anyone come across this issue?
> Any suggestions on how to fix this issue is much appreciated.
>
>
> thanks & regards,
> Rajesh Ramana
>

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Erick Erickson

It looks OK, but still doesn't explain keeping the old files around. What is
your  in your solrconfig.xml look like? It's
possible that you're seeing Solr attempt to keep around several
optimized copies of the index, but that still doesn't explain why
restarting Solr removes them unless the deletionPolicy gets invoked
on sometime and you're index files are aging out (I don't know the
internals of deletion well enough to say).

About optimization. It's become less important with recent code. Once
upon a time, it made a substantial difference in search speed. More
recently, it has very little impact on search speed, and is used
much more sparingly. Its greatest benefit is reclaiming unused resources
left over from deleted documents. So you might want to avoid the pain
of optimizing (44 minutes!) and only optimize rarely of if you have
deleted a lot of documents.

It might be worthwhile to try (with a smaller index !) a bunch of optimize
cycles and see if the  idea has any merit. I'd expect
your index to reach a maximum and stay there after the saved
copies of the index was reached...

But otherwise I'm puzzled...

Erick

On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
 wrote:
> Hi Erik,
>
> Am 20.04.2011 15:42, schrieb Erick Erickson:
>>
>> H, this isn't right. You've pretty much eliminated the obvious
>> things. What does lsof show? I'm assuming it shows the files are
>> being held open by your Solr instance, but it's worth checking.
>
> Just commited new content 3 times and finally optimized.
> Again having old index files left.
>
> Then checked on my master, only the newest version of index files are
> listed with lsof. No file handles to the old index files but the
> old index files remain in data/index/.
> Thats strange.
>
> This time replication worked fine and cleaned up old index on slaves.
>
>>
>> I'm not getting the same behavior, admittedly on a Windows box.
>> The only other thing I can think of is that you have a query that's
>> somehow never ending, but that's grasping at straws.
>>
>> Do your log files show anything interesting?
>
> Lets see:
> - it has the old generation (generation=12) and its files
> - and recognizes that there have been several commits (generation=18)
>
> 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: start
> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
> INFO: SolrDeletionPolicy.onInit: commits:num=2
>
>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
> _3xm.fdx, segment
> s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
>
>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
> _3xo.tis, _3xp.pr
> x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
> _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
> _3xn.fdt, _3x
> p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
> _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
> _3xo.fdt, _3xp.fr
> q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
> _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
> _3xs.tis, _3x
> m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
> INFO: newest commit = 1302159868447
>
>
> - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
>  the SolrDeletionPolicy onCommit and has the new generation 19 listed.
>
>
> 20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit
> INFO: SolrDeletionPolicy.onCommit: commits:num=3
>
>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
> _3xm.fdx, segment
> s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
>
>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
> _3xo.tis, _3xp.pr
> x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
> _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
> _3xn.fdt, _3x
> p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
> _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
> _3xo.fdt, _3xp.fr
> q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
> _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
> _3xs.tis, _3x
> m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
>
>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,generation=19,filenames=[_3xt.fnm,
> _3xt.nrm, _3xt.fr
> q, _3xt.fdt, _3xt.tis, _3xt.fdx, segments_

Multiple Tags and Facets

2011-04-20 Thread Em

Hello,

I watched an online video with Chris Hostsetter from Lucidimagination. He
showed the possibility of having some Facets that exclude *all* filter while
also having some Facets that take care of some of the set filters while
ignoring other filters.

Unfortunately the Webinar did not explain how they made this and I wasn't
able to give a filter/facet more than one tag.

Here is an example:

Facets and Filters: DocType, Author

Facet:
- Author
-- George (10)
-- Brian (12)
-- Christian (78)
-- Julia (2)

-Doctype
-- PDF (70)
-- ODT (10)
-- Word (20)
-- JPEG (1)
-- PNG (1)

When clicking on "Julia" I would like to achieve the following:
Facet:
- Author
-- George (10)
-- Brian (12)
-- Christian (78)
-- Julia (2)
 Julia's Doctypes:
-- JPEG (1)
-- PNG (1)

-Doctype
-- PDF (70)
-- ODT (10)
-- Word (20)
-- JPEG (1)
-- PNG (1)

Another example which adds special options to your GUI could be as
following:
Imagine a fashion store.
If you search for "shirt" you get a color-facet:

colors:
- red (19)
- green (12)
- blue (4)
- black (2)

As well as a brand-facet:

brands:
- puma (18)
- nike (19)

When I click on the red color-facet, I would like to get the following back:
colors:
- red (19)
- green (12)*
- blue (4)*
- black (2)*

brands:
- puma (18)*
- nike (19)

All those filters marked by an "*" could be displayed half-transparent or so
- they just show the user that those filter-options exist for his/her search
but aren't included in the result-set, since he/she excluded them by
clicking the "red" filter.

This case is more interesting, if not all red shirts were from nike. 
This way you can show the user that i.e. 8 of 19 red - shirts are from the
brand you selected/you see 8 of 19 red shirts.

I hope I explained what I want to achive.

Thank you!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaEntityProcessor

2011-04-20 Thread firdous_kind86

after reading this post i hoped that i could achieve.. but couldnt find any
success in almost a week

http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html#a867572

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2843084.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaEntityProcessor

2011-04-20 Thread Andreas Kemkes

I went unsuccessfully down this path - too many incompatibilities among 
versions 
- some code changes and recompiling required.  See also thread "Solr 1.4.1 and 
Tika 0.9 - some tests not passing" for remaining issues.  You'll have better 
luck with the newer Solr 3.1 release, which already uses Tika 0.8 - still 
re-compiled from code (no changes as far as I remember) - never tried the 
library replacement - don't think it's possible.

Andreas  




From: firdous_kind86 
To: solr-user@lucene.apache.org
Sent: Wed, April 20, 2011 12:38:02 AM
Subject: Re: TikaEntityProcessor

hi, i asked that :)

didnt get that.. what dependencies?

i am using solr 1.4 and tika 0.9

i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib
also replaced old version of dataimporthandler-extras by
apache-solr-dataimporthandler-extras-3.1.0.jar

but still same problem..

someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Bernd Fehling


Hi Erik,

Am 20.04.2011 15:42, schrieb Erick Erickson:

H, this isn't right. You've pretty much eliminated the obvious
things. What does lsof show? I'm assuming it shows the files are
being held open by your Solr instance, but it's worth checking.


Just commited new content 3 times and finally optimized.
Again having old index files left.

Then checked on my master, only the newest version of index files are
listed with lsof. No file handles to the old index files but the
old index files remain in data/index/.
Thats strange.

This time replication worked fine and cleaned up old index on slaves.



I'm not getting the same behavior, admittedly on a Windows box.
The only other thing I can think of is that you have a query that's
somehow never ending, but that's grasping at straws.

Do your log files show anything interesting?


Lets see:
- it has the old generation (generation=12) and its files
- and recognizes that there have been several commits (generation=18)

20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, 
_3xm.fdx, segment

s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, 
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, 
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, 
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, 
_3xs.tis, _3x

m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1302159868447


- after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
  the SolrDeletionPolicy onCommit and has the new generation 19 listed.


20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=3
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, 
_3xm.fdx, segment

s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, 
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, 
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, 
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, 
_3xs.tis, _3x

m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,generation=19,filenames=[_3xt.fnm, 
_3xt.nrm, _3xt.fr

q, _3xt.fdt, _3xt.tis, _3xt.fdx, segments_j, _3xt.prx, _3xt.tii]
20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1302159868449


- it starts a new searcher and warms it up
- it sends SolrIndexSearcher close


20.04.2011 14:49:29 org.apache.solr.search.SolrIndexSearcher 
INFO: Opening Searcher@2c37425f main
20.04.2011 14:49:29 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
20.04.2011 14:49:29 org.apache.solr.search.SolrIndexSearcher warm
...
20.04.2011 14:49:29 org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher@2c37425f main
20.04.2011 14:49:29 org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={facet=true&start=0&event=newSearcher&q=solr&facet.limit=100&facet.field=f_dcyear&rows=10} hits=96 
status=0 QTime=816

20.04.2011 14:49:30 org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={facet=true&start=0&event=newSearcher&q=*:*&facet.limit=100&facet.field=f_dcyear&rows=10} hits=27826100 
status=0 QTime=633

20.04.2011 14:49:30 org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
20.04.2011 14:49:30 org.apache.solr.core.SolrCore registerSearcher
INFO: [] Registered new searcher Searcher@2c37425f main
20.04.2011

Solr - Multi Term highlighting issue

2011-04-20 Thread Ramanathapuram, Rajesh

Hello,

I am dealing with a highlighting issue in SOLR, I will try to explain
the issue.

When I search for a single term in solr, it wraps  tag around the
words I want to highlight, all works well.
But if I search multiple term, for most part highlighting works good and
then for some of the terms, 
the highlight return multiple terms in a sing  tag ...
srchtrm1)  srchtrm2
I expect solr to return highlight terms like... srchtrm1)
... srchtrm2

When I search for 'US mec chile', here is how my result appears 
  ... Corboba. (MEC)CHILE/FOREST FIRES: We
had ... with US and Chile ...,
  (MEC)US  

This is what I was expecting it to be 
  ... Corboba. (MEC)CHILE/FOREST
FIRES: We had ... with US and Chile ...,
(MEC)US 

Here is my query params 
- 
- 
  0 
  26 
- 
 10 
  
 on 
 story, slug 
 standard 
 on 
 10 
 2.2 
 true 
 * 
 0 
 mec us chile 
 standard 
 true 
 storyid="  X" 
  
  

Here are some other links I found in the forum, but no real conclusion
 
http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_hi
ghlighting_question#78163c42a67cb533 
   
I am going to try this patch, which also had no conclusive results
   https://issues.apache.org/jira/browse/SOLR-1394 

Has anyone come across this issue?
Any suggestions on how to fix this issue is much appreciated.


thanks & regards,
Rajesh Ramana

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Erick Erickson

H, this isn't right. You've pretty much eliminated the obvious
things. What does lsof show? I'm assuming it shows the files are
being held open by your Solr instance, but it's worth checking.

I'm not getting the same behavior, admittedly on a Windows box.
The only other thing I can think of is that you have a query that's
somehow never ending, but that's grasping at straws.

Do your log files show anything interesting?

Best
Erick@NotMuchHelpIKnow

On Wed, Apr 20, 2011 at 8:37 AM, Bernd Fehling
 wrote:
> Hi Erik,
>
> Am 20.04.2011 13:56, schrieb Erick Erickson:
>>
>> Does this persist? In other words, if you just watch it for
>> some time, does the disk usage go back to normal?
>
> Only after restarting the whole solr the disk usage goes back to normal.
>
>>
>> Because it's typical that your index size will temporarily
>> spike after the operations you describe as new searchers
>> are warmed up. During that interval, both the old and new
>> searchers are open.
>
> Temporarily yes, but still after a couple of hours after optimize
> or replication?
>
>>
>> Look particularly at your warmup time in the Solr admin page,
>> that should give you an indication of how long it takes your
>> warmup to happen and give you a clue about when you should
>> expect the index sizes to drop again.
>
> We have newSearcher and firstSearcher (both with 2 simple queries) and
> false
> 2
> The QTime is less than 500 (0.5 second).
>
> warmupTime=0 for all autowarming Searcher
>
>>
>> How often do you optimize on the master and replicate on the
>> slave? Because you may be getting into the runaway warmup
>> problem where a new searcher is opened before the last one
>> is autowarmed and spiraling out of control.
>
> We commit new content about every hour and do an optimze once a day.
> So replication is also once a day after optimize finished and
> system has settled down.
> No commit during optimize and replication.
>
>
> Any further hints?
>
>
>>
>> Hope that helps
>> Erick
>>
>> On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling
>>   wrote:
>>>
>>> Hello list,
>>>
>>> we have the problem that old searchers often are not closing
>>> after optimize (on master) or replication (on slaves) and
>>> therefore have huge index volumes.
>>> Only solution so far is to stop and start solr which cleans
>>> up everything successfully, but this can only be a workaround.
>>>
>>> Is the parameter "waitSearcher=false" an option to solve this?
>>>
>>> Any hints what to check or to debug?
>>>
>>> We use Apache Solr 3.1.0 on Linux.
>>>
>>> Regards
>>> Bernd
>>>
>

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Bernd Fehling


Hi Erik,

Am 20.04.2011 13:56, schrieb Erick Erickson:

Does this persist? In other words, if you just watch it for
some time, does the disk usage go back to normal?


Only after restarting the whole solr the disk usage goes back to normal.



Because it's typical that your index size will temporarily
spike after the operations you describe as new searchers
are warmed up. During that interval, both the old and new
searchers are open.


Temporarily yes, but still after a couple of hours after optimize
or replication?



Look particularly at your warmup time in the Solr admin page,
that should give you an indication of how long it takes your
warmup to happen and give you a clue about when you should
expect the index sizes to drop again.


We have newSearcher and firstSearcher (both with 2 simple queries) and
false
2
The QTime is less than 500 (0.5 second).

warmupTime=0 for all autowarming Searcher



How often do you optimize on the master and replicate on the
slave? Because you may be getting into the runaway warmup
problem where a new searcher is opened before the last one
is autowarmed and spiraling out of control.


We commit new content about every hour and do an optimze once a day.
So replication is also once a day after optimize finished and
system has settled down.
No commit during optimize and replication.


Any further hints?




Hope that helps
Erick

On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling
  wrote:

Hello list,

we have the problem that old searchers often are not closing
after optimize (on master) or replication (on slaves) and
therefore have huge index volumes.
Only solution so far is to stop and start solr which cleans
up everything successfully, but this can only be a workaround.

Is the parameter "waitSearcher=false" an option to solve this?

Any hints what to check or to debug?

We use Apache Solr 3.1.0 on Linux.

Regards
Bernd

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Erick Erickson

Does this persist? In other words, if you just watch it for
some time, does the disk usage go back to normal?

Because it's typical that your index size will temporarily
spike after the operations you describe as new searchers
are warmed up. During that interval, both the old and new
searchers are open.

Look particularly at your warmup time in the Solr admin page,
that should give you an indication of how long it takes your
warmup to happen and give you a clue about when you should
expect the index sizes to drop again.

How often do you optimize on the master and replicate on the
slave? Because you may be getting into the runaway warmup
problem where a new searcher is opened before the last one
is autowarmed and spiraling out of control.

Hope that helps
Erick

On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling
 wrote:
> Hello list,
>
> we have the problem that old searchers often are not closing
> after optimize (on master) or replication (on slaves) and
> therefore have huge index volumes.
> Only solution so far is to stop and start solr which cleans
> up everything successfully, but this can only be a workaround.
>
> Is the parameter "waitSearcher=false" an option to solve this?
>
> Any hints what to check or to debug?
>
> We use Apache Solr 3.1.0 on Linux.
>
> Regards
> Bernd
>

Re: KStemmer for Solr 3.x +

2011-04-20 Thread Ofer Fort

Seems like it isn't. In my installation (1.4.1) i used
LucidKStemFilterFactory, and when switching the solr.war file to the 3.1 one
i get:
14:42:31.664 ERROR [pool-1-thread-1]: java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
at
org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78)
at
org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50)
at
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:606)
at
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:151)
at
org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
at
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:84)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:52)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1169)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)

when the config is:

anybody familiar with this issue?

On Sat, Apr 9, 2011 at 7:00 AM, David Smiley (@MITRE.org)  wrote:

> I see no reason why it would not be compatible.
>
> -
>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/KStemmer-for-Solr-3-x-tp2796594p2798213.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread jmaslac

Tanguy, thanks for the anwser.

Yes I have already tried that but the problem is that min() function is not
yet available (it is set for Solr 3.2). 
:(


Btw. in my original post I've asked if the query could in the results return
a new field with this computed minimal value - that is redudant, I'm only
interested in sorting part of the question.



Tanguy Moal wrote:
> 
> Hello,
> 
> Have you tried reading : 
> http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> 
>  From that page I would try something like :
> http://host:port/solr/select?q=sony&sort=min(min(priceCash,priceCreditCard),priceCoupon)+asc&rows=10&indent=on&debugQuery=on
> 
> Is that of any help ?
> 
> --
> Tanguy
> 
> On 04/20/2011 09:41 AM, jmaslac wrote:
>> Hello,
>>
>> short question is this - is there a way for a search to return a field
>> that
>> is not defined in the schema but is a minimal/maximum value of several
>> (int/float) fields in solrDocument? (and how would that search look
>> like?)
>>
>> Longer explanation. I have products and each of them can have a several
>> prices (price for cash, price for credit cards, coupon price and so on) -
>> not every product has all the price options. (Don't ask why - that's the
>> use
>> case:) )
>>
>> 
>> > stored="true"
>> />
>> > />
>> +2 more
>>
>> Is there a way to ask "give me the products containing for example 'sony'
>> and in the results return me the minimal price of all possible prices
>> (for
>> each product) and SORT the results by that (minimal) price"?
>>
>> I know I can calculate the minimal price at import/index time and store
>> it
>> in one separate field but the idea is that users will have checkboxes in
>> which they could say - i'm only interested in products that have the
>> priceCreditCard and priceCoupon, show me the smaller of those two and
>> sort
>> by that value.
>>
>> My idea is something like this:
>> ?q=sony&minPrice:min(priceCash,priceCreditCard,priceCoupon...)
>> (the field minPrice is not defined in schema but should return in the
>> results)
>>
>> For searching this actually doesn't represent a problem as I can easily
>> programatically compare the prices and present it to the user. The
>> problem
>> is sorting - I could do that also programatically but that would mean
>> that
>> I'd have to pull out all the results query returned (which can be quite
>> big
>> of course) and then sort them, so that a option I would naturally like to
>> avoid.
>>
>> Don't know if I'm asking too much of solr:) but I can see usefulness of
>> something like this in other examples other then mine.
>> Hope the question is clear and if I'm going about things completely the
>> wrong way please advise in the right direction.
>> (If there is a similar question asked somewhere else please redirect me -
>> i
>> didn't find it)
>>
>> Help much appreciated!
>>
>> Josip
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> -- 
> --
> Tanguy
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2842232.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: How could each core share configuration files

2011-04-20 Thread Ephraim Ofir

I just use soft-links...

Ephraim Ofir

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: Wednesday, April 20, 2011 10:09 AM
To: solr-user@lucene.apache.org
Subject: Re: How could each core share configuration files

Perhaps this could help :

http://lucene.472066.n3.nabble.com/Shared-conf-td2787771.html#a2789447

Ludovic.

2011/4/20 kun xiong [via Lucene] <
ml-node+2841801-1701787156-383...@n3.nabble.com>

> Hi all,
>
> Currently in my project , most of the core configurations are
> same(solrconfig.xml, dataimport.properties...),  which are putted in
their
> own folder as reduplicative.
>
> I am wondering how could I put common ones in one folder, which each
core
> could share, and keep the different ones in their own folder still.
>
> Thanks
>
> Kun
>
>
> --
>  If you reply to this email, your message will be added to the
discussion
> below:
>
>
http://lucene.472066.n3.nabble.com/How-could-each-core-share-configurati
on-files-tp2841801p2841801.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click
here.
>
>


-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-could-each-core-share-configurati
on-files-tp2841801p2841875.html
Sent from the Solr - User mailing list archive at Nabble.com.

Saravanan Chinnadurai/Actionimages is out of the office.

2011-04-20 Thread Saravanan . Chinnadurai

I will be out of the office starting  20/04/2011 and will not return until
21/04/2011.

Please email to itsta...@actionimages.com  for any urgent issues.


Action Images is a division of Reuters Limited and your data will therefore be 
protected
in accordance with the Reuters Group Privacy / Data Protection notice which is 
available
in the privacy footer at www.reuters.com
Registered in England No. 145516   VAT REG: 397000555

Re: Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread Tanguy Moal


Hello,

Have you tried reading : 
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


From that page I would try something like :
http://host:port/solr/select?q=sony&sort=min(min(priceCash,priceCreditCard),priceCoupon)+asc&rows=10&indent=on&debugQuery=on

Is that of any help ?

--
Tanguy

On 04/20/2011 09:41 AM, jmaslac wrote:

Hello,

short question is this - is there a way for a search to return a field that
is not defined in the schema but is a minimal/maximum value of several
(int/float) fields in solrDocument? (and how would that search look like?)

Longer explanation. I have products and each of them can have a several
prices (price for cash, price for credit cards, coupon price and so on) -
not every product has all the price options. (Don't ask why - that's the use
case:) )




+2 more

Is there a way to ask "give me the products containing for example 'sony'
and in the results return me the minimal price of all possible prices (for
each product) and SORT the results by that (minimal) price"?

I know I can calculate the minimal price at import/index time and store it
in one separate field but the idea is that users will have checkboxes in
which they could say - i'm only interested in products that have the
priceCreditCard and priceCoupon, show me the smaller of those two and sort
by that value.

My idea is something like this:
?q=sony&minPrice:min(priceCash,priceCreditCard,priceCoupon...)
(the field minPrice is not defined in schema but should return in the
results)

For searching this actually doesn't represent a problem as I can easily
programatically compare the prices and present it to the user. The problem
is sorting - I could do that also programatically but that would mean that
I'd have to pull out all the results query returned (which can be quite big
of course) and then sort them, so that a option I would naturally like to
avoid.

Don't know if I'm asking too much of solr:) but I can see usefulness of
something like this in other examples other then mine.
Hope the question is clear and if I'm going about things completely the
wrong way please advise in the right direction.
(If there is a similar question asked somewhere else please redirect me - i
didn't find it)

Help much appreciated!

Josip

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
--
Tanguy

Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread jmaslac

Hello,

short question is this - is there a way for a search to return a field that
is not defined in the schema but is a minimal/maximum value of several
(int/float) fields in solrDocument? (and how would that search look like?)

Longer explanation. I have products and each of them can have a several
prices (price for cash, price for credit cards, coupon price and so on) -
not every product has all the price options. (Don't ask why - that's the use
case:) )

   
   
   
+2 more

Is there a way to ask "give me the products containing for example 'sony'
and in the results return me the minimal price of all possible prices (for
each product) and SORT the results by that (minimal) price"?

I know I can calculate the minimal price at import/index time and store it
in one separate field but the idea is that users will have checkboxes in
which they could say - i'm only interested in products that have the
priceCreditCard and priceCoupon, show me the smaller of those two and sort
by that value.

My idea is something like this:
?q=sony&minPrice:min(priceCash,priceCreditCard,priceCoupon...)
(the field minPrice is not defined in schema but should return in the
results)

For searching this actually doesn't represent a problem as I can easily
programatically compare the prices and present it to the user. The problem
is sorting - I could do that also programatically but that would mean that
I'd have to pull out all the results query returned (which can be quite big
of course) and then sort them, so that a option I would naturally like to
avoid.

Don't know if I'm asking too much of solr:) but I can see usefulness of
something like this in other examples other then mine. 
Hope the question is clear and if I'm going about things completely the
wrong way please advise in the right direction.
(If there is a similar question asked somewhere else please redirect me - i
didn't find it)

Help much appreciated!

Josip

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaEntityProcessor

2011-04-20 Thread firdous_kind86

hi, i asked that :)

didnt get that.. what dependencies?

i am using solr 1.4 and tika 0.9

i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib
also replaced old version of dataimporthandler-extras by
apache-solr-dataimporthandler-extras-3.1.0.jar

but still same problem..

someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Custom Sorting

2011-04-20 Thread Michael Owen


Ok thank you for the discussion. As I thought regard to not possible within 
performance limits.
I think the way to go is to document some more stats at index time, and use 
them in boost queries. :)
Thanks
Mike

> Date: Tue, 19 Apr 2011 15:12:00 -0400
> Subject: Re: Custom Sorting
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> As I understand it, sorting by field is what caches are all
> about. You have a big list in memory of all of the terms for
> a field, indexed by Lucene doc ID so fetching the term to
> compare by doc ID is fast, and also why the caches need
> to be warmed, and why sort fields should be single-valued.
> 
> If you try to do this yourself and fetch data from each document,
> you can incur a huge performance hit, since you'll be seeking
> all over your disk...
> 
> Score is special though since it's transient. Internally, all Lucene
> has to do is keep track of the top N scores encountered where
> N is something like "start + queryResultWindowSize", this
> latter from solrconfig.xml, with no seeks to disk at all...
> 
> Best
> Erick
> 
> On Tue, Apr 19, 2011 at 2:50 PM, Jonathan Rochkind  wrote:
> > On 4/19/2011 1:43 PM, Jan Høydahl wrote:
> >>
> >> Hi,
> >>
> >> Not possible :)
> >> Lucene compares each matching document against the query and produces a
> >> score for each.
> >> Documents are not compared to eachother like normal sort, that would be
> >> way too costly.
> >
> > That might be true for sort by 'score' (although even if you have all the
> > scores, it still seems like some kind of sort must be neccesary to see which
> > comes first), but when you sort by a field value, which is also possible,
> > Lucene must be doing some kind of 'normal sort' algorithm, no?  Ah, I guess
> > it could just be using each term's position in the index, which is available
> > in constant time, always kept track of in an index? Maybe, I don't know?
> >
> >
> >

Re: How could each core share configuration files

2011-04-20 Thread lboutros

Perhaps this could help :

http://lucene.472066.n3.nabble.com/Shared-conf-td2787771.html#a2789447

Ludovic.

2011/4/20 kun xiong [via Lucene] <
ml-node+2841801-1701787156-383...@n3.nabble.com>

> Hi all,
>
> Currently in my project , most of the core configurations are
> same(solrconfig.xml, dataimport.properties...),  which are putted in their
> own folder as reduplicative.
>
> I am wondering how could I put common ones in one folder, which each core
> could share, and keep the different ones in their own folder still.
>
> Thanks
>
> Kun
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/How-could-each-core-share-configuration-files-tp2841801p2841801.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-could-each-core-share-configuration-files-tp2841801p2841875.html
Sent from the Solr - User mailing list archive at Nabble.com.

52 matches

Mail list logo