date:20110316


: I'm not sure if I get what you are trying to achieve. What do you mean
: by "constraint"?

"constraint" it fairly standard terminology when refering to facets, it's 
used extensively in our facet docs and is even listed on solr's glossary 
page (allthough not specificyly in hte context of faceting since it can 
be used more broadly then that)...

http://wiki.apache.org/solr/SolrTerminology

In a nutshell:

A facet is a way of classifying objects
A constraint is a viable way of limiting a set of objects

faceted search is a search where feedback on viable constraints 
(usually in the form of counts) is provided for each facet.  (ie: "facet 
counts" or "constraint counts" ... the terms are both used relatively 
loosely)

: > I'm trying to use facet's via widget's within Ajax-Solr. I have tried the
: > wiki for general help on configuring facets and constraints and also
: > attended the recent Lucidworks webinar on faceted search. Can anyone
: > please direct me to some reading on how to formally configure facets for
: > searching.

the beauty of faceting in solr is that it doesn't have ot be formally 
configured -- you can specify it all at query time using request params as 
long as the data is indexed...

http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/SimpleFacetParameters

: > Topics < field
: >   Legislation < constraint
: >   Guidance/Policies < constraint
: >   Customer Service information/complaints procedure < constraint
: >   financial information < constraint

if you index a "Topic" field, and the topic field contains those field 
values as indexed terms, then you will get those constraints back using 
"facet.field=Topics"


-Hoss

Re: Replication slows down massively during high load

2011-03-16 Thread Shawn Heisey


On 3/16/2011 6:09 PM, Shawn Heisey wrote:

du -hc *x


I was looking over the files in an index and I think it needs to include 
more of the files for a true picture of RAM needs.  I get 5.9GB running 
the following command against a 16GB index.  It excludes *.fdt (stored 
field data) and *.tvf (term vector fields), but includes everything else.


du -hc `ls | egrep -v "tvf|fdt"`

If any of the experts have a better handle on which files are consulted 
on virtually all queries, that would help narrow down the OS cache 
requirements.


Thanks,
Shawn

Re: Replication slows down massively during high load

2011-03-16 Thread Shawn Heisey


On 3/16/2011 7:56 AM, Vadim Kisselmann wrote:

If the load is low, both slaves replicate with around 100MB/s from master.

But when I use Solrmeter (100-400 queries/min) for load tests (over
the load balancer), the replication slows down to an unacceptable
speed, around 100KB/s (at least that's whats the replication page on
/solr/admin says).



- Same hardware for all servers: Physical machines with quad core
CPUs, 24GB RAM (JVM starts up with -XX:+UseConcMarkSweepGC -Xms10G
-Xmx10G)
- Index size is about 100GB with 40M docs


Primary assumption:  You have a 64-bit OS and a 64-bit JVM.

It sounds to me like you're I/O bound, because your machine cannot keep 
enough of your index in RAM.  Relative to your 100GB index, you only 
have a maximum of 14GB of RAM available to the OS disk cache, since 
Java's heap size is 10GB.  How much disk space do all of the index files 
that end in "x" take up?  I would venture a guess that it's 
significantly more than 14GB.  On Linux, you could do this command to 
tally it quickly:


du -hc *x

If you installed enough RAM so the disk cache can be much larger than 
the total size of those files ending in "x", you'd probably stop having 
these performance issues.  Realizing that this is a Alternatively, you 
could take steps to reduce the size of your index, or perhaps add more 
machines to go distributed.


My own index is distributed and replicated.  I've got nearly 53 million 
documents and a total index size of 95GB.  This is split into six shards 
that each are nearly 16GB.  Running that du command I gave you above, 
the total on one shard is 2.5GB, and there is 7GB of RAM available for 
the OS cache.


NB: I could be completely wrong about the source of the problem.

Thanks,
Shawn

Re: Sorting on multiValued fields via function query

2011-03-16 Thread Bill Bell

I agree with this and it is even needed for function sorting for multvalued 
fields. See geohash patch for one wY to deal with multivalued fields on 
distance. Not ideal but it works efficiently.

Bill Bell
Sent from mobile


On Mar 16, 2011, at 4:08 PM, Jonathan Rochkind  wrote:

> Huh, so lucene is actually doing what has been commonly described as 
> impossible in Solr?
> 
> But is Solr trunk, as the OP person seemed to report, still not aware of this 
> and raising on a sort on multi-valued field, instead of just saying, okay, 
> we'll just pass it to lucene anyway and go with lucene's approach to sorting 
> on multi-valued field (that is, apparently, using the largest value)?
> 
> If so... that kind of sounds like a bug/misfeature, yes, no?
> 
> Also... if lucene is already capable of sorting on multi-valued field by 
> choosing the largest value largest vs. smallest is presumably just 
> arbitrary there, there is presumably no performance implication to choosing 
> the smallest instead of the largest. It just chooses the largest, according 
> to Yonik.
> 
> So... if someone patched lucene, so whether it chose the largest or smallest 
> in that case was a parameter passed in -- probably not a large patch since 
> lucene, says Yonik, already has been enhanced to choose largest always -- and 
> then patched Solr to take a param and pass it to Lucene for this purpose, 
> which presumably also wouldn't be a large patch if lucene supported it   
> then we'd have the feature OP asked for.
> 
> Based on Yonik's description (assuming I understand correctly and he's 
> correct), it doesn't sound like a lot of code. But it's still beyond my 
> unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I have the 
> interest for my own app needs at the moment. But if OP or someone else has 
> both sounds like a plausible feature?
> 
> On 3/16/2011 6:00 PM, Yonik Seeley wrote:
>> On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
>>   wrote:
>>> : However, many of our multiValued fields are single valued for the majority
>>> : of documents in our index so we may not have noticed the incorrect sorting
>>> : behaviors.
>>> 
>>> that would make sense ... if you use a multiValued field as if it were
>>> single valued, you would never enocunter a problem.  if you had *some*
>>> multivalued fields your results would be sorted extremely arbitrarily for
>>> those docs that did have multiple values, unless you had more distinct
>>> values then you had documents -- at which point you would get a hard crash
>>> at query time.
>> AFAIK, not any more.  Since that behavior was very unreliable, it has
>> been removed and you can reliably sort by any multi-valued field in
>> lucene (with the sort order being defined by the largest value if
>> there are multiple).
>> 
>> -Yonik
>> http://lucidimagination.com
>>

Re: hierarchical faceting, SOLR-792 - confused on config

2011-03-16 Thread Koji Sekiguchi


(11/03/17 3:53), Jonathan Rochkind wrote:

Interesting, any documentation on the PathTokenizer anywhere?


It is PathHierarchyTokenizer:

https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/analysis/PathHierarchyTokenizerFactory.html

Koji
--
http://www.rondhuit.com/en/

Re: Sorting on multiValued fields via function query

Huh, so lucene is actually doing what has been commonly described as 
impossible in Solr?


But is Solr trunk, as the OP person seemed to report, still not aware of 
this and raising on a sort on multi-valued field, instead of just 
saying, okay, we'll just pass it to lucene anyway and go with lucene's 
approach to sorting on multi-valued field (that is, apparently, using 
the largest value)?


If so... that kind of sounds like a bug/misfeature, yes, no?

Also... if lucene is already capable of sorting on multi-valued field by 
choosing the largest value largest vs. smallest is presumably just 
arbitrary there, there is presumably no performance implication to 
choosing the smallest instead of the largest. It just chooses the 
largest, according to Yonik.


So... if someone patched lucene, so whether it chose the largest or 
smallest in that case was a parameter passed in -- probably not a large 
patch since lucene, says Yonik, already has been enhanced to choose 
largest always -- and then patched Solr to take a param and pass it to 
Lucene for this purpose, which presumably also wouldn't be a large patch 
if lucene supported it   then we'd have the feature OP asked for.


Based on Yonik's description (assuming I understand correctly and he's 
correct), it doesn't sound like a lot of code. But it's still beyond my 
unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I 
have the interest for my own app needs at the moment. But if OP or 
someone else has both sounds like a plausible feature?


On 3/16/2011 6:00 PM, Yonik Seeley wrote:

On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
  wrote:

: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were
single valued, you would never enocunter a problem.  if you had *some*
multivalued fields your results would be sorted extremely arbitrarily for
those docs that did have multiple values, unless you had more distinct
values then you had documents -- at which point you would get a hard crash
at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com

Re: Sorting on multiValued fields via function query

On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
 wrote:
>
> : However, many of our multiValued fields are single valued for the majority
> : of documents in our index so we may not have noticed the incorrect sorting
> : behaviors.
>
> that would make sense ... if you use a multiValued field as if it were
> single valued, you would never enocunter a problem.  if you had *some*
> multivalued fields your results would be sorted extremely arbitrarily for
> those docs that did have multiple values, unless you had more distinct
> values then you had documents -- at which point you would get a hard crash
> at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com

Re: Version Incompatibility(Invalid version (expected 2, but 1) or the data in not in 'javabin' format)

2011-03-16 Thread Ahmet Arslan

> >           I am using Solr 4.0 api
> > to search from index (made using solr1.4 version). I
> am
> > getting error Invalid version (expected 2, but 1) or
> the
> > data in not in 'javabin' format. Can anyone help me to
> fix
> > problem.
> 
> You need to use solrj version 1.4 which is compatible to
> your index format/version.
> 

Actually there exists another solution. Using XMLResponseParser instead of 
BinaryResponseParser which is the default.

new CommonsHttpSolrServer(new URL("http://solr1.4.0Instance:8080/solr";), null, 
new XMLResponseParser(), false);

Re: Sorting on multiValued fields via function query


: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were 
single valued, you would never enocunter a problem.  if you had *some* 
multivalued fields your results would be sorted extremely arbitrarily for 
those docs that did have multiple values, unless you had more distinct 
values then you had documents -- at which point you would get a hard crash 
at query time.

: Regardless, I understand the reasoning behind the restriction, I'm
: interested in getting around it by using a functionQuery to reduce
: multiValued fields to a single value.  It sounds like this isn't possible,

I don't think we have any functions that do that -- functions are composed 
of valuesources which may be composed of other value sources but 
ultimatley the data comes from somewhere, and in every case i can think of 
(except for constant values) that data comes from the FieldCache -- the 
same FieldCache used for sorting.

I don't think there are any value sources that will let you specify a 
multiValued field, and then pick one of those values based on a 
rule/function ... even the "PolyFields used for spatial search work by 
using multiple field names unde the covers (N distinct field names for an 
N-dimensional space)

: is that correct?  Ideally I'd like to sort by the maximum value on
: descending sorts and the minimum value on ascending sorts.  Is there any
: movement towards implementing this sort of behavior?

this is a fairly classic usecase of just having multiple fields.  even if 
the logic was implemented to support this at query time, it could never be 
faster then sorting on asingle valued field that you populat with the 
min/max at indexing time -- the mantra of fast I/R is that if you can 
precompute it independently of the individual search critera, you should 
(it's the whole foundation for why the inverted index exists)


-Hoss

dismax parser, parens, what do they do exactly


It looks like Dismax query parser can somehow handle parens, used for
applying, for instance, + or - to a group, distributing it. But I'm not
sure what effect they have on the overall query.

For instance, if I give dismax this:

book (dog +( cat -frog))

debugQuery shows:

+((DisjunctionMaxQuery((text:book)~0.01)
+DisjunctionMaxQuery((text:dog)~0.01)
DisjunctionMaxQuery((text:cat)~0.01)
-DisjunctionMaxQuery((text:frog)~0.01))~2) ()


How will that be treated by mm?  Let's say I have an mm of 50%.  Does
that apply to the "top-level", like either "book" needs to match or
"+(dog +( cat -frog))" needs to match?  And for "+(dog +( cat -frog))"
to match, do just 50% of that subquery need to match... or is mm ignored
there?  Or something else entirely?

Can anyone clear this up?  Continuing to try experimentally to clear it up... 
it _looks_ like the mm actually applies to each _individual_ low-level query.  
So even though the semantics of:
book (dog +( cat -frog))

are respected, if mm is 50%, the nesting is irrelvant, exactly 50% of "book", "dog", 
"+cat", and "+-frog" (distributing the operators through I guess?) are required. I think. I'm 
getting confused even talking about it.

Re: Error during auto-warming of key

Actually, i dug in the logs again and surprise, it sometimes still occurs with 
`random` queries. Here's are a few snippets from the error log. Somewhere 
during that time there might be OOM-errors but older logs are unfortunately 
rotated away.



2011-03-14 00:25:32,152 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : 
Error during auto-warming of 
key:f_sp_eigenschappen:geo:java.lang.ArrayIndexOutOfBoundsException: 431733
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:152)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:642)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
at 
org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520)
at 
org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296)
at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168)
at 
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)




2011-03-14 00:25:32,795 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : 
Error during auto-warming of key:+(titel_i:touareg^5.0 | 
f_advertentietype:touareg^2.0 | f_automodel_j:touareg^8.0 | facets:touareg^2.0 
| omschrijving_i:touareg | catlevel1_i:touareg^2.0 | 
catlevel2_i:touareg^4.0)~0.1 () 
(10.0/(7.71E-8*float(ms(const(130003560),date(sort_date)))+1.0))^10.0:java.lang.ArrayIndexOutOfBoundsException:
 
468554
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.readNoTf(SegmentTermDocs.java:169)
at 
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:139)
at org.apache.lucene.search.TermScorer.nextDoc(TermScorer.java:130)
at 
org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:145)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:246)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
at 
org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520)
at 
org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296)
at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168)
at 
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)




2011-03-14 00:25:33,051 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : 
Error during auto-warming of key:+*:* (10.0/(7.71E-8*fl
oat(ms(const(130003560),date(sort_date)))+1.0))^10.0:java.lang.ArrayIndexOutOfBoundsException:
 
489479
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
at 
org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:562)
at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:208)
at 
org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:525)
at 
org.apache.solr.search.function.LongFieldSource.getValues(LongFieldSource.java:57)
at 
org.apache.solr.search.function.DualFloatFunction.getValues(DualFloatFunction.java:48)
at 
org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61)
at 
org.apache.solr.search.function.FunctionQuery$AllScorer.(FunctionQuery.java:123)
at 
org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:93)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at 
org.apache.lucene.search.IndexSearcher.search(IndexS

Re: FunctionQueries and FieldCache and OOM

Hi,


> FWIW: it sounds like your problem wasn't actually related to your
> fieldCache, but probably instead if was because of how big your
> queryResultCache is

It's the same cluster as in the other thread. I decided a long time ago that 
documentCache and queryResultCache wouldn't be a good idea because of the 
extreme volume of queries, all hitting very different parts of the index. Hit 
ratio's were extremely low even with high cache sizes (and with appropriate 
high JVM heap). The commit rate is also high so auto warming a cache big 
enough to have a high hit ratio would take _very_ long.

> 
> : > > Am i correct when i assume that Lucene FieldCache entries are added
> : > > for each unique function query?  In that case, every query is a
> : > > unique cache
> 
> ...no, the FieldCache has one entry per field name, and the value of that
> cache is an "array" keyed off of the internal docId for every doc in the
> index, and the corrisponding value (it's an uninverted version of lucene's
> inverted index for doing fast value lookups by document)
> 
> changes in the *values* used in your function queries won't affect
> FieldCache usage -- only changing the *fields* used in your functions
> would impact that.

Thanks for bringing additional clarity :)

> 
> : > > each unique function query?  In that case, every query is a unique
> : > > cache entry because it operates on milliseconds. If all doesn't work
> : > > i might be
> 
> what you describe is correct, but not in the FieldCache -- the
> queryResultCache is where queries that deal with the main result set (ie:
> paginated and/or sorted) wind up .. having lots of distinct queries in
> the "bq" (or "q") param will make the number of unique items in that cache
> grow significantly (just like having lots of distinct queries in the "fq"
> will cause your filterCache to grow significantly)
> 
> you should definitley checkout what max size you have configured for your
> queryResultCache ... it sounds like it's proably too big, if you were
> getting OOM errors from having high precision dates in your boost queries.
> while i think using less precision is a wise choice, you should still
> consider dialing that max size down, so that if some other usage pattern
> still causes lots of unique queries in a short time period (a bot crawling
> your site map perhaps) it doesn't fill up and cause another OOM

One of the reasons i choose to disable queryResultCache. Can you come up with 
new explanations knowing that i only use filterCache and Lucene's 
fieldValueCache and fieldCache?

Thanks!

> 
> 
> 
> -Hoss

Re: i don't get why my index didn't grow more...

On Wed, Mar 16, 2011 at 5:10 PM, Robert Petersen  wrote:
> OK I have a 30 gb index where there are lots of sparsly populated int
> fields and then one title field and one catchall field with title and
> everything else we want as keywords, the catchall field.  I figure it is
> the biggest field in our documents which as I mentioned is otherwise
> composed of a variety if int fields and a title.
>
>
>
> So my puzzlement is that my biggest field is copied into a double
> metaphone field and now I added another copyfield to also copy the
> catchall field into a newly created soundex field for an experiment to
> compare the effectiveness of the two.  I expected the index to grow by
> at least 25% to 30%, but it barely grew at all.  Can someone explain
> this to me?  Thanks!  J

I assume you reindexed everything?

Anyway, the size of indexed fields generally grows sub-linearly (as
opposed to stored fields which is exactly linear).
But if it really barely grew at all, this could point to other parts
of the index taking up much more space than you realize.

If you could do an "ls -l" of your index directory, we might be able
to see what parts of the index are using up the most space.

-Yonik
http://lucidimagination.com

Re: FunctionQueries and FieldCache and OOM


: Alright, i can now confirm the issue has been resolved by reducing precision. 
: The garbage collector on nodes without reduced precision has a real hard time 
: keeping up and clearly shows a very different graph of heap consumption.
: 
: Consider using MINUTE, HOUR or DAY as precision in case you suffer from 
: excessive memory consumption:
: 
: recip(ms(NOW/,),,1,1)

FWIW: it sounds like your problem wasn't actually related to your 
fieldCache, but probably instead if was because of how big your 
queryResultCache is

: > > Am i correct when i assume that Lucene FieldCache entries are added for
: > > each unique function query?  In that case, every query is a unique cache

...no, the FieldCache has one entry per field name, and the value of that 
cache is an "array" keyed off of the internal docId for every doc in the 
index, and the corrisponding value (it's an uninverted version of lucene's 
inverted index for doing fast value lookups by document)

changes in the *values* used in your function queries won't affect 
FieldCache usage -- only changing the *fields* used in your functions 
would impact that.

: > > each unique function query?  In that case, every query is a unique cache
: > > entry because it operates on milliseconds. If all doesn't work i might be

what you describe is correct, but not in the FieldCache -- the 
queryResultCache is where queries that deal with the main result set (ie: 
paginated and/or sorted) wind up .. having lots of distinct queries in 
the "bq" (or "q") param will make the number of unique items in that cache 
grow significantly (just like having lots of distinct queries in the "fq" 
will cause your filterCache to grow significantly)

you should definitley checkout what max size you have configured for your 
queryResultCache ... it sounds like it's proably too big, if you were 
getting OOM errors from having high precision dates in your boost queries.  
while i think using less precision is a wise choice, you should still 
consider dialing that max size down, so that if some other usage pattern 
still causes lots of unique queries in a short time period (a bot crawling 
your site map perhaps) it doesn't fill up and cause another OOM



-Hoss

i don't get why my index didn't grow more...

2011-03-16 Thread Robert Petersen

OK I have a 30 gb index where there are lots of sparsly populated int
fields and then one title field and one catchall field with title and
everything else we want as keywords, the catchall field.  I figure it is
the biggest field in our documents which as I mentioned is otherwise
composed of a variety if int fields and a title.

 

So my puzzlement is that my biggest field is copied into a double
metaphone field and now I added another copyfield to also copy the
catchall field into a newly created soundex field for an experiment to
compare the effectiveness of the two.  I expected the index to grow by
at least 25% to 30%, but it barely grew at all.  Can someone explain
this to me?  Thanks!  J

Re: faceting over ngrams

Hi Yonik,

I have ran the queries against single index solr with only 16M documents.
After attaching facet.method=fc the results seemed to come faster (first two
queries below), but still not fast enough.

Here are the fieldValueCache stats:

(facet.limit=100&facet.mincount=5&facet.method=fc, 542094 hits, 1 min)
--> smallest result set

*name: *fieldValueCache  *class: *org.apache.solr.search.FastLRUCache  *
version: *1.0  *description: *Concurrent LRU Cache(maxSize=1,
initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false)  *
stats: *lookups : 400
hits : 396
hitratio : 0.99
inserts : 1
evictions : 0
size : 1
warmupTime : 0
cumulative_lookups : 400
cumulative_hits : 396
cumulative_hitratio : 0.99
cumulative_inserts : 1
cumulative_evictions : 0
item_shingleContent_trigram :
{field=shingleContent_trigram,memSize=1786355392,tindexSize=17977426,time=662387,phase1=654707,nTerms=53492050,bigTerms=38,termInstances=602090958,uses=397}

(facet.limit=100&facet.mincount=5&facet.method=fc, 2837589 hits, 3 min 8
s) --> largest result set

*name: *fieldValueCache  *class: *org.apache.solr.search.FastLRUCache  *
version: *1.0  *description: *Concurrent LRU Cache(maxSize=1,
initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false)  *
stats: *lookups : 401
hits : 397
hitratio : 0.99
inserts : 1
evictions : 0
size : 1
warmupTime : 0
cumulative_lookups : 401
cumulative_hits : 397
cumulative_hitratio : 0.99
cumulative_inserts : 1
cumulative_evictions : 0
item_shingleContent_trigram :
{field=shingleContent_trigram,memSize=1786355392,tindexSize=17977426,time=662387,phase1=654707,nTerms=53492050,bigTerms=38,termInstances=602090958,uses=398}


On Wed, Mar 16, 2011 at 9:46 PM, Yonik Seeley wrote:

> On Wed, Mar 16, 2011 at 8:05 AM, Dmitry Kan  wrote:
> > Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
> the
> > trigrams field with about 1 million of entries in the result set and more
> > than 100 million of entries to facet on in the index. Currently the
> faceted
> > search is very slow, taking about 5 minutes per query. Would running on a
> > cloud with Hadoop make it faster (to seconds) as faceting seems to be a
> > natural map-reduce task?
>
> How many indexed tokens does each document have (for the field you are
> faceting on) on average?
> How many unique tokens are indexed in that field over the complete index?
>
> Or you could go to the admin/stats page and cut-n-paste the
> fieldValueCache entry after your faceting request - it should contain
> most of the info to further analyze this.
>
> -Yonik
> http://lucidimagination.com
>



-- 
Regards,

Dmitry Kan

Re: Error during auto-warming of key

> that is odd...
> 
> can you let us know exactly what verison of Solr/Lucne you are using (if
> it's not an official release, can you let us know exactly what the version
> details on the admin info page say, i'm curious about the svn revision)

Of course, that's the stable 1.4.1.

> 
> can you also please let us know what types of queries you are generating?
> ... that's the toString output of a query and it's not entirely clear what
> the original looked like.  If you can recognize what the original query
> was, it would also be helpful to know if you can consistently reproduce
> this error on autowarming after executing that query (or queries like it
> with a slightly differnet date value)

It's extremely difficult to reproduce. It happened on a multinode system that's 
being prepared for production. It has been under heavy load for a long time 
already, updates and queries. It is continuously being updated with real user 
input and receives real user queries from a source that's being updated from 
logs. Solr is about to replace an existing search solution.

It is impossible to reproduce because of these uncontrollable variables, i 
tried but failed. The error, however, did occur at least a couple of times 
after i started this thread.

It hasn't reappeared after i reduced precision from milliseconds to an hour, 
see my other thread for more information:
http://web.archiveorange.com/archive/v/AAfXfFuqjPhU4tdq53Tv


> 
> One of the things that particularly boggles me is this...
> 
> : org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java
> : :545)
> : 
> : at
> : 
> : org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.ja
> : va:520)
> 
>   [...]
> 
> : Well, i use Dismax' bf parameter to boost very recent documents. I'm not
> : using the queryResultCache or documentCache, only filterCache and Lucene
> : fieldCache.
> 
> ... that cache warming stack trace seems to be coming from filterCache,
> but that contradicts your statement that you don't use the filterCache.
> independent of your comments, that's an odd looking query to be cached in
> the filter cache anyway, since it includes a mandatory matchalldocs
> clause, and seems to only exist for boosting on that function.

But i am using filterCache and fieldCache (forgot to mention the obvious 
fieldValueCache as well).

If you have any methods that may help to reproduce i'm of course willing to 
take the time and see if i can. It may prove really hard because several weird 
errors were not reproduceable in a more controlled but similar environment 
(load and config) and i can't mess with the soon-to-be production cluster.

Thanks!
> 
> 
> -Hoss

Re: 'Registering' a query / Percolation


: I.E. Instruct Solr that you are interested in documents that match a
: given query and then have Solr notify you (through whatever callback
: mechanism is specified) if and when a document appears that matches the
: query.
: 
: We are planning on writing some software that will effectively grind
: Solr to give us the same behaviour, but if Solr has this registration
: built in, it would be very useful and much easier on our resources...

it does not, but there are typically two ways people deal with this 
depending on the balance of your variables...

* max latency of notificications after doc is added/updated
* rate of churn of documents in index
* number of registered queries for notification

1) if you have a heavy churn of documents, and the max latency allowed for 
notification is large, then doing periodic polling at a frequency of that 
latency can be preferably to minimize the amount of redundent work

2) if the churn on documents is going to be relatively small and/or the 
number of registered queries is going to be relatively large, you can 
invert the problem and build an index where each document represents a 
query, and as documents are added/updated you use the terms in those 
documents to query your "query index" (this could even be done as an 
UpdateProcessor on your doc core, querying over to some other 
"notifications" core)


(disclaimer: i've never implemented any of these ideas personally, this is 
just what i've picked up over the years on hte mailing lists)

-Hoss

Re: Using Solr 1.4.1 on most recent Tomcat 7.0.11

2011-03-16 Thread François Schiettecatte

Lewis

Quick response, I am currently using Tomcat 7.0.8 with solr (with no issues), I 
will upgrade to 7.0.11 tonight and see if I run into the same issues.

Stay tuned as they say.

Cheers

François

On Mar 16, 2011, at 2:38 PM, McGibbney, Lewis John wrote:

> Hello list,
> 
> Is anyone running Solr (in my case 1.4.1) on above Tomcat dist? In the
> past I have been using guidance in accordance with
> http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat
> but having upgraded from Tomcat 7.0.8 to 7.0.11 I am having problems
> E.g.
> 
> INFO: Deploying configuration descriptor wombra.xml < This is my context
> fragment
> from /home/lewis/Downloads/apache-tomcat-7.0.11/conf/Catalina/localhost
> 16-Mar-2011 16:57:36 org.apache.tomcat.util.digester.Digester fatalError
> SEVERE: Parse Fatal Error at line 4 column 6: The processing instruction
> target matching "[xX][mM][lL]" is not allowed.
> org.xml.sax.SAXParseException: The processing instruction target
> matching "[xX][mM][lL]" is not allowed.
> ...
> 16-Mar-2011 16:57:36 org.apache.catalina.startup.HostConfig
> deployDescriptor
> SEVERE: Error deploying configuration descriptor wombra.xml
> org.xml.sax.SAXParseException: The processing instruction target
> matching "[xX][mM][lL]" is not allowed.
> ...
> some more
> ...
> 
> My configuration descriptor is as follows
> 
>  crossContext="true">
>   value="/home/lewis/Downloads/wombra" override="true"/>
> 
> 
> Preferably I would upload a WAR file, but I have been working well with
> the configuration I have been using up until now therefore I didn't
> question change.
> I am unfamiliar with the above errors. Can anyone please point me in the
> right direction?
> 
> Thank you
> Lewis
> 
> Glasgow Caledonian University is a registered Scottish charity, number 
> SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the 
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> 
> Winner: Times Higher Education’s Outstanding Support for Early Career 
> Researchers of the Year 2010, GCU as a lead with Universities Scotland 
> partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: Error during auto-warming of key

: 
: Yesterday's error log contains something peculiar: 
: 
:  ERROR [solr.search.SolrCache] - [pool-29-thread-1] - : Error during auto-
: warming of key:+*:* 
: 
(1.0/(7.71E-8*float(ms(const(1298682616680),date(sort_date)))+1.0))^20.0:java.lang.NullPointerException
: at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
: at 
: org.apache.lucene.search.FieldCacheImpl$Entry.(FieldCacheImpl.java:275)
: at 
: org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:525)
: at 
: 
org.apache.solr.search.function.LongFieldSource.getValues(LongFieldSource.java:57)
: at 
: 
org.apache.solr.search.function.DualFloatFunction.getValues(DualFloatFunction.java:48)
: at 
: 
org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61)

that is odd...

can you let us know exactly what verison of Solr/Lucne you are using (if 
it's not an official release, can you let us know exactly what the version 
details on the admin info page say, i'm curious about the svn revision)

can you also please let us know what types of queries you are generating? 
... that's the toString output of a query and it's not entirely clear what 
the original looked like.  If you can recognize what the original query 
was, it would also be helpful to know if you can consistently reproduce 
this error on autowarming after executing that query (or queries like it 
with a slightly differnet date value)

One of the things that particularly boggles me is this...

: org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
: at 
: 
org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520)

[...]

: Well, i use Dismax' bf parameter to boost very recent documents. I'm not 
using 
: the queryResultCache or documentCache, only filterCache and Lucene 
fieldCache. 

... that cache warming stack trace seems to be coming from filterCache, 
but that contradicts your statement that you don't use the filterCache.  
independent of your comments, that's an odd looking query to be cached in 
the filter cache anyway, since it includes a mandatory matchalldocs 
clause, and seems to only exist for boosting on that function.


-Hoss

Re: faceting over ngrams

On Wed, Mar 16, 2011 at 8:05 AM, Dmitry Kan  wrote:
> Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
> trigrams field with about 1 million of entries in the result set and more
> than 100 million of entries to facet on in the index. Currently the faceted
> search is very slow, taking about 5 minutes per query. Would running on a
> cloud with Hadoop make it faster (to seconds) as faceting seems to be a
> natural map-reduce task?

How many indexed tokens does each document have (for the field you are
faceting on) on average?
How many unique tokens are indexed in that field over the complete index?

Or you could go to the admin/stats page and cut-n-paste the
fieldValueCache entry after your faceting request - it should contain
most of the info to further analyze this.

-Yonik
http://lucidimagination.com

Re: faceting over ngrams

Hi Toke,

Thanks a lot for trying this out. I have to mention, that the facetted
search hits only one specific shard by design, so in general the time to
query a shard directly and through the "proxy" SOLR should be comparable.

Would it be feasible for you to make that field ngram'ed or is it too much
worry for you?

I'll check out the direct query and let you know.

On Wed, Mar 16, 2011 at 5:51 PM, Toke Eskildsen wrote:

> On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
> > Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
> the
> > trigrams field with about 1 million of entries in the result set and more
> > than 100 million of entries to facet on in the index. Currently the
> faceted
> > search is very slow, taking about 5 minutes per query.
>
> I tried creating an index with 1M documents, each with 100 unique terms
> in a field. A search for "*:*" with a facet request for the first 1M
> entries in the field took about 20 seconds for the first call and about
> 1-1½ second for each subsequent call. This was with Solr trunk. The
> complexity of my setup is no doubt a lot simpler and lighter than yours,
> but 5 minutes sounds excessive.
>
> My guess is that your performance problem is due to the merging process.
> Could you try measuring the performance of a direct request to a single
> shard? If that is satisfactory, going to the cloud would not solve your
> problem. If you really need 1M entries in your result set, you would be
> better of investigating whether your index can be in a single instance.
>
>


-- 
Regards,

Dmitry Kan

Re: SOLR DIH importing MySQL "text" column as a BLOB

2011-03-16 Thread Jayendra Patil

Hi Kaushik,

If the field is being treated as blobs, you can try using the
FieldStreamDataSource mapping.
This handles the blob objects to extract contents from it.

This feature is available only after Solr 3.1, I suppose.
http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/FieldStreamDataSource.html

Regards,
Jayendra

On Tue, Mar 15, 2011 at 11:57 PM, Kaushik Chakraborty
 wrote:
> I've a column for posts in MySQL of type `text`, I've tried corresponding
> `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But
> whenever I'm importing it using the DIH, it's getting imported as a BLOB
> object. I checked, this thing is happening only for columns of type `text`
> and not for `varchar`(they are getting indexed as string). Hence, the posts
> field is not becoming searchable.
>
> I found about this issue, after repeated search failures, when I did a `*:*`
> query search on Solr. A sample response:
>
>        
>        
>        1.0
>        [B@10a33ce2
>        2011-02-21T07:02:55Z
>        test.acco...@gmail.com
>        Test
>        Account
>        [B@2c93c4f1
>        1
>        
>
> The `data-config.xml` :
>
>        
>     
>             
>     
>     
>     
>     
>     
>     
>     
>           
>      
>
> The `schema.xml` :
>
>    
>         indexed="true" stored="true" required="true" />
>      stored="true" required="true" />
>      />
>      stored="true" />
>      stored="true" />
>      stored="true" />
>      stored="true" />
>    
>    solr_post_status_message_id
>    solr_post_message
>
>
> Thanks,
> Kaushik
>

RE: hierarchical faceting, SOLR-792 - confused on config

Hi Erik,

I have been reading about the progression of SOLR-792 into pivot faceting, 
however can you expand to comment on
where it is committed. Are you referring to trunk?
The reason I am asking is that I have been using 1.4.1 for some time now and 
have been thinking of upgrading to trunk... or branch

Thank you Lewis

From: Erik Hatcher [erik.hatc...@gmail.com]
Sent: 16 March 2011 17:36
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

Sorry, I missed the original mail on this thread

I put together that hierarchical faceting wiki page a couple of years ago when 
helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches.  Since 
then, SOLR-792 morphed and is committed as pivot faceting.  SOLR-64 spawned a 
PathTokenizer which is part of Solr now too.

Recently Toke updated that page with some additional info.  It's definitely not 
a "how to" page, and perhaps should get renamed/moved/revamped?  Toke?

Erik


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: hierarchical faceting, SOLR-792 - confused on config

Interesting, any documentation on the PathTokenizer anywhere? Or just 
have to find and look at the source? That's something I hadn't known 
about, which may be useful to some stuff I've been working on depending 
on how it works.


If nothing else, in the meantime, I'm going to take that exact message 
from Erik and just add it to the top of the wiki page, to avoid other 
people getting confused (I've been confused by that page too) until 
someone spends the time to rewrite it to be more up to date and 
accurate, or clear about it's topicality.


On 3/16/2011 1:36 PM, Erik Hatcher wrote:

Sorry, I missed the original mail on this thread

I put together that hierarchical faceting wiki page a couple of years ago when 
helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches.  Since 
then, SOLR-792 morphed and is committed as pivot faceting.  SOLR-64 spawned a 
PathTokenizer which is part of Solr now too.

Recently Toke updated that page with some additional info.  It's definitely not a 
"how to" page, and perhaps should get renamed/moved/revamped?  Toke?

Erik

On Mar 16, 2011, at 12:39 , McGibbney, Lewis John wrote:


Hi,

This is also where I am having problems. I have not been able to understand 
very much on the wiki.
I do not understand how to configure the faceting we are referring to.
Although I know very little about this, I can't help but think that the wiki is 
quite clearly unaccurate by some way!

Any comments please
Lewis

From: kmf [kfole...@gmail.com]
Sent: 23 February 2011 17:10
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

I'm really confused now.  Is this page completely out of date -
http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that
solr-792 is a form of hierarchical faceting. "There are currently two
similar, non-competing, approaches to generating tree/hierarchical facets
from Solr: SOLR-64 and SOLR-792"

To achieve hierarchical faceting, is the rule then that you form the
hierarchical facets using a transformer in the DIH and do nothing in
schema.xml or solrconfig.xml?   I seem to recall reading somewhere that
creating a copyField is needed.  Sorry for the entry level question but, I'm
still trying to understand how to configure solr to do hierarchical
faceting.

Thanks,
kmf
--
View this message in context: 
http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html
Sent from the Solr - User mailing list archive at Nabble.com.

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: faceting over ngrams

Oh, doc count over 100M is a very different thing than doc count about 
1M. In your original message you said "I tried creating an index with 1M 
documents, each with 100 unique terms in a field." If you instead have 
100M documents, your use is a couple orders of magnitude larger than mine.


It also occurs to me that while I have around 3 million documents, and 
probably up to 50 million or so unique values in the multi-valued 
facetted field -- each document only has 3-10 values, not 100 each. So 
that may also be a difference that effects the facetting algorithm to 
your detriment, not sure.


Prior to Solr 1.4, it was pretty much impossible to facet over 1 
million+ unique values at all, now it works wonderfully in many use 
cases, but you may have found one that's still too much for it.


It also raises my curiosity as to why you'd want to facet over a 
n-grammed field to begin with, that's definitely not an ordinary use 
case. Perhaps there is some way to do what you need without facetting? 
But you probably know what you're doing.


Jonathan

On 3/16/2011 2:25 PM, Dmitry Kan wrote:

Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless I 
do something fundamentally wrong in the SOLR configuration, I have to 
admit, that counting ngrams up to trigrams across whole set of shard's 
documents is pretty intensive task, as each ngram can occur anywhere 
in the index and SOLR most probably doesn't precompute the cumulative 
count of it. I'll try querying with facet.method=fc, thanks for that.


By the way, the trigrams are defined like this:

positionIncrementGap="100">



outputUnigrams="true"/>




For the sharding -- I decided to go with it, when the index size 
approached half a terabyte and doc count went over 100M, I thought it 
would help us scale better. I also maintain good level of caching, and 
so far the faceting over normal string fields (no ngrams) performed 
really well (around 1 sec).



On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind > wrote:


Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding,
so that could explain our different experiences.  It seems like
sharding definitely has trade-offs, makes some things faster and
other things slower. So far I've managed to avoid it, in the
interest of keeping things simpler and easier to understand (for
me, the developer/Solr manager), thinking that sharding is also a
somewhat less mature feature.

With only 1M documents are you sure you need sharding at all?
 You could still use replication to "scale out" for volume,
sharding seems more about scaling for number of documents (or
total bytes) in your index.  1M documents is not very large, for
Solr, in general.

Jonathan


On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

Hello guys. We are using shard'ed solr 1.4 for heavy
faceted search over the
trigrams field with about 1 million of entries in the
result set and more
than 100 million of entries to facet on in the index.
Currently the faceted
search is very slow, taking about 5 minutes per query.

I tried creating an index with 1M documents, each with 100
unique terms
in a field. A search for "*:*" with a facet request for the
first 1M
entries in the field took about 20 seconds for the first call
and about
1-1½ second for each subsequent call. This was with Solr
trunk. The
complexity of my setup is no doubt a lot simpler and lighter
than yours,
but 5 minutes sounds excessive.

My guess is that your performance problem is due to the
merging process.
Could you try measuring the performance of a direct request to
a single
shard? If that is satisfactory, going to the cloud would not
solve your
problem. If you really need 1M entries in your result set, you
would be
better of investigating whether your index can be in a single
instance.




--
Regards,

Dmitry Kan

Using Solr 1.4.1 on most recent Tomcat 7.0.11

Hello list,

Is anyone running Solr (in my case 1.4.1) on above Tomcat dist? In the
past I have been using guidance in accordance with
http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat
but having upgraded from Tomcat 7.0.8 to 7.0.11 I am having problems
E.g.

INFO: Deploying configuration descriptor wombra.xml < This is my context
fragment
from /home/lewis/Downloads/apache-tomcat-7.0.11/conf/Catalina/localhost
16-Mar-2011 16:57:36 org.apache.tomcat.util.digester.Digester fatalError
SEVERE: Parse Fatal Error at line 4 column 6: The processing instruction
target matching "[xX][mM][lL]" is not allowed.
org.xml.sax.SAXParseException: The processing instruction target
matching "[xX][mM][lL]" is not allowed.
...
16-Mar-2011 16:57:36 org.apache.catalina.startup.HostConfig
deployDescriptor
SEVERE: Error deploying configuration descriptor wombra.xml
org.xml.sax.SAXParseException: The processing instruction target
matching "[xX][mM][lL]" is not allowed.
...
some more
...

My configuration descriptor is as follows


  


Preferably I would upload a WAR file, but I have been working well with
the configuration I have been using up until now therefore I didn't
question change.
I am unfamiliar with the above errors. Can anyone please point me in the
right direction?

Thank you
Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: faceting over ngrams

Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless I do
something fundamentally wrong in the SOLR configuration, I have to admit,
that counting ngrams up to trigrams across whole set of shard's documents is
pretty intensive task, as each ngram can occur anywhere in the index and
SOLR most probably doesn't precompute the cumulative count of it. I'll try
querying with facet.method=fc, thanks for that.

By the way, the trigrams are defined like this:








For the sharding -- I decided to go with it, when the index size approached
half a terabyte and doc count went over 100M, I thought it would help us
scale better. I also maintain good level of caching, and so far the faceting
over normal string fields (no ngrams) performed really well (around 1 sec).


On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind  wrote:

> Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding, so that
> could explain our different experiences.  It seems like sharding definitely
> has trade-offs, makes some things faster and other things slower. So far
> I've managed to avoid it, in the interest of keeping things simpler and
> easier to understand (for me, the developer/Solr manager), thinking that
> sharding is also a somewhat less mature feature.
>
> With only 1M documents are you sure you need sharding at all?  You
> could still use replication to "scale out" for volume, sharding seems more
> about scaling for number of documents (or total bytes) in your index.  1M
> documents is not very large, for Solr, in general.
>
> Jonathan
>
>
> On 3/16/2011 11:51 AM, Toke Eskildsen wrote:
>
>> On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
>>
>>> Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
>>> the
>>> trigrams field with about 1 million of entries in the result set and more
>>> than 100 million of entries to facet on in the index. Currently the
>>> faceted
>>> search is very slow, taking about 5 minutes per query.
>>>
>> I tried creating an index with 1M documents, each with 100 unique terms
>> in a field. A search for "*:*" with a facet request for the first 1M
>> entries in the field took about 20 seconds for the first call and about
>> 1-1½ second for each subsequent call. This was with Solr trunk. The
>> complexity of my setup is no doubt a lot simpler and lighter than yours,
>> but 5 minutes sounds excessive.
>>
>> My guess is that your performance problem is due to the merging process.
>> Could you try measuring the performance of a direct request to a single
>> shard? If that is satisfactory, going to the cloud would not solve your
>> problem. If you really need 1M entries in your result set, you would be
>> better of investigating whether your index can be in a single instance.
>>
>>


-- 
Regards,

Dmitry Kan

RE: Different options for autocomplete/autosuggestion

2011-03-16 Thread Robert Petersen

I take raw user search term data, 'collapse' it into a form where I have
only unique terms, per store, ordered by frequency of searches over some
time period.  The suggestions are then grouped and presented with store
breakouts.  That sounds kind of like what this page is talking about
here, but I could be using the wrong terminology:
http://wiki.apache.org/solr/FieldCollapsing


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, March 15, 2011 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Different options for autocomplete/autosuggestion

Hi,

I actually don't follow how field collapsing helps with
autocompletion...?

Over at http://search-lucene.com we eat our own autocomplete dog food: 
http://sematext.com/products/autocomplete/index.html .  Tasty stuff.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Kai Schlamp 
> To: solr-user@lucene.apache.org
> Sent: Mon, March 14, 2011 11:52:48 PM
> Subject: Re: Different options for autocomplete/autosuggestion
> 
> @Robert: That sounds interesting and very flexible, but also like a
> lot of  work. This approach also doesn't seem to allow querying Solr
> directly by  using Ajax ... one of the big benefits in my opinion when
> using  Solr.
> @Bill: There are some things I don't like about the  Suggester
> component. It doesn't seem to allow infix searches (at least it is
not
> mentioned in the Wiki or elsewhere). It also uses a separate  index
> that has to be rebuild independently of the main index. And it
doesn't
> support any filter queries.
> 
> The Lucid Imagination blog also  describes a further autosuggest
> approach 
>(http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popu
lar-queries-using-edgengrams/).
>
> The  disadvantage here is that the source documents must have distinct
> fields  (resp. the dih selects must provide distinct data). Otherwise
> duplications  would come up in the Solr query result, cause of the
> document nature of  Solr.
> 
> In my opinion field collapsing seems to be most promising for a  full
> featured autosuggestion solution. Unfortunately it is not  available
> for Solr 1.4.x or 3.x (I tried patching those branches several  times
> without success).
> 
> 2011/3/15 Bill Bell :
> >
http://lucidworks.lucidimagination.com/display/LWEUG/Spell+Checking+and+
Aut
> >  omatic+Completion+of+User+Queries
> >
> > For Auto-Complete, find the  following section in the solrconfig.xml
file
> > for the collection:
> >   
> >  
> >
> >  autocomplete
> >   >  name="classname">org.apache.solr.spelling.suggest.Suggester
> >>
name="lookupImpl">org.apache.solr.spelling.suggest.jaspell.JaspellLookup
 >  tr>
> >  autocomplete
> >   true
> > 
> >
> >
> >
> >
> >
> > On  3/14/11 8:16 PM, "Andy"   wrote:
> >
> >>Can you provide more details? Or a  link?
> >>
> >>--- On Mon, 3/14/11, Bill Bell   wrote:
> >>
> >>> See how Lucid Enterprise does it...  A
> >>> bit differently.
> >>>
> >>> On 3/14/11  12:14 AM, "Kai Schlamp" 
> >>>  wrote:
> >>>
> >>> >Hi.
> >>>  >
> >>> >There seems to be several options for implementing  an
> >>> >autocomplete/autosuggestions feature with Solr. I  am
> >>> trying to
> >>> >summarize those possibilities  together with their
> >>> advantages and
> >>>  >disadvantages. It would be really nice to read some of
> >>> your  opinions.
> >>> >
> >>> >* Using N-Gram filter + text  field query
> >>> >+ available in stable 1.4.x
> >>>  >+ results can be boosted
> >>> >+ sorted by best  matches
> >>> >- may return duplicate results
> >>>  >
> >>> >* Facets
> >>> >+ available in stable  1.4.x
> >>> >+ no duplicate entries
> >>> >- sorted by  count
> >>> >- may need an extra N-Gram field for infix  queries
> >>> >
> >>> >* Terms
> >>> >+  available in stable 1.4.x
> >>> >+ infix query by using regex in  3.x
> >>> >- only prefix query in 1.4.x
> >>> >-  regexp may be slow (just a guess)
> >>> >
> >>> >*  Suggestions
> >>> >? Did not try that yet. Does it allow infix  queries?
> >>> >
> >>> >* Field  Collapsing
> >>> >+ no duplications
> >>> >- only  available in 4.x branch
> >>> >? Does it work together with  highlighting? That would
> >>> be a big plus.
> >>>  >
> >>> >What are your experiences regarding
> >>>  autocomplete/autosuggestion with
> >>> >Solr? Any additions,  suggestions or corrections? What
> >>> do you prefer?
> >>>  >
> >>>  >Kai
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> >
> 
> 
> 
> -- 
> Dr. med. Kai Schlamp
> Am Fort Elisabeth 17
> 55131  Mainz
> Germany
> Phone +49-177-7402778
> Email: schl...@gmx.de
>

Re: hierarchical faceting, SOLR-792 - confused on config

2011-03-16 Thread Erik Hatcher

Sorry, I missed the original mail on this thread

I put together that hierarchical faceting wiki page a couple of years ago when 
helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches.  Since 
then, SOLR-792 morphed and is committed as pivot faceting.  SOLR-64 spawned a 
PathTokenizer which is part of Solr now too.

Recently Toke updated that page with some additional info.  It's definitely not 
a "how to" page, and perhaps should get renamed/moved/revamped?  Toke?

Erik

On Mar 16, 2011, at 12:39 , McGibbney, Lewis John wrote:

> Hi,
> 
> This is also where I am having problems. I have not been able to understand 
> very much on the wiki.
> I do not understand how to configure the faceting we are referring to.
> Although I know very little about this, I can't help but think that the wiki 
> is quite clearly unaccurate by some way!
> 
> Any comments please
> Lewis
> 
> From: kmf [kfole...@gmail.com]
> Sent: 23 February 2011 17:10
> To: solr-user@lucene.apache.org
> Subject: Re: hierarchical faceting, SOLR-792 - confused on config
> 
> I'm really confused now.  Is this page completely out of date -
> http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that
> solr-792 is a form of hierarchical faceting. "There are currently two
> similar, non-competing, approaches to generating tree/hierarchical facets
> from Solr: SOLR-64 and SOLR-792"
> 
> To achieve hierarchical faceting, is the rule then that you form the
> hierarchical facets using a transformer in the DIH and do nothing in
> schema.xml or solrconfig.xml?   I seem to recall reading somewhere that
> creating a copyField is needed.  Sorry for the entry level question but, I'm
> still trying to understand how to configure solr to do hierarchical
> faceting.
> 
> Thanks,
> kmf
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> Email has been scanned for viruses by Altman Technologies' email management 
> service - www.altman.co.uk/emailsystems
> 
> Glasgow Caledonian University is a registered Scottish charity, number 
> SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the 
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> 
> Winner: Times Higher Education’s Outstanding Support for Early Career 
> Researchers of the Year 2010, GCU as a lead with Universities Scotland 
> partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Error: "Unbuffered entity enclosing request can not be repeated."

2011-03-16 Thread André Santos

Hi all!

I created a SolrJ project to run test Solr. So, I am inserting batches of
7000 records, each with 200 attributes which adds up approximately to 13.77
Mb per batch.

I am measuring the time it takes to add and commit each set of 7000
records to an instantiation of CommonsHttpSolrServer.
Each of the first 6 batches takes approximately 17 to 21 seconds.
The 7th batch takes 42sec and the 8th takes 1min.

And when it adds the 9th batch to the server it generates this error:

Mar 16, 2011 4:56:20 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (java.net.SocketException) caught when processing
request: Connection reset
Mar 16, 2011 4:56:21 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request
Exception in thread "main" org.apache.solr.client.solrj.SolrServerException:
org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing
request can not be repeated.
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)

I googled this error and one of the suggestions consists of the reduction of
the number of records per batch. But I want to achieve a solution with at
least 7000 records per batch.
Any help would be appreciated.
André

Re: Solrj performance bottleneck

On Wed, Mar 16, 2011 at 12:56 PM, Asharudeen  wrote:
> Currently my index size is around 4GB. Normally in small instances total
> available memory will be 1.6GB. In my setup, I allocated around 1GB as a
> heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS
> cache.

Actually, even less.  A JVM with a 1.6GB heap size will take up even
more memory (since the heap size does not count stuff not on the heap,
like the JVM code itself).  This is definitely your problem.

> I believe, I need to migrate my Solr instance from small instance to large.
> So that some more memory will be allotted for OS cache. But initially I
> suspect, since I call Solrj code from another instance, I need to increase
> the memory in the instance from where I run the Solrj. But you said I need
> to increase the memory in Solr instance only. Here, just I want to double
> check this case only. sorry for that.

SolrJ itself won't take up much memory.  It depends on what else your
client app is doing, but a small instance may be fine.

-Yonik
http://lucidimagination.com

Re: Solrj performance bottleneck

2011-03-16 Thread Asharudeen

Hi

Thanks for your info.

Currently my index size is around 4GB. Normally in small instances total
available memory will be 1.6GB. In my setup, I allocated around 1GB as a
heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS
cache.

I believe, I need to migrate my Solr instance from small instance to large.
So that some more memory will be allotted for OS cache. But initially I
suspect, since I call Solrj code from another instance, I need to increase
the memory in the instance from where I run the Solrj. But you said I need
to increase the memory in Solr instance only. Here, just I want to double
check this case only. sorry for that.

Once again thanks for your replies.

Regards,

On Wed, Mar 16, 2011 at 7:02 PM, Yonik Seeley wrote:

> On Wed, Mar 16, 2011 at 7:25 AM, rahul  wrote:
> > In our setup, we are having Solr index in one machine. And Solrj client
> part
> > (java code) in another machine. Currently as you suggest, if it may be a
> > 'not enough free RAM for the OS to cache' then whether I need to increase
> > the RAM in the machine in which Solrj query part is there.??? Or need to
> > increase RAM for Solr instance for the OS cache?
>
> That would be RAM for the Solr instance.  If there is not enough free
> memory for the OS to cache, then each document retrieved will be a
> disk seek + read.
>
> > Since both the system are in local Amazon network (Linux EC2 small
> > instances), I believe the network wont be a issue.
>
> Ah, how big is your index?
>
> > Another thing, in the reply you have mentioned 'client not reading fast
> > enough'. Whether it is related to network or Solrj.
>
> That was a general issue - it *can* be the client, but since you're
> using SolrJ it would be the network.
>
> -Yonik
> http://lucidimagination.com
>

RE: hierarchical faceting, SOLR-792 - confused on config

Hi,

This is also where I am having problems. I have not been able to understand 
very much on the wiki.
I do not understand how to configure the faceting we are referring to.
Although I know very little about this, I can't help but think that the wiki is 
quite clearly unaccurate by some way!

Any comments please
Lewis

From: kmf [kfole...@gmail.com]
Sent: 23 February 2011 17:10
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

I'm really confused now.  Is this page completely out of date -
http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that
solr-792 is a form of hierarchical faceting. "There are currently two
similar, non-competing, approaches to generating tree/hierarchical facets
from Solr: SOLR-64 and SOLR-792"

To achieve hierarchical faceting, is the rule then that you form the
hierarchical facets using a transformer in the DIH and do nothing in
schema.xml or solrconfig.xml?   I seem to recall reading somewhere that
creating a copyField is needed.  Sorry for the entry level question but, I'm
still trying to understand how to configure solr to do hierarchical
faceting.

Thanks,
kmf
--
View this message in context: 
http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html
Sent from the Solr - User mailing list archive at Nabble.com.

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: SOLR DIH importing MySQL "text" column as a BLOB

2011-03-16 Thread Gora Mohanty

On Wed, Mar 16, 2011 at 9:50 PM, Kaushik Chakraborty  wrote:
> The query's there in the data-config.xml. And the query's fetching as
> expected from the database.
[...]

Doh! Sorry, had missed that somehow.

So, the relevant part is:
SELECT ... p.message as solr_post_message,

What is the field type for p.message in mysql?
Cannot remember off the top of my head for
mysql, but if it is a TextField, you might want
to look into a ClobTransformer:
http://wiki.apache.org/solr/DataImportHandler#ClobTransformer

Regards,
Gora

Re: faceting over ngrams

Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding, so that 
could explain our different experiences.  It seems like sharding 
definitely has trade-offs, makes some things faster and other things 
slower. So far I've managed to avoid it, in the interest of keeping 
things simpler and easier to understand (for me, the developer/Solr 
manager), thinking that sharding is also a somewhat less mature feature.


With only 1M documents are you sure you need sharding at all?  You 
could still use replication to "scale out" for volume, sharding seems 
more about scaling for number of documents (or total bytes) in your 
index.  1M documents is not very large, for Solr, in general.


Jonathan

On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
trigrams field with about 1 million of entries in the result set and more
than 100 million of entries to facet on in the index. Currently the faceted
search is very slow, taking about 5 minutes per query.

I tried creating an index with 1M documents, each with 100 unique terms
in a field. A search for "*:*" with a facet request for the first 1M
entries in the field took about 20 seconds for the first call and about
1-1½ second for each subsequent call. This was with Solr trunk. The
complexity of my setup is no doubt a lot simpler and lighter than yours,
but 5 minutes sounds excessive.

My guess is that your performance problem is due to the merging process.
Could you try measuring the performance of a direct request to a single
shard? If that is satisfactory, going to the cloud would not solve your
problem. If you really need 1M entries in your result set, you would be
better of investigating whether your index can be in a single instance.

Re: faceting over ngrams


I don't know anything about trying to use map-reduce with Solr.

But I can tell you that with about 6 million entries in the result set, 
and around 10 million values to facet on (facetting on a multi-value 
field) -- I still get fine performance in my application. In the worst 
case it can take maybe 800ms for my complete query when nothing useful 
is in the caches, which isn't great, but is FAR from 5 minutes!


Now, 100 million values is an order of magnitude more than 10 million -- 
but it still seems like it ought not to be that slow. Not sure what's 
making it so slow for you.  Could you need more RAM allocated to the 
JVM? I have found that facetting sometimes gets pathologically slow when 
I don't have enough RAM -- even though I'm not getting any OOM errors or 
anything.  Of course, I'm not sure exactly what "enough RAM" is for your 
use case -- in my case I'm giving my JVM about 5G of heap.  I also make 
sure to use facet.method=fc for these high-ordinality fields (forget if 
that's the default in 1.4.1 or not).   I also do some warming queries at 
startup to try and fill the various caches that might be involved in 
facetting -- but I don't entirely understand what I'm doing there, and 
that isn't your problem, because that would only effect the first time 
you did such a facetting query, but you're getting the pathological 5min 
result times on subsequent times too.


I am definitely not an expert in the internals of Solr that effect this 
stuff, I'm just reporting my experience, and from my experience -- your 
experience does not match mine.


Jonathan

On 3/16/2011 8:05 AM, Dmitry Kan wrote:

Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
trigrams field with about 1 million of entries in the result set and more
than 100 million of entries to facet on in the index. Currently the faceted
search is very slow, taking about 5 minutes per query. Would running on a
cloud with Hadoop make it faster (to seconds) as faceting seems to be a
natural map-reduce task?

Are there any other options to look into before stepping into the cloud?

Please let me know, if you need specific details on the schema / solrconfig
setup or the like.

Re: SOLR DIH importing MySQL "text" column as a BLOB

2011-03-16 Thread Kaushik Chakraborty

The query's there in the data-config.xml. And the query's fetching as
expected from the database.

Thanks,
Kaushik


On Wed, Mar 16, 2011 at 9:21 PM, Gora Mohanty  wrote:

> On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis
>  wrote:
> > Kaushik,
> >
> > i just remembered an ML-Post few weeks ago .. same problem while
> > importing geo-data
> > (
> http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html
> )
> > - the solution was:
> >
> >> CAST( CONCAT( lat, ',', lng ) AS CHAR )
> >
> > at that time i search a little bit for the reason and afaik there was
> > a bug in mysql/jdbc which produces that binary output under certain
> > conditions
> [...]
>
> As Stefan mentions, there might be a way to solve this.
>
> Could you show us the query in DIH that you are using
> when you get this BLOB, i.e., the SELECT statement
> that goes to the database?
>
> It might also be instructive for you to try that same
> SELECT directly in a mysql interface.
>
> Regards,
> Gora
>

Re: SOLR DIH importing MySQL "text" column as a BLOB

2011-03-16 Thread Gora Mohanty

On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis
 wrote:
> Kaushik,
>
> i just remembered an ML-Post few weeks ago .. same problem while
> importing geo-data
> (http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html)
> - the solution was:
>
>> CAST( CONCAT( lat, ',', lng ) AS CHAR )
>
> at that time i search a little bit for the reason and afaik there was
> a bug in mysql/jdbc which produces that binary output under certain
> conditions
[...]

As Stefan mentions, there might be a way to solve this.

Could you show us the query in DIH that you are using
when you get this BLOB, i.e., the SELECT statement
that goes to the database?

It might also be instructive for you to try that same
SELECT directly in a mysql interface.

Regards,
Gora

Re: faceting over ngrams

2011-03-16 Thread Toke Eskildsen

On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
> Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
> trigrams field with about 1 million of entries in the result set and more
> than 100 million of entries to facet on in the index. Currently the faceted
> search is very slow, taking about 5 minutes per query.

I tried creating an index with 1M documents, each with 100 unique terms
in a field. A search for "*:*" with a facet request for the first 1M
entries in the field took about 20 seconds for the first call and about
1-1½ second for each subsequent call. This was with Solr trunk. The
complexity of my setup is no doubt a lot simpler and lighter than yours,
but 5 minutes sounds excessive.

My guess is that your performance problem is due to the merging process.
Could you try measuring the performance of a direct request to a single
shard? If that is satisfactory, going to the cloud would not solve your
problem. If you really need 1M entries in your result set, you would be
better of investigating whether your index can be in a single instance.

Re: Sorting on multiValued fields via function query

2011-03-16 Thread Smiley, David W.

Heh heh, you say "it worked correctly for me" yet you didn't actually have 
multi-valued data ;-)  Funny.

The only solution right now is to store the max and min into indexed 
single-valued fields at index time.  This is pretty straight-forward to do.  
Even if/when Solr supports sorting on a multi-valued field, I doubt it would 
perform as well as what I suggest.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2011, at 10:16 AM, harish.agarwal wrote:

> Hi David,
> 
> It did seem to work correctly for me - we had it running on our production
> indexes for some time and we never noticed any strange sorting behavior. 
> However, many of our multiValued fields are single valued for the majority
> of documents in our index so we may not have noticed the incorrect sorting
> behaviors.
> 
> Regardless, I understand the reasoning behind the restriction, I'm
> interested in getting around it by using a functionQuery to reduce
> multiValued fields to a single value.  It sounds like this isn't possible,
> is that correct?  Ideally I'd like to sort by the maximum value on
> descending sorts and the minimum value on ascending sorts.  Is there any
> movement towards implementing this sort of behavior?
> 
> Best,
> -Harish
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting on multiValued fields via function query

2011-03-16 Thread harish.agarwal

Hi David,

It did seem to work correctly for me - we had it running on our production
indexes for some time and we never noticed any strange sorting behavior. 
However, many of our multiValued fields are single valued for the majority
of documents in our index so we may not have noticed the incorrect sorting
behaviors.

Regardless, I understand the reasoning behind the restriction, I'm
interested in getting around it by using a functionQuery to reduce
multiValued fields to a single value.  It sounds like this isn't possible,
is that correct?  Ideally I'd like to sort by the maximum value on
descending sorts and the minimum value on ascending sorts.  Is there any
movement towards implementing this sort of behavior?

Best,
-Harish

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Online training for ruby and rails

2011-03-16 Thread hi . sinie


Hi,

We are looking for some one who can provide online training for ruby and  
rails
I found your profile interesting and If you are Interested then please do  
reply me for this mail.

If not then please do not consider this message as a spam.

If you are Interested then let me know -

How much hours would it require to cover all details and what would be the  
cost of the training.

How you will execute this training session.


If Interested then please go through this link -   
http://tinyurl.com/Rubytraining


My company is willing to pay decent amount for this training.
Looking forward to hear from you, thanks


Regards

Sinie

Project coordinator at SCAN technologies, Delhi (INDIA)
Contact at skype - neel_bpl

Replication slows down massively during high load

2011-03-16 Thread Vadim Kisselmann

Hi everyone,

I have Solr running on one master and two slaves (load balanced) via
Solr 1.4.1 native replication.

If the load is low, both slaves replicate with around 100MB/s from master.

But when I use Solrmeter (100-400 queries/min) for load tests (over
the load balancer), the replication slows down to an unacceptable
speed, around 100KB/s (at least that's whats the replication page on
/solr/admin says).

Going to a slave directly without load balancer yields the same result
for the slave under test:

Slave 1 gets hammered with Solrmeter and the replication slows down to 100KB/s.
At the same time, Slave 2 with only 20-50 queries/min without the load
test has no problems. It replicates with 100MB/s and the index version
is 5-10 versions ahead of Slave 1.

The replications stays in the 100KB/s range even after the load test
is over until the application server is restarted. The same issue
comes up under both Tomcat and Jetty.

The setup looks like this:

- Same hardware for all servers: Physical machines with quad core
CPUs, 24GB RAM (JVM starts up with -XX:+UseConcMarkSweepGC -Xms10G
-Xmx10G)
- Index size is about 100GB with 40M docs
- Master commits every 10 min/10k docs
- Slaves polls every minute

I checked this:

- Changed network interface; same behavior
- Increased thread pool size from 200 to 500 and queue size from 100
to 500 in Tomcat; same behavior
- Both disk and network I/O are not bottlenecked. Disk I/O went down
to almost zero after every query in the load test got cached. Network
isn't doing much and can put through almost an GBit/s with iPerf
(network throughput tester) while Solrmeter is running.

Any ideas what could be wrong?


Best Regards
Vadim

Re: SSL and connection pooling

2011-03-16 Thread Em

Am 16.03.2011 14:12, schrieb Erlend Garåsen:
>
> We are unsure whether we should use SSL in order to communicate with
> our Solr server since it will increase the cost of creating http
> connections. If we go for SSL, is it advisable to do some additional
> settings for the HttpClient in order to reduce the connection costs?
>
> After reading the Commons Http Client documentation, it is not clear
> to me whether a connection pooling mechanism is enabled by default
> since the documentation differs between version 4.1 and 3.1 (Solr uses
> the latter).
>
> Solr will run on Resin 4 with Apache 2.2, so perhaps we need to do
> some additional adjustments in the httpd.conf file as well in order to
> prevent Apache from closing the connections.
>
> Erlend
>
first: You have to use SSL when you have to. If you can live with
 the fact that someone could watch your internal

  clear-text-data-streams, than do not use SSL. On the other hand:
  If you can not, than you definitely have to use SSL. 
  That should be the main-point for your technical dicission. Not
  performance.

  second: In my last checkout's ( a few weeks ago ) Solr repository,
  the CommonsHttpSolrServer uses a MultiThreaded-connection with 32
  connections per host and 128 total connections.

  Hope this helps.

  Regards,
  Em

Re: Solrj performance bottleneck

On Wed, Mar 16, 2011 at 7:25 AM, rahul  wrote:
> In our setup, we are having Solr index in one machine. And Solrj client part
> (java code) in another machine. Currently as you suggest, if it may be a
> 'not enough free RAM for the OS to cache' then whether I need to increase
> the RAM in the machine in which Solrj query part is there.??? Or need to
> increase RAM for Solr instance for the OS cache?

That would be RAM for the Solr instance.  If there is not enough free
memory for the OS to cache, then each document retrieved will be a
disk seek + read.

> Since both the system are in local Amazon network (Linux EC2 small
> instances), I believe the network wont be a issue.

Ah, how big is your index?

> Another thing, in the reply you have mentioned 'client not reading fast
> enough'. Whether it is related to network or Solrj.

That was a general issue - it *can* be the client, but since you're
using SolrJ it would be the network.

-Yonik
http://lucidimagination.com

SSL and connection pooling

2011-03-16 Thread Erlend Garåsen



We are unsure whether we should use SSL in order to communicate with our 
Solr server since it will increase the cost of creating http 
connections. If we go for SSL, is it advisable to do some additional 
settings for the HttpClient in order to reduce the connection costs?


After reading the Commons Http Client documentation, it is not clear to 
me whether a connection pooling mechanism is enabled by default since 
the documentation differs between version 4.1 and 3.1 (Solr uses the 
latter).


Solr will run on Resin 4 with Apache 2.2, so perhaps we need to do some 
additional adjustments in the httpd.conf file as well in order to 
prevent Apache from closing the connections.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Multicore

What Solr are you using? That filter is not pre 3.1 releases.

On Wednesday 16 March 2011 13:55:21 Brian Lamb wrote:
> Hi all,
> 
> I am setting up multicore and the schema.xml file in the core0 folder says
> not to sure that one because its very stripped down. So I copied the schema
> from example/solr/conf but now I am getting a bunch of class not found
> exceptions:
> 
> SEVERE: org.apache.solr.common.SolrException: Error loading class
> 'solr.KeywordMarkerFilterFactory'
> 
> For example.
> 
> I also copied over the solrconfig.xml from example/solr/conf and changed
> all the lib dir="xxx" paths to go up one directory higher ( dir="../xxx" /> instead). I've found that when I use my solrconfig file
> with the stripped down schema.xml file, it runs correctly. But when I use
> the full schema xml file, I get those errors.
> 
> Now this says to me I am not loading a library or two somewhere but I've
> looked through the configuration files and cannot see any other place other
> than solrconfig.xml where that would be set so what am I doing incorrectly?
> 
> Thanks,
> 
> Brian Lamb

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Multicore

2011-03-16 Thread Brian Lamb

Hi all,

I am setting up multicore and the schema.xml file in the core0 folder says
not to sure that one because its very stripped down. So I copied the schema
from example/solr/conf but now I am getting a bunch of class not found
exceptions:

SEVERE: org.apache.solr.common.SolrException: Error loading class
'solr.KeywordMarkerFilterFactory'

For example.

I also copied over the solrconfig.xml from example/solr/conf and changed all
the lib dir="xxx" paths to go up one directory higher (
instead). I've found that when I use my solrconfig file with the stripped
down schema.xml file, it runs correctly. But when I use the full schema xml
file, I get those errors.

Now this says to me I am not loading a library or two somewhere but I've
looked through the configuration files and cannot see any other place other
than solrconfig.xml where that would be set so what am I doing incorrectly?

Thanks,

Brian Lamb

Re: Stemming question

2011-03-16 Thread Ahmet Arslan

> When I use the Porter Stemmer in
> Solr, it appears to take works that are
> stemmed and replace it with the root work in the index.
> I verified this by looking at analysis.jsp.
> 
> Is there an option to expand the stemmer to include all
> combinations of the
> word? Like include 's, ly, etc?

So you want expansion stemming (currently not supported ), which expands query 
and do not require re-indexing. As described here : 

http://www.slideshare.net/otisg/finite-state-queries-in-lucene 


May be you can extract stemming collisions from your index and use them in a 
huge synonym.txt file?

> Other options besides protection?

What id protection?

Re: Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository

2011-03-16 Thread Ahmet Arslan

> does anyone have a successfull setup (=pom.xml) that
> specifies the
> Hudson snapshot repository :
> 
> https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts
> (or that for trunk)
> 
> and entries for any solr snapshot artifacts which are then
> found by
> Maven in this repository?

This is what i use successfully:


trunk


https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/





   org.apache.solr
   solr-core
   4.0-SNAPSHOT
   compile
   jar

Re: Dismax: field not returned unless in sort clause?

2011-03-16 Thread mrw

No, not setting those options in the query or schema.xml file.

I'll try what you said, however.


Thanks


Chris Hostetter-3 wrote:
> 
> : We have a "D" field (string, indexed, stored, not required) that is
> returned
> : * when we search with the standard request handler
> : * when we search with dismax request handler _and the field is specified
> in
> : the sort parameter_
> : 
> : but is not returned when using the dismax handler and the field is not
> : specified in the sort param.
> 
> are you using one of the "sortMissing" options on D or it's fieldType?
> 
> I'm guessing you have sortMissingLast="true" for D, so anytime you sort on 
> it the docs that do have a value appear first.  but when you don't sort on 
> it, other factors probably lead docs that don't have a value for the D 
> field to appear first -- solr doesn't include fields in docs that don't 
> have any value for that field.
> 
> if my guess is correct, adding "fq=D:[* TO *] to any of your queries will 
> cause the total number of results to shrink, but the first page of results 
> for your requests that don't sort on D will look exactly the same.
> 
> the LUkeRequestHandler will help you see how many docs in your index don't 
> have any values indexed in the "D" field.
> 
> 
> -Hoss
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-field-not-returned-unless-in-sort-clause-tp2681447p2688039.html
Sent from the Solr - User mailing list archive at Nabble.com.

faceting over ngrams

Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
trigrams field with about 1 million of entries in the result set and more
than 100 million of entries to facet on in the index. Currently the faceted
search is very slow, taking about 5 minutes per query. Would running on a
cloud with Hadoop make it faster (to seconds) as faceting seems to be a
natural map-reduce task?

Are there any other options to look into before stepping into the cloud?

Please let me know, if you need specific details on the schema / solrconfig
setup or the like.

-- 
Regards,

Dmitry Kan

Re: Stemming question

Hmm, i'm not sure if its supposed to stem that way but if it doesn't and you 
insist then you might be able to abuse the PatternReplaceFilterFactory.

On Wednesday 16 March 2011 06:02:32 Bill Bell wrote:
> When I use the Porter Stemmer in Solr, it appears to take works that are
> stemmed and replace it with the root work in the index.
> I verified this by looking at analysis.jsp.
> 
> Is there an option to expand the stemmer to include all combinations of the
> word? Like include 's, ly, etc?
> 
> Other options besides protection?
> 
> Bill

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Solr admin page timed out and index updating issues

Yes, due to warmup queries Solr may run out of heap space at start up.

On Monday 14 March 2011 16:52:15 Ranma wrote:
> I am still stuck at the same point.
> 
> Looking here and there I could read that the memory limit (heap space) may
> need to be increased to -Xms512M -Xmx512M when launching the
> java -jar start.jar
>  command. But in my vps I've been forced to set the Xmx limit to maximum
> Xmx400M since at higher value it returns a VM initialization error and it
> won't run.
> 
> My first question is: could this be the problem not being able to access
> the solr admin page?
> 
> Please...! Thanks!
> 
> -
> loredanaebook.it
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-admin-page-timed-out-and-index-upd
> ating-issues-tp2664429p2676437.html Sent from the Solr - User mailing list
> archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Solrj performance bottleneck

2011-03-16 Thread rahul

Hi,

Thanks for your information.

One simple question. Please clarify me.

In our setup, we are having Solr index in one machine. And Solrj client part
(java code) in another machine. Currently as you suggest, if it may be a
'not enough free RAM for the OS to cache' then whether I need to increase
the RAM in the machine in which Solrj query part is there.??? Or need to
increase RAM for Solr instance for the OS cache?

Since both the system are in local Amazon network (Linux EC2 small
instances), I believe the network wont be a issue.

Another thing, in the reply you have mentioned 'client not reading fast
enough'. Whether it is related to network or Solrj.

Thanks in advance for your info.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2687448.html
Sent from the Solr - User mailing list archive at Nabble.com.

Multiple spellchecker

2011-03-16 Thread royr

Hello,

I have a problem with the SOLR spellchecker component. This is the problem:

Searching term = Company: American today, City: London (two fields:
copyfield to one: Spell )

User search = American tuday, Londen

What i want is a collation of: American today london. SOLR returns with the
q parameter:

American
Correction: American today

tuday
Correction: American today

londen
Correction: London

Collaction:  American today American today London

SOLR returns with the spellcheck.q parameter:

American tuday londen
Correction: American today

The index of Spell looks like this:
American today
London
google
France
etc.

I want that SOLR makes two parts of terms: ("American today") and
("London"). Both terms have to be checked for spelling, not as one term and
not as three terms.

Can somebody helps me?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-spellchecker-tp2687320p2687320.html
Sent from the Solr - User mailing list archive at Nabble.com.

Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository

2011-03-16 Thread Chantal Ackermann

Hi all,

does anyone have a successfull setup (=pom.xml) that specifies the
Hudson snapshot repository :

https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts
(or that for trunk)

and entries for any solr snapshot artifacts which are then found by
Maven in this repository?

I have specified the repository in my pom.xml as :


solr-snapshot-3.x
https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts

false


true




And the dependencies:

org.apache.solr
solr-core
3.2-SNAPSHOT


org.apache.solr
solr-dataimporthandler
3.2-SNAPSHOT



Maven's output is (for solr-core):

Downloading:
http://192.168.2.40:8081/nexus/content/groups/public/org/apache/solr/solr-core/3.2-SNAPSHOT/solr-core-3.2-SNAPSHOT.jar
[INFO] Unable to find resource
'org.apache.solr:solr-core:jar:3.2-SNAPSHOT' in repository
solr-snapshot-3.x
(https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts)


I'm also trying around with specifying the exact name of the jar, but no
success so far, and it also seems wrong as it will be constantly
changing.
Also, searching hasn't returned anything helpful, so far.

I'd really appreciate if someone could point me into the right
direction!
Thanks!
Chantal

RE: Faceting help