date:20110316

Hi Upayavira,

I use the term constraint to define additional options for a user to refine 
search with under each facet. If we could think
of them as sub facet's then maybe this would explain in slightly better terms.

I didn't add additional document source types in my original email but if I 
knew that there would be xls and doc contained within the
Solr index then these would also be added as sub facet's allowing a user to 
select prior to entering a search query.

Can you point me towards documentation or something similar in order to 
implement the above. I am aware that I have a lot more to
learn on faceted search, namely how to properly implement it!

Thank you Lewis

From: Upayavira [u...@odoko.co.uk]
Sent: 15 March 2011 22:42
To: solr-user@lucene.apache.org
Subject: Re: Faceting help

I'm not sure if I get what you are trying to achieve. What do you mean
by constraint?

Are you saying that you effectively want to filter the facets that are
returned?

e.g. for source field, you want to show html/pdf/email, but not, say xls
or doc?

Upayavira


 Topics  field
   Legislation  constraint
   Guidance/Policies  constraint
   Customer Service information/complaints procedure  constraint
   financial information  constraint
   etc etc

 Source  field
   html  constraint  constraint
   pdf  constraint
   email  constraint
   etc etc

 Date  field
 constraint

 Basically I need resources to understand how to implement the above
 instead of the example I currently have.
 Some guidance would be great
 Thank you kindly

 Lewis

 Glasgow Caledonian University is a registered Scottish charity, number
 SC021474

 Winner: Times Higher Education’s Widening Participation Initiative of the
 Year 2009 and Herald Society’s Education Initiative of the Year 2009.
 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

 Winner: Times Higher Education’s Outstanding Support for Early Career
 Researchers of the Year 2010, GCU as a lead with Universities Scotland
 partners.
 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

---
Enterprise Search Consultant at Sourcesense UK,
Making Sense of Open Source

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository

2011-03-16 Thread Chantal Ackermann

Hi all,

does anyone have a successfull setup (=pom.xml) that specifies the
Hudson snapshot repository :

https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts
(or that for trunk)

and entries for any solr snapshot artifacts which are then found by
Maven in this repository?

I have specified the repository in my pom.xml as :
repositories
repository
idsolr-snapshot-3.x/id
urlhttps://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts/url
releases
enabledfalse/enabled
/releases
snapshots
enabledtrue/enabled
/snapshots
/repository
/repositories

And the dependencies:
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-core/artifactId
version3.2-SNAPSHOT/version
/dependency
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-dataimporthandler/artifactId
version3.2-SNAPSHOT/version
/dependency


Maven's output is (for solr-core):

Downloading:
http://192.168.2.40:8081/nexus/content/groups/public/org/apache/solr/solr-core/3.2-SNAPSHOT/solr-core-3.2-SNAPSHOT.jar
[INFO] Unable to find resource
'org.apache.solr:solr-core:jar:3.2-SNAPSHOT' in repository
solr-snapshot-3.x
(https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts)


I'm also trying around with specifying the exact name of the jar, but no
success so far, and it also seems wrong as it will be constantly
changing.
Also, searching hasn't returned anything helpful, so far.

I'd really appreciate if someone could point me into the right
direction!
Thanks!
Chantal

Multiple spellchecker

2011-03-16 Thread royr

Hello,

I have a problem with the SOLR spellchecker component. This is the problem:

Searching term = Company: American today, City: London (two fields:
copyfield to one: Spell )

User search = American tuday, Londen

What i want is a collation of: American today london. SOLR returns with the
q parameter:

American
Correction: American today

tuday
Correction: American today

londen
Correction: London

Collaction:  American today American today London

SOLR returns with the spellcheck.q parameter:

American tuday londen
Correction: American today

The index of Spell looks like this:
American today
London
google
France
etc.

I want that SOLR makes two parts of terms: (American today) and
(London). Both terms have to be checked for spelling, not as one term and
not as three terms.

Can somebody helps me?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-spellchecker-tp2687320p2687320.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solrj performance bottleneck

2011-03-16 Thread rahul

Hi,

Thanks for your information.

One simple question. Please clarify me.

In our setup, we are having Solr index in one machine. And Solrj client part
(java code) in another machine. Currently as you suggest, if it may be a
'not enough free RAM for the OS to cache' then whether I need to increase
the RAM in the machine in which Solrj query part is there.??? Or need to
increase RAM for Solr instance for the OS cache?

Since both the system are in local Amazon network (Linux EC2 small
instances), I believe the network wont be a issue.

Another thing, in the reply you have mentioned 'client not reading fast
enough'. Whether it is related to network or Solrj.

Thanks in advance for your info.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2687448.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr admin page timed out and index updating issues

Yes, due to warmup queries Solr may run out of heap space at start up.

On Monday 14 March 2011 16:52:15 Ranma wrote:
 I am still stuck at the same point.
 
 Looking here and there I could read that the memory limit (heap space) may
 need to be increased to -Xms512M -Xmx512M when launching the
 java -jar start.jar
  command. But in my vps I've been forced to set the Xmx limit to maximum
 Xmx400M since at higher value it returns a VM initialization error and it
 won't run.
 
 My first question is: could this be the problem not being able to access
 the solr admin page?
 
 Please...! Thanks!
 
 -
 loredanaebook.it
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-admin-page-timed-out-and-index-upd
 ating-issues-tp2664429p2676437.html Sent from the Solr - User mailing list
 archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Stemming question

Hmm, i'm not sure if its supposed to stem that way but if it doesn't and you 
insist then you might be able to abuse the PatternReplaceFilterFactory.

On Wednesday 16 March 2011 06:02:32 Bill Bell wrote:
 When I use the Porter Stemmer in Solr, it appears to take works that are
 stemmed and replace it with the root work in the index.
 I verified this by looking at analysis.jsp.
 
 Is there an option to expand the stemmer to include all combinations of the
 word? Like include 's, ly, etc?
 
 Other options besides protection?
 
 Bill

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

faceting over ngrams

Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
trigrams field with about 1 million of entries in the result set and more
than 100 million of entries to facet on in the index. Currently the faceted
search is very slow, taking about 5 minutes per query. Would running on a
cloud with Hadoop make it faster (to seconds) as faceting seems to be a
natural map-reduce task?

Are there any other options to look into before stepping into the cloud?

Please let me know, if you need specific details on the schema / solrconfig
setup or the like.

-- 
Regards,

Dmitry Kan

Re: Dismax: field not returned unless in sort clause?

2011-03-16 Thread mrw

No, not setting those options in the query or schema.xml file.

I'll try what you said, however.


Thanks


Chris Hostetter-3 wrote:
 
 : We have a D field (string, indexed, stored, not required) that is
 returned
 : * when we search with the standard request handler
 : * when we search with dismax request handler _and the field is specified
 in
 : the sort parameter_
 : 
 : but is not returned when using the dismax handler and the field is not
 : specified in the sort param.
 
 are you using one of the sortMissing options on D or it's fieldType?
 
 I'm guessing you have sortMissingLast=true for D, so anytime you sort on 
 it the docs that do have a value appear first.  but when you don't sort on 
 it, other factors probably lead docs that don't have a value for the D 
 field to appear first -- solr doesn't include fields in docs that don't 
 have any value for that field.
 
 if my guess is correct, adding fq=D:[* TO *] to any of your queries will 
 cause the total number of results to shrink, but the first page of results 
 for your requests that don't sort on D will look exactly the same.
 
 the LUkeRequestHandler will help you see how many docs in your index don't 
 have any values indexed in the D field.
 
 
 -Hoss
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-field-not-returned-unless-in-sort-clause-tp2681447p2688039.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Maven : Specifying SNAPSHOT Artifacts and the Hudson Repository

2011-03-16 Thread Ahmet Arslan

 does anyone have a successfull setup (=pom.xml) that
 specifies the
 Hudson snapshot repository :
 
 https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastStableBuild/artifact/maven_artifacts
 (or that for trunk)
 
 and entries for any solr snapshot artifacts which are then
 found by
 Maven in this repository?

This is what i use successfully:

repository
idtrunk/id
url

https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/
/url
/repository


dependency
   groupIdorg.apache.solr/groupId
   artifactIdsolr-core/artifactId
   version4.0-SNAPSHOT/version
   scopecompile/scope
   typejar/type
/dependency

Re: Stemming question

2011-03-16 Thread Ahmet Arslan

 When I use the Porter Stemmer in
 Solr, it appears to take works that are
 stemmed and replace it with the root work in the index.
 I verified this by looking at analysis.jsp.
 
 Is there an option to expand the stemmer to include all
 combinations of the
 word? Like include 's, ly, etc?

So you want expansion stemming (currently not supported ), which expands query 
and do not require re-indexing. As described here : 

http://www.slideshare.net/otisg/finite-state-queries-in-lucene 


May be you can extract stemming collisions from your index and use them in a 
huge synonym.txt file?

 Other options besides protection?

What id protection?

Multicore

2011-03-16 Thread Brian Lamb

Hi all,

I am setting up multicore and the schema.xml file in the core0 folder says
not to sure that one because its very stripped down. So I copied the schema
from example/solr/conf but now I am getting a bunch of class not found
exceptions:

SEVERE: org.apache.solr.common.SolrException: Error loading class
'solr.KeywordMarkerFilterFactory'

For example.

I also copied over the solrconfig.xml from example/solr/conf and changed all
the lib dir=xxx paths to go up one directory higher (lib dir=../xxx /
instead). I've found that when I use my solrconfig file with the stripped
down schema.xml file, it runs correctly. But when I use the full schema xml
file, I get those errors.

Now this says to me I am not loading a library or two somewhere but I've
looked through the configuration files and cannot see any other place other
than solrconfig.xml where that would be set so what am I doing incorrectly?

Thanks,

Brian Lamb

Re: Multicore

What Solr are you using? That filter is not pre 3.1 releases.

On Wednesday 16 March 2011 13:55:21 Brian Lamb wrote:
 Hi all,
 
 I am setting up multicore and the schema.xml file in the core0 folder says
 not to sure that one because its very stripped down. So I copied the schema
 from example/solr/conf but now I am getting a bunch of class not found
 exceptions:
 
 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'solr.KeywordMarkerFilterFactory'
 
 For example.
 
 I also copied over the solrconfig.xml from example/solr/conf and changed
 all the lib dir=xxx paths to go up one directory higher (lib
 dir=../xxx / instead). I've found that when I use my solrconfig file
 with the stripped down schema.xml file, it runs correctly. But when I use
 the full schema xml file, I get those errors.
 
 Now this says to me I am not loading a library or two somewhere but I've
 looked through the configuration files and cannot see any other place other
 than solrconfig.xml where that would be set so what am I doing incorrectly?
 
 Thanks,
 
 Brian Lamb

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

SSL and connection pooling

2011-03-16 Thread Erlend Garåsen



We are unsure whether we should use SSL in order to communicate with our 
Solr server since it will increase the cost of creating http 
connections. If we go for SSL, is it advisable to do some additional 
settings for the HttpClient in order to reduce the connection costs?


After reading the Commons Http Client documentation, it is not clear to 
me whether a connection pooling mechanism is enabled by default since 
the documentation differs between version 4.1 and 3.1 (Solr uses the 
latter).


Solr will run on Resin 4 with Apache 2.2, so perhaps we need to do some 
additional adjustments in the httpd.conf file as well in order to 
prevent Apache from closing the connections.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Solrj performance bottleneck

On Wed, Mar 16, 2011 at 7:25 AM, rahul asharud...@gmail.com wrote:
 In our setup, we are having Solr index in one machine. And Solrj client part
 (java code) in another machine. Currently as you suggest, if it may be a
 'not enough free RAM for the OS to cache' then whether I need to increase
 the RAM in the machine in which Solrj query part is there.??? Or need to
 increase RAM for Solr instance for the OS cache?

That would be RAM for the Solr instance.  If there is not enough free
memory for the OS to cache, then each document retrieved will be a
disk seek + read.

 Since both the system are in local Amazon network (Linux EC2 small
 instances), I believe the network wont be a issue.

Ah, how big is your index?

 Another thing, in the reply you have mentioned 'client not reading fast
 enough'. Whether it is related to network or Solrj.

That was a general issue - it *can* be the client, but since you're
using SolrJ it would be the network.

-Yonik
http://lucidimagination.com

Re: SSL and connection pooling

2011-03-16 Thread Em

Am 16.03.2011 14:12, schrieb Erlend Garåsen:

 We are unsure whether we should use SSL in order to communicate with
 our Solr server since it will increase the cost of creating http
 connections. If we go for SSL, is it advisable to do some additional
 settings for the HttpClient in order to reduce the connection costs?

 After reading the Commons Http Client documentation, it is not clear
 to me whether a connection pooling mechanism is enabled by default
 since the documentation differs between version 4.1 and 3.1 (Solr uses
 the latter).

 Solr will run on Resin 4 with Apache 2.2, so perhaps we need to do
 some additional adjustments in the httpd.conf file as well in order to
 prevent Apache from closing the connections.

 Erlend

first: You have to use SSL when you have to. If you can live with
 the fact that someone could watch your internal

  clear-text-data-streams, than do not use SSL. On the other hand:
  If you can not, than you definitely have to use SSL. br
  That should be the main-point for your technical dicission. Not
  performance.

  second: In my last checkout's ( a few weeks ago ) Solr repository,
  the CommonsHttpSolrServer uses a MultiThreaded-connection with 32
  connections per host and 128 total connections.

  Hope this helps.

  Regards,
  Em

Replication slows down massively during high load

2011-03-16 Thread Vadim Kisselmann

Hi everyone,

I have Solr running on one master and two slaves (load balanced) via
Solr 1.4.1 native replication.

If the load is low, both slaves replicate with around 100MB/s from master.

But when I use Solrmeter (100-400 queries/min) for load tests (over
the load balancer), the replication slows down to an unacceptable
speed, around 100KB/s (at least that's whats the replication page on
/solr/admin says).

Going to a slave directly without load balancer yields the same result
for the slave under test:

Slave 1 gets hammered with Solrmeter and the replication slows down to 100KB/s.
At the same time, Slave 2 with only 20-50 queries/min without the load
test has no problems. It replicates with 100MB/s and the index version
is 5-10 versions ahead of Slave 1.

The replications stays in the 100KB/s range even after the load test
is over until the application server is restarted. The same issue
comes up under both Tomcat and Jetty.

The setup looks like this:

- Same hardware for all servers: Physical machines with quad core
CPUs, 24GB RAM (JVM starts up with -XX:+UseConcMarkSweepGC -Xms10G
-Xmx10G)
- Index size is about 100GB with 40M docs
- Master commits every 10 min/10k docs
- Slaves polls every minute

I checked this:

- Changed network interface; same behavior
- Increased thread pool size from 200 to 500 and queue size from 100
to 500 in Tomcat; same behavior
- Both disk and network I/O are not bottlenecked. Disk I/O went down
to almost zero after every query in the load test got cached. Network
isn't doing much and can put through almost an GBit/s with iPerf
(network throughput tester) while Solrmeter is running.

Any ideas what could be wrong?


Best Regards
Vadim

Online training for ruby and rails

2011-03-16 Thread hi . sinie


Hi,

We are looking for some one who can provide online training for ruby and  
rails
I found your profile interesting and If you are Interested then please do  
reply me for this mail.

If not then please do not consider this message as a spam.

If you are Interested then let me know -

How much hours would it require to cover all details and what would be the  
cost of the training.

How you will execute this training session.


If Interested then please go through this link -   
http://tinyurl.com/Rubytraining


My company is willing to pay decent amount for this training.
Looking forward to hear from you, thanks


Regards

Sinie

Project coordinator at SCAN technologies, Delhi (INDIA)
Contact at skype - neel_bpl

Re: Sorting on multiValued fields via function query

2011-03-16 Thread harish.agarwal

Hi David,

It did seem to work correctly for me - we had it running on our production
indexes for some time and we never noticed any strange sorting behavior. 
However, many of our multiValued fields are single valued for the majority
of documents in our index so we may not have noticed the incorrect sorting
behaviors.

Regardless, I understand the reasoning behind the restriction, I'm
interested in getting around it by using a functionQuery to reduce
multiValued fields to a single value.  It sounds like this isn't possible,
is that correct?  Ideally I'd like to sort by the maximum value on
descending sorts and the minimum value on ascending sorts.  Is there any
movement towards implementing this sort of behavior?

Best,
-Harish

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting on multiValued fields via function query

2011-03-16 Thread Smiley, David W.

Heh heh, you say it worked correctly for me yet you didn't actually have 
multi-valued data ;-)  Funny.

The only solution right now is to store the max and min into indexed 
single-valued fields at index time.  This is pretty straight-forward to do.  
Even if/when Solr supports sorting on a multi-valued field, I doubt it would 
perform as well as what I suggest.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2011, at 10:16 AM, harish.agarwal wrote:

 Hi David,
 
 It did seem to work correctly for me - we had it running on our production
 indexes for some time and we never noticed any strange sorting behavior. 
 However, many of our multiValued fields are single valued for the majority
 of documents in our index so we may not have noticed the incorrect sorting
 behaviors.
 
 Regardless, I understand the reasoning behind the restriction, I'm
 interested in getting around it by using a functionQuery to reduce
 multiValued fields to a single value.  It sounds like this isn't possible,
 is that correct?  Ideally I'd like to sort by the maximum value on
 descending sorts and the minimum value on ascending sorts.  Is there any
 movement towards implementing this sort of behavior?
 
 Best,
 -Harish
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: faceting over ngrams

2011-03-16 Thread Toke Eskildsen

On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
 Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
 trigrams field with about 1 million of entries in the result set and more
 than 100 million of entries to facet on in the index. Currently the faceted
 search is very slow, taking about 5 minutes per query.

I tried creating an index with 1M documents, each with 100 unique terms
in a field. A search for *:* with a facet request for the first 1M
entries in the field took about 20 seconds for the first call and about
1-1½ second for each subsequent call. This was with Solr trunk. The
complexity of my setup is no doubt a lot simpler and lighter than yours,
but 5 minutes sounds excessive.

My guess is that your performance problem is due to the merging process.
Could you try measuring the performance of a direct request to a single
shard? If that is satisfactory, going to the cloud would not solve your
problem. If you really need 1M entries in your result set, you would be
better of investigating whether your index can be in a single instance.

Re: SOLR DIH importing MySQL text column as a BLOB

2011-03-16 Thread Gora Mohanty

On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis
matheis.ste...@googlemail.com wrote:
 Kaushik,

 i just remembered an ML-Post few weeks ago .. same problem while
 importing geo-data
 (http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html)
 - the solution was:

 CAST( CONCAT( lat, ',', lng ) AS CHAR )

 at that time i search a little bit for the reason and afaik there was
 a bug in mysql/jdbc which produces that binary output under certain
 conditions
[...]

As Stefan mentions, there might be a way to solve this.

Could you show us the query in DIH that you are using
when you get this BLOB, i.e., the SELECT statement
that goes to the database?

It might also be instructive for you to try that same
SELECT directly in a mysql interface.

Regards,
Gora

Re: SOLR DIH importing MySQL text column as a BLOB

2011-03-16 Thread Kaushik Chakraborty

The query's there in the data-config.xml. And the query's fetching as
expected from the database.

Thanks,
Kaushik


On Wed, Mar 16, 2011 at 9:21 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis
 matheis.ste...@googlemail.com wrote:
  Kaushik,
 
  i just remembered an ML-Post few weeks ago .. same problem while
  importing geo-data
  (
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html
 )
  - the solution was:
 
  CAST( CONCAT( lat, ',', lng ) AS CHAR )
 
  at that time i search a little bit for the reason and afaik there was
  a bug in mysql/jdbc which produces that binary output under certain
  conditions
 [...]

 As Stefan mentions, there might be a way to solve this.

 Could you show us the query in DIH that you are using
 when you get this BLOB, i.e., the SELECT statement
 that goes to the database?

 It might also be instructive for you to try that same
 SELECT directly in a mysql interface.

 Regards,
 Gora

Re: faceting over ngrams


I don't know anything about trying to use map-reduce with Solr.

But I can tell you that with about 6 million entries in the result set, 
and around 10 million values to facet on (facetting on a multi-value 
field) -- I still get fine performance in my application. In the worst 
case it can take maybe 800ms for my complete query when nothing useful 
is in the caches, which isn't great, but is FAR from 5 minutes!


Now, 100 million values is an order of magnitude more than 10 million -- 
but it still seems like it ought not to be that slow. Not sure what's 
making it so slow for you.  Could you need more RAM allocated to the 
JVM? I have found that facetting sometimes gets pathologically slow when 
I don't have enough RAM -- even though I'm not getting any OOM errors or 
anything.  Of course, I'm not sure exactly what enough RAM is for your 
use case -- in my case I'm giving my JVM about 5G of heap.  I also make 
sure to use facet.method=fc for these high-ordinality fields (forget if 
that's the default in 1.4.1 or not).   I also do some warming queries at 
startup to try and fill the various caches that might be involved in 
facetting -- but I don't entirely understand what I'm doing there, and 
that isn't your problem, because that would only effect the first time 
you did such a facetting query, but you're getting the pathological 5min 
result times on subsequent times too.


I am definitely not an expert in the internals of Solr that effect this 
stuff, I'm just reporting my experience, and from my experience -- your 
experience does not match mine.


Jonathan

On 3/16/2011 8:05 AM, Dmitry Kan wrote:

Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
trigrams field with about 1 million of entries in the result set and more
than 100 million of entries to facet on in the index. Currently the faceted
search is very slow, taking about 5 minutes per query. Would running on a
cloud with Hadoop make it faster (to seconds) as faceting seems to be a
natural map-reduce task?

Are there any other options to look into before stepping into the cloud?

Please let me know, if you need specific details on the schema / solrconfig
setup or the like.

Re: faceting over ngrams

Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding, so that 
could explain our different experiences.  It seems like sharding 
definitely has trade-offs, makes some things faster and other things 
slower. So far I've managed to avoid it, in the interest of keeping 
things simpler and easier to understand (for me, the developer/Solr 
manager), thinking that sharding is also a somewhat less mature feature.


With only 1M documents are you sure you need sharding at all?  You 
could still use replication to scale out for volume, sharding seems 
more about scaling for number of documents (or total bytes) in your 
index.  1M documents is not very large, for Solr, in general.


Jonathan

On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
trigrams field with about 1 million of entries in the result set and more
than 100 million of entries to facet on in the index. Currently the faceted
search is very slow, taking about 5 minutes per query.

I tried creating an index with 1M documents, each with 100 unique terms
in a field. A search for *:* with a facet request for the first 1M
entries in the field took about 20 seconds for the first call and about
1-1½ second for each subsequent call. This was with Solr trunk. The
complexity of my setup is no doubt a lot simpler and lighter than yours,
but 5 minutes sounds excessive.

My guess is that your performance problem is due to the merging process.
Could you try measuring the performance of a direct request to a single
shard? If that is satisfactory, going to the cloud would not solve your
problem. If you really need 1M entries in your result set, you would be
better of investigating whether your index can be in a single instance.

Re: SOLR DIH importing MySQL text column as a BLOB

2011-03-16 Thread Gora Mohanty

On Wed, Mar 16, 2011 at 9:50 PM, Kaushik Chakraborty kaych...@gmail.com wrote:
 The query's there in the data-config.xml. And the query's fetching as
 expected from the database.
[...]

Doh! Sorry, had missed that somehow.

So, the relevant part is:
SELECT ... p.message as solr_post_message,

What is the field type for p.message in mysql?
Cannot remember off the top of my head for
mysql, but if it is a TextField, you might want
to look into a ClobTransformer:
http://wiki.apache.org/solr/DataImportHandler#ClobTransformer

Regards,
Gora

RE: hierarchical faceting, SOLR-792 - confused on config

Hi,

This is also where I am having problems. I have not been able to understand 
very much on the wiki.
I do not understand how to configure the faceting we are referring to.
Although I know very little about this, I can't help but think that the wiki is 
quite clearly unaccurate by some way!

Any comments please
Lewis

From: kmf [kfole...@gmail.com]
Sent: 23 February 2011 17:10
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

I'm really confused now.  Is this page completely out of date -
http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that
solr-792 is a form of hierarchical faceting. There are currently two
similar, non-competing, approaches to generating tree/hierarchical facets
from Solr: SOLR-64 and SOLR-792

To achieve hierarchical faceting, is the rule then that you form the
hierarchical facets using a transformer in the DIH and do nothing in
schema.xml or solrconfig.xml?   I seem to recall reading somewhere that
creating a copyField is needed.  Sorry for the entry level question but, I'm
still trying to understand how to configure solr to do hierarchical
faceting.

Thanks,
kmf
--
View this message in context: 
http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html
Sent from the Solr - User mailing list archive at Nabble.com.

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: Solrj performance bottleneck

2011-03-16 Thread Asharudeen

Hi

Thanks for your info.

Currently my index size is around 4GB. Normally in small instances total
available memory will be 1.6GB. In my setup, I allocated around 1GB as a
heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS
cache.

I believe, I need to migrate my Solr instance from small instance to large.
So that some more memory will be allotted for OS cache. But initially I
suspect, since I call Solrj code from another instance, I need to increase
the memory in the instance from where I run the Solrj. But you said I need
to increase the memory in Solr instance only. Here, just I want to double
check this case only. sorry for that.

Once again thanks for your replies.

Regards,


On Wed, Mar 16, 2011 at 7:02 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Mar 16, 2011 at 7:25 AM, rahul asharud...@gmail.com wrote:
  In our setup, we are having Solr index in one machine. And Solrj client
 part
  (java code) in another machine. Currently as you suggest, if it may be a
  'not enough free RAM for the OS to cache' then whether I need to increase
  the RAM in the machine in which Solrj query part is there.??? Or need to
  increase RAM for Solr instance for the OS cache?

 That would be RAM for the Solr instance.  If there is not enough free
 memory for the OS to cache, then each document retrieved will be a
 disk seek + read.

  Since both the system are in local Amazon network (Linux EC2 small
  instances), I believe the network wont be a issue.

 Ah, how big is your index?

  Another thing, in the reply you have mentioned 'client not reading fast
  enough'. Whether it is related to network or Solrj.

 That was a general issue - it *can* be the client, but since you're
 using SolrJ it would be the network.

 -Yonik
 http://lucidimagination.com

Re: Solrj performance bottleneck

On Wed, Mar 16, 2011 at 12:56 PM, Asharudeen asharud...@gmail.com wrote:
 Currently my index size is around 4GB. Normally in small instances total
 available memory will be 1.6GB. In my setup, I allocated around 1GB as a
 heap size for tomcat. Hence I believe, remaining 600 MB will be used for OS
 cache.

Actually, even less.  A JVM with a 1.6GB heap size will take up even
more memory (since the heap size does not count stuff not on the heap,
like the JVM code itself).  This is definitely your problem.

 I believe, I need to migrate my Solr instance from small instance to large.
 So that some more memory will be allotted for OS cache. But initially I
 suspect, since I call Solrj code from another instance, I need to increase
 the memory in the instance from where I run the Solrj. But you said I need
 to increase the memory in Solr instance only. Here, just I want to double
 check this case only. sorry for that.

SolrJ itself won't take up much memory.  It depends on what else your
client app is doing, but a small instance may be fine.

-Yonik
http://lucidimagination.com

Error: Unbuffered entity enclosing request can not be repeated.

2011-03-16 Thread André Santos

Hi all!

I created a SolrJ project to run test Solr. So, I am inserting batches of
7000 records, each with 200 attributes which adds up approximately to 13.77
Mb per batch.

I am measuring the time it takes to add and commit each set of 7000
records to an instantiation of CommonsHttpSolrServer.
Each of the first 6 batches takes approximately 17 to 21 seconds.
The 7th batch takes 42sec and the 8th takes 1min.

And when it adds the 9th batch to the server it generates this error:

Mar 16, 2011 4:56:20 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (java.net.SocketException) caught when processing
request: Connection reset
Mar 16, 2011 4:56:21 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request
Exception in thread main org.apache.solr.client.solrj.SolrServerException:
org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing
request can not be repeated.
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)

I googled this error and one of the suggestions consists of the reduction of
the number of records per batch. But I want to achieve a solution with at
least 7000 records per batch.
Any help would be appreciated.
André

Re: hierarchical faceting, SOLR-792 - confused on config

2011-03-16 Thread Erik Hatcher

Sorry, I missed the original mail on this thread

I put together that hierarchical faceting wiki page a couple of years ago when
helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches. Since
then, SOLR-792 morphed and is committed as pivot faceting. SOLR-64 spawned a
PathTokenizer which is part of Solr now too.

Recently Toke updated that page with some additional info. It's definitely not
a how to page, and perhaps should get renamed/moved/revamped? Toke?

Erik

On Mar 16, 2011, at 12:39 , McGibbney, Lewis John wrote:

Hi,

This is also where I am having problems. I have not been able to understand
very much on the wiki.
I do not understand how to configure the faceting we are referring to.
Although I know very little about this, I can't help but think that the wiki
is quite clearly unaccurate by some way!

Any comments please
Lewis

From: kmf [kfole...@gmail.com]
Sent: 23 February 2011 17:10
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

I'm really confused now. Is this page completely out of date -
http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that
solr-792 is a form of hierarchical faceting. There are currently two
similar, non-competing, approaches to generating tree/hierarchical facets
from Solr: SOLR-64 and SOLR-792

To achieve hierarchical faceting, is the rule then that you form the
hierarchical facets using a transformer in the DIH and do nothing in
schema.xml or solrconfig.xml? I seem to recall reading somewhere that
creating a copyField is needed. Sorry for the entry level question but, I'm
still trying to understand how to configure solr to do hierarchical
faceting.

Thanks,
kmf
--
View this message in context:
http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html
Sent from the Solr - User mailing list archive at Nabble.com.

Email has been scanned for viruses by Altman Technologies' email management
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number
SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the
Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career
Researchers of the Year 2010, GCU as a lead with Universities Scotland
partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

RE: Different options for autocomplete/autosuggestion

2011-03-16 Thread Robert Petersen

I take raw user search term data, 'collapse' it into a form where I have
only unique terms, per store, ordered by frequency of searches over some
time period.  The suggestions are then grouped and presented with store
breakouts.  That sounds kind of like what this page is talking about
here, but I could be using the wrong terminology:
http://wiki.apache.org/solr/FieldCollapsing


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, March 15, 2011 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Different options for autocomplete/autosuggestion

Hi,

I actually don't follow how field collapsing helps with
autocompletion...?

Over at http://search-lucene.com we eat our own autocomplete dog food: 
http://sematext.com/products/autocomplete/index.html .  Tasty stuff.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Kai Schlamp schl...@gmx.de
 To: solr-user@lucene.apache.org
 Sent: Mon, March 14, 2011 11:52:48 PM
 Subject: Re: Different options for autocomplete/autosuggestion
 
 @Robert: That sounds interesting and very flexible, but also like a
 lot of  work. This approach also doesn't seem to allow querying Solr
 directly by  using Ajax ... one of the big benefits in my opinion when
 using  Solr.
 @Bill: There are some things I don't like about the  Suggester
 component. It doesn't seem to allow infix searches (at least it is
not
 mentioned in the Wiki or elsewhere). It also uses a separate  index
 that has to be rebuild independently of the main index. And it
doesn't
 support any filter queries.
 
 The Lucid Imagination blog also  describes a further autosuggest
 approach 
(http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popu
lar-queries-using-edgengrams/).

 The  disadvantage here is that the source documents must have distinct
 fields  (resp. the dih selects must provide distinct data). Otherwise
 duplications  would come up in the Solr query result, cause of the
 document nature of  Solr.
 
 In my opinion field collapsing seems to be most promising for a  full
 featured autosuggestion solution. Unfortunately it is not  available
 for Solr 1.4.x or 3.x (I tried patching those branches several  times
 without success).
 
 2011/3/15 Bill Bell billnb...@gmail.com:
 
http://lucidworks.lucidimagination.com/display/LWEUG/Spell+Checking+and+
Aut
   omatic+Completion+of+User+Queries
 
  For Auto-Complete, find the  following section in the solrconfig.xml
file
  for the collection:
!-- Auto-Complete component --
   searchComponent  name=autocomplete
class=solr.SpellCheckComponent
 lst  name=spellchecker
   str  name=nameautocomplete/str
   str
   name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str
 
name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup
/s
   tr
   str name=fieldautocomplete/str
str name=buildOnCommittrue/str
  !--
str name=sourceLocationamerican-english/str
   --
 /lst
 
 
 
 
  On  3/14/11 8:16 PM, Andy angelf...@yahoo.com  wrote:
 
 Can you provide more details? Or a  link?
 
 --- On Mon, 3/14/11, Bill Bell billnb...@gmail.com  wrote:
 
  See how Lucid Enterprise does it...  A
  bit differently.
 
  On 3/14/11  12:14 AM, Kai Schlamp kai.schl...@googlemail.com
   wrote:
 
  Hi.
   
  There seems to be several options for implementing  an
  autocomplete/autosuggestions feature with Solr. I  am
  trying to
  summarize those possibilities  together with their
  advantages and
   disadvantages. It would be really nice to read some of
  your  opinions.
  
  * Using N-Gram filter + text  field query
  + available in stable 1.4.x
   + results can be boosted
  + sorted by best  matches
  - may return duplicate results
   
  * Facets
  + available in stable  1.4.x
  + no duplicate entries
  - sorted by  count
  - may need an extra N-Gram field for infix  queries
  
  * Terms
  +  available in stable 1.4.x
  + infix query by using regex in  3.x
  - only prefix query in 1.4.x
  -  regexp may be slow (just a guess)
  
  *  Suggestions
  ? Did not try that yet. Does it allow infix  queries?
  
  * Field  Collapsing
  + no duplications
  - only  available in 4.x branch
  ? Does it work together with  highlighting? That would
  be a big plus.
   
  What are your experiences regarding
   autocomplete/autosuggestion with
  Solr? Any additions,  suggestions or corrections? What
  do you prefer?
   
   Kai
 
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Dr. med. Kai Schlamp
 Am Fort Elisabeth 17
 55131  Mainz
 Germany
 Phone +49-177-7402778
 Email: schl...@gmx.de

Re: faceting over ngrams

Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless I do
something fundamentally wrong in the SOLR configuration, I have to admit,
that counting ngrams up to trigrams across whole set of shard's documents is
pretty intensive task, as each ngram can occur anywhere in the index and
SOLR most probably doesn't precompute the cumulative count of it. I'll try
querying with facet.method=fc, thanks for that.

By the way, the trigrams are defined like this:

fieldType name=shingle_text_trigram class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=3
outputUnigrams=true/
/analyzer
/fieldType

For the sharding -- I decided to go with it, when the index size approached
half a terabyte and doc count went over 100M, I thought it would help us
scale better. I also maintain good level of caching, and so far the faceting
over normal string fields (no ngrams) performed really well (around 1 sec).


On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding, so that
 could explain our different experiences.  It seems like sharding definitely
 has trade-offs, makes some things faster and other things slower. So far
 I've managed to avoid it, in the interest of keeping things simpler and
 easier to understand (for me, the developer/Solr manager), thinking that
 sharding is also a somewhat less mature feature.

 With only 1M documents are you sure you need sharding at all?  You
 could still use replication to scale out for volume, sharding seems more
 about scaling for number of documents (or total bytes) in your index.  1M
 documents is not very large, for Solr, in general.

 Jonathan


 On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

 On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

 Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
 the
 trigrams field with about 1 million of entries in the result set and more
 than 100 million of entries to facet on in the index. Currently the
 faceted
 search is very slow, taking about 5 minutes per query.

 I tried creating an index with 1M documents, each with 100 unique terms
 in a field. A search for *:* with a facet request for the first 1M
 entries in the field took about 20 seconds for the first call and about
 1-1½ second for each subsequent call. This was with Solr trunk. The
 complexity of my setup is no doubt a lot simpler and lighter than yours,
 but 5 minutes sounds excessive.

 My guess is that your performance problem is due to the merging process.
 Could you try measuring the performance of a direct request to a single
 shard? If that is satisfactory, going to the cloud would not solve your
 problem. If you really need 1M entries in your result set, you would be
 better of investigating whether your index can be in a single instance.




-- 
Regards,

Dmitry Kan

Using Solr 1.4.1 on most recent Tomcat 7.0.11

Hello list,

Is anyone running Solr (in my case 1.4.1) on above Tomcat dist? In the
past I have been using guidance in accordance with
http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat
but having upgraded from Tomcat 7.0.8 to 7.0.11 I am having problems
E.g.

INFO: Deploying configuration descriptor wombra.xml  This is my context
fragment
from /home/lewis/Downloads/apache-tomcat-7.0.11/conf/Catalina/localhost
16-Mar-2011 16:57:36 org.apache.tomcat.util.digester.Digester fatalError
SEVERE: Parse Fatal Error at line 4 column 6: The processing instruction
target matching [xX][mM][lL] is not allowed.
org.xml.sax.SAXParseException: The processing instruction target
matching [xX][mM][lL] is not allowed.
...
16-Mar-2011 16:57:36 org.apache.catalina.startup.HostConfig
deployDescriptor
SEVERE: Error deploying configuration descriptor wombra.xml
org.xml.sax.SAXParseException: The processing instruction target
matching [xX][mM][lL] is not allowed.
...
some more
...

My configuration descriptor is as follows
?xml version=1.0 encoding=utf-8?
Context docBase=/home/lewis/Downloads/wombra/wombra.war
crossContext=true
  Environment name=solr/home type=java.lang.String
value=/home/lewis/Downloads/wombra override=true/
/Context

Preferably I would upload a WAR file, but I have been working well with
the configuration I have been using up until now therefore I didn't
question change.
I am unfamiliar with the above errors. Can anyone please point me in the
right direction?

Thank you
Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: faceting over ngrams

Oh, doc count over 100M is a very different thing than doc count about 
1M. In your original message you said I tried creating an index with 1M 
documents, each with 100 unique terms in a field. If you instead have 
100M documents, your use is a couple orders of magnitude larger than mine.


It also occurs to me that while I have around 3 million documents, and 
probably up to 50 million or so unique values in the multi-valued 
facetted field -- each document only has 3-10 values, not 100 each. So 
that may also be a difference that effects the facetting algorithm to 
your detriment, not sure.


Prior to Solr 1.4, it was pretty much impossible to facet over 1 
million+ unique values at all, now it works wonderfully in many use 
cases, but you may have found one that's still too much for it.


It also raises my curiosity as to why you'd want to facet over a 
n-grammed field to begin with, that's definitely not an ordinary use 
case. Perhaps there is some way to do what you need without facetting? 
But you probably know what you're doing.


Jonathan

On 3/16/2011 2:25 PM, Dmitry Kan wrote:

Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless I 
do something fundamentally wrong in the SOLR configuration, I have to 
admit, that counting ngrams up to trigrams across whole set of shard's 
documents is pretty intensive task, as each ngram can occur anywhere 
in the index and SOLR most probably doesn't precompute the cumulative 
count of it. I'll try querying with facet.method=fc, thanks for that.


By the way, the trigrams are defined like this:

fieldType name=shingle_text_trigram class=solr.TextField 
positionIncrementGap=100

analyzer
tokenizer class=solr.LowerCaseTokenizerFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=3 
outputUnigrams=true/

/analyzer
/fieldType

For the sharding -- I decided to go with it, when the index size 
approached half a terabyte and doc count went over 100M, I thought it 
would help us scale better. I also maintain good level of caching, and 
so far the faceting over normal string fields (no ngrams) performed 
really well (around 1 sec).



On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind rochk...@jhu.edu 
mailto:rochk...@jhu.edu wrote:


Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding,
so that could explain our different experiences.  It seems like
sharding definitely has trade-offs, makes some things faster and
other things slower. So far I've managed to avoid it, in the
interest of keeping things simpler and easier to understand (for
me, the developer/Solr manager), thinking that sharding is also a
somewhat less mature feature.

With only 1M documents are you sure you need sharding at all?
 You could still use replication to scale out for volume,
sharding seems more about scaling for number of documents (or
total bytes) in your index.  1M documents is not very large, for
Solr, in general.

Jonathan


On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

Hello guys. We are using shard'ed solr 1.4 for heavy
faceted search over the
trigrams field with about 1 million of entries in the
result set and more
than 100 million of entries to facet on in the index.
Currently the faceted
search is very slow, taking about 5 minutes per query.

I tried creating an index with 1M documents, each with 100
unique terms
in a field. A search for *:* with a facet request for the
first 1M
entries in the field took about 20 seconds for the first call
and about
1-1½ second for each subsequent call. This was with Solr
trunk. The
complexity of my setup is no doubt a lot simpler and lighter
than yours,
but 5 minutes sounds excessive.

My guess is that your performance problem is due to the
merging process.
Could you try measuring the performance of a direct request to
a single
shard? If that is satisfactory, going to the cloud would not
solve your
problem. If you really need 1M entries in your result set, you
would be
better of investigating whether your index can be in a single
instance.




--
Regards,

Dmitry Kan

Re: hierarchical faceting, SOLR-792 - confused on config

Interesting, any documentation on the PathTokenizer anywhere? Or just 
have to find and look at the source? That's something I hadn't known 
about, which may be useful to some stuff I've been working on depending 
on how it works.


If nothing else, in the meantime, I'm going to take that exact message 
from Erik and just add it to the top of the wiki page, to avoid other 
people getting confused (I've been confused by that page too) until 
someone spends the time to rewrite it to be more up to date and 
accurate, or clear about it's topicality.


On 3/16/2011 1:36 PM, Erik Hatcher wrote:

Sorry, I missed the original mail on this thread

I put together that hierarchical faceting wiki page a couple of years ago when 
helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches.  Since 
then, SOLR-792 morphed and is committed as pivot faceting.  SOLR-64 spawned a 
PathTokenizer which is part of Solr now too.

Recently Toke updated that page with some additional info.  It's definitely not a 
how to page, and perhaps should get renamed/moved/revamped?  Toke?

Erik

On Mar 16, 2011, at 12:39 , McGibbney, Lewis John wrote:


Hi,

This is also where I am having problems. I have not been able to understand 
very much on the wiki.
I do not understand how to configure the faceting we are referring to.
Although I know very little about this, I can't help but think that the wiki is 
quite clearly unaccurate by some way!

Any comments please
Lewis

From: kmf [kfole...@gmail.com]
Sent: 23 February 2011 17:10
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

I'm really confused now.  Is this page completely out of date -
http://wiki.apache.org/solr/HierarchicalFaceting - as it seems to imply that
solr-792 is a form of hierarchical faceting. There are currently two
similar, non-competing, approaches to generating tree/hierarchical facets
from Solr: SOLR-64 and SOLR-792

To achieve hierarchical faceting, is the rule then that you form the
hierarchical facets using a transformer in the DIH and do nothing in
schema.xml or solrconfig.xml?   I seem to recall reading somewhere that
creating a copyField is needed.  Sorry for the entry level question but, I'm
still trying to understand how to configure solr to do hierarchical
faceting.

Thanks,
kmf
--
View this message in context: 
http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2561445.html
Sent from the Solr - User mailing list archive at Nabble.com.

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

RE: hierarchical faceting, SOLR-792 - confused on config

Hi Erik,

I have been reading about the progression of SOLR-792 into pivot faceting, 
however can you expand to comment on
where it is committed. Are you referring to trunk?
The reason I am asking is that I have been using 1.4.1 for some time now and 
have been thinking of upgrading to trunk... or branch

Thank you Lewis

From: Erik Hatcher [erik.hatc...@gmail.com]
Sent: 16 March 2011 17:36
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting, SOLR-792 - confused on config

Sorry, I missed the original mail on this thread

I put together that hierarchical faceting wiki page a couple of years ago when 
helping a customer evaluate SOLR-64 vs. SOLR-792 vs.other approaches.  Since 
then, SOLR-792 morphed and is committed as pivot faceting.  SOLR-64 spawned a 
PathTokenizer which is part of Solr now too.

Recently Toke updated that page with some additional info.  It's definitely not 
a how to page, and perhaps should get renamed/moved/revamped?  Toke?

Erik


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: SOLR DIH importing MySQL text column as a BLOB

2011-03-16 Thread Jayendra Patil

Hi Kaushik,

If the field is being treated as blobs, you can try using the
FieldStreamDataSource mapping.
This handles the blob objects to extract contents from it.

This feature is available only after Solr 3.1, I suppose.
http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/FieldStreamDataSource.html

Regards,
Jayendra

On Tue, Mar 15, 2011 at 11:57 PM, Kaushik Chakraborty
kaych...@gmail.com wrote:
 I've a column for posts in MySQL of type `text`, I've tried corresponding
 `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But
 whenever I'm importing it using the DIH, it's getting imported as a BLOB
 object. I checked, this thing is happening only for columns of type `text`
 and not for `varchar`(they are getting indexed as string). Hence, the posts
 field is not becoming searchable.

 I found about this issue, after repeated search failures, when I did a `*:*`
 query search on Solr. A sample response:

        result name=response numFound=223 start=0 maxScore=1.0
        doc
        float name=score1.0/float
        str name=solr_post_bio[B@10a33ce2/str
        date name=solr_post_created_at2011-02-21T07:02:55Z/date
        str name=solr_post_emailtest.acco...@gmail.com/str
        str name=solr_post_first_nameTest/str
        str name=solr_post_last_nameAccount/str
        str name=solr_post_message[B@2c93c4f1/str
        str name=solr_post_status_message_id1/str
        /doc

 The `data-config.xml` :

        document
     entity name=posts dataSource=jdbc  query=select
     p.person_id as solr_post_person_id,
     pr.first_name as solr_post_first_name,
     pr.last_name as solr_post_last_name,
     u.email as solr_post_email,
     p.message as solr_post_message,
     p.id as solr_post_status_message_id,
     p.created_at as solr_post_created_at,
     pr.bio as solr_post_bio
     from posts p,users u,profiles pr where p.person_id = u.id and
 p.person_id = pr.person_id and p.type='StatusMessage'
             field column=solr_post_person_id /
     field column=solr_post_first_name/
     field column=solr_post_last_name /
     field column=solr_post_email /
     field column=solr_post_message /
     field column=solr_post_status_message_id /
     field column=solr_post_created_at /
     field column=solr_post_bio/
           /entity
      /document

 The `schema.xml` :

    fields
        field name=solr_post_status_message_id type=string
 indexed=true stored=true required=true /
     field name=solr_post_message type=text_ws indexed=true
 stored=true required=true /
     field name=solr_post_bio type=text indexed=false stored=true
 /
     field name=solr_post_first_name type=string indexed=false
 stored=true /
     field name=solr_post_last_name type=string indexed=false
 stored=true /
     field name=solr_post_email type=string indexed=false
 stored=true /
     field name=solr_post_created_at type=date indexed=false
 stored=true /
    /fields
    uniqueKeysolr_post_status_message_id/uniqueKey
    defaultSearchFieldsolr_post_message/defaultSearchField


 Thanks,
 Kaushik

Re: faceting over ngrams

Hi Toke,

Thanks a lot for trying this out. I have to mention, that the facetted
search hits only one specific shard by design, so in general the time to
query a shard directly and through the proxy SOLR should be comparable.

Would it be feasible for you to make that field ngram'ed or is it too much
worry for you?

I'll check out the direct query and let you know.

On Wed, Mar 16, 2011 at 5:51 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
  Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
 the
  trigrams field with about 1 million of entries in the result set and more
  than 100 million of entries to facet on in the index. Currently the
 faceted
  search is very slow, taking about 5 minutes per query.

 I tried creating an index with 1M documents, each with 100 unique terms
 in a field. A search for *:* with a facet request for the first 1M
 entries in the field took about 20 seconds for the first call and about
 1-1½ second for each subsequent call. This was with Solr trunk. The
 complexity of my setup is no doubt a lot simpler and lighter than yours,
 but 5 minutes sounds excessive.

 My guess is that your performance problem is due to the merging process.
 Could you try measuring the performance of a direct request to a single
 shard? If that is satisfactory, going to the cloud would not solve your
 problem. If you really need 1M entries in your result set, you would be
 better of investigating whether your index can be in a single instance.




-- 
Regards,

Dmitry Kan

Re: faceting over ngrams

On Wed, Mar 16, 2011 at 8:05 AM, Dmitry Kan dmitry@gmail.com wrote:
 Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over the
 trigrams field with about 1 million of entries in the result set and more
 than 100 million of entries to facet on in the index. Currently the faceted
 search is very slow, taking about 5 minutes per query. Would running on a
 cloud with Hadoop make it faster (to seconds) as faceting seems to be a
 natural map-reduce task?

How many indexed tokens does each document have (for the field you are
faceting on) on average?
How many unique tokens are indexed in that field over the complete index?

Or you could go to the admin/stats page and cut-n-paste the
fieldValueCache entry after your faceting request - it should contain
most of the info to further analyze this.

-Yonik
http://lucidimagination.com

Re: Using Solr 1.4.1 on most recent Tomcat 7.0.11

2011-03-16 Thread François Schiettecatte

Lewis

Quick response, I am currently using Tomcat 7.0.8 with solr (with no issues), I 
will upgrade to 7.0.11 tonight and see if I run into the same issues.

Stay tuned as they say.

Cheers

François

On Mar 16, 2011, at 2:38 PM, McGibbney, Lewis John wrote:

 Hello list,
 
 Is anyone running Solr (in my case 1.4.1) on above Tomcat dist? In the
 past I have been using guidance in accordance with
 http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat
 but having upgraded from Tomcat 7.0.8 to 7.0.11 I am having problems
 E.g.
 
 INFO: Deploying configuration descriptor wombra.xml  This is my context
 fragment
 from /home/lewis/Downloads/apache-tomcat-7.0.11/conf/Catalina/localhost
 16-Mar-2011 16:57:36 org.apache.tomcat.util.digester.Digester fatalError
 SEVERE: Parse Fatal Error at line 4 column 6: The processing instruction
 target matching [xX][mM][lL] is not allowed.
 org.xml.sax.SAXParseException: The processing instruction target
 matching [xX][mM][lL] is not allowed.
 ...
 16-Mar-2011 16:57:36 org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor wombra.xml
 org.xml.sax.SAXParseException: The processing instruction target
 matching [xX][mM][lL] is not allowed.
 ...
 some more
 ...
 
 My configuration descriptor is as follows
 ?xml version=1.0 encoding=utf-8?
 Context docBase=/home/lewis/Downloads/wombra/wombra.war
 crossContext=true
  Environment name=solr/home type=java.lang.String
 value=/home/lewis/Downloads/wombra override=true/
 /Context
 
 Preferably I would upload a WAR file, but I have been working well with
 the configuration I have been using up until now therefore I didn't
 question change.
 I am unfamiliar with the above errors. Can anyone please point me in the
 right direction?
 
 Thank you
 Lewis
 
 Glasgow Caledonian University is a registered Scottish charity, number 
 SC021474
 
 Winner: Times Higher Education’s Widening Participation Initiative of the 
 Year 2009 and Herald Society’s Education Initiative of the Year 2009.
 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
 
 Winner: Times Higher Education’s Outstanding Support for Early Career 
 Researchers of the Year 2010, GCU as a lead with Universities Scotland 
 partners.
 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: 'Registering' a query / Percolation


: I.E. Instruct Solr that you are interested in documents that match a
: given query and then have Solr notify you (through whatever callback
: mechanism is specified) if and when a document appears that matches the
: query.
: 
: We are planning on writing some software that will effectively grind
: Solr to give us the same behaviour, but if Solr has this registration
: built in, it would be very useful and much easier on our resources...

it does not, but there are typically two ways people deal with this 
depending on the balance of your variables...

* max latency of notificications after doc is added/updated
* rate of churn of documents in index
* number of registered queries for notification

1) if you have a heavy churn of documents, and the max latency allowed for 
notification is large, then doing periodic polling at a frequency of that 
latency can be preferably to minimize the amount of redundent work

2) if the churn on documents is going to be relatively small and/or the 
number of registered queries is going to be relatively large, you can 
invert the problem and build an index where each document represents a 
query, and as documents are added/updated you use the terms in those 
documents to query your query index (this could even be done as an 
UpdateProcessor on your doc core, querying over to some other 
notifications core)


(disclaimer: i've never implemented any of these ideas personally, this is 
just what i've picked up over the years on hte mailing lists)

-Hoss

Re: Error during auto-warming of key

 that is odd...
 
 can you let us know exactly what verison of Solr/Lucne you are using (if
 it's not an official release, can you let us know exactly what the version
 details on the admin info page say, i'm curious about the svn revision)

Of course, that's the stable 1.4.1.

 
 can you also please let us know what types of queries you are generating?
 ... that's the toString output of a query and it's not entirely clear what
 the original looked like.  If you can recognize what the original query
 was, it would also be helpful to know if you can consistently reproduce
 this error on autowarming after executing that query (or queries like it
 with a slightly differnet date value)

It's extremely difficult to reproduce. It happened on a multinode system that's 
being prepared for production. It has been under heavy load for a long time 
already, updates and queries. It is continuously being updated with real user 
input and receives real user queries from a source that's being updated from 
logs. Solr is about to replace an existing search solution.

It is impossible to reproduce because of these uncontrollable variables, i 
tried but failed. The error, however, did occur at least a couple of times 
after i started this thread.

It hasn't reappeared after i reduced precision from milliseconds to an hour, 
see my other thread for more information:
http://web.archiveorange.com/archive/v/AAfXfFuqjPhU4tdq53Tv


 
 One of the things that particularly boggles me is this...
 
 : org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java
 : :545)
 : 
 : at
 : 
 : org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.ja
 : va:520)
 
   [...]
 
 : Well, i use Dismax' bf parameter to boost very recent documents. I'm not
 : using the queryResultCache or documentCache, only filterCache and Lucene
 : fieldCache.
 
 ... that cache warming stack trace seems to be coming from filterCache,
 but that contradicts your statement that you don't use the filterCache.
 independent of your comments, that's an odd looking query to be cached in
 the filter cache anyway, since it includes a mandatory matchalldocs
 clause, and seems to only exist for boosting on that function.

But i am using filterCache and fieldCache (forgot to mention the obvious 
fieldValueCache as well).

If you have any methods that may help to reproduce i'm of course willing to 
take the time and see if i can. It may prove really hard because several weird 
errors were not reproduceable in a more controlled but similar environment 
(load and config) and i can't mess with the soon-to-be production cluster.

Thanks!
 
 
 -Hoss

Re: faceting over ngrams

Hi Yonik,

I have ran the queries against single index solr with only 16M documents.
After attaching facet.method=fc the results seemed to come faster (first two
queries below), but still not fast enough.

Here are the fieldValueCache stats:

(facet.limit=100facet.mincount=5facet.method=fc, 542094 hits, 1 min)
-- smallest result set

*name: *fieldValueCache  *class: *org.apache.solr.search.FastLRUCache  *
version: *1.0  *description: *Concurrent LRU Cache(maxSize=1,
initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false)  *
stats: *lookups : 400
hits : 396
hitratio : 0.99
inserts : 1
evictions : 0
size : 1
warmupTime : 0
cumulative_lookups : 400
cumulative_hits : 396
cumulative_hitratio : 0.99
cumulative_inserts : 1
cumulative_evictions : 0
item_shingleContent_trigram :
{field=shingleContent_trigram,memSize=1786355392,tindexSize=17977426,time=662387,phase1=654707,nTerms=53492050,bigTerms=38,termInstances=602090958,uses=397}

(facet.limit=100facet.mincount=5facet.method=fc, 2837589 hits, 3 min 8
s) -- largest result set

*name: *fieldValueCache  *class: *org.apache.solr.search.FastLRUCache  *
version: *1.0  *description: *Concurrent LRU Cache(maxSize=1,
initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false)  *
stats: *lookups : 401
hits : 397
hitratio : 0.99
inserts : 1
evictions : 0
size : 1
warmupTime : 0
cumulative_lookups : 401
cumulative_hits : 397
cumulative_hitratio : 0.99
cumulative_inserts : 1
cumulative_evictions : 0
item_shingleContent_trigram :
{field=shingleContent_trigram,memSize=1786355392,tindexSize=17977426,time=662387,phase1=654707,nTerms=53492050,bigTerms=38,termInstances=602090958,uses=398}


On Wed, Mar 16, 2011 at 9:46 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Mar 16, 2011 at 8:05 AM, Dmitry Kan dmitry@gmail.com wrote:
  Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
 the
  trigrams field with about 1 million of entries in the result set and more
  than 100 million of entries to facet on in the index. Currently the
 faceted
  search is very slow, taking about 5 minutes per query. Would running on a
  cloud with Hadoop make it faster (to seconds) as faceting seems to be a
  natural map-reduce task?

 How many indexed tokens does each document have (for the field you are
 faceting on) on average?
 How many unique tokens are indexed in that field over the complete index?

 Or you could go to the admin/stats page and cut-n-paste the
 fieldValueCache entry after your faceting request - it should contain
 most of the info to further analyze this.

 -Yonik
 http://lucidimagination.com




-- 
Regards,

Dmitry Kan

i don't get why my index didn't grow more...

2011-03-16 Thread Robert Petersen

OK I have a 30 gb index where there are lots of sparsly populated int
fields and then one title field and one catchall field with title and
everything else we want as keywords, the catchall field.  I figure it is
the biggest field in our documents which as I mentioned is otherwise
composed of a variety if int fields and a title.

 

So my puzzlement is that my biggest field is copied into a double
metaphone field and now I added another copyfield to also copy the
catchall field into a newly created soundex field for an experiment to
compare the effectiveness of the two.  I expected the index to grow by
at least 25% to 30%, but it barely grew at all.  Can someone explain
this to me?  Thanks!  J

Re: FunctionQueries and FieldCache and OOM


: Alright, i can now confirm the issue has been resolved by reducing precision. 
: The garbage collector on nodes without reduced precision has a real hard time 
: keeping up and clearly shows a very different graph of heap consumption.
: 
: Consider using MINUTE, HOUR or DAY as precision in case you suffer from 
: excessive memory consumption:
: 
: recip(ms(NOW/PRECISION,DATE_FIELD),TIME_FRACTION,1,1)

FWIW: it sounds like your problem wasn't actually related to your 
fieldCache, but probably instead if was because of how big your 
queryResultCache is

:   Am i correct when i assume that Lucene FieldCache entries are added for
:   each unique function query?  In that case, every query is a unique cache

...no, the FieldCache has one entry per field name, and the value of that 
cache is an array keyed off of the internal docId for every doc in the 
index, and the corrisponding value (it's an uninverted version of lucene's 
inverted index for doing fast value lookups by document)

changes in the *values* used in your function queries won't affect 
FieldCache usage -- only changing the *fields* used in your functions 
would impact that.

:   each unique function query?  In that case, every query is a unique cache
:   entry because it operates on milliseconds. If all doesn't work i might be

what you describe is correct, but not in the FieldCache -- the 
queryResultCache is where queries that deal with the main result set (ie: 
paginated and/or sorted) wind up .. having lots of distinct queries in 
the bq (or q) param will make the number of unique items in that cache 
grow significantly (just like having lots of distinct queries in the fq 
will cause your filterCache to grow significantly)

you should definitley checkout what max size you have configured for your 
queryResultCache ... it sounds like it's proably too big, if you were 
getting OOM errors from having high precision dates in your boost queries.  
while i think using less precision is a wise choice, you should still 
consider dialing that max size down, so that if some other usage pattern 
still causes lots of unique queries in a short time period (a bot crawling 
your site map perhaps) it doesn't fill up and cause another OOM



-Hoss

Re: i don't get why my index didn't grow more...

On Wed, Mar 16, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote:
 OK I have a 30 gb index where there are lots of sparsly populated int
 fields and then one title field and one catchall field with title and
 everything else we want as keywords, the catchall field.  I figure it is
 the biggest field in our documents which as I mentioned is otherwise
 composed of a variety if int fields and a title.



 So my puzzlement is that my biggest field is copied into a double
 metaphone field and now I added another copyfield to also copy the
 catchall field into a newly created soundex field for an experiment to
 compare the effectiveness of the two.  I expected the index to grow by
 at least 25% to 30%, but it barely grew at all.  Can someone explain
 this to me?  Thanks!  J

I assume you reindexed everything?

Anyway, the size of indexed fields generally grows sub-linearly (as
opposed to stored fields which is exactly linear).
But if it really barely grew at all, this could point to other parts
of the index taking up much more space than you realize.

If you could do an ls -l of your index directory, we might be able
to see what parts of the index are using up the most space.

-Yonik
http://lucidimagination.com

Re: Error during auto-warming of key

Actually, i dug in the logs again and surprise, it sometimes still occurs with 
`random` queries. Here's are a few snippets from the error log. Somewhere 
during that time there might be OOM-errors but older logs are unfortunately 
rotated away.



2011-03-14 00:25:32,152 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : 
Error during auto-warming of 
key:f_sp_eigenschappen:geo:java.lang.ArrayIndexOutOfBoundsException: 431733
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:152)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:642)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
at 
org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520)
at 
org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296)
at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168)
at 
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)




2011-03-14 00:25:32,795 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : 
Error during auto-warming of key:+(titel_i:touareg^5.0 | 
f_advertentietype:touareg^2.0 | f_automodel_j:touareg^8.0 | facets:touareg^2.0 
| omschrijving_i:touareg | catlevel1_i:touareg^2.0 | 
catlevel2_i:touareg^4.0)~0.1 () 
(10.0/(7.71E-8*float(ms(const(130003560),date(sort_date)))+1.0))^10.0:java.lang.ArrayIndexOutOfBoundsException:
 
468554
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.readNoTf(SegmentTermDocs.java:169)
at 
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:139)
at org.apache.lucene.search.TermScorer.nextDoc(TermScorer.java:130)
at 
org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:145)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:246)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
at 
org.apache.solr.search.SolrIndexSearcher.cacheDocSet(SolrIndexSearcher.java:520)
at 
org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:296)
at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:168)
at 
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1131)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)




2011-03-14 00:25:33,051 ERROR [solr.search.SolrCache] - [pool-1-thread-1] - : 
Error during auto-warming of key:+*:* (10.0/(7.71E-8*fl
oat(ms(const(130003560),date(sort_date)))+1.0))^10.0:java.lang.ArrayIndexOutOfBoundsException:
 
489479
at org.apache.lucene.util.BitVector.get(BitVector.java:102)
at 
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
at 
org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:562)
at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:208)
at 
org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:525)
at 
org.apache.solr.search.function.LongFieldSource.getValues(LongFieldSource.java:57)
at 
org.apache.solr.search.function.DualFloatFunction.getValues(DualFloatFunction.java:48)
at 
org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61)
at 
org.apache.solr.search.function.FunctionQuery$AllScorer.init(FunctionQuery.java:123)
at 
org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:93)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at

dismax parser, parens, what do they do exactly


It looks like Dismax query parser can somehow handle parens, used for
applying, for instance, + or - to a group, distributing it. But I'm not
sure what effect they have on the overall query.

For instance, if I give dismax this:

book (dog +( cat -frog))

debugQuery shows:

+((DisjunctionMaxQuery((text:book)~0.01)
+DisjunctionMaxQuery((text:dog)~0.01)
DisjunctionMaxQuery((text:cat)~0.01)
-DisjunctionMaxQuery((text:frog)~0.01))~2) ()


How will that be treated by mm?  Let's say I have an mm of 50%.  Does
that apply to the top-level, like either book needs to match or
+(dog +( cat -frog)) needs to match?  And for +(dog +( cat -frog))
to match, do just 50% of that subquery need to match... or is mm ignored
there?  Or something else entirely?

Can anyone clear this up?  Continuing to try experimentally to clear it up... 
it _looks_ like the mm actually applies to each _individual_ low-level query.  
So even though the semantics of:
book (dog +( cat -frog))

are respected, if mm is 50%, the nesting is irrelvant, exactly 50% of book, dog, 
+cat, and +-frog (distributing the operators through I guess?) are required. I think. I'm 
getting confused even talking about it.

Re: Sorting on multiValued fields via function query


: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were 
single valued, you would never enocunter a problem.  if you had *some* 
multivalued fields your results would be sorted extremely arbitrarily for 
those docs that did have multiple values, unless you had more distinct 
values then you had documents -- at which point you would get a hard crash 
at query time.

: Regardless, I understand the reasoning behind the restriction, I'm
: interested in getting around it by using a functionQuery to reduce
: multiValued fields to a single value.  It sounds like this isn't possible,

I don't think we have any functions that do that -- functions are composed 
of valuesources which may be composed of other value sources but 
ultimatley the data comes from somewhere, and in every case i can think of 
(except for constant values) that data comes from the FieldCache -- the 
same FieldCache used for sorting.

I don't think there are any value sources that will let you specify a 
multiValued field, and then pick one of those values based on a 
rule/function ... even the PolyFields used for spatial search work by 
using multiple field names unde the covers (N distinct field names for an 
N-dimensional space)

: is that correct?  Ideally I'd like to sort by the maximum value on
: descending sorts and the minimum value on ascending sorts.  Is there any
: movement towards implementing this sort of behavior?

this is a fairly classic usecase of just having multiple fields.  even if 
the logic was implemented to support this at query time, it could never be 
faster then sorting on asingle valued field that you populat with the 
min/max at indexing time -- the mantra of fast I/R is that if you can 
precompute it independently of the individual search critera, you should 
(it's the whole foundation for why the inverted index exists)


-Hoss

Re: Version Incompatibility(Invalid version (expected 2, but 1) or the data in not in 'javabin' format)

2011-03-16 Thread Ahmet Arslan

            I am using Solr 4.0 api
  to search from index (made using solr1.4 version). I
 am
  getting error Invalid version (expected 2, but 1) or
 the
  data in not in 'javabin' format. Can anyone help me to
 fix
  problem.
 
 You need to use solrj version 1.4 which is compatible to
 your index format/version.
 

Actually there exists another solution. Using XMLResponseParser instead of 
BinaryResponseParser which is the default.

new CommonsHttpSolrServer(new URL(http://solr1.4.0Instance:8080/solr;), null, 
new XMLResponseParser(), false);

Re: Sorting on multiValued fields via function query

On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : However, many of our multiValued fields are single valued for the majority
 : of documents in our index so we may not have noticed the incorrect sorting
 : behaviors.

 that would make sense ... if you use a multiValued field as if it were
 single valued, you would never enocunter a problem.  if you had *some*
 multivalued fields your results would be sorted extremely arbitrarily for
 those docs that did have multiple values, unless you had more distinct
 values then you had documents -- at which point you would get a hard crash
 at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com

Re: Sorting on multiValued fields via function query

Huh, so lucene is actually doing what has been commonly described as 
impossible in Solr?


But is Solr trunk, as the OP person seemed to report, still not aware of 
this and raising on a sort on multi-valued field, instead of just 
saying, okay, we'll just pass it to lucene anyway and go with lucene's 
approach to sorting on multi-valued field (that is, apparently, using 
the largest value)?


If so... that kind of sounds like a bug/misfeature, yes, no?

Also... if lucene is already capable of sorting on multi-valued field by 
choosing the largest value largest vs. smallest is presumably just 
arbitrary there, there is presumably no performance implication to 
choosing the smallest instead of the largest. It just chooses the 
largest, according to Yonik.


So... if someone patched lucene, so whether it chose the largest or 
smallest in that case was a parameter passed in -- probably not a large 
patch since lucene, says Yonik, already has been enhanced to choose 
largest always -- and then patched Solr to take a param and pass it to 
Lucene for this purpose, which presumably also wouldn't be a large patch 
if lucene supported it   then we'd have the feature OP asked for.


Based on Yonik's description (assuming I understand correctly and he's 
correct), it doesn't sound like a lot of code. But it's still beyond my 
unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I 
have the interest for my own app needs at the moment. But if OP or 
someone else has both sounds like a plausible feature?


On 3/16/2011 6:00 PM, Yonik Seeley wrote:

On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
hossman_luc...@fucit.org  wrote:

: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were
single valued, you would never enocunter a problem.  if you had *some*
multivalued fields your results would be sorted extremely arbitrarily for
those docs that did have multiple values, unless you had more distinct
values then you had documents -- at which point you would get a hard crash
at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com

Re: hierarchical faceting, SOLR-792 - confused on config

2011-03-16 Thread Koji Sekiguchi


(11/03/17 3:53), Jonathan Rochkind wrote:

Interesting, any documentation on the PathTokenizer anywhere?


It is PathHierarchyTokenizer:

https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/analysis/PathHierarchyTokenizerFactory.html

Koji
--
http://www.rondhuit.com/en/

Re: Sorting on multiValued fields via function query

2011-03-16 Thread Bill Bell

I agree with this and it is even needed for function sorting for multvalued 
fields. See geohash patch for one wY to deal with multivalued fields on 
distance. Not ideal but it works efficiently.

Bill Bell
Sent from mobile


On Mar 16, 2011, at 4:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Huh, so lucene is actually doing what has been commonly described as 
 impossible in Solr?
 
 But is Solr trunk, as the OP person seemed to report, still not aware of this 
 and raising on a sort on multi-valued field, instead of just saying, okay, 
 we'll just pass it to lucene anyway and go with lucene's approach to sorting 
 on multi-valued field (that is, apparently, using the largest value)?
 
 If so... that kind of sounds like a bug/misfeature, yes, no?
 
 Also... if lucene is already capable of sorting on multi-valued field by 
 choosing the largest value largest vs. smallest is presumably just 
 arbitrary there, there is presumably no performance implication to choosing 
 the smallest instead of the largest. It just chooses the largest, according 
 to Yonik.
 
 So... if someone patched lucene, so whether it chose the largest or smallest 
 in that case was a parameter passed in -- probably not a large patch since 
 lucene, says Yonik, already has been enhanced to choose largest always -- and 
 then patched Solr to take a param and pass it to Lucene for this purpose, 
 which presumably also wouldn't be a large patch if lucene supported it   
 then we'd have the feature OP asked for.
 
 Based on Yonik's description (assuming I understand correctly and he's 
 correct), it doesn't sound like a lot of code. But it's still beyond my 
 unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I have the 
 interest for my own app needs at the moment. But if OP or someone else has 
 both sounds like a plausible feature?
 
 On 3/16/2011 6:00 PM, Yonik Seeley wrote:
 On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
 hossman_luc...@fucit.org  wrote:
 : However, many of our multiValued fields are single valued for the majority
 : of documents in our index so we may not have noticed the incorrect sorting
 : behaviors.
 
 that would make sense ... if you use a multiValued field as if it were
 single valued, you would never enocunter a problem.  if you had *some*
 multivalued fields your results would be sorted extremely arbitrarily for
 those docs that did have multiple values, unless you had more distinct
 values then you had documents -- at which point you would get a hard crash
 at query time.
 AFAIK, not any more.  Since that behavior was very unreliable, it has
 been removed and you can reliably sort by any multi-valued field in
 lucene (with the sort order being defined by the largest value if
 there are multiple).
 
 -Yonik
 http://lucidimagination.com

Re: Replication slows down massively during high load

2011-03-16 Thread Shawn Heisey


On 3/16/2011 7:56 AM, Vadim Kisselmann wrote:

If the load is low, both slaves replicate with around 100MB/s from master.

But when I use Solrmeter (100-400 queries/min) for load tests (over
the load balancer), the replication slows down to an unacceptable
speed, around 100KB/s (at least that's whats the replication page on
/solr/admin says).

snip

- Same hardware for all servers: Physical machines with quad core
CPUs, 24GB RAM (JVM starts up with -XX:+UseConcMarkSweepGC -Xms10G
-Xmx10G)
- Index size is about 100GB with 40M docs


Primary assumption:  You have a 64-bit OS and a 64-bit JVM.

It sounds to me like you're I/O bound, because your machine cannot keep 
enough of your index in RAM.  Relative to your 100GB index, you only 
have a maximum of 14GB of RAM available to the OS disk cache, since 
Java's heap size is 10GB.  How much disk space do all of the index files 
that end in x take up?  I would venture a guess that it's 
significantly more than 14GB.  On Linux, you could do this command to 
tally it quickly:


du -hc *x

If you installed enough RAM so the disk cache can be much larger than 
the total size of those files ending in x, you'd probably stop having 
these performance issues.  Realizing that this is a Alternatively, you 
could take steps to reduce the size of your index, or perhaps add more 
machines to go distributed.


My own index is distributed and replicated.  I've got nearly 53 million 
documents and a total index size of 95GB.  This is split into six shards 
that each are nearly 16GB.  Running that du command I gave you above, 
the total on one shard is 2.5GB, and there is 7GB of RAM available for 
the OS cache.


NB: I could be completely wrong about the source of the problem.

Thanks,
Shawn

Re: Replication slows down massively during high load

2011-03-16 Thread Shawn Heisey


On 3/16/2011 6:09 PM, Shawn Heisey wrote:

du -hc *x


I was looking over the files in an index and I think it needs to include 
more of the files for a true picture of RAM needs.  I get 5.9GB running 
the following command against a 16GB index.  It excludes *.fdt (stored 
field data) and *.tvf (term vector fields), but includes everything else.


du -hc `ls | egrep -v tvf|fdt`

If any of the experts have a better handle on which files are consulted 
on virtually all queries, that would help narrow down the OS cache 
requirements.


Thanks,
Shawn

Re: Faceting help