Re: Fully automated replica creation in AWS

2015-12-09 Thread Jean-Sebastien Vachon
Not sure if this will meet all your needs but you can probably do most of the 
work using AWS lambda.
I haven't used it personally but it is supposed to launch custom code following 
some events.

I guess you could create a small Java class to do the required work following 
the birth of a new server or the death of the old one. 

If you can find an event source that provide you the entry point, you'll 
probably be fine.

https://aws.amazon.com/lambda/faqs/

Good luck

From: Erick Erickson 
Sent: Wednesday, December 9, 2015 2:19 PM
To: solr-user
Subject: Re: Fully automated replica creation in AWS

Not that I know of. The two systems are somewhat disconnected.
AWS doesn't know that Solr lives on those nodes, it's just spinning
one up, right? Albeit with Solr running.

There's nothing in Solr that auto-detects the  existence of a new
Solr node and automagically assigns collections and/or replicas.

How would either system intuit that this new node is replacing
something else and "do the right thing"?

I'll tell you how, by interrogating Zookeeper and seeing that for some
specific collection, shardX had fewer replicas than other shards and
issuing the Collections API ADDREPLICA command.

But now there are _three_ systems that need to be coordinated and
doing the right thing in your situation would be the wrong thing in
another. The last thing many sys ops want is having replicas started
without their knowledge.

And on top of that, I have doubts about the model. Having AWS
elastically spin up a new replica is a heavyweight operation from
Solr's perspective. I mean this potentially copies a many G set of
index files from one place to another which could take a long time,
is that really what's desired here?

I have seen some folks spin up/down Solr instances based on a
schedule if they know roughly when the peak load will be, but again
there's nothing built in to handle this.

Best,
Erick

On Wed, Dec 9, 2015 at 10:15 AM, Ugo Matrangolo
 wrote:
> Hi,
>
> I was trying to setup a SolrCloud cluster in AWS backed by an ASG (auto
> scaling group) serving a replicated collection. I have just came across a
> case when one of the Solr node became unresponsive with AWS killing it and
> spinning a new one.
>
> Unfortunately, this new Solr node did not join as a replica of the existing
> collection requiring human intervention to configure it as a new replica.
>
> I was wondering if there is around something that will make this process
> fully automated by detecting that a new node just joined the cluster and
> instructing it (e.g. via Collections API) to join as a replica of a given
> collection.
>
> Best
> Ugo


Re: are there any SolrCloud supervisors?

2015-10-13 Thread Jean-Sebastien Vachon
I would be interested in seeing it in action. Do you have any documentation 
available on what it does and how?

Thanks


From: r b 
Sent: Friday, October 2, 2015 3:09 PM
To: solr-user@lucene.apache.org
Subject: are there any SolrCloud supervisors?

I've been working on something that just monitors ZooKeeper to add and
remove nodes from collections. the use case being I put SolrCloud in
an autoscaling group on EC2 and as instances go up and down, I need
them added to the collection. It's something I've built for work and
could clean up to share on GitHub if there is much interest.

I asked in the IRC about a SolrCloud supervisor utility but wanted to
extend that question to this list. are there any more "full featured"
supervisors out there?


-renning


Re: JSON Facet Analytics API in Solr 5.1

2015-04-17 Thread Jean-Sebastien Vachon
I prefer the second way. I find it more readable and shorter.

Thanks for making Solr even better ;)


From: Yonik Seeley ysee...@gmail.com
Sent: Friday, April 17, 2015 12:20 PM
To: solr-user@lucene.apache.org
Subject: Re: JSON Facet  Analytics API in Solr 5.1

Does anyone have any thoughts on the current general structure of JSON facets?
The current general form of a facet command is:

facet_name : { facet_type : facet_args }

For example:

top_authors : { terms : {
  field : author,
  limit : 5,
}}

One alternative I considered in the past is having the type in the args:

top_authors : {
  type : terms,
  field : author,
  limit : 5
}

It's a flatter structure... probably better in some ways, but worse in
other ways.
Thoughts / preferences?

-Yonik


On Tue, Apr 14, 2015 at 4:30 PM, Yonik Seeley ysee...@gmail.com wrote:
 Folks, there's a new JSON Facet API in the just released Solr 5.1
 (actually, a new facet module under the covers too).

 It's marked as experimental so we have time to change the API based on
 your feedback.  So let us know what you like, what you would change,
 what's missing, or any other ideas you may have!

 I've just started the documentation for the reference guide (on our
 confluence wiki), so for now the best doc is on my blog:

 http://yonik.com/json-facet-api/
 http://yonik.com/solr-facet-functions/
 http://yonik.com/solr-subfacets/

 I'll also be hanging out more on the #solr-dev IRC channel on freenode
 if you want to hit me up there about any development ideas.

 -Yonik


Re: indexing db records via SolrJ

2015-03-16 Thread Jean-Sebastien Vachon
Do you have any references to such integrations (Solr + Storm)?

Thanks


From: mike st. john mstj...@gmail.com
Sent: Monday, March 16, 2015 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing db records via SolrJ

Take a look at some of the integrations people are using with apache storm,
  we do something similar on a larger scale , having created a pgsql spout
and having a solr indexing bolt.


-msj

On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts 
hrobe...@cyber.law.harvard.edu wrote:

 We import anywhere from five to fifty million small documents a day from a
 postgres database.  I wrestled to get the DIH stuff to work for us for
 about a year and was much happier when I ditched that approach and switched
 to writing the few hundred lines of relatively simple code to handle
 directly the logic of what gets updated and how it gets queried from
 postgres ourselves.

 The DIH stuff is great for lots of cases, but if you are getting to the
 point of trying to hack its undocumented internals, I suspect you are
 better off spending a day or two of your time just writing all of the
 update logic yourself.

 We found a relatively simple combination of postgres triggers, export to
 csv based on those triggers, and then just calling update/csv to work best
 for us.

 -hal


 On 3/16/15 9:59 AM, Shawn Heisey wrote:

 On 3/16/2015 7:15 AM, sreedevi s wrote:

 I had checked this post.I dont know whether this is possible but my query
 is whether I can use the configuration for DIH for indexing via SolrJ


 You can use SolrJ for accessing DIH.  I have code that does this, but
 only for full index rebuilds.

 It won't be particularly obvious how to do it.  Writing code that can
 intepret DIH status and know when it finishes, succeeds, or fails is
 very tricky because DIH only uses human-readable status info, not
 machine-readable, and the info is not very consistent.

 I can't just share my code, because it's extremely convoluted ... but
 the general gist is to create a SolrQuery object, use setRequestHandler
 to set the handler to /dataimport or whatever your DIH handler is, and
 set the other parameters on the request like command to full-import
 and so on.

 Thanks,
 Shawn


 --
 Hal Roberts
 Fellow
 Berkman Center for Internet  Society
 Harvard University



Re: Delete By query on a multi-value field

2015-02-03 Thread Jean-Sebastien Vachon
Hi Lokesh, 

thanks for the information. 

I forgot to mention that the system I am working on is still using 3.5 so I 
will probably have to reindex the whole set of documents.
Unless someone knows how to get around this...



From: Lokesh Chhaparwal xyzlu...@gmail.com
Sent: Monday, February 2, 2015 11:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Delete By query on a multi-value field

Hi Jean,

Please see the issues
https://issues.apache.org/jira/browse/SOLR-3862
https://issues.apache.org/jira/browse/SOLR-5992

Both of them are resolved. The *remove *clause (atomic update) has been
added to 4.9.0 release. Haven't checked it though.

Thanks,
Lokesh


On Tue, Feb 3, 2015 at 7:26 AM, Jean-Sebastien Vachon 
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi All,


 Is there a way to delete a value from a Multi-value field without
 reindexing anything?


 Lets say I have three documents A,B and C with field XYZ set to 1,2,3,
 2,3,4 and 1. I'd like to remove anything that has the value '1' in the
 field XYZ. That is I want to remove the value '1' from the field, deleting
 the document only if '1' is the only value present.


 Deleting documents such as C (single value) is easy with a Delete by query
 through the update handler but what about document A?



 Thanks for any hint



Delete By query on a multi-value field

2015-02-02 Thread Jean-Sebastien Vachon
Hi All,


Is there a way to delete a value from a Multi-value field without reindexing 
anything?


Lets say I have three documents A,B and C with field XYZ set to 1,2,3, 
2,3,4 and 1. I'd like to remove anything that has the value '1' in the 
field XYZ. That is I want to remove the value '1' from the field, deleting the 
document only if '1' is the only value present.


Deleting documents such as C (single value) is easy with a Delete by query 
through the update handler but what about document A?



Thanks for any hint


RE: ANN: Solr Next

2014-06-10 Thread Jean-Sebastien Vachon
Hi Yonik,

Very impressive results. Looking forward to use this on our systems. Any idea 
what`s the plan for this feature? Will it make its way into Solr 4.9? or do we 
have to switch to HeliosSearch to be able to use it?

Thanks

 -Original Message-
 From: Yonik Seeley [mailto:ysee...@gmail.com]
 Sent: June-09-14 10:50 AM
 To: solr-user@lucene.apache.org
 Subject: Re: ANN: Solr Next
 
 On Tue, Jan 7, 2014 at 1:53 PM, Yonik Seeley ysee...@gmail.com wrote:
 [...]
  Next major feature: Native Code Optimizations.
  In addition to moving more large data structures off-heap(like
  UnInvertedField?), I am planning to implement native code
  optimizations for certain hotspots.  Native code faceting would be an
  obvious first choice since it can often be a CPU bottleneck.
 
 It's in!  Abbreviated report: 2x performance increase over stock solr faceting
 (which is already fast!) http://heliosearch.org/native-code-faceting/
 
 -Yonik
 http://heliosearch.org -- making solr shine
 
  Project resources:
 
  https://github.com/Heliosearch/heliosearch
 
  https://groups.google.com/forum/#!forum/heliosearch
  https://groups.google.com/forum/#!forum/heliosearch-dev
 
  Freenode IRC: #heliosearch #heliosearch-dev
 
  -Yonik
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date:
 27/05/2014 La Base de données des virus a expiré.


RE: Strange Behavior with Solr in Tomcat.

2014-06-06 Thread Jean-Sebastien Vachon
I would try a thread dump and check the output to see what`s going on. 
You could also strace the process if you`re running on Unix or changed the log 
level in Solr to get more information logged

 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: June-06-14 2:33 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Strange Behavior with Solr in Tomcat.
 
 Anyone folks?
 
 
 On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote:
 
   Hi Folks,
 
  I recently started using the spellchecker in my solrconfig.xml. I am
  able to build up an index in Solr.
 
  But,if I ever shutdown tomcat I am not able to restart it.The server
  never spits out the server startup time in seconds in the logs,nor
  does it print any error messages in the catalina.out file.
 
  The only way for me to get around this is by delete the data directory
  of the index and then start the server,obviously this makes me loose my
 index.
 
  Just wondering if anyone faced a similar issue and if they were able
  to solve this.
 
  Thanks.
 
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date:
 27/05/2014 La Base de données des virus a expiré.


RE: Strange behaviour when tuning the caches

2014-06-03 Thread Jean-Sebastien Vachon
Hi Otis,

We saw some improvement when increasing the size of the caches. Since then, we 
followed Shawn advice on the filterCache and gave some additional RAM to the 
JVM in order to reduce GC. The performance is very good right now but we are 
still experiencing some instability but not at the same level as before.
With our current settings the number of evictions is actually very low so we 
might be able to reduce some caches to free up some additional memory for the 
JVM to use.

As for the queries, it is a set of 5 million queries taken from our logs so 
they vary a lot. All I can say is that all queries involve either 
grouping/field collapsing and/or radius search around a point. Our largest 
customer is using a set of 8-10 filters that are translated as fq parameters. 
The collection contains around 13 million documents distributed on 5 shards 
with 2 replicas. The second collection has the same configuration and is used 
for indexing or as a fail-over index in case the first one falls.

We`ll keep making adjustments today but we are pretty close of having something 
that performs while being stable.

Thanks all for your help.



 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: June-03-14 12:17 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Strange behaviour when tuning the caches
 
 Hi Jean-Sebastien,
 
 One thing you didn't mention is whether as you are increasing(I assume)
 cache sizes you actually see performance improve?  If not, then maybe there
 is no value increasing cache sizes.
 
 I assume you changed only one cache at a time? Were you able to get any
 one of them to the point where there were no evictions without things
 breaking?
 
 What are your queries like, can you share a few examples?
 
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics Solr 
 Elasticsearch Support * http://sematext.com/
 
 
 On Mon, Jun 2, 2014 at 11:09 AM, Jean-Sebastien Vachon  jean-
 sebastien.vac...@wantedanalytics.com wrote:
 
  Thanks for your quick response.
 
  Our JVM is configured with a heap of 8GB. So we are pretty close of
  the optimal configuration you are mentioning. The only other
  programs running is Zookeeper (which has its own storage device) and a
  proprietary API (with a heap of 1GB) we have on top of Solr to server our
 customer`s requests.
 
  I will look into the filterCache to see if we can better use it.
 
  Thanks for your help
 
   -Original Message-
   From: Shawn Heisey [mailto:s...@elyograg.org]
   Sent: June-02-14 10:48 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Strange behaviour when tuning the caches
  
   On 6/2/2014 8:24 AM, Jean-Sebastien Vachon wrote:
We have yet to determine where the exact breaking point is.
   
The two patterns we are seeing are:
   
-  less cache (around 20-30% hit/ratio), poor performance but
overall good stability
  
   When caches are too small, a low hit ratio is expected.  Increasing
   them
  is a
   good idea, but only increase them a little bit at a time.  The
  filterCache in
   particular should not be increased dramatically, especially the
   autowarmCount value.  Filters can take a very long time to execute,
   so a
  high
   autowarmCount can result in commits taking forever.
  
   Each filter entry can take up a lot of heap memory -- in terms of
   bytes,
  it is
   the number of documents in the core divided by 8.  This means that
   if the core has 10 million documents, each filter entry (for JUST
   that
   core) will take over a megabyte of RAM.
  
-  more cache (over 90% hit/ratio), improved performance but
almost no stability. In that case, we start seeing messages such
as No shards hosting shard X or cancelElection did not find
election node to remove
  
   This would not be a direct result of increasing the cache size,
   unless
  perhaps
   you've increased them so they are *REALLY* big and you're running
   out of RAM for the heap or OS disk cache.
  
Anyone, has any advice on what could cause this? I am beginning to
suspect the JVM version, is there any minimal requirements
regarding the JVM?
  
   Oracle Java 7 is recommended for all releases, and required for Solr
  4.8.  You
   just need to stay away from 7u40, 7u45, and 7u51 because of bugs in
   Java itself.  Right now, the latest release is recommended, which is 7u60.
   The
   7u21 release that you are running should be perfectly fine.
  
   With six 9.4GB cores per node, you'll achieve the best performance
   if you have about 60GB of RAM left over for the OS disk cache to use
   -- the
  size of
   your index data on disk.  You did mention that you have 92GB of RAM
   per node, but you have not said how big your Java heap is, or
   whether there
  is
   other software on the machine that may be eating up RAM for its heap
   or data.
  
   http://wiki.apache.org/solr/SolrPerformanceProblems
  
   Thanks,
   Shawn

RE: Strange behaviour when tuning the caches

2014-06-03 Thread Jean-Sebastien Vachon
Yes we are already using it.

 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: June-03-14 11:41 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Strange behaviour when tuning the caches
 
 Hi,
 
 Have you seen https://wiki.apache.org/solr/CollapsingQParserPlugin ?  May
 help with the field collapsing queries.
 
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics Solr 
 Elasticsearch Support * http://sematext.com/
 
 
 On Tue, Jun 3, 2014 at 8:41 AM, Jean-Sebastien Vachon  jean-
 sebastien.vac...@wantedanalytics.com wrote:
 
  Hi Otis,
 
  We saw some improvement when increasing the size of the caches. Since
  then, we followed Shawn advice on the filterCache and gave some
  additional RAM to the JVM in order to reduce GC. The performance is
  very good right now but we are still experiencing some instability but
  not at the same level as before.
  With our current settings the number of evictions is actually very low
  so we might be able to reduce some caches to free up some additional
  memory for the JVM to use.
 
  As for the queries, it is a set of 5 million queries taken from our
  logs so they vary a lot. All I can say is that all queries involve
  either grouping/field collapsing and/or radius search around a point.
  Our largest customer is using a set of 8-10 filters that are
  translated as fq parameters. The collection contains around 13 million
  documents distributed on 5 shards with 2 replicas. The second
  collection has the same configuration and is used for indexing or as a
  fail-over index in case the first one falls.
 
  We`ll keep making adjustments today but we are pretty close of having
  something that performs while being stable.
 
  Thanks all for your help.
 
 
 
   -Original Message-
   From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
   Sent: June-03-14 12:17 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Strange behaviour when tuning the caches
  
   Hi Jean-Sebastien,
  
   One thing you didn't mention is whether as you are increasing(I
   assume) cache sizes you actually see performance improve?  If not,
   then maybe
  there
   is no value increasing cache sizes.
  
   I assume you changed only one cache at a time? Were you able to get
   any one of them to the point where there were no evictions without
   things breaking?
  
   What are your queries like, can you share a few examples?
  
   Otis
   --
   Performance Monitoring * Log Analytics * Search Analytics Solr 
   Elasticsearch Support * http://sematext.com/
  
  
   On Mon, Jun 2, 2014 at 11:09 AM, Jean-Sebastien Vachon  jean-
   sebastien.vac...@wantedanalytics.com wrote:
  
Thanks for your quick response.
   
Our JVM is configured with a heap of 8GB. So we are pretty close
of the optimal configuration you are mentioning. The only other
programs running is Zookeeper (which has its own storage device)
and a proprietary API (with a heap of 1GB) we have on top of Solr
to server
  our
   customer`s requests.
   
I will look into the filterCache to see if we can better use it.
   
Thanks for your help
   
 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: June-02-14 10:48 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Strange behaviour when tuning the caches

 On 6/2/2014 8:24 AM, Jean-Sebastien Vachon wrote:
  We have yet to determine where the exact breaking point is.
 
  The two patterns we are seeing are:
 
  -  less cache (around 20-30% hit/ratio), poor performance
  but
  overall good stability

 When caches are too small, a low hit ratio is expected.
 Increasing them
is a
 good idea, but only increase them a little bit at a time.  The
filterCache in
 particular should not be increased dramatically, especially the
 autowarmCount value.  Filters can take a very long time to
 execute, so a
high
 autowarmCount can result in commits taking forever.

 Each filter entry can take up a lot of heap memory -- in terms
 of bytes,
it is
 the number of documents in the core divided by 8.  This means
 that if the core has 10 million documents, each filter entry
 (for JUST that
 core) will take over a megabyte of RAM.

  -  more cache (over 90% hit/ratio), improved performance
  but
  almost no stability. In that case, we start seeing messages
  such as No shards hosting shard X or cancelElection did not
  find election node to remove

 This would not be a direct result of increasing the cache size,
 unless
perhaps
 you've increased them so they are *REALLY* big and you're
 running out of RAM for the heap or OS disk cache.

  Anyone, has any advice on what could cause this? I am
  beginning to suspect the JVM version, is there any minimal
  requirements regarding the JVM

Strange behaviour when tuning the caches

2014-06-02 Thread Jean-Sebastien Vachon
Hi All,

We have a 5 nodes setup running Solr 4.8.1 and we are trying to get the most 
out of it by tuning Solr caches.
Following is the output of the script version.sh provided with Tomcat

Server version: Apache Tomcat/7.0.39
Server built:   Mar 22 2013 12:37:24
Server number:  7.0.39.0
OS Name:Linux
OS Version: 3.0.76-0.11-default
Architecture:   amd64
JVM Version:1.7.0_21-b11
JVM Vendor: Oracle Corporation

To measure the performance, we are running a simple set of queries using Jmeter 
with 25 threads from another host (not a member of our cloud)

We tried to tune the different caches (mostly the documentCache, filterCache 
and queryResultCache) to reduce the number of evictions but the cloud
became very unstable at some point. Each server has 92GB of RAM and has 2 
collections (1 shard and two replicas) for a total of 6 cores per node.
Each core is around 9.4GB in size according to the Core admin panel.

We have yet to determine where the exact breaking point is.

The two patterns we are seeing are:

-  less cache (around 20-30% hit/ratio), poor performance but overall 
good stability

-  more cache (over 90% hit/ratio), improved performance but almost no 
stability.
In that case, we start seeing messages such as No shards 
hosting shard X or cancelElection did not find election node to remove

Anyone, has any advice on what could cause this? I am beginning to suspect the 
JVM version, is there any minimal requirements regarding the JVM?

Thanks


RE: Strange behaviour when tuning the caches

2014-06-02 Thread Jean-Sebastien Vachon
Thanks for your quick response.

Our JVM is configured with a heap of 8GB. So we are pretty close of the 
optimal configuration you are mentioning. The only other programs running is 
Zookeeper (which has its own storage device) and a proprietary API (with a heap 
of 1GB) we have on top of Solr to server our customer`s requests. 

I will look into the filterCache to see if we can better use it.

Thanks for your help

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: June-02-14 10:48 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Strange behaviour when tuning the caches
 
 On 6/2/2014 8:24 AM, Jean-Sebastien Vachon wrote:
  We have yet to determine where the exact breaking point is.
 
  The two patterns we are seeing are:
 
  -  less cache (around 20-30% hit/ratio), poor performance but
  overall good stability
 
 When caches are too small, a low hit ratio is expected.  Increasing them is a
 good idea, but only increase them a little bit at a time.  The filterCache in
 particular should not be increased dramatically, especially the
 autowarmCount value.  Filters can take a very long time to execute, so a high
 autowarmCount can result in commits taking forever.
 
 Each filter entry can take up a lot of heap memory -- in terms of bytes, it is
 the number of documents in the core divided by 8.  This means that if the
 core has 10 million documents, each filter entry (for JUST that
 core) will take over a megabyte of RAM.
 
  -  more cache (over 90% hit/ratio), improved performance but
  almost no stability. In that case, we start seeing messages such as
  No shards hosting shard X or cancelElection did not find election
  node to remove
 
 This would not be a direct result of increasing the cache size, unless perhaps
 you've increased them so they are *REALLY* big and you're running out of
 RAM for the heap or OS disk cache.
 
  Anyone, has any advice on what could cause this? I am beginning to
  suspect the JVM version, is there any minimal requirements regarding
  the JVM?
 
 Oracle Java 7 is recommended for all releases, and required for Solr 4.8.  You
 just need to stay away from 7u40, 7u45, and 7u51 because of bugs in Java
 itself.  Right now, the latest release is recommended, which is 7u60.  The
 7u21 release that you are running should be perfectly fine.
 
 With six 9.4GB cores per node, you'll achieve the best performance if you
 have about 60GB of RAM left over for the OS disk cache to use -- the size of
 your index data on disk.  You did mention that you have 92GB of RAM per
 node, but you have not said how big your Java heap is, or whether there is
 other software on the machine that may be eating up RAM for its heap or
 data.
 
 http://wiki.apache.org/solr/SolrPerformanceProblems
 
 Thanks,
 Shawn
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date:
 27/05/2014


RE: Question regarding the lastest version of HeliosSearch

2014-05-17 Thread Jean-Sebastien Vachon
Thanks for the information Yonik. 

 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: May-16-14 8:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Question regarding the lastest version of HeliosSearch
 
 On Thu, May 15, 2014 at 3:44 PM, Jean-Sebastien Vachon jean-
 sebastien.vac...@wantedanalytics.com wrote:
  I spent some time today playing around with subfacets and facets functions
 now available in helios search 0.05 and I have some concerns... They look
 very promising .
 
 Thanks, glad for the feedback!
 
 [...]
  the response looks good except for one little thing... the mincount is not
 respected whenever I specify the facet.stat parameter. Removing it will
 cause the mincount to be respected but then I need this parameter.
 
 Right, the mincount parameter is not yet implemented.   Hopefully soon!
 
  {
 
val:1133,
 
unique(job_id):0, == what is this?
 
count:0},
   Many zero entries following...
 
  I was wondering where the extra entries were coming from... the
  position_id = 1133 above is not even a match for my query (its title is 
  Audit
 Consultant) I`ve also noticed a similar behaviour when using subfacets. It
 looks like the number of items returned always match the facet.limit
 parameter.
  If not enough values are present for a given entry then the bucket is filled
 with documents not matching the original query.
 
 Right... straight Solr faceting will do this too (unless you have a
 mincount0).  We're just looking at terms in the field and we don't
 have enough context to know if some 0's make more sense than others to
 return.
 
 -Yonik
 http://heliosearch.org - facet functions, subfacets, off-heap
 filtersfieldcache
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4570 / Base de données virale: 3931/7453 - Date:
 07/05/2014 La Base de données des virus a expiré.


Question regarding the lastest version of HeliosSearch

2014-05-16 Thread Jean-Sebastien Vachon
Hi All,

I spent some time today playing around with subfacets and facets functions now 
available in helios search 0.05 and I have some concerns... They look very 
promising .

I indexed 10 000 documents and built some queries to look at each feature and 
found some weird behaviour that I could not explain.

The first query I made was to find all documents having the word java in 
their title and then compute a facet on the field position_id with stats about 
the field job_id. Basically, I want the number of unique Job_ids for each 
position_id for all matching documents.

http://localhost:8983/solr/current/select?q=title:javafacet=onfacet.field=position_idfacet.stat=unique(job_id)rows=1facet.limit=10facet.mincount=1wt=jsonindent=onfl=job_id,position_id,super_alias_id

the response looks good except for one little thing... the mincount is not 
respected whenever I specify the facet.stat parameter. Removing it will cause 
the mincount to be respected but then I need this parameter.

Without the parameter the facet looks like this:
facet_counts:{
facet_queries:{},
facet_fields:{
  position_id:[
265151,5,
927284,1,
1662380,1,
2625553,1,
2862455,1,
4128904,1,
4253203,1]},  === accounted for all 11 documents

And now when adding the parameter:


facets:{

position_id:{

  stats:{

unique(job_id):11, == again, 11 documents, which is good

count:11},

  buckets:[{

  val:265151,

  unique(job_id):5,

  count:5},

{

  val:927284,

  unique(job_id):1,

  count:1},

{

  val:1662380,

  unique(job_id):1,

  count:1},

{

  val:2625553,

  unique(job_id):1,

  count:1},

{

  val:2862455,

  unique(job_id):1,

  count:1},

{

  val:4128904,

  unique(job_id):1,

  count:1},

{

  val:4253203,

  unique(job_id):1,

  count:1},

{

  val:1133,

  unique(job_id):0, == what is this?

  count:0},
 Many zero entries following...

I was wondering where the extra entries were coming from... the position_id = 
1133 above is not even a match for my query (its title is Audit Consultant)
I`ve also noticed a similar behaviour when using subfacets. It looks like the 
number of items returned always match the facet.limit parameter.
If not enough values are present for a given entry then the bucket is filled 
with documents not matching the original query.

Am I doing something wrong?


RE: Transformation on a numeric field

2014-04-16 Thread Jean-Sebastien Vachon
Thanks for the information. I will look into this but I`m curious to know why 
something this basic requires an external script... 

Anyone knows why we can`t have an analysis chain on a numeric field ? Looks to 
me like it would be very useful to be able to manipulate/transform a value 
without an external resources.

Thanks

 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: April-15-14 4:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Transformation on a numeric field
 
 You can use an update processor. The stateless script update processor will
 let you write arbitrary JavaScript code, which can do this calculation.
 
 You should be able to figure it  out from the wiki:
 http://wiki.apache.org/solr/ScriptUpdateProcessor
 
 My e-book has plenty of script examples for this processor as well.
 
 We could also write a generic script that takes a source and destination field
 name and then does a specified operation on it, like add an offset or multiple
 by a scale factor.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Jean-Sebastien Vachon
 Sent: Tuesday, April 15, 2014 3:57 PM
 To: 'solr-user@lucene.apache.org'
 Subject: Transformation on a numeric field
 
 Hi All,
 
 I am looking for a way to index a numeric field and its value divided by 1
 000 into another numeric field.
 I thought about using a CopyField with a PatternReplaceFilterFactory to keep
 only the first few digits (cutting the last three).
 
 Solr complains that I can not have an analysis chain on a numeric field:
 
 Core:
 org.apache.solr.common.SolrException:org.apache.solr.common.SolrExcept
 ion:
 Plugin init failure for [schema.xml] fieldType truncated_salary:
 FieldType: TrieIntField (truncated_salary) does not support specifying an
 analyzer. Schema file is /data/solr/solr-no-cloud/Core1/schema.xml
 
 
 Is there a way to accomplish this ?
 
 Thanks
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
 09/04/2014


Transformation on a numeric field

2014-04-15 Thread Jean-Sebastien Vachon
Hi All,

I am looking for a way to index a numeric field and its value divided by 1 000 
into another numeric field.
I thought about using a CopyField with a PatternReplaceFilterFactory to keep 
only the first few digits (cutting the last three).

Solr complains that I can not have an analysis chain on a numeric field:

Core: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Plugin init failure for [schema.xml] fieldType truncated_salary: FieldType: 
TrieIntField (truncated_salary) does not support specifying an analyzer. Schema 
file is /data/solr/solr-no-cloud/Core1/schema.xml


Is there a way to accomplish this ?

Thanks


RE: Were changes made to facetting on multivalued fields recently?

2014-04-11 Thread Jean-Sebastien Vachon
Thanks to both of you. I finally found the issue and you were right (again) ;)

The problem was not coming from the full indexation code containing the SQL 
replace statement but from another process whose job is to maintain our index 
up to date. This process had no idea that commas were to be replaced by spaces 
for some fields (and it should not about this either).

I changed the Tokenizer used for the field to the following and everything is 
fine now.
tokenizer class=solr.PatternTokenizerFactory pattern=,/

Thanks for your help

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: April-10-14 1:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Were changes made to facetting on multivalued fields recently?
 
 bq: The SQL query contains a Replace statement that does this
 
 Well, I suspect that's where the issue is. The facet values being reported
 include:
 int name=4,1134826/int
 which indicates that the incoming text to Solr still has the commas.
 Solr is seeing the commas and all.
 
 You can cure this by using PatternReplaceCharFilterFactory and doing the
 substitution at index time if you want to.
 
 That doesn't clarify why the behavior has changed though, but my
 supposition is that it has nothing to do with Solr, and something about your
 SQL statement is different.
 
 Best,
 Erick
 
 On Thu, Apr 10, 2014 at 9:33 AM, Jean-Sebastien Vachon jean-
 sebastien.vac...@wantedanalytics.com wrote:
  The SQL query contains a Replace statement that does this
 
  -Original Message-
  From: Shawn Heisey [mailto:s...@elyograg.org]
  Sent: April-10-14 11:30 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Were changes made to facetting on multivalued fields
 recently?
 
  On 4/10/2014 9:14 AM, Jean-Sebastien Vachon wrote:
   Here are the field definitions for both our old and new index... as
   you can
  see that are identical. We've been using this chain and field type
  starting with Solr 1.4 and never had any problem. As for the
  documents, both indexes are using the same data source. They could be
  slightly out of sync from time to time but we tend to index them on a
  daily basis. Both indexes are also using the same code (indexing through
 SolrJ) to index their content.
  
   The source is a column in MySql that contains entries such as 4,1
   that get stored in a Multivalued fields after replacing commas by
   spaces
  
   OLD (4.6.1):
  fieldType name=text_ws class=solr.TextField
  positionIncrementGap=100
 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
   /fieldType
  
   field name=ad_job_type_id type=text_ws indexed=true
   stored=true required=false multiValued=true /
 
  Just so you know, there's nothing here that would require the field
  to be multivalued.  WhitespaceTokenizerFactory does not create
  multiple field values, it creates multiple terms.  If you are
  actually inserting multiple values for the field in SolrJ, then you would
 need a multivalued field.
 
  What is replacing the commas with spaces?  I don't see anything here
  that would do that.  It sounds like that part of your indexing is not
 working.
 
  Thanks,
  Shawn
 
 
  -
  Aucun virus trouvé dans ce message.
  Analyse effectuée par AVG - www.avg.fr
  Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
  09/04/2014
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
 09/04/2014


RE: Were changes made to facetting on multivalued fields recently?

2014-04-10 Thread Jean-Sebastien Vachon
Here are the field definitions for both our old and new index... as you can see 
that are identical. We've been using this chain and field type starting with 
Solr 1.4 and never had any problem. As for the documents, both indexes are 
using the same data source. They could be slightly out of sync from time to 
time but we tend to index them on a daily basis. Both indexes are also using 
the same code (indexing through SolrJ) to index their content.

The source is a column in MySql that contains entries such as 4,1 that get 
stored in a Multivalued fields after replacing commas by spaces

OLD (4.6.1):
   fieldType name=text_ws class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

field name=ad_job_type_id type=text_ws indexed=true stored=true 
required=false multiValued=true /

NEW (4.7.1):

fieldType name=text_ws class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
 /fieldType

field name=ad_job_type_id type=text_ws indexed=true stored=true 
required=false multiValued=true /

It looks like the /analysis/field hanlder is not active in our current setup. I 
will look into this and perform additional checks later as we are currently 
doing a full reindex of our DB.

Thanks for your time

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: April-09-14 5:23 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Were changes made to facetting on multivalued fields recently?
 
 On 4/9/2014 2:15 PM, Erick Erickson wrote:
  Right, but the response in the doc when you make a request is almost,
  but not quite totally, unrelated to how facet values are tallied. It's
  all about what tokens are actually in your index, which you can see in
  the schema browser...
 
 Supplement to what Erick has told you:
 
 SOLR-5512 seems to be related to facets using docValues. The commit for
 that issue looks like it only touches on that specifically.If you do not have
 (and never have had) docValues on this field, then SOLR-5512 should not
 apply.
 
 I am reasonably sure that for facets on fields with docValues, your facets
 would reflect the *stored* information, not the indexed information.
 
 Finally, I don't think that docValues work on fieldtypes whose class is
 solr.TextField, which is the only class that can have an analysis chain that
 would turn 4 5 1 into three separate tokens.  The response that you shared
 where the value is 4 5 1 looks like there is only one value in the field -- 
 so
 for that document, it is effectively the same as one that is single-valued.
 
 Bottom line: It looks like either your analysis chain is working differently 
 in
 the newer version, or you have documents in your newer index that are not
 in the older one.  Can you share the field and fieldType definitions from both
 versions?  Did your luceneMatchVersion change with the upgrade?  If you are
 using DIH to populate your index, can you also share your DIH config?
 
 Thanks,
 Shawn
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4354 / Base de données virale: 3722/7256 - Date:
 27/03/2014 La Base de données des virus a expiré.


RE: Were changes made to facetting on multivalued fields recently?

2014-04-10 Thread Jean-Sebastien Vachon
The SQL query contains a Replace statement that does this

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: April-10-14 11:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Were changes made to facetting on multivalued fields recently?
 
 On 4/10/2014 9:14 AM, Jean-Sebastien Vachon wrote:
  Here are the field definitions for both our old and new index... as you can
 see that are identical. We've been using this chain and field type starting 
 with
 Solr 1.4 and never had any problem. As for the documents, both indexes are
 using the same data source. They could be slightly out of sync from time to
 time but we tend to index them on a daily basis. Both indexes are also using
 the same code (indexing through SolrJ) to index their content.
 
  The source is a column in MySql that contains entries such as 4,1
  that get stored in a Multivalued fields after replacing commas by
  spaces
 
  OLD (4.6.1):
 fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
  /fieldType
 
  field name=ad_job_type_id type=text_ws indexed=true
  stored=true required=false multiValued=true /
 
 Just so you know, there's nothing here that would require the field to be
 multivalued.  WhitespaceTokenizerFactory does not create multiple field
 values, it creates multiple terms.  If you are actually inserting multiple 
 values
 for the field in SolrJ, then you would need a multivalued field.
 
 What is replacing the commas with spaces?  I don't see anything here that
 would do that.  It sounds like that part of your indexing is not working.
 
 Thanks,
 Shawn
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
 09/04/2014


Were changes made to facetting on multivalued fields recently?

2014-04-09 Thread Jean-Sebastien Vachon
Hi All,

We just discovered that the response from Solr (4.7.1) when faceting on one of 
our multi-valued fields has changed considerably.

In the past (4.6.1 and prior versions as well) we used to have something like 
this: (there are 7 possible values for this attribute)

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=ad_job_type_id
int name=111454652/int
int name=411387070/int
int name=52095603/int
int name=3809992/int
int name=2567244/int
int name=6139389/int
int name=74120/int
/lst
/lst
lst name=facet_dates/
/lst

And now with 4.7.1 we are getting this:
lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=ad_job_type_id
int name=110954552/int
int name=410884418/int
int name=52000530/int
int name=3784491/int
int name=2535935/int
int name=4,1134826/int
int name=5,111770/int
... there are too many values to list them all ...

I checked the Change log for 4.7.1 and only saw an optimization made for 
https://issues.apache.org/jira/browse/SOLR-5512

Is there any new configuration directive that we should be aware of?

Thanks







RE: Were changes made to facetting on multivalued fields recently?

2014-04-09 Thread Jean-Sebastien Vachon
Thanks Erick I will check this as soon as I can.

In the meantime, here is a sample query and how it looks in our index. It looks 
good to me (at least that what is showing up as well in our other and older 
indexes)

http://10.0.5.227:8201/solr/Current/select?q=*:*fl=ad_job_type_idfq=ad_job_type_id:[*%20TO%20*]facet=onfacet.field=ad_job_type_idrows=1

result name=response numFound=12204004 start=0 maxScore=1.0
 doc
   arr name=ad_job_type_id
   str4 5 1/str
/arr
  /doc
/result

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: April-09-14 2:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Were changes made to facetting on multivalued fields recently?
 
 That is...um...very strange. It looks to me like you have somehow indexed a
 bunch of new values. I'm guessing here, but it's suspicious that you have a
 value 4,1 should that have been indexed as 4 and 1 as separate tokens?
 
 So here's what I'd do
 1 take a look at the solr/admin/schema browser output for that field
 in the two versions. I suspect you'll see 7 values in 4.6 and a bazillion in 
 4.7.1.
 2 if 1 is true, take a look at the admin/analysis page for the
 field in question and see some sample index-time inputs, especially for the
 theoretical 4,1 entries. I suspect that 4.6 will break these up into two
 tokens and 4.7.1 won't.
 3 if 2 is true, take a very careful look at the index-time analysis
 chains in the two versions, I bet they're different and that accounts for your
 observations.
 4 try 1-3, discover I'm totally off base and paste the schema.xml
 definitions for the field in question in both 4.6 and 4.7.1 to this thread and
 we can take a look.
 
 This should not have changed between 4.6 and 4.7.1, at least not
 intentionally.
 
 Best,
 Erick
 
 On Wed, Apr 9, 2014 at 11:04 AM, Jean-Sebastien Vachon jean-
 sebastien.vac...@wantedanalytics.com wrote:
  Hi All,
 
  We just discovered that the response from Solr (4.7.1) when faceting on
 one of our multi-valued fields has changed considerably.
 
  In the past (4.6.1 and prior versions as well) we used to have
  something like this: (there are 7 possible values for this attribute)
 
  lst name=facet_counts
  lst name=facet_queries/
  lst name=facet_fields
  lst name=ad_job_type_id
  int name=111454652/int
  int name=411387070/int
  int name=52095603/int
  int name=3809992/int
  int name=2567244/int
  int name=6139389/int
  int name=74120/int
  /lst
  /lst
  lst name=facet_dates/
  /lst
 
  And now with 4.7.1 we are getting this:
  lst name=facet_counts
  lst name=facet_queries/
  lst name=facet_fields
  lst name=ad_job_type_id
  int name=110954552/int
  int name=410884418/int
  int name=52000530/int
  int name=3784491/int
  int name=2535935/int
  int name=4,1134826/int
  int name=5,111770/int
  ... there are too many values to list them all ...
 
  I checked the Change log for 4.7.1 and only saw an optimization made
  for https://issues.apache.org/jira/browse/SOLR-5512
 
  Is there any new configuration directive that we should be aware of?
 
  Thanks
 
 
 
 
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4354 / Base de données virale: 3722/7256 - Date:
 27/03/2014 La Base de données des virus a expiré.


RE: Update single field through SolrJ

2014-04-01 Thread Jean-Sebastien Vachon
Hi,

Thanks for pointing me in the proper direction. I managed to change my code to 
send atomic updates through SolrJ but this morning we experienced something 
weird. I sent a large batch of updates and deletes through SolrJ and our Cloud 
quickly became unusable and unresponsive (no leader for a shard, etc).

We looked through the logs and could not find a particular reason for this. We 
waited quite some time but some nodes were not showing any progress in their 
recovery so we restarted them (we are running Tomcat 7.0.39) and everything 
came back as if nothing happened.

Does anyone experienced something similar? We are currently running Solr 4.6.1 
on a 5 nodes cluster with both ZK 3.4.5 and Solr on them (ZK has its own 
storage device to minimize the impact). Both are also running under JRE 
1.7.0_21 in 64 bits mode. Our index has 5 shards with 2 replicas.

Thanks for your help

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: March-28-14 3:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Update single field through SolrJ
 
 On 3/28/2014 1:02 PM, Jean-Sebastien Vachon wrote:
  I`d like to know how (it is possible) to update a field`s value using 
  SolrJ. I
 looked at the API and could not figure it out so for now I'm using the
 UpdateHandler by sending it a JSON formatted document illustrating the
 required changes.
 
 
  Is there a way to do the same through SolrJ?
 
 The feature you are after is called Atomic Updates.  In order to use this
 feature *all* of your fields must be stored, except for copyField 
 destinations.
 See especially the Caveats and Limitations section of the first link below:
 
 http://wiki.apache.org/solr/Atomic_Updates
 https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Docu
 ments
 
 To do this with SolrJ, you must use a Map for the field value instead of just
 one or more regular values:
 
 http://stackoverflow.com/questions/16234045/solr-how-to-use-the-new-
 field-update-modes-atomic-updates-with-solrj
 
 Thanks,
 Shawn
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4354 / Base de données virale: 3722/7256 - Date:
 27/03/2014


Update single field through SolrJ

2014-03-28 Thread Jean-Sebastien Vachon
Hi All,

I`d like to know how (it is possible) to update a field`s value using SolrJ. I 
looked at the API and could not figure it out so for now I'm using the 
UpdateHandler by sending it a JSON formatted document illustrating the required 
changes.


Is there a way to do the same through SolrJ?

Thanks


RE: Recherche avec et sans espaces

2013-11-04 Thread Jean-Sebastien Vachon
Bonjour Antoine,

Je ne vois que 2 solutions à ton problème.

1) utilisation de synonymes mais tu seras limités au cas connus d'avance 
seulement alors c'est une solution qui ne scale pas à long terme.

2) sinon tu dois envisager d'avoir un deuxième champ (probablement en 
CopyField) qui n'utilisera pas un WhitespaceTokenizer (la classe 
KeywordTokenizerFactory semble un bon candidat) et faire la recherche sur les 2 
champs (fq=champ1:la redoute OR champ2:la redoute)

La page d'administration (/solr/admin/analysis.jsp) te permet de bien voir ce 
qui se passe pour différentes valeurs et champs.

De plus, tu auras beaucoup plus de chance d'obtenir des réponses à tes 
questions si celles-ci sont rédigées en anglais. ;)

Bonne chance

 -Original Message-
 From: Antoine REBOUL [mailto:antoine.reb...@gmail.com]
 Sent: November-04-13 11:42 AM
 To: solr-user@lucene.apache.org
 Subject: Recherche avec et sans espaces
 
 Bonjour,
 
 je souhaite faire en sorte que les recherches dans un champs de type texte
 renvoient des résultats même si les espaces sont mal saisies (par exemple : 
 la
 redoute=laredoute).
 
 Aujourd'hui mon champ texte est défini de la façon suivante :
 
 
 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
 filter class=solr.StopFilterFactory
  ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
  /
 filter class=solr.ElisionFilterFactory articles=elisions.txt/  filter
 class=solr.SynonymFilterFactory synonyms=synonyms2.txt
 ignoreCase=true expand=false/
 filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
  catenateWords=1
 catenateNumbers=1
 catenateAll=1
  splitOnCaseChange=1
 splitOnNumerics=1
 preserveOriginal=1
  /
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
  analyzer type=query
 filter class=solr.ISOLatin1AccentFilterFactory/
  tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1
 generateNumberParts=1
 catenateWords=1
  catenateNumbers=0
 catenateAll=1
 splitOnCaseChange=1
  preserveOriginal=1
 /
 filter class=solr.StopFilterFactory
  ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
  /
 filter class=solr.ElisionFilterFactory articles=elisions.txt/  filter
 class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 
 
 
 
 
 
 Merci d'avance pour vos éventuelles réponses.
 Cordialement.
 
 Antoine Reboul
 *
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2014.0.4158 / Base de données virale: 3615/6784 - Date: 26/10/2013
 La Base de données des virus a expiré.


RE: Regarding improving performance of the solr

2013-09-06 Thread Jean-Sebastien Vachon
Have you checked the hit ratio of the different caches? Try to tune them to get 
rid of all evictions if possible.

Tuning the size of the caches and warming you searcher can give you a pretty 
good improvement. You might want to check your analysis chain as well to see if 
you`re not doing anything that is not necessary.



 -Original Message-
 From: prabu palanisamy [mailto:pr...@serendio.com]
 Sent: September-06-13 4:55 AM
 To: solr-user@lucene.apache.org
 Subject: Regarding improving performance of the solr
 
  Hi
 
 I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with java
 1.6.
 I am searching the solr with text (which is actually twitter tweets) .
 Currently it takes average time of 210 millisecond for each post, out of which
 200 millisecond is consumed by solr server (QTime).  I used the jconsole
 monitor tool.
 
 The stats are
Heap usage - 10-50Mb,
No of threads - 10-20
No of class- 3800,
Cpu usage - 10-15%
 
 Currently I am loading all the fields of the wikipedia.
 
 I only need the freebase category and wikipedia category. I want to know
 how to optimize the solr server to improve the performance.
 
 Could you please help me out in optimize the performance?
 
 Thanks and Regards
 Prabu
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3222/6640 - Date: 05/09/2013


RE: Flushing cache without restarting everything?

2013-08-22 Thread Jean-Sebastien Vachon
How can you validate that the changes you just made had any impact on the 
performance of the cloud if you don't have the same starting conditions?

What we do basically is running a batch of requests to warm up the index and 
then launch the benchmark itself. That way we can measure the impact of our 
change(s). Otherwise there is absolutely no way we can be sure who is 
responsible for the gain or loss of performance.

Restarting a cloud is actually a real pain, I just want to know if there is a 
faster way to proceed.

 -Original Message-
 From: Dmitry Kan [mailto:solrexp...@gmail.com]
 Sent: August-22-13 7:26 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Flushing cache without restarting everything?
 
 But is it really a good benchmarking, if you flush the cache? Wouldn't you
 want to benchmark against a system, that would be comparable to what is
 under real (=production) load?
 
 Dmitry
 
 
 On Tue, Aug 20, 2013 at 9:39 PM, Jean-Sebastien Vachon  jean-
 sebastien.vac...@wantedanalytics.com wrote:
 
  I just want to run benchmarks and want to have the same starting
  conditions.
 
   -Original Message-
   From: Walter Underwood [mailto:wun...@wunderwood.org]
   Sent: August-20-13 2:06 PM
   To: solr-user@lucene.apache.org
   Subject: Re: Flushing cache without restarting everything?
  
   Why? What are you trying to acheive with this? --wunder
  
   On Aug 20, 2013, at 11:04 AM, Jean-Sebastien Vachon wrote:
  
Hi All,
   
Is there a way to flush the cache of all nodes in a Solr Cloud (by
  reloading all
   the cores, through the collection API, ...) without having to
   restart
  all nodes?
   
Thanks
  
  
  
  
  
   -
   Aucun virus trouvé dans ce message.
   Analyse effectuée par AVG - www.avg.fr
   Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date:
  09/08/2013
   La Base de données des virus a expiré.
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013
 La Base de données des virus a expiré.


RE: Flushing cache without restarting everything?

2013-08-22 Thread Jean-Sebastien Vachon
I was afraid someone would tell me that... thanks for your input

 -Original Message-
 From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
 Sent: August-22-13 9:56 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Flushing cache without restarting everything?
 
 On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote:
  Is there a way to flush the cache of all nodes in a Solr Cloud (by
  reloading all the cores, through the collection API, ...) without
  having to restart all nodes?
 
 As MMapDirectory shares data with the OS disk cache, flushing of
 Solr-related caches on a machine should involve
 
 1) Shut down all Solr instances on the machine
 2) Clear the OS read cache ('sudo echo 1  /proc/sys/vm/drop_caches' on
 a Linux box)
 3) Start the Solr instances
 
 I do not know of any Solr-supported way to do step 2. For our
 performance tests we use custom scripts to perform the steps.
 
 - Toke Eskildsen, State and University Library, Denmark
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013
 La Base de données des virus a expiré.


Flushing cache without restarting everything?

2013-08-20 Thread Jean-Sebastien Vachon
Hi All,

Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading 
all the cores, through the collection API, ...) without having to restart all 
nodes?

Thanks


RE: Flushing cache without restarting everything?

2013-08-20 Thread Jean-Sebastien Vachon
I just want to run benchmarks and want to have the same starting conditions.

 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org]
 Sent: August-20-13 2:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Flushing cache without restarting everything?
 
 Why? What are you trying to acheive with this? --wunder
 
 On Aug 20, 2013, at 11:04 AM, Jean-Sebastien Vachon wrote:
 
  Hi All,
 
  Is there a way to flush the cache of all nodes in a Solr Cloud (by 
  reloading all
 the cores, through the collection API, ...) without having to restart all 
 nodes?
 
  Thanks
 
 
 
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013
 La Base de données des virus a expiré.


Huge discrepancy between QTime and ElapsedTime

2013-08-14 Thread Jean-Sebastien Vachon
Hi All,

I am running some benchmarks to tune our Solr 4.3 cloud and noticed that while 
the reported QTime  is quite satisfactory (100 ms or so), the elapsed time is 
quite large (around 5 seconds). The collection contains 12.8M documents and the 
index size on disk is about 35 GB.. I have only one shard and 4 replicas (we 
intent to have 5 shards but wanted to see how Solr would perform with only one 
shard so that we could benefit from all Solr functions)

I checked for huge GC but found none. I also checked if we had intensive IO and 
we don't. All five nodes have 48GB of ram of which 4GB is allocated to Tomcat 7 
and Solr. The caches have a hit ratio over 80%. Zookeeper is running on the 
same boxes (5 instances, one per node) but there does not seem to be much 
activity going on.

This is a sample query:

http://10.0.5.211:8201/solr/Current/select?fq=position_first_seen_date_id:[3484 
TO 3516]q= (title:java OR semi_clean_title:java OR 
ad_description:java)rows=10start=0fl=job_id,position_id,super_alias_id,advertiser,super_alias,credited_source_id,position_first_seen_date_id,position_last_seen_date_id,
 position_posted_date_id, position_refreshed_date_id, position_job_type_id, 
position_function_id,position_green_code,title_id,semi_clean_title_id,clean_title_id,position_empl_count,place_id,
 state_id,county_id,msa_id,country_id,position_id,position_job_type_mva, 
ad_activity_status_id, position_score, 
ad_score,position_salary,position_salary_range_id,position_salary_source,position_naics_6_code,position_education_level_id,
 
is_staffing,is_bulk,is_anonymous,is_third_party,is_dirty,ref_num,tags,lat,long,position_duns_number,url,advertiser_id,
 title, semi_clean_title, ad_description, position_description, ad_bls_salary, 
position_bls_salary, covering_source_id, 
content_model_id,position_soc_2011_8_codegroup.field=position_idgroup=truegroup.ngroups=falsegroup.main=truesort=position_first_seen_date_id
 desc,score desc

Any idea what could cause this?


RE: Huge discrepancy between QTime and ElapsedTime

2013-08-14 Thread Jean-Sebastien Vachon
Thanks Shawn and Scott for your feedback. It is really appreciated.

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: August-14-13 12:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Huge discrepancy between QTime and ElapsedTime
 
 On 8/14/2013 9:09 AM, Jean-Sebastien Vachon wrote:
  I am running some benchmarks to tune our Solr 4.3 cloud and noticed
  that while the reported QTime  is quite satisfactory (100 ms or so),
  the elapsed time is quite large (around 5 seconds). The collection
  contains 12.8M documents and the index size on disk is about 35 GB.. I
  have only one shard and 4 replicas (we intent to have 5 shards but
  wanted to see how Solr would perform with only one shard so that we
  could benefit from all Solr functions)
 
 As your other reply from Scott says, you may be dealing with the fact that
 Solr must fetch stored field data from the index on disk and decompress it.
 Solr 4.1 and later have compressed stored fields.  There is no way other than
 writing custom Solr code to turn off the compression.  If the documents are
 very large, the decompression step can be a big performance penalty.
 
 You have a VERY large field list - fl parameter.  Have you tried just leaving 
 that
 parameter off so that Solr will return all stored fields instead of 
 identifying
 each field?  This might not help at all, I'm just putting it out there as
 something to try.

I will give it a try. 

 You also have grouping enabled.  From what I understand, that can be slow.
 If you turn that off, what happens to your elapsed times?

Yes grouping is slow but not as much as bad as it was in Solr 1.4 which we are 
still using in production with a similar index (actually it has 17M documents 
on 6 shards but all on the same server). I expect the grouping to be much 
faster in 4.x than in 1.4 and I don't have this problem in 1.4. It's true 
however that I have some additional stored fields in my new setup. But this was 
done to limit the number of times I have to fetch the information from MySQL.

 Your free RAM vs. index size is good, assuming that there's nothing else on
 your Solr servers.  With 12.8 million documents plus the use of grouping and
 sorting, you might need a larger java heap.  Try increasing it to 5GB as an
 initial test and see if that makes any difference, either good or bad.
 
 Your email says you checked for huge GC, but without knowing exactly how
 you checked, it's difficult to know what you would have actually found.

I turned GC logging on and analyzed the resulting file. I've also taken a few 
heap dumps and all generations seem to be properly sized.
I will give it a try to see if it affects the performance.

 Thanks,
 Shawn
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013


queryResultCache showing all zeros

2013-07-31 Thread Jean-Sebastien Vachon
Hi,

We just configured a new Solr cloud (5 nodes) running Solr 4.3, ran about 200 
000 queries taken from our production environment and measured the performance 
of the cloud over a collection of 14M documents with the default Solr settings. 
We are now trying to tune the different caches and when I look at each node of 
the cloud, all of them are showing no activity (see below) regarding the 
queryResultCache... all other caches are showing some activity. Any idea what 
could cause this?


  *

org.apache.solr.search.LRUCache
  *

version:
1.0
  *

description:
LRU Cache(maxSize=512, initialSize=512)
  *

src:
$URL: 
https:/?/?svn.apache.org/?repos/?asf/?lucene/?dev/?branches/?lucene_solr_4_3/?solr/?core/?src/?java/?org/?apache/?solr/?search/?LRUCache.javahttps://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/solr/core/src/java/org/apache/solr/search/LRUCache.java
 $
  *   stats:
 *

lookups:
0
 *

hits:
0
 *

hitratio:
0.00
 *

inserts:
0
 *

evictions:
0
 *

size:
0
 *

warmupTime:
0
 *

cumulative_lookups:
0
 *

cumulative_hits:
0
 *

cumulative_hitratio:
0.00
 *

cumulative_inserts:
0
 *

cumulative_evictions:
0




RE: queryResultCache showing all zeros

2013-07-31 Thread Jean-Sebastien Vachon
Looks like the problem might not be related to Solr but to a proprietary system 
we have on top of it. 
I made some queries with facets and the cache was updated. We are looking into 
this... I should not have assumed that the problem was coming from Solr ;)

I'll let you know if there is anything

From: Chris Hostetter
Sent: Wednesday, July 31, 2013 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache showing all zeros

: We just configured a new Solr cloud (5 nodes) running Solr 4.3, ran
: about 200 000 queries taken from our production environment and measured
: the performance of the cloud over a collection of 14M documents with the
: default Solr settings. We are now trying to tune the different caches
: and when I look at each node of the cloud, all of them are showing no
: activity (see below) regarding the queryResultCache... all other caches
: are showing some activity. Any idea what could cause this?

Can you show us some examples of hte types of queries you are executing?

Do you have useFilterForSortedQuery in your solrconfig.xml ?



-Hoss


RE: queryResultCache showing all zeros

2013-07-31 Thread Jean-Sebastien Vachon
Ok I might have found an Solr issue after I fixed a problem in our system.

This the kind of query we are making:

http://10.0.5.214:8201/solr/Current/select?fq=position_refreshed_date_id:[2747%20TO%203501]fq=position_soc_2011_8_code:41101100fq=country_id:1fq=position_job_type_id:4fq=position_education_level_id:8fq=position_salary_range_id:2fq=is_dirty:falsefq=is_staffing:falsefq=-position_soc_2011_2_code:99fq=-covering_source_id:(839%20OR%201145%20OR%2025%20OR%20802%20OR%20777%20OR%2085%20OR%20881%20OR%20775%20OR%201558%20OR%20743%20OR%20800%20OR%201580%20OR%201147%20OR%201690%20OR%20674%20OR%20894%20OR%20791)q=%20(title:photographer%20OR%20ad_description:photographer%20OR%20super_alias:photographer)%20AND%20(_val_:%22sum(product(75,div(5000,sum(50,sub(3500,position_refreshed_date_id,product(0.75,job_score),product(0.75,source_score))%22)facet=truefacet.mincount=1f.state_id.facet.limit=10facet.field=state_idfacet.field=position_salary_range_idfacet.field=position_job_type_idfacet.field=position_naics_6_codefacet.field=place_idfacet.field=position_education_level_idfacet.field=position_soc_2011_8_codef.position_salary_range_id.facet.limit=10f.position_job_type_id.facet.limit=10f.position_naics_6_code.facet.limit=10f.place_id.facet.limit=10f.position_education_level_id.facet.limit=10f.position_soc_2011_8_code.facet.limit=10rows=10start=0fl=job_id,position_id,super_alias_id,advertiser,super_alias,credited_source_id,position_first_seen_date_id,position_last_seen_date_id,%20position_posted_date_id,%20position_refreshed_date_id,%20position_job_type_id,%20position_function_id,position_green_code,title_id,semi_clean_title_id,clean_title_id,position_empl_count,place_id,%20state_id,county_id,msa_id,country_id,position_id,position_job_type_mva,%20ad_activity_status_id,%20position_score,%20ad_score,position_salary,position_salary_range_id,position_salary_source,position_naics_6_code,position_education_level_id,%20is_staffing,is_bulk,is_anonymous,is_third_party,is_dirty,ref_num,tags,lat,long,position_duns_number,url,advertiser_id,%20title,%20semi_clean_title,%20ad_description,%20position_description,%20ad_bls_salary,%20position_bls_salary,%20covering_source_id,%20content_model_id,position_soc_2011_8_code,position_noc_2006_4_idgroup.field=position_idgroup=truegroup.ngroups=truegroup.main=truesort=score%20desc

it's quite long but this request uses both faceting and grouping. If I remove 
the grouping then the cache is used. Is this a normal behavior or a bug?

Thanks

From: Jean-Sebastien Vachon
Sent: Wednesday, July 31, 2013 2:38 PM
To: solr-user@lucene.apache.org
Subject: RE: queryResultCache showing all zeros

Looks like the problem might not be related to Solr but to a proprietary system 
we have on top of it.
I made some queries with facets and the cache was updated. We are looking into 
this... I should not have assumed that the problem was coming from Solr ;)

I'll let you know if there is anything

From: Chris Hostetter
Sent: Wednesday, July 31, 2013 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache showing all zeros

: We just configured a new Solr cloud (5 nodes) running Solr 4.3, ran
: about 200 000 queries taken from our production environment and measured
: the performance of the cloud over a collection of 14M documents with the
: default Solr settings. We are now trying to tune the different caches
: and when I look at each node of the cloud, all of them are showing no
: activity (see below) regarding the queryResultCache... all other caches
: are showing some activity. Any idea what could cause this?

Can you show us some examples of hte types of queries you are executing?

Do you have useFilterForSortedQuery in your solrconfig.xml ?



-Hoss

RE: queryResultCache showing all zeros

2013-07-31 Thread Jean-Sebastien Vachon
Also we do not have any useFilterForSortedQuery in our config. So we are 
relying on the default which I guess is false.




From: Jean-Sebastien Vachon
Sent: Wednesday, July 31, 2013 3:44 PM
To: solr-user@lucene.apache.org
Subject: RE: queryResultCache showing all zeros

Ok I might have found an Solr issue after I fixed a problem in our system.

This the kind of query we are making:

http://10.0.5.214:8201/solr/Current/select?fq=position_refreshed_date_id:[2747%20TO%203501]fq=position_soc_2011_8_code:41101100fq=country_id:1fq=position_job_type_id:4fq=position_education_level_id:8fq=position_salary_range_id:2fq=is_dirty:falsefq=is_staffing:falsefq=-position_soc_2011_2_code:99fq=-covering_source_id:(839%20OR%201145%20OR%2025%20OR%20802%20OR%20777%20OR%2085%20OR%20881%20OR%20775%20OR%201558%20OR%20743%20OR%20800%20OR%201580%20OR%201147%20OR%201690%20OR%20674%20OR%20894%20OR%20791)q=%20(title:photographer%20OR%20ad_description:photographer%20OR%20super_alias:photographer)%20AND%20(_val_:%22sum(product(75,div(5000,sum(50,sub(3500,position_refreshed_date_id,product(0.75,job_score),product(0.75,source_score))%22)facet=truefacet.mincount=1f.state_id.facet.limit=10facet.field=state_idfacet.field=position_salary_range_idfacet.field=position_job_type_idfacet.field=position_naics_6_codefacet.field=place_idfacet.field=position_education_level_idfacet.field=position_soc_2011_8_codef.position_salary_range_id.facet.limit=10f.position_job_type_id.facet.limit=10f.position_naics_6_code.facet.limit=10f.place_id.facet.limit=10f.position_education_level_id.facet.limit=10f.position_soc_2011_8_code.facet.limit=10rows=10start=0fl=job_id,position_id,super_alias_id,advertiser,super_alias,credited_source_id,position_first_seen_date_id,position_last_seen_date_id,%20position_posted_date_id,%20position_refreshed_date_id,%20position_job_type_id,%20position_function_id,position_green_code,title_id,semi_clean_title_id,clean_title_id,position_empl_count,place_id,%20state_id,county_id,msa_id,country_id,position_id,position_job_type_mva,%20ad_activity_status_id,%20position_score,%20ad_score,position_salary,position_salary_range_id,position_salary_source,position_naics_6_code,position_education_level_id,%20is_staffing,is_bulk,is_anonymous,is_third_party,is_dirty,ref_num,tags,lat,long,position_duns_number,url,advertiser_id,%20title,%20semi_clean_title,%20ad_description,%20position_description,%20ad_bls_salary,%20position_bls_salary,%20covering_source_id,%20content_model_id,position_soc_2011_8_code,position_noc_2006_4_idgroup.field=position_idgroup=truegroup.ngroups=truegroup.main=truesort=score%20desc

it's quite long but this request uses both faceting and grouping. If I remove 
the grouping then the cache is used. Is this a normal behavior or a bug?

Thanks

From: Jean-Sebastien Vachon
Sent: Wednesday, July 31, 2013 2:38 PM
To: solr-user@lucene.apache.org
Subject: RE: queryResultCache showing all zeros

Looks like the problem might not be related to Solr but to a proprietary system 
we have on top of it.
I made some queries with facets and the cache was updated. We are looking into 
this... I should not have assumed that the problem was coming from Solr ;)

I'll let you know if there is anything

From: Chris Hostetter
Sent: Wednesday, July 31, 2013 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache showing all zeros

: We just configured a new Solr cloud (5 nodes) running Solr 4.3, ran
: about 200 000 queries taken from our production environment and measured
: the performance of the cloud over a collection of 14M documents with the
: default Solr settings. We are now trying to tune the different caches
: and when I look at each node of the cloud, all of them are showing no
: activity (see below) regarding the queryResultCache... all other caches
: are showing some activity. Any idea what could cause this?

Can you show us some examples of hte types of queries you are executing?

Do you have useFilterForSortedQuery in your solrconfig.xml ?



-Hoss

RE: Problem with document routing with Solr 4.2.1

2013-05-24 Thread Jean-Sebastien Vachon
Hi All,

Evan Sayer from LucidWorks found the problem in our schema so this problem is 
not related at all to SolrCloud itself. (well it is but as least it is not a 
bug)

I don't why :( but at some point we changed the type of the id field from 
'string' to 'text'.
Since we are doing custom hashing and that the id field was tokenized, Solr 
could not find back documents when collecting responses from each shards.

We changed back the id field to the 'string' type and it is now working



-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com] 
Sent: May-23-13 2:57 PM
To: solr-user@lucene.apache.org
Subject: RE: Problem with document routing with Solr 4.2.1

I must add the shard.keys= does not return anything on two on my nodes. But 
that is to be expected since I'm using a replication factor of 3 on a cloud of 
5 servers

-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com]
Sent: May-23-13 11:27 AM
To: solr-user@lucene.apache.org
Subject: RE: Problem with document routing with Solr 4.2.1

If that can help.. adding distrib=false or shard.keys= is giving back 
results.


-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com]
Sent: May-23-13 10:39 AM
To: solr-user@lucene.apache.org
Subject: RE: Problem with document routing with Solr 4.2.1

I know. If a stop routing the documents and simply use a standard 'id' field 
then I am getting back my fields. 
I forgot to tell you how the collection was created.

http://localhost:8201/solr/admin/collections?action=CREATEname=CurrentnumShards=15replicationFactor=3maxShardsPerNode=9

Since I am using the numshards parameter then composite routing should be 
working... unless I misunderstood something

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: May-23-13 10:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with document routing with Solr 4.2.1

That's strange. The default value of rows param is 10 so you should be 
getting 10 results back unless your StandardRequestHandler config in solrconfig 
has set rows to 0 or if none of your fields are stored.


On Thu, May 23, 2013 at 7:40 PM, Jean-Sebastien Vachon  
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi All,

 I just started indexing data in my brand new Solr Cloud running on 4.2.1.
 Since I am a big user of the grouping feature, I need to route my 
 documents on the proper shard.
 Following the instruction found here:

 http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+So
 lrCloud

 I set my document id to something like this  'fieldA!id' where fieldA 
 is the key I want to use to distribute my documents.
 (All documents with the same value for fieldA will be sent to the same 
 shard).

 When I query my index, I can see that the number of documents increase 
 but there are no fields at all in the index.

 http://10.0.5.211:8201/solr/Current/select?q=*:*

 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime11/int
   lst name=params
   str name=q*:*/str
   /lst
   /lst
   result name=response numFound=26318 start=0 maxScore=1.0/ 
 /response

 Specifying fields in the 'fl' parameter does nothing.

 What am I doing wrong?




--
Regards,
Shalin Shekhar Mangar.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.


Problem with document routing with Solr 4.2.1

2013-05-23 Thread Jean-Sebastien Vachon
Hi All,

I just started indexing data in my brand new Solr Cloud running on 4.2.1.
Since I am a big user of the grouping feature, I need to route my documents on 
the proper shard.
Following the instruction found here:
http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+SolrCloud

I set my document id to something like this  'fieldA!id' where fieldA is the 
key I want to use to distribute my documents.
(All documents with the same value for fieldA will be sent to the same shard).

When I query my index, I can see that the number of documents increase but 
there are no fields at all in the index.

http://10.0.5.211:8201/solr/Current/select?q=*:*

response
  lst name=responseHeader
  int name=status0/int
  int name=QTime11/int
  lst name=params
  str name=q*:*/str
  /lst
  /lst
  result name=response numFound=26318 start=0 maxScore=1.0/
/response

Specifying fields in the 'fl' parameter does nothing.

What am I doing wrong?


RE: Problem with document routing with Solr 4.2.1

2013-05-23 Thread Jean-Sebastien Vachon
I know. If a stop routing the documents and simply use a standard 'id' field 
then I am getting back my fields. 
I forgot to tell you how the collection was created.

http://localhost:8201/solr/admin/collections?action=CREATEname=CurrentnumShards=15replicationFactor=3maxShardsPerNode=9

Since I am using the numshards parameter then composite routing should be 
working... unless I misunderstood something

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: May-23-13 10:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with document routing with Solr 4.2.1

That's strange. The default value of rows param is 10 so you should be 
getting 10 results back unless your StandardRequestHandler config in solrconfig 
has set rows to 0 or if none of your fields are stored.


On Thu, May 23, 2013 at 7:40 PM, Jean-Sebastien Vachon  
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi All,

 I just started indexing data in my brand new Solr Cloud running on 4.2.1.
 Since I am a big user of the grouping feature, I need to route my 
 documents on the proper shard.
 Following the instruction found here:

 http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+So
 lrCloud

 I set my document id to something like this  'fieldA!id' where fieldA 
 is the key I want to use to distribute my documents.
 (All documents with the same value for fieldA will be sent to the same 
 shard).

 When I query my index, I can see that the number of documents increase 
 but there are no fields at all in the index.

 http://10.0.5.211:8201/solr/Current/select?q=*:*

 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime11/int
   lst name=params
   str name=q*:*/str
   /lst
   /lst
   result name=response numFound=26318 start=0 maxScore=1.0/ 
 /response

 Specifying fields in the 'fl' parameter does nothing.

 What am I doing wrong?




--
Regards,
Shalin Shekhar Mangar.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.


RE: Problem with document routing with Solr 4.2.1

2013-05-23 Thread Jean-Sebastien Vachon
If that can help.. adding distrib=false or shard.keys= is giving back 
results.


-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com] 
Sent: May-23-13 10:39 AM
To: solr-user@lucene.apache.org
Subject: RE: Problem with document routing with Solr 4.2.1

I know. If a stop routing the documents and simply use a standard 'id' field 
then I am getting back my fields. 
I forgot to tell you how the collection was created.

http://localhost:8201/solr/admin/collections?action=CREATEname=CurrentnumShards=15replicationFactor=3maxShardsPerNode=9

Since I am using the numshards parameter then composite routing should be 
working... unless I misunderstood something

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: May-23-13 10:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with document routing with Solr 4.2.1

That's strange. The default value of rows param is 10 so you should be 
getting 10 results back unless your StandardRequestHandler config in solrconfig 
has set rows to 0 or if none of your fields are stored.


On Thu, May 23, 2013 at 7:40 PM, Jean-Sebastien Vachon  
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi All,

 I just started indexing data in my brand new Solr Cloud running on 4.2.1.
 Since I am a big user of the grouping feature, I need to route my 
 documents on the proper shard.
 Following the instruction found here:

 http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+So
 lrCloud

 I set my document id to something like this  'fieldA!id' where fieldA 
 is the key I want to use to distribute my documents.
 (All documents with the same value for fieldA will be sent to the same 
 shard).

 When I query my index, I can see that the number of documents increase 
 but there are no fields at all in the index.

 http://10.0.5.211:8201/solr/Current/select?q=*:*

 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime11/int
   lst name=params
   str name=q*:*/str
   /lst
   /lst
   result name=response numFound=26318 start=0 maxScore=1.0/ 
 /response

 Specifying fields in the 'fl' parameter does nothing.

 What am I doing wrong?




--
Regards,
Shalin Shekhar Mangar.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.


RE: Problem with document routing with Solr 4.2.1

2013-05-23 Thread Jean-Sebastien Vachon
I must add the shard.keys= does not return anything on two on my nodes. But 
that is to be expected since I'm using a replication factor of 3 on a cloud of 
5 servers

-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com] 
Sent: May-23-13 11:27 AM
To: solr-user@lucene.apache.org
Subject: RE: Problem with document routing with Solr 4.2.1

If that can help.. adding distrib=false or shard.keys= is giving back 
results.


-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com]
Sent: May-23-13 10:39 AM
To: solr-user@lucene.apache.org
Subject: RE: Problem with document routing with Solr 4.2.1

I know. If a stop routing the documents and simply use a standard 'id' field 
then I am getting back my fields. 
I forgot to tell you how the collection was created.

http://localhost:8201/solr/admin/collections?action=CREATEname=CurrentnumShards=15replicationFactor=3maxShardsPerNode=9

Since I am using the numshards parameter then composite routing should be 
working... unless I misunderstood something

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: May-23-13 10:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with document routing with Solr 4.2.1

That's strange. The default value of rows param is 10 so you should be 
getting 10 results back unless your StandardRequestHandler config in solrconfig 
has set rows to 0 or if none of your fields are stored.


On Thu, May 23, 2013 at 7:40 PM, Jean-Sebastien Vachon  
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi All,

 I just started indexing data in my brand new Solr Cloud running on 4.2.1.
 Since I am a big user of the grouping feature, I need to route my 
 documents on the proper shard.
 Following the instruction found here:

 http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+So
 lrCloud

 I set my document id to something like this  'fieldA!id' where fieldA 
 is the key I want to use to distribute my documents.
 (All documents with the same value for fieldA will be sent to the same 
 shard).

 When I query my index, I can see that the number of documents increase 
 but there are no fields at all in the index.

 http://10.0.5.211:8201/solr/Current/select?q=*:*

 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime11/int
   lst name=params
   str name=q*:*/str
   /lst
   /lst
   result name=response numFound=26318 start=0 maxScore=1.0/ 
 /response

 Specifying fields in the 'fl' parameter does nothing.

 What am I doing wrong?




--
Regards,
Shalin Shekhar Mangar.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.3336 / Base de données virale: 3162/6319 - Date: 12/05/2013 La 
Base de données des virus a expiré.


RE: SOlr 3.5 and sharding

2013-01-17 Thread Jean-Sebastien Vachon
Hi Erick,

It looks like we are saying the exact same thing but with different terms ;) 
I looked at the Solr glossary and you might be right.. maybe I should talk 
about partitions instead of shards.

Since my last message, I`ve configured the replication between the master and 
slave and everything is working fine except for my original question about the 
number of documents not matching my expectations.

I`ll try to clarify a few things and come back to this question...

Machine A (which I called the master node) is where the indexation takes place.
It consist of four Solr instances that will (eventually ) contain  1/4 of the 
entire collection. It`s just that, at this moment, since I have no control on 
which partition a given document is sent, I made copies of the same index for 
all partitions. Each Solr instance  has a replication handler configured. I 
will eventually get to the point of changing the indexation code to distribute 
documents evenly on all partitions but the person who can give me access to 
this portion is not available right now so I can do nothing about it.

Machine B has the same four shards setup to be replicas of the corresponding 
shard on machine A.
Machine B also contains another Solr instance with the default handler 
configured to use the four local partitions. This instance receives client`s 
requests, collect the results from each partition and then select the best 
matches to form the final response. We intent to add new slaves being exact 
copies of Machine B and load balance clients requests on all slaves.

My original question was that if each partition has 1000 documents matching a 
certain keyword and that I know all partitions have the same content then I was 
expecting to receive 4*1000 documents for the same keyword. But that is not the 
case.
The replication is not an issue here since the same request on the master node 
will give me the same result.

Each shard when called individually will give 1000 documents. But when I call 
them using the shards=xxx parameters then I am getting a little less than 4000 
documents. I was just curious to know why this was happening... Is this a bug? 
Or something I am misunderstanding...

Thanks for your time and contribution to Solr!

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: January-17-13 8:46 AM
To: solr-user@lucene.apache.org
Subject: Re: SOlr 3.5 and sharding

You're still confusing shards (or at least mixing up the terminology) with 
simple replication. Shards are when you split up the index into several sub 
indexes and configure the sub-indexes to know about each other. Say you have 
1M docs in 2 shards. 500K of them would go on one shard and 500K on the other. 
But logically you have a single index of 1M docs. So the two shards have to 
know about each other and when you send a request to one of them, it 
automatically queries the other (as well as itself), collects the response and 
combines them, returning the top N to the requester.

This is totally different from replication. In replication (master/slave), each 
node has all 1M documents. Each node can work totally in isolation. An incoming 
request is handled by the slave without contacting any other node.

If you're copying around indexes AND configuring them as though they were 
shards, each request will be distributed to all shards and the results 
collated, giving you the same doc repeatedly in your result set.

If you have no access to the indexing code, you really can't go to a sharded 
setup.

Polling is when the slaves periodically ask the master has anything changed? 
If so then the slave pulls down the changes. The polling interval is configured 
in solrconfig.xml _on the slave_. So let's say you index docs to the master. 
For some interval, until the slaves poll the master and get an updated index, 
the number of searchable docs on the master will be different than for the 
slaves. Additionally, you may have the issue of the polling intervals for the 
slaves being offset from one another, so for some brief interval the counts on 
the slaves may be different as well.

Best
Erick

On Tue, Jan 15, 2013 at 10:18 AM, Jean-Sebastien Vachon 
jean-sebastien.vac...@wantedanalytics.com wrote:
 Ok I see what Erick`s meant now.. Thanks.

 The original index I`m working on contains about 120k documents. Since I have 
 no access to the code that pushes documents into the index, I made four 
 copies of the same index.

 The master node contains no data at all, it simply use the data available in 
 its four shards. Knowing that I have 1000 documents matching the keyword 
 java on each shard I was expecting to receive 4000 documents out of my 
 sharded setup. There are only a few documents that are not accounted for (The 
 result count is about 3996 which is pretty close but not accurate).

 Right now, the index is static so there is no need for any replication so the 
 polling interval has no effect.
 Later this week, I

RE: SOlr 3.5 and sharding

2013-01-15 Thread Jean-Sebastien Vachon
Hi Erick,

Thanks for your comments but I am migrating an existing index (single instance) 
to a sharded setup and currently I have no access to the code involved in the 
indexation process. That`s why I made a simple copy of the index on each shards.

In the end, the data will be distributed among all shards.

I was just curious to know why I had not the expected number of documents with 
my four shards.

Can you elaborate on  this polling interval thing? I am pretty sure I never 
eared about this... 

Regards

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: January-15-13 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: SOlr 3.5 and sharding

You're confusing shards and slaves here. Shards are splitting a logical index 
amongst N machines, where each machine contains a portion of the index. In that 
setup, you have to configure the slaves to know about the other shards, and the 
incoming query has to be distributed amongst all the shards to find all the 
docs.

In your case, since you're really replicating (rather than sharding), you only 
have to query _one_ slave, the query doesn't need to be distributed.

So pull all the sharding stuff out of your config files, put a load balancer in 
front of your slaves and only send the request to one of them would be the 
place I'd start.

Also, don't be at all surprised if the number of hits from the _master_ (which 
you shouldn't be searching, BTW) is different than the slaves, there's the 
polling interval to consider.

Best
Erick


On Mon, Jan 14, 2013 at 9:58 AM, Jean-Sebastien Vachon  
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi,

 I`m setting up a small Sorl setup consisting of 1 master node and 4 
 shards. For now, all four shards contains the exact same data. When I 
 perform a query on each individual shards for the word `java` I am 
 receiving the same number of docs (as expected). However, when I am 
 going through the master node using the shards parameters, the number 
 of results is slightly off by a few documents. There is nothing 
 special in my setup so I`m looking for hints on why I am getting this 
 problem

 Thanks


-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.2890 / Base de données virale: 2638/6032 - Date: 14/01/2013


RE: SOlr 3.5 and sharding

2013-01-15 Thread Jean-Sebastien Vachon
Ok I see what Erick`s meant now.. Thanks.

The original index I`m working on contains about 120k documents. Since I have 
no access to the code that pushes documents into the index, I made four copies 
of the same index.

The master node contains no data at all, it simply use the data available in 
its four shards. Knowing that I have 1000 documents matching the keyword java 
on each shard I was expecting to receive 4000 documents out of my sharded 
setup. There are only a few documents that are not accounted for (The result 
count is about 3996 which is pretty close but not accurate).

Right now, the index is static so there is no need for any replication so the 
polling interval has no effect.
Later this week, I will configure the replication and have the indexation 
modified to  distribute the documents to each shard using a simple ID modulo 4 
rule.

Were my expectations wrong about the number  of documents? 

-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: January-15-13 9:21 AM
To: solr-user@lucene.apache.org
Subject: Re: SOlr 3.5 and sharding

He was referring to master/slave setup, where a slave will poll the master 
periodically asking for index updates. That frequency is configured in 
solrconfig.xml on the slave.

So, you are saying that you have, say 1m documents in your master index.
You then copy your index to four other boxes. At that point you have 1m 
documents on each of those four. Eventually, you'll delete some docs, so'd you 
have 250k on each. You're wondering, before the deletes, you're not seeing 1m 
docs on each of your instances.

Or are you wondering why you're not seeing 1m docs when you do a distributed 
query across all for of these boxes?

Is that correct? 

Upayavira

On Tue, Jan 15, 2013, at 02:11 PM, Jean-Sebastien Vachon wrote:
 Hi Erick,
 
 Thanks for your comments but I am migrating an existing index (single
 instance) to a sharded setup and currently I have no access to the 
 code involved in the indexation process. That`s why I made a simple 
 copy of the index on each shards.
 
 In the end, the data will be distributed among all shards.
 
 I was just curious to know why I had not the expected number of 
 documents with my four shards.
 
 Can you elaborate on  this polling interval thing? I am pretty sure 
 I never eared about this...
 
 Regards
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: January-15-13 8:00 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SOlr 3.5 and sharding
 
 You're confusing shards and slaves here. Shards are splitting a 
 logical index amongst N machines, where each machine contains a 
 portion of the index. In that setup, you have to configure the slaves 
 to know about the other shards, and the incoming query has to be 
 distributed amongst all the shards to find all the docs.
 
 In your case, since you're really replicating (rather than sharding), 
 you only have to query _one_ slave, the query doesn't need to be distributed.
 
 So pull all the sharding stuff out of your config files, put a load 
 balancer in front of your slaves and only send the request to one of 
 them would be the place I'd start.
 
 Also, don't be at all surprised if the number of hits from the 
 _master_ (which you shouldn't be searching, BTW) is different than the 
 slaves, there's the polling interval to consider.
 
 Best
 Erick
 
 
 On Mon, Jan 14, 2013 at 9:58 AM, Jean-Sebastien Vachon  
 jean-sebastien.vac...@wantedanalytics.com wrote:
 
  Hi,
 
  I`m setting up a small Sorl setup consisting of 1 master node and 4 
  shards. For now, all four shards contains the exact same data. When 
  I perform a query on each individual shards for the word `java` I am 
  receiving the same number of docs (as expected). However, when I am 
  going through the master node using the shards parameters, the 
  number of results is slightly off by a few documents. There is 
  nothing special in my setup so I`m looking for hints on why I am 
  getting this problem
 
  Thanks
 
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.2890 / Base de données virale: 2638/6032 - Date:
 14/01/2013

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.2890 / Base de données virale: 2638/6032 - Date: 14/01/2013


RE: SOlr 3.5 and sharding

2013-01-14 Thread Jean-Sebastien Vachon
Ok that was my first thought... thanks for the confirmation

-Original Message-
From: Michael Ryan [mailto:mr...@moreover.com] 
Sent: January-14-13 10:06 AM
To: solr-user@lucene.apache.org
Subject: RE: SOlr 3.5 and sharding 

If you have the same documents -- with the same uniqueKey -- across multiple 
shards, the count will not be what you expect. You'll need to ensure that each 
document exists only on a single shard.

-Michael

-Original Message-
From: Jean-Sebastien Vachon [mailto:jean-sebastien.vac...@wantedanalytics.com]
Sent: Monday, January 14, 2013 9:59 AM
To: solr-user@lucene.apache.org
Subject: SOlr 3.5 and sharding 

Hi,

I`m setting up a small Sorl setup consisting of 1 master node and 4 shards. For 
now, all four shards contains the exact same data. When I perform a query on 
each individual shards for the word `java` I am receiving the same number of 
docs (as expected). However, when I am going through the master node using the 
shards parameters, the number of results is slightly off by a few documents. 
There is nothing special in my setup so I`m looking for hints on why I am 
getting this problem

Thanks

-
Aucun virus trouvé dans ce message.
Analyse effectuée par AVG - www.avg.fr
Version: 2013.0.2805 / Base de données virale: 2637/5996 - Date: 29/12/2012 La 
Base de données des virus a expiré.


RE: FieldCache

2011-05-26 Thread Jean-Sebastien Vachon
10 unique terms on 1.5M documents each with 50+ fields? I don't think so ;)

What I mean is controlling its size like the other caches. There are
currently no options in solrconfig.xml to control this cache.
Is Solr/Lucene managing this all by itself? 

It could be that my understanding of the FieldCache is wrong. I thought this
was the main cache for Lucene. Is that right?

Thanks for your feedback

-Original Message-
From: pravesh [mailto:suyalprav...@yahoo.com] 
Sent: May-26-11 2:58 AM
To: solr-user@lucene.apache.org
Subject: Re: FieldCache

This is because you may be having only 10 unique terms in your indexed
Field.
BTW, what do you mean by controlling the FieldCache?

--
View this message in context:
http://lucene.472066.n3.nabble.com/FieldCache-tp2987541p2988142.html
Sent from the Solr - User mailing list archive at Nabble.com.



FieldCache

2011-05-25 Thread Jean-Sebastien Vachon
Hi All,

 

Since there is no way of controlling the size of Lucene's internal
FieldCache, how can we make sure that we are making good use of it? One of
my shard has close to 1.5M documents and the fieldCache only contains about
10 elements.

 

Is there anything we can do to control this?

 

Thanks



SOLR-2209

2011-05-19 Thread Jean-Sebastien Vachon
Hi All,

I am having some problems with the presence of unnecessary  parenthesis in my 
query.
A query such as:
title:software AND (title:engineer)
will return no results. Remove the parenthesis fix the issue but then since my 
user can enter the parenthesis by himself I need to find a way to fix or 
work-around this bug. I found that this is related to SOLR-2209 but there is no 
activity on this bug.

Anyone know if this will get fixed some time in the future or if it is already 
fixed in Solr 4?

Otherwise, could someone point me to the code handling this so that I can 
attempt to make a fix?

Thx


RE: SOLR-2209

2011-05-19 Thread Jean-Sebastien Vachon
I'm using Solr 1.4...

I thought I had a case without a NOT but it seems to work now :S
It might be a glitch on my server.

The problem is easily reproducible with the NOT operator

http://10.0.5.221:8983/jobs/select?q=title:java%20AND%20(-title:programmer)
http://10.0.5.221:8983/jobs/select?q=title:java%20AND%20(-(title:programmer)
)

both queries returns 0 results while...

http://10.0.5.221:8983/jobs/select?q=title:java%20AND%20-(title:programmer)
(note the position of the negation operator)

returns more than 50 000 results

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: May-19-11 9:53 AM
To: solr-user@lucene.apache.org
Subject: Re: SOLR-2209

What version of Solr are you using? Because this works fine for me.

Could you attach the results of adding debugQuery=on in both instances?
The parsed form of the query is identical in 1.4.1 as far as I can tell. The
bug you're referencing is a peculiarity of the not (-) operator I think.

Best
Erick

On Thu, May 19, 2011 at 7:25 AM, Jean-Sebastien Vachon
jean-sebastien.vac...@wantedtech.com wrote:
 Hi All,

 I am having some problems with the presence of unnecessary  parenthesis in
my query.
 A query such as:
                title:software AND (title:engineer) will return no 
 results. Remove the parenthesis fix the issue but then since my user can
enter the parenthesis by himself I need to find a way to fix or work-around
this bug. I found that this is related to SOLR-2209 but there is no activity
on this bug.

 Anyone know if this will get fixed some time in the future or if it is
already fixed in Solr 4?

 Otherwise, could someone point me to the code handling this so that I can
attempt to make a fix?

 Thx




RE: Exact match on a field with stemming

2011-04-11 Thread Jean-Sebastien Vachon
I'm curious to know why Solr is not respecting the phrase.
If it consider manager as a phrase... shouldn't it return only document 
containing that phrase?

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means use this as a phrase, not use this as a literal. :) I 
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
search :: http://search-lucene.com/



- Original Message 
 From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 2:55:04 PM
 Subject: Exact match on a field with stemming
 
 Hi all,
 
 Is there a way to perform an exact match query on a field that  has 
stemming enable by using the standard /select handler?
 
 I thought  that putting word inside double-quotes would enable this 
behaviour but if I  query my field with a single word like “manager”
 I am receiving results  containing the word “management”
 
 I know I can use a CopyField with  different types but that would 
double the size of my index… Is there an  alternative?
 
 Thanks
 



RE: Exact match on a field with stemming

2011-04-11 Thread Jean-Sebastien Vachon
Thanks for the clarification. This make sense.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: April-11-11 7:54 PM
To: solr-user@lucene.apache.org
Subject: FW: Exact match on a field with stemming


 I'm curious to know why Solr is not respecting the phrase.
 If it consider manager as a phrase... shouldn't it return only document
containing that phrase?

A phrase means to solr (or rather to the lucene and dismax query parsers,
which are what understand double-quoted phrases)  these tokens in exactly
this order

So a phrase of one token manager, is exactly the same as if you didn't use
the double quotes. It's only one token, so all the tokens in this phrase in
exactly the order specified is, well, just the same as one token without
phrase quotes. 

If you've set up a stemmed field at indexing time, then manager and
management are stemmed IN THE INDEX, probably to something like manag.
There is no longer any information in the index (at least in that field) on
what the original literal was, it's been stemmed in the index.  So there's
no way possible for it to only match certain un-stemmed versions -- at least
using that field. And when you enter either 'manager' or 'management' at
query time, it is analyzed and stemmed to match that stemmed something-like
manag in the index either way. If it didn't analyze and stem at query
time, then instead the query would just match NOTHING, because neither
'manager' nor 'management' are in the index at all, only the stemmed
versions. 

So, yes, double quotes are interpreted as a phrase, and only documents
containing that phrase are returned, you got it. 


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means use this as a phrase, not use this as a literal. :) I
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
search :: http://search-lucene.com/



- Original Message 
 From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 2:55:04 PM
 Subject: Exact match on a field with stemming

 Hi all,

 Is there a way to perform an exact match query on a field that  has 
stemming enable by using the standard /select handler?

 I thought  that putting word inside double-quotes would enable this 
behaviour but if I  query my field with a single word like manager
 I am receiving results  containing the word management

 I know I can use a CopyField with  different types but that would 
double the size of my index. Is there an  alternative?

 Thanks


=



Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-01 Thread Jean-Sebastien Vachon

Try this...

http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft

- Original Message - 
From: Dennis Gearon gear...@sbcglobal.net

To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException



I am trying to get spatial search to work on my Solr installation. I am 
running

version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}


The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude 
field in

my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better
idea to learn from others’ mistakes, so you do not have to make them 
yourself.

from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-01 Thread Jean-Sebastien Vachon
I just saw the parameter 'lng' in your query... I believe it should be 
'long'. Give it a try if the link I sent you is not working


- Original Message - 
From: Dennis Gearon gear...@sbcglobal.net

To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 11:39 PM
Subject: Re: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException



Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still
being held up.

I'll be the go between until he has access.

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better
idea to learn from others’ mistakes, so you do not have to make them 
yourself.

from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jean-Sebastien Vachon js.vac...@videotron.ca
To: solr-user@lucene.apache.org
Sent: Wed, December 1, 2010 7:12:20 PM
Subject: Re: spatial query parinsg error:
org.apache.lucene.queryParser.ParseException

Try this...

http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft


- Original Message - From: Dennis Gearon gear...@sbcglobal.net
To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error:
org.apache.lucene.queryParser.ParseException


I am trying to get spatial search to work on my Solr installation. I am 
running

version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}



The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude 
field in

my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better
idea to learn from others’ mistakes, so you do not have to make them 
yourself.

from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die. 



Facet.query and collapsing

2010-11-25 Thread Jean-Sebastien Vachon

Hi All,

I'm in a situation where I need to perform a facet on a query with field 
collapsing.

Let's say the main query is something like this

title:applefq={!tag=sources}source_id:(33 OR 
44)facet=onfacet.field={!ex=sources}source_idfacet.query=source_id:(33 OR 
44)collapse=oncollapse.field=hash_id

I'd like my facet query to return the number of unique documents (based on the 
hash_id field) that are associated to either source 33 or 44

Right now, the query works but the count returned is larger than expected since 
there is no collapsing performed on the facet query's result set.

Is there any way of doing this? I'd like to be able to do this without 
performing a second request.

Thanks

NOTE: I'm using Solr 1.4.1 with patch 236 
(https://issues.apache.org/jira/browse/SOLR-236)





Re: Looking for help with Solr implementation

2010-11-13 Thread Jean-Sebastien Vachon
Yes we did. Sorry for this. We both made the same error replying to the 
mailing list.


- Original Message - 
From: Thumuluri, Sai sai.thumul...@verizonwireless.com

To: solr-user@lucene.apache.org
Sent: Saturday, November 13, 2010 8:41 AM
Subject: RE: Looking for help with Solr implementation


Please refrain using this mailing group for soliciting and take it offline


-Original Message-
From: AC [mailto:acanuc...@yahoo.com]
Sent: Sat 11/13/2010 1:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Looking for help with Solr implementation

Hey Jean-Sebastien,

Thanks for the reply. It sounds like your experience is exactly what is 
needed

for my project.


To give you some background this project is for a personal project related 
to

biomedical field that I'm trying to get up off the ground.


The site is www.antibodyreview.com

It is a portal site for researchers in the biotech industry specifically 
focused

on antibodies - not sure how up you may be on biomedical research :)

Anyway I have collected a lot of information about proteins and antibodies 
from

various sources which people can search and browse. The site is and will be
free to access by anyone.


The current search uses MySQL but our requirements for how the site needs to
operate cannot be properly handled by MySQL. Searches can take ~8-10 sec and
this is clearly not acceptable.


If you try the default search on the index page you can see how slow it is.
Suggested terms to try: Akt, p53, PTEN, AIF.


So there are several different items indexed in solr that we want to search:

1. Protein Information (~42,000 MySQL DB records)
2. Products (expect to host 200,000 product records, currently ~20,000
products) http://www.antibodyreview.com/products.php (current product search 
is

faceted but also takes way too long)
3. Articles (text from ~120,000 articles) Article search can be accessed 
from

the protein pages and advanced search page:
http://www.antibodyreview.com/advsearch.php
4. Images (~100,000 image captions) Image search is found on this page
http://www.antibodyreview.com/gallery.php

The current solr search which has been set-up can be seen on this page:
www.antibodyreview.com/proteins3.php (search bar on this page uses solr). It
is clearly much faster and meets our needs so it seems clear that using solr 
is

the solution to the search issue.


The last programmer mentioned that he had indexed all the data and it is now
just a matter of setting up the search queries in solr. The most complicated
query to set-up will be the products as it requires faceted search. The 
other

searches are failry routine or have more limited facets/options.


If it looks like their is mutual interest I can share with you a document 
that
he created that explains how things have been set-up which should help you 
get

started.


Please let me know what you think.

Regards,

Abe







From: Jean-Sebastien Vachon js.vac...@videotron.ca
To: solr-user@lucene.apache.org
Sent: Fri, November 12, 2010 7:09:06 PM
Subject: Re: Looking for help with Solr implementation

Hi,

If you're still looking for someone, I might be interested in getting more
information
about your project. From you initial message that does not seem to be a lot 
of

work
so I might be willing to give you some time.

I've been working with Solr for the last 7 months on my full-time job and 
I'm

currently managing
a Solr based project that use Field collapsing, facetting, custom scoring 
with

function queries and
a custom query handler.

Contact me if you're interested

- Original Message - From: AC acanuc...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Thursday, November 11, 2010 7:43 PM
Subject: Looking for help with Solr implementation


Hi,

Not sure if this is the correct place to post but I'm looking for someone to
help finish a Solr install on our LAMP based website. This would be a paid
project.


The programmer that started the project got too busy with his full-time job 
to
finish the project. Solr has been installed and a basic search is working 
but

we need to configure it to work across the site and also set-up faceted
search. I tried posting on some popular freelance sites but haven't been 
able

to find anyone with real Solr expertise / experience.


If you think you can help me with this project please let me know and I can
supply more details.


Regards,

Abe







Re: Looking for help with Solr implementation

2010-11-12 Thread Jean-Sebastien Vachon

Hi,

If you're still looking for someone, I might be interested in getting more 
information
about your project. From you initial message that does not seem to be a lot 
of work

so I might be willing to give you some time.

I've been working with Solr for the last 7 months on my full-time job and 
I'm currently managing
a Solr based project that use Field collapsing, facetting, custom scoring 
with function queries and

a custom query handler.

Contact me if you're interested

- Original Message - 
From: AC acanuc...@yahoo.com

To: solr-user@lucene.apache.org
Sent: Thursday, November 11, 2010 7:43 PM
Subject: Looking for help with Solr implementation


Hi,

Not sure if this is the correct place to post but I'm looking for someone to
help finish a Solr install on our LAMP based website. This would be a paid
project.


The programmer that started the project got too busy with his full-time job 
to
finish the project. Solr has been installed and a basic search is working 
but

we need to configure it to work across the site and also set-up faceted
search. I tried posting on some popular freelance sites but haven't been 
able

to find anyone with real Solr expertise / experience.


If you think you can help me with this project please let me know and I can
supply more details.


Regards,

Abe





Re: Looking for help with Solr implementation

2010-11-12 Thread Jean-Sebastien Vachon

Sorry all, I obviously meant to send this to the original poster

- Original Message - 
From: Jean-Sebastien Vachon js.vac...@videotron.ca

To: solr-user@lucene.apache.org
Sent: Friday, November 12, 2010 10:09 PM
Subject: Re: Looking for help with Solr implementation



Hi,

If you're still looking for someone, I might be interested in getting more 
information
about your project. From you initial message that does not seem to be a 
lot of work

so I might be willing to give you some time.

I've been working with Solr for the last 7 months on my full-time job and 
I'm currently managing
a Solr based project that use Field collapsing, facetting, custom scoring 
with function queries and

a custom query handler.

Contact me if you're interested

- Original Message - 
From: AC acanuc...@yahoo.com

To: solr-user@lucene.apache.org
Sent: Thursday, November 11, 2010 7:43 PM
Subject: Looking for help with Solr implementation


Hi,

Not sure if this is the correct place to post but I'm looking for someone 
to

help finish a Solr install on our LAMP based website. This would be a paid
project.


The programmer that started the project got too busy with his full-time 
job to
finish the project. Solr has been installed and a basic search is working 
but

we need to configure it to work across the site and also set-up faceted
search. I tried posting on some popular freelance sites but haven't been 
able

to find anyone with real Solr expertise / experience.


If you think you can help me with this project please let me know and I 
can

supply more details.


Regards,

Abe







problem with wildcard

2010-11-11 Thread Jean-Sebastien Vachon

Hi All,

I'm having some trouble with a query using some wildcard and I was wondering if 
anyone could tell me why these two
similar queries do not return the same number of results. Basically, the query 
I'm making should return all docs whose title starts
(or contain) the string lowe'. I suspect some analyzer is causing this 
behaviour and I'd like to know if there is a way to fix this problem.

1) select?q=*:*fq=title:(+lowe')debugQuery=onrows=0

result name=response numFound=302 start=0/
lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str
lst name=explain/
str name=QParserLuceneQParser/str
arr name=filter_queries
strtitle:(  lowe')/str
/arr
arr name=parsed_filter_queries
strtitle:low/str
/arr

2) select?q=*:*fq=title:(+lowe'*)debugQuery=onrows=0 

result name=response numFound=0 start=0/
lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str
lst name=explain/
str name=QParserLuceneQParser/str
arr name=filter_queries
strtitle:(  lowe'*)/str
/arr
arr name=parsed_filter_queries
strtitle:lowe'*/str
/arr
...
/lst


The title field is defined as:

field name=title type=text indexed=true stored=true required=false/

where the text type is:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
/fieldType






Re: problem with wildcard

2010-11-11 Thread Jean-Sebastien Vachon

On 2010-11-11, at 3:45 PM, Ahmet Arslan wrote:

 I'm having some trouble with a query using some wildcard
 and I was wondering if anyone could tell me why these two
 similar queries do not return the same number of results.
 Basically, the query I'm making should return all docs whose
 title starts
 (or contain) the string lowe'. I suspect some analyzer is
 causing this behaviour and I'd like to know if there is a
 way to fix this problem.
 
 1)
 select?q=*:*fq=title:(+lowe')debugQuery=onrows=0
 
 wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/
 

Yeah I found out about this a couple of minutes after I posted my problem. If 
there is no analyzer then
why is Solr not finding any documents when a single quote precedes the wildcard?


Re: Problem escaping question marks

2010-11-04 Thread Jean-Sebastien Vachon
Have you tried encoding it with %3F?

firstname:*%3F*

On 2010-11-04, at 1:44 AM, Stephen Powis wrote:

 I'm having difficulty properly escaping ? in my search queries.  It seems as
 tho it matches any character.
 
 Some info, a simplified schema and query to explain the issue I'm having.
 I'm currently running solr1.4.1
 
 Schema:
 
 field name=id type=sint indexed=true stored=true required=true /
 field name=first_name type=string indexed=true stored=true
 required=false /
 
 I want to return any first name with a Question Mark in it
 Query: first_name: *\?*
 
 Returns all documents with any character in it.
 
 Can anyone lend a hand?
 Thanks!
 Stephen



Re: RAM increase

2010-10-21 Thread Jean-Sebastien Vachon

You will also need to switch to a 64 bits JVM
You might have to add the `-d64` flag as well as the `-Xms` and `-Xmx`

- Original Message - 
From: Gora Mohanty g...@mimirtech.com

To: solr-user@lucene.apache.org
Sent: Thursday, October 21, 2010 2:34 AM
Subject: Re: RAM increase


On Thu, Oct 21, 2010 at 10:46 AM, satya swaroop satya.yada...@gmail.com 
wrote:

Hi all,
I increased my RAM size to 8GB and i want 4GB of it to be used
for solr itself. can anyone tell me the way to allocate the RAM for the
solr.

[...]

You will need to set up the allocation of RAM for Java, via the the -Xmx
and -Xms variables. If you are using something like Tomcat, that would
be done in the Tomcat configuration file. E.g., this option can be added
inside /etc/init.d/tomcat6 on new Debian/Ubuntu systems.

Regards,
Gora 



Re: Need help with field collapsing and out of memory error

2010-09-01 Thread Jean-Sebastien Vachon
can you tell us what are your current settings regarding the fieldCollapseCache?

I had similar issues with field collapsing and I found out that this cache was 
responsible for 
most of the OOM exceptions.

Reduce or even remove this cache from your configuration and it should help.


On 2010-09-01, at 1:10 PM, Moazzam Khan wrote:

 Hi guys,
 
 I have about 20k documents in the Solr index (and there's a lot of
 text in each of them). I have field collapsing enabled on a specific
 field (AdvisorID).
 
 The thing is if I have field collapsing enabled in the search request
 I don't get correct count for the total number of records that
 matched. It always says that the number of rows I asked to get back
 is the number of total records it found.
 
 And, when I run a query with search criteria *:* (to get the number of
 total advisors in the index) solr runs of out memory and gives me an
 error saying
 
 SEVERE: java.lang.OutOfMemoryError: Java heap space
at java.nio.CharBuffer.wrap(CharBuffer.java:350)
at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
at java.lang.StringCoding.decode(StringCoding.java:173)
 
 
 This is going to be a huge problem later on when we index 50k
 documents later on.
 
 These are the options I am running Solr with :
 
 java  -Xms2048M -Xmx2048M -XX:+UseConcMarkSweepGC -XX:PermSize=1024m
 MaxPermSize=1024m-jar  start.jar
 
 
 Is there any way I can get the counts and not run out of memory?
 
 Thanks in advance,
 Moazzam



bug or feature???

2010-08-11 Thread Jean-Sebastien Vachon
Hi,

Can someone tell me why the two following queries do not return the same 
results?
Is that a bug or a feature?

http://localhost:8983/jobs/select?fq=title:(NOT janitor)fq=description:(NOT 
janitor)q=*:*

http://localhost:8983/jobs/select?q=title:(NOT janitor) AND description:(NOT 
janitor)


The second query returns no result while the first one returns 6097276 documents

Thanks


Re: question about the fieldCollapseCache

2010-06-15 Thread Jean-Sebastien Vachon
They used to be in the branches if I recall correctly but you're right. They 
aren't there anymore.

Maybe someone else can explain why... it looks like they restructure the 
repository for the Solr/lucene merge.

On 2010-06-15, at 4:54 AM, Rakhi Khatwani wrote:

 Hi,
  I tried downloading solr 1.4.1 from the site. but it shows an empty
 directory. where did u get solr 1.4.1 from?
 
 Regards,
 Raakhi
 
 On Tue, Jun 8, 2010 at 10:35 PM, Jean-Sebastien Vachon 
 js.vac...@videotron.ca wrote:
 
 Hi All,
 
 I've been running some tests using 6 shards each one containing about 1
 millions documents.
 Each shard is running in its own virtual machine with 7 GB of ram (5GB
 allocated to the JVM).
 After about 1100 unique queries the shards start to struggle and run out of
 memory. I've reduced all
 other caches without significant impact.
 
 When I remove completely the fieldCollapseCache, the server can keep up for
 hours
 and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)
 
 The size of the fieldCollapseCache was set to 5000 items. How can 5000
 items eat 3 GB of ram?
 
 Can someone tell me what is put in this cache? Has anyone experienced this
 kind of problem?
 
 I am running Solr 1.4.1 with patch 236. All requests are collapsing on a
 single field (pint) and
 collapse.maxdocs set to 200 000.
 
 Thanks for any hints...
 
 



Re: question about the fieldCollapseCache

2010-06-09 Thread Jean-Sebastien Vachon
ok great.

I believe this should be mentioned in the wiki.

Later

On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote:

 The fieldCollapseCache should not be used as it is now, it uses too
 much memory. It stores any information relevant for a field collapse
 search. Like document collapse counts, collapsed document ids /
 fields, collapsed docset and uncollapsed docset (everything per unique
 search). So the memory usage will grow for each unique query (and fast
 with all this information). So its best I think to disable this cache
 for now.
 
 Martijn
 
 On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote:
 Hi All,
 
 I've been running some tests using 6 shards each one containing about 1 
 millions documents.
 Each shard is running in its own virtual machine with 7 GB of ram (5GB 
 allocated to the JVM).
 After about 1100 unique queries the shards start to struggle and run out of 
 memory. I've reduced all
 other caches without significant impact.
 
 When I remove completely the fieldCollapseCache, the server can keep up for 
 hours
 and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)
 
 The size of the fieldCollapseCache was set to 5000 items. How can 5000 items 
 eat 3 GB of ram?
 
 Can someone tell me what is put in this cache? Has anyone experienced this 
 kind of problem?
 
 I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
 single field (pint) and
 collapse.maxdocs set to 200 000.
 
 Thanks for any hints...
 
 



Re: Diagnosing solr timeout

2010-06-09 Thread Jean-Sebastien Vachon
Have you looked at the garbage collector statistics? I've experienced this kind 
of issues in the past
and I was getting huge spikes when the GC was doing its job.

On 2010-06-09, at 10:52 AM, Paul wrote:

 Hi all,
 
 In my app, it seems like solr has become slower over time. The index
 has grown a bit, and there are probably a few more people using the
 site, but the changes are not drastic.
 
 I notice that when a solr search is made, the amount of cpu and ram
 spike precipitously.
 
 I notice in the solr log, a bunch of entries in the same second that end in:
 
 status=0 QTime=212
 status=0 QTime=96
 status=0 QTime=44
 status=0 QTime=276
 status=0 QTime=8552
 status=0 QTime=16
 status=0 QTime=20
 status=0 QTime=56
 
 and then:
 
 status=0 QTime=315919
 status=0 QTime=325071
 
 My questions: How do I figure out what to fix? Do I need to start java
 with more memory? How do I tell what is the correct amount of memory
 to use? Is there something particularly inefficient about something
 else in my configuration, or the way I'm formulating the solr request,
 and how would I narrow down what it could be? I can't tell, but it
 seems like it happens after solr has been running unattended for a
 little while. Should I have a cron job that restarts solr every day?
 Could the solr process be starved by something else on the server
 (although -- the only other thing that is particularly running is
 apache/passenger/rails app)?
 
 In other words, I'm at a total loss about how to fix this.
 
 Thanks!
 
 P.S. In case this helps, here's the exact log entry for the first item
 that failed:
 
 Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute
 INFO: [resources] webapp=/solr path=/select
 params={hl.fragsize=600facet.missing=truefacet=falsefacet.mincount=1ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48facet.limit=-1hl.fl=texthl.maxAnalyzedChars=512000wt=javabinhl=truerows=30version=1fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uristart=0q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5facet.field=genrefacet.field=archivefacet.field=freeculturefacet.field=has_full_textfacet.field=federationisShard=truefq=year:1882}
 status=0 QTime=315919



Re: Diagnosing solr timeout

2010-06-09 Thread Jean-Sebastien Vachon
I use the following article as a reference when dealing with GC related issues

http://www.petefreitag.com/articles/gctuning/

I suggest you activate the verbose option and send GC stats to a file. I don't 
remember exactly what
was the option but you should find the information easily

Good luck

On 2010-06-09, at 11:35 AM, Paul wrote:

 Have you looked at the garbage collector statistics? I've experienced this 
 kind of issues in the past
 and I was getting huge spikes when the GC was doing its job.
 
 I haven't, and I'm not sure what a good way to monitor this is. The
 problem occurs maybe once a week on a server. Should I run jstat the
 whole time and redirect the output to a log file? Is there another way
 to get that info?
 
 Also, I was suspecting GC myself. So, if it is the problem, what do I
 do about it? It seems like increasing RAM might make the problem worse
 because it would wait longer to GC, then it would have more to do.



question about the fieldCollapseCache

2010-06-08 Thread Jean-Sebastien Vachon
Hi All,

I've been running some tests using 6 shards each one containing about 1 
millions documents. 
Each shard is running in its own virtual machine with 7 GB of ram (5GB 
allocated to the JVM).
After about 1100 unique queries the shards start to struggle and run out of 
memory. I've reduced all
other caches without significant impact. 

When I remove completely the fieldCollapseCache, the server can keep up for 
hours 
and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)

The size of the fieldCollapseCache was set to 5000 items. How can 5000 items 
eat 3 GB of ram?

Can someone tell me what is put in this cache? Has anyone experienced this kind 
of problem?

I am running Solr 1.4.1 with patch 236. All requests are collapsing on a single 
field (pint) and
collapse.maxdocs set to 200 000.

Thanks for any hints...



Re: Faceted search not working?

2010-05-25 Thread Jean-Sebastien Vachon
Is the FacetComponent loaded at all? 

requestHandler name=standard class=solr.SearchHandler default=true
  arr name=components
  strquery/str
  strfacet/str
   /arr
/requestHandler


On 2010-05-25, at 3:32 AM, Sascha Szott wrote:

 Hi Birger,
 
 Birger Lie wrote:
 I don't think the bolean fields is mapped to on and off :)
 You can use true and on interchangeably.
 
 -Sascha
 
 
 
 -birger
 
 -Original Message-
 From: Ilya Sterin [mailto:ster...@gmail.com]
 Sent: 24. mai 2010 23:11
 To: solr-user@lucene.apache.org
 Subject: Faceted search not working?
 
 I'm trying to perform a faceted search without any luck.  Result set doesn't 
 return any facet information...
 
 http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title
 
 I'm getting the result set, but no face information present?  Is there 
 something else that needs to happen to turn faceting on?
 
 I'm using latest Solr 1.4 release.  Data is indexed from the database using 
 dataimporter.
 
 Thanks.
 
 Ilya Sterin
 




Re: jmx issue with solr

2010-05-19 Thread Jean-Sebastien Vachon
Hi,

Try adding these options...

-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false


On 2010-05-19, at 3:44 AM, Na_D wrote:

 
 Hi,
 
 I am trying to start solr with the following command :
 
 java -Dsolr.solr.home=./example-DIH/solr/ -Dcom.sun.management.jmxremote
 -Dcom.sun.management.jmxremote.port=3000
 
 
 On doing so an error is reported :
 
 Error: Password file read access must be restricted: C:\Program
 Files\Java\jdk1.
 6.0_18\jre\lib\management\jmxremote.password
 
 
 The jmxremote.password file is there in the lib\management folder and the
 same has been set to read-only.
 still the error persists.I am using Windows XP SP3 Version 2002, just
 mentioning the same if its of any help.
 Please do put in your suggestions.
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/jmx-issue-with-solr-tp828478p828478.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: JTeam Spatial Plugin

2010-05-11 Thread Jean-Sebastien Vachon
Hi,

Thanks for your suggestion but I received more information about this issue 
from one of the JTeam's developer and he told me that
my problem was caused by the plugin not supporting sharding at this time. 

In my case, I noticed that individual shards were computing the distance 
through the geo_distance field. 
However, the master Solr instance controlling the shards was kind of loosing 
this information from the lack of support for shards.

For now there is no quick work around that I know of.

Later,

On 2010-05-11, at 2:54 PM, Michael wrote:

 Try using geo_distance in the return fields.
 
 On Thu, Apr 29, 2010 at 9:26 AM, Jean-Sebastien Vachon
 js.vac...@videotron.ca wrote:
 Hi All,
 
 I am using JTeam's Spatial Plugin RC3 to perform spatial searches on my 
 index and it works great. However, I can't seem to get it to return the 
 computed distances.
 
 My query component is run before the geoDistanceComponent and the 
 distanceField is set to distance
 Fields for lat/long are defined as well and the different tiers field are in 
 the results. Increasing the radius cause the number of matches to increase 
 so I guess that my setup is working...
 
 Here is sample query and its output (I removed some of the fields to keep it 
 short):
 
 /select?passkey=sampleq={!spatial%20lat=40.27%20long=-76.29%20radius=22%20calc=arc}title:engineerwt=jsonindent=onfl=*,distance
 
 
 
 {
  responseHeader:{
  status:0,
  QTime:69,
  params:{
fl:*,distance,
indent:on,
q:{!spatial lat=40.27 long=-76.29 radius=22 
 calc=arc}title:engineer,
wt:json}},
  response:{numFound:223,start:0,docs:[
{
 
 title:Electrical Engineer,
long:-76.3054962158203,
 lat:40.037899017334,
 _tier_9:-3.004,
 _tier_10:-6.0008,
 _tier_11:-12.0016,
 _tier_12:-24.0031,
 _tier_13:-47.0061,
 _tier_14:-93.00122,
 _tier_15:-186.00243,
 _tier_16:-372.00485},
 }}
 
 This output suggests to me that everything is in place. Anyone knows how to 
 fetch the computed distance? I tried adding the field 'distance' to my list 
 of fields but it didn't work
 
 Thanks
 



JTeam Spatial Plugin

2010-04-29 Thread Jean-Sebastien Vachon
Hi All,

I am using JTeam's Spatial Plugin RC3 to perform spatial searches on my index 
and it works great. However, I can't seem to get it to return the computed 
distances.

My query component is run before the geoDistanceComponent and the distanceField 
is set to distance
Fields for lat/long are defined as well and the different tiers field are in 
the results. Increasing the radius cause the number of matches to increase so I 
guess that my setup is working...

Here is sample query and its output (I removed some of the fields to keep it 
short):

/select?passkey=sampleq={!spatial%20lat=40.27%20long=-76.29%20radius=22%20calc=arc}title:engineerwt=jsonindent=onfl=*,distance



{
 responseHeader:{
  status:0,
  QTime:69,
  params:{
fl:*,distance,
indent:on,
q:{!spatial lat=40.27 long=-76.29 radius=22 calc=arc}title:engineer,
wt:json}},
 response:{numFound:223,start:0,docs:[
{

 title:Electrical Engineer,
long:-76.3054962158203,
 lat:40.037899017334,
 _tier_9:-3.004,
 _tier_10:-6.0008,
 _tier_11:-12.0016,
 _tier_12:-24.0031,
 _tier_13:-47.0061,
 _tier_14:-93.00122,
 _tier_15:-186.00243,
 _tier_16:-372.00485},
}}

This output suggests to me that everything is in place. Anyone knows how to 
fetch the computed distance? I tried adding the field 'distance' to my list of 
fields but it didn't work

Thanks


Re: Reg: Indexing Date Fields

2010-04-15 Thread Jean-Sebastien Vachon
I guess you can simply use a range query such as:

fq=createdDate:[ date1 TO date2 ]


On 2010-04-15, at 7:30 AM, Venkata Sai Krishna Vepakomma wrote:

 Hi,
 
 1) How do I query for Data between 2 date ranges.  I have specified the 
 following field definition in Schema.xml.
 
   field name=createdDate  type=long indexed=true stored=true /
 
 I have long values for Date fields.  When I query with long values, I am 
 always getting all the results.
 
 2) For indexing to be working efficiently and for querying between Date 
 ranges, Is it OK to use long values or Do I need to use 'Date' type with 
 specific formats.
 
 Please Let me know your thoughts.
 
 Thanks  Regards
 Venkat



Collapse problem

2010-04-12 Thread Jean-Sebastien Vachon
Hi All,

I'd like to know if anyone else is experiencing the same problem we are 
facing

basically, we are running  query with field collapsing (Solr 1.4 with patch 
236). The responses tells us that there are about 2700 documents matching our 
query.
However, I can not get passed the 431th document. From this point on, the 
response will not contain any document.

If I run the same query without collapsing then I can iterator through all 
results without problem. This tells me that the problem is not related to the 
shards.

Any hints?


Re: Benchmarking Solr

2010-04-10 Thread Jean-Sebastien Vachon

Hi,

why don't you use JMeter? It would give you greater control over the tests 
you wish to make.
It has many different samplers that will let you run different scenarios 
using your existing set of queries.


ab is great when you want to evaluate the performance of your server under 
heavy load. But other than this, I don`t see much use to it. JMeter offers 
many more options once you get to know it a little.


good luck

- Original Message - 
From: Blargy zman...@hotmail.com

To: solr-user@lucene.apache.org
Sent: Friday, April 09, 2010 9:46 PM
Subject: Benchmarking Solr




I am about to deploy Solr into our production environment and I would like 
to

do some benchmarking to determine how many slaves I will need to set up.
Currently the only way I know how to benchmark is to use Apache Benchmark
but I would like to be able to send random requests to the Solr... not 
just

one request over and over.

I have a sample data set of 5000 user entered queries and I would like to 
be

able to use AB to benchmark against all these random queries. Is this
possible?

FYI our current index is ~1.5 gigs with ~5m documents and we will be using
faceting quite extensively. Are average requests per/day is ~2m. We will 
be

running RHEL with about 8-12g ram. Any idea how many slaves might be
required to handle our load?

Thanks
--
View this message in context: 
http://n3.nabble.com/Benchmarking-Solr-tp709561p709561.html
Sent from the Solr - User mailing list archive at Nabble.com. 




Excluding field from the results

2010-03-26 Thread Jean-Sebastien Vachon
Hi,

Is there an easy way to prevent a field from being returned in the response?

we can use fl=field1, field2, field3, ...

but then our software has an option that must trigger the presence or not of a 
field in the response.
So what I'd like to do is tell Solr to return all fields except one.

Does Solr support this?

I imagine the syntax could look like this:

fl=*, -description

Since this is not working, is there any other way of doing this? Otherwise, I 
will have to manage multiple list of fields.

Thanks


Spatial queries

2010-03-23 Thread Jean-Sebastien Vachon
Hi All,

I am using the package from JTeam to perform spatial searches on my index. I'd 
like to know if it is possible
to build a query that uses multiple clauses. Here is an example:

q={!spatial lat=123 long=456 radius=10} OR {!spatial lat=111 long=222 
radius=20}title:java

Basically that would return all documents having the word java in the title 
field and that are either
within 10 miles from the first location OR 20 miles from the second.

I've made a few tries but it does not seem to be supported. I'm still wondering 
if it would make sense to support this kind of queries.

I could use multiple queries and merge the results myself but then I need some 
faceting.

Thanks


Re: Recommended OS

2010-03-18 Thread Jean-Sebastien Vachon

On 2010-03-18, at 1:03 PM, K Wong wrote:

 http://wiki.apache.org/solr/FAQ#What_are_the_Requirements_for_running_a_Solr_server.3F
 
 I have Solr running on CentOS 5.4. It runs fine on the OpenJDK 1.6.0
 and Tomcat 5. If I were to do it again, I'd probably just stick with
 Jetty.

Would you mind explaining why you would stick with Jetty instead of Tomcat?


 You really will need to read the docs to get the settings right as
 there is no one-size-fits-all setting. (re your mem/dsk question)
 
 K
 
 
 
 On Thu, Mar 18, 2010 at 9:51 AM, blargy zman...@hotmail.com wrote:
 
 Does anyone have any recommendations on which OS to use when setting up Solr
 search server?
 
 Any memory/disk space recommendations?
 
 Thanks
 --
 View this message in context: 
 http://old.nabble.com/Recommended-OS-tp27948306p27948306.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Spatial search in Solr 1.5

2010-03-15 Thread Jean-Sebastien Vachon
Hi All,

I'm trying to figure out how to perform spatial searches using Solr 1.5 (from 
the trunk).

Is the support for spatial search built-in? because none of the patches I tried 
could be applied to the source tree.
If this is the case, can someone one tell me how to configure it?

I find the available information very confusing so I hope someone will be able 
to give me some hints...

Thanks


Profiling Solr

2010-03-11 Thread Jean-Sebastien Vachon
Hi,

I'm trying to identify the bottleneck to get acceptable performance of a single 
shard containing 4.7 millions of documents using my own machine (Mac Pro - Quad 
Core with 8Gb of RAM with 4Gb allocated to the JVM). 

I tried using YourKit but I don't get anything about Solr classes. I'm new to 
Yourkit so I might be doing something wrong but it seems pretty straight 
forward.

I am running Solr within a Tomcat instance within Eclipse. Does anyone have an 
idea about what could be wrong in my setup?

I'm making individual requests (one at a time) and the response times are 
horrible (about 15 sec on average). I need to bring this way below 1 second.

Here is a sample query:

http://localhost:8983/jobs_part3/select/?q=*:*collapse=truecollapse.field=hash_idfacet=truefacet.field=county_idfacet.field=advertiser_idfacet.field=county_idsort=county_id+ascrows=100collapse.type=adjacent

I know that collapsing results has a big hit on performance but it is a must 
have for us.

Thanks for any hints.

= JVM Parameters =

-Xms4g -Xmx4g -d64 -server


Multi valued fields

2010-03-11 Thread Jean-Sebastien Vachon
Hi All,

I'd like to know if it is possible to do the following on a multi-value field:

Given the following data:

document A:  field1   = [ A B C D]
document B:  field 1  = [A B]
document C:  field 1  = [A]

Can I build a query such as : 

-field: A

which will return all documents that do not have exclusive A in the their 
field's values. By exclusive I mean that I don't want documents that only have 
A in their list of values. In my sample case, the query would return doc A and 
B.
Because they both have other values in field1.

It this kind of query possible with Solr/Lucene?

Thanks





Re: Index size

2010-02-26 Thread Jean-Sebastien Vachon
Hi,

All the document can be up to 10K. Most if it comes from a single field which 
is both indexed and stored. 
The data is uncompressed because it would eat up to much CPU considering the 
volume we have. We have around 30 fields in all.
We also need to compute some facets as well as collapse the documents forming 
the result set and to be able to sort them on any field.

Thx

On 2010-02-25, at 5:50 PM, Otis Gospodnetic wrote:

 It depends on many factors - how big those docs are (compare a tweet to a 
 news article to a book chapter) whether you store the data or just index it, 
 whether you compress it, how and how much you analyze the data, etc.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Hadoop ecosystem search :: http://search-hadoop.com/
 
 
 
 - Original Message 
 From: Jean-Sebastien Vachon js.vac...@videotron.ca
 To: solr-user@lucene.apache.org
 Sent: Wed, February 24, 2010 8:57:21 AM
 Subject: Index size
 
 Hi All,
 
 I'm currently looking on integrating Solr and I'd like to have some hints on 
 the 
 size of the index (number of documents) I could possibly host on a server 
 running a Double-Quad server (16 cores) with 48Gb of RAM running Linux. 
 Basically, I need to determine how many of these servers would be required 
 to 
 host about half a billion documents. Should I setup multiple Solr instances 
 (in 
 Virtual Machines or not) or should I run a single instance (with multicores 
 or 
 not) using all available memory as the cache ?
 
 I also made some tests with shardings on this same server and I could not 
 see 
 any improvement (at least not with 4.5 millions documents). Should all the 
 shards be hosted on different servers? I shall try with more documents in 
 the 
 following days.
 
 Thx 
 



Index size

2010-02-24 Thread Jean-Sebastien Vachon
Hi All,

I'm currently looking on integrating Solr and I'd like to have some hints on 
the size of the index (number of documents) I could possibly host on a server 
running a Double-Quad server (16 cores) with 48Gb of RAM running Linux. 
Basically, I need to determine how many of these servers would be required to 
host about half a billion documents. Should I setup multiple Solr instances (in 
Virtual Machines or not) or should I run a single instance (with multicores or 
not) using all available memory as the cache ?

I also made some tests with shardings on this same server and I could not see 
any improvement (at least not with 4.5 millions documents). Should all the 
shards be hosted on different servers? I shall try with more documents in the 
following days.

Thx