Re: Bad fieldNorm when using morphologic synonyms

2013-12-26 Thread Isaac Hebsh
Attached patch into the JIRA issue.
Reviews are welcome.


On Thu, Dec 19, 2013 at 7:24 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Roman, do you have any results?

 created SOLR-5561

 Robert, if I'm wrong, you are welcome to close that issue.


 On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.comwrote:

 You can see the norm value, in the explain text, when setting
 debugQuery=true.
 If the same item gets different norm before/after, that's it.

 Note that this configuration is in schema.xml (not solrconfig.xml...)

 On Monday, December 9, 2013, Roman Chyla wrote:

 Isaac, is there an easy way to recognize this problem? We also index
 synonym tokens in the same position (like you do, and I'm sure that our
 positions are set correctly). I could test whether the default similarity
 factory in solrconfig.xml had any effect (before/after reindexing).

 --roman


 On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi Robert and Manuel.
 
  The DefaultSimilarity indeed sets discountOverlap to true by default.
  BUT, the *factory*, aka DefaultSimilarityFactory, when called by
  IndexSchema (the getSimilarity method), explicitly sets this value to
 the
  value of its corresponding class member.
  This class member is initialized to be FALSE  when the instance is
 created
  (like every boolean variable in the world). It should be set when
 init
  method is called. If the parameter is not set in schema.xml, the
 default is
  true.
 
  Everything seems to be alright, but the issue is that init method is
 NOT
  called, if the similarity is not *explicitly* declared in schema.xml.
 In
  that case, init method is not called, the discountOverlaps member (of
 the
  factory class) remains FALSE, and getSimilarity explicitly calls
  setDiscountOverlaps with value of FALSE.
 
  This is very easy to reproduce and debug.
 
 
  On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:
 
   no, its turned on by default in the default similarity.
  
   as i said, all that is necessary is to fix your analyzer to emit the
   proper position increments.
  
   On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
   manuel.lenorm...@gmail.com wrote:
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the
 schema.xml,
   which
is commented out by default!
   
As by default this param is false, the above situation is expected
 with
correct positioning, as said.
   
In order to fix the field norms you'd have to reindex with the
  similarity
class which initializes the param to true.
   
Cheers,
Manu
  
 





Re: LocalParam for nested query without escaping?

2013-12-19 Thread Isaac Hebsh
created SOLR-5560


On Tue, Dec 10, 2013 at 8:48 AM, William Bell billnb...@gmail.com wrote:

 Sounds like a bug.


 On Mon, Dec 9, 2013 at 1:16 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

  If so, can someone suggest how a query should be escaped (securely and
  correctly)?
  Should I escape the quote mark (and backslash mark itself) only?
 
 
  On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
 
   Obviously, there is the option of external parameter ({...
   v=$nestedq}nestedq=...)
  
   This is a good solution, but it is not practical, when having a lot of
   such nested queries.
  
   Any ideas?
  
   On Friday, December 6, 2013, Isaac Hebsh wrote:
  
   We want to set a LocalParam on a nested query. When quering with v
   inline parameter, it works fine:
  
  
 
 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND{!lucenedf=text
  v=TERM2 TERM3 \TERM4 TERM5\}
  
   the parsedquery_toString is
   +id:TERM1 +(text:term2 text:term3 text:term4 term5)
  
   Query using the _query_ also works fine:
  
  
 
 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND_query_:{!lucene
 df=text}TERM2 TERM3 \TERM4 TERM5\
  
   (parsedquery is exactly the same).
  
  
   BUT, when trying to put the nested query in place, it yields syntax
  error:
  
  
 
 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND{!lucenedf=text}(TERM2
  TERM3 TERM4 TERM5)
  
   org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'
  
   The previous options are less preferred, because the escaping that
  should
   be made on the nested query.
  
   Can't I set a LocalParam to a nested query without escaping the query?
  
  
 



 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: Bad fieldNorm when using morphologic synonyms

2013-12-19 Thread Isaac Hebsh
Roman, do you have any results?

created SOLR-5561

Robert, if I'm wrong, you are welcome to close that issue.


On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 You can see the norm value, in the explain text, when setting
 debugQuery=true.
 If the same item gets different norm before/after, that's it.

 Note that this configuration is in schema.xml (not solrconfig.xml...)

 On Monday, December 9, 2013, Roman Chyla wrote:

 Isaac, is there an easy way to recognize this problem? We also index
 synonym tokens in the same position (like you do, and I'm sure that our
 positions are set correctly). I could test whether the default similarity
 factory in solrconfig.xml had any effect (before/after reindexing).

 --roman


 On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi Robert and Manuel.
 
  The DefaultSimilarity indeed sets discountOverlap to true by default.
  BUT, the *factory*, aka DefaultSimilarityFactory, when called by
  IndexSchema (the getSimilarity method), explicitly sets this value to
 the
  value of its corresponding class member.
  This class member is initialized to be FALSE  when the instance is
 created
  (like every boolean variable in the world). It should be set when init
  method is called. If the parameter is not set in schema.xml, the
 default is
  true.
 
  Everything seems to be alright, but the issue is that init method is
 NOT
  called, if the similarity is not *explicitly* declared in schema.xml. In
  that case, init method is not called, the discountOverlaps member (of
 the
  factory class) remains FALSE, and getSimilarity explicitly calls
  setDiscountOverlaps with value of FALSE.
 
  This is very easy to reproduce and debug.
 
 
  On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:
 
   no, its turned on by default in the default similarity.
  
   as i said, all that is necessary is to fix your analyzer to emit the
   proper position increments.
  
   On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
   manuel.lenorm...@gmail.com wrote:
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the
 schema.xml,
   which
is commented out by default!
   
As by default this param is false, the above situation is expected
 with
correct positioning, as said.
   
In order to fix the field norms you'd have to reindex with the
  similarity
class which initializes the param to true.
   
Cheers,
Manu
  
 




Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Isaac Hebsh
Hi Robert and Manuel.

The DefaultSimilarity indeed sets discountOverlap to true by default.
BUT, the *factory*, aka DefaultSimilarityFactory, when called by
IndexSchema (the getSimilarity method), explicitly sets this value to the
value of its corresponding class member.
This class member is initialized to be FALSE  when the instance is created
(like every boolean variable in the world). It should be set when init
method is called. If the parameter is not set in schema.xml, the default is
true.

Everything seems to be alright, but the issue is that init method is NOT
called, if the similarity is not *explicitly* declared in schema.xml. In
that case, init method is not called, the discountOverlaps member (of the
factory class) remains FALSE, and getSimilarity explicitly calls
setDiscountOverlaps with value of FALSE.

This is very easy to reproduce and debug.


On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:

 no, its turned on by default in the default similarity.

 as i said, all that is necessary is to fix your analyzer to emit the
 proper position increments.

 On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  In order to set discountOverlaps to true you must have added the
  similarity class=solr.DefaultSimilarityFactory to the schema.xml,
 which
  is commented out by default!
 
  As by default this param is false, the above situation is expected with
  correct positioning, as said.
 
  In order to fix the field norms you'd have to reindex with the similarity
  class which initializes the param to true.
 
  Cheers,
  Manu



Re: LocalParam for nested query without escaping?

2013-12-09 Thread Isaac Hebsh
If so, can someone suggest how a query should be escaped (securely and
correctly)?
Should I escape the quote mark (and backslash mark itself) only?


On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Obviously, there is the option of external parameter ({...
 v=$nestedq}nestedq=...)

 This is a good solution, but it is not practical, when having a lot of
 such nested queries.

 Any ideas?

 On Friday, December 6, 2013, Isaac Hebsh wrote:

 We want to set a LocalParam on a nested query. When quering with v
 inline parameter, it works fine:

 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
  {!lucene df=text v=TERM2 TERM3 \TERM4 TERM5\}

 the parsedquery_toString is
 +id:TERM1 +(text:term2 text:term3 text:term4 term5)

 Query using the _query_ also works fine:

 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
  _query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\

 (parsedquery is exactly the same).


 BUT, when trying to put the nested query in place, it yields syntax error:

 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
  {!lucene df=text}(TERM2 TERM3 TERM4 TERM5)

 org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'

 The previous options are less preferred, because the escaping that should
 be made on the nested query.

 Can't I set a LocalParam to a nested query without escaping the query?




Re: Global query parameters to facet query

2013-12-09 Thread Isaac Hebsh
created SOLR-5542.
Anyone else want it?


On Thu, Dec 5, 2013 at 8:55 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi,

 It seems that a facet query does not use the global query parameters (for
 example, field aliasing for edismax parser).
 We have an intensive use of facet queries (in some cases, we have a lot of
 facet.query for a single q), and the using of LocalParams for each
 facet.query is not convenient.

 Did I miss a normal way to solve it?
 Did anyone else encountered this requirement?



Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Isaac Hebsh
You can see the norm value, in the explain text, when setting
debugQuery=true.
If the same item gets different norm before/after, that's it.

Note that this configuration is in schema.xml (not solrconfig.xml...)

On Monday, December 9, 2013, Roman Chyla wrote:

 Isaac, is there an easy way to recognize this problem? We also index
 synonym tokens in the same position (like you do, and I'm sure that our
 positions are set correctly). I could test whether the default similarity
 factory in solrconfig.xml had any effect (before/after reindexing).

 --roman


 On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh 
 isaac.he...@gmail.comjavascript:;
 wrote:

  Hi Robert and Manuel.
 
  The DefaultSimilarity indeed sets discountOverlap to true by default.
  BUT, the *factory*, aka DefaultSimilarityFactory, when called by
  IndexSchema (the getSimilarity method), explicitly sets this value to the
  value of its corresponding class member.
  This class member is initialized to be FALSE  when the instance is
 created
  (like every boolean variable in the world). It should be set when init
  method is called. If the parameter is not set in schema.xml, the default
 is
  true.
 
  Everything seems to be alright, but the issue is that init method is
 NOT
  called, if the similarity is not *explicitly* declared in schema.xml. In
  that case, init method is not called, the discountOverlaps member (of the
  factory class) remains FALSE, and getSimilarity explicitly calls
  setDiscountOverlaps with value of FALSE.
 
  This is very easy to reproduce and debug.
 
 
  On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.comjavascript:;
 wrote:
 
   no, its turned on by default in the default similarity.
  
   as i said, all that is necessary is to fix your analyzer to emit the
   proper position increments.
  
   On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
   manuel.lenorm...@gmail.com javascript:; wrote:
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the schema.xml,
   which
is commented out by default!
   
As by default this param is false, the above situation is expected
 with
correct positioning, as said.
   
In order to fix the field norms you'd have to reindex with the
  similarity
class which initializes the param to true.
   
Cheers,
Manu
  
 



Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Isaac Hebsh
1) positions look all right (for me).
2) fieldNorm is determined by the size of the termVector, isn't it? the
termVector size isn't affected by the positions.


On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote:

 Your analyzer needs to set positionIncrement correctly: sounds like its
 broken.

 On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
  Hi,
  we implemented a morphologic analyzer, which stems words on index time.
  For some reasons, we index both the original word and the stem (on the
 same
  position, of course).
  The stemming is done on a specific language, so other languages are not
  stemmed at all.
 
  Because of that, two documents with the same amount of terms, may have
  different termVector size. document which contains many words that being
  stemmed, will have a double sized termVector. This behaviour affects the
  relevance score in a BAD way. the fieldNorm of these documents reduces
  thier score. This is NOT the wanted behaviour in our case.
 
  We are looking for a way to mark the stemmed words (on index time, of
  course) so they won't affect the fieldNorm. Do such a way exist?
 
  Do you have another idea?



LocalParam for nested query without escaping?

2013-12-06 Thread Isaac Hebsh
We want to set a LocalParam on a nested query. When quering with v inline
parameter, it works fine:
http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
{!lucene df=text v=TERM2 TERM3 \TERM4 TERM5\}

the parsedquery_toString is
+id:TERM1 +(text:term2 text:term3 text:term4 term5)

Query using the _query_ also works fine:
http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
_query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\

(parsedquery is exactly the same).


BUT, when trying to put the nested query in place, it yields syntax error:
http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
{!lucene df=text}(TERM2 TERM3 TERM4 TERM5)

org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'

The previous options are less preferred, because the escaping that should
be made on the nested query.

Can't I set a LocalParam to a nested query without escaping the query?


Re: LocalParam for nested query without escaping?

2013-12-06 Thread Isaac Hebsh
Obviously, there is the option of external parameter ({...
v=$nestedq}nestedq=...)

This is a good solution, but it is not practical, when having a lot of such
nested queries.

Any ideas?

On Friday, December 6, 2013, Isaac Hebsh wrote:

 We want to set a LocalParam on a nested query. When quering with v
 inline parameter, it works fine:

 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
  {!lucene df=text v=TERM2 TERM3 \TERM4 TERM5\}

 the parsedquery_toString is
 +id:TERM1 +(text:term2 text:term3 text:term4 term5)

 Query using the _query_ also works fine:

 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
  _query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\

 (parsedquery is exactly the same).


 BUT, when trying to put the nested query in place, it yields syntax error:

 http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND
  {!lucene df=text}(TERM2 TERM3 TERM4 TERM5)

 org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'

 The previous options are less preferred, because the escaping that should
 be made on the nested query.

 Can't I set a LocalParam to a nested query without escaping the query?



Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Isaac Hebsh
Hi,
we implemented a morphologic analyzer, which stems words on index time.
For some reasons, we index both the original word and the stem (on the same
position, of course).
The stemming is done on a specific language, so other languages are not
stemmed at all.

Because of that, two documents with the same amount of terms, may have
different termVector size. document which contains many words that being
stemmed, will have a double sized termVector. This behaviour affects the
relevance score in a BAD way. the fieldNorm of these documents reduces
thier score. This is NOT the wanted behaviour in our case.

We are looking for a way to mark the stemmed words (on index time, of
course) so they won't affect the fieldNorm. Do such a way exist?

Do you have another idea?


Global query parameters to facet query

2013-12-05 Thread Isaac Hebsh
Hi,

It seems that a facet query does not use the global query parameters (for
example, field aliasing for edismax parser).
We have an intensive use of facet queries (in some cases, we have a lot of
facet.query for a single q), and the using of LocalParams for each
facet.query is not convenient.

Did I miss a normal way to solve it?
Did anyone else encountered this requirement?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Isaac Hebsh
The field is our main textual field. In the standard case, the
length-normalization makes a significant work with tf-idf, we don't want to
avoid it.

Removing duplicates won't help here, because the terms are not dup. One
term is stemmed, and the other is not.


On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Isaac,

 Did you consider omitting norms completely for that field? omitNorms=true
 Are you using solr.RemoveDuplicatesTokenFilterFactory?



 On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

 Hi,
 we implemented a morphologic analyzer, which stems words on index time.
 For some reasons, we index both the original word and the stem (on the same
 position, of course).
 The stemming is done on a specific language, so other languages are not
 stemmed at all.

 Because of that, two documents with the same amount of terms, may have
 different termVector size. document which contains many words that being
 stemmed, will have a double sized termVector. This behaviour affects the
 relevance score in a BAD way. the fieldNorm of these documents reduces
 thier score. This is NOT the wanted behaviour in our case.

 We are looking for a way to mark the stemmed words (on index time, of
 course) so they won't affect the fieldNorm. Do such a way exist?

 Do you have another idea?



Re: Solr Result Tagging

2013-10-27 Thread Isaac Hebsh
Hi,
Try using facet.query on each part, you will get the number of total hits
for every OR.
If you need this info per document, the answers might appear when
specifying debug query=true.. If that info is useful, try adding
[explain] to fl param (probably requires registering the augmenter plugin
in solrconfig)

- Isaac.

On Friday, October 25, 2013, Cool Techi wrote:

 Hi,
 My search queries to solr are of the following nature,
  (A OR B OR C) OR (X AND Y AND Z) OR ((ABC AND DEF) - XYZ)
 What I am trying to achieve is when I fire the query the results returned
 should be able to tagged with which part or the OR resulted in the result.
 In case all three parts above are applicable then the result should
 indicate the same. I tried group.query feature, but doesn't seem like it
 works on solr cloud.
 Thanks,Ayush



Re: Profiling Solr Lucene for query

2013-10-01 Thread Isaac Hebsh
Hi Dmitry,

I'm trying to examine your suggestion to create a frontend node. It sounds
pretty usefull.
I saw that every node in solr cluster can serve request for any collection,
even if it does not hold a core of that collection. because of that, I
thought that adding a new node to the cluster (aka, the frontend/gateway
server), and creating a dummy collection (with 1 dummy core), will solve
the problem.

But, I see that a request which sent to the gateway node, is not then sent
to the shards. Instead, the request is proxyed to a (random) core of the
requested collection, and from there it is sent to the shards. (It is
reasonable, because the SolrCore on the gateway might run with different
configuration, etc). This means that my new node isn't functioning as a
frontend (which responsible for sorting, etc.), but as a poor load
balancer. No performance improvement will come from this implementation.

So, how do you suggest to implement a frontend? On the one hand, it has to
run a core of the target collection, but on the other hand, we don't want
it to hold any shard contents.


On Fri, Sep 13, 2013 at 1:08 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Manuel,

 Whether to have the front end solr as aggregator of shard results depends
 on your requirements. To repeat, we found merging from many shards very
 inefficient fo our use case. It can be the opposite for you (i.e. requires
 testing). There are some limitations with distributed search, see here:

 http://docs.lucidworks.com/display/solr/Distributed+Search+with+Index+Sharding


 On Wed, Sep 11, 2013 at 3:35 PM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Dmitry - currently we don't have such a front end, this sounds like a
 good
  idea creating it. And yes, we do query all 36 shards every query.
 
  Mikhail - I do think 1 minute is enough data, as during this exact
 minute I
  had a single query running (that took a qtime of 1 minute). I wanted to
  isolate these hard queries. I repeated this profiling few times.
 
  I think I will take the termInterval from 128 to 32 and check the
 results.
  I'm currently using NRTCachingDirectoryFactory
 
 
 
 
  On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   Hi Manuel,
  
   The frontend solr instance is the one that does not have its own index
  and
   is doing merging of the results. Is this the case? If yes, are all 36
   shards always queried?
  
   Dmitry
  
  
   On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand 
   manuel.lenorm...@gmail.com wrote:
  
Hi Dmitry,
   
I have solr 4.3 and every query is distributed and merged back for
   ranking
purpose.
   
What do you mean by frontend solr?
   
   
On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
   
 are you querying your shards via a frontend solr? We have noticed,
  that
 querying becomes much faster if results merging can be avoided.

 Dmitry


 On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Hello all
  Looking on the 10% slowest queries, I get very bad performances
  (~60
sec
  per query).
  These queries have lots of conditions on my main field (more
 than a
  hundred), including phrase queries and rows=1000. I do return
 only
   id's
  though.
  I can quite firmly say that this bad performance is due to slow
   storage
  issue (that are beyond my control for now). Despite this I want
 to
 improve
  my performances.
 
  As tought in school, I started profiling these queries and the
 data
   of
~1
  minute profile is located here:
 
  http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
 
  Main observation: most of the time I do wait for readVInt, who's
 stacktrace
  (2 out of 2 thread dumps) is:
 
  catalina-exec-3870 - Thread t@6615
   java.lang.Thread.State: RUNNABLE
   at
 org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
   at
 
 

   
  
 
 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
  2357)
   at
 
 

   
  
 
 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
   at
 org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
   at
 
 

   
  
 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
   at

  org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
   at
 
 

   
  
 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
 
   
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 

   
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
 
   
  

Re: Profiling Solr Lucene for query

2013-10-01 Thread Isaac Hebsh
Hi Shawn,
I know that every node operates as a frontend. This is the way our cluster
currently run.

If I seperate the frontend from the nodes which hold the shards, I can let
him different amount of CPUs as RAM. (e.g. large amount of RAM to JVM,
because this server won't need the OS cache for reading the index, or more
CPUs because the merging process might be more CPU intensive).

Isn't it possible?


On Wed, Oct 2, 2013 at 12:42 AM, Shawn Heisey s...@elyograg.org wrote:

 On 10/1/2013 2:35 PM, Isaac Hebsh wrote:

 Hi Dmitry,

 I'm trying to examine your suggestion to create a frontend node. It sounds
 pretty usefull.
 I saw that every node in solr cluster can serve request for any
 collection,
 even if it does not hold a core of that collection. because of that, I
 thought that adding a new node to the cluster (aka, the frontend/gateway
 server), and creating a dummy collection (with 1 dummy core), will solve
 the problem.

 But, I see that a request which sent to the gateway node, is not then sent
 to the shards. Instead, the request is proxyed to a (random) core of the
 requested collection, and from there it is sent to the shards. (It is
 reasonable, because the SolrCore on the gateway might run with different
 configuration, etc). This means that my new node isn't functioning as a
 frontend (which responsible for sorting, etc.), but as a poor load
 balancer. No performance improvement will come from this implementation.

 So, how do you suggest to implement a frontend? On the one hand, it has to
 run a core of the target collection, but on the other hand, we don't want
 it to hold any shard contents.


 With SolrCloud, every node is a frontend node.  If you're running
 SolrCloud, then it doesn't make sense to try and use that concept.

 It only makes sense to create a frontend node (or core) if you are using
 traditional distributed search, where you need to include a shards
 parameter.

 http://wiki.apache.org/solr/**DistributedSearchhttp://wiki.apache.org/solr/DistributedSearch

 Thanks,
 Shawn




Considerations about setting maxMergedSegmentMB

2013-09-30 Thread Isaac Hebsh
Hi,
Trying to solve query performance issue, we suspect on the number of index
segments, which might slow the query (due to I/O seeks, happens for each
term in the query, multiplied by number of segments).
We are on Solr 4.3 (TieredMergePolicy with mergeFactor of 4).

We can reduce the number of segments by enlarging maxMergedSegmentMB, from
the default 5GB to something bigger (10GB, 15GB?).

What are the side effects, which should be considered when doing it?
Did anyone changed this setting in PROD for a while?


Re: Data duplication using Cloud+HDFS+Mirroring

2013-09-30 Thread Isaac Hebsh
Hi Greg, Did you get an answer?
I'm interested in the same question.

More generally, what are the benefits of HdfsDirectoryFactory, besides the
transparent restore of the shard contents in case of a disk failure, and
the ability to rebuild index using MR?
Is the next statement exact? blocks of a particular shard, which are
replicated to another node, will be never queried, since there is no solr
core configured to read them.


On Wed, Aug 7, 2013 at 8:46 PM, Greg Walters
gwalt...@sherpaanalytics.comwrote:

 While testing Solr's new ability to store data and transaction directories
 in HDFS I added an additional core to one of my testing servers that was
 configured as a backup (active but not leader) core for a shard elsewhere.
 It looks like this extra core copies the data into its own directory rather
 than just using the existing directory with the data that's already
 available to it.

 Since HDFS likely already has redundancy of the data covered via the
 replicationFactor is there a reason for non-leader cores to create their
 own data directory rather than doing reads on the existing master copy? I
 searched Jira for anything that suggests this behavior might change and
 didn't find any issues; is there any intent to address this?

 Thanks,
 Greg



Re: Getting a query parameter in a TokenFilter

2013-09-21 Thread Isaac Hebsh
Thought about that again,
We can do this work as a search component, manipulating the query string.
The cons are the double QParser work, and the double tokenization work.

Another approach which might solve this issue easily is Dynamic query
analyze chain: https://issues.apache.org/jira/browse/SOLR-5053

What would you do?


On Tue, Sep 17, 2013 at 10:31 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi everyone,

 We developed a TokenFilter.
 It should act differently, depends on a parameter supplied in the
 query (for query chain only, not the index one, of course).
 We found no way to pass that parameter into the TokenFilter flow. I guess
 that the root cause is because TokenFilter is a pure lucene object.

 As a last resort, we tried to pass the parameter as the first term in the
 query text (q=...), and save it as a member of the TokenFilter instance.

 Although it is ugly, it might work fine.
 But, the problem is that it is not guaranteed that all the terms of a
 particular query will be analyzed by the same instance of a TokenFilter. In
 this case, some terms will be analyzed without the required information of
 that parameter. We can produce such a race very easily.

 How should I overcome this issue?
 Do anyone have a better resolution?



Getting a query parameter in a TokenFilter

2013-09-17 Thread Isaac Hebsh
Hi everyone,

We developed a TokenFilter.
It should act differently, depends on a parameter supplied in the
query (for query chain only, not the index one, of course).
We found no way to pass that parameter into the TokenFilter flow. I guess
that the root cause is because TokenFilter is a pure lucene object.

As a last resort, we tried to pass the parameter as the first term in the
query text (q=...), and save it as a member of the TokenFilter instance.

Although it is ugly, it might work fine.
But, the problem is that it is not guaranteed that all the terms of a
particular query will be analyzed by the same instance of a TokenFilter. In
this case, some terms will be analyzed without the required information of
that parameter. We can produce such a race very easily.

How should I overcome this issue?
Do anyone have a better resolution?


documentCache and lazyFieldLoading

2013-08-29 Thread Isaac Hebsh
Hi,
We've investigated a memory dump, which was taken after some frequent OOM
incidents.

The main issue we found was a lot of millions of LazyField instances,
taking ~2GB of memory, even though queries request about 10 small fields
only.

We've found that LazyDocument creates a LazyField object for every item in
a multivalued field, even if do not want this field.

For example, documents contain a multivalued field, named f, with a lot
of values (let's say 100 values per document). Queries set fl=id (request
only document id). The documentCache will grow up in memory :(

In our case, documentCache was configured to 32000. There are 2 cores per
node, so 64000 LazyDocument instances are in memory. This is pretty big
number, and we'll reduce it.


I'm curious whether it's a known issue or not? and why should the
LazyDocument know the amount of values in a multivalued field which is not
requested?

Another thought which I had: Is it reasonable to add something like
{!cache=false} which will affect documentCache. For example. If my query
request id only, with a big rows parameter, I don't want documentCache to
hold these big LazyDocument objects.

Did anyone else encounter this?


Re: documentCache and lazyFieldLoading

2013-08-29 Thread Isaac Hebsh
Thanks Hoss.

1. We currently use Solr 4.3.0.
2. I understand this architecture of LazyFields, but i did not understand
why multiple LazyFields should be created for the multivalued field. You
can't load a part of them. If you request the field, you will get ALL of
its values. so 100 (or more) placeholders are not necessary in this case.
Moreover, why should Solr KNOW how much values are in that unloaded field?
3. In our poor case, we might handle some concurrent queries, each one
requests rows=2000.

What do you think about temporary disabling documentCache, for a specific
query?




On Thu, Aug 29, 2013 at 10:11 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : The main issue we found was a lot of millions of LazyField instances,
 : taking ~2GB of memory, even though queries request about 10 small fields
 : only.

 which version of Solr are you using?  there was a really bad bug with
 lazyFieldLoading fixed in Solr 4.2.1 (SOLR-4589)

 : We've found that LazyDocument creates a LazyField object for every item
 in
 : a multivalued field, even if do not want this field.

 right, that's exactly how lazyFieldLoading is designed to work -- instead
 of loading the full field values into ram, only a small LazyField object
 is loaded in it's place and that LazyField only fetches the underlying
 data if/when it's requested.

 If the LazyField instances weren't created as placeholders, subsequent
 requests for the document that *might* request additional fields (beyond
 the 10 small fields that were requested the first time) would have no
 way of knowing if/when those additional fields existed to be able to fetch
 them from the index.

 : In our case, documentCache was configured to 32000. There are 2 cores per
 : node, so 64000 LazyDocument instances are in memory. This is pretty big
 : number, and we'll reduce it.

 FWIW: Even at 1/10 that size, that seems like a ridiculously large
 documentCache to me.


 -Hoss



Re: Sending shard requests to all replicas

2013-07-31 Thread Isaac Hebsh
Thanks to Ryan Ernst, my issue is duplicate of SOLR-4449.
I think that this proposal might be very useful (some supporting links are
attached there. worth reading..)


On Tue, Jul 30, 2013 at 11:49 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi,
 I submitted a new JIRA for this:
 https://issues.apache.org/jira/browse/SOLR-5092

 A (very initial) patch is already attached. Reviews are very welcome.


 On Sun, Jul 28, 2013 at 4:50 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 You'd probably start in CloudSolrServer in SolrJ code,
 as far as I know that's where the request is sent out.

 I'd think that would be better than changing Solr itself
 since if you found that this was useful you wouldn't
 be patching your Solr release, just keeping your client
 up to date.

 Best
 Erick

 On Sat, Jul 27, 2013 at 7:28 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
  Shawn, thank you for the tips.
  I know the significant cons of virtualization, but I don't want to move
  this thread into a virtualization pros/cons in the Solr(Cloud) case.
 
  I've just asked what is the minimal code change should be made, in
 order to
  examine whether this is a possible solution or not.. :)
 
 
  On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey s...@elyograg.org
 wrote:
 
  On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
   I have about 40 shards. repFactor=2.
   The cause of slower shards is very interesting, and this is the main
   approach we took.
   Note that in every query, it is another shard which is the slowest.
 In
  20%
   of the queries, the slowest shard takes about 4 times more than the
  average
   shard qtime.
   While continuing investigation, remember it might be the
 virtualization /
   storage-access / network / gc /..., so I thought that reducing the
 effect
   of the slow shards might be a good (temporary or permanent) solution.
 
  Virtualization is not the best approach for Solr.  Assuming you're
  dealing with your own hardware and not something based in the cloud
 like
  Amazon, you can get better results by running on bare metal and having
  multiple shards per host.
 
  Garbage collection is a very likely source of this problem.
 
  http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
 
   I thought it should be an almost trivial code change (for proving the
   concept). Isn't it?
 
  I have no idea what you're saying/asking here.  Can you clarify?
 
  It seems to me that sending requests to all replicas would just
 increase
  the overall load on the cluster, with no real benefit.
 
  Thanks,
  Shawn
 
 





Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Hi Erick, thanks.

I have about 40 shards. repFactor=2.
The cause of slower shards is very interesting, and this is the main
approach we took.
Note that in every query, it is another shard which is the slowest. In 20%
of the queries, the slowest shard takes about 4 times more than the average
shard qtime.
While continuing investigation, remember it might be the virtualization /
storage-access / network / gc /..., so I thought that reducing the effect
of the slow shards might be a good (temporary or permanent) solution.

I thought it should be an almost trivial code change (for proving the
concept). Isn't it?


On Sat, Jul 27, 2013 at 6:11 PM, Erick Erickson erickerick...@gmail.comwrote:

 This has been suggested, but so far it's not been implemented
 as far as I know.

 I'm curious though, how many shards are you dealing with? I
 wonder if it would be a better idea to try to figure out _why_
 you so often have a slow shard and whether the problem could
 be cured with, say, better warming queries on the shards...

 Best
 Erick

 On Fri, Jul 26, 2013 at 8:23 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
  Hi!
 
  When SolrClound executes a query, it creates shard requests, which is
 sent
  to one replica of each shard. Total QTime is determined by the slowest
  shard response (plus some extra time). [For simplicity, let's assume that
  no stored fields are requested.]
 
  I suffer from a situation where in every query, some shards are much
 slower
  than others.
 
  We might consider a different approach, which sends the shard request to
  *ALL* replicas of each shard. Solr will continue when responses are got
  from at least one replica of each shard.
 
  Of course, the amount of work that is wasted is big (multiplied by
  replicationFactor), but in my case, there are very few concurrent
 queries,
  and the most important performance is the qtime. Such a solution might
  improve qtime significantly.
 
 
  Did someone tried this before?
  Any tip from where should I start in the code?



Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Shawn, thank you for the tips.
I know the significant cons of virtualization, but I don't want to move
this thread into a virtualization pros/cons in the Solr(Cloud) case.

I've just asked what is the minimal code change should be made, in order to
examine whether this is a possible solution or not.. :)


On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
  I have about 40 shards. repFactor=2.
  The cause of slower shards is very interesting, and this is the main
  approach we took.
  Note that in every query, it is another shard which is the slowest. In
 20%
  of the queries, the slowest shard takes about 4 times more than the
 average
  shard qtime.
  While continuing investigation, remember it might be the virtualization /
  storage-access / network / gc /..., so I thought that reducing the effect
  of the slow shards might be a good (temporary or permanent) solution.

 Virtualization is not the best approach for Solr.  Assuming you're
 dealing with your own hardware and not something based in the cloud like
 Amazon, you can get better results by running on bare metal and having
 multiple shards per host.

 Garbage collection is a very likely source of this problem.

 http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems

  I thought it should be an almost trivial code change (for proving the
  concept). Isn't it?

 I have no idea what you're saying/asking here.  Can you clarify?

 It seems to me that sending requests to all replicas would just increase
 the overall load on the cluster, with no real benefit.

 Thanks,
 Shawn




Sending shard requests to all replicas

2013-07-26 Thread Isaac Hebsh
Hi!

When SolrClound executes a query, it creates shard requests, which is sent
to one replica of each shard. Total QTime is determined by the slowest
shard response (plus some extra time). [For simplicity, let's assume that
no stored fields are requested.]

I suffer from a situation where in every query, some shards are much slower
than others.

We might consider a different approach, which sends the shard request to
*ALL* replicas of each shard. Solr will continue when responses are got
from at least one replica of each shard.

Of course, the amount of work that is wasted is big (multiplied by
replicationFactor), but in my case, there are very few concurrent queries,
and the most important performance is the qtime. Such a solution might
improve qtime significantly.


Did someone tried this before?
Any tip from where should I start in the code?


MoinMoin Dump

2013-07-17 Thread Isaac Hebsh
Hi,

There was a thread about viewing Solr Wiki offline, About 6 months ago. I'm
intersted, too.

It seems that a manual (cron?) dump will do the work...

Would it be too much to ask that one of the admins will manually create
such a dump? (http://moinmo.in/HelpOnMoinCommand/ExportDump)

Otis, is there any progress made on this in Apache Infra?


Re: Wildcards and Phrase queries

2013-06-23 Thread Isaac Hebsh
Ahmet, it looks great!

Can you tell us why havn't this code been commited into lucene+solr trunk?


On Sun, Jun 23, 2013 at 2:28 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Isaac,

 ComplexPhrase-4.2.1.zip should work with solr4.2.1. Zipball contains a
 ReadMe.txt file about instructions.


 You could try with higher solr versions too. If it does not work, please
 lets us know.


 https://issues.apache.org/jira/secure/attachment/12579832/ComplexPhrase-4.2.1.zip



 
  From: Isaac Hebsh isaac.he...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, June 22, 2013 9:33 PM
 Subject: Re: Wildcards and Phrase queries


 Thanks Erick.
 Maybe lucene (java-user) is a better mailing list to ask in?


 On Sat, Jun 22, 2013 at 7:30 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Wouldn't imagine they're production ready, they haven't been touched
  in months.
 
  So I'd say you're on your own here in terms of whether you wanted
  to use these for production.
 
  I confess I don't know what state they were left in or why they were
  never committed.
 
  FWIW,
  Erick
 
  On Wed, Jun 19, 2013 at 10:08 AM, Isaac Hebsh isaac.he...@gmail.com
  wrote:
   Hi,
  
   I'm trying to understand what is the status of enabling wildcards on
  phrase
   queries?
  
   Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
   Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604
  
   It looks like these issues are not going to be solved in the close
 future
   :( Will they? Did they came into a (partially) dead-end, in the current
   approach. Can I contribute anything to make them fixed into an official
   version?
  
   Does the lastest patches which attached to rthe JIRAs are production
  ready?
  
   [Should this message be sent to java-user list?]
 



Re: Wildcards and Phrase queries

2013-06-22 Thread Isaac Hebsh
Thanks Erick.
Maybe lucene (java-user) is a better mailing list to ask in?


On Sat, Jun 22, 2013 at 7:30 AM, Erick Erickson erickerick...@gmail.comwrote:

 Wouldn't imagine they're production ready, they haven't been touched
 in months.

 So I'd say you're on your own here in terms of whether you wanted
 to use these for production.

 I confess I don't know what state they were left in or why they were
 never committed.

 FWIW,
 Erick

 On Wed, Jun 19, 2013 at 10:08 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
  Hi,
 
  I'm trying to understand what is the status of enabling wildcards on
 phrase
  queries?
 
  Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
  Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604
 
  It looks like these issues are not going to be solved in the close future
  :( Will they? Did they came into a (partially) dead-end, in the current
  approach. Can I contribute anything to make them fixed into an official
  version?
 
  Does the lastest patches which attached to rthe JIRAs are production
 ready?
 
  [Should this message be sent to java-user list?]



Wildcards and Phrase queries

2013-06-19 Thread Isaac Hebsh
Hi,

I'm trying to understand what is the status of enabling wildcards on phrase
queries?

Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604

It looks like these issues are not going to be solved in the close future
:( Will they? Did they came into a (partially) dead-end, in the current
approach. Can I contribute anything to make them fixed into an official
version?

Does the lastest patches which attached to rthe JIRAs are production ready?

[Should this message be sent to java-user list?]


OutOfMemory while indexing (PROD environment!)

2013-06-06 Thread Isaac Hebsh
Hi everyone,

My SolrCloud cluster (4.3.0) has came into production a few days ago.
Docs are being indexed into Solr using /update requestHandler, as a POST
request, containing text/xml content-type.

The collection is sharded into 36 pieces, each shard has two replicas.
There are 36 nodes (each node on separate virtual machine), so each node
holds exactly 2 cores.

Each update request contains 100 docs, what means 2-3 docs for each shard.
There are 1-2 such requests every minute. Soft-commit happens every 10
minutes, Hard-commit every 30 minutes, and ramBufferSizeMB=128.

After 48 hours of zero problems, suddenly one shard went down (its both
cores). Log says it's OOM (GC overhead limit exceeded). JVM is set to
Xmx=4G.
I'm pretty sure that some minutes before this incident, JVM memory wasn't
so high (even the max memory usage indicator was below 2G).

Indexing requests did not stop, and started getting HTTP 503 errors (no
server hosting shard). At this time, some other cores started to go down
(l had all of the rainbow colors: Active, Recovering, Down, Recovery Failed
and Gone :).

Then I tried to restart tomcat of the down nodes, but some of them failed
to start, due to the error message: we are not the leader. Only shutting
down the both two cores and starting them gradually, solved the problem,
and the whole cluster came back to green state.

Solr is not yet exposed to users, so no queries have been made at that time
(but maybe some non-heavy auto-warm queries were executed).

I don't think that all of the 4GB were being used for justifiable reasons..
I guess that adding more RAM will not solve the problem, in the long term.

Where should I start my log investigation? (about the OOM itself, and about
the chain accident came after it)

I did a search for previous similar issues. There are a lot, but most of
them talks about very old versions of Solr.

[Versions:
Solr: 4.3.0
Tomcat 7
JVM: Oracle 7 (last, standard, JRE), 64bit.
OS: RedHat 6.3]


Re: Prevention of heavy wildcard queries

2013-06-02 Thread Isaac Hebsh
Hi everyone.

I came across another need for term extraction: I want to find pairs of
words that appear in queries together. All of the clustering work is
ready. and the only hole is how to get the basic terms from the query.

Nobody tried it before? There is no clean way to do it?


On Tue, May 28, 2013 at 7:08 AM, Isaac Hebsh isaac.he...@gmail.com wrote:

 I don't want to affect on the (correctness of the) real query parsing, so
 creating a QParserPlugin is risky.
 Instead, If I'll parse the query in my search component, it will be
 detached from the real query parsing, (obviously this causes double
 parsing, but assume it's OK)...


 On Tue, May 28, 2013 at 3:52 AM, Roman Chyla roman.ch...@gmail.comwrote:

 Hi Issac,
 it is as you say, with the exception that you create a QParserPlugin, not
 a
 search component

 * create QParserPlugin, give it some name, eg. 'nw'
 * make a copy of the pipeline - your component should be at the same
 place,
 or just above, the wildcard processor

 also make sure you are setting your qparser for FQ queries, ie.
 fq={!nw}foo


 On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Thanks Roman.
  Based on some of your suggestions, will the steps below do the work?
 
  * Create (and register) a new SearchComponent
  * In its prepare method: Do for Q and all of the FQs (so this
  SearchComponent should run AFTER QueryComponent, in order to see all of
 the
  FQs)
  * Create
 org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
  with a special implementation of QueryNodeProcessorPipeline, which
 contains
  my NodeProcessor in the top of its list.
  * Set my analyzer into that StandardQueryParser
  * My NodeProcessor will be called for each term in the query, so it can
  throw an exception if a (basic) querynode contains wildcard in both
 start
  and end of the term.
 
  Do I have a way to avoid from reimplementing the whole
 StandardQueryParser
  class?
 

 you can try subclassing it, if it allows it


  Will this work for both LuceneQParser and EdismaxQParser queries?
 

 this will not work for edismax, nothing but changing the edismax qparser
 will do the trick


 
  Any other solution/work-around? How do other production environments of
  Solr overcome this issue?
 

 you can also try modifying the standard solr parser, or even the JavaCC
 generated classes
 I believe many people do just that (or some sort of preprocessing)

 roman


 
 
  On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   You are right that starting to parse the query before the query
 component
   can get soon very ugly and complicated. You should take advantage of
 the
   flex parser, it is already in lucene contrib - but if you are
 interested
  in
   the better version, look at
   https://issues.apache.org/jira/browse/LUCENE-5014
  
   The way you can solve this is:
  
   1. use the standard syntax grammar (which allows *foo*)
   2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case,
 or
   raise error etc
  
   this way, you are changing semantics - but don't need to touch the
 syntax
   definition; of course, you may also change the grammar and allow only
 one
   instance of wildcard (or some combination) but for that you should
  probably
   use LUCENE-5014
  
   roman
  
   On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
   wrote:
  
Hi.
   
Searching terms with wildcard in their start, is solved with
ReversedWildcardFilterFactory. But, what about terms with wildcard
 in
   both
start AND end?
   
This query is heavy, and I want to disallow such queries from my
 users.
   
I'm looking for a way to cause these queries to fail.
I guess there is no built-in support for my need, so it is OK to
 write
  a
new solution.
   
My current plan is to create a search component (which will run
 before
QueryComponent). It should analyze the query string, and to drop the
   query
if too heavy wildcard are found.
   
Another option is to create a query parser, which wraps the current
(specified or default) qparser, and does the same work as above.
   
These two options require an analysis of the query text, which
 might be
   an
ugly work (just think about nested queries [using _query_], OR even
 a
  lot
of more basic scenarios like quoted terms, etc.)
   
Am I missing a simple and clean way to do this?
What would you do?
   
P.S. if no simple solution exists, timeAllowed limit is the best
work-around I could think about. Any other suggestions?
   
  
 





Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
Hi.

Searching terms with wildcard in their start, is solved with
ReversedWildcardFilterFactory. But, what about terms with wildcard in both
start AND end?

This query is heavy, and I want to disallow such queries from my users.

I'm looking for a way to cause these queries to fail.
I guess there is no built-in support for my need, so it is OK to write a
new solution.

My current plan is to create a search component (which will run before
QueryComponent). It should analyze the query string, and to drop the query
if too heavy wildcard are found.

Another option is to create a query parser, which wraps the current
(specified or default) qparser, and does the same work as above.

These two options require an analysis of the query text, which might be an
ugly work (just think about nested queries [using _query_], OR even a lot
of more basic scenarios like quoted terms, etc.)

Am I missing a simple and clean way to do this?
What would you do?

P.S. if no simple solution exists, timeAllowed limit is the best
work-around I could think about. Any other suggestions?


Re: Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
Thanks Roman.
Based on some of your suggestions, will the steps below do the work?

* Create (and register) a new SearchComponent
* In its prepare method: Do for Q and all of the FQs (so this
SearchComponent should run AFTER QueryComponent, in order to see all of the
FQs)
* Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
with a special implementation of QueryNodeProcessorPipeline, which contains
my NodeProcessor in the top of its list.
* Set my analyzer into that StandardQueryParser
* My NodeProcessor will be called for each term in the query, so it can
throw an exception if a (basic) querynode contains wildcard in both start
and end of the term.

Do I have a way to avoid from reimplementing the whole StandardQueryParser
class?
Will this work for both LuceneQParser and EdismaxQParser queries?

Any other solution/work-around? How do other production environments of
Solr overcome this issue?


On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote:

 You are right that starting to parse the query before the query component
 can get soon very ugly and complicated. You should take advantage of the
 flex parser, it is already in lucene contrib - but if you are interested in
 the better version, look at
 https://issues.apache.org/jira/browse/LUCENE-5014

 The way you can solve this is:

 1. use the standard syntax grammar (which allows *foo*)
 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
 raise error etc

 this way, you are changing semantics - but don't need to touch the syntax
 definition; of course, you may also change the grammar and allow only one
 instance of wildcard (or some combination) but for that you should probably
 use LUCENE-5014

 roman

 On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi.
 
  Searching terms with wildcard in their start, is solved with
  ReversedWildcardFilterFactory. But, what about terms with wildcard in
 both
  start AND end?
 
  This query is heavy, and I want to disallow such queries from my users.
 
  I'm looking for a way to cause these queries to fail.
  I guess there is no built-in support for my need, so it is OK to write a
  new solution.
 
  My current plan is to create a search component (which will run before
  QueryComponent). It should analyze the query string, and to drop the
 query
  if too heavy wildcard are found.
 
  Another option is to create a query parser, which wraps the current
  (specified or default) qparser, and does the same work as above.
 
  These two options require an analysis of the query text, which might be
 an
  ugly work (just think about nested queries [using _query_], OR even a lot
  of more basic scenarios like quoted terms, etc.)
 
  Am I missing a simple and clean way to do this?
  What would you do?
 
  P.S. if no simple solution exists, timeAllowed limit is the best
  work-around I could think about. Any other suggestions?
 



Re: Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
I don't want to affect on the (correctness of the) real query parsing, so
creating a QParserPlugin is risky.
Instead, If I'll parse the query in my search component, it will be
detached from the real query parsing, (obviously this causes double
parsing, but assume it's OK)...


On Tue, May 28, 2013 at 3:52 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Issac,
 it is as you say, with the exception that you create a QParserPlugin, not a
 search component

 * create QParserPlugin, give it some name, eg. 'nw'
 * make a copy of the pipeline - your component should be at the same place,
 or just above, the wildcard processor

 also make sure you are setting your qparser for FQ queries, ie.
 fq={!nw}foo


 On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Thanks Roman.
  Based on some of your suggestions, will the steps below do the work?
 
  * Create (and register) a new SearchComponent
  * In its prepare method: Do for Q and all of the FQs (so this
  SearchComponent should run AFTER QueryComponent, in order to see all of
 the
  FQs)
  * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
  with a special implementation of QueryNodeProcessorPipeline, which
 contains
  my NodeProcessor in the top of its list.
  * Set my analyzer into that StandardQueryParser
  * My NodeProcessor will be called for each term in the query, so it can
  throw an exception if a (basic) querynode contains wildcard in both start
  and end of the term.
 
  Do I have a way to avoid from reimplementing the whole
 StandardQueryParser
  class?
 

 you can try subclassing it, if it allows it


  Will this work for both LuceneQParser and EdismaxQParser queries?
 

 this will not work for edismax, nothing but changing the edismax qparser
 will do the trick


 
  Any other solution/work-around? How do other production environments of
  Solr overcome this issue?
 

 you can also try modifying the standard solr parser, or even the JavaCC
 generated classes
 I believe many people do just that (or some sort of preprocessing)

 roman


 
 
  On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   You are right that starting to parse the query before the query
 component
   can get soon very ugly and complicated. You should take advantage of
 the
   flex parser, it is already in lucene contrib - but if you are
 interested
  in
   the better version, look at
   https://issues.apache.org/jira/browse/LUCENE-5014
  
   The way you can solve this is:
  
   1. use the standard syntax grammar (which allows *foo*)
   2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case,
 or
   raise error etc
  
   this way, you are changing semantics - but don't need to touch the
 syntax
   definition; of course, you may also change the grammar and allow only
 one
   instance of wildcard (or some combination) but for that you should
  probably
   use LUCENE-5014
  
   roman
  
   On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
   wrote:
  
Hi.
   
Searching terms with wildcard in their start, is solved with
ReversedWildcardFilterFactory. But, what about terms with wildcard in
   both
start AND end?
   
This query is heavy, and I want to disallow such queries from my
 users.
   
I'm looking for a way to cause these queries to fail.
I guess there is no built-in support for my need, so it is OK to
 write
  a
new solution.
   
My current plan is to create a search component (which will run
 before
QueryComponent). It should analyze the query string, and to drop the
   query
if too heavy wildcard are found.
   
Another option is to create a query parser, which wraps the current
(specified or default) qparser, and does the same work as above.
   
These two options require an analysis of the query text, which might
 be
   an
ugly work (just think about nested queries [using _query_], OR even a
  lot
of more basic scenarios like quoted terms, etc.)
   
Am I missing a simple and clean way to do this?
What would you do?
   
P.S. if no simple solution exists, timeAllowed limit is the best
work-around I could think about. Any other suggestions?
   
  
 



Re: SurroundQParser does not analyze the query text

2013-05-17 Thread Isaac Hebsh
Thank you Erik and Jack.

I opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-4834
I wish a will have time to sumbit a patch file soon.


On Fri, May 17, 2013 at 7:38 AM, Jack Krupansky j...@basetechnology.comwrote:

 (Erik: Or he can get the LucidWorks Search product and then use near and
 before operators so that he doesn't need the surround query parser!)

 -- Jack Krupansky

 -Original Message- From: Erik Hatcher
 Sent: Thursday, May 16, 2013 6:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SurroundQParser does not analyze the query text


 The issue can certainly be solved.  But to me, it's actually a bit of a
 feature by design for the Lucene-level surround query parser to not do
 analysis, as it seems to have been meant for advanced query writers to
 piece together sophisticated SpanQuery-based pattern matching kinds of
 things utilizing their knowledge of how text was analyzed and indexed.

 But for sure it could be modified to do analysis, probably using the
 multiterm analyzer feature in there now elsewhere now.  I looked into
 this when I did the basic work of integrating the surround query parser,
 and determined it was a lot of work because it'd need changes in the Lucene
 level code to leverage analysis, and then glue at the Solr level to be
 field type aware and savvy.

 By all means open and JIRA and contribute!

 Workaround?  Client-side calls can be made to analyze text, and the
 client-side could build up a query expression based on term-by-term (or
 phrase) analysis results.  Maybe that means a prohibitive number of
 requests to Solr to build up a query in a way that leverages Solr's field
 type analysis settings, but it is a technologically possible technique
 maybe worth considering.

 Erik



 On May 16, 2013, at 16:38 , Isaac Hebsh wrote:

  Hi,

 I'm trying to use Surround Query Parser for two reasons, which are not
 covered by proximity slops:
 1. find documents with two words within a given distance, *unordered*
 2. given two lists of words, find documents with (at least) one word from
 list A and (at least) one word from list B, within a given distance.

 The surround query parser looks great, but it have one big drawback - It
 does not analyze the query text. It is documented in the [weak :(] wiki
 page.

 Can this issue be solved somehow, or it is a bigger constraint?
 Should I open a JIRA issue for this?
 Any work-around?





Bloom Filters

2013-05-17 Thread Isaac Hebsh
Hi everyone..

I'm indexing docs into Solr using the update request handler, by POSTing
data to the REST endpoint (not SolrJ, not DIH).
My indexer should return an indication, whether the document existed in the
collection before or not, based in its ID.

The obvious solution is the perform a query, before trying to index the
document. Do I have any better choice?

If the query approach is chosen, I thought that BloomFilters might make
this request very efficient. After searching in wiki and JIRA, I found this:
http://wiki.apache.org/solr/BloomIndexComponent

This JIRA issue is very old, and didn't managed to be resolved. What effort
should be done, in order to make this issue resolved?


SurroundQParser does not analyze the query text

2013-05-16 Thread Isaac Hebsh
Hi,

I'm trying to use Surround Query Parser for two reasons, which are not
covered by proximity slops:
1. find documents with two words within a given distance, *unordered*
2. given two lists of words, find documents with (at least) one word from
list A and (at least) one word from list B, within a given distance.

The surround query parser looks great, but it have one big drawback - It
does not analyze the query text. It is documented in the [weak :(] wiki
page.

Can this issue be solved somehow, or it is a bigger constraint?
Should I open a JIRA issue for this?
Any work-around?


Re: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Isaac Hebsh
Hi Tim,
Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't
return any failure message. see SOLR-4043 issue).

As far as I know, you can't tell Solr to use authentication credentials
when communicating other nodes. It's a bigger issue.. for example, if you
want to protect the /update requestHandler, so unauthorized users won't
delete your whole collection, it can interfere the replication process.

I think it's a necessary mechanism in production environment... I'm curious
how do people use SolrCloud in production w/o it.





On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim tvaillanco...@ea.comwrote:

 Hey guys,

 I've recently setup basic auth under Jetty 8 for all my Solr 4.x
 '/admin/*' calls, in order to protect my Collections and Cores API.

 Although the security constraint is working as expected ('/admin/*' calls
 require Basic Auth or return 401), when I use the Collections API to create
 a collection, I receive a 200 OK to the Collections API CREATE call, but
 the background Cores API calls that are ran on the Collection API's behalf
 fail on the Basic Auth on other nodes with a 401 code, as I should have
 foreseen, but didn't.

 Is there a way to tell SolrCloud to use authentication on internal Cores
 API calls that are spawned on Collections API's behalf, or is this a new
 feature request?

 To reproduce:

 1.   Implement basic auth on '/admin/*' URIs.

 2.   Perform a CREATE Collections API call to a node (which will
 return 200 OK).

 3.   Notice all Cores API calls fail (Collection isn't created). See
 stack trace below from the node that was issued the CREATE call.

 The stack trace I get is:

 org.apache.solr.common.SolrException: Server at http://HOST
 HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok
 status:401, message:Unauthorized
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:169)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:135)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)

 Cheers!

 Tim





Re: Combining Solr Indexes at SolrCloud

2013-03-29 Thread Isaac Hebsh
Let's say you have machine A and machine B. you want to shutdown B.
If all the shards on B have replicas (on A), you can shutdown B instantly.
If there is a shard on B that has no replica, you should create one on
machine A (using Core API), let it replicate the whole shard contents, and
then you are safe to shutdown B.

[Changing the shard count of an existing collection is not possible for
now, so MERGing cores is not relevant.]


On Fri, Mar 29, 2013 at 11:23 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Let's assume that I have two machine in a SolrCloud that works as a part of
 cloud. If I want to shutdown one of them an combine its indexes into other
 how can I do that?



Solr 4.2 - DocValues on id field

2013-03-13 Thread Isaac Hebsh
Hi,

The example schema.xml in Solr 4.2 does not define id field
as docValues=true.
Any good reason? (other than backward compat for index for previous
version...)

If my common case is fl=id (and no other field), DocValues is classic for
me. Am I right?


Any documentation on Solr MBeans?

2013-03-07 Thread Isaac Hebsh
Hi,

I'm trying to monitor some Solr behaviour, using JMX.
It looks like a great job was done there, but I can't find any
documentation on the MBeans themselves.

For example, DirectUpdateHandler2 attributes. What is the difference
between adds and cumulative_adds? Is adds count the last X seconds
only? or maybe cumulative_adds survives a core reload?


Re: Timestamp field is changed on update

2013-02-28 Thread Isaac Hebsh
Hoss Man suggested a wonderful solution for this need:
Always set update=add to the field you want to keep (is exists), and use
FirstFieldValueUpdateProcessorFactory in the update chain, after
DistributedUpdateProcessorFactory (so the AtomicUpdate will add the
existing field before, if exists).

This solution exactly covers my case. Thank you!


On Wed, Feb 20, 2013 at 11:33 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Nobody responded my JIRA issue :(
 Should I commit this patch into SVN's trunk, and set the issue as Resolved?


 On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh isaac.he...@gmail.comwrote:

 Thank you Alex.
 Atomic Update allows you to add new values into multivalued field, for
 example... It means that the original document is being read (using
 RealTimeGet, which depends on updateLog).
 There is no reason that the list of operations (add/set/inc) will not
 include a create-only operation... I think that throwing it to the client
 is not a good idea, and even only because the required atomicity (which is
 handled in the DistributedUpdateProcessor using internal locks).

 There is no problem when using Atomic Update semantics on non-existent
 document.

 Indeed, it will work on stored fields only.


 On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch 
 arafa...@gmail.com wrote:

 Unless it is an Atomic Update, right. In which case Solr/Lucene will
 actually look at the existing document and - I assume - will preserve
 whatever field got already populated as long as it is stored. Should work
 for default values as well, right? They get populated on first creation,
 then that document gets partially updated.

 But I can't tell from the problem description whether it can be
 reformulated as something that fits Atomic Update. I think if the client
 does not know whether this is a new record or an update one, Solr will
 complain if Atomic Update semantics is used against non-existent
 document.

 Regards,
Alex.
 P.s. Lots of conjecture here; I haven't tested exactly this use-case.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood 
 wun...@wunderwood.org
 wrote:
 
  It is natural part of the update model for Solr (and for many other
 search engines). Solr does not do updates. It does add, replace, and
 delete.
 
  Every document is processed as if it was new. If there is already a
 document with that id, then the new document replaces it. The existing
 documents are not read during indexing. This allows indexing to be much
 faster than in a relational database.
 
  wunder






update fails if one doc is wrong

2013-02-26 Thread Isaac Hebsh
Hi.

I add documents to Solr by POSTing them to UpdateHandler, as bulks of add
commands (DIH is not used).

If one document contains any invalid data (e.g. string data into numeric
field), Solr returns HTTP 400 Bad Request, and the whole bulk is failed.

I'm searching for a way to tell Solr to accept the rest of the documents...
(I'll use RealTimeGet to determine which documents were added).

If there is no standard way for doing it, maybe it can be implemented by
spiltting the add commands into seperate HTTP POSTs. Because of using
auto-soft-commit, can I say that it is almost equivalent? What is the
performance penalty of 100 POST requests (of 1 document each) againt 1
request of 100 docs, if a soft commit is eventually done.

Thanks in advance...


Re: Timestamp field is changed on update

2013-02-20 Thread Isaac Hebsh
Nobody responded my JIRA issue :(
Should I commit this patch into SVN's trunk, and set the issue as Resolved?


On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Thank you Alex.
 Atomic Update allows you to add new values into multivalued field, for
 example... It means that the original document is being read (using
 RealTimeGet, which depends on updateLog).
 There is no reason that the list of operations (add/set/inc) will not
 include a create-only operation... I think that throwing it to the client
 is not a good idea, and even only because the required atomicity (which is
 handled in the DistributedUpdateProcessor using internal locks).

 There is no problem when using Atomic Update semantics on non-existent
 document.

 Indeed, it will work on stored fields only.


 On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch arafa...@gmail.com
  wrote:

 Unless it is an Atomic Update, right. In which case Solr/Lucene will
 actually look at the existing document and - I assume - will preserve
 whatever field got already populated as long as it is stored. Should work
 for default values as well, right? They get populated on first creation,
 then that document gets partially updated.

 But I can't tell from the problem description whether it can be
 reformulated as something that fits Atomic Update. I think if the client
 does not know whether this is a new record or an update one, Solr will
 complain if Atomic Update semantics is used against non-existent document.

 Regards,
Alex.
 P.s. Lots of conjecture here; I haven't tested exactly this use-case.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood wun...@wunderwood.org
 
 wrote:
 
  It is natural part of the update model for Solr (and for many other
 search engines). Solr does not do updates. It does add, replace, and
 delete.
 
  Every document is processed as if it was new. If there is already a
 document with that id, then the new document replaces it. The existing
 documents are not read during indexing. This allows indexing to be much
 faster than in a relational database.
 
  wunder





Re: Timestamp field is changed on update

2013-02-17 Thread Isaac Hebsh
Thank you Alex.
Atomic Update allows you to add new values into multivalued field, for
example... It means that the original document is being read (using
RealTimeGet, which depends on updateLog).
There is no reason that the list of operations (add/set/inc) will not
include a create-only operation... I think that throwing it to the client
is not a good idea, and even only because the required atomicity (which is
handled in the DistributedUpdateProcessor using internal locks).

There is no problem when using Atomic Update semantics on non-existent
document.

Indeed, it will work on stored fields only.


On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Unless it is an Atomic Update, right. In which case Solr/Lucene will
 actually look at the existing document and - I assume - will preserve
 whatever field got already populated as long as it is stored. Should work
 for default values as well, right? They get populated on first creation,
 then that document gets partially updated.

 But I can't tell from the problem description whether it can be
 reformulated as something that fits Atomic Update. I think if the client
 does not know whether this is a new record or an update one, Solr will
 complain if Atomic Update semantics is used against non-existent document.

 Regards,
Alex.
 P.s. Lots of conjecture here; I haven't tested exactly this use-case.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood wun...@wunderwood.org
 wrote:
 
  It is natural part of the update model for Solr (and for many other
 search engines). Solr does not do updates. It does add, replace, and
 delete.
 
  Every document is processed as if it was new. If there is already a
 document with that id, then the new document replaces it. The existing
 documents are not read during indexing. This allows indexing to be much
 faster than in a relational database.
 
  wunder



Re: Timestamp field is changed on update

2013-02-16 Thread Isaac Hebsh
I opened a JIRA for this improvement request (attached a patch to
DistributedUpdateProcessor).
It's my first JIRA. please review it...
(Or, if someone has an easier solution, tell us...)

https://issues.apache.org/jira/browse/SOLR-4468


On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi.

 I have a 'timestamp' field, which is a date, with a default value of 'NOW'.
 I want it to represent the datetime when the item was inserted (at the
 first time).

 Unfortunately, when the item is updated, the timestamp is changed...

 How can I implement INSERT TIME automatically?



Re: Timestamp field is changed on update

2013-02-16 Thread Isaac Hebsh
Hi,
I do have an externally-created timestamp, but some minutes may pass before
it will be sent to Solr.


On Sat, Feb 16, 2013 at 10:39 PM, Walter Underwood wun...@wunderwood.orgwrote:

 Do you really want the time that Solr first saw it or do you want the time
 that the document was really created in the system? I think an external
 create timestamp would be a lot more useful.

 wunder

 On Feb 16, 2013, at 12:37 PM, Isaac Hebsh wrote:

  I opened a JIRA for this improvement request (attached a patch to
  DistributedUpdateProcessor).
  It's my first JIRA. please review it...
  (Or, if someone has an easier solution, tell us...)
 
  https://issues.apache.org/jira/browse/SOLR-4468
 
 
  On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
 
  Hi.
 
  I have a 'timestamp' field, which is a date, with a default value of
 'NOW'.
  I want it to represent the datetime when the item was inserted (at the
  first time).
 
  Unfortunately, when the item is updated, the timestamp is changed...
 
  How can I implement INSERT TIME automatically?
 







Re: Timestamp field is changed on update

2013-02-16 Thread Isaac Hebsh
The component who sends the document does not know whether it is a new
document or an update. These are my internal constraints.. But, guys, I
think that it's a basic feature, and it will be better if Solr will support
it without external help...


On Sun, Feb 17, 2013 at 12:37 AM, Upayavira u...@odoko.co.uk wrote:

 I think what Walter means is make the thing that sends it to Solr set
 the timestamp when it does so.

 Upayavira

 On Sat, Feb 16, 2013, at 08:56 PM, Isaac Hebsh wrote:
  Hi,
  I do have an externally-created timestamp, but some minutes may pass
  before
  it will be sent to Solr.
 
 
  On Sat, Feb 16, 2013 at 10:39 PM, Walter Underwood
  wun...@wunderwood.orgwrote:
 
   Do you really want the time that Solr first saw it or do you want the
 time
   that the document was really created in the system? I think an external
   create timestamp would be a lot more useful.
  
   wunder
  
   On Feb 16, 2013, at 12:37 PM, Isaac Hebsh wrote:
  
I opened a JIRA for this improvement request (attached a patch to
DistributedUpdateProcessor).
It's my first JIRA. please review it...
(Or, if someone has an easier solution, tell us...)
   
https://issues.apache.org/jira/browse/SOLR-4468
   
   
On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh isaac.he...@gmail.com
   wrote:
   
Hi.
   
I have a 'timestamp' field, which is a date, with a default value of
   'NOW'.
I want it to represent the datetime when the item was inserted (at
 the
first time).
   
Unfortunately, when the item is updated, the timestamp is changed...
   
How can I implement INSERT TIME automatically?
   
  
  
  
  
  



Re: How to limit queries to specific IDs

2013-02-12 Thread Isaac Hebsh
Thank you, Erick! Three great answers!


On Wed, Feb 13, 2013 at 4:20 AM, Erick Erickson erickerick...@gmail.comwrote:

 First, it may not be a problem assuming your other filter queries are more
 frequent.

 Second, the easiest way to keep these out of the filter cache would be just
 to include them as a MUST clause, like
 +(original query) +id:(1 2 3 4).

 Third possibility, see https://issues.apache.org/jira/browse/SOLR-2429,
 but
 the short form is:
 fq={!cache=false}restoffq


 On Mon, Feb 11, 2013 at 2:41 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi everyone.
 
  I have queries that should be bounded to a set of IDs (the uniqueKey
 field
  of my schema).
  My client front-end sends two Solr request:
  In the first one, it wants to get the top X IDs. This result should
 return
  very fast. No time to waste on highlighting. this is a very standard
  query.
  In the aecond one, it wants to get the highlighting info (corresponding
 to
  the queried fields and terms, of course), on those documents (may be some
  sequential requests, on small bulks of the full list).
 
  These two requests are implemented as almost identical calls, to
 different
  requestHandlers.
 
  I thought to append a filter query to the second request, id:(1 2 3 4
 5).
  Is this idea good for Solr?
  If does, my problem is that I don't want these filters to flood my
  filterCache... Is there any way (even if it involves some coding...) to
 add
  a filter query which won't be added to filterCache (at least, not instead
  of standard filters)?
 
 
  Notes:
  1. It can't be assured that the the first query will remain in
  queryResultsCache...
  2. consider index size of 50M documents...
 



How to limit queries to specific IDs

2013-02-11 Thread Isaac Hebsh
Hi everyone.

I have queries that should be bounded to a set of IDs (the uniqueKey field
of my schema).
My client front-end sends two Solr request:
In the first one, it wants to get the top X IDs. This result should return
very fast. No time to waste on highlighting. this is a very standard
query.
In the aecond one, it wants to get the highlighting info (corresponding to
the queried fields and terms, of course), on those documents (may be some
sequential requests, on small bulks of the full list).

These two requests are implemented as almost identical calls, to different
requestHandlers.

I thought to append a filter query to the second request, id:(1 2 3 4 5).
Is this idea good for Solr?
If does, my problem is that I don't want these filters to flood my
filterCache... Is there any way (even if it involves some coding...) to add
a filter query which won't be added to filterCache (at least, not instead
of standard filters)?


Notes:
1. It can't be assured that the the first query will remain in
queryResultsCache...
2. consider index size of 50M documents...


Re: Trying to understand soft vs hard commit vs transaction log

2013-02-08 Thread Isaac Hebsh
Shawn, what about 'flush to disk' behaviour on MMapDirectoryFactory?


On Fri, Feb 8, 2013 at 11:12 AM, Prakhar Birla prakharbi...@gmail.comwrote:

 Great explanation Shawn! BTW soft commited documents will be not be
 recovered on JVM crash.

 On 8 February 2013 13:27, Shawn Heisey s...@elyograg.org wrote:

  On 2/7/2013 9:29 PM, Alexandre Rafalovitch wrote:
 
  Hello,
 
  What actually happens when using soft (as opposed to hard) commit?
 
  I understand somewhat very high-level picture (documents become
 available
  faster, but you may loose them on power loss).
  I don't care about low-level implementation details.
 
  But I am trying to understand what is happening on the medium level of
  details.
 
  For example what are stages of a document if we are using all available
  transaction log, soft commit, hard commit options? It feels like there
 is
  three stages:
  *) Uncommitted (soft or hard): accessible only via direct real-time get?
  *) Soft-committed: accessible through all search operatons? (but not on
  disk? but where is it? in memory?)
  *) Hard-committed: all the same as soft-committed but it is now on disk
 
  Similarly,  in performance section of Wiki, it says: A commit
 (including
  a
  soft commit) will free up almost all heap memory - why would soft
 commit
  free up heap memory? I thought it was not flushed to disk.
 
  Also, with soft-commits and transaction log enabled, doesn't transaction
  log allows to replay/recover the latest state after crash? I believe
  that's
  what transaction log does for the database. If not, how does one
 recover,
  if at all?
 
  And where does openSearcher=false fits into that? Does it cause
  inconsistent results somehow?
 
  I am missing something, but I am not sure what or where. Any points in
 the
  right direction would be appreciated.
 
 
  Let's see if I can answer your questions without giving you incorrect
  information.
 
  New indexed content is not searchable until you open a new searcher,
  regardless of the type of commit that you do.
 
  A hard commit will close the current transaction log and start a new one.
   It will also instruct the Directory implementation to flush to disk.  If
  you specify openSearcher=false, then the content that has just been
  committed will NOT be searchable, as discussed in the previous paragraph.
   The existing searcher will remain open and continue to serve queries
  against the same index data.
 
  A soft commit does not flush the new content to disk, but it does open a
  new searcher.  I'm sure that the amount of memory available for caching
  this content is not large, so it's possible that if you do a lot of
  indexing with soft commits and your hard commits are too infrequent,
 you'll
  end up flushing part of the cached data to disk anyway.  I'd love to hear
  from a committer about this, because I could be wrong.
 
  There's a caveat with that 'flush to disk' operation -- the default
  Directory implementation in the Solr example config, which is
  NRTCachingDirectoryFactory, will cache the last few megabytes of indexed
  data and not flush it to disk even with a hard commit.  If your commits
 are
  small, then the net result is similar to a soft commit.  If the server or
  Solr were to crash, the transaction logs would be replayed on Solr
 startup,
  recovering that last few megabytes.  The transaction log may also recover
  documents that were soft committed, but I'm not 100% sure about that.
 
  To take full advantage of NRT functionality, you can commit as often as
  you like with soft commits.  On some reasonable interval, say every one
 to
  fifteen minutes, you can issue a hard commit with openSearcher set to
  false, to flush things to disk and cycle through transaction logs before
  they get huge.  Solr will keep a few of the transaction logs around, and
 if
  they are huge, it can take a long time to replay them.  You'll want to
  choose a hard commit interval that doesn't create giant transaction logs.
 
  If any of the info I've given here is wrong, someone should correct me!
 
  Thanks,
  Shawn
 
 


 --
 Regards,
 Prakhar Birla
 +91 9739868086



Re: IP Address as number

2013-02-07 Thread Isaac Hebsh
Small addition:
To support query, I probably have to implement an analyzer (query time)...
An analyzer can be configured on numeric (i.e non TEXT) field?


On Thu, Feb 7, 2013 at 6:48 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi.

 I have to index field which contains an IP address.
 Users want to query this field using RANGE queries. to support this, the
 IP is stored as its DWORD value (assume it is IPv4...). On the other side,
 users supply the IP addresses textually (xxx.xxx.xxx.xxx).

 I can write a new field type, extends TrieLongField, which will change the
 textual representation to numeric one.
 But what about the stored field retrieval? I want to return the textual
 form..  may be a search component, which changes the stored fields?

 Has anyone encountered this need before?



Re: Servlet Filter for randomizing core names

2013-02-04 Thread Isaac Hebsh
LBHttpSolrServer is only solrj feature.. doesn't it?

I think that Solr does not balance queries among cores in the same server.
You can claim that it's a non-issue, if a single core can completely serve
multiple queries on the same time,  and passing requests through different
cores does nothing.  I feel that we can achieve some improvement in this
case...


On Mon, Feb 4, 2013 at 12:45 AM, Shawn Heisey s...@elyograg.org wrote:

 On 2/3/2013 3:24 PM, Isaac Hebsh wrote:

 Thanks Shawn for your quick answer.

 When using collection name, Solr will choose the leader, when available in
 the current server (see getCoreByCollection in SolrDispatchFilter). It is
 clear that it's useful when indexing. But queries should run on replicas
 too, don't they? Moreover, the core selection seems to be consistent (that
 is, it will never get the non-first core in a specific arrangement)...

 Under the assumption that a core makes extra work for serving queries
 (e.g,
 combining results, processing every non distributed search component (?)),
 and the assumption that multithreading works well here, Is utilizing all
 the cores would not be useful?


 Here's an excerpt from the SolrCloud wiki page that suggests it handles
 load balancing across the cluster automatically:

 
 Now send a query to any of the servers to query the cluster:

 http://localhost:7500/solr/**collection1/select?q=*:*http://localhost:7500/solr/collection1/select?q=*:*

 Send this query multiple times and observe the logs from the solr servers.
 You should be able to observe Solr load balancing the requests (done via
 LBHttpSolrServer ?) across replicas, using different servers to satisfy
 each request.
 

 This is near the end of example B.

 http://wiki.apache.org/solr/**SolrCloud#Example_B:_Simple_**
 two_shard_cluster_with_shard_**replicashttp://wiki.apache.org/solr/SolrCloud#Example_B:_Simple_two_shard_cluster_with_shard_replicas

 Thanks,
 Shawn




Re: Servlet Filter for randomizing core names

2013-02-04 Thread Isaac Hebsh
Of course I did not mean to multiple cores of the same shard...
A normal SolrCloud configuration, let's say 4 shards, on 4 servers, using
replicationFactor=3.
Of course, no matter what core was requested, the request will be forwarded
to one core of each shard.
My question is - whether this *first* request should be distributed over
all of the cores in a specific server or not.

The statement Cores are completely thread safe and can do queries/updates
concurrently answers me that there is no reason for my idea.


On Mon, Feb 4, 2013 at 9:28 PM, Shawn Heisey s...@elyograg.org wrote:

 On 2/4/2013 12:06 PM, Isaac Hebsh wrote:

 LBHttpSolrServer is only solrj feature.. doesn't it?

 I think that Solr does not balance queries among cores in the same server.
 You can claim that it's a non-issue, if a single core can completely serve
 multiple queries on the same time,  and passing requests through different
 cores does nothing.  I feel that we can achieve some improvement in this
 case...


 If LBHttpSolrServer is used as described in the Wiki (whoever wrote that
 wasn't sure, they were asking), then it is being used on the server side,
 not the client.

 Multiple copies of a shard on the same server is probably not a generally
 supported config with SolrCloud.  It would use more memory and disk space,
 and I'm not sure that there would be any actual benefit to query speed.
  Cores are completely thread safe and can do queries/updates concurrently.
  Whatever concurrency problems exist are likely due to resource (CPU, RAM,
 I/O) utilization rather than code limitations.  If I'm right about that,
 multiple copies would not solve the problem.  Buying a bigger/faster server
 would be the solution to that problem.

 Thanks,
 Shawn




Re: Servlet Filter for randomizing core names

2013-02-03 Thread Isaac Hebsh
Thanks Shawn for your quick answer.

When using collection name, Solr will choose the leader, when available in
the current server (see getCoreByCollection in SolrDispatchFilter). It is
clear that it's useful when indexing. But queries should run on replicas
too, don't they? Moreover, the core selection seems to be consistent (that
is, it will never get the non-first core in a specific arrangement)...

Under the assumption that a core makes extra work for serving queries (e.g,
combining results, processing every non distributed search component (?)),
and the assumption that multithreading works well here, Is utilizing all
the cores would not be useful?


On Sun, Feb 3, 2013 at 11:49 PM, Shawn Heisey s...@elyograg.org wrote:

 On 2/3/2013 1:18 PM, Isaac Hebsh wrote:

 Hi.

 I have a SolrCloud cluster, which contains some servers. each server runs
 multiple cores.

 I want to distribute the requests over the running cores on each server,
 without knowing the cores names in the client.

 Question 1: Do I have any reason to do this (when indexing? when
 querying?).

 All of these cores are sharing the same system resources, but I guess that
 I still get a better performance if same amount of requests are going to
 each core. Am I right?


 If you are using a cloud-aware API (such as CloudSolrServer from SolrJ),
 your client knows about your zookeeper setup.  Behind the scenes, it
 consults zookeeper about how to find the various servers and cores.  You
 never have to configure any core names on the client.

 If you are not using a cloud-aware API, shouldn't you be talking to the
 collection, not the cores?  That is, talk to /solr/test, not
 /solr/test_shard1_replica1 in your program.  That should cause Solr itself
 to figure out where the cores are and forward requests as necessary.
  Couple that with a load balancer and it approaches what a cloud-aware API
 gives you in terms of reliability.

 From my attempts to help people in the IRC channel, I have concluded that
 Solr 4.0 may use the name of the collection as the name of the core on each
 server.  I have not actually used SolrCloud in 4.0, so I cannot say.

 Solr 4.1 does not do this.  If you create a collection named test with 2
 shards and 2 replicas with the collections API, you get the following cores
 distributed among your servers:

 test_shard1_replica1
 test_shard1_replica2
 test_shard2_replica1
 test_shard2_replica2


  Question 2:

 I've implemented a nice ServletFilter, which replaces the magic name
 /randomcore/ with a random core name (retrieved from CoreContainer). I'm
 using RequestDispatcher.forward, on the new URI. It works, very cool :)

 But, for making it work, I had to set dispatcherFORWARD/**dispatcher
 on
 SolrRequestFilter. this setting is explicitly inadvisable in web.xml. Can
 anyone explain why?


 No idea here.

 Thanks,
 Shawn




Re: Distibuted search

2013-01-28 Thread Isaac Hebsh
Well, My index is already broken to 16 shards...
The behaviour I supposed - It absolutely doesn't happen... Right?
Does it make sense somehow as an improvement request?
Technically, Can multiple Lucene responses be intersected this way?


On Mon, Jan 28, 2013 at 9:27 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 In your case, since there is no co-current queries, adding replicas won't
 help much on improving the response speed.  However, break your index into
 a few shards do help increase query performance. I recently break an index
 with 30 million documents (30G) into 4 shards, and the boost is pretty
 impressive (roughly 2-5x faster for a complicated query)

 Ming


 On Mon, Jan 28, 2013 at 10:54 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Does adding replicas (on additional servers) help to improve search
  performance?
 
  It is known that each query goes to all the shards. It's clear that if we
  have massive load, then multiple cores serving the same shard are very
  useful.
 
  But what happens if I'll never have concurrent queries (one query is in
 the
  system at any time), but I want these single queries to return faster.
 Is a
  bigger replication factor will contribute?
 
  Especially, Will a complicated query (with a large amount of queried
  fields) go to multiple cores *of the same shard*? (E.g. core1 searching
 for
  term1 in field1, and core2 searching for term 2 in field2)
 
  And what about a query on a single field, which contains a lot of terms?
 
  Thanks in advance..
 



Re: secure Solr server

2013-01-27 Thread Isaac Hebsh
You can define a security filter in WEB-INF\web.xml, on specific url
patterns.
You might want to set the url pattern to /admin/*.

[find examples here:
http://stackoverflow.com/questions/7920092/how-can-i-bypass-security-filter-in-web-xml
]


On Sun, Jan 27, 2013 at 8:07 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 Before Solr 4.0, I secure solr by enable password protection in Jetty.
  However, password protection will make solrcloud not work.

 We use EC2 now, and we need the www admin interface of solr to be
 accessible (with password) from anywhere.

 How do you protect your solr sever from unauthorized access?

 Thanks,
 Ming



Re: uniqueKey field type

2013-01-23 Thread Isaac Hebsh
id field is not serial, it generated randomly.. so range queries on this
field are almost useless.
I mentioned TrieField, because solr.LongField is internally implemented as
a string, while solr.TrieLongField is a number. It might improve
performace, even without setting a precisionStep...


On Thu, Jan 24, 2013 at 3:31 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I think trie type fields add value only if you do range queries in them and
 it sounds like that is bit your use case.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Jan 23, 2013 2:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

  Hi,
 
  In my use case, Solr have to to return only the id field, as a response
  for queries. However, it should return 1000 docs at once (rows=1000).
 
  My id field is defined as StrField, due to external systems constraints.
 
  I guess that TrieFields are more efficient than StrFields.
 *Theoretically*,
  the field content can be retrieved without loading the stored field.
 
  Should I strive that the id will be managed as a number, or it has no
  contribution to performance (search  retrieve times)?
 
  (Yes, I know that lucene has an internal id mechanism. I think it is not
  relevant to my question...)
 
 
  - Isaac.
 



Re: Solr cache considerations

2013-01-20 Thread Isaac Hebsh
Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my
view. It must be mentioned in SolrPerformanceFactors in SolrWiki...
I'm sorry I did not know it before.
Thank you a lot.
I promise to share my results then my cart will start to fly :)


On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson erickerick...@gmail.comwrote:

 About your question about document cache: Typically the document cache
 has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
 often. And remember that this cache is only hit when assembling the
 response for a few documents (your page size).

 Bottom line: I wouldn't worry about this cache much. It's quite useful
 for processing a particular query faster, but not really intended for
 cross-query use.

 Really, I think you're getting the cart before the horse here. Run it
 up the flagpole and try it. Rely on the OS to do its job
 (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
 Find  a bottleneck _then_ tune. Premature optimization and all
 that

 Several tens of millions of docs isn't that large unless the text
 fields are enormous.

 Best
 Erick

 On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
  Ok. Thank you everyone for your helpful answers.
  I understand that fieldValueCache is not used for resolving queries.
  Is there any cache that can help this basic scenario (a lot of different
  queries, on a small set of fields)?
  Does Lucene's FieldCache help (implicitly)?
  How can I use RAM to reduce I/O in this type of queries?
 
 
  On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
  No, the fieldValueCache is not used for resolving queries. Only for
  multi-token faceting and apparently for the stats component too. The
  document cache maintains in memory the stored content of the fields you
 are
  retrieving or highlighting on. It'll hit if the same document matches
 the
  query multiple times and the same fields are requested, but as Eirck
 said,
  it is important for cases when multiple components in the same request
 need
  to access the same data.
 
  I think soft committing every 10 minutes is totally fine, but you should
  hard commit more often if you are going to be using transaction log.
  openSearcher=false will essentially tell Solr not to open a new searcher
  after the (hard) commit, so you won't see the new indexed data and
 caches
  wont be flushed. openSearcher=false makes sense when you are using
  hard-commits together with soft-commits, as the soft-commit is dealing
  with opening/closing searchers, you don't need hard commits to do it.
 
  Tomás
 
 
  On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com
  wrote:
 
   Unfortunately, it seems (
   http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html)
 that
   these caches are not per-segment. In this case, I want to (soft)
 commit
   less frequently. Am I right?
  
   Tomás, as the fieldValueCache is very similar to lucene's FieldCache,
 I
   guess it has a big contribution to standard (not only faceted) queries
   time. SolrWiki claims that it primarily used by faceting. What that
 says
   about complex textual queries?
  
   documentCache:
   Erick, After a query processing is finished, doesn't some documents
 stay
  in
   the documentCache? can't I use it to accelerate queries that should
   retrieve stored fields of documents? In this case, a big documentCache
  can
   hold more documents..
  
   About commit frequency:
   HardCommit: openSearch=false seems as a nice solution. Where can I
 read
   about this? (found nothing but one unexplained sentence in SolrWiki).
   SoftCommit: In my case, the required index freshness is 10 minutes.
 The
   plan to soft commit every 10 minutes is similar to storing all of the
   documents in a queue (outside to Solr), an indexing a bulk every 10
   minutes.
  
   Thanks.
  
  
   On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
   tomasflo...@gmail.com wrote:
  
I think fieldValueCache is not per segment, only fieldCache is.
  However,
unless I'm missing something, this cache is only used for faceting
 on
multivalued fields
   
   
On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson 
  erickerick...@gmail.com
wrote:
   
 filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters
 in
 cache). Notice the /8. This reflects the fact that the filters are
 represented by a bitset on the _internal_ Lucene ID. UniqueId has
 no
 bearing here whatsoever. This is, in a nutshell, why warming is
 required, the internal Lucene IDs may change. Note also that it's
 maxDoc, the internal arrays have holes for deleted documents.

 Note this is an _upper_ bound, if there are only a few docs that
 match, the size will be (num of matching docs) * sizeof(int)).

 fieldValueCache. I don't think so, although I'm a bit fuzzy on
 this.
 It depends on whether

Re: Solr cache considerations

2013-01-19 Thread Isaac Hebsh
Ok. Thank you everyone for your helpful answers.
I understand that fieldValueCache is not used for resolving queries.
Is there any cache that can help this basic scenario (a lot of different
queries, on a small set of fields)?
Does Lucene's FieldCache help (implicitly)?
How can I use RAM to reduce I/O in this type of queries?


On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 No, the fieldValueCache is not used for resolving queries. Only for
 multi-token faceting and apparently for the stats component too. The
 document cache maintains in memory the stored content of the fields you are
 retrieving or highlighting on. It'll hit if the same document matches the
 query multiple times and the same fields are requested, but as Eirck said,
 it is important for cases when multiple components in the same request need
 to access the same data.

 I think soft committing every 10 minutes is totally fine, but you should
 hard commit more often if you are going to be using transaction log.
 openSearcher=false will essentially tell Solr not to open a new searcher
 after the (hard) commit, so you won't see the new indexed data and caches
 wont be flushed. openSearcher=false makes sense when you are using
 hard-commits together with soft-commits, as the soft-commit is dealing
 with opening/closing searchers, you don't need hard commits to do it.

 Tomás


 On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Unfortunately, it seems (
  http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
  these caches are not per-segment. In this case, I want to (soft) commit
  less frequently. Am I right?
 
  Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
  guess it has a big contribution to standard (not only faceted) queries
  time. SolrWiki claims that it primarily used by faceting. What that says
  about complex textual queries?
 
  documentCache:
  Erick, After a query processing is finished, doesn't some documents stay
 in
  the documentCache? can't I use it to accelerate queries that should
  retrieve stored fields of documents? In this case, a big documentCache
 can
  hold more documents..
 
  About commit frequency:
  HardCommit: openSearch=false seems as a nice solution. Where can I read
  about this? (found nothing but one unexplained sentence in SolrWiki).
  SoftCommit: In my case, the required index freshness is 10 minutes. The
  plan to soft commit every 10 minutes is similar to storing all of the
  documents in a queue (outside to Solr), an indexing a bulk every 10
  minutes.
 
  Thanks.
 
 
  On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
   I think fieldValueCache is not per segment, only fieldCache is.
 However,
   unless I'm missing something, this cache is only used for faceting on
   multivalued fields
  
  
   On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
cache). Notice the /8. This reflects the fact that the filters are
represented by a bitset on the _internal_ Lucene ID. UniqueId has no
bearing here whatsoever. This is, in a nutshell, why warming is
required, the internal Lucene IDs may change. Note also that it's
maxDoc, the internal arrays have holes for deleted documents.
   
Note this is an _upper_ bound, if there are only a few docs that
match, the size will be (num of matching docs) * sizeof(int)).
   
fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
It depends on whether these are per-segment caches or not. Any per
segment cache is still valid.
   
Think of documentCache as intended to hold the stored fields while
various components operate on it, thus avoiding repeatedly fetching
the data from disk. It's _usually_ not too big a worry.
   
About hard-commits once a day. That's _extremely_ long. Think instead
of committing more frequently with openSearcher=false. If nothing
else, you transaction log will grow lots and lots and lots. I'm
thinking on the order of 15 minutes, or possibly even much less. With
softCommits happening more often, maybe every 15 seconds. In fact,
 I'd
start out with soft commits every 15 seconds and hard commits
(openSearcher=false) every 5 minutes. The problem with hard commits
being once a day is that, if for any reason the server is
 interrupted,
on startup Solr will try to replay the entire transaction log to
assure index integrity. Not to mention that your tlog will be huge.
Not to mention that there is some memory usage for each document in
the tlog. Hard commits roll over the tlog, flush the in-memory tlog
pointers, close index segments, etc.
   
Best
Erick
   
On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com
wrote:
 Hi,

 I am going to build a big

Re: Solr cache considerations

2013-01-17 Thread Isaac Hebsh
Unfortunately, it seems (
http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
these caches are not per-segment. In this case, I want to (soft) commit
less frequently. Am I right?

Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
guess it has a big contribution to standard (not only faceted) queries
time. SolrWiki claims that it primarily used by faceting. What that says
about complex textual queries?

documentCache:
Erick, After a query processing is finished, doesn't some documents stay in
the documentCache? can't I use it to accelerate queries that should
retrieve stored fields of documents? In this case, a big documentCache can
hold more documents..

About commit frequency:
HardCommit: openSearch=false seems as a nice solution. Where can I read
about this? (found nothing but one unexplained sentence in SolrWiki).
SoftCommit: In my case, the required index freshness is 10 minutes. The
plan to soft commit every 10 minutes is similar to storing all of the
documents in a queue (outside to Solr), an indexing a bulk every 10 minutes.

Thanks.


On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 I think fieldValueCache is not per segment, only fieldCache is. However,
 unless I'm missing something, this cache is only used for faceting on
 multivalued fields


 On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
  cache). Notice the /8. This reflects the fact that the filters are
  represented by a bitset on the _internal_ Lucene ID. UniqueId has no
  bearing here whatsoever. This is, in a nutshell, why warming is
  required, the internal Lucene IDs may change. Note also that it's
  maxDoc, the internal arrays have holes for deleted documents.
 
  Note this is an _upper_ bound, if there are only a few docs that
  match, the size will be (num of matching docs) * sizeof(int)).
 
  fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
  It depends on whether these are per-segment caches or not. Any per
  segment cache is still valid.
 
  Think of documentCache as intended to hold the stored fields while
  various components operate on it, thus avoiding repeatedly fetching
  the data from disk. It's _usually_ not too big a worry.
 
  About hard-commits once a day. That's _extremely_ long. Think instead
  of committing more frequently with openSearcher=false. If nothing
  else, you transaction log will grow lots and lots and lots. I'm
  thinking on the order of 15 minutes, or possibly even much less. With
  softCommits happening more often, maybe every 15 seconds. In fact, I'd
  start out with soft commits every 15 seconds and hard commits
  (openSearcher=false) every 5 minutes. The problem with hard commits
  being once a day is that, if for any reason the server is interrupted,
  on startup Solr will try to replay the entire transaction log to
  assure index integrity. Not to mention that your tlog will be huge.
  Not to mention that there is some memory usage for each document in
  the tlog. Hard commits roll over the tlog, flush the in-memory tlog
  pointers, close index segments, etc.
 
  Best
  Erick
 
  On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com
  wrote:
   Hi,
  
   I am going to build a big Solr (4.0?) index, which holds some dozens of
   millions of documents. Each document has some dozens of fields, and one
  big
   textual field.
   The queries on the index are non-trivial, and a little-bit long (might
 be
   hundreds of terms). No query is identical to another.
  
   Now, I want to analyze the cache performance (before setting up the
 whole
   environment), in order to estimate how much RAM will I need.
  
   filterCache:
   In my scenariom, every query has some filters. let's say that each
 filter
   matches 1M documents, out of 10M. Does the estimated memory usage
 should
  be
   1M * sizeof(uniqueId) * num-of-filters-in-cache?
  
   fieldValueCache:
   Due to the difference between queries, I guess that fieldValueCache is
  the
   most important factor on query performance. Here comes a generic
  question:
   I'm indexing new documents to the index constantly. Soft commits will
 be
   performed every 10 mins. Does it say that the cache is meaningless,
 after
   every 10 minutes?
  
   documentCache:
   enableLazyFieldLoading will be enabled, and fl contains a very small
  set
   of fields. BUT, I need to return highlighting on about (possibly) 20
   fields. Does the highlighting component use the documentCache? I guess
  that
   highlighting requires the whole field to be loaded into the
  documentCache.
   Will it happen only for fields that matched a term from the query?
  
   And one more question: I'm planning to hard-commit once a day. Should I
   prepare to a significant RAM usage growth between hard-commits?
  (consider a
   lot of new documents in this period...)
   Does this RAM