Re: Are there any restrictions on what kind of how many fields you can use in Pivot Query? I get ClassCastException when I use some of my string fields, and don't when I use some other sting fields

2011-02-16 Thread Tanguy Moal

Hello Ravish, Erick,

I'm facing the same issue with solr-trunk (as of r1071282)

- Field configuration :
positionIncrementGap="100">









- Schema configuration :




In my test index, I have documents with sparse values : Some documents 
may or may not have a value for f1, f2 and/or f3

The number of indexed documents is around 25.

I'm facing the issue at query time, depending on my query, and the 
temperature of the index.


Parameters having an effect on the reproducibility :
- number of levels of the decision tree : the deeper the tree, the 
faster the exceptions arises
- facet.limit parameter : the higher the limit, the faster the exception 
arises.


Examples :

All docs, facet-pivoting on all fields that matters, varying on 
facet.limit :

q=*:* pivot=f1,f2,f3 facet.limit=1  : OK
q=*:* pivot=f1,f2,f3 facet.limit=2  : OK
...
q=*:* pivot=f1,f2,f3 facet.limit=8  : OK
q=*:* pivot=f1,f2,f3 facet.limit=9  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=9  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=9  : OK
q=*:* pivot=f1,f2,f3 facet.limit=10  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=10  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=10  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=10  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=10  : NOT OK
retry
q=*:* pivot=f1,f2,f3 facet.limit=10  : OK
q=*:* pivot=f1,f2,f3 facet.limit=11  : NOT OK
...

It really looks like a cache issue.
After some retries, I can finally obtain my results, and not an HTTP 500.

Once I obtain my results, I can ask for more, if wait a little.
That's very odd.

So before I continue, here is my query configuration :

1024
autowarmCount="0"/>
autowarmCount="0"/>
autowarmCount="0"/>

true
20
200







 solr rocks0name="rows">10
static firstSearcher warming query from 
solrconfig.xml



false
2


That's very much like the default configuration.

I guess that the default cache configuration is not perfectly suitable 
for facet pivoting, so any hint on how to tweak it right is welcome.


Kind regards,

--
Tanguy

On 02/15/2011 06:05 PM, Erick Erickson wrote:

To get meaningful help, you have to post a minimum of:
1>  the relevant schema definitions for the field that makes it blow
up. include the  and  tags.
2>  the query you used, with some indication of the field that makes it blow up.
3>  What version you're using
4>  any changes you've made to the standard configurations.
5>  whether you've recently installed a new version.

It might help if you reviewed: http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Tue, Feb 15, 2011 at 11:27 AM, Ravish Bhagdev
  wrote:

Looks like its a bug?  Is it not?

Ravish

On Tue, Feb 15, 2011 at 4:03 PM, Ravish Bhagdevwrote:


When include some of the fields in my search query:

SEVERE: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
[Lorg.apache.solr.common.util.ConcurrentLRUCache$CacheEntry;
  at
org.apache.solr.common.util.ConcurrentLRUCache$PQueue.myInsertWithOverflow(ConcurrentLRUCache.java:377)
at
org.apache.solr.common.util.ConcurrentLRUCache.markAndSweep(ConcurrentLRUCache.java:329)
  at
org.apache.solr.common.util.ConcurrentLRUCache.put(ConcurrentLRUCache.java:144)
at org.apache.solr.search.FastLRUCache.put(FastLRUCache.java:131)
  at
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:904)
at
org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:121)
  at
org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:126)
at
org.apache.solr.handler.component.PivotFacetHelper.process(PivotFacetHelper.java:85)
  at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:84)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231)
  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)
  at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
  at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
  at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
  at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
  at
org.

Re: Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread Tanguy Moal

Hello,

Have you tried reading : 
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


From that page I would try something like :
http://host:port/solr/select?q=sony&sort=min(min(priceCash,priceCreditCard),priceCoupon)+asc&rows=10&indent=on&debugQuery=on

Is that of any help ?

--
Tanguy

On 04/20/2011 09:41 AM, jmaslac wrote:

Hello,

short question is this - is there a way for a search to return a field that
is not defined in the schema but is a minimal/maximum value of several
(int/float) fields in solrDocument? (and how would that search look like?)

Longer explanation. I have products and each of them can have a several
prices (price for cash, price for credit cards, coupon price and so on) -
not every product has all the price options. (Don't ask why - that's the use
case:) )




+2 more

Is there a way to ask "give me the products containing for example 'sony'
and in the results return me the minimal price of all possible prices (for
each product) and SORT the results by that (minimal) price"?

I know I can calculate the minimal price at import/index time and store it
in one separate field but the idea is that users will have checkboxes in
which they could say - i'm only interested in products that have the
priceCreditCard and priceCoupon, show me the smaller of those two and sort
by that value.

My idea is something like this:
?q=sony&minPrice:min(priceCash,priceCreditCard,priceCoupon...)
(the field minPrice is not defined in schema but should return in the
results)

For searching this actually doesn't represent a problem as I can easily
programatically compare the prices and present it to the user. The problem
is sorting - I could do that also programatically but that would mean that
I'd have to pull out all the results query returned (which can be quite big
of course) and then sort them, so that a option I would naturally like to
avoid.

Don't know if I'm asking too much of solr:) but I can see usefulness of
something like this in other examples other then mine.
Hope the question is clear and if I'm going about things completely the
wrong way please advise in the right direction.
(If there is a similar question asked somewhere else please redirect me - i
didn't find it)

Help much appreciated!

Josip

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
--
Tanguy



Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-25 Thread Tanguy Moal

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and wished 
to perform overwriting of duplicated documents at index time, during the 
update, taking advantage of the UpdateProcessorChain.


At the beginning of the indexing stage, everything is quite fast; 
documents arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple 
of hashes that are used to identify uniquely documents given their 
content, using both stock (MD5Signature) and custom (derived from 
Lookup3Signature) update processors.

I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while (~10 
minutes), the rate at which documents are received starts to fall 
dramatically, the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit, 
while my push client is waiting for the flush to occur. That would have 
been a normal slowdown.


The thing that retained my attention was the fact that unexpectedly, the 
server was performing a lot of small reads, way more the number writes, 
which seem to be larger.
The combination of the many small reads with the constant amount of 
bigger writes seem to be creating a lot of IO contention on my commodity 
SATA drive, and the ETA of my built index started to increase scarily =D


I then restarted the JVM with JMX enabled so I could start investigating 
a little bit more. I've the realized that the UpdateHandler was 
performing many reads while processing the update request.


Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built index, 
but for comparison purposes it's good.


That did the trick, indexing is fast again, even with the periodic commits.

I therefor have two questions, an interesting first  one and a boring 
second one :


1 / What's the workflow of the UpdateProcessorChain when one or more 
processors have overwriting of duplicates turned on ? What happens under 
the hood ?


I tried to answer that myself looking at DirectUpdateHandler2 and my 
understanding stopped at the following :

- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTerm 
and updateTerm things, in the addDoc method. The deletions seem to be 
buffered somewhere, I just didn't get it :-)


I might be wrong since I didn't read the code more than that, but the 
point might be at how does solr handles deletions, which is something 
still unclear to me. In anyways, a lot of reads seem to occur for that 
precise task and it tends to produce a lot of IO, killing indexing 
performances when overwriteDupes is on. I don't even understand why so 
many read operations occur at this stage since my process had a 
comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used 
so far.


Any help, recommandation or idea is welcome :-)

2 / In the case there isn't a simple fix for this, I'll have to do with 
duplicates in my index. I don't mind since solr offers a great grouping 
feature, which I already use in some other applications. The only thing 
I don't know yet is that if I do rely on grouping at search time, in 
combination with the Stats component (which is the intent of that 
index), and limiting the results to 1 document per group, will the 
computed statistics take those duplicates into account or not ? Shortly, 
how well does the Stats component behave when combined to hits collapsing ?


I had firstly implemented my solution using overwriteDupes because it 
would have reduced both the target size of my index and the complexity 
of queries used to obtain statistics on the search results, at one time.


Thank you very much in advance.

--
Tanguy



Re: how can i index data in different documents

2011-05-26 Thread Tanguy Moal

Hi Romi,

A simple way to do so is to define in your schema.xml the union of all 
the columns you need plus a "type" field to distinguish your entities.


eg, In your DB

table1 :
- col1 : varchar
- col2 : int
- col3 : float
table2 :
- col1 : int
- col2 : varchar
- col3 : int
- col4 : varchar

in solr's schema :

field name="table1_col1" type="text"
field name="table1_col2" type="int"
field name="table1_col3" type="float"
field name="table2_col1" type="int"
field name="table2_col2" type="text"
field name="table2_col3" type="int"
field name="table2_col4" type="string"
field name="type" type="string" required="true" multivalued="false"

Ensure that when you add your documents, their "type" value is 
effectively set to either "table1" or table"2".


That's a possibility amongst others.

--
Tanguy

On 05/26/11 14:57, Romi wrote:

Hi, i was not getting reply for this post, so here i am reposting this,
please reply.

In my database i have two types of entity customer and product. I want to
index customer related information in one document and product related
information in other document. is it possible via solr , if so how can i
achieve this.

Thanks&  Regards
Romi.

-
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-can-i-index-data-in-different-documents-tp2988621p2988621.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
--
Tanguy



Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-30 Thread Tanguy Moal

Hello,

Sorry for re-posting this but it seems my message got lost in the 
mailing list's messages stream without hitting anyone's attention... =D


Shortly, has anyone already experienced dramatic indexing slowdowns 
during large bulk imports with overwriteDupes turned on and a fairly 
high duplicates rate (around 4-8x) ?


It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of 
little reads operations occuring simultaneously with the regular large 
write operations of the merge. Added to the poor IO performances of a 
commodity SATA drive, indexing takes ages.


I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me 
to turn on field collapsing at search time.


Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?


More details on my setup and the state of my understanding are in my 
previous message here-after.


Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and 
wished to perform overwriting of duplicated documents at index time, 
during the update, taking advantage of the UpdateProcessorChain.


At the beginning of the indexing stage, everything is quite fast; 
documents arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple 
of hashes that are used to identify uniquely documents given their 
content, using both stock (MD5Signature) and custom (derived from 
Lookup3Signature) update processors.

I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while 
(~10 minutes), the rate at which documents are received starts to fall 
dramatically, the server being IO bound.
I've been firstly thinking of a normal speed decrease during the 
commit, while my push client is waiting for the flush to occur. That 
would have been a normal slowdown.


The thing that retained my attention was the fact that unexpectedly, 
the server was performing a lot of small reads, way more the number 
writes, which seem to be larger.
The combination of the many small reads with the constant amount of 
bigger writes seem to be creating a lot of IO contention on my 
commodity SATA drive, and the ETA of my built index started to 
increase scarily =D


I then restarted the JVM with JMX enabled so I could start 
investigating a little bit more. I've the realized that the 
UpdateHandler was performing many reads while processing the update 
request.


Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built 
index, but for comparison purposes it's good.


That did the trick, indexing is fast again, even with the periodic 
commits.


I therefor have two questions, an interesting first  one and a boring 
second one :


1 / What's the workflow of the UpdateProcessorChain when one or more 
processors have overwriting of duplicates turned on ? What happens 
under the hood ?


I tried to answer that myself looking at DirectUpdateHandler2 and my 
understanding stopped at the following :

- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTerm 
and updateTerm things, in the addDoc method. The deletions seem to be 
buffered somewhere, I just didn't get it :-)


I might be wrong since I didn't read the code more than that, but the 
point might be at how does solr handles deletions, which is something 
still unclear to me. In anyways, a lot of reads seem to occur for that 
precise task and it tends to produce a lot of IO, killing indexing 
performances when overwriteDupes is on. I don't even understand why so 
many read operations occur at this stage since my process had a 
comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used 
so far.


Any help, recommandation or idea is welcome :-)

2 / In the case there isn't a simple fix for this, I'll have to do 
with duplicates in my index. I don't mind since solr offers a great 
grouping feature, which I already use in some other applications. The 
only thing I don't know yet is that if I do rely on grouping at search 
time, in combination with the Stats component (which is the intent of 
that index), and limiting the results to 1 document per group, will 
the computed statistics take those duplicates into account or not ? 
Shortly, how well does the Stats component behave when combined to

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-06-01 Thread Tanguy Moal

Lee,

Thank you very much for your answer.

Using the signature field as the uniqueKey is effectively what I was 
doing, so the "overwriteDupes=true" parameter in my solrconfig was 
somehow redundant, although I wasn't aware of it! =D


In practice it works perfectly and that's the nice part.

By the way, I wonder what happens when we enter in the following code 
snippet when the id field is the same as the signature field, from 
addDoc@DirectUpdateHandler2(AddUpdateCommand) :

  if(del) { // ensure id remains unique
  BooleanQuery bq = new BooleanQuery();
  bq.add(new BooleanClause(new TermQuery(updateTerm), 
Occur.MUST_NOT));

  bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST));
  writer.deleteDocuments(bq);
}

May be all my problems started from here...

I'll try to reproduce using a different uniqueKey field and turning 
overwriteDupes back to "on" to see if the problem was because of the 
signature field being the same as the uniqueKey field *and* having 
overwriteDupes on, when I'll have some time. If so, maybe that a simple 
configuration check should be performed to avoid the issue. Otherwise it 
means that having overwriteDupes turned on simply doesn't scale and that 
should be added to the wiki's Deduplication page, IMHO.


Thank you again.
Regards,

--
Tanguy

On 31/05/2011 14:58, lee carroll wrote:

Tanguy

You might have tried this already but can you set overwritedupes to
false and set the signiture key to be the id. That way solr
will manage updates?

from the wiki

http://wiki.apache.org/solr/Deduplication



HTH

Lee


On 30 May 2011 08:32, Tanguy Moal  wrote:

Hello,

Sorry for re-posting this but it seems my message got lost in the mailing 
list's messages stream without hitting anyone's attention... =D

Shortly, has anyone already experienced dramatic indexing slowdowns during 
large bulk imports with overwriteDupes turned on and a fairly high duplicates 
rate (around 4-8x) ?

It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of little 
reads operations occuring simultaneously with the regular large write 
operations of the merge. Added to the poor IO performances of a commodity SATA 
drive, indexing takes ages.

I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me to turn 
on field collapsing at search time.

Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?

More details on my setup and the state of my understanding are in my previous 
message here-after.

Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and wished to 
perform overwriting of duplicated documents at index time, during the update, 
taking advantage of the UpdateProcessorChain.

At the beginning of the indexing stage, everything is quite fast; documents 
arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple of 
hashes that are used to identify uniquely documents given their content, using 
both stock (MD5Signature) and custom (derived from Lookup3Signature) update 
processors.
I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while (~10 
minutes), the rate at which documents are received starts to fall dramatically, 
the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit, while 
my push client is waiting for the flush to occur. That would have been a normal 
slowdown.

The thing that retained my attention was the fact that unexpectedly, the server 
was performing a lot of small reads, way more the number writes, which seem to 
be larger.
The combination of the many small reads with the constant amount of bigger 
writes seem to be creating a lot of IO contention on my commodity SATA drive, 
and the ETA of my built index started to increase scarily =D

I then restarted the JVM with JMX enabled so I could start investigating a 
little bit more. I've the realized that the UpdateHandler was performing many 
reads while processing the update request.

Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built index, but for 
comparison purposes it's good.

That did the trick, indexing is fast again, even with the periodic commits.

I therefor have two questions, an interesting first  one and a boring second 
one :


"Virtual field", Statistics

2010-10-14 Thread Tanguy Moal
Dear solr-user folks,

I would like to use the stats module to perform very basic statistics
(mean, min and max) which is actually working just fine.

Nethertheless I found a little limitation that bothers me a tiny bit :
how to perform the exact same statistics, but on the result of a
function query rather than a field.

Example :
schema :
- string : id
- float : width
- float : height
- float : depth
- string : color
- float : price

What I'd like to do is something like :
select?price:[45.5 TO
99.99]&stats=on&stats.facet=color&stats.field={volume=product(product(width,
height), depth)}
I would expect to obtain :


 
  
   ...
   ...
   ...
   ...
   ...
   ...
   ...
   ...
   

 
  ...
  ...
  ...
  ...
  ...
  ...
  ...
  ...


  ...
  ...
  ...
  ...
  ...
  ...
  ...
  ...


   
  
 


Of course computing the volume can be performed before indexing data,
but defining virtual fields on the fly given an arbitrary function is
powerful and I am comfortable with the idea that many others would
appreciate. Especially for BI needs and so on... :-D
Is there a way to do it easily that I would have not been able to
find, or is it actually impossible ?

Thank you very much in advance for your help.

--
Tanguy


Re: "Virtual field", Statistics

2010-10-18 Thread Tanguy Moal
Hello Lance, thank you for your reply.

I created the following JIRA issue:
https://issues.apache.org/jira/browse/SOLR-2171, as suggested.

Can you tell me how new issues are handled by the development teams,
and whether there's a way I could help/contribute ?

--
Tanguy

2010/10/16 Lance Norskog :
> Please add a JIRA issue requesting this. A bunch of things are not
> supported for functions: returning as a field value, for example.
>
> On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal  wrote:
>> Dear solr-user folks,
>>
>> I would like to use the stats module to perform very basic statistics
>> (mean, min and max) which is actually working just fine.
>>
>> Nethertheless I found a little limitation that bothers me a tiny bit :
>> how to perform the exact same statistics, but on the result of a
>> function query rather than a field.
>>
>> Example :
>> schema :
>> - string : id
>> - float : width
>> - float : height
>> - float : depth
>> - string : color
>> - float : price
>>
>> What I'd like to do is something like :
>> select?price:[45.5 TO
>> 99.99]&stats=on&stats.facet=color&stats.field={volume=product(product(width,
>> height), depth)}
>> I would expect to obtain :
>>
>> 
>>  
>>  
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   
>>    
>>     
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>    
>>    
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>    
>>    
>>   
>>  
>>  
>> 
>>
>> Of course computing the volume can be performed before indexing data,
>> but defining virtual fields on the fly given an arbitrary function is
>> powerful and I am comfortable with the idea that many others would
>> appreciate. Especially for BI needs and so on... :-D
>> Is there a way to do it easily that I would have not been able to
>> find, or is it actually impossible ?
>>
>> Thank you very much in advance for your help.
>>
>> --
>> Tanguy
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: [Wildcard query] Weird behaviour

2010-12-03 Thread Tanguy Moal
Thank you very much Robert for replying that fast and accurately.

I have effectively an other idea in mind to provide similar
suggestions less expansively, I was balancing between the work around
and the report issue options.

I don't regret it since you came with a possible fix. I'll give it a
try as soon as possible, and let the list know.

Regards,

Tanguy

2010/12/3 Robert Muir :
> Actually, i took a look at the code again, the queries you mentioned:
> "I send queries to that field in the form (*term1*term2*)"
>
> I think the patch will not fix your problem... The only way i know you
> can fix this would be to upgrade to lucene/solr trunk, where wildcard
> comparison is linear to the length of the string.
>
> In all other versions, it has much worse runtime, and thats what you
> are experiencing.
>
> Separately, even better than this would be to see if you can index
> your content in a way to avoid these expensive queries. But this is
> just a suggestion, what you are doing should still work fine.
>
> On Fri, Dec 3, 2010 at 6:56 AM, Robert Muir  wrote:
>> On Fri, Dec 3, 2010 at 6:28 AM, Tanguy Moal  wrote:
>>> However suddenly CPU usage simply doubles, and sometimes eventually
>>> start using all 16 cores of the server, whereas the number of handled
>>> request is pretty stable, and even starts decreasing because of
>>> degraded user experience due to dramatic response times.
>>>
>>
>> Hi Tanguy: This was fixed here:
>> https://issues.apache.org/jira/browse/LUCENE-2620.
>>
>> You can apply the patch file there
>> (https://issues.apache.org/jira/secure/attachment/12452947/LUCENE-2620_3x.patch)
>> and recompile your own lucene 2.9.x, or you can replace the lucene jar
>> file in your solr war with the newly released lucene-2.9.4 core jar...
>> which I think is due to be released later today!
>>
>> Thanks for spending the time to report the problem... let us know the
>> patch/lucene 2.9.4 doesnt fix it!
>>
>


Re: Autosuggest terms which GOOGLE uses?

2010-12-08 Thread Tanguy Moal
Kind of : their suggestions are based on users queries with some filtering.
You can have a little read there :
http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=106230

They perform "little" filtering to remove offending content such as
"hate speech, violence and pornography" (quoting the page).
You can also have a look at this slideshow :
http://www.slideshare.net/sturlese/use-ofsolrattrovitclassifiedads-marcsturlese
.

You'll see how they build their suggest service using a dedicated solr instance.

Hope this helps ;-)

--
Tanguy

2010/12/8 Anurag :
>
> How Google selects the autosuggest terms? Is that Google uses "Userrs
> Queries" from Log files to suggest only those terms?
>
> -
> Kumar Anurag
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Autosuggest-terms-which-GOOGLE-uses-tp2039078p2039078.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Google like search

2010-12-14 Thread Tanguy Moal
Hi Satya,

I think what you'e looking for is called "highlighting" in the sense
of "highlighting" the query terms in their matching context.

You could start by googling "solr highlight", surely the first results
will make sense.

Solr's wiki results are usually a good entry point :
http://wiki.apache.org/solr/HighlightingParameters .

Maybe I misunderstood your question, but I hope that'll help...

Regards,

Tanguy


2010/12/14 satya swaroop :
> Hi All,
>         Can we get the results like google  having some data  about the
> search... I was able to get the data that is the first 300 characters of a
> file, but it is not helpful for me, can i be get the data that is having the
> first found key in that file
>
> Regards,
> Satya
>


Re: Google like search

2010-12-14 Thread Tanguy Moal
Satya,

In fact the highlighter will select the relevant part of the whole
text and return it with the matched terms highlighted.

If you do so for a whole book, you will face the issue spotted by Dave
(too long text).

To address that issue, you have the possibility to split your book in
chapters, and index each chapter as a unique document.

You would then be interested in adding a field to identify uniquely
each book (using ISBN number for example) and turn on grouping (or
collapsing) on that field ... (see this very good blog post :
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/
)

Moreover, you might be interested by the following JIRA issue :
https://issues.apache.org/jira/browse/SOLR-2272 . Using this patch,
you could for example ensure that if a given document-chapter is
selected by the query, then another (or several) document(s) (maybe a
father "book-document", or all the other chapters) get selected along
the way (by doing a self-join on the ISBN number). Here again,
grouping afterward would return a group of document representing each
book.

Good luck!

--
Tanguy

2010/12/14 Dave Searle :
> Highlighting is exactly what you need, although if you highlight the whole 
> book, this could slow down your queries. Index/store the first 5000-1 
> characters and see how you get on
>
> -Original Message-
> From: satya swaroop [mailto:satya.yada...@gmail.com]
> Sent: 14 December 2010 10:08
> To: solr-user@lucene.apache.org
> Subject: Re: Google like search
>
> Hi Tanguy,
>                  I am not asking for highlighting.. I think it can be
> explained with an example.. Here i illustarte it::
>
> when i post the query like dis::
>
> http://localhost:8080/solr/select?q=Java&version=2.2&start=0&rows=10&indent=on
>
> i Would be getting the result as follows::
>
> -
> -
> 0
> 1
> 
> -
> -
> Java%20debugging.pdf
> 122
> -
> -
> Table of Contents
> If you're viewing this document online, you can click any of the topics
> below to link directly to that section.
> 1. Tutorial tips 2
> 2. Introducing debugging  4
> 3. Overview of the basics 6
> 4. Lessons in client-side debugging 11
> 5. Lessons in server-side debugging 15
> 6. Multithread debugging 18
> 7. Jikes overview 20
> 
> 
> 
> 
> 
>
> Here the str field contains the first 300 characters of the file as i kept a
> field to copy only 300 characters in schema.xml...
> But i dont want the content like dis.. Is there any way to make an o/p as
> follows::
>
>  Java is one of the best language,java is easy to learn...
>
>
> where this content is at start of the chapter,where the first word of java
> is occured in the file...
>
>
> Regards,
> Satya
>


Re: Google like search

2010-12-14 Thread Tanguy Moal
To do so, you have several possibilities, I don't know if there is a best one.

It depends pretty much on the format of the input file(s), your
affinities with a given programing language,some libraries you might
need and the time you're ready to spend on this task.

Consider having a look at SolrJ  (http://wiki.apache.org/solr/Solrj)
or at the DataImportHandler
(http://wiki.apache.org/solr/DataImportHandler) .

Cheers,

--
Tanguy

2010/12/14 satya swaroop :
> Hi Tanguy,
>                 Thanks for ur reply. sorry to ask this type of question.
> how can we index each chapter of a file as seperate document.As for i know
> we just give the path of file to solr to index it... Can u provide me any
> sources for this type... I mean any blogs or wiki's...
>
> Regards,
> satya
>


Re: PHPSolrClient

2010-12-16 Thread Tanguy Moal
Hi Dennis,

Not particular to the client you use (solr-php-client) for sending
documents, think of update as an overwrite.

This means that if you update a particular document, the previous
version indexed is lost.
Therefore, when updating a document, make sure that all the fields to
be indexed and retrieved are present in the update.

For an update to occur, only the uniqueKey id (as specified in your
schema.xml) has to be the same as the document you want to update.

Shortly, an update is like an add, (and performed the same way) except
that the added document was previously indexed. It simple gets
replaced by the update.

Hope that helps,

--
Tanguy

2010/12/16 Dennis Gearon :
> First of all, it's a very nice piece of work.
>
> I am just getting my feet wet with Solr in general. So I 'am not even sure 
> how a
> document is NORMALLY deleted.
>
> The library PHPDocs say 'add', 'get' 'delete', But does anyone know about
> 'update'?
>  (obviously one can read-delete-modify-create)
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: Tuning StatsComponent

2011-01-10 Thread Tanguy Moal
Hello,

You could try taking advantage of Solr's facetization feature : provided
that you have the amount stored in the amount field and the currency stored
in the currency field, try the following request :
http://host:port
/solr/select?q=YOUR_QUERY&stats=on&stats.field=amount&f.amount.stats.facet=currency&rows=0

You'll get a top-level stats node with false numbers, because of the mixed
currencies, but after that, you'll have one stat node per currency.

Alternatively, you could try indexing different currencies in seperate
fields (e.g. amount_usd, amount_eur, ...) and send your queries that way :

http://host:portsolr
/select?q=amount_us:*+OR+amount_eur:*[+OR+amount_...:*]&stats=on&stats.field=amount_usd&stats.field=amount_eur[&stats.field=amount_...]&rows=0
That way in one query you'll get everything you want, excepted that you
can't trust "missing" count for each sum computed. May be your query isn't a
"select all" one, I which case you should get results even faster.

Hope that helps a little...

--
Tanguy

2011/1/10 stockii 

>
> Hello.
>
> i`m using the StatsComponent to get the sum of amounts. but solr
> statscomponent is very slow on a huge index of 30 Million documents. how
> can
> i tune the statscomponent ?
>
> the problem is, that i have 5 currencys and i need to send for each
> currency
> a new request. thats make the solr search sometimes very slow. =(
>
> any ideas ?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2225809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: best way for sum of fields

2011-11-07 Thread Tanguy Moal

Hi,

If you only need to sum over "displayed" results, go with the 
post-processing of hits solution, that's fast and easy.
If you sum over the whole data set (i.e your sum is not query 
dependant), have it computed at indexing time, depending on your 
indexing workflow.


Otherwise, (sum over the whole result set, query dependant but 
independantly of displayed results) you should give a try to sharding...
You generally want that when your index size is too large to be searched 
quickly (see http://wiki.apache.org/solr/DistributedSearch) (here the 
sum operation is part of a search query)


Basically what you need is:
- On the master host : n master instances (each being a shard)
- On slave host : n slave instances (each being a replica of its master 
side counterpart)


Only the slave instances will need a comfortable amount of RAM in order 
to serve queries rapidly. Slave instances can be deployed over several 
hosts if the total amount of RAM required is high.


Your main effort here might be in finding the 'n' value.
You have 45M documents in a single shard and that may be the cause of 
your issue, especially for queries returning a high number of results.

You may need to split it into more shards to achieve your goal.

This should enable you to reduce the time to perform the sum operation 
at search time (but adds complixity at data indexing time : you need to 
define a way to send documents to shard #1, #2, ..., or #n).
If you keep having more and more documents over time, may be you'll want 
to have a fixed maximum shard size (say 5M docs, if performing the sum 
on 5M docs is fast enough) and simply add shards as required, when more 
documents are to be indexed/searched. This addresses the importing issue 
because you'll simply need to change the target shard every 5M documents.

The last shard is always the smallest.

Such sharding can involve a little overhead at search time : make sure 
you don't allow for retrieval of far documents (start=k, where k is high 
-- see 
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations).
-> When using stats component, have start and rows parameters set to 0 
if you don't need the documents themselves.



After that, if you face high search load issues, you could still 
duplicate the slave host to match your load requirements, and 
load-balance your search traffic over slaves as required.


Hope this helps,

Tanguy

Le 07/11/2011 09:49, stockii a écrit :

sry.

i need the sum of values of the found documents. e.g. the total amount of
one day. each doc in index has ist own amount.

i try out something with StatsComponent but with  48 Million docs in Index
its to slow.

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores,
1 Core with 45 Million Documents other Cores<  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486406.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: best way for sum of fields

2011-11-07 Thread Tanguy Moal

Hi again,
Since you have a custom high availability solution over your solr 
instances, I can't help much I guess... :-)


I usually rely on master/slave replication to separate index build and 
index search processes.


The fact is that resources consumption at build time and search time are 
not necessarily the same, and therefor hardware dimensioning can be 
adapted as required.
I like to have the service related processes isolated and easy to deploy 
wherever needed. Just in case things go wrong, hardware failures occur.
Build services on the other hand don't have the same availability 
constraints, and can be off for a while, it's no issue (unless near 
realtime indexing comes into party, that's an other thing)


In a slave configuration, the index doesn't need to commit. It simply 
replicates its data from its associated master whenever the master 
changes and performs a reopen of the searcher. "Change" events can be 
triggered at commit, startup and / or optimize. (see 
http://wiki.apache.org/solr/SolrReplication , although you seemed to be 
not interested by this feature :) )


Having search and build on the same host is no bad point.
It simply depends on available resources and build vs service load 
requirements.
For example with a big core such as the one you have, segments merging 
can occur from time to time, which is an operation that is IO bound 
(i.e. time is dependant of disk performances). Under high IO load, a 
server can become less responsive and therefor having the service 
separated from the build could became handy at that time.


As you see, I can't tell you what makes sense and what doesn't.
It's all about what you're doing, at which frequency, etc. :-)

Regards,

Tanguy

Le 07/11/2011 12:12, stockii a écrit :

hi thanks for the big reply ;)

i had the idea with the several and small 5M shards too.
and i think thats the next step i have to go, because our biggest index
grows each day with avg. 50K documents.
but make it sense to keep searcher AND updater cores on one big server? i
dont want to use replication, because with our own high avalibility solution
is this not possible.

my system is split into searcher and updater cores, each with his own index.
some search requests are over all this 8 cores with distributed search.



-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores,
1 Core with 45 Million Documents other Cores<  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486652.html
Sent from the Solr - User mailing list archive at Nabble.com.




Core reload vs servlet container restart

2011-11-10 Thread Tanguy Moal

Dear list,
I've experienced a weird (unexpected?) behaviour concerning core reload 
on a master instance.


My setup :
master/slave on separate hosts.

On the master, I update the schema.xml file, adding a dynamic field of 
type random sort field.


I reload the master using core admin.

The new field is *not* taken into account.

I restart the servlet container (jetty in my case).

The new field is taken into account, I can perform random sort operations.

On the slave side, no problem : at startup the schema.xml replicated, 
the core reloaded, and I was able to perform random sorts as well.


Now the question is : what was wrong with the core reload on the master ?
The output it gave to me was something like : "sort param field can't be 
found : ${fieldName}".
At this point, in admin/schema view, the schema showing up was indeed 
showing the freshly added dynamic field.

I had to restart jetty (not a big issue here, but just to be sure).

Thanks!

--
Tanguy