Re: Upgrading Tika in Solr

2010-02-17 Thread Liam O'Boyle
I just copied in the newer .jars and got rid of the old ones and
everything seemed to work smoothly enough.

Liam

On Tue, 2010-02-16 at 13:11 -0500, Grant Ingersoll wrote:
> I've got a task open to upgrade to 0.6.  Will try to get to it this week.  
> Upgrading is usually pretty trivial.
> 
> 
> On Feb 14, 2010, at 12:37 AM, Liam O'Boyle wrote:
> 
> > Afternoon,
> > 
> > I've got a large collections of documents which I'm attempting to add to
> > a Solr index using Tika via the ExtractingRequestHandler, but there are
> > a large number that it has problems with (PDFs, PPTX and XLS documents
> > mainly).  
> > 
> > I've tried them with the most recent stand alone version of Tika and it
> > handles most of the failing documents correctly.  I tried using a recent
> > nightly build of Solr, but the same problems seem to occur.
> > 
> > Are there instructions somewhere on installing a more recent Tika build
> > into Solr?
> > 
> > Thanks,
> > Liam
> > 
> > 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
> 




Re: Tomcat vs Jetty: A Comparative Analysis?

2010-02-17 Thread Ron Chan

probably not 

if there is no need to embed or programmatically start and stop the server then 
Tomcat would be the safe choice, probably easier to get going with to start 
with and you'll find a lot more information about it 

- Original Message - 
From: "Steve Radhouani"  
To: solr-user@lucene.apache.org 
Sent: Wednesday, 17 February, 2010 7:24:01 AM 
Subject: Re: Tomcat vs Jetty: A Comparative Analysis? 

Thanks Ron. Actually, I'm developing a Web search engine. Would that 
matter? 

Thanks. 

2010/2/16 Ron Chan  

> 
> I'd doubt if a performance benchmark would be very useful, it ultimately 
> depends on what you are trying to do and what you are comfortable with. 
> 
> We've had successful deployments on both. 
> 
> Any difference in performance is far outweighed by ease of setup/support 
> that you personally find in each. 
> 
> There is far more "knowledge" around Tomcat, but Jetty is more lightweight 
> and real easy to embed. 
> 
> Ron 
> 
> - Original Message - 
> From: "Steve Radhouani"  
> To: solr-user@lucene.apache.org 
> Sent: Tuesday, 16 February, 2010 12:38:04 PM 
> Subject: Tomcat vs Jetty: A Comparative Analysis? 
> 
> Hi there, 
> 
> Is there any analysis out there that may help to choose between Tomcat and 
> Jetty to deploy Solr? I wonder wether there's a significant difference 
> between them in terms of performance. 
> 
> Any advice would be much appreciated, 
> -Steve 
> 


Re: Tomcat vs Jetty: A Comparative Analysis?

2010-02-17 Thread Steve Radhouani
Thanks a lot Ron!

2010/2/17 Ron Chan 

>
> probably not
>
> if there is no need to embed or programmatically start and stop the server
> then Tomcat would be the safe choice, probably easier to get going with to
> start with and you'll find a lot more information about it
>
> - Original Message -
> From: "Steve Radhouani" 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, 17 February, 2010 7:24:01 AM
> Subject: Re: Tomcat vs Jetty: A Comparative Analysis?
>
> Thanks Ron. Actually, I'm developing a Web search engine. Would that
> matter?
>
> Thanks.
>
> 2010/2/16 Ron Chan 
>
> >
> > I'd doubt if a performance benchmark would be very useful, it ultimately
> > depends on what you are trying to do and what you are comfortable with.
> >
> > We've had successful deployments on both.
> >
> > Any difference in performance is far outweighed by ease of setup/support
> > that you personally find in each.
> >
> > There is far more "knowledge" around Tomcat, but Jetty is more
> lightweight
> > and real easy to embed.
> >
> > Ron
> >
> > - Original Message -
> > From: "Steve Radhouani" 
> > To: solr-user@lucene.apache.org
> > Sent: Tuesday, 16 February, 2010 12:38:04 PM
> > Subject: Tomcat vs Jetty: A Comparative Analysis?
> >
> > Hi there,
> >
> > Is there any analysis out there that may help to choose between Tomcat
> and
> > Jetty to deploy Solr? I wonder wether there's a significant difference
> > between them in terms of performance.
> >
> > Any advice would be much appreciated,
> > -Steve
> >
>


Incremental Backup of Indexes

2010-02-17 Thread abhishes

Hello All,

If we have very large index size, how can I back up incrementally. (one full
backup followed by multiple incremental backups).

How do I take compressed backups?


Do I have roll out the backup infrastructure manually? or is there something
pre-built?

-- 
View this message in context: 
http://old.nabble.com/Incremental-Backup-of-Indexes-tp27621757p27621757.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: dataimporthandler and expungeDeletes=false

2010-02-17 Thread Jorg Heymans
Looking closer at the documentation, it appears that expungeDeletes in fact
has nothing to do with 'removing deleted documents from the index' as i
thought before:

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22commit.22_and_.22optimize.22


expungeDeletes = "true" | "false" — default is false — merge segments with
deletes away.

Is this correct ?

FWIW I worked around the issue by adding a removed flag to my data and
sending   and  commands after delta import but it would have
been so much nicer to be able to do this all from DIH.

Has anybody been able to get deletedPkQuery to work for deleting documents
during delta import ?

Jorg

On Tue, Feb 16, 2010 at 3:57 PM, Jorg Heymans wrote:

> Hi,
>
> Can anybody tell me if [1] still applies as of version trunk 03/02/2010 ? I
> am removing documents from my index using deletedPkQuery and a deltaimport.
> I can tell from the logs that the removal seems to be working:
>
> 16-Feb-2010 15:32:54 org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
> INFO: Completed parentDeltaQuery for Entity: attachment
> 16-Feb-2010 15:32:54 org.apache.solr.handler.dataimport.DocBuilder
> deleteAll
> INFO: Deleting stale documents
> 16-Feb-2010 15:32:54 org.apache.solr.handler.dataimport.SolrWriter
> deleteDoc
> INFO: Deleting document: 33053
> 16-Feb-2010 15:32:54 org.apache.solr.core.SolrDeletionPolicy onInit
> INFO: SolrDeletionPolicy.onInit: commits:num=1
>
>  
> commit{dir=D:\lib\apache-solr-1.5-dev\example\solr\project\data\index,segFN=segments_1y,version=1265210107838,generation=70,filenames=[_2v.prx,
> _2v.fnm, _2v.tis, _2v.fdt, _2v.frq, segments_1y, _2v.fdx, _2v.tii]
> 16-Feb-2010 15:32:54 org.apache.solr.core.SolrDeletionPolicy updateCommits
> INFO: newest commit = 1265210107838
> 16-Feb-2010 15:32:54 org.apache.solr.handler.dataimport.DocBuilder doDelta
> INFO: Delta Import completed successfully
> 16-Feb-2010 15:32:54 org.apache.solr.handler.dataimport.DocBuilder finish
> INFO: Import completed successfully
> 16-Feb-2010 15:32:54 org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: start
> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
> 16-Feb-2010 15:32:54 org.apache.solr.search.SolrIndexSearcher 
> INFO: Opening searc...@182c2d9 main
> 16-Feb-2010 15:32:54 org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: end_commit_flush
>
> However when i search the index the removed data is still present,
> presumably because the DirectUpdateHandler2 does not automatically do
> expungeDeletes ? Can i configure this somewhere in solrconfig.xml (SOLR-1275
> was not very clear exactly what needs to be done to activate this behaviour)
> ?
>
> Thanks
> Jorg
>
> [1] http://marc.info/?l=solr-user&m=125962049425151&w=2
>


Re: Incremental Backup of Indexes

2010-02-17 Thread Jay Ess

abhishes wrote:

Hello All,

If we have very large index size, how can I back up incrementally. (one full
backup followed by multiple incremental backups).

How do I take compressed backups?
  

http://rsnapshot.org/


xml error when indexing

2010-02-17 Thread Jan Simon Winkelmann
Hi,

I'm having a strange problem when indexing data through our application. 
Whenever I post something to the update resource, I get

Unexpected character 'a' (code 97) in prolog; expected '<'  at [row,col 
{unknown-source}]: [1,1], 


Error 400 Unexpected character 'a' (code 97) in prolog; expected '<'
 at [row,col {unknown-source}]: [1,1]

HTTP ERROR 400
Problem accessing /solr/update. Reason:
Unexpected character 'a' (code 97) in prolog; expected '<'
 at [row,col {unknown-source}]: [1,1]Powered by 
Jetty://


However, when I post the same data from an xml file using curl it works.

The add command looks like this:

145405329411702010-02-16T15:30:02Z02010-02-16T15:30:02Z2019-12-31T00:00:00Z0145-4053294«Positives Gespräch» zwischen Bielefeld und DFL«Positives Gespräch» zwischen Bielefeld und 
DFLBielefeld (dpa) - Der finanziell 
angeschlagene Zweitligist Arminia Bielefeld hat der Deutschen Fußball Liga in 
Frankfurt/Main einen Maßnahmen-Katalog präsentiert. 

Bielefeld (dpa) - Der finanziell angeschlagene Zweitligist Arminia Bielefeld hat der Deutschen Fußball Liga in Frankfurt/Main einen Maßnahmen-Katalog präsentiert.

«Daran arbeiten wir derzeit mit Hochdruck», teilte Arminia-Geschäftsführer Heinz Anders mit. Die Arminia-Delegation, zu der noch Manager Detlev Dammeier, Aufsichtsratschef Norbert Leopoldseder und Finanz-Prokurist Henrik Wiehl gehörten, habe die Lage vor den DFL-Vertretern laut Anders «offen und transparent» analysiert. Es sei ein «sehr positives Gespräch gewesen». Die nicht näher erläuterten Maßnahmen müssten nun umgesetzt und bei der DFL entsprechend nachgewiesen werden.

Die DFL kommentierte das Zusammentreffen in ihrer Frankfurter Zentrale nicht. «Zu solchen Dinge äußern wir uns nicht», erklärte ein Sprecher auf Anfrage der Deutschen Presse-Agentur dpa.

Der frühere Erstligist Bielefeld hat Verbindlichkeiten und Schulden von rund 15,5 Millionen Euro. Im operativen Geschäft dieser Saison gibt es eine Finanzierungslücke von 2,5 Millionen Euro. Der Club hat sich vor allem mit dem Ausbau und der Modernisierung der SchücoArena übernommen. Zudem ist die Entwicklung bei den Zuschauer-Zahlen und den Sponsorzuwendungen nach dem Bundesliga-Abstieg unerfreulich. Allein für das Stadion sind noch 13 Millionen Euro zu tilgen. Der Verein denkt sogar an einen Verkauf der SchücoArena.

The System we run on is Solr 1.4 with Jetty Hightide 7.0.1. Am I missing something here? Would be glad for any help. Best Jan

Need feedback on solr security

2010-02-17 Thread Vijayant Kumar
Hi Group,

I need some feedback on  solr security.

For Making by solr admin password protected,
 I had used the Path Based Authentication form
http://wiki.apache.org/solr/SolrSecurity.

In this way my admin area,search,delete,add to index is protected.But Now 
when I make solr authenticated then for every update/delete from the fornt
end is blocked without authentication.

I do not need this authentication from the front end so I simply pass the
username and password to the solr in my fornt end scripts and it is
working fine. I had done it in the below way.

http://username:passw...@localhost:8983/solr/admin/update
I need your suggestion and feed back on the above method.Is it fessiable
method and secure? TO over come from this issue is there any alternate
method?




-- 

Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211



solr word frequency

2010-02-17 Thread michaelnazaruk

hi all! How I can get the frequency for word in index?
-- 
View this message in context: 
http://old.nabble.com/solr-word-frequency-tp27622615p27622615.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr word frequency

2010-02-17 Thread Steve Radhouani
Using the "Schema Browser" of the Solr interface or Luke you can get the
frequency of a word in a specific field, but I don't know how to get it in
the entire index. A "dirty" solution would be to create a new field and copy
in it all your existing fields (), and then search the frequency of a given word in the
new field.

That being said, the frequency is available in the _i.frq file under your
index directory; perhaps you find a way to read it (I didn't it).
-Steve

  chema Browser ]

2010/2/17 michaelnazaruk 

>
> hi all! How I can get the frequency for word in index?
> --
> View this message in context:
> http://old.nabble.com/solr-word-frequency-tp27622615p27622615.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


scores are the same for many diferent documents

2010-02-17 Thread Marc Sturlese

Hey there,
I see that when solr gives me back the scores in the response it are the
same for many different documents.

I have build a simple index for testing purposes with just documents with
one field indexed with standard analyzer and containing pices of text.
I have done the same with a self coded simple lucene indexer.

Quering to the solr index with qt=standard&q=title:laptop will give me back
documents, some of them exactly with the same score.
Quering to the lucene index (with a simple self coded search app) with
title:laptop will give me back no equal scores.

When building solr index I have tryied both omitNorms=true and
omitNorms=false. It will give me different scores but in both cases there
are some equal scores.

I am testing this because I have a Solr component with a
FieldComparatorSource wich uses the scores and other external factors for
the sorting. Having same score for different documents combined with
external factors may give me back results in unexpected undesired order
-- 
View this message in context: 
http://old.nabble.com/scores-are-the-same-for-many-diferent-documents-tp27623039p27623039.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Need feedback on solr security

2010-02-17 Thread Xavier Schepler

Vijayant Kumar wrote:

Hi Group,

I need some feedback on  solr security.

For Making by solr admin password protected,
 I had used the Path Based Authentication form
http://wiki.apache.org/solr/SolrSecurity.

In this way my admin area,search,delete,add to index is protected.But Now 
when I make solr authenticated then for every update/delete from the fornt

end is blocked without authentication.

I do not need this authentication from the front end so I simply pass the
username and password to the solr in my fornt end scripts and it is
working fine. I had done it in the below way.

http://username:passw...@localhost:8983/solr/admin/update
I need your suggestion and feed back on the above method.Is it fessiable
method and secure? TO over come from this issue is there any alternate
method?

Hey,

there is at least another solution. You can set a firewall rule that 
allow  connections to the Solr's port only from trusted IPs.




Re: solr word frequency

2010-02-17 Thread michaelnazaruk

Schema browser and Luke don't fit! Because I need get frequency for selected
word in my code! In Luke display only first 10 words! I try to change some
configs in solrconfig and in schema but it don't help me! Maybe there are
another way to get frequency for word?
-- 
View this message in context: 
http://old.nabble.com/solr-word-frequency-tp27622615p27623246.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Need feedback on solr security

2010-02-17 Thread Vijayant Kumar
Hi Xavier,

Thanks for your feedback
the firewall rule for the trusted IP is not fessiable for us because the
application is open for public so we can not work through IP banning.
> Vijayant Kumar wrote:
>> Hi Group,
>>
>> I need some feedback on  solr security.
>>
>> For Making by solr admin password protected,
>>  I had used the Path Based Authentication form
>> http://wiki.apache.org/solr/SolrSecurity.
>>
>> In this way my admin area,search,delete,add to index is protected.But
>> Now
>> when I make solr authenticated then for every update/delete from the
>> fornt
>> end is blocked without authentication.
>>
>> I do not need this authentication from the front end so I simply pass
>> the
>> username and password to the solr in my fornt end scripts and it is
>> working fine. I had done it in the below way.
>>
>> http://username:passw...@localhost:8983/solr/admin/update
>> I need your suggestion and feed back on the above method.Is it fessiable
>> method and secure? TO over come from this issue is there any alternate
>> method?
> Hey,
>
> there is at least another solution. You can set a firewall rule that
> allow  connections to the Solr's port only from trusted IPs.
>


-- 

Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211



Re: solr word frequency

2010-02-17 Thread Steve Radhouani
in the Schema browser, you can specify the Top X Terms you want to display.
Here's what you have on the browser: *Docs: * xxx

*Distinct: * 
Top Terms
Thus, you can get the frequency of a given word, even though it's not the
most elegant solution.

2010/2/17 michaelnazaruk 

>
> Schema browser and Luke don't fit! Because I need get frequency for
> selected
> word in my code! In Luke display only first 10 words! I try to change some
> configs in solrconfig and in schema but it don't help me! Maybe there are
> another way to get frequency for word?
> --
> View this message in context:
> http://old.nabble.com/solr-word-frequency-tp27622615p27623246.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Need feedback on solr security

2010-02-17 Thread Xavier Schepler

Vijayant Kumar wrote:

Hi Xavier,

Thanks for your feedback
the firewall rule for the trusted IP is not fessiable for us because the
application is open for public so we can not work through IP banning.
  

Vijayant Kumar wrote:


Hi Group,

I need some feedback on  solr security.

For Making by solr admin password protected,
 I had used the Path Based Authentication form
http://wiki.apache.org/solr/SolrSecurity.

In this way my admin area,search,delete,add to index is protected.But
Now
when I make solr authenticated then for every update/delete from the
fornt
end is blocked without authentication.

I do not need this authentication from the front end so I simply pass
the
username and password to the solr in my fornt end scripts and it is
working fine. I had done it in the below way.

http://username:passw...@localhost:8983/solr/admin/update
I need your suggestion and feed back on the above method.Is it fessiable
method and secure? TO over come from this issue is there any alternate
method?
  

Hey,

there is at least another solution. You can set a firewall rule that
allow  connections to the Solr's port only from trusted IPs.





  

Do your users connect directly to Solr ?
I mean, the firewall rule is for the solr client, i.e. the computer that 
host the application that connect to Solr.


Re: solr word frequency

2010-02-17 Thread michaelnazaruk

I found more interesting way:
http://localhost:8983/solr/select?q=bongo&terms=true&terms.fl=id&terms.prefix=bongo&indent=true
in terms.prefix we set the value witch we want to find :)
I hope this example help for another people ...
Thanks for all, who help me :)
-- 
View this message in context: 
http://old.nabble.com/solr-word-frequency-tp27622615p27623784.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: xml error when indexing

2010-02-17 Thread Erick Erickson
The file looks good to me, but as I remember, the xml must
be UTF-8 (but check). Is there a chance that somewhere in
the chain it's not?

HTH
Erick

2010/2/17 Jan Simon Winkelmann 

> Hi,
>
> I'm having a strange problem when indexing data through our application.
> Whenever I post something to the update resource, I get
>
> Unexpected character 'a' (code 97) in prolog; expected '<'  at [row,col
> {unknown-source}]: [1,1], 
> 
> 
> Error 400 Unexpected character 'a' (code 97) in prolog; expected
> '<'
>  at [row,col {unknown-source}]: [1,1]
> 
> HTTP ERROR 400
> Problem accessing /solr/update. Reason:
> Unexpected character 'a' (code 97) in prolog; expected '<'
>  at [row,col {unknown-source}]: [1,1]Powered by
> Jetty://
>
>
> However, when I post the same data from an xml file using curl it works.
>
> The add command looks like this:
>
>  overwriteCommitted="true">145 name="basic_module_id">4053294 name="category">1170 name="moddate">2010-02-16T15:30:02Z name="archive">0 name="valid_from">2010-02-16T15:30:02Z name="valid_till">2019-12-31T00:00:00Z name="staging">0145-4053294 name="name">«Positives Gespräch» zwischen Bielefeld und DFL name="description">«Positives Gespräch» zwischen Bielefeld und
> DFLBielefeld (dpa) - Der
> finanziell angeschlagene Zweitligist Arminia Bielefeld hat der Deutschen
> Fußball Liga in Frankfurt/Main einen Maßnahmen-Katalog präsentiert.
> 

Bielefeld (dpa) - Der > finanziell angeschlagene Zweitligist Arminia Bielefeld hat der Deutschen > Fußball Liga in Frankfurt/Main einen Maßnahmen-Katalog präsentiert. >

«Daran arbeiten wir derzeit mit Hochdruck», teilte > Arminia-Geschäftsführer Heinz Anders mit. Die Arminia-Delegation, zu der > noch Manager Detlev Dammeier, Aufsichtsratschef Norbert Leopoldseder und > Finanz-Prokurist Henrik Wiehl gehörten, habe die Lage vor den DFL-Vertretern > laut Anders «offen und transparent» analysiert. Es sei ein «sehr positives > Gespräch gewesen». Die nicht näher erläuterten Maßnahmen müssten nun > umgesetzt und bei der DFL entsprechend nachgewiesen > werden.

Die DFL kommentierte das Zusammentreffen in ihrer > Frankfurter Zentrale nicht. «Zu solchen Dinge äußern wir uns nicht», > erklärte ein Sprecher auf Anfrage der Deutschen Presse-Agentur > dpa.

Der frühere Erstligist Bielefeld hat > Verbindlichkeiten und Schulden von rund 15,5 Millionen Euro. Im operativen > Geschäft dieser Saison gibt es eine Finanzierungslücke von 2,5 Millionen > Euro. Der Club hat sich vor allem mit dem Ausbau und der Modernisierung der > SchücoArena übernommen. Zudem ist die Entwicklung bei den Zuschauer-Zahlen > und den Sponsorzuwendungen nach dem Bundesliga-Abstieg unerfreulich. Allein > für das Stadion sind noch 13 Millionen Euro zu tilgen. Der Verein denkt > sogar an einen Verkauf der SchücoArena.

> > The System we run on is Solr 1.4 with Jetty Hightide 7.0.1. > > Am I missing something here? Would be glad for any help. > > Best > Jan >

Re: Need feedback on solr security

2010-02-17 Thread Xavier Schepler

Xavier Schepler wrote:

Vijayant Kumar wrote:

Hi Xavier,

Thanks for your feedback
the firewall rule for the trusted IP is not fessiable for us because the
application is open for public so we can not work through IP banning.
 

Vijayant Kumar wrote:
   

Hi Group,

I need some feedback on  solr security.

For Making by solr admin password protected,
 I had used the Path Based Authentication form
http://wiki.apache.org/solr/SolrSecurity.

In this way my admin area,search,delete,add to index is protected.But
Now
when I make solr authenticated then for every update/delete from the
fornt
end is blocked without authentication.

I do not need this authentication from the front end so I simply pass
the
username and password to the solr in my fornt end scripts and it is
working fine. I had done it in the below way.

http://username:passw...@localhost:8983/solr/admin/update
I need your suggestion and feed back on the above method.Is it 
fessiable

method and secure? TO over come from this issue is there any alternate
method?
  

Hey,

there is at least another solution. You can set a firewall rule that
allow  connections to the Solr's port only from trusted IPs.





  

Do your users connect directly to Solr ?
I mean, the firewall rule is for the solr client, i.e. the computer 
that host the application that connect to Solr.





You could set a firewall that forbid any connection to your Solr's 
server port to everyone, except the computer that host your application 
that connect to Solr.

So, only your application will be able to connect to Solr.

This idea comes from the book Solr 1.4 Entreprise Search Server.


Re: scores are the same for many diferent documents

2010-02-17 Thread Erick Erickson
OmitNorms=false is probably what you want. Did you re-create your
index for each test?

Also, what does debutQuery=true show?

You could get a copy of Luke (google Lucene Luke) and use that to
examine your index to see how things score, which would give you
some clue whether your index (and Lucene) were scoring things
identically or whether there was a different issue...

HTH
Erick

On Wed, Feb 17, 2010 at 7:35 AM, Marc Sturlese wrote:

>
> Hey there,
> I see that when solr gives me back the scores in the response it are the
> same for many different documents.
>
> I have build a simple index for testing purposes with just documents with
> one field indexed with standard analyzer and containing pices of text.
> I have done the same with a self coded simple lucene indexer.
>
> Quering to the solr index with qt=standard&q=title:laptop will give me back
> documents, some of them exactly with the same score.
> Quering to the lucene index (with a simple self coded search app) with
> title:laptop will give me back no equal scores.
>
> When building solr index I have tryied both omitNorms=true and
> omitNorms=false. It will give me different scores but in both cases there
> are some equal scores.
>
> I am testing this because I have a Solr component with a
> FieldComparatorSource wich uses the scores and other external factors for
> the sorting. Having same score for different documents combined with
> external factors may give me back results in unexpected undesired order
> --
> View this message in context:
> http://old.nabble.com/scores-are-the-same-for-many-diferent-documents-tp27623039p27623039.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


long warmup duration

2010-02-17 Thread Stefan Neumann
Hi all,

we are facing extremly increasing warmup times the last 15 days, which
we are not able to explain, since the number of documents and their size
 is stable. Before the increase we can commit our changes in nearly 20
minutes, now it is about 2 hours.

We were able to identify the warmup of the caches (queryresultCache and
filterCache) as the reason. We tried to decrease the number of warmup
elements from 3 to 1 without any impact.

What influences the runtime during the warmup? Is there any possibility
to boost the warmup?

I attach some more information and statistics.

Thanks a lot for your help.

Stefan


Solr:   1.3
Documents:  4.000.000
-Xmx12G
index size/disc 4.7G

config:

100
200

No queries configured for warming.

CACHES:
===

name:   queryResultCache
class:  org.apache.solr.search.LRUCache
version:1.0
description:LRU Cache(maxSize=20,
  initialSize=3,
  autowarmCount=1,
regenerator=org.apache.solr.search.solrindexsearche...@36eb7331)
stats:

lookups:15958
hits :  9589
hitratio:   0.60
inserts:16211
evictions:  0
size:   16169
warmupTime :1960239
cumulative_lookups: 436250
cumulative_hits:260678
cumulative_hitratio:0.59
cumulative_inserts: 174066
cumulative_evictions:   0


name:   filterCache
class:  org.apache.solr.search.LRUCache
version:1.0
description:LRU Cache(maxSize=20,
  initialSize=3,
  autowarmCount=3,  
regenerator=org.apache.solr.search.solrindexsearche...@9818f80)
stats:  
lookups:6313622
hits:   6304004
hitratio: 0.99
inserts: 42266
evictions: 0
size: 40827
warmupTime: 1268074
cumulative_lookups: 118887830
cumulative_hits: 118605224
cumulative_hitratio: 0.99
cumulative_inserts: 296134
cumulative_evictions: 0





RE: Need feedback on solr security

2010-02-17 Thread Fuad Efendi
> You could set a firewall that forbid any connection to your Solr's
> server port to everyone, except the computer that host your application
> that connect to Solr.
> So, only your application will be able to connect to Solr.


I believe firewalling is the only possible solution since SOLR doesn't use
cookies/sessionIDs

However, 'firewall' can be implemented as an Apache HTTPD Server (or any
other front-end configured to authenticate users). (you can even configure
CISCO PIX (etc.) Firewall to authenticate users.)

HTTPD is easiest, but I haven't tried.

But again, if your use case is "many users, many IPs" you need good
front-end (web application); if it is not the case - just restrict access to
specific IP.


-Fuad
http://www.tokenizer.ca





RE: Need feedback on solr security

2010-02-17 Thread Fuad Efendi
 For Making by solr admin password protected,
  I had used the Path Based Authentication form
 http://wiki.apache.org/solr/SolrSecurity.
 In this way my admin area,search,delete,add to index is protected.But
 Now
 when I make solr authenticated then for every update/delete from the
 fornt
 end is blocked without authentication.


Correct, SOLR doesn't use HTTP Session (Session Cookies, Session IDs); and
it shouldn't do that.

If you have such use case (Authenticated Session) you will need front-end
web application.




Re: Need feedback on solr security

2010-02-17 Thread Gora Mohanty
On Wed, 17 Feb 2010 10:13:46 -0400
"Fuad Efendi"  wrote:

> > You could set a firewall that forbid any connection to your
> > Solr's server port to everyone, except the computer that host
> > your application that connect to Solr.
> > So, only your application will be able to connect to Solr.
> 
> 
> I believe firewalling is the only possible solution since SOLR
> doesn't use cookies/sessionIDs
> 
> However, 'firewall' can be implemented as an Apache HTTPD Server
> (or any other front-end configured to authenticate users). (you
> can even configure CISCO PIX (etc.) Firewall to authenticate
> users.)
[...]

If you are on Linux, or another system that supports it, iptable
rules are quite easy to set up to restrict access only to
desired Solr client(s).

Regards,
Gora


Re: persistent cache

2010-02-17 Thread Toke Eskildsen
On Tue, 2010-02-16 at 10:35 +0100, Tim Terlegård wrote:
> I actually tried SSD yesterday. Queries which need to go to disk are
> much faster now. I did expect that warmup for sort fields would be
> much quicker as well, but that seems to be cpu bound.

That and bulk I/O. The sorter imports the Terms into RAM by iterating,
which means that the IO-access for this is sequential. Most modern SSDs
are faster than conventional harddisks for this, but not by much.

> It still takes a minute to cache the six sort fields of the 40 million 
> document index.

I am not aware of any solutions to this, besides beefing hardware bulk
reads and processor speed (the sorter is not threaded as far as I
remember). It it technically possible to move this step to the indexer,
but the only win would be for setups with few builders and many
searchers.

> Are there any differences among SSD disks. Why is Intel X25-M your favourite?

A soft reason is that I have faith in support from Intel: There has been
problems with earlier versions of the drive (nuking content in some
edge-cases and performance degradation (which hits all SSDs)) and Intel
has responded well by acknowledging the problems and resolving them.
That's very subjective though and I'm sure that some would turn that
around and say that Intel delivered crap in the first place.

On the harder side, the Intel drive is surprisingly cheap and provides
random IO performance ahead of most competitors. Especially for random
writes, which is normally the weak point for SSDs. Some graphs can be
found at Anandtech: 
http://anandtech.com/storage/showdoc.aspx?i=3631&p=22
Anandtech is BTW a very fine starting point on SSD's as they go into
details that too many reviewers skip over.

To be truthful here, standard index building and searching with Lucene
requires three things from the IO-system: Bulk writes, bulk reads
(mainly for sorting) and random reads. The Intel drive is not stellar
for bulk writes and being superior for random writes does not make a
difference for Lucene/SOLR. if we're only talking search: Pick whatever
SSD you can get your hands on: They are all fine for random reads and
the CPU will probably be the bottleneck.

However, random write speed is a bonus that might show indirectly:
Untarring a million small files, updating a database and similiar is
often part of the workflow with search.


Back in 2007 we were fortunate enough to get a test-machine with 2 types
of SSD, 2 10,000 RPM harddisks and 2 15,000 RPM harddisks. Some quick
notes can be found at http://wiki.statsbiblioteket.dk/summa/Hardware

The world has moved on since then, but that has only widened the gap
between SSDs and harddisks.

Regards,
Toke Eskildsen



Re: ConstantScoreQuery and wildcards

2010-02-17 Thread TCK
Thanks, this is very helpful!
-TCK



On Tue, Feb 16, 2010 at 8:16 PM, Ahmet Arslan  wrote:

> > It seems that when I do a search with a wildcard (eg,
> > +text:abc*) the Solr
> > standard SearchHandler will construct a ConstantScoreQuery
> > passing in a
> > Filter, so all the documents in the result set are scored
> > the same. Is there
> > a way to make Solr construct a BooleanQuery instead so that
> > scoring based on
> > term frequencies, etc are used?
>
> Somehow yes. http://old.nabble.com/Boost-with-wildcard.-td25959382.html
>
> > Moreover, in my application
> > I'm building a
> > Query using the Lucene api, calling toString on it and
> > passing it to Solr
> > via solrj and I would like Solr to recover the same Lucene
> > query on its
> > end... is this possible?
>
> There were a discussion about this titled "Lucene Query to Solr query"
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg22034.html
>
>
>
>


Re: Preventing mass index delete via DataImportHandler full-import

2010-02-17 Thread Daniel Shane
Thats what I thought. I think I'll take the time to add something to the DIH to 
prevent such things. Maybe a parameter that will cause the import to bail out 
if the documents to index are less than X % of the total number of documents 
already in the index.

There would also be a parameter to override this manually.

I think it would be a good safety precaution.

Daniel Shane

- Original Message -
From: "Noble Paul നോബിള്‍ नोब्ळ्" 
To: solr-user@lucene.apache.org
Sent: Wednesday, February 17, 2010 12:36:52 AM
Subject: Re: Preventing mass index delete via DataImportHandler full-import

On Wed, Feb 17, 2010 at 8:03 AM, Chris Hostetter
 wrote:
>
> : I have a small worry though. When I call the full-import functions, can
> : I configure Solr (via the XML files) to make sure there are rows to
> : index before wiping everything? What worries me is if, for some unknown
> : reason, we have an empty database, then the full-import will just wipe
> : the live index and the search will be broken.
>
> I believe if you set clear=false when doing the full-import, DIH won't
it is clean=false

or use command=import instead of command=full-import
> delete the entire index before it starts.  it probably makes the
> full-import slower (most of the adds wind up being deletes followed by
> adds) but it should prevent you from having an empty index if something
> goes wrong with your DB.
>
> the big catch is you now have to be responsible for managing deletes
> (using the XmlUpdateRequestHandler) yourself ... this bug looks like it's
> goal is to make this easier to deal with (but i'd not really clear to
> me what "deletedPkQuery" is ... it doesnt' seem to be documented.
>
> https://issues.apache.org/jira/browse/SOLR-1168
>
>
>
> -Hoss
>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: Merge several queries into one result?

2010-02-17 Thread Daniel Shane
Yup, thats also what I was thinking. 

However, I do think that many real world examples cannot simply use one flat 
index. If you have a big index with big documents, you may want to have a 
separate, small index, for things that update frequently etc.. You would need 
to cross reference that index with the main one to produce the final result.

It java it would be easy to just do 2 queries, one to get the main hits, and 
the other to get the smaller index. In fact, that controller could just cache 
those entries in the second index. 

I don't know if it would be easy to include in Solr. It would certainly require 
much thought tough as some may want to cross index another core for each hit, 
while others would just want to retrive a bunch of documents statically.

Daniel Shane

I'll see what could be done, but I don't think anything easy 
- Original Message -
From: "Erick Erickson" 
To: solr-user@lucene.apache.org
Sent: Tuesday, February 16, 2010 10:20:50 PM
Subject: Re: Merge several queries into one result?

It's generally a bad idea to try to think of
various SOLR/Lucene indexes in a database-like
way, Lucene isn't built to do RDBMS-like stuff. The
first suggestion is usually to consider flattening
your data. That would be something like
adding NY and "New York" in each document.

If that's not possible, the thread titled "Collating results from multiple
indexes" might be useful, although my very quick
read of that is that you have to do some custom work...

HTH
Erick


On Tue, Feb 16, 2010 at 4:54 PM, Daniel Shane wrote:

> Hi all!
>
> I'm trying to join 2 indexes together to produce a final result using only
> Solr + Velocity Response Writer.
>
> The problem is that each "hit" of the main index contains references to
> some common documents located in another index. For example, the hit could
> have a field that describes in what state its located. This field would have
> a value of "NY" for New York etc...
>
> Now what if, in velocity, I want to show this information in full detail.
> Instead of the NY, I would like to show "New York"? This information has not
> been indexed in the main index, but rather in a second one.
>
> Is it possible to coalesce or join these results together so that I can
> pass a simple Velocity template to generate the final HTML?
>
> Or do I have to write a webapp in java to cache all these global variables
> (the state codes, the country codes etc...)?
>
> Daniel Shane
>


Re: Tomcat vs Jetty: A Comparative Analysis?

2010-02-17 Thread gary


http://www.webtide.com/choose/jetty.jsp

>> > - Original Message -
>> > From: "Steve Radhouani" 
>> > To: solr-user@lucene.apache.org
>> > Sent: Tuesday, 16 February, 2010 12:38:04 PM
>> > Subject: Tomcat vs Jetty: A Comparative Analysis?
>> >
>> > Hi there,
>> >
>> > Is there any analysis out there that may help to choose between Tomcat
>> and
>> > Jetty to deploy Solr? I wonder wether there's a significant difference
>> > between them in terms of performance.
>> >
>> > Any advice would be much appreciated,
>> > -Steve
>> >
>>


Catching slow shards

2010-02-17 Thread Otis Gospodnetic
Hello,

Does Solr have any hooks that allow one to watch out for any slaves not 
responding to a query request in the context of distributed search?  That is, 
if a query is sent to shards A, B, and C, and if B doesn't actually respond 
(within N milliseconds), I'd like to know about it, and I'm wondering what the 
best way to get to this information is.


Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



Re: create requesthandler with default shard parameter for different query parser, stock solr 1.4

2010-02-17 Thread Jason Venner
Anyone come up with an answer for this?

I am using the blacklight ruby app and seems to require multiple handlers for 
different styles of queries.

In particular, what I am noticing is that the facet query using q=*:* seems to 
produce a single shard answer.

This query produces 1 result and facets for the single result:
http://host:8983/solr/select?rows=10&q=*:*&facet.field=field1&facet.field=field2&spellcheck.q=*:*&wt=standard&qt=search&sort=
While
http://host:8983/solr/select?rows=10&q=*:*&facet.field=field1&facet.field=field2&spellcheck.q=*:*&wt=standard&qt=standard&sort=
Produces the faceting across the full shard space.

There is a requesthandler for "search" and for "standard"
"search" is defType=dismax, and has a shard parameter set that is identical to 
"standard".

Searches for actual terms seem to work correctly across both "standard" and 
"search".



On 1/21/10 12:05 PM, "Joe Calderon"  wrote:

thx much, i see now, having request handlers with the same name as the
query parsers was confusing me, i do however have an additional
problem, if i use defType it does indeed use the right query parser
but is there a way to not send all the query parameters in the url
(qf, pf, bf etc), its the main reason im creating the new request
handler, or do i put them all as defaults under my new request handler
and let the query parser use whichever ones it supports?

On Thu, Jan 21, 2010 at 11:45 AM, Yonik Seeley
 wrote:
> On Thu, Jan 21, 2010 at 2:39 PM, Joe Calderon  wrote:
>> hello *, what is the best way to create a requesthandler for
>> distributed search with a default shards parameter but that can use
>> different query parsers
>>
>> thus far i have
>>
>>  
>>
>> 
>>   *,score
>>   json
>>   > name="shards">host0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3
>>
>>
>>  query
>>  facet
>>  spellcheck
>>  debug
>>
>>  
>>
>>
>> which works as long as qt=standard, if i change it to dismax it doenst
>> use the shards parameter anymore...
>
> Legacy terminology causing some confusion I think... qt does stand for
> "query type", but it actually picks the request handler.
> "defType" defines the default query parser to use, so you probably
> don't want to be using "qt" at all.
>
> So try something like:
> http://localhost:8983/solr/ds?defType=dismax&qf=text&q=foo
>
> -Yonik
> http://www.lucidimagination.com
>



Re: Copying dynamic fields into default text field messing up fieldNorm?

2010-02-17 Thread Yu-Shan Fung
I'll take a stab. IMHO, it doesn't make much sense to propagae the boost,
and here's why:

For the typical use case, copyField is used to add other "searchable" fields
into the default "text" field for Standard queries. Say we are copying the
ModelNumber field into the text field, and we have a boost of 5.0 for the
ModelNumber field. Now, that means any document with a ModelNumber value
would have the extra boost of 5.0 multiplied into the boost of the "text"
field, for ALL terms in "text"; whereas documents with no ModelNumber would
get no such benefit, completely skewing the results

This would only make sense if boosts are per field instance and not per
field, but we know that's not the case.

Am I making sense?
Yu-Shan


On Tue, Feb 16, 2010 at 10:54 PM, Chris Hostetter
wrote:

>
> : > I belive Koji was mistaken. looking at DocumentBuilder.toDocument, the
> : > boosts have been propogated to copyField destinations since that method
> was
> : > added in 2007 (initially it didn't deal with copyfields at all, but
> once
> : > that was fixed it copied the boosts as well.)
>...
> : Hmm, I didn't know it. Thanks for correcting me.
> : But is it (propagating boost) good idea? What is use case for?
>
> No clue, to either question ... i have no opinion on wether or not it
> makes sense, i'm just telling you what i see in the code.
>
>
> -Hoss
>
>


-- 
“When nothing seems to help, I go look at a stonecutter hammering away at
his rock perhaps a hundred times without as much as a crack showing in it.
Yet at the hundred and first blow it will split in two, and I know it was
not that blow that did it, but all that had gone before.” — Jacob Riis


AW: Performance-Issues and raising numbers of "cumulative inserts"

2010-02-17 Thread Bohnsack, Sven
Sorry, for the chaos-posts, if someone minds :)

My Colleague posted more infos here: 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201002.mbox/%3c4b7bf56e.3080...@freiheit.com%3e

I would be very pleased if you could response any idea to his post.

Regards,
Sven

-Ursprüngliche Nachricht-
Von: Lance Norskog [mailto:goks...@gmail.com] 
Gesendet: Mittwoch, 17. Februar 2010 06:30
An: solr-user@lucene.apache.org
Betreff: Re: Performance-Issues and raising numbers of "cumulative inserts"

These are some very large numbers. 700k ms is 70 seconds, 4M ms is 4k
seconds or 66 minutes. No Solr installation should take this long to
warm up.

There is something very wrong here. Have you optimized lately? What
queries do you run to warm it up? And, the basics: how many documents,
how much data per document, how much disk space is the index?

On Tue, Feb 16, 2010 at 3:02 AM, Bohnsack, Sven
 wrote:
> Hi Shalin!
>
>
>
> Thanks for quick response. Sadly it tells me, that i have to look elsewhere 
> to fix the problem.
>
>
>
> Anyone an idea what could cause the increasing warmup-Times? If required I 
> can post some stats.
>
>
>
> Thanking you in anticipation!
>
>
>
> Regards,
>
> Sven
>
>
>
> Feed: Solr-Mailing-List
> Bereitgestellt am: Dienstag, 16. Februar 2010 09:05
> Autor: Shalin Shekhar Mangar 
> Betreff: Re: Performance-Issues and raising numbers of "cumulative inserts"
>
>
>
> 
>
>  On Tue, Feb 16, 2010 at 1:06 PM, Bohnsack, Sven  > wrote:  > Hey IT-Crowd! > > I'm dealing with some performance issues during 
> warmup the > queryResultCache. Normally it tooks about 11 Minutes (~700.000 
> ms), but > now it tooks about 4 MILLION and more ms. All I can see in the 
> solr.log > ist that the number of cumulative_inserts ascends from from 
> ~250.000 to > ~670.000. > > I asked Google about the cumulative_inserts, but 
> did not get an answer. > Can anyone tell me what "cumulative inserts" are and 
> what they stand > for? What does it mean, if the number of such inserts 
> raises? > > cumulative_inserts are the total number of inserts into the cache 
> since Solr started up. The "inserts" shows the number of inserts since the 
> last commit.  --  Regards, Shalin Shekhar Mangar.
>
>
> Artikel anzeigen... 
> 
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: long warmup duration

2010-02-17 Thread Antonio Lobato
Drop those cache numbers.  Way down.  I warm up 30 million documents in about 2 
minutes with the following configuration:

  

  

  

  

Mind you, I also use Solr 1.4.  Also, setup a decent warming query or two, as 
so:
 date:[NOW-2DAYS TO NOW] 0 
100 date desc

Don't warm facets that have a large amount of terms or you will kill your warm 
up time.

Hope this helps!

On Feb 17, 2010, at 8:55 AM, Stefan Neumann wrote:

> Hi all,
> 
> we are facing extremly increasing warmup times the last 15 days, which
> we are not able to explain, since the number of documents and their size
> is stable. Before the increase we can commit our changes in nearly 20
> minutes, now it is about 2 hours.
> 
> We were able to identify the warmup of the caches (queryresultCache and
> filterCache) as the reason. We tried to decrease the number of warmup
> elements from 3 to 1 without any impact.
> 
> What influences the runtime during the warmup? Is there any possibility
> to boost the warmup?
> 
> I attach some more information and statistics.
> 
> Thanks a lot for your help.
> 
> Stefan
> 
> 
> Solr: 1.3
> Documents:4.000.000
> -Xmx  12G
> index size/disc 4.7G
> 
> config:
> 
> 100
> 200
> 
> No queries configured for warming.
> 
> CACHES:
> ===
> 
> name:   queryResultCache
> class:  org.apache.solr.search.LRUCache
> version:1.0
> description:LRU Cache(maxSize=20,
>  initialSize=3,
> autowarmCount=1,
>   regenerator=org.apache.solr.search.solrindexsearche...@36eb7331)
> stats:
> 
> lookups:15958
> hits :  9589
> hitratio:   0.60
> inserts:16211
> evictions:  0
> size:   16169
> warmupTime :1960239
> cumulative_lookups: 436250
> cumulative_hits:260678
> cumulative_hitratio:0.59
> cumulative_inserts: 174066
> cumulative_evictions:   0
> 
> 
> name: filterCache
> class:org.apache.solr.search.LRUCache
> version:  1.0
> description:  LRU Cache(maxSize=20,
> initialSize=3,
>  autowarmCount=3, 
>   regenerator=org.apache.solr.search.solrindexsearche...@9818f80)
> stats:
> lookups:  6313622
> hits:   6304004
> hitratio: 0.99
> inserts: 42266
> evictions: 0
> size: 40827
> warmupTime: 1268074
> cumulative_lookups: 118887830
> cumulative_hits: 118605224
> cumulative_hitratio: 0.99
> cumulative_inserts: 296134
> cumulative_evictions: 0
> 
> 
> 



Re: Tomcat vs Jetty: A Comparative Analysis?

2010-02-17 Thread Andy
This read more like a PR release or product brochure for jetty than anything 
else.

Then I poked around the website and realized why: it was written by the creator 
of Jetty, and is hosted on the website of a company with the slogan "The Java 
Experts behind Jetty"

--- On Wed, 2/17/10, g...@littlebunch.com  wrote:

From: g...@littlebunch.com 
Subject: Re: Tomcat vs Jetty: A Comparative Analysis?
To: solr-user@lucene.apache.org
Date: Wednesday, February 17, 2010, 11:27 AM



http://www.webtide.com/choose/jetty.jsp

>> > - Original Message -
>> > From: "Steve Radhouani" 
>> > To: solr-user@lucene.apache.org
>> > Sent: Tuesday, 16 February, 2010 12:38:04 PM
>> > Subject: Tomcat vs Jetty: A Comparative Analysis?
>> >
>> > Hi there,
>> >
>> > Is there any analysis out there that may help to choose between Tomcat
>> and
>> > Jetty to deploy Solr? I wonder wether there's a significant difference
>> > between them in terms of performance.
>> >
>> > Any advice would be much appreciated,
>> > -Steve
>> >
>>



  

Reindex after changing defaultSearchField?

2010-02-17 Thread Frederico Azeiteiro
Hi,

 

If i change the "defaultSearchField" in the core schema, do I need to
recreate the index?

 

Thanks,

Frederico

 



Re: Reindex after changing defaultSearchField?

2010-02-17 Thread Joe Calderon
no, youre just changing how your querying the index, not the actual 
index, you will need to restart the servlet container or reload the core 
for the config changes to take effect tho

On 02/17/2010 10:04 AM, Frederico Azeiteiro wrote:

Hi,



If i change the "defaultSearchField" in the core schema, do I need to
recreate the index?



Thanks,

Frederico




   




Re: xml error when indexing

2010-02-17 Thread Chris Hostetter
: I'm having a strange problem when indexing data through our application. 
: Whenever I post something to the update resource, I get
: 
: Unexpected character 'a' (code 97) in prolog; expected '<'  at [row,col 
{unknown-source}]: [1,1], 
... 
: However, when I post the same data from an xml file using curl it works.

...thta's pretty much a dead give away that your application isn't posting 
the exact same XML as the curl command.  You might try using a packet 
sniffer, or an HTTP Proxy that logs all the details of the requests to see 
what exactly your application is sending over the wire and how it differs 
from curl.



-Hoss



Re: Preventing mass index delete via DataImportHandler full-import

2010-02-17 Thread Chris Hostetter

: Thats what I thought. I think I'll take the time to add something to the 
: DIH to prevent such things. Maybe a parameter that will cause the import 
: to bail out if the documents to index are less than X % of the total 
: number of documents already in the index.

the devils in the details though ... to do an efficient "full-import" DIH 
deletes hte index before it starts indexing anything, and for an 
arbitrary datasource with an arbitrary set of entities and sub entities 
and various layers of logic it seems like it would be infeasible to know 
how many rows you are going to get before you actually start.

I think this sort of thing would pretty much have to be done post-import 
(w/o doing the initial delete), counting the number of docs adding, and 
deleting all of the ones older then that (using a deleteQuery based on a 
timestamp field) if the number is above a percentage threshold.

Of course: none of this helps you with the possibility that you have 
plenty of docs, but they all contain useless data (maybe some nested 
entity query failed so you have no searchable text) ... logic for sanity 
checking an index tends to be fairly domain specific.



-Hoss



Re: Merge several queries into one result?

2010-02-17 Thread Erick Erickson
Certainly if you come up with a general solution, the whole community will
be *very* interested .

On Wed, Feb 17, 2010 at 11:14 AM, Daniel Shane wrote:

> Yup, thats also what I was thinking.
>
> However, I do think that many real world examples cannot simply use one
> flat index. If you have a big index with big documents, you may want to have
> a separate, small index, for things that update frequently etc.. You would
> need to cross reference that index with the main one to produce the final
> result.
>
> It java it would be easy to just do 2 queries, one to get the main hits,
> and the other to get the smaller index. In fact, that controller could just
> cache those entries in the second index.
>
> I don't know if it would be easy to include in Solr. It would certainly
> require much thought tough as some may want to cross index another core for
> each hit, while others would just want to retrive a bunch of documents
> statically.
>
> Daniel Shane
>
> I'll see what could be done, but I don't think anything easy
> - Original Message -
> From: "Erick Erickson" 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 16, 2010 10:20:50 PM
> Subject: Re: Merge several queries into one result?
>
> It's generally a bad idea to try to think of
> various SOLR/Lucene indexes in a database-like
> way, Lucene isn't built to do RDBMS-like stuff. The
> first suggestion is usually to consider flattening
> your data. That would be something like
> adding NY and "New York" in each document.
>
> If that's not possible, the thread titled "Collating results from multiple
> indexes" might be useful, although my very quick
> read of that is that you have to do some custom work...
>
> HTH
> Erick
>
>
> On Tue, Feb 16, 2010 at 4:54 PM, Daniel Shane  >wrote:
>
> > Hi all!
> >
> > I'm trying to join 2 indexes together to produce a final result using
> only
> > Solr + Velocity Response Writer.
> >
> > The problem is that each "hit" of the main index contains references to
> > some common documents located in another index. For example, the hit
> could
> > have a field that describes in what state its located. This field would
> have
> > a value of "NY" for New York etc...
> >
> > Now what if, in velocity, I want to show this information in full detail.
> > Instead of the NY, I would like to show "New York"? This information has
> not
> > been indexed in the main index, but rather in a second one.
> >
> > Is it possible to coalesce or join these results together so that I can
> > pass a simple Velocity template to generate the final HTML?
> >
> > Or do I have to write a webapp in java to cache all these global
> variables
> > (the state codes, the country codes etc...)?
> >
> > Daniel Shane
> >
>


Re: parsing strings into phrase queries

2010-02-17 Thread Chris Hostetter

: take a look at PositionFilter

Right, there was another thread recently where almost the exact same issue 
was discussed...

http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html

..except that i was ignorant of the existence of PositionFilter when i 
wrote that message.



-Hoss



Re: How to reindex data without restarting server

2010-02-17 Thread Chris Hostetter

: How do I SWAP the old_core with the new_core. Is it to be done manually or
: does solr provide with a command for doing so. What if I don't make a new

you use the SWAP command, as described in the URL that was mentioned...

: > : http://wiki.apache.org/solr/CoreAdmin
: >
: > For making a schema change, the steps would be:
: >  - create a "new_core" with the new schema
: >  - reindex all the docs into "new_core"
: >  - "SWAP" "old_core" and "new_core" so all the old URLs now point at the
: > new core with the new schema.



-Hoss



Re: getting unexpected statscomponent values

2010-02-17 Thread Grant Ingersoll
Can you share the full output from the StatsComponent? 
On Feb 15, 2010, at 3:07 PM, solr-user wrote:

> 
> Has anyone encountered the following issue?
> 
> I wanted to understand the statscomponent better, so I setup a simple test
> index with a few thousand docs.  In my schema I have:
> - an indexed multivalue sint field (StatsFacetField) that can contain 
> values
> 0 thru 5 that I want to use as my stats.facet field.
> - an indexed single value sint field (ValueOfOneField) that will always
> contain the value 1 and that I want stats on for this test
> 
> When I execute the following query:
> 
> http://localhost:8080/solr/select?q=*:*&stats=true&stats.field=ValueOfOneField&stats.facet=StatsFacetField&rows=0&facet=on&facet.limit=10&facet.field=StatsFacetField
> 
> For this situation (*:*) I was expecting that the statscomponent Count/Sum
> values for each possible value in StatsFacetField to match the facet values
> for StatsFacetField.  They don’t.  Some are close (ie 204 vs 214) while
> others are way off (ie 230 vs 8000)
> 
> Shouldn’t the values match up?  If not, why?
> 
> I am using a recent copy of 1.5.0-dev solr ($Id: CHANGES.txt 906924
> 2010-02-05 12:43:11Z noble $)
> -- 
> View this message in context: 
> http://old.nabble.com/getting-unexpected-statscomponent-values-tp27599248p27599248.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: VelocityResponseWriter: Image References

2010-02-17 Thread Erik Hatcher
Unfortunately the file request handler does not support bindary file  
types (yet).


Lance's suggestion of hosting static content in another servlet  
container context is the best solution for now.


Erik

On Feb 15, 2010, at 8:47 AM, Chantal Ackermann wrote:


Hi all,

Google didn't come up with any helpful hits, so I'm wondering whether
this is either too simple for me to grok, or I've got some obvious
mistake in my code.


Problem:

Images that I want to load in the velocity templates (including those
referenced in CSS/JS files) for the VelocityResponseWriter do not show
up. (CSS/JS files are loaded!)

I am using the following URL (the same as for CSS/JS files (which work
fine)):


http://server:port/solr/core/admin/file?file=[path to
image]&contentType=image/png



When I try that URL in my browser (Firefox or Safari on Windows)  
they do

not return the image correctly. Firefox states that something is wrong
with the image, Safari simply displays the [?] icon.
When I download the file (removing the parameter contentType to get  
the

download dialog), something is downloaded (> 0KB) but it's a different
format (my image viewer fails to load it).

Has anyone managed to load images that are stored in the SOLR config
directory? Or do I need to move those resources to the webapps solr
folder (I'd rather avoid that)?

Thanks!
Chantal






Re: Deleting spelll checker index

2010-02-17 Thread darniz

Please bear with me on the limitted understanding.
i deleted all documents and i made a rebuild of my spell checker  using the
command
spellcheck=true&spellcheck.build=true&spellcheck.dictionary=default

After this i went to the schema browser and i saw that mySpellText still has
around 2000 values.
How can i make sure that i clean up that field.
We had the same issue with facets too, even though we delete all the
documents, and if we do a facet on make we still see facets but we can
filter out facets by saying facet.mincount>0.

Again coming back to my question how can i make mySpellText fields get rid
of all previous terms

Thanks a lot
darniz



hossman wrote:
> 
> : But still i cant stop thinking about this.
> : i deleted my entire index and now i have 0 documents.
> : 
> : Now if i make a query with accrd i still get a suggestion of accord even
> : though there are no document returned since i deleted my entire index. i
> : hope it also clear the spell check index field.
> 
> there are two Lucene indexes when you use spell checking.
> 
> there is the "main" index which is goverend by your schema.xml and is what 
> you add your own documents to, and what searches are run agains for the 
> result section of solr responses.  
> 
> There is also the "spell" index which has only two fields and in 
> which each "document" corrisponds to a "word" that might be returend as a 
> spelling suggestion, and the other fields contain various start/end/middle 
> ngrams that represent possible misspellings.
> 
> When you use the spellchecker component it builds the "spell" index 
> makinga document out of every word it finds in whatever field name you 
> configure it to use.
> 
> deleting your entire "main" index won't automaticly delete the "spell" 
> index (allthough you should be able rebuild the "spell" index using the 
> *empty* "main" index, that should work).
> 
> : i am copying both fields to a field called 
> : 
> : 
> 
> ..at this point your "main" index has a field named mySpellText, and for 
> ever document it contains a copy of make and model.
> 
> : 
> : default
> : mySpellText
> : true
> : true
> 
> ...so whenever you commit or optimize your "main" index it will take every 
> word from the mySpellText and use them all as individual documents in the 
> "spell" index.
> 
> In your previous email you said you changed hte copyField declaration, and 
> then triggered a commit -- that rebuilt your "spell" index, but the data 
> was still all there in the mySpellText field of the "main" index, so the 
> rebuilt "spell" index was exactly the same.
> 
> : i have buildOnOPtmize and buildOnCommit as true so when i index new
> document
> : i want my dictionary to be created but how can i make sure i remove the
> : preivious indexed terms. 
> 
> everytime the spellchecker component "builds" it will create a completley 
> new "spell" index .. but if the old data is still in the "main" index then 
> it will also be in the "spell" index.
> 
> The only reason i can think of why you'd be seeing words in your "spell" 
> index after deleting documents from your "main" index is that even if you 
> delete documents, the Terms are still there in the underlying index untill 
> the segments are merged ... so if you do an optimize that will force them 
> to be expunged --- but i honestly have no idea if that is what's causing 
> your problem, because quite frankly i really don't understand what your 
> problem is ... you have to provide specifics: reproducible steps anyone 
> can take using a clean install of solr to see the the behavior you are 
> seeing that seems incorrect.  (ie: modifications to the example schema, 
> and commands to execute against hte demo port to see the bug)
> 
> if you can provide details like that then it's possible to understand what 
> is going wrong for you -- which is a prereq to providing useful help.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27629740.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: parsing strings into phrase queries

2010-02-17 Thread Robert Muir
i think we can improve the docs/wiki to show this example use case, i
noticed the wiki explanation for this filter gives a more complex shingles
example, which is interesting, but this seems to be a common problem and
maybe we should add this use case.

On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
wrote:

>
> : take a look at PositionFilter
>
> Right, there was another thread recently where almost the exact same issue
> was discussed...
>
> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>
> ..except that i was ignorant of the existence of PositionFilter when i
> wrote that message.
>
>
>
> -Hoss
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Site search upsells & boosting by content type

2010-02-17 Thread Chris Hostetter


: 54 results with that particular event on top.  However, if I try to 
: boost another term, such as "+(old 97's) || granada^100" - I get over 
: 300 results because it adds in all of the matches for the word 

In Solr/Lucene, the keywords of "AND" and "OR" are really just syntactic 
sugar for making two clauses mandatory or optional -- which means that 
something like this...

+FOO || BAR

...causes a "FOO" clause to be created which is mandatory, and then a 
BAR clause is created and both the FOO and BAR clause are set to optional. 
(because of the binary OR specificed by "||")

You can see all of this if you look at the parsedquery in the 
debugQuery=true output.

The sucky part of overriding the "default operator" is that when you set 
it to "AND" there is no syntax to force a clause to be "optional" .. which 
is why i recommend *never* changing the default operrator, and using "+" 
to denote when you wnat to make things mandatory.

: "granada".  This is not what I want.  Instead of AND or OR, I want AND 
: MAYBE.

In Lucene/Solr there is (really) no "AND" or "OR" or "AND MAYBE" .. there 
are just "MANDATORY" "PROHIBITED" and "OPTIONAL" ... in the expression 
"+FOO BAR" FOO becomes MANDATORY and BAR becomes optional, which is 
equivilent to "AND MAYBE" in other parsers.

nine times out of ten, when people are asking questions like this, the 
best answer is:

  1) use the dismax parser
  2) put the input from your user in the q param
  3) set the mm param to 100%
  4) put the boost query you want to use in the bq param


-Hoss



Re: optimize is taking too much time

2010-02-17 Thread Chris Hostetter

: in my solr u have 1,42,45,223 records having some 50GB .
: Now when iam loading a new record and when its trying optimize the docs its
: taking 2 much memory and time 

: can any body please tell do we have any property in solr to get rid of this.

Solr isn't going to optimize the index unless you tell it to -- how are 
you indexing your docs? are you sure you don't have something programmed 
to send an optimize command?


-Hoss



Re: getting unexpected statscomponent values

2010-02-17 Thread solr-user


Grant Ingersoll-6 wrote:
> 
> Can you share the full output from the StatsComponent? 
> 

Sure.  This is what I get.

   
- 
- 
  0 
  62 
- 
  on 
  *:* 
  true 
  ValueOfOne 
  10 
  StatsFacetField 
  StatsFacetField 
  0 
  
  
   
- 
   
- 
- 
  1619 
  7433 
   
  3984 
  233 
  41 
  
  
   
  
- 
- 
- 
  1.0 
  1.0 
  8627.0 
  8627 
  0 
  8627.0 
  1.0 
  0.0 
- 
- 
- 
  1.0 
  1.0 
  3758.0 
  3758 
  0 
  3758.0 
  1.0 
  0.0 
  
- 
  1.0 
  1.0 
  3915.0 
  3915 
  0 
  3915.0 
  1.0 
  0.0 
  
- 
  1.0 
  1.0 
  265.0 
  265 
  0 
  265.0 
  1.0 
  0.0 
  
- 
  1.0 
  1.0 
  37.0 
  37 
  0 
  37.0 
  1.0 
  0.0 
  
- 
  1.0 
  1.0 
  41.0 
  41 
  0 
  41.0 
  1.0 
  0.0 
  
- 
  1.0 
  1.0 
  201.0 
  201 
  0 
  201.0 
  1.0 
  0.0 
  
  
  
  
  
  
  
-- 
View this message in context: 
http://old.nabble.com/getting-unexpected-statscomponent-values-tp27599248p27631121.html
Sent from the Solr - User mailing list archive at Nabble.com.



What is largest reasonable setting for ramBufferSizeMB?

2010-02-17 Thread Burton-West, Tom
Hello all,

At some point we will need to re-build an index that totals about 2 terrabytes 
in size (split over 10 shards).  At our current indexing speed we estimate that 
this will take about 3 weeks.  We would like to reduce that time.  It appears 
that our main bottleneck is disk I/O.
 We currently have ramBufferSizeMB set to 32 and our merge factor is 10.  If we 
increase ramBufferSizeMB to 320, we avoid a merge and the 9 disk writes and 
reads to merge 9+1 32MB segments into a 320MB segment.

 Assuming we allocate enough memory to the JVM, would it make sense to increase 
ramBufferSize to 3200MB?   What are people's experiences with very large 
ramBufferSizeMB sizes?

Tom Burton-West
University of Michigan Library
www.hathitrust.org



Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-17 Thread Mark Miller
Burton-West, Tom wrote:
> Hello all,
>
> At some point we will need to re-build an index that totals about 2 
> terrabytes in size (split over 10 shards).  At our current indexing speed we 
> estimate that this will take about 3 weeks.  We would like to reduce that 
> time.  It appears that our main bottleneck is disk I/O.
>  We currently have ramBufferSizeMB set to 32 and our merge factor is 10.  If 
> we increase ramBufferSizeMB to 320, we avoid a merge and the 9 disk writes 
> and reads to merge 9+1 32MB segments into a 320MB segment.
>
>  Assuming we allocate enough memory to the JVM, would it make sense to 
> increase ramBufferSize to 3200MB?   What are people's experiences with very 
> large ramBufferSizeMB sizes?
>
> Tom Burton-West
> University of Michigan Library
> www.hathitrust.org
>
>
>   
There is a hard limit just under about 2 gigs. Appears to be diminishing
returns as you go over a hundred to a few hundred MB. IE, you prob
picked a good number with 320. If you plan to go big anyway ( > 1 gig ),
you really have to give a lot of RAM to the JVM to avoid some nasty
paging / GC effects. I think someone that tested this had to give over 6
gigabytes to go over 1 gig without these affects? Thats remembering from
memory though. If you look at the gain eked out at that point, its not
really worth it. I'd stick to lower hundreds max.

-- 
- Mark

http://www.lucidimagination.com





Re: getting unexpected statscomponent values

2010-02-17 Thread Chris Hostetter

: Sure.  This is what I get.

That does look really weird, and definitely seems like a bug.

Can you open an issue in Jira? ... ideally with a TestCase (even if it's 
not a JUnit test case, just having some sample docs that can be indexed 
against the example schema and a URL showing the problem would be helpful)


:
: - 
: - 
:   0 
:   62 
: - 
:   on 
:   *:* 
:   true 
:   ValueOfOne 
:   10 
:   StatsFacetField 
:   StatsFacetField 
:   0 
:   
:   
:
: - 
:
: - 
: - 
:   1619 
:   7433 
:    
:   3984 
:   233 
:   41 
:   
:   
:
:   
: - 
: - 
: - 
:   1.0 
:   1.0 
:   8627.0 
:   8627 
:   0 
:   8627.0 
:   1.0 
:   0.0 
: - 
: - 
: - 
:   1.0 
:   1.0 
:   3758.0 
:   3758 
:   0 
:   3758.0 
:   1.0 
:   0.0 
:   
: - 
:   1.0 
:   1.0 
:   3915.0 
:   3915 
:   0 
:   3915.0 
:   1.0 
:   0.0 
:   
: - 
:   1.0 
:   1.0 
:   265.0 
:   265 
:   0 
:   265.0 
:   1.0 
:   0.0 
:   
: - 
:   1.0 
:   1.0 
:   37.0 
:   37 
:   0 
:   37.0 
:   1.0 
:   0.0 
:   
: - 
:   1.0 
:   1.0 
:   41.0 
:   41 
:   0 
:   41.0 
:   1.0 
:   0.0 
:   
: - 
:   1.0 
:   1.0 
:   201.0 
:   201 
:   0 
:   201.0 
:   1.0 
:   0.0 
:   
:   
:   
:   
:   
:   
:   
: -- 
: View this message in context: 
http://old.nabble.com/getting-unexpected-statscomponent-values-tp27599248p27631121.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 



-Hoss



Re: Getting max/min dates from solr index

2010-02-17 Thread Chris Hostetter

: Is it possible to do date faceting on multiple solr shards?

Distributed search doesn't currently support date faceting...

http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
https://issues.apache.org/jira/browse/SOLR-1709


-Hoss



Re: getting unexpected statscomponent values

2010-02-17 Thread solr-user


hossman wrote:
> 
> 
> That does look really weird, and definitely seems like a bug.
> 
> Can you open an issue in Jira? ... ideally with a TestCase (even if it's 
> not a JUnit test case, just having some sample docs that can be indexed 
> against the example schema and a URL showing the problem would be helpful)
> 
> 

Hossman, what do you mean by including a "TestCase"?  

Will create issue in Jira asap; I will include the URL, schema and some code
to generate sample data
-- 
View this message in context: 
http://old.nabble.com/getting-unexpected-statscomponent-values-tp27599248p27631633.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Realtime search and facets with very frequent commits

2010-02-17 Thread Jan Høydahl / Cominvent
Hi,

Have you tried playing with mergeFactor or even mergePolicy?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 16. feb. 2010, at 08.26, Janne Majaranta wrote:

> Hey Dipti,
> 
> Basically query optimizations + setting cache sizes to a very high level.
> Other than that, the config is about the same as the out-of-the-box config
> that comes with the Solr download.
> 
> I haven't found a magic switch to get very fast query responses + facet
> counts with the frequency of commits I'm having using one single SOLR
> instance.
> Adding some TOP queries for a certain type of user to static warming queries
> just moved the time of autowarming the caches to the time it took to warm
> the caches with static queries.
> I've been staging a setup where there's a small solr instance receiving all
> the updates and a large instance which doesn't receive the live feed of
> updates.
> The small index will be merged with the large index periodically (once a
> week or once a month).
> The two instances are seen by the client app as one instance using the
> sharding features of SOLR.
> The instances are running on the same server inside their own JVM / jetty.
> 
> In this setup the caches are very HOT for the large index and queries are
> extremely fast, and the small index is small enough to get extremely fast
> queries without having to warm up the caches too much.
> 
> Basically I'm able to have a commit frequency of 10 seconds in a 40M docs
> index while counting TOP5 facets over 14 fields in 200ms.
> In reality the commit frequency of 10 seconds comes from the fact that the
> updates are going into a 1M - 2M documents index, and the fast facet counts
> from the fact that the 38M documents index has hot caches and doesn't
> receive any updates.
> 
> Also, not running updates to the large index means that the SOLR instance
> reading the large index uses about half the memory it used before when
> running the updates to the large index. At least it does so on Win2k3.
> 
> -Janne
> 
> 
> 2010/2/15 dipti khullar 
> 
>> Hey Janne
>> 
>> Can you please let me know what other optimizations are you talking about
>> here. Because in our application we are committing in about 5 mins but
>> still
>> the response time is very low and at times there are some connection time
>> outs also.
>> 
>> Just wanted to confirm if you have done some major configuration changes
>> which have proved beneficial.
>> 
>> Thanks
>> Dipti
>> 
>> 



Re: Discovering Slaves

2010-02-17 Thread Jan Høydahl / Cominvent
After ZooKeeper is integrated (1.5?) there will be a way to get info about all 
nodes in your cluster including their roles, status etc. Perhaps you want to 
coordinate your dashboard effort with this version, although still very early 
in development? See http://wiki.apache.org/solr/SolrCloud

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 15. feb. 2010, at 23.53, wojtekpia wrote:

> 
> Is there a way to 'discover' slaves using ReplicationHandler? I'm writing a
> quick dashboard, and don't have access to a list of slaves, but would like
> to show some stats about their health.
> -- 
> View this message in context: 
> http://old.nabble.com/Discovering-Slaves-tp27601334p27601334.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: Collating results from multiple indexes

2010-02-17 Thread Jan Høydahl / Cominvent
Thanks for your clarification and link, Will.

Back to Aaron's question. There is some ongoing work to try to support updating 
single fields within documents (http://issues.apache.org/jira/browse/SOLR-139 
and http://issues.apache.org/jira/browse/SOLR-828) which could perhaps be part 
of a future solution.

Is it an option for you to write a smart "join" component which can live on top 
of multiple cores and do multiple sub queries in an efficient way and 
transparently return the final result? Forking the shards query code could be a 
starting point? Donating this component back to Solr may free you of 
maintenance burden, as I'm sure it will be useful to a larger audience?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 17. feb. 2010, at 03.27, Will Johnson wrote:

> Jan Hoydal / Otis,
> 
> 
> 
> First off, Thanks for mentioning us.  We do use some utility functions from
> SOLR but our index engine is built on top of Lucene only, there are no Solr
> cores involved.  We do have a JOIN operator that allows us to perform
> relational searches while still acting like a search engine in terms of
> performance, ranking, faceting, etc.  Our CTO wrote a blog article about it
> a month ago that does a pretty good of explaining how it’s used:
> http://www.attivio.com/blog/55-industry-insights/507-can-a-search-engine-replace-a-relational-database.html
> 
> 
> 
> The join functionality and most of our other higher level features use
> separate data structures and don’t really have much to do with Lucene/SOLR
> except where they integrate with the query execution.  If you want to learn
> more feel free to check out www.attivio.com.
> 
> 
> 
> -  w...@attivio.com
> 
> 
> On Fri, Feb 12, 2010 at 10:35 AM, Jan Høydahl / Cominvent <
> jan@cominvent.com> wrote:
> 
>> Really? The last time I looked at AIE, I am pretty sure there was Solr core
>> msgs in the logs, so I assumed it used EmbeddedSolr or something. But I may
>> be mistaken. Anyone from Attivio here who can elaborate? Is the join stuff
>> at Lucene level or on top of multiple Solr cores or what?
>> 
>> --
>> Jan Høydahl  - search architect
>> Cominvent AS - www.cominvent.com
>> 
>> On 11. feb. 2010, at 23.02, Otis Gospodnetic wrote:
>> 
>>> Minor correction re Attivio - their stuff runs on top of Lucene, not
>> Solr.  I *think* they are trying to patent this.
>>> 
>>> Otis
>>> 
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Hadoop ecosystem search :: http://search-hadoop.com/
>>> 
>>> 
>>> 
>>> - Original Message 
 From: Jan Høydahl / Cominvent 
 To: solr-user@lucene.apache.org
 Sent: Mon, February 8, 2010 3:33:41 PM
 Subject: Re: Collating results from multiple indexes
 
 Hi,
 
 There is no JOIN functionality in Solr. The common solution is either to
>> accept
 the high volume update churn, or to add client side code to build a
>> "join" layer
 on top of the two indices. I know that Attivio (www.attivio.com) have
>> built some
 kind of JOIN functionality on top of Solr in their AIE product, but do
>> not know
 the details or the actual performance.
 
 Why not open a JIRA issue, if there is no such already, to request this
>> as a
 feature?
 
 --
 Jan Høydahl  - search architect
 Cominvent AS - www.cominvent.com
 
 On 25. jan. 2010, at 22.01, Aaron McKee wrote:
 
> 
> Is there any somewhat convenient way to collate/integrate fields from
>> separate
 indices during result writing, if the indices use the same unique keys?
 Basically, some sort of cross-index JOIN?
> 
> As a bit of background, I have a rather heavyweight dataset of every US
 business (~25m records, an on-disk index footprint of ~30g, and 5-10
>> hours to
 fully index on a decent box). Given the size and relatively stability of
>> the
 dataset, I generally only update this monthly. However, I have separate
 advertising-related datasets that need to be updated either hourly or
>> daily
 (e.g. today's coupon, click revenue remaining, etc.) . These advertiser
>> feeds
 reference the same keyspace that I use in the main index, but are
>> otherwise
 significantly lighter weight. Importing and indexing them discretely
>> only takes
 a couple minutes. Given that Solr/Lucene doesn't support field updating,
>> without
 having to drop and re-add an entire document, it doesn't seem practical
>> to
 integrate this data into the main index (the system would be under a
>> constant
 state of churn, if we did document re-inserts, and the performance
>> impact would
 probably be debilitating). It may be nice if this data could participate
>> in
 filtering (e.g. only show advertisers), but it doesn't need to
>> participate in
 scoring/ranking.
> 
> I'm guessing that someone else has had a similar need, at some point?
>> I can
 have our front-end query the smaller indic

labeling facets and highlighting question

2010-02-17 Thread adeelmahmood

simple question: I want to give a label to my facet queries instead of the
name of facet field .. i found the documentation at solr site that I can do
that by specifying the key local param .. syntax something like
facet.field={!ex=dt%20key='By%20Owner'}owner

I am just not sure what the ex=dt part does .. if i take it out .. it throws
an error so it seems its important but what for ???

also I tried turning on the highlighting and i can see that it adds the
highlighting items list in the xml at the end .. but it only points out the
ids of all the matching results .. it doesnt actually shows the text data
thats its making a match with // so i am getting something like this back

 
  
  
...

instead of the actual text thats being matched .. isnt it supposed to do
that and wrap the search terms in em tag .. how come its not doing that in
my case 

here is my schema
 
 
 
 

-- 
View this message in context: 
http://old.nabble.com/labeling-facets-and-highlighting-question-tp27632747p27632747.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: xml error when indexing

2010-02-17 Thread Lance Norskog
What type are you posting with? Is it expecting a multipart upload?
What is the curl command and what is its mime-type for uploaded data?



On Wed, Feb 17, 2010 at 10:19 AM, Chris Hostetter
 wrote:
> : I'm having a strange problem when indexing data through our application.
> : Whenever I post something to the update resource, I get
> :
> : Unexpected character 'a' (code 97) in prolog; expected '<'  at [row,col 
> {unknown-source}]: [1,1], 
>        ...
> : However, when I post the same data from an xml file using curl it works.
>
> ...thta's pretty much a dead give away that your application isn't posting
> the exact same XML as the curl command.  You might try using a packet
> sniffer, or an HTTP Proxy that logs all the details of the requests to see
> what exactly your application is sending over the wire and how it differs
> from curl.
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: xml error when indexing

2010-02-17 Thread Lance Norskog
I mean: what MIME type does the POST command use?

On Wed, Feb 17, 2010 at 7:09 PM, Lance Norskog  wrote:
> What type are you posting with? Is it expecting a multipart upload?
> What is the curl command and what is its mime-type for uploaded data?
>
>
>
> On Wed, Feb 17, 2010 at 10:19 AM, Chris Hostetter
>  wrote:
>> : I'm having a strange problem when indexing data through our application.
>> : Whenever I post something to the update resource, I get
>> :
>> : Unexpected character 'a' (code 97) in prolog; expected '<'  at [row,col 
>> {unknown-source}]: [1,1], 
>>        ...
>> : However, when I post the same data from an xml file using curl it works.
>>
>> ...thta's pretty much a dead give away that your application isn't posting
>> the exact same XML as the curl command.  You might try using a packet
>> sniffer, or an HTTP Proxy that logs all the details of the requests to see
>> what exactly your application is sending over the wire and how it differs
>> from curl.
>>
>>
>>
>> -Hoss
>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: Deleting spelll checker index

2010-02-17 Thread Lance Norskog
This is a quirk of Lucene - when you delete a document, the indexed
terms for the document are not deleted. That is, if 2 documents have
the word 'frampton' in an indexed field, the term dictionary contains
the entry 'frampton' and pointers to those two documents. When you
delete those two documents, the index contains the entry 'frampton'
with an empty list of pointers. So, the terms are still there even
when you delete all of the documents.

Facets and the spellchecking dictionary build from this term
dictionary, not from the text string that are 'stored' and returned
when you search for the documents.

The  command throws away these remnant terms.

http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/

On Wed, Feb 17, 2010 at 12:24 PM, darniz  wrote:
>
> Please bear with me on the limitted understanding.
> i deleted all documents and i made a rebuild of my spell checker  using the
> command
> spellcheck=true&spellcheck.build=true&spellcheck.dictionary=default
>
> After this i went to the schema browser and i saw that mySpellText still has
> around 2000 values.
> How can i make sure that i clean up that field.
> We had the same issue with facets too, even though we delete all the
> documents, and if we do a facet on make we still see facets but we can
> filter out facets by saying facet.mincount>0.
>
> Again coming back to my question how can i make mySpellText fields get rid
> of all previous terms
>
> Thanks a lot
> darniz
>
>
>
> hossman wrote:
>>
>> : But still i cant stop thinking about this.
>> : i deleted my entire index and now i have 0 documents.
>> :
>> : Now if i make a query with accrd i still get a suggestion of accord even
>> : though there are no document returned since i deleted my entire index. i
>> : hope it also clear the spell check index field.
>>
>> there are two Lucene indexes when you use spell checking.
>>
>> there is the "main" index which is goverend by your schema.xml and is what
>> you add your own documents to, and what searches are run agains for the
>> result section of solr responses.
>>
>> There is also the "spell" index which has only two fields and in
>> which each "document" corrisponds to a "word" that might be returend as a
>> spelling suggestion, and the other fields contain various start/end/middle
>> ngrams that represent possible misspellings.
>>
>> When you use the spellchecker component it builds the "spell" index
>> makinga document out of every word it finds in whatever field name you
>> configure it to use.
>>
>> deleting your entire "main" index won't automaticly delete the "spell"
>> index (allthough you should be able rebuild the "spell" index using the
>> *empty* "main" index, that should work).
>>
>> : i am copying both fields to a field called
>> : 
>> : 
>>
>> ..at this point your "main" index has a field named mySpellText, and for
>> ever document it contains a copy of make and model.
>>
>> :         
>> :             default
>> :             mySpellText
>> :             true
>> :             true
>>
>> ...so whenever you commit or optimize your "main" index it will take every
>> word from the mySpellText and use them all as individual documents in the
>> "spell" index.
>>
>> In your previous email you said you changed hte copyField declaration, and
>> then triggered a commit -- that rebuilt your "spell" index, but the data
>> was still all there in the mySpellText field of the "main" index, so the
>> rebuilt "spell" index was exactly the same.
>>
>> : i have buildOnOPtmize and buildOnCommit as true so when i index new
>> document
>> : i want my dictionary to be created but how can i make sure i remove the
>> : preivious indexed terms.
>>
>> everytime the spellchecker component "builds" it will create a completley
>> new "spell" index .. but if the old data is still in the "main" index then
>> it will also be in the "spell" index.
>>
>> The only reason i can think of why you'd be seeing words in your "spell"
>> index after deleting documents from your "main" index is that even if you
>> delete documents, the Terms are still there in the underlying index untill
>> the segments are merged ... so if you do an optimize that will force them
>> to be expunged --- but i honestly have no idea if that is what's causing
>> your problem, because quite frankly i really don't understand what your
>> problem is ... you have to provide specifics: reproducible steps anyone
>> can take using a clean install of solr to see the the behavior you are
>> seeing that seems incorrect.  (ie: modifications to the example schema,
>> and commands to execute against hte demo port to see the bug)
>>
>> if you can provide details like that then it's possible to understand what
>> is going wrong for you -- which is a prereq to providing useful help.
>>
>>
>>
>> -Hoss
>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27629740.html
> Sent from the Solr - User mailing list archive at Nab

Re: parsing strings into phrase queries

2010-02-17 Thread Lance Norskog
That would be great. After reading this and the PositionFilter class I
still don't know how to use it.

On Wed, Feb 17, 2010 at 12:38 PM, Robert Muir  wrote:
> i think we can improve the docs/wiki to show this example use case, i
> noticed the wiki explanation for this filter gives a more complex shingles
> example, which is interesting, but this seems to be a common problem and
> maybe we should add this use case.
>
> On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
> wrote:
>
>>
>> : take a look at PositionFilter
>>
>> Right, there was another thread recently where almost the exact same issue
>> was discussed...
>>
>> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>>
>> ..except that i was ignorant of the existence of PositionFilter when i
>> wrote that message.
>>
>>
>>
>> -Hoss
>>
>>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: labeling facets and highlighting question

2010-02-17 Thread Lance Norskog
Here's the problem: the wiki page is confusing:

http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

The line:
q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype

is standalone, but the later line:

facet.field={!ex=dt key=mylabel}doctype

mean 'change the long query from {!ex=dt}docType to {!ex=dt key=mylabel}docType'

'tag=dt' creates a tag (name) for a filter query, and 'ex=dt' means
'exclude this filter query'.

On Wed, Feb 17, 2010 at 4:30 PM, adeelmahmood  wrote:
>
> simple question: I want to give a label to my facet queries instead of the
> name of facet field .. i found the documentation at solr site that I can do
> that by specifying the key local param .. syntax something like
> facet.field={!ex=dt%20key='By%20Owner'}owner
>
> I am just not sure what the ex=dt part does .. if i take it out .. it throws
> an error so it seems its important but what for ???
>
> also I tried turning on the highlighting and i can see that it adds the
> highlighting items list in the xml at the end .. but it only points out the
> ids of all the matching results .. it doesnt actually shows the text data
> thats its making a match with // so i am getting something like this back
>
> 
>  
>  
> ...
>
> instead of the actual text thats being matched .. isnt it supposed to do
> that and wrap the search terms in em tag .. how come its not doing that in
> my case
>
> here is my schema
>  />
> 
> 
> 
>
> --
> View this message in context: 
> http://old.nabble.com/labeling-facets-and-highlighting-question-tp27632747p27632747.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: labeling facets and highlighting question

2010-02-17 Thread adeelmahmood

okay so if I dont want to do any excludes then I am assuming I should just
put in {key=label}field .. i tried that and it doesnt work .. it says
undefined field {key=label}field


Lance Norskog-2 wrote:
> 
> Here's the problem: the wiki page is confusing:
> 
> http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters
> 
> The line:
> q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype
> 
> is standalone, but the later line:
> 
> facet.field={!ex=dt key=mylabel}doctype
> 
> mean 'change the long query from {!ex=dt}docType to {!ex=dt
> key=mylabel}docType'
> 
> 'tag=dt' creates a tag (name) for a filter query, and 'ex=dt' means
> 'exclude this filter query'.
> 
> On Wed, Feb 17, 2010 at 4:30 PM, adeelmahmood 
> wrote:
>>
>> simple question: I want to give a label to my facet queries instead of
>> the
>> name of facet field .. i found the documentation at solr site that I can
>> do
>> that by specifying the key local param .. syntax something like
>> facet.field={!ex=dt%20key='By%20Owner'}owner
>>
>> I am just not sure what the ex=dt part does .. if i take it out .. it
>> throws
>> an error so it seems its important but what for ???
>>
>> also I tried turning on the highlighting and i can see that it adds the
>> highlighting items list in the xml at the end .. but it only points out
>> the
>> ids of all the matching results .. it doesnt actually shows the text data
>> thats its making a match with // so i am getting something like this back
>>
>> 
>>  
>>  
>> ...
>>
>> instead of the actual text thats being matched .. isnt it supposed to do
>> that and wrap the search terms in em tag .. how come its not doing that
>> in
>> my case
>>
>> here is my schema
>> > required="true"
>> />
>> 
>> 
>> 
>>
>> --
>> View this message in context:
>> http://old.nabble.com/labeling-facets-and-highlighting-question-tp27632747p27632747.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 
> 

-- 
View this message in context: 
http://old.nabble.com/labeling-facets-and-highlighting-question-tp27632747p27634177.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Has anyone prepared a general purpose synonyms.txt for search engines

2010-02-17 Thread Lance Norskog
openthesaurus seems to be european languages, not including English :)
 Wordnet is a venerable thesaurus project:

http://wordnet.princeton.edu/

and lucene-contrib includes a set of tools for using it.

http://www.lucidimagination.com/search/?q=wordnet

On Fri, Feb 12, 2010 at 11:51 AM, Julian Hille  wrote:
> Hi,
>
> Your welcome. Thats something google came up with some weeks ago :)
>
>
> Am 12.02.2010 um 20:42 schrieb Emad Mushtaq:
>
>> Wow thanks!! You all are awesome! :D :D
>>
>> On Sat, Feb 13, 2010 at 12:32 AM, Julian Hille  wrote:
>>
>>> Hi,
>>>
>>> at openthesaurus.org or .com you can find a mysql version of synonyms you
>>> just have to join it to fit the synonym schema of solr yourself.
>>>
>>>
>>> Am 12.02.2010 um 20:03 schrieb Emad Mushtaq:
>>>
 Hi,

 I was wondering if anyone has prepared a synonyms.txt for general purpose
 search engines,  that can be shared. If not could you refer me to places
 where such a synonym list or thesaurus can be found. Synonyms for search
 engines are different from the regular thesaurus. Any help would be
>>> highly
 appreciated. Thanks.

 --
 Muhammad Emad Mushtaq
 http://www.emadmushtaq.com/
>>>
>>> Mit freundlichen Grüßen,
>>> Julian Hille
>>>
>>>
>>>
>>
>>
>> --
>> Muhammad Emad Mushtaq
>> http://www.emadmushtaq.com/
>
> Mit freundlichen Grüßen,
> Julian Hille
>
>
> ---
> NetImpact KG
> Altonaer Straße 8
> 20357 Hamburg
>
> Tel: 040 / 6738363 2
> Mail: jul...@netimpact.de
>
> Geschäftsführer: Tarek Müller
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: optimize is taking too much time

2010-02-17 Thread mklprasad



hossman wrote:
> 
> 
> : in my solr u have 1,42,45,223 records having some 50GB .
> : Now when iam loading a new record and when its trying optimize the docs
> its
> : taking 2 much memory and time 
> 
> : can any body please tell do we have any property in solr to get rid of
> this.
> 
> Solr isn't going to optimize the index unless you tell it to -- how are 
> you indexing your docs? are you sure you don't have something programmed 
> to send an optimize command?
> 
> 
> -Hoss
> 
>  yes ,
> From My Code 
> For Every Load iam calling the server.optimize() method
> ( Now iam planning to remove this from the code)
> in the config level i have 'mergerFactor=10'
> i have a doubt like will the mergerFactor will only do a merge  or will it
> also performs the optimization 
> if not do i need to call  in that case for my 50Gb will it takes less time .
> 
> 
> Please clearify me
> Thanks in advance
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27634994.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Realtime search and facets with very frequent commits

2010-02-17 Thread Janne Majaranta
Hi,

Yes, I did play with mergeFactor.
I didn't play with mergePolicy.

Wouldn't that affect indexing speed and possibly memory usage ?
I don't have any problems with indexing speed ( 1000 - 2000 docs / sec via
the standard HTTP API ).

My problem is that I need very warm caches to get fast faceting, and the
autowarming of the caches takes too long compared to the frequency of
commits I'm having.
So a commit every minute means less than a minute time to warm the caches.

To give you a idea of what kind of queries needs to be autowarmed in my app,
the logevents indexed as documents have timestamps with different
granularity used for faceting.
For example, to get count of logevents for every hour using faceting there's
a timestamp field with the format mmddhh ( for example: 2010021808
meaning 2010-02-18 8am).
One use case is to get hourly counts over the whole index. A non-cached
query counting the hourly counts over the 40M documents index takes a
while..
And to my understanding autowarming means something like that this kind of
query would be basically re-executed against a cold cache. Probably not
exactly how it works, but it "feels" like it would.

Moving the commits to a smaller index while using sharding to have a
transparent view to the index from the client app seems to solve my problem.

I'm not sure if the (upcoming?) NRT features would keep the caches more
persistent, probably not in a environment where docs get frequent updates /
deletes.

Also, I'm closely following the Ocean Realtime Search project AND it's SOLR
integration. It sounds like it has the "dream features" to enable realtime
updates to the index.

-Janne


2010/2/18 Jan Høydahl / Cominvent 

> Hi,
>
> Have you tried playing with mergeFactor or even mergePolicy?
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 16. feb. 2010, at 08.26, Janne Majaranta wrote:
>
> > Hey Dipti,
> >
> > Basically query optimizations + setting cache sizes to a very high level.
> > Other than that, the config is about the same as the out-of-the-box
> config
> > that comes with the Solr download.
> >
> > I haven't found a magic switch to get very fast query responses + facet
> > counts with the frequency of commits I'm having using one single SOLR
> > instance.
> > Adding some TOP queries for a certain type of user to static warming
> queries
> > just moved the time of autowarming the caches to the time it took to warm
> > the caches with static queries.
> > I've been staging a setup where there's a small solr instance receiving
> all
> > the updates and a large instance which doesn't receive the live feed of
> > updates.
> > The small index will be merged with the large index periodically (once a
> > week or once a month).
> > The two instances are seen by the client app as one instance using the
> > sharding features of SOLR.
> > The instances are running on the same server inside their own JVM /
> jetty.
> >
> > In this setup the caches are very HOT for the large index and queries are
> > extremely fast, and the small index is small enough to get extremely fast
> > queries without having to warm up the caches too much.
> >
> > Basically I'm able to have a commit frequency of 10 seconds in a 40M docs
> > index while counting TOP5 facets over 14 fields in 200ms.
> > In reality the commit frequency of 10 seconds comes from the fact that
> the
> > updates are going into a 1M - 2M documents index, and the fast facet
> counts
> > from the fact that the 38M documents index has hot caches and doesn't
> > receive any updates.
> >
> > Also, not running updates to the large index means that the SOLR instance
> > reading the large index uses about half the memory it used before when
> > running the updates to the large index. At least it does so on Win2k3.
> >
> > -Janne
> >
> >
> > 2010/2/15 dipti khullar 
> >
> >> Hey Janne
> >>
> >> Can you please let me know what other optimizations are you talking
> about
> >> here. Because in our application we are committing in about 5 mins but
> >> still
> >> the response time is very low and at times there are some connection
> time
> >> outs also.
> >>
> >> Just wanted to confirm if you have done some major configuration changes
> >> which have proved beneficial.
> >>
> >> Thanks
> >> Dipti
> >>
> >>
>
>


Re: Tomcat vs Jetty: A Comparative Analysis?

2010-02-17 Thread Steve Radhouani
Totally agreed!

2010/2/17 Andy 

> This read more like a PR release or product brochure for jetty than
> anything else.
>
> Then I poked around the website and realized why: it was written by the
> creator of Jetty, and is hosted on the website of a company with the slogan
> "The Java Experts behind Jetty"
>
> --- On Wed, 2/17/10, g...@littlebunch.com  wrote:
>
> From: g...@littlebunch.com 
> Subject: Re: Tomcat vs Jetty: A Comparative Analysis?
> To: solr-user@lucene.apache.org
> Date: Wednesday, February 17, 2010, 11:27 AM
>
>
>
> http://www.webtide.com/choose/jetty.jsp
>
> >> > - Original Message -
> >> > From: "Steve Radhouani" 
> >> > To: solr-user@lucene.apache.org
> >> > Sent: Tuesday, 16 February, 2010 12:38:04 PM
> >> > Subject: Tomcat vs Jetty: A Comparative Analysis?
> >> >
> >> > Hi there,
> >> >
> >> > Is there any analysis out there that may help to choose between Tomcat
> >> and
> >> > Jetty to deploy Solr? I wonder wether there's a significant difference
> >> > between them in terms of performance.
> >> >
> >> > Any advice would be much appreciated,
> >> > -Steve
> >> >
> >>
>
>
>
>
>