date:20110209

Hi Hoss,

Ok, that makes much more sense now. I was under the impression that values
were copied as well which seemed a bit odd..
unless you have to deal with a use case similar to yours. :)

Cheers,
- Savvas

On 9 February 2011 02:25, Chris Hostetter hossman_luc...@fucit.org wrote:

 : In my understanding, the Current Index Searcher uses a cache instance and
 : when a New Index Searcher is registered a new cache instance is used
 which
 : is also auto-warmed. However, what happens when the New Index Searcher is
 a
 : view of an index which has been modified? If the entries contained in the
 : old cache are copied during auto warming to the new cache wouldn’t that
 new
 : cache contain invalid entries?

 a) i'm not sure what you mean by view of an index which has been
 modified ... except for the first time an index is created, an Index
 Searcher always contains a view of an index which has been modified --
 that view that the IndexSearcher represents is entirely consinsitent and
 doesn't change as documents are added/removed - that's why a new Searcher
 needs to be opened.

 b) entries are not copied during autowarming.  the *keys* of the entries
 in the old cache are used to warm the new cache -- using the new searcher
 to generate new values.

 (caveat: if you have a custom cache, you could write a custom cache
 regenerator that did copy the values from the old cache verbatim -- i have
 done that in special cases where the type of object i was caching didn't
 vary based on the IndexSearcher -- or did vary, but in such a way that i
 could use the new Searcher to determine a cheap piece of information and
 based on the result either reuse an old value that was expensive to
 compute, or recompute it using hte new Searcher.  ... but none of the
 default cache regenerators for the stock solr caches work this way)


 :
 :
 :
 : Thanks,
 : - Savvas
 :

 -Hoss

IndexOutOfBoundsException

2011-02-09 Thread Dominik Lange


hi,

we have a problem with our solr test instance.
This instance is running with 90 cores with about 2 GB of Index-Data per core.

This worked fine for a few weeks.

Now we get an exception querying data from one core : 
java.lang.IndexOutOfBoundsException: Index: 104, Size: 11
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41)
at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45)
at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
at org.apache.lucene.search.Query.weight(Query.java:98)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
...

All other cores are working fine with the same schema.
This problem only occurs when querying for specific data like
q=fieldA:valueA%20AND%20fieldB:valueB

By using the following query data is returned
q=*:*

Has anybody any suggestions on what is causing this problem?
Are 90 cores too much for a single solr instance?

Thanks in advance,

Dominik

Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Timo Schmidt

Hello together,

i am currently developing a search solution, based on Apache Solr. Currently I 
have the problem that I want to offer the user the possibility to maintain 
synonyms and stopwords in a userfriendy tool. But currently I could not find 
any possibility to write the stopwords.txt or synonyms.txt.

Are there any other solutions?

Currently I have some ideas how to handle it:

1.Implement another SynonymFilterFactory to allow other datasources like 
databases. I already saw approaches for that but no solutions yet.
2.Implement a fileWriter request handler to write the stopwords.txt

Are there other solutions which are maybe already implemented?

Thanks and best regards Timo


Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 


Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
vernichten Sie diese Mail. This e-mail message may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this e-mail in error) please notify the sender immediately and destroy this 
e-mail.

Re: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Stefan Matheis

Timo,

On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

what about writing the Files from an external Application and reload
your Solr Core!?
Seemed to be the simplest way to solve your problem, not?

Regards
Stefan

AW: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Timo Schmidt

Hi Stefan,

i allready thought about that. Maybe some php service or something like that.
But this would mean, that I need additional software on that server like a 
normal
Apache installation, which needs to be maintained. That's why I thought a 
solution that
is build into solr would be nice.

Thanks

Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 
Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


-Ursprüngliche Nachricht-
Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Gesendet: Mittwoch, 9. Februar 2011 11:14
An: solr-user@lucene.apache.org
Betreff: Re: Maintain stopwords.txt and synonyms.txt

Timo,

On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

what about writing the Files from an external Application and reload
your Solr Core!?
Seemed to be the simplest way to solve your problem, not?

Regards
Stefan

Re: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Stefan Matheis

Hi Timo,

of course - that's right. Write some JSP (i guess) which could be
integrated in the already existing jetty/tomcat Server?

Just wondering about, how do you perform Search-Requests to Solr?
Normally, there is already any other Service running, which acts as
'proxy' to the outer world? ;)

Regards
Stefan

On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 Hi Stefan,

 i allready thought about that. Maybe some php service or something like that.
 But this would mean, that I need additional software on that server like a 
 normal
 Apache installation, which needs to be maintained. That's why I thought a 
 solution that
 is build into solr would be nice.

 Thanks

 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:14
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Timo,

 On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

 what about writing the Files from an external Application and reload
 your Solr Core!?
 Seemed to be the simplest way to solve your problem, not?

 Regards
 Stefan

Re: Nutch and Solr search on the fly

The parsed data is only sent to the Solr index of you tell a segment to be 
indexed; solrindex crawldb linkdb segment

If you did this only once after injecting  and then the consequent 
fetch,parse,update,index sequence then you, of course, only see those URL's. 
If you don't index a segment after it's being parsed, you need to do it later 
on.

On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
 Hi all,
 
  I am a newbie to nutch and solr. Well relatively much newer to Solr than
 Nutch :)
 
  I have been using nutch for past two weeks, and I wanted to know if I can
 query or search on my nutch crawls on the fly(before it completes). I am
 asking this because the websites I am crawling are really huge and it takes
 around 3-4 days for a crawl to complete. I want to analyze some quick
 results while the nutch crawler is still crawling the URLs. Some one
 suggested me that Solr would make it possible.
 
  I followed the steps in
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
 this process, I see only the injected URLs are shown in the Solr search. I
 know I did something really foolish and the crawl never happened, I feel I
 am missing some information here. I think somewhere in the process there
 should be a crawling happening and I missed it out.
 
  Just wanted to see if some one could help me pointing this out and where I
 went wrong in the process. Forgive my foolishness and thanks for your
 patience.
 
 Cheers,
 Abi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: [WKT] Spatial Searching

2011-02-09 Thread Grant Ingersoll

The show stopper for JTS is it's license, unfortunately. Otherwise, I think it
would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

I just came across a ~nudge post over in the SIS list on what the status is
for that project. This got me looking more in to spatial mods with Solr4.0.
I found this enhancement in Jira.
https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David
mentions that he's already integrated JTS in to Solr4.0 for querying on
polygons stored as WKT.

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form of
WKT. You can then get the data in to Solr by running the following command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be included
in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: [WKT] Spatial Searching

2011-02-09 Thread Estrada Groups

How could i stub this out not being a java guy? What is needed in order to do
this?

Licensing is always going to be an issue with JTS which is why I am interested
in the project SIS sitting in incubation right now.

I willing to put forth the effort if I had a little direction from the peanut
gallery ;-)

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll gsing...@apache.org wrote:

The show stopper for JTS is it's license, unfortunately. Otherwise, I think
it would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form
of WKT. You can then get the data in to Solr by running the following
command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be
included in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

--
Grant Ingersoll
http://www.lucidimagination.com/

AW: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Timo Schmidt

Yes we have something, but on another machine.


Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 
Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


-Ursprüngliche Nachricht-
Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Gesendet: Mittwoch, 9. Februar 2011 11:34
An: solr-user@lucene.apache.org
Betreff: Re: Maintain stopwords.txt and synonyms.txt

Hi Timo,

of course - that's right. Write some JSP (i guess) which could be
integrated in the already existing jetty/tomcat Server?

Just wondering about, how do you perform Search-Requests to Solr?
Normally, there is already any other Service running, which acts as
'proxy' to the outer world? ;)

Regards
Stefan

On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 Hi Stefan,

 i allready thought about that. Maybe some php service or something like that.
 But this would mean, that I need additional software on that server like a 
 normal
 Apache installation, which needs to be maintained. That's why I thought a 
 solution that
 is build into solr would be nice.

 Thanks

 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:14
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Timo,

 On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

 what about writing the Files from an external Application and reload
 your Solr Core!?
 Seemed to be the simplest way to solve your problem, not?

 Regards
 Stefan

Re: [WKT] Spatial Searching

2011-02-09 Thread Estrada Groups

Thought I would share this on web mapping...it's a great write up and something
to consider when talking about working with spatial data.

http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll gsing...@apache.org wrote:

The show stopper for JTS is it's license, unfortunately. Otherwise, I think
it would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form
of WKT. You can then get the data in to Solr by running the following
command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be
included in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: Where is NGramFilter?

2011-02-09 Thread Koji Sekiguchi


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
There is only EdgeNGramFilterFactory listed (which I got working for
prefix indexing), but no NGramFilterFactory. Is that filter not
supported anymore, or is that list not up to date?


It should be there. Here is the javadoc for it:

https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/analysis/NGramFilterFactory.html

Anyone who have an account can update the wiki. Contribution is welcome!

Koji
--
http://www.rondhuit.com/en/

Re: [WKT] Spatial Searching

2011-02-09 Thread Adam Estrada

Grant,

How could i stub this out not being a java guy? What is needed in order to do
this?

Licensing is always going to be an issue with JTS which is why I am interested
in the project SIS sitting in incubation right now.

I'm willing to put forth the effort if I had a little direction on how to
implement it from the peanut gallery ;-)

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote:

The show stopper for JTS is it's license, unfortunately. Otherwise, I think
it would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form
of WKT. You can then get the data in to Solr by running the following
command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be
included in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: [WKT] Spatial Searching

2011-02-09 Thread Adam Estrada

Thought I would share this on web mapping...it's a great write up and something
to consider when talking about working with spatial data.

http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote:

The show stopper for JTS is it's license, unfortunately. Otherwise, I think
it would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form
of WKT. You can then get the data in to Solr by running the following
command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be
included in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Stefan Matheis

Timo,

then use cronjobs on your solr-machine to fetch the generated
synonyms-file, put in to the correct location and reload the
core-configuration (which is required to update the synonyms-file)? :)

Regards
Stefan

On Wed, Feb 9, 2011 at 1:15 PM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 Yes we have something, but on another machine.


 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:34
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Hi Timo,

 of course - that's right. Write some JSP (i guess) which could be
 integrated in the already existing jetty/tomcat Server?

 Just wondering about, how do you perform Search-Requests to Solr?
 Normally, there is already any other Service running, which acts as
 'proxy' to the outer world? ;)

 Regards
 Stefan

 On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 Hi Stefan,

 i allready thought about that. Maybe some php service or something like that.
 But this would mean, that I need additional software on that server like a 
 normal
 Apache installation, which needs to be maintained. That's why I thought a 
 solution that
 is build into solr would be nice.

 Thanks

 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:14
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Timo,

 On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 But currently I could not find any possibility to write the stopwords.txt 
 or synonyms.txt.

 what about writing the Files from an external Application and reload
 your Solr Core!?
 Seemed to be the simplest way to solve your problem, not?

 Regards
 Stefan

AW: IndexOutOfBoundsException

2011-02-09 Thread André Widhani

I think we had a similar exception recently when attempting to sort on a 
multi-valued field ... could that be possible in your case?

André

-Ursprüngliche Nachricht-
Von: Dominik Lange [mailto:dominikla...@searchmetrics.com] 
Gesendet: Mittwoch, 9. Februar 2011 10:55
An: solr-user@lucene.apache.org
Betreff: IndexOutOfBoundsException


hi,

we have a problem with our solr test instance.
This instance is running with 90 cores with about 2 GB of Index-Data per core.

This worked fine for a few weeks.

Now we get an exception querying data from one core : 
java.lang.IndexOutOfBoundsException: Index: 104, Size: 11
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41)
at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45)
at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
at org.apache.lucene.search.Query.weight(Query.java:98)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
...

All other cores are working fine with the same schema.
This problem only occurs when querying for specific data like 
q=fieldA:valueA%20AND%20fieldB:valueB

By using the following query data is returned
q=*:*

Has anybody any suggestions on what is causing this problem?
Are 90 cores too much for a single solr instance?

Thanks in advance,

Dominik

Re: high cpu usage

You can try attaching jConsole to the process to see what it shows. If
you're on a *nix box
you can get a gross idea what's going on with top.

Best
Erick

On Wed, Feb 9, 2011 at 4:31 AM, Erez Zarum e...@icinga.org.il wrote:

 Hello,
 We have been running read only solr instances for a few months now,
 yesterday i have noticed an high cpu usage coming from the JVM, it simply
 use 100% of the CPU for no reason.
 Nothing was changed, we are using Jetty as a Servlet container for solr.
 Where can i start looking what cause it? it has been using 100% CPU for
 almost 24 hours now.

 Thanks,
Erez.

AW: IndexOutOfBoundsException

2011-02-09 Thread Dominik Lange

No, we do not have multivalued fields and we do not sort (in this case).
We reindexed csv file and the error disappeared, but it would we interesting 
why this error occured...

Thank you for you suggestion.

Dominik


-Ursprüngliche Nachricht-
Von: André Widhani [mailto:andre.widh...@digicol.de]
Gesendet: Mi 09.02.2011 13:58
An: solr-user@lucene.apache.org
Betreff: AW: IndexOutOfBoundsException
 
I think we had a similar exception recently when attempting to sort on a 
multi-valued field ... could that be possible in your case?

André

-Ursprüngliche Nachricht-
Von: Dominik Lange [mailto:dominikla...@searchmetrics.com] 
Gesendet: Mittwoch, 9. Februar 2011 10:55
An: solr-user@lucene.apache.org
Betreff: IndexOutOfBoundsException


hi,

we have a problem with our solr test instance.
This instance is running with 90 cores with about 2 GB of Index-Data per core.

This worked fine for a few weeks.

Now we get an exception querying data from one core : 
java.lang.IndexOutOfBoundsException: Index: 104, Size: 11
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41)
at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45)
at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
at org.apache.lucene.search.Query.weight(Query.java:98)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
...

All other cores are working fine with the same schema.
This problem only occurs when querying for specific data like 
q=fieldA:valueA%20AND%20fieldB:valueB

By using the following query data is returned
q=*:*

Has anybody any suggestions on what is causing this problem?
Are 90 cores too much for a single solr instance?

Thanks in advance,

Dominik

Re: Where is NGramFilter?

In addition to Koji's note, see the bold comment at the top of that
page that says that this not a complete list, the definitive list is
always the javadocs...

Best
Erick

On Wed, Feb 9, 2011 at 3:34 AM, Kai Schlamp schl...@gmx.de wrote:

 Hi.

 On the Sunspot (a Ruby Solr client) Wiki
 (
 https://github.com/outoftime/sunspot/wiki/Matching-substrings-in-fulltext-search
 )
 it says that the NGramFilter should allow substring indexing. As I
 never got it working, I searched a bit and found this site:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
 There is only EdgeNGramFilterFactory listed (which I got working for
 prefix indexing), but no NGramFilterFactory. Is that filter not
 supported anymore, or is that list not up to date? Is there an
 alternative filter for getting substring searching working?

 Best regards,
 Kai

Re: TermVector query using Solr Tutorial

2011-02-09 Thread Ryan Chan

Hello,

On Tue, Feb 8, 2011 at 11:12 PM, Grant Ingersoll gsing...@apache.org wrote:

 It's a little hard to read due to the indentation, but AFAICT you have two 
 terms, usb and cabl.  USB appears at position 0 and cabl at position 1.  
 Those are the relative positions to each other.  Perhaps you can explain a 
 bit more what you are trying to do?

I am searching the keyword 25, in the field

field name=features30 TFT active matrix LCD, 2560 x 1600, .25mm
dot pitch, 700:1 contrast/field

I want to know the character position of matched keyword in the
corresponding field.

usb or cabl is not what I want.

Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.

Hi Markus,

 I am sorry for not being clear, I meant to say that...

 Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
turn contain links to a.html, b.html, c.html, d.html) is injected into the
seed.txt, after the whole process I was expecting a bunch of other pages
which crawled from this seed url. However, at the end of it all I see is the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.

 The crawling happens only for the URLs mentioned in the seed.txt and does
not proceed further from there. So I am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.

Thanks,
Abi


On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 The parsed data is only sent to the Solr index of you tell a segment to be
 indexed; solrindex crawldb linkdb segment

 If you did this only once after injecting  and then the consequent
 fetch,parse,update,index sequence then you, of course, only see those
 URL's.
 If you don't index a segment after it's being parsed, you need to do it
 later
 on.

 On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
  Hi all,
 
   I am a newbie to nutch and solr. Well relatively much newer to Solr than
  Nutch :)
 
   I have been using nutch for past two weeks, and I wanted to know if I
 can
  query or search on my nutch crawls on the fly(before it completes). I am
  asking this because the websites I am crawling are really huge and it
 takes
  around 3-4 days for a crawl to complete. I want to analyze some quick
  results while the nutch crawler is still crawling the URLs. Some one
  suggested me that Solr would make it possible.
 
   I followed the steps in
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
  this process, I see only the injected URLs are shown in the Solr search.
 I
  know I did something really foolish and the crawl never happened, I feel
 I
  am missing some information here. I think somewhere in the process there
  should be a crawling happening and I missed it out.
 
   Just wanted to see if some one could help me pointing this out and where
 I
  went wrong in the process. Forgive my foolishness and thanks for your
  patience.
 
  Cheers,
  Abi

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350

Re: Nutch and Solr search on the fly

WARNING: I don't do Nutch much, but could it be that your
crawl depth is 1? See:
http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/NutchTutorialand search for depth
Best
Erick

On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote:

 Hi Markus,

  I am sorry for not being clear, I meant to say that...

  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
 turn contain links to a.html, b.html, c.html, d.html) is injected into the
 seed.txt, after the whole process I was expecting a bunch of other pages
 which crawled from this seed url. However, at the end of it all I see is
 the
 contents from only this page namely
 www.somehost.com/gifts/greetingcard.htmland I do not see any other
 pages(here a.html, b.html, c.html, d.html)
 crawled from this one.

  The crawling happens only for the URLs mentioned in the seed.txt and does
 not proceed further from there. So I am just bit confused. Why is it not
 crawling the linked pages(a.html, b.html, c.html and d.html). I get a
 feeling that I am missing something that the author of the blog(
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
 everyone would know.

 Thanks,
 Abi


 On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

  The parsed data is only sent to the Solr index of you tell a segment to
 be
  indexed; solrindex crawldb linkdb segment
 
  If you did this only once after injecting  and then the consequent
  fetch,parse,update,index sequence then you, of course, only see those
  URL's.
  If you don't index a segment after it's being parsed, you need to do it
  later
  on.
 
  On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
   Hi all,
  
I am a newbie to nutch and solr. Well relatively much newer to Solr
 than
   Nutch :)
  
I have been using nutch for past two weeks, and I wanted to know if I
  can
   query or search on my nutch crawls on the fly(before it completes). I
 am
   asking this because the websites I am crawling are really huge and it
  takes
   around 3-4 days for a crawl to complete. I want to analyze some quick
   results while the nutch crawler is still crawling the URLs. Some one
   suggested me that Solr would make it possible.
  
I followed the steps in
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
 By
   this process, I see only the injected URLs are shown in the Solr
 search.
  I
   know I did something really foolish and the crawl never happened, I
 feel
  I
   am missing some information here. I think somewhere in the process
 there
   should be a crawling happening and I missed it out.
  
Just wanted to see if some one could help me pointing this out and
 where
  I
   went wrong in the process. Forgive my foolishness and thanks for your
   patience.
  
   Cheers,
   Abi
 
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350

Re: Nutch and Solr search on the fly

Are you using the depth parameter with the crawl command or are you using the 
separate generate, fetch etc. commands?

What's $  nutch readdb crawldb -stats returning?

On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote:
 Hi Markus,
 
  I am sorry for not being clear, I meant to say that...
 
  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
 turn contain links to a.html, b.html, c.html, d.html) is injected into the
 seed.txt, after the whole process I was expecting a bunch of other pages
 which crawled from this seed url. However, at the end of it all I see is
 the contents from only this page namely
 www.somehost.com/gifts/greetingcard.htmland I do not see any other
 pages(here a.html, b.html, c.html, d.html)
 crawled from this one.
 
  The crawling happens only for the URLs mentioned in the seed.txt and does
 not proceed further from there. So I am just bit confused. Why is it not
 crawling the linked pages(a.html, b.html, c.html and d.html). I get a
 feeling that I am missing something that the author of the blog(
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
 everyone would know.
 
 Thanks,
 Abi
 
 On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
  The parsed data is only sent to the Solr index of you tell a segment to
  be indexed; solrindex crawldb linkdb segment
  
  If you did this only once after injecting  and then the consequent
  fetch,parse,update,index sequence then you, of course, only see those
  URL's.
  If you don't index a segment after it's being parsed, you need to do it
  later
  on.
  
  On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
   Hi all,
   
I am a newbie to nutch and solr. Well relatively much newer to Solr
than
   
   Nutch :)
   
I have been using nutch for past two weeks, and I wanted to know if I
  
  can
  
   query or search on my nutch crawls on the fly(before it completes). I
   am asking this because the websites I am crawling are really huge and
   it
  
  takes
  
   around 3-4 days for a crawl to complete. I want to analyze some quick
   results while the nutch crawler is still crawling the URLs. Some one
   suggested me that Solr would make it possible.
   
I followed the steps in
   
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
   By this process, I see only the injected URLs are shown in the Solr
   search.
  
  I
  
   know I did something really foolish and the crawl never happened, I
   feel
  
  I
  
   am missing some information here. I think somewhere in the process
   there should be a crawling happening and I missed it out.
   
Just wanted to see if some one could help me pointing this out and
where
  
  I
  
   went wrong in the process. Forgive my foolishness and thanks for your
   patience.
   
   Cheers,
   Abi
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Does Distributed Search support {!boost }?

On Tue, Feb 8, 2011 at 9:02 PM, Andy angelf...@yahoo.com wrote:
 Is it possible to do a query like {!boost b=log(popularity)}foo over sharded 
 indexes?

Yep, that should work fine.

-Yonik
http://lucidimagination.com

Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.

Hi Erick,

Thanks a bunch for the response

Could be a chance..but all I am wondering is where to specify the depth in
the whole entire process in the URL
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
specifying it during the fetcher phase but it was just ignored :(

Thanks,
Abi

On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.comwrote:

WARNING: I don't do Nutch much, but could it be that your
crawl depth is 1? See:
http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/NutchTutorialand search for depth
Best
Erick

On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote:

Hi Markus,

I am sorry for not being clear, I meant to say that...

Suppose if a url namely
www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28whichin
turn contain links to a.html, b.html, c.html, d.html) is injected into
the
seed.txt, after the whole process I was expecting a bunch of other pages
which crawled from this seed url. However, at the end of it all I see is
the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.

The crawling happens only for the URLs mentioned in the seed.txt and
does
not proceed further from there. So I am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.

Thanks,
Abi

On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma
markus.jel...@openindex.io
wrote:

The parsed data is only sent to the Solr index of you tell a segment to
be
indexed; solrindex crawldb linkdb segment

If you did this only once after injecting and then the consequent
fetch,parse,update,index sequence then you, of course, only see those
URL's.
If you don't index a segment after it's being parsed, you need to do it
later
on.

On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
Hi all,

I am a newbie to nutch and solr. Well relatively much newer to Solr
than
Nutch :)

I have been using nutch for past two weeks, and I wanted to know if
I
can
query or search on my nutch crawls on the fly(before it completes). I
am
asking this because the websites I am crawling are really huge and it
takes
around 3-4 days for a crawl to complete. I want to analyze some quick
results while the nutch crawler is still crawling the URLs. Some one
suggested me that Solr would make it possible.

I followed the steps in
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
this.
By
this process, I see only the injected URLs are shown in the Solr
search.
I
know I did something really foolish and the crawl never happened, I
feel
I
am missing some information here. I think somewhere in the process
there
should be a crawling happening and I missed it out.

Just wanted to see if some one could help me pointing this out and
where
I
went wrong in the process. Forgive my foolishness and thanks for your
patience.

Cheers,
Abi

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Solr 1.4.1 using more memory than Solr 1.3

2011-02-09 Thread Rachita Choudhary

Hi Solr Users,

We are in the process of upgrading from Solr 1.3 to Solr 1.4.1.
While performing stress test on Solr 1.4.1 to measure the performance
improvement in Query times (QTime) and no more blocked threads, we ran into
memory issues with Solr 1.4.1.

Test Setup details:
- 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually.
- 3 cores with index sizes : 10 GB, 2 GB, 1 GB.
- JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB
- No other application/service running on the servers.
- For querying solr servers, we are using wget queries from a standalone
host.

For the same index data and same set of queries, Solr 1.3 is hovering
between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is
reaching its 3 GB limit and performing FULL GC after almost every query. The
Full GC is also not freeing up any memory.

Has anyone also faced similar issues with Solr 1.4.1 ?

Also why is Solr 1.4.1 using more memory for the same amount of processing
compared to Solr 1.3 ?

Is there any particular configuration that needs to be done to avoid this
high memory usage ?

Thanks,
Rachita

Re: Solr 1.4.1 using more memory than Solr 1.3

Searching and sorting is now done on a per-segment basis, meaning that
the FieldCache entries used for sorting and for function queries are
created and used per-segment and can be reused for segments that don't
change between index updates.  While generally beneficial, this can lead
to increased memory usage over 1.3 in certain scenarios: 
  1) A single valued field that was used for both sorting and faceting
in 1.3 would have used the same top level FieldCache entry.  In 1.4, 
sorting will use entries at the segment level while faceting will still
use entries at the top reader level, leading to increased memory usage.
  2) Certain function queries such as ord() and rord() require a top level
FieldCache instance and can thus lead to increased memory usage.  Consider
replacing ord() and rord() with alternatives, such as function queries
based on ms() for date boosting.


http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/CHANGES.txt



On Wednesday 09 February 2011 16:07:01 Rachita Choudhary wrote:
 Hi Solr Users,
 
 We are in the process of upgrading from Solr 1.3 to Solr 1.4.1.
 While performing stress test on Solr 1.4.1 to measure the performance
 improvement in Query times (QTime) and no more blocked threads, we ran into
 memory issues with Solr 1.4.1.
 
 Test Setup details:
 - 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually.
 - 3 cores with index sizes : 10 GB, 2 GB, 1 GB.
 - JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB
 - No other application/service running on the servers.
 - For querying solr servers, we are using wget queries from a standalone
 host.
 
 For the same index data and same set of queries, Solr 1.3 is hovering
 between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is
 reaching its 3 GB limit and performing FULL GC after almost every query.
 The Full GC is also not freeing up any memory.
 
 Has anyone also faced similar issues with Solr 1.4.1 ?
 
 Also why is Solr 1.4.1 using more memory for the same amount of processing
 compared to Solr 1.3 ?
 
 Is there any particular configuration that needs to be done to avoid this
 high memory usage ?
 
 Thanks,
 Rachita

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Concurrent updates/commits

2011-02-09 Thread Jonathan Rochkind

Solr does handle concurrency fine. But there is NOT transaction isolation 
like you'll get from an rdbms. All 'pending' changes are (conceptually, anyway) 
held in a single queue, and any commit will commit ALL of them. There isn't 
going to be any data corruption issues or anything from concurrent adds (unless 
there's a bug in Solr, there isn't supposed to be) -- but there is no kind of 
transactions or isolation between different concurrent adders. So, sure, 
everyone can add concurrently -- but any time any of those actors issues a 
commit, all pending adds are committed. 

In addition, there are problems with Solr's basic architecture and _too 
frequent_ commits (whether made by different processes or not, doesn''t 
matter). When a new commit happens, Solr fires up a new index searcher and 
warms it up on the new version of the index. Until the new index searcher is 
fully warmed, the old index searcher is still serving queries.  Which can also 
mean that there are, for this period, TWO versions of all your caches in RAM 
and such. So let's say it takes 5 minutes for the new index to be fully warmed. 
 But if you have commits happening every 1 minute -- then you'll end up with 
FIVE 'new indexes' being warmed -- meaning potentially 5 times the RAM usage 
(quickly running into a JVM out of memory error), lots of CPU activity going on 
warming indexes that will never actually been used (because even though they 
aren't even done being warmed and ready to use, they've already been superseded 
by a later commit).   

I don't know of any good way to deal with this except less frequent commits. 
One way to get less frequent commits is to use Solr replication, and 'stage' 
all your commits in a 'master' index, but only replicate to 'slave' at a 
frequency slow enough so the new index is fully warmed before the next commit 
happens. 

Some new features in trunk (both lucene and solr) for 'near real time'  search 
ameliorate this problem somewhat, depending on the nature of your commits. 

Jonathan

From: Savvas-Andreas Moysidis [savvas.andreas.moysi...@googlemail.com]
Sent: Wednesday, February 09, 2011 10:34 AM
To: solr-user@lucene.apache.org
Subject: Concurrent updates/commits

Hello,

This topic has probably been covered before here, but we're still not very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action. This
could potentially cause multiple updates/commits to be fired to Solr and we
are trying to investigate how Solr handles those multiple requests.

This thread:
http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

suggests that Solr will handle all of the lower level details and that Before
a *COMMIT* is done , lock is obtained and its released  after the
operation
which in my understanding means that Solr will serialise all update/commit
requests?

However, the Solr book, in the Commit, Optimise, Rollback section reads:
if more than one Solr client were to submit modifications and commit them
at similar times, it is possible for part of one client's set of changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each request
or has some other strategy for processing those?


Thanks,
- Savvas

RE: Concurrent updates/commits

2011-02-09 Thread Pierre GOSSE

 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them 
 at similar times, it is possible for part of one client's set of changes to 
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.

I read this as If two client submit modifications and commits every couple of 
minutes, it could happen that modifications of client1 got committed by 
client2's commit before client1 asks for a commit.

As far as I understand Solr commit, they are serialized by design. And 
committing too often could lead you to trouble if you have many warm-up queries 
(?).

Hope this helps,

Pierre
-Message d'origine-
De : Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] 
Envoyé : mercredi 9 février 2011 16:34
À : solr-user@lucene.apache.org
Objet : Concurrent updates/commits

Hello,

This topic has probably been covered before here, but we're still not very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action. This
could potentially cause multiple updates/commits to be fired to Solr and we
are trying to investigate how Solr handles those multiple requests.

This thread:
http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

suggests that Solr will handle all of the lower level details and that Before
a *COMMIT* is done , lock is obtained and its released  after the
operation
which in my understanding means that Solr will serialise all update/commit
requests?

However, the Solr book, in the Commit, Optimise, Rollback section reads:
if more than one Solr client were to submit modifications and commit them
at similar times, it is possible for part of one client's set of changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each request
or has some other strategy for processing those?


Thanks,
- Savvas

Re: Concurrent updates/commits

Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr, but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

  However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
  if more than one Solr client were to submit modifications and commit
 them
  at similar times, it is possible for part of one client's set of changes
 to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.

 I read this as If two client submit modifications and commits every couple
 of minutes, it could happen that modifications of client1 got committed by
 client2's commit before client1 asks for a commit.

 As far as I understand Solr commit, they are serialized by design. And
 committing too often could lead you to trouble if you have many warm-up
 queries (?).

 Hope this helps,

 Pierre
 -Message d'origine-
 De : Savvas-Andreas Moysidis [mailto:
 savvas.andreas.moysi...@googlemail.com]
 Envoyé : mercredi 9 février 2011 16:34
 À : solr-user@lucene.apache.org
 Objet : Concurrent updates/commits

 Hello,

 This topic has probably been covered before here, but we're still not very
 clear about how multiple commits work in Solr.
 We currently have a requirement to make our domain objects searchable
 immediately after the get updated in the database by some user action. This
 could potentially cause multiple updates/commits to be fired to Solr and we
 are trying to investigate how Solr handles those multiple requests.

 This thread:

 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

 suggests that Solr will handle all of the lower level details and that
 Before
 a *COMMIT* is done , lock is obtained and its released  after the
 operation
 which in my understanding means that Solr will serialise all update/commit
 requests?

 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them
 at similar times, it is possible for part of one client's set of changes to
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.

 Our questions are:
 - Does Solr handle concurrent requests or do we need to add synchronisation
 logic around our code?
 - If Solr *does* handle concurrent requests, does it serialise each request
 or has some other strategy for processing those?


 Thanks,
 - Savvas

Re: Concurrent updates/commits

2011-02-09 Thread Walter Underwood

Don't think commit, that is confusing. Solr is not a database. In particular,
it does not have the isolation property from ACID.

Solr indexes new documents as a batch, then installs a new version of the
entire index. Installing a new index isn't instant, especially with warming
queries. Solr creates the index, then warms it, then makes it available for
regular queries.

If you are creating indexes frequently, don't bother warming.

wunder
==
Walter Underwood
Lead Engineer, MarkLogic

On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote:

Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr, but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

However, the Solr book, in the Commit, Optimise, Rollback section
reads:
if more than one Solr client were to submit modifications and commit
them
at similar times, it is possible for part of one client's set of changes
to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

I read this as If two client submit modifications and commits every couple
of minutes, it could happen that modifications of client1 got committed by
client2's commit before client1 asks for a commit.

As far as I understand Solr commit, they are serialized by design. And
committing too often could lead you to trouble if you have many warm-up
queries (?).

Hope this helps,

Pierre
-Message d'origine-
De : Savvas-Andreas Moysidis [mailto:
savvas.andreas.moysi...@googlemail.com]
Envoyé : mercredi 9 février 2011 16:34
À : solr-user@lucene.apache.org
Objet : Concurrent updates/commits

Hello,

This topic has probably been covered before here, but we're still not very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action. This
could potentially cause multiple updates/commits to be fired to Solr and we
are trying to investigate how Solr handles those multiple requests.

This thread:

http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

suggests that Solr will handle all of the lower level details and that
Before
a *COMMIT* is done , lock is obtained and its released after the
operation
which in my understanding means that Solr will serialise all update/commit
requests?

However, the Solr book, in the Commit, Optimise, Rollback section reads:
if more than one Solr client were to submit modifications and commit them
at similar times, it is possible for part of one client's set of changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each request
or has some other strategy for processing those?

Thanks,
- Savvas

Re: Concurrent updates/commits


Hi Savvas,

well, although it sounds strange: If a commit happens, a new Index Searcher
is warming. If a new commit happens while a 'new' Index Searcher is warming,
another Index Searcher is warming. So, at this point of time, you got 3
Index Searchers: The old one, the 'new' one and the newest one.

I don't know wheter the old one will be replaced by the new one until the
newest one has finished warming, but it seems to be a good guess, since you
can search while the new index is still committing.

You should know that Lucene is built on a segment-architecture. This means
every time you do a commit you write a completely new index-segment. 

Example:
You got one segment in your index and a searcher for it, now you are
committing.
After the commit finished you got two segments for it and one searcher for
both segments.
Internally your indexSearcher consists of at least two segmentReaders.

If you are committing three times at the same moment, you will warm 3 new
SolrIndexSearchers that contain 3,4 and 5 segmentReaders. Your old
SolrIndexSearcher contains 2 segmentReaders and is valid until the newer
SolrIndexReader based on 3 segmentReaders is warmed.

Regards,
Em
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Concurrent-updates-commits-tp2459222p2459522.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Concurrent updates/commits

2011-02-09 Thread Pierre GOSSE

Well, Jonathan explanations are much more accurate than mine. :)

I took the word serialization as meaning kind of isolation between commits, 
which is not very smart. Sorry to have introduce more confusion in this.

Pierre

-Message d'origine-
De : Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] 
Envoyé : mercredi 9 février 2011 17:04
À : solr-user@lucene.apache.org
Objet : Re: Concurrent updates/commits

Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr, but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

  However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
  if more than one Solr client were to submit modifications and commit
 them
  at similar times, it is possible for part of one client's set of changes
 to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.

 I read this as If two client submit modifications and commits every couple
 of minutes, it could happen that modifications of client1 got committed by
 client2's commit before client1 asks for a commit.

 As far as I understand Solr commit, they are serialized by design. And
 committing too often could lead you to trouble if you have many warm-up
 queries (?).

 Hope this helps,

 Pierre
 -Message d'origine-
 De : Savvas-Andreas Moysidis [mailto:
 savvas.andreas.moysi...@googlemail.com]
 Envoyé : mercredi 9 février 2011 16:34
 À : solr-user@lucene.apache.org
 Objet : Concurrent updates/commits

 Hello,

 This topic has probably been covered before here, but we're still not very
 clear about how multiple commits work in Solr.
 We currently have a requirement to make our domain objects searchable
 immediately after the get updated in the database by some user action. This
 could potentially cause multiple updates/commits to be fired to Solr and we
 are trying to investigate how Solr handles those multiple requests.

 This thread:

 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

 suggests that Solr will handle all of the lower level details and that
 Before
 a *COMMIT* is done , lock is obtained and its released  after the
 operation
 which in my understanding means that Solr will serialise all update/commit
 requests?

 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them
 at similar times, it is possible for part of one client's set of changes to
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.

 Our questions are:
 - Does Solr handle concurrent requests or do we need to add synchronisation
 logic around our code?
 - If Solr *does* handle concurrent requests, does it serialise each request
 or has some other strategy for processing those?


 Thanks,
 - Savvas

Re: Concurrent updates/commits

Yes, we'll probably go towards that path as our index files are relatively
small, so auto warming might not be extremely useful in our case..
Yep, we do realise the difference between a db and a Solr commit. :)

Thanks.

On 9 February 2011 16:15, Walter Underwood wun...@wunderwood.org wrote:

Don't think commit, that is confusing. Solr is not a database. In
particular, it does not have the isolation property from ACID.

If you are creating indexes frequently, don't bother warming.

wunder
==
Walter Underwood
Lead Engineer, MarkLogic

On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote:

Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr,
but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher
being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want
to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

However, the Solr book, in the Commit, Optimise, Rollback section
reads:
if more than one Solr client were to submit modifications and commit
them
at similar times, it is possible for part of one client's set of
changes
to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

I read this as If two client submit modifications and commits every
couple
of minutes, it could happen that modifications of client1 got committed
by
client2's commit before client1 asks for a commit.

As far as I understand Solr commit, they are serialized by design. And
committing too often could lead you to trouble if you have many warm-up
queries (?).

Hope this helps,

Hello,

This topic has probably been covered before here, but we're still not
very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action.
This
could potentially cause multiple updates/commits to be fired to Solr and
we
are trying to investigate how Solr handles those multiple requests.

This thread:

http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

Our questions are:
- Does Solr handle concurrent requests or do we need to add
synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each
request
or has some other strategy for processing those?

Thanks,
- Savvas

Re: Concurrent updates/commits

Thanks very much Em.

- Savvas

On 9 February 2011 16:22, Savvas-Andreas Moysidis
savvas.andreas.moysi...@googlemail.com wrote:

Thanks.

On 9 February 2011 16:15, Walter Underwood wun...@wunderwood.org wrote:

Don't think commit, that is confusing. Solr is not a database. In
particular, it does not have the isolation property from ACID.

If you are creating indexes frequently, don't bother warming.

wunder
==
Walter Underwood
Lead Engineer, MarkLogic

On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote:

Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr,
but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher
being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want
to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

However, the Solr book, in the Commit, Optimise, Rollback section
reads:
if more than one Solr client were to submit modifications and commit
them
at similar times, it is possible for part of one client's set of
changes
to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

I read this as If two client submit modifications and commits every
couple
of minutes, it could happen that modifications of client1 got committed
by
client2's commit before client1 asks for a commit.

As far as I understand Solr commit, they are serialized by design. And
committing too often could lead you to trouble if you have many warm-up
queries (?).

Hope this helps,

Hello,

This topic has probably been covered before here, but we're still not
very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action.
This
could potentially cause multiple updates/commits to be fired to Solr
and we
are trying to investigate how Solr handles those multiple requests.

This thread:

http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

However, the Solr book, in the Commit, Optimise, Rollback section
reads:
if more than one Solr client were to submit modifications and commit
them
at similar times, it is possible for part of one client's set of
changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add
synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each
request
or has some other strategy for processing those?

Thanks,
- Savvas

Query regarding search term count in Solr

2011-02-09 Thread Rahul Warawdekar

Hi All,

This is Rahul and am using Solr for one of my upcoming projects.
I had a query regarding search term count using Solr.
We have a requirement in one of our search based projects to search the
results based on search term counts per document.

For eg,
if a user searches for something like solr[4:9], this query should return
only documents in which solr appears between 4 and 9 times (inclusively).
 if a user searches for something like solr lucene[4:9], this query should
return only documents in which the phrase solr lucene appears between 4
and 9 times (inclusively).

Is there any way from Solr to return results based on the search term and
phrase counts ?
If  not, can it be customized by extending existing Solr/Lucene libraries ?


-- 
Thanks and Regards
Rahul A. Warawdekar

Re: Nutch and Solr search on the fly

2011-02-09 Thread charan kumar

Hi Abishek,

depth is a param of crawl command, not fetch command

If you are using custom script calling individual stages of nutch crawl,
then depth N means , you running that script for N times.. You can put a
loop, in the script.

Thanks,
Charan

On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote:

Hi Erick,

Thanks a bunch for the response

Thanks,
Abi

On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com
wrote:

WARNING: I don't do Nutch much, but could it be that your
crawl depth is 1? See:
http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/NutchTutorialand search for depth
Best
Erick

On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com
wrote:

Hi Markus,

I am sorry for not being clear, I meant to say that...

Suppose if a url namely
www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which
http://www.somehost.com/gifts/greetingcard.html%28whichin
turn contain links to a.html, b.html, c.html, d.html) is injected into
the
seed.txt, after the whole process I was expecting a bunch of other
pages
which crawled from this seed url. However, at the end of it all I see
is
the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.

The crawling happens only for the URLs mentioned in the seed.txt and
does
not proceed further from there. So I am just bit confused. Why is it
not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.

Thanks,
Abi

On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma
markus.jel...@openindex.io
wrote:

The parsed data is only sent to the Solr index of you tell a segment
to
be
indexed; solrindex crawldb linkdb segment

On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
Hi all,

I am a newbie to nutch and solr. Well relatively much newer to
Solr
than
Nutch :)

I have been using nutch for past two weeks, and I wanted to know
if
I
can
query or search on my nutch crawls on the fly(before it completes).
I
am
asking this because the websites I am crawling are really huge and
it
takes
around 3-4 days for a crawl to complete. I want to analyze some
quick
results while the nutch crawler is still crawling the URLs. Some
one
suggested me that Solr would make it possible.

Just wanted to see if some one could help me pointing this out and
where
I
went wrong in the process. Forgive my foolishness and thanks for
your
patience.

Cheers,
Abi

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

QueryWeight for Solr


Hello folks,

I got a question regarding an own QueryWeight implementation for a special
usecase.

For the current usecase we want to experiment with different values for the
idf based on different algorithms and how they affect the scoring.

Is there a way to plug-in an own weight-implementation without rewriting the
full query-class?

Let's say we extend the DismaxQParser to create an extended boolean Query
(let's call it EBooleanQuery, E for extended) and we implement a QueryWeight
for this Query-class that takes some values into account that are not part
of the current approach.

Is this the way we have to go? Or what would you suggest?

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2459933.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: QueryWeight for Solr

On Wed, Feb 9, 2011 at 12:16 PM, Em mailformailingli...@yahoo.de wrote:
 For the current usecase we want to experiment with different values for the
 idf based on different algorithms and how they affect the scoring.

For tf, idf, lengthNorm, coord, etc, see Similarity.
Solr already alows you to specify one in the schema, and work is
underway to make it per-field:
https://issues.apache.org/jira/browse/SOLR-2338

-Yonik
http://lucidimagination.com

Re: Query regarding search term count in Solr

I suspect it's worthwhile to back up and ask whether this is a reasonable
requirement. What is the use-case? Because unless the input is very
uniform, I wouldn't be surprised if this will produce poor results. For
instance,
if solr appears once in a field 5 words long and 5 times in another
document
where the same field is 1,000,000 words long, which is preferable?

This requirement can make sense if the fields being searched are uniform
in length, but even then I'm not sure it would be good for the user

That said, you know your problem domain best but before going through the
effort of making this all work I'd step back and ask this question.

There is no way that I know of of doing this out of the box with Solr. I can
imagine you could set up a custom scorer that accessed the underlying
TermDocs (see TermDocs in the Lucene API), but you'd also have to provide
your own query parser...

I'll reiterate, though, that it might be best to see if there are already
ways in Solr to get close enough behavior to satisfy the underlying
requirement rather than go down this route.

Best
Erick

On Wed, Feb 9, 2011 at 11:55 AM, Rahul Warawdekar 
rahul.warawde...@gmail.com wrote:

 Hi All,

 This is Rahul and am using Solr for one of my upcoming projects.
 I had a query regarding search term count using Solr.
 We have a requirement in one of our search based projects to search the
 results based on search term counts per document.

 For eg,
 if a user searches for something like solr[4:9], this query should return
 only documents in which solr appears between 4 and 9 times (inclusively).
  if a user searches for something like solr lucene[4:9], this query
 should
 return only documents in which the phrase solr lucene appears between 4
 and 9 times (inclusively).

 Is there any way from Solr to return results based on the search term and
 phrase counts ?
 If  not, can it be customized by extending existing Solr/Lucene libraries ?


 --
 Thanks and Regards
 Rahul A. Warawdekar

Re: Solr Out of Memory Error

2011-02-09 Thread Bing Li

Dear Adam,

I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh
as follows.

   ...
   if [ -z $LOGGING_MANAGER ]; then
 JAVA_OPTS=$JAVA_OPTS
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
   else
JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m
   fi
   ...

Is this change correct? After that, I still got the same exception. The
index is updated and searched frequently. I am trying to change the code to
avoid the frequent updates. I guess only changing JAVA_OPTS does not work.

Could you give me some help?

Thanks,
LB


On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada 
estrada.adam.gro...@gmail.com wrote:

 Is anyone familiar with the environment variable, JAVA_OPTS? I set
 mine to a much larger heap size and never had any of these issues
 again.

 JAVA_OPTS = -server -Xms4048m -Xmx4048m

 Adam

 On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com
 wrote:
  Hi all,
  By adding more servers do u mean sharding of index.And after sharding ,
 how
  my query performance will be affected .
  Will the query execution time increase.
 
  Thanks,
  Isan Fulia.
 
  On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:
 
 
  Hi Isan,
 
  It seems your index size 25GB si much more compared to you have total
 Ram
  size is 4GB.
  You have to do 2 things to avoid Out Of Memory Problem.
  1-Buy more Ram ,add at least 12 GB of more ram.
  2-Increase the Memory allocated to solr by setting XMX values.at least
 12
  GB
  allocate to solr.
 
  But if your all index will fit into the Cache memory it will give you
 the
  better result.
 
  Also add more servers to load balance as your QPS is high.
  Your 7 Laks data makes 25 GB of index its looking quite high.Try to
 lower
  the index size
  What are you indexing in your 25GB of index?
 
  -
  Thanx:
  Grijesh
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  Thanks  Regards,
  Isan Fulia.

Re: QueryWeight for Solr


Hi Yonik,

thanks for the fast feedback.

Well, as far as I can see there is no possibility to get the original query
from the similarity-class...

Let me ask differently: I know there are some distributed
idf-implementations out there. 
One approach is to ask every shard for its idf for a term and than aggregate
those values at the master wo queried them all. Afterwards they use it for
their similarity etc.

How do they store these idfs for the current request so that the
similarity is aware of them?

I do not want to reimplement distributed idf, but I want to figure out how
they make it accessible for the similarity that is in use.

Thank you!

Regards


Yonik Seeley-2-2 wrote:
 
 On Wed, Feb 9, 2011 at 12:16 PM, Em mailformailingli...@yahoo.de wrote:
 For the current usecase we want to experiment with different values for
 the
 idf based on different algorithms and how they affect the scoring.
 
 For tf, idf, lengthNorm, coord, etc, see Similarity.
 Solr already alows you to specify one in the schema, and work is
 underway to make it per-field:
 https://issues.apache.org/jira/browse/SOLR-2338
 
 -Yonik
 http://lucidimagination.com
 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2460386.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Out of Memory Error

Bing Li,

One should be conservative when setting Xmx. Also, just setting Xmx might not 
do the trick at all because the garbage collector might also be the issue 
here. Configure the JVM to output debug logs of the garbage collector and 
monitor the heap usage (especially the tenured generation) with a good tool 
like JConsole.

You might also want to take a look at your cache settings and autowarm 
parameters. In some scenario's with very frequent updates, a large corpus and 
a high load of heterogenous queries you might want to dump the documentCache 
and queryResultCache, the cache hitratio tends to be very low and the caches 
will just consume a lot of memory and CPU time.

One of my projects i finally decided to only use the filterCache. Using the 
other caches took too much RAM and CPU while running and had a lot of 
evictions and still a lot hitratio. I could, of course, make the caches a lot 
bigger and increase autowarming but that would take a lot of time before a 
cache is autowarmed and a very, very, large amount of RAM. I choose to rely on 
the OS-cache instead.

Cheers,

 Dear Adam,
 
 I also got the OutOfMemory exception. I changed the JAVA_OPTS in
 catalina.sh as follows.
 
...
if [ -z $LOGGING_MANAGER ]; then
  JAVA_OPTS=$JAVA_OPTS
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
else
 JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m
fi
...
 
 Is this change correct? After that, I still got the same exception. The
 index is updated and searched frequently. I am trying to change the code to
 avoid the frequent updates. I guess only changing JAVA_OPTS does not work.
 
 Could you give me some help?
 
 Thanks,
 LB
 
 
 On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada 
 
 estrada.adam.gro...@gmail.com wrote:
  Is anyone familiar with the environment variable, JAVA_OPTS? I set
  mine to a much larger heap size and never had any of these issues
  again.
  
  JAVA_OPTS = -server -Xms4048m -Xmx4048m
  
  Adam
  
  On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com
  
  wrote:
   Hi all,
   By adding more servers do u mean sharding of index.And after sharding ,
  
  how
  
   my query performance will be affected .
   Will the query execution time increase.
   
   Thanks,
   Isan Fulia.
   
   On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:
   Hi Isan,
   
   It seems your index size 25GB si much more compared to you have total
  
  Ram
  
   size is 4GB.
   You have to do 2 things to avoid Out Of Memory Problem.
   1-Buy more Ram ,add at least 12 GB of more ram.
   2-Increase the Memory allocated to solr by setting XMX values.at least
  
  12
  
   GB
   allocate to solr.
   
   But if your all index will fit into the Cache memory it will give you
  
  the
  
   better result.
   
   Also add more servers to load balance as your QPS is high.
   Your 7 Laks data makes 25 GB of index its looking quite high.Try to
  
  lower
  
   the index size
   What are you indexing in your 25GB of index?
   
   -
   Thanx:
   Grijesh
   --
  
   View this message in context:
  http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p228
  5779.html
  
   Sent from the Solr - User mailing list archive at Nabble.com.
   
   --
   Thanks  Regards,
   Isan Fulia.

Re: Solr Out of Memory Error

I should also add that reducing the caches and autowarm sizes (or not using 
them at all) drastically reduces memory consumption when a new searcher is 
being prepares after a commit. The memory usage will spike at these events. 
Again, use a monitoring tool to get more information on your specific scenario.

 Bing Li,
 
 One should be conservative when setting Xmx. Also, just setting Xmx might
 not do the trick at all because the garbage collector might also be the
 issue here. Configure the JVM to output debug logs of the garbage
 collector and monitor the heap usage (especially the tenured generation)
 with a good tool like JConsole.
 
 You might also want to take a look at your cache settings and autowarm
 parameters. In some scenario's with very frequent updates, a large corpus
 and a high load of heterogenous queries you might want to dump the
 documentCache and queryResultCache, the cache hitratio tends to be very
 low and the caches will just consume a lot of memory and CPU time.
 
 One of my projects i finally decided to only use the filterCache. Using the
 other caches took too much RAM and CPU while running and had a lot of
 evictions and still a lot hitratio. I could, of course, make the caches a
 lot bigger and increase autowarming but that would take a lot of time
 before a cache is autowarmed and a very, very, large amount of RAM. I
 choose to rely on the OS-cache instead.
 
 Cheers,
 
  Dear Adam,
  
  I also got the OutOfMemory exception. I changed the JAVA_OPTS in
  catalina.sh as follows.
  
 ...
 if [ -z $LOGGING_MANAGER ]; then
 
   JAVA_OPTS=$JAVA_OPTS
  
  -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
  
 else
 
  JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m
 
 fi
 ...
  
  Is this change correct? After that, I still got the same exception. The
  index is updated and searched frequently. I am trying to change the code
  to avoid the frequent updates. I guess only changing JAVA_OPTS does not
  work.
  
  Could you give me some help?
  
  Thanks,
  LB
  
  
  On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada 
  
  estrada.adam.gro...@gmail.com wrote:
   Is anyone familiar with the environment variable, JAVA_OPTS? I set
   mine to a much larger heap size and never had any of these issues
   again.
   
   JAVA_OPTS = -server -Xms4048m -Xmx4048m
   
   Adam
   
   On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com
   
   wrote:
Hi all,
By adding more servers do u mean sharding of index.And after sharding
,
   
   how
   
my query performance will be affected .
Will the query execution time increase.

Thanks,
Isan Fulia.

On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:
Hi Isan,

It seems your index size 25GB si much more compared to you have
total
   
   Ram
   
size is 4GB.
You have to do 2 things to avoid Out Of Memory Problem.
1-Buy more Ram ,add at least 12 GB of more ram.
2-Increase the Memory allocated to solr by setting XMX values.at
least
   
   12
   
GB
allocate to solr.

But if your all index will fit into the Cache memory it will give
you
   
   the
   
better result.

Also add more servers to load balance as your QPS is high.
Your 7 Laks data makes 25 GB of index its looking quite high.Try to
   
   lower
   
the index size
What are you indexing in your 25GB of index?

-
Thanx:
Grijesh
--
   
View this message in context:
   http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2
   28 5779.html
   
Sent from the Solr - User mailing list archive at Nabble.com.

--
Thanks  Regards,
Isan Fulia.

Changing value of start parameter affects numFound?

2011-02-09 Thread mrw


I have a data set indexed over two irons, with M docs per Solr core for a
total of N cores.

If I perform a query across all N cores with start=0 and rows=30, I get,
say, numFound=27521).  If I simply change the start param to start=27510
(simulating being on the last page of data), I get a smaller result set
(say, numFound=21415).  

I had expected numFound to be the same in either case, since no other aspect
of the query had changed.  Am I mistaken?

I'm using Solr 1.4.1.955763M.  Faceting is enabled on the query. All cores
have the same schema.

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-tp2460645p2460645.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: QueryWeight for Solr

On Wed, Feb 9, 2011 at 1:18 PM, Em mailformailingli...@yahoo.de wrote:
 How do they store these idfs for the current request so that the
 similarity is aware of them?

The df (as opposed to idf) is requested from the searcher by the
weight, which then uses the similarity to produce the idf.  See
TermWeight as an example.  There's no out-of-the-box plugin to provide
alternate df values though, other than the Searcher interface.

If you're doing custom enough scoring, then just implementing your own
query class is probably the way to go, but people might have other
ideas depending on the specifics of what you're trying to do.

-Yonik
http://lucidimagination.com

Re: QueryWeight for Solr


Thanks, again. :)

Okay, so if one wants a distributed idf one should extend a searcher instead
of the query-class.

But it doesn't seem to be pluggable, right?

Well, for our purposes extending the query-class is enough, but just from
beeing curious: Where should one starts if one wants to make some components
pluggable?

Since Real-Time-Search is an issue where I read about the idea of making
things like the searcher pluggable, this could be beneficial to the
community.


Regards


Yonik Seeley-2-2 wrote:
 
 On Wed, Feb 9, 2011 at 1:18 PM, Em mailformailingli...@yahoo.de wrote:
 How do they store these idfs for the current request so that the
 similarity is aware of them?
 
 The df (as opposed to idf) is requested from the searcher by the
 weight, which then uses the similarity to produce the idf.  See
 TermWeight as an example.  There's no out-of-the-box plugin to provide
 alternate df values though, other than the Searcher interface.
 
 If you're doing custom enough scoring, then just implementing your own
 query class is probably the way to go, but people might have other
 ideas depending on the specifics of what you're trying to do.
 
 -Yonik
 http://lucidimagination.com
 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2460718.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Changing value of start parameter affects numFound?

2011-02-09 Thread mrw



mrw wrote:
 
 I have a data set indexed over two irons, with M docs per Solr core for a
 total of N cores.
 
 If I perform a query across all N cores with start=0 and rows=30, I get,
 say, numFound=27521).  If I simply change the start param to start=27510
 (simulating being on the last page of data), I get a smaller result set
 (say, numFound=21415).  
 
 I had expected numFound to be the same in either case, since no other
 aspect of the query had changed.  Am I mistaken?
 
 I'm using Solr 1.4.1.955763M.  Faceting is enabled on the query. All cores
 have the same schema.
 
 Thanks!
 

More detail:  numFound seems to vary unpredictably based on start value.


start,   numFound
--
0-46,   27521
47-59, 27520
60,  27519
61-91, 27518
62,  27517


Any ideas?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-tp2460645p2460795.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: QueryWeight for Solr

On Wed, Feb 9, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote:
 Okay, so if one wants a distributed idf one should extend a searcher instead
 of the query-class.

Yes.
If you're interested in distributed search for Solr, there is a patch
in progress:
https://issues.apache.org/jira/browse/SOLR-1632

 But it doesn't seem to be pluggable, right?

Since you can weight with a different searcher than you query with, a
searcher works fine as a fine extension point (but it's a lucene-level
extension point, not a solr-level one).

-Yonik
http://lucidimagination.com

Re: Changing value of start parameter affects numFound?

On Wed, Feb 9, 2011 at 1:42 PM, mrw mikerobertsw...@gmail.com wrote:

 I have a data set indexed over two irons, with M docs per Solr core for a
 total of N cores.

 If I perform a query across all N cores with start=0 and rows=30, I get,
 say, numFound=27521).  If I simply change the start param to start=27510
 (simulating being on the last page of data), I get a smaller result set
 (say, numFound=21415).

 I had expected numFound to be the same in either case, since no other aspect
 of the query had changed.  Am I mistaken?

You probably have some duplicate docs in your shards (those with the same id).
Solr doesn't know they are dups until it retrieves the ids of the docs
to merge, and then it only takes one of the dups and decrements
numFound.

-Yonik
http://lucidimagination.com

Architecture decisions with Solr

2011-02-09 Thread Greg Georges

Hello all,

I am looking into an enterprise search solution for our architecture and I am 
very pleased to see all the features Solr provides. In our case, we will have a 
need for a highly scalable application for multiple clients. This application 
will be built to serve many users who each will have a client account. Each 
client will have a multitude of documents to index (0-1000s of documents). 
After discussion we were talking about going multicore and to have one index 
file per client account. The reason for this is that security is achieved by 
having a separate index for each client etc.. Is this the best approach? How 
feasible is it (dynamically create indexes on client account creation. Is it 
better to go the faceted search capabilities route? Thanks for your help

Greg

Re: Architecture decisions with Solr

2011-02-09 Thread Darren Govoni

What about standing up a VM (search appliance that you would make) for
each client? 
If there's no data sharing across clients, then using the same solr
server/index doesn't seem necessary.

Solr will easily meet your needs though, its the best there is.

On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

 Hello all,
 
 I am looking into an enterprise search solution for our architecture and I am 
 very pleased to see all the features Solr provides. In our case, we will have 
 a need for a highly scalable application for multiple clients. This 
 application will be built to serve many users who each will have a client 
 account. Each client will have a multitude of documents to index (0-1000s of 
 documents). After discussion we were talking about going multicore and to 
 have one index file per client account. The reason for this is that security 
 is achieved by having a separate index for each client etc.. Is this the best 
 approach? How feasible is it (dynamically create indexes on client account 
 creation. Is it better to go the faceted search capabilities route? Thanks 
 for your help
 
 Greg

RE: Architecture decisions with Solr

2011-02-09 Thread Greg Georges

From what I understand about multicore, each of the indexes are independant 
from each other right? Or would one index have access to the info of the other? 
My requirement is like you mention, a client has access only to his or her 
search data based in their documents. Other clients have no access to the index 
of other clients.

Greg

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: 9 février 2011 14:28
To: solr-user@lucene.apache.org
Subject: Re: Architecture decisions with Solr

What about standing up a VM (search appliance that you would make) for
each client? 
If there's no data sharing across clients, then using the same solr
server/index doesn't seem necessary.

Solr will easily meet your needs though, its the best there is.

On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

 Hello all,
 
 I am looking into an enterprise search solution for our architecture and I am 
 very pleased to see all the features Solr provides. In our case, we will have 
 a need for a highly scalable application for multiple clients. This 
 application will be built to serve many users who each will have a client 
 account. Each client will have a multitude of documents to index (0-1000s of 
 documents). After discussion we were talking about going multicore and to 
 have one index file per client account. The reason for this is that security 
 is achieved by having a separate index for each client etc.. Is this the best 
 approach? How feasible is it (dynamically create indexes on client account 
 creation. Is it better to go the faceted search capabilities route? Thanks 
 for your help
 
 Greg

Re: Architecture decisions with Solr

2011-02-09 Thread Glen Newton

 This application will be built to serve many users

If this means that you have thousands of users, 1000s of VMs and/or
1000s of cores is not going to scale.

Have an ID in the index for each user, and filter using it.
Then they can see only their own documents.

Assuming that you are building an app that through which they
authenticate  talks to solr .
(i.e. all requests are filtered using their ID)

-Glen

On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote:
 From what I understand about multicore, each of the indexes are independant 
 from each other right? Or would one index have access to the info of the 
 other? My requirement is like you mention, a client has access only to his or 
 her search data based in their documents. Other clients have no access to the 
 index of other clients.

 Greg

 -Original Message-
 From: Darren Govoni [mailto:dar...@ontrenet.com]
 Sent: 9 février 2011 14:28
 To: solr-user@lucene.apache.org
 Subject: Re: Architecture decisions with Solr

 What about standing up a VM (search appliance that you would make) for
 each client?
 If there's no data sharing across clients, then using the same solr
 server/index doesn't seem necessary.

 Solr will easily meet your needs though, its the best there is.

 On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

 Hello all,

 I am looking into an enterprise search solution for our architecture and I 
 am very pleased to see all the features Solr provides. In our case, we will 
 have a need for a highly scalable application for multiple clients. This 
 application will be built to serve many users who each will have a client 
 account. Each client will have a multitude of documents to index (0-1000s of 
 documents). After discussion we were talking about going multicore and to 
 have one index file per client account. The reason for this is that security 
 is achieved by having a separate index for each client etc.. Is this the 
 best approach? How feasible is it (dynamically create indexes on client 
 account creation. Is it better to go the faceted search capabilities route? 
 Thanks for your help

 Greg






-- 

-

solr render biased search result

2011-02-09 Thread cyang2010


Hi,

I am asked that whether solr renders biased search result?  For example, for
this search (query all movie title by this Comedy genre),  for user who
indicates a preference to 1950's movies, solr renders the 1950's movies with
higher score (top in the list)?Or if user is a kid, then the result will
render G/PG rated movie top in the list, and render all the R rated movie
bottom in the list?

I know that solr can boost score based on match on a particular field.  But
it can't favor some value over other value in the same field.  is that
right?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461155.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Architecture decisions with Solr

2011-02-09 Thread Sujit Pal

Another option (assuming the case where a user can be granted access to
a certain class of documents, and more than one user would be able to
access certain documents) would be to store the access filter (as an OR
query of content types) in an external cache (perhaps a database or an
eternal cache that the database changes are published to periodically),
then using this access filter as a facet on the base query.

-sujit

On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote:
  This application will be built to serve many users
 
 If this means that you have thousands of users, 1000s of VMs and/or
 1000s of cores is not going to scale.
 
 Have an ID in the index for each user, and filter using it.
 Then they can see only their own documents.
 
 Assuming that you are building an app that through which they
 authenticate  talks to solr .
 (i.e. all requests are filtered using their ID)
 
 -Glen
 
 On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote:
  From what I understand about multicore, each of the indexes are independant 
  from each other right? Or would one index have access to the info of the 
  other? My requirement is like you mention, a client has access only to his 
  or her search data based in their documents. Other clients have no access 
  to the index of other clients.
 
  Greg
 
  -Original Message-
  From: Darren Govoni [mailto:dar...@ontrenet.com]
  Sent: 9 février 2011 14:28
  To: solr-user@lucene.apache.org
  Subject: Re: Architecture decisions with Solr
 
  What about standing up a VM (search appliance that you would make) for
  each client?
  If there's no data sharing across clients, then using the same solr
  server/index doesn't seem necessary.
 
  Solr will easily meet your needs though, its the best there is.
 
  On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:
 
  Hello all,
 
  I am looking into an enterprise search solution for our architecture and I 
  am very pleased to see all the features Solr provides. In our case, we 
  will have a need for a highly scalable application for multiple clients. 
  This application will be built to serve many users who each will have a 
  client account. Each client will have a multitude of documents to index 
  (0-1000s of documents). After discussion we were talking about going 
  multicore and to have one index file per client account. The reason for 
  this is that security is achieved by having a separate index for each 
  client etc.. Is this the best approach? How feasible is it (dynamically 
  create indexes on client account creation. Is it better to go the faceted 
  search capabilities route? Thanks for your help
 
  Greg

Re: solr render biased search result

2011-02-09 Thread Paul Libbrecht

Cyang,

why can't you, for a kid, add a boosting query 

genre:kid^2.0

aside of the rest?
That would double the score of a match if the users are kids.
But note that you'd better calibrate the coefficient with some test battery. 
This is part of the fine art, I think.

paul


Le 9 févr. 2011 à 20:44, cyang2010 a écrit :

 
 Hi,
 
 I am asked that whether solr renders biased search result?  For example, for
 this search (query all movie title by this Comedy genre),  for user who
 indicates a preference to 1950's movies, solr renders the 1950's movies with
 higher score (top in the list)?Or if user is a kid, then the result will
 render G/PG rated movie top in the list, and render all the R rated movie
 bottom in the list?
 
 I know that solr can boost score based on match on a particular field.  But
 it can't favor some value over other value in the same field.  is that
 right?
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461155.html
 Sent from the Solr - User mailing list archive at Nabble.com.

solr current workding directory or reading config files

Hi,

I have a class (in a jar) that reads from properties (text) files.  I have 
these 
files in the same jar file as the class.

However, when my class reads those properties files, those files cannot be 
found 
since solr reads from tomcat's bin directory.

I don't really want to put the config files in tomcat's bin directory.

How do I reconcile this?

Tri

pre and post processing when building index

Hi,

I'm scheduling solr to build every hour or so.

I'd like to do some pre and post processing for each index build.  The 
preprocessing would do some checks and perhaps will skip the build.

For post processing, I will do some checks and either commit or rollback the 
build.

Can I write some class and plugin into solr for this?

Thanks,

Tri

DataImportHandler: regex debugging

2011-02-09 Thread Jon Drukman

I am trying to use the regex transformer but it's not returning anything. 
Either my regex is wrong, or I've done something else wrong in the setup of the
entity.  Is there any way to debug this?  Making a change and waiting 7 minutes
to reindex the entity sucks.

entity name=boxshot
  query=SELECT GROUP_CONCAT(i.url, ',') boxshot_url,
 GROUP_CONCAT(i2.url, ',') boxshot_url_small FROM games g
 left join image_sizes i ON g.box_image_id = i.id AND i.size_type = 39
 left join image_sizes i2 on g.box_image_id = i2.id AND i2.size_type = 
40
 WHERE g.game_seo_title = '${game.game_seo_title}'
 GROUP BY g.game_seo_title
field name=main_image regex=^(.*?), sourceColName=boxshot_url /
field name=small_image regex=^(.*?), sourceColName=boxshot_url_small 
/
/entity

This returns columns that are either null, or have some comma-separated strings.
I want the bit up to the first comma, if it exists.

Ideally I could have it log the query and the input/output
of the field statements.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2011-02-09 Thread pravin


Hello,
Andy, so did you get final answer to your quetion?
I am also trying to do something similar. Please give me pointers if you
have any.
Basically even I need to use Ngram with WhitespaceTokenizer any help will be
appreciated.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/NGramFilterFactory-for-auto-complete-that-matches-the-middle-of-multi-lingual-tags-tp1619234p2459466.html
Sent from the Solr - User mailing list archive at Nabble.com.

Why does the StatsComponent only work with indexed fields?

2011-02-09 Thread Travis Truman

Is there a reason why the StatsComponent only deals with indexed fields?

I just updated the wiki: http://wiki.apache.org/solr/StatsComponent to call
this fact out since it was not apparent previously.

I've briefly skimmed the source of StatsComponent, but am not familiar
enough with the code or Solr yet to understand if it was omitted for
performance reasons or some other reason.

Any information would be appreciated.

Thanks,
Travis

Re: solr render biased search result

2011-02-09 Thread cyang2010


That makes sense.  It is a little bit indirect.  You have to translate that
user preference/profile into a search field value and then dictate search
result boosting the doc with that preference value.   
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461668.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why does the StatsComponent only work with indexed fields?

What kinds of information would you expect for a stored-only field? I mean,
the stored part is just a blob that Solr doesn't peek inside of, so I'm not
sure
what useful information *could* be returned

Best
Erick

On Wed, Feb 9, 2011 at 3:55 PM, Travis Truman trum...@gmail.com wrote:

 Is there a reason why the StatsComponent only deals with indexed fields?

 I just updated the wiki: http://wiki.apache.org/solr/StatsComponent to
 call
 this fact out since it was not apparent previously.

 I've briefly skimmed the source of StatsComponent, but am not familiar
 enough with the code or Solr yet to understand if it was omitted for
 performance reasons or some other reason.

 Any information would be appreciated.

 Thanks,
 Travis

Re: solr render biased search result

What *could* solr do for you? You've outlined a domain-specific requirement,
I'm not sure how a general-purpose search engine would incorporate
that functionality

Best
Erick

On Wed, Feb 9, 2011 at 4:08 PM, cyang2010 ysxsu...@hotmail.com wrote:


 That makes sense.  It is a little bit indirect.  You have to translate that
 user preference/profile into a search field value and then dictate search
 result boosting the doc with that preference value.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461668.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr current workding directory or reading config files

2011-02-09 Thread Wilkes, Chris

Is your war always deployed the the same location, ie /usr/mycomp/ 
myapplication/webapps/myapp.war?  If so then on startup copy the  
files out of your directory and put them under CATALINA_BASE/solr (usr/ 
mycomp/myapplication/solr) and in your war file have the META-INF/ 
context.xml JNDI setting point to that.


Context
   Environment name=solr/home type=java.lang.String value=/usr/ 
mycomp/myapplication/solr override=true /

/Context

If you know of a way to reference CATALINA_BASE in the context.xml  
that would make it easier.


On Feb 9, 2011, at 12:00 PM, Tri Nguyen wrote:


Hi,

I have a class (in a jar) that reads from properties (text) files.   
I have these

files in the same jar file as the class.

However, when my class reads those properties files, those files  
cannot be found

since solr reads from tomcat's bin directory.

I don't really want to put the config files in tomcat's bin directory.

How do I reconcile this?

Tri

communication between entity processor and solr DataImporter

Hi,

I'd like to communicate errors between my entity processor and the DataImporter 
in case of error.

Should there be an error in my entity processor, I'd like the index build to 
rollback. How can I do this?

I want to throw an exception of some sort.  Only thing I can think of is to 
force a runtime exception be thrown in nextRow() of the entityprocessor since 
runtime exceptions are not checked and does not have to be declared in the 
nextRow() method signature.

How can I request the nextRow() method signature be updated to throw 
Exception?  
Would it even make sense?

Tri

Re: solr current workding directory or reading config files

Wanted to add some more details to my problem.  I have many jars that have 
their 
own config files.  So I'd have to copy files for every jar.  Can solr read from 
the classpath (jar files)?

Yes my war is always deployed to the same location under webapps.  I do already 
have solr/home defined in web.xml.  I'll try copying my files into there, but I 
would have to extract every jar file and do this manually.





From: Wilkes, Chris cwil...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, February 9, 2011 3:44:03 PM
Subject: Re: solr current workding directory or reading config files

Is your war always deployed the the same location, ie 
/usr/mycomp/myapplication/webapps/myapp.war?  If so then on startup copy the 
files out of your directory and put them under CATALINA_BASE/solr 
(usr/mycomp/myapplication/solr) and in your war file have the 
META-INF/context.xml JNDI setting point to that.

Context
  Environment name=solr/home type=java.lang.String 
value=/usr/mycomp/myapplication/solr override=true /
/Context

If you know of a way to reference CATALINA_BASE in the context.xml that would 
make it easier.

On Feb 9, 2011, at 12:00 PM, Tri Nguyen wrote:

 Hi,
 
 I have a class (in a jar) that reads from properties (text) files.  I have 
these
 files in the same jar file as the class.
 
 However, when my class reads those properties files, those files cannot be 
found
 since solr reads from tomcat's bin directory.
 
 I don't really want to put the config files in tomcat's bin directory.
 
 How do I reconcile this?
 
 Tri

Re: communication between entity processor and solr DataImporter

I can throw DataImportHandlerException (a runtime exception) from my 
entityprocessor which will force a rollback.

Tri





From: Tri Nguyen tringuye...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Wed, February 9, 2011 3:50:05 PM
Subject: communication between entity processor and solr DataImporter

Hi,

I'd like to communicate errors between my entity processor and the DataImporter 
in case of error.

Should there be an error in my entity processor, I'd like the index build to 
rollback. How can I do this?

I want to throw an exception of some sort.  Only thing I can think of is to 
force a runtime exception be thrown in nextRow() of the entityprocessor since 
runtime exceptions are not checked and does not have to be declared in the 
nextRow() method signature.

How can I request the nextRow() method signature be updated to throw 
Exception?  

Would it even make sense?

Tri

Re: communication between entity processor and solr DataImporter