Where is NGramFilter?

2011-02-09 Thread Kai Schlamp
Hi.

On the Sunspot (a Ruby Solr client) Wiki
(https://github.com/outoftime/sunspot/wiki/Matching-substrings-in-fulltext-search)
it says that the NGramFilter should allow substring indexing. As I
never got it working, I searched a bit and found this site:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
There is only EdgeNGramFilterFactory listed (which I got working for
prefix indexing), but no NGramFilterFactory. Is that filter not
supported anymore, or is that list not up to date? Is there an
alternative filter for getting substring searching working?

Best regards,
Kai


high cpu usage

2011-02-09 Thread Erez Zarum

Hello,
We have been running read only solr instances for a few months now, 
yesterday i have noticed an high cpu usage coming from the JVM, it 
simply use 100% of the CPU for no reason.

Nothing was changed, we are using Jetty as a Servlet container for solr.
Where can i start looking what cause it? it has been using 100% CPU for 
almost 24 hours now.


Thanks,
Erez.


Re: General question about Solr Caches

2011-02-09 Thread Savvas-Andreas Moysidis
Hi Hoss,

Ok, that makes much more sense now. I was under the impression that values
were copied as well which seemed a bit odd..
unless you have to deal with a use case similar to yours. :)

Cheers,
- Savvas

On 9 February 2011 02:25, Chris Hostetter hossman_luc...@fucit.org wrote:

 : In my understanding, the Current Index Searcher uses a cache instance and
 : when a New Index Searcher is registered a new cache instance is used
 which
 : is also auto-warmed. However, what happens when the New Index Searcher is
 a
 : view of an index which has been modified? If the entries contained in the
 : old cache are copied during auto warming to the new cache wouldn’t that
 new
 : cache contain invalid entries?

 a) i'm not sure what you mean by view of an index which has been
 modified ... except for the first time an index is created, an Index
 Searcher always contains a view of an index which has been modified --
 that view that the IndexSearcher represents is entirely consinsitent and
 doesn't change as documents are added/removed - that's why a new Searcher
 needs to be opened.

 b) entries are not copied during autowarming.  the *keys* of the entries
 in the old cache are used to warm the new cache -- using the new searcher
 to generate new values.

 (caveat: if you have a custom cache, you could write a custom cache
 regenerator that did copy the values from the old cache verbatim -- i have
 done that in special cases where the type of object i was caching didn't
 vary based on the IndexSearcher -- or did vary, but in such a way that i
 could use the new Searcher to determine a cheap piece of information and
 based on the result either reuse an old value that was expensive to
 compute, or recompute it using hte new Searcher.  ... but none of the
 default cache regenerators for the stock solr caches work this way)


 :
 :
 :
 : Thanks,
 : - Savvas
 :

 -Hoss


IndexOutOfBoundsException

2011-02-09 Thread Dominik Lange

hi,

we have a problem with our solr test instance.
This instance is running with 90 cores with about 2 GB of Index-Data per core.

This worked fine for a few weeks.

Now we get an exception querying data from one core : 
java.lang.IndexOutOfBoundsException: Index: 104, Size: 11
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41)
at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45)
at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
at org.apache.lucene.search.Query.weight(Query.java:98)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
...

All other cores are working fine with the same schema.
This problem only occurs when querying for specific data like
q=fieldA:valueA%20AND%20fieldB:valueB

By using the following query data is returned
q=*:*

Has anybody any suggestions on what is causing this problem?
Are 90 cores too much for a single solr instance?

Thanks in advance,

Dominik


Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Timo Schmidt
Hello together,

i am currently developing a search solution, based on Apache Solr. Currently I 
have the problem that I want to offer the user the possibility to maintain 
synonyms and stopwords in a userfriendy tool. But currently I could not find 
any possibility to write the stopwords.txt or synonyms.txt.

Are there any other solutions?

Currently I have some ideas how to handle it:

1.Implement another SynonymFilterFactory to allow other datasources like 
databases. I already saw approaches for that but no solutions yet.
2.Implement a fileWriter request handler to write the stopwords.txt

Are there other solutions which are maybe already implemented?

Thanks and best regards Timo


Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 


Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
vernichten Sie diese Mail. This e-mail message may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this e-mail in error) please notify the sender immediately and destroy this 
e-mail. 





Re: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Stefan Matheis
Timo,

On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

what about writing the Files from an external Application and reload
your Solr Core!?
Seemed to be the simplest way to solve your problem, not?

Regards
Stefan


AW: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Timo Schmidt
Hi Stefan,

i allready thought about that. Maybe some php service or something like that.
But this would mean, that I need additional software on that server like a 
normal
Apache installation, which needs to be maintained. That's why I thought a 
solution that
is build into solr would be nice.

Thanks

Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 
Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


-Ursprüngliche Nachricht-
Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Gesendet: Mittwoch, 9. Februar 2011 11:14
An: solr-user@lucene.apache.org
Betreff: Re: Maintain stopwords.txt and synonyms.txt

Timo,

On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

what about writing the Files from an external Application and reload
your Solr Core!?
Seemed to be the simplest way to solve your problem, not?

Regards
Stefan


Re: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Stefan Matheis
Hi Timo,

of course - that's right. Write some JSP (i guess) which could be
integrated in the already existing jetty/tomcat Server?

Just wondering about, how do you perform Search-Requests to Solr?
Normally, there is already any other Service running, which acts as
'proxy' to the outer world? ;)

Regards
Stefan

On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 Hi Stefan,

 i allready thought about that. Maybe some php service or something like that.
 But this would mean, that I need additional software on that server like a 
 normal
 Apache installation, which needs to be maintained. That's why I thought a 
 solution that
 is build into solr would be nice.

 Thanks

 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:14
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Timo,

 On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

 what about writing the Files from an external Application and reload
 your Solr Core!?
 Seemed to be the simplest way to solve your problem, not?

 Regards
 Stefan



Re: Nutch and Solr search on the fly

2011-02-09 Thread Markus Jelsma
The parsed data is only sent to the Solr index of you tell a segment to be 
indexed; solrindex crawldb linkdb segment

If you did this only once after injecting  and then the consequent 
fetch,parse,update,index sequence then you, of course, only see those URL's. 
If you don't index a segment after it's being parsed, you need to do it later 
on.

On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
 Hi all,
 
  I am a newbie to nutch and solr. Well relatively much newer to Solr than
 Nutch :)
 
  I have been using nutch for past two weeks, and I wanted to know if I can
 query or search on my nutch crawls on the fly(before it completes). I am
 asking this because the websites I am crawling are really huge and it takes
 around 3-4 days for a crawl to complete. I want to analyze some quick
 results while the nutch crawler is still crawling the URLs. Some one
 suggested me that Solr would make it possible.
 
  I followed the steps in
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
 this process, I see only the injected URLs are shown in the Solr search. I
 know I did something really foolish and the crawl never happened, I feel I
 am missing some information here. I think somewhere in the process there
 should be a crawling happening and I missed it out.
 
  Just wanted to see if some one could help me pointing this out and where I
 went wrong in the process. Forgive my foolishness and thanks for your
 patience.
 
 Cheers,
 Abi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: [WKT] Spatial Searching

2011-02-09 Thread Grant Ingersoll
The show stopper for JTS is it's license, unfortunately.  Otherwise, I think it 
would be done already!  We could, since it's LGPL, make it an optional 
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

 I just came across a ~nudge post over in the SIS list on what the status is 
 for that project. This got me looking more in to spatial mods with Solr4.0.  
 I found this enhancement in Jira. 
 https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David 
 mentions that he's already integrated JTS in to Solr4.0 for querying on 
 polygons stored as WKT. 
 
 It's relatively easy to get WKT strings in to Solr but does the Field type 
 exist yet? Is there a patch or something that I can test out? 
 
 Here's how I would do it using GDAL/OGR and the already existing csv update 
 handler. http://www.gdal.org/ogr/drv_csv.html
 
 ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
 This converts a shapefile to a csv with the geometries in tact in the form of 
 WKT. You can then get the data in to Solr by running the following command.
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
 There are lots of flavors of geometries so I suspect that this will be a 
 daunting task but because JTS recognizes each geometry type it should be 
 possible to work with them. 
 Does anyone know of a patch or even when this functionality might be included 
 in to Solr4.0? I need to query for polygons ;-)
 Thanks,
 Adam
 
 
 

--
Grant Ingersoll
http://www.lucidimagination.com/



Re: [WKT] Spatial Searching

2011-02-09 Thread Estrada Groups
How could i stub this out not being a java guy? What is needed in order to do 
this? 

Licensing is always going to be an issue with JTS which is why I am interested 
in the project SIS sitting in incubation right now. 

I willing to put forth the effort if I had a little direction from the peanut 
gallery ;-)

Adam


On Feb 9, 2011, at 7:03 AM, Grant Ingersoll gsing...@apache.org wrote:

 The show stopper for JTS is it's license, unfortunately.  Otherwise, I think 
 it would be done already!  We could, since it's LGPL, make it an optional 
 dependency, assuming someone can stub it out.
 
 On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:
 
 I just came across a ~nudge post over in the SIS list on what the status is 
 for that project. This got me looking more in to spatial mods with Solr4.0.  
 I found this enhancement in Jira. 
 https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David 
 mentions that he's already integrated JTS in to Solr4.0 for querying on 
 polygons stored as WKT. 
 
 It's relatively easy to get WKT strings in to Solr but does the Field type 
 exist yet? Is there a patch or something that I can test out? 
 
 Here's how I would do it using GDAL/OGR and the already existing csv update 
 handler. http://www.gdal.org/ogr/drv_csv.html
 
 ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
 This converts a shapefile to a csv with the geometries in tact in the form 
 of WKT. You can then get the data in to Solr by running the following 
 command.
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
 There are lots of flavors of geometries so I suspect that this will be a 
 daunting task but because JTS recognizes each geometry type it should be 
 possible to work with them. 
 Does anyone know of a patch or even when this functionality might be 
 included in to Solr4.0? I need to query for polygons ;-)
 Thanks,
 Adam
 
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 


AW: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Timo Schmidt
Yes we have something, but on another machine.


Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 
Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


-Ursprüngliche Nachricht-
Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Gesendet: Mittwoch, 9. Februar 2011 11:34
An: solr-user@lucene.apache.org
Betreff: Re: Maintain stopwords.txt and synonyms.txt

Hi Timo,

of course - that's right. Write some JSP (i guess) which could be
integrated in the already existing jetty/tomcat Server?

Just wondering about, how do you perform Search-Requests to Solr?
Normally, there is already any other Service running, which acts as
'proxy' to the outer world? ;)

Regards
Stefan

On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 Hi Stefan,

 i allready thought about that. Maybe some php service or something like that.
 But this would mean, that I need additional software on that server like a 
 normal
 Apache installation, which needs to be maintained. That's why I thought a 
 solution that
 is build into solr would be nice.

 Thanks

 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:14
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Timo,

 On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 But currently I could not find any possibility to write the stopwords.txt or 
 synonyms.txt.

 what about writing the Files from an external Application and reload
 your Solr Core!?
 Seemed to be the simplest way to solve your problem, not?

 Regards
 Stefan



Re: [WKT] Spatial Searching

2011-02-09 Thread Estrada Groups
Thought I would share this on web mapping...it's a great write up and something 
to consider when talking about working with spatial data.

http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/

Adam


On Feb 9, 2011, at 7:03 AM, Grant Ingersoll gsing...@apache.org wrote:

 The show stopper for JTS is it's license, unfortunately.  Otherwise, I think 
 it would be done already!  We could, since it's LGPL, make it an optional 
 dependency, assuming someone can stub it out.
 
 On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:
 
 I just came across a ~nudge post over in the SIS list on what the status is 
 for that project. This got me looking more in to spatial mods with Solr4.0.  
 I found this enhancement in Jira. 
 https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David 
 mentions that he's already integrated JTS in to Solr4.0 for querying on 
 polygons stored as WKT. 
 
 It's relatively easy to get WKT strings in to Solr but does the Field type 
 exist yet? Is there a patch or something that I can test out? 
 
 Here's how I would do it using GDAL/OGR and the already existing csv update 
 handler. http://www.gdal.org/ogr/drv_csv.html
 
 ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
 This converts a shapefile to a csv with the geometries in tact in the form 
 of WKT. You can then get the data in to Solr by running the following 
 command.
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
 There are lots of flavors of geometries so I suspect that this will be a 
 daunting task but because JTS recognizes each geometry type it should be 
 possible to work with them. 
 Does anyone know of a patch or even when this functionality might be 
 included in to Solr4.0? I need to query for polygons ;-)
 Thanks,
 Adam
 
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 


Re: Where is NGramFilter?

2011-02-09 Thread Koji Sekiguchi

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
There is only EdgeNGramFilterFactory listed (which I got working for
prefix indexing), but no NGramFilterFactory. Is that filter not
supported anymore, or is that list not up to date?


It should be there. Here is the javadoc for it:

https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/analysis/NGramFilterFactory.html

Anyone who have an account can update the wiki. Contribution is welcome!

Koji
--
http://www.rondhuit.com/en/


Re: [WKT] Spatial Searching

2011-02-09 Thread Adam Estrada
Grant,

How could i stub this out not being a java guy? What is needed in order to do 
this? 

Licensing is always going to be an issue with JTS which is why I am interested 
in the project SIS sitting in incubation right now. 

I'm willing to put forth the effort if I had a little direction on how to 
implement it from the peanut gallery ;-)

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote:

 The show stopper for JTS is it's license, unfortunately.  Otherwise, I think 
 it would be done already!  We could, since it's LGPL, make it an optional 
 dependency, assuming someone can stub it out.
 
 On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:
 
 I just came across a ~nudge post over in the SIS list on what the status is 
 for that project. This got me looking more in to spatial mods with Solr4.0.  
 I found this enhancement in Jira. 
 https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David 
 mentions that he's already integrated JTS in to Solr4.0 for querying on 
 polygons stored as WKT. 
 
 It's relatively easy to get WKT strings in to Solr but does the Field type 
 exist yet? Is there a patch or something that I can test out? 
 
 Here's how I would do it using GDAL/OGR and the already existing csv update 
 handler. http://www.gdal.org/ogr/drv_csv.html
 
 ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
 This converts a shapefile to a csv with the geometries in tact in the form 
 of WKT. You can then get the data in to Solr by running the following 
 command.
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
 There are lots of flavors of geometries so I suspect that this will be a 
 daunting task but because JTS recognizes each geometry type it should be 
 possible to work with them. 
 Does anyone know of a patch or even when this functionality might be 
 included in to Solr4.0? I need to query for polygons ;-)
 Thanks,
 Adam
 
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 



Re: [WKT] Spatial Searching

2011-02-09 Thread Adam Estrada
Thought I would share this on web mapping...it's a great write up and something 
to consider when talking about working with spatial data.

http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/

Adam


On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote:

 The show stopper for JTS is it's license, unfortunately.  Otherwise, I think 
 it would be done already!  We could, since it's LGPL, make it an optional 
 dependency, assuming someone can stub it out.
 
 On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:
 
 I just came across a ~nudge post over in the SIS list on what the status is 
 for that project. This got me looking more in to spatial mods with Solr4.0.  
 I found this enhancement in Jira. 
 https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David 
 mentions that he's already integrated JTS in to Solr4.0 for querying on 
 polygons stored as WKT. 
 
 It's relatively easy to get WKT strings in to Solr but does the Field type 
 exist yet? Is there a patch or something that I can test out? 
 
 Here's how I would do it using GDAL/OGR and the already existing csv update 
 handler. http://www.gdal.org/ogr/drv_csv.html
 
 ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
 This converts a shapefile to a csv with the geometries in tact in the form 
 of WKT. You can then get the data in to Solr by running the following 
 command.
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
 There are lots of flavors of geometries so I suspect that this will be a 
 daunting task but because JTS recognizes each geometry type it should be 
 possible to work with them. 
 Does anyone know of a patch or even when this functionality might be 
 included in to Solr4.0? I need to query for polygons ;-)
 Thanks,
 Adam
 
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 



Re: Maintain stopwords.txt and synonyms.txt

2011-02-09 Thread Stefan Matheis
Timo,

then use cronjobs on your solr-machine to fetch the generated
synonyms-file, put in to the correct location and reload the
core-configuration (which is required to update the synonyms-file)? :)

Regards
Stefan

On Wed, Feb 9, 2011 at 1:15 PM, Timo Schmidt timo.schm...@aoemedia.de wrote:
 Yes we have something, but on another machine.


 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:34
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Hi Timo,

 of course - that's right. Write some JSP (i guess) which could be
 integrated in the already existing jetty/tomcat Server?

 Just wondering about, how do you perform Search-Requests to Solr?
 Normally, there is already any other Service running, which acts as
 'proxy' to the outer world? ;)

 Regards
 Stefan

 On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 Hi Stefan,

 i allready thought about that. Maybe some php service or something like that.
 But this would mean, that I need additional software on that server like a 
 normal
 Apache installation, which needs to be maintained. That's why I thought a 
 solution that
 is build into solr would be nice.

 Thanks

 Timo Schmidt
 Entwickler (Diplom Informatiker FH)


 AOE media GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199



 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/

 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould


 -Ursprüngliche Nachricht-
 Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
 Gesendet: Mittwoch, 9. Februar 2011 11:14
 An: solr-user@lucene.apache.org
 Betreff: Re: Maintain stopwords.txt and synonyms.txt

 Timo,

 On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de 
 wrote:
 But currently I could not find any possibility to write the stopwords.txt 
 or synonyms.txt.

 what about writing the Files from an external Application and reload
 your Solr Core!?
 Seemed to be the simplest way to solve your problem, not?

 Regards
 Stefan




AW: IndexOutOfBoundsException

2011-02-09 Thread André Widhani
I think we had a similar exception recently when attempting to sort on a 
multi-valued field ... could that be possible in your case?

André

-Ursprüngliche Nachricht-
Von: Dominik Lange [mailto:dominikla...@searchmetrics.com] 
Gesendet: Mittwoch, 9. Februar 2011 10:55
An: solr-user@lucene.apache.org
Betreff: IndexOutOfBoundsException


hi,

we have a problem with our solr test instance.
This instance is running with 90 cores with about 2 GB of Index-Data per core.

This worked fine for a few weeks.

Now we get an exception querying data from one core : 
java.lang.IndexOutOfBoundsException: Index: 104, Size: 11
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41)
at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45)
at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
at org.apache.lucene.search.Query.weight(Query.java:98)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
...

All other cores are working fine with the same schema.
This problem only occurs when querying for specific data like 
q=fieldA:valueA%20AND%20fieldB:valueB

By using the following query data is returned
q=*:*

Has anybody any suggestions on what is causing this problem?
Are 90 cores too much for a single solr instance?

Thanks in advance,

Dominik


Re: high cpu usage

2011-02-09 Thread Erick Erickson
You can try attaching jConsole to the process to see what it shows. If
you're on a *nix box
you can get a gross idea what's going on with top.

Best
Erick

On Wed, Feb 9, 2011 at 4:31 AM, Erez Zarum e...@icinga.org.il wrote:

 Hello,
 We have been running read only solr instances for a few months now,
 yesterday i have noticed an high cpu usage coming from the JVM, it simply
 use 100% of the CPU for no reason.
 Nothing was changed, we are using Jetty as a Servlet container for solr.
 Where can i start looking what cause it? it has been using 100% CPU for
 almost 24 hours now.

 Thanks,
Erez.



AW: IndexOutOfBoundsException

2011-02-09 Thread Dominik Lange
No, we do not have multivalued fields and we do not sort (in this case).
We reindexed csv file and the error disappeared, but it would we interesting 
why this error occured...

Thank you for you suggestion.

Dominik


-Ursprüngliche Nachricht-
Von: André Widhani [mailto:andre.widh...@digicol.de]
Gesendet: Mi 09.02.2011 13:58
An: solr-user@lucene.apache.org
Betreff: AW: IndexOutOfBoundsException
 
I think we had a similar exception recently when attempting to sort on a 
multi-valued field ... could that be possible in your case?

André

-Ursprüngliche Nachricht-
Von: Dominik Lange [mailto:dominikla...@searchmetrics.com] 
Gesendet: Mittwoch, 9. Februar 2011 10:55
An: solr-user@lucene.apache.org
Betreff: IndexOutOfBoundsException


hi,

we have a problem with our solr test instance.
This instance is running with 90 cores with about 2 GB of Index-Data per core.

This worked fine for a few weeks.

Now we get an exception querying data from one core : 
java.lang.IndexOutOfBoundsException: Index: 104, Size: 11
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961)
at 
org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at 
org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41)
at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45)
at 
org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
at org.apache.lucene.search.Query.weight(Query.java:98)
at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
...

All other cores are working fine with the same schema.
This problem only occurs when querying for specific data like 
q=fieldA:valueA%20AND%20fieldB:valueB

By using the following query data is returned
q=*:*

Has anybody any suggestions on what is causing this problem?
Are 90 cores too much for a single solr instance?

Thanks in advance,

Dominik





Re: Where is NGramFilter?

2011-02-09 Thread Erick Erickson
In addition to Koji's note, see the bold comment at the top of that
page that says that this not a complete list, the definitive list is
always the javadocs...

Best
Erick

On Wed, Feb 9, 2011 at 3:34 AM, Kai Schlamp schl...@gmx.de wrote:

 Hi.

 On the Sunspot (a Ruby Solr client) Wiki
 (
 https://github.com/outoftime/sunspot/wiki/Matching-substrings-in-fulltext-search
 )
 it says that the NGramFilter should allow substring indexing. As I
 never got it working, I searched a bit and found this site:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
 There is only EdgeNGramFilterFactory listed (which I got working for
 prefix indexing), but no NGramFilterFactory. Is that filter not
 supported anymore, or is that list not up to date? Is there an
 alternative filter for getting substring searching working?

 Best regards,
 Kai



Re: TermVector query using Solr Tutorial

2011-02-09 Thread Ryan Chan
Hello,

On Tue, Feb 8, 2011 at 11:12 PM, Grant Ingersoll gsing...@apache.org wrote:

 It's a little hard to read due to the indentation, but AFAICT you have two 
 terms, usb and cabl.  USB appears at position 0 and cabl at position 1.  
 Those are the relative positions to each other.  Perhaps you can explain a 
 bit more what you are trying to do?

I am searching the keyword 25, in the field

field name=features30 TFT active matrix LCD, 2560 x 1600, .25mm
dot pitch, 700:1 contrast/field

I want to know the character position of matched keyword in the
corresponding field.

usb or cabl is not what I want.


Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.
Hi Markus,

 I am sorry for not being clear, I meant to say that...

 Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
turn contain links to a.html, b.html, c.html, d.html) is injected into the
seed.txt, after the whole process I was expecting a bunch of other pages
which crawled from this seed url. However, at the end of it all I see is the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.

 The crawling happens only for the URLs mentioned in the seed.txt and does
not proceed further from there. So I am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.

Thanks,
Abi


On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 The parsed data is only sent to the Solr index of you tell a segment to be
 indexed; solrindex crawldb linkdb segment

 If you did this only once after injecting  and then the consequent
 fetch,parse,update,index sequence then you, of course, only see those
 URL's.
 If you don't index a segment after it's being parsed, you need to do it
 later
 on.

 On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
  Hi all,
 
   I am a newbie to nutch and solr. Well relatively much newer to Solr than
  Nutch :)
 
   I have been using nutch for past two weeks, and I wanted to know if I
 can
  query or search on my nutch crawls on the fly(before it completes). I am
  asking this because the websites I am crawling are really huge and it
 takes
  around 3-4 days for a crawl to complete. I want to analyze some quick
  results while the nutch crawler is still crawling the URLs. Some one
  suggested me that Solr would make it possible.
 
   I followed the steps in
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
  this process, I see only the injected URLs are shown in the Solr search.
 I
  know I did something really foolish and the crawl never happened, I feel
 I
  am missing some information here. I think somewhere in the process there
  should be a crawling happening and I missed it out.
 
   Just wanted to see if some one could help me pointing this out and where
 I
  went wrong in the process. Forgive my foolishness and thanks for your
  patience.
 
  Cheers,
  Abi

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350



Re: Nutch and Solr search on the fly

2011-02-09 Thread Erick Erickson
WARNING: I don't do Nutch much, but could it be that your
crawl depth is 1? See:
http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/NutchTutorialand search for depth
Best
Erick

On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote:

 Hi Markus,

  I am sorry for not being clear, I meant to say that...

  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
 turn contain links to a.html, b.html, c.html, d.html) is injected into the
 seed.txt, after the whole process I was expecting a bunch of other pages
 which crawled from this seed url. However, at the end of it all I see is
 the
 contents from only this page namely
 www.somehost.com/gifts/greetingcard.htmland I do not see any other
 pages(here a.html, b.html, c.html, d.html)
 crawled from this one.

  The crawling happens only for the URLs mentioned in the seed.txt and does
 not proceed further from there. So I am just bit confused. Why is it not
 crawling the linked pages(a.html, b.html, c.html and d.html). I get a
 feeling that I am missing something that the author of the blog(
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
 everyone would know.

 Thanks,
 Abi


 On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

  The parsed data is only sent to the Solr index of you tell a segment to
 be
  indexed; solrindex crawldb linkdb segment
 
  If you did this only once after injecting  and then the consequent
  fetch,parse,update,index sequence then you, of course, only see those
  URL's.
  If you don't index a segment after it's being parsed, you need to do it
  later
  on.
 
  On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
   Hi all,
  
I am a newbie to nutch and solr. Well relatively much newer to Solr
 than
   Nutch :)
  
I have been using nutch for past two weeks, and I wanted to know if I
  can
   query or search on my nutch crawls on the fly(before it completes). I
 am
   asking this because the websites I am crawling are really huge and it
  takes
   around 3-4 days for a crawl to complete. I want to analyze some quick
   results while the nutch crawler is still crawling the URLs. Some one
   suggested me that Solr would make it possible.
  
I followed the steps in
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
 By
   this process, I see only the injected URLs are shown in the Solr
 search.
  I
   know I did something really foolish and the crawl never happened, I
 feel
  I
   am missing some information here. I think somewhere in the process
 there
   should be a crawling happening and I missed it out.
  
Just wanted to see if some one could help me pointing this out and
 where
  I
   went wrong in the process. Forgive my foolishness and thanks for your
   patience.
  
   Cheers,
   Abi
 
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350
 



Re: Nutch and Solr search on the fly

2011-02-09 Thread Markus Jelsma
Are you using the depth parameter with the crawl command or are you using the 
separate generate, fetch etc. commands?

What's $  nutch readdb crawldb -stats returning?

On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote:
 Hi Markus,
 
  I am sorry for not being clear, I meant to say that...
 
  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
 turn contain links to a.html, b.html, c.html, d.html) is injected into the
 seed.txt, after the whole process I was expecting a bunch of other pages
 which crawled from this seed url. However, at the end of it all I see is
 the contents from only this page namely
 www.somehost.com/gifts/greetingcard.htmland I do not see any other
 pages(here a.html, b.html, c.html, d.html)
 crawled from this one.
 
  The crawling happens only for the URLs mentioned in the seed.txt and does
 not proceed further from there. So I am just bit confused. Why is it not
 crawling the linked pages(a.html, b.html, c.html and d.html). I get a
 feeling that I am missing something that the author of the blog(
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
 everyone would know.
 
 Thanks,
 Abi
 
 On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
  The parsed data is only sent to the Solr index of you tell a segment to
  be indexed; solrindex crawldb linkdb segment
  
  If you did this only once after injecting  and then the consequent
  fetch,parse,update,index sequence then you, of course, only see those
  URL's.
  If you don't index a segment after it's being parsed, you need to do it
  later
  on.
  
  On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
   Hi all,
   
I am a newbie to nutch and solr. Well relatively much newer to Solr
than
   
   Nutch :)
   
I have been using nutch for past two weeks, and I wanted to know if I
  
  can
  
   query or search on my nutch crawls on the fly(before it completes). I
   am asking this because the websites I am crawling are really huge and
   it
  
  takes
  
   around 3-4 days for a crawl to complete. I want to analyze some quick
   results while the nutch crawler is still crawling the URLs. Some one
   suggested me that Solr would make it possible.
   
I followed the steps in
   
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
   By this process, I see only the injected URLs are shown in the Solr
   search.
  
  I
  
   know I did something really foolish and the crawl never happened, I
   feel
  
  I
  
   am missing some information here. I think somewhere in the process
   there should be a crawling happening and I missed it out.
   
Just wanted to see if some one could help me pointing this out and
where
  
  I
  
   went wrong in the process. Forgive my foolishness and thanks for your
   patience.
   
   Cheers,
   Abi
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Does Distributed Search support {!boost }?

2011-02-09 Thread Yonik Seeley
On Tue, Feb 8, 2011 at 9:02 PM, Andy angelf...@yahoo.com wrote:
 Is it possible to do a query like {!boost b=log(popularity)}foo over sharded 
 indexes?

Yep, that should work fine.

-Yonik
http://lucidimagination.com


Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.
Hi Erick,

 Thanks a bunch for the response

 Could be a chance..but all I am wondering is where to specify the depth in
the whole entire process in the URL
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
specifying it during the fetcher phase but it was just ignored :(

Thanks,
Abi

On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.comwrote:

 WARNING: I don't do Nutch much, but could it be that your
 crawl depth is 1? See:
 http://wiki.apache.org/nutch/NutchTutorial

 http://wiki.apache.org/nutch/NutchTutorialand search for depth
 Best
 Erick

 On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote:

  Hi Markus,
 
   I am sorry for not being clear, I meant to say that...
 
   Suppose if a url namely 
  www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28whichin
  turn contain links to a.html, b.html, c.html, d.html) is injected into
 the
  seed.txt, after the whole process I was expecting a bunch of other pages
  which crawled from this seed url. However, at the end of it all I see is
  the
  contents from only this page namely
  www.somehost.com/gifts/greetingcard.htmland I do not see any other
  pages(here a.html, b.html, c.html, d.html)
  crawled from this one.
 
   The crawling happens only for the URLs mentioned in the seed.txt and
 does
  not proceed further from there. So I am just bit confused. Why is it not
  crawling the linked pages(a.html, b.html, c.html and d.html). I get a
  feeling that I am missing something that the author of the blog(
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
  everyone would know.
 
  Thanks,
  Abi
 
 
  On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   The parsed data is only sent to the Solr index of you tell a segment to
  be
   indexed; solrindex crawldb linkdb segment
  
   If you did this only once after injecting  and then the consequent
   fetch,parse,update,index sequence then you, of course, only see those
   URL's.
   If you don't index a segment after it's being parsed, you need to do it
   later
   on.
  
   On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
Hi all,
   
 I am a newbie to nutch and solr. Well relatively much newer to Solr
  than
Nutch :)
   
 I have been using nutch for past two weeks, and I wanted to know if
 I
   can
query or search on my nutch crawls on the fly(before it completes). I
  am
asking this because the websites I am crawling are really huge and it
   takes
around 3-4 days for a crawl to complete. I want to analyze some quick
results while the nutch crawler is still crawling the URLs. Some one
suggested me that Solr would make it possible.
   
 I followed the steps in
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
 this.
  By
this process, I see only the injected URLs are shown in the Solr
  search.
   I
know I did something really foolish and the crawl never happened, I
  feel
   I
am missing some information here. I think somewhere in the process
  there
should be a crawling happening and I missed it out.
   
 Just wanted to see if some one could help me pointing this out and
  where
   I
went wrong in the process. Forgive my foolishness and thanks for your
patience.
   
Cheers,
Abi
  
   --
   Markus Jelsma - CTO - Openindex
   http://www.linkedin.com/in/markus17
   050-8536620 / 06-50258350
  
 



Solr 1.4.1 using more memory than Solr 1.3

2011-02-09 Thread Rachita Choudhary
Hi Solr Users,

We are in the process of upgrading from Solr 1.3 to Solr 1.4.1.
While performing stress test on Solr 1.4.1 to measure the performance
improvement in Query times (QTime) and no more blocked threads, we ran into
memory issues with Solr 1.4.1.

Test Setup details:
- 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually.
- 3 cores with index sizes : 10 GB, 2 GB, 1 GB.
- JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB
- No other application/service running on the servers.
- For querying solr servers, we are using wget queries from a standalone
host.

For the same index data and same set of queries, Solr 1.3 is hovering
between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is
reaching its 3 GB limit and performing FULL GC after almost every query. The
Full GC is also not freeing up any memory.

Has anyone also faced similar issues with Solr 1.4.1 ?

Also why is Solr 1.4.1 using more memory for the same amount of processing
compared to Solr 1.3 ?

Is there any particular configuration that needs to be done to avoid this
high memory usage ?

Thanks,
Rachita


Re: Solr 1.4.1 using more memory than Solr 1.3

2011-02-09 Thread Markus Jelsma
Searching and sorting is now done on a per-segment basis, meaning that
the FieldCache entries used for sorting and for function queries are
created and used per-segment and can be reused for segments that don't
change between index updates.  While generally beneficial, this can lead
to increased memory usage over 1.3 in certain scenarios: 
  1) A single valued field that was used for both sorting and faceting
in 1.3 would have used the same top level FieldCache entry.  In 1.4, 
sorting will use entries at the segment level while faceting will still
use entries at the top reader level, leading to increased memory usage.
  2) Certain function queries such as ord() and rord() require a top level
FieldCache instance and can thus lead to increased memory usage.  Consider
replacing ord() and rord() with alternatives, such as function queries
based on ms() for date boosting.


http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/CHANGES.txt



On Wednesday 09 February 2011 16:07:01 Rachita Choudhary wrote:
 Hi Solr Users,
 
 We are in the process of upgrading from Solr 1.3 to Solr 1.4.1.
 While performing stress test on Solr 1.4.1 to measure the performance
 improvement in Query times (QTime) and no more blocked threads, we ran into
 memory issues with Solr 1.4.1.
 
 Test Setup details:
 - 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually.
 - 3 cores with index sizes : 10 GB, 2 GB, 1 GB.
 - JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB
 - No other application/service running on the servers.
 - For querying solr servers, we are using wget queries from a standalone
 host.
 
 For the same index data and same set of queries, Solr 1.3 is hovering
 between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is
 reaching its 3 GB limit and performing FULL GC after almost every query.
 The Full GC is also not freeing up any memory.
 
 Has anyone also faced similar issues with Solr 1.4.1 ?
 
 Also why is Solr 1.4.1 using more memory for the same amount of processing
 compared to Solr 1.3 ?
 
 Is there any particular configuration that needs to be done to avoid this
 high memory usage ?
 
 Thanks,
 Rachita

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: Concurrent updates/commits

2011-02-09 Thread Jonathan Rochkind
Solr does handle concurrency fine. But there is NOT transaction isolation 
like you'll get from an rdbms. All 'pending' changes are (conceptually, anyway) 
held in a single queue, and any commit will commit ALL of them. There isn't 
going to be any data corruption issues or anything from concurrent adds (unless 
there's a bug in Solr, there isn't supposed to be) -- but there is no kind of 
transactions or isolation between different concurrent adders. So, sure, 
everyone can add concurrently -- but any time any of those actors issues a 
commit, all pending adds are committed. 

In addition, there are problems with Solr's basic architecture and _too 
frequent_ commits (whether made by different processes or not, doesn''t 
matter). When a new commit happens, Solr fires up a new index searcher and 
warms it up on the new version of the index. Until the new index searcher is 
fully warmed, the old index searcher is still serving queries.  Which can also 
mean that there are, for this period, TWO versions of all your caches in RAM 
and such. So let's say it takes 5 minutes for the new index to be fully warmed. 
 But if you have commits happening every 1 minute -- then you'll end up with 
FIVE 'new indexes' being warmed -- meaning potentially 5 times the RAM usage 
(quickly running into a JVM out of memory error), lots of CPU activity going on 
warming indexes that will never actually been used (because even though they 
aren't even done being warmed and ready to use, they've already been superseded 
by a later commit).   

I don't know of any good way to deal with this except less frequent commits. 
One way to get less frequent commits is to use Solr replication, and 'stage' 
all your commits in a 'master' index, but only replicate to 'slave' at a 
frequency slow enough so the new index is fully warmed before the next commit 
happens. 

Some new features in trunk (both lucene and solr) for 'near real time'  search 
ameliorate this problem somewhat, depending on the nature of your commits. 

Jonathan

From: Savvas-Andreas Moysidis [savvas.andreas.moysi...@googlemail.com]
Sent: Wednesday, February 09, 2011 10:34 AM
To: solr-user@lucene.apache.org
Subject: Concurrent updates/commits

Hello,

This topic has probably been covered before here, but we're still not very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action. This
could potentially cause multiple updates/commits to be fired to Solr and we
are trying to investigate how Solr handles those multiple requests.

This thread:
http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

suggests that Solr will handle all of the lower level details and that Before
a *COMMIT* is done , lock is obtained and its released  after the
operation
which in my understanding means that Solr will serialise all update/commit
requests?

However, the Solr book, in the Commit, Optimise, Rollback section reads:
if more than one Solr client were to submit modifications and commit them
at similar times, it is possible for part of one client's set of changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each request
or has some other strategy for processing those?


Thanks,
- Savvas


RE: Concurrent updates/commits

2011-02-09 Thread Pierre GOSSE
 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them 
 at similar times, it is possible for part of one client's set of changes to 
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.

I read this as If two client submit modifications and commits every couple of 
minutes, it could happen that modifications of client1 got committed by 
client2's commit before client1 asks for a commit.

As far as I understand Solr commit, they are serialized by design. And 
committing too often could lead you to trouble if you have many warm-up queries 
(?).

Hope this helps,

Pierre
-Message d'origine-
De : Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] 
Envoyé : mercredi 9 février 2011 16:34
À : solr-user@lucene.apache.org
Objet : Concurrent updates/commits

Hello,

This topic has probably been covered before here, but we're still not very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action. This
could potentially cause multiple updates/commits to be fired to Solr and we
are trying to investigate how Solr handles those multiple requests.

This thread:
http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

suggests that Solr will handle all of the lower level details and that Before
a *COMMIT* is done , lock is obtained and its released  after the
operation
which in my understanding means that Solr will serialise all update/commit
requests?

However, the Solr book, in the Commit, Optimise, Rollback section reads:
if more than one Solr client were to submit modifications and commit them
at similar times, it is possible for part of one client's set of changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each request
or has some other strategy for processing those?


Thanks,
- Savvas


Re: Concurrent updates/commits

2011-02-09 Thread Savvas-Andreas Moysidis
Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr, but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

  However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
  if more than one Solr client were to submit modifications and commit
 them
  at similar times, it is possible for part of one client's set of changes
 to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.

 I read this as If two client submit modifications and commits every couple
 of minutes, it could happen that modifications of client1 got committed by
 client2's commit before client1 asks for a commit.

 As far as I understand Solr commit, they are serialized by design. And
 committing too often could lead you to trouble if you have many warm-up
 queries (?).

 Hope this helps,

 Pierre
 -Message d'origine-
 De : Savvas-Andreas Moysidis [mailto:
 savvas.andreas.moysi...@googlemail.com]
 Envoyé : mercredi 9 février 2011 16:34
 À : solr-user@lucene.apache.org
 Objet : Concurrent updates/commits

 Hello,

 This topic has probably been covered before here, but we're still not very
 clear about how multiple commits work in Solr.
 We currently have a requirement to make our domain objects searchable
 immediately after the get updated in the database by some user action. This
 could potentially cause multiple updates/commits to be fired to Solr and we
 are trying to investigate how Solr handles those multiple requests.

 This thread:

 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

 suggests that Solr will handle all of the lower level details and that
 Before
 a *COMMIT* is done , lock is obtained and its released  after the
 operation
 which in my understanding means that Solr will serialise all update/commit
 requests?

 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them
 at similar times, it is possible for part of one client's set of changes to
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.

 Our questions are:
 - Does Solr handle concurrent requests or do we need to add synchronisation
 logic around our code?
 - If Solr *does* handle concurrent requests, does it serialise each request
 or has some other strategy for processing those?


 Thanks,
 - Savvas



Re: Concurrent updates/commits

2011-02-09 Thread Walter Underwood
Don't think commit, that is confusing. Solr is not a database. In particular, 
it does not have the isolation property from ACID.

Solr indexes new documents as a batch, then installs a new version of the 
entire index. Installing a new index isn't instant, especially with warming 
queries. Solr creates the index, then warms it, then makes it available for 
regular queries.

If you are creating indexes frequently, don't bother warming.

wunder
==
Walter Underwood
Lead Engineer, MarkLogic

On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote:

 Hello,
 
 Thanks very much for your quick replies.
 
 So, according to Pierre, all updates will be immediately posted to Solr, but
 all commits will be serialised. But doesn't that contradict Jonathan's
 example where you can end up with FIVE 'new indexes' being warmed? If
 commits are serialised, then there can only ever be one Index Searcher being
 auto-warmed at a time or have I got this wrong?
 
 The reason we are investigating commit serialisation, is because we want to
 know whether the commit requests will be blocked until the previous ones
 finish.
 
 Cheers,
 - Savvas
 
 On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:
 
 However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
 if more than one Solr client were to submit modifications and commit
 them
 at similar times, it is possible for part of one client's set of changes
 to
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.
 
 I read this as If two client submit modifications and commits every couple
 of minutes, it could happen that modifications of client1 got committed by
 client2's commit before client1 asks for a commit.
 
 As far as I understand Solr commit, they are serialized by design. And
 committing too often could lead you to trouble if you have many warm-up
 queries (?).
 
 Hope this helps,
 
 Pierre
 -Message d'origine-
 De : Savvas-Andreas Moysidis [mailto:
 savvas.andreas.moysi...@googlemail.com]
 Envoyé : mercredi 9 février 2011 16:34
 À : solr-user@lucene.apache.org
 Objet : Concurrent updates/commits
 
 Hello,
 
 This topic has probably been covered before here, but we're still not very
 clear about how multiple commits work in Solr.
 We currently have a requirement to make our domain objects searchable
 immediately after the get updated in the database by some user action. This
 could potentially cause multiple updates/commits to be fired to Solr and we
 are trying to investigate how Solr handles those multiple requests.
 
 This thread:
 
 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search
 
 suggests that Solr will handle all of the lower level details and that
 Before
 a *COMMIT* is done , lock is obtained and its released  after the
 operation
 which in my understanding means that Solr will serialise all update/commit
 requests?
 
 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them
 at similar times, it is possible for part of one client's set of changes to
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.
 
 Our questions are:
 - Does Solr handle concurrent requests or do we need to add synchronisation
 logic around our code?
 - If Solr *does* handle concurrent requests, does it serialise each request
 or has some other strategy for processing those?
 
 
 Thanks,
 - Savvas
 






Re: Concurrent updates/commits

2011-02-09 Thread Em

Hi Savvas,

well, although it sounds strange: If a commit happens, a new Index Searcher
is warming. If a new commit happens while a 'new' Index Searcher is warming,
another Index Searcher is warming. So, at this point of time, you got 3
Index Searchers: The old one, the 'new' one and the newest one.

I don't know wheter the old one will be replaced by the new one until the
newest one has finished warming, but it seems to be a good guess, since you
can search while the new index is still committing.

You should know that Lucene is built on a segment-architecture. This means
every time you do a commit you write a completely new index-segment. 

Example:
You got one segment in your index and a searcher for it, now you are
committing.
After the commit finished you got two segments for it and one searcher for
both segments.
Internally your indexSearcher consists of at least two segmentReaders.

If you are committing three times at the same moment, you will warm 3 new
SolrIndexSearchers that contain 3,4 and 5 segmentReaders. Your old
SolrIndexSearcher contains 2 segmentReaders and is valid until the newer
SolrIndexReader based on 3 segmentReaders is warmed.

Regards,
Em
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Concurrent-updates-commits-tp2459222p2459522.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Concurrent updates/commits

2011-02-09 Thread Pierre GOSSE
Well, Jonathan explanations are much more accurate than mine. :)

I took the word serialization as meaning kind of isolation between commits, 
which is not very smart. Sorry to have introduce more confusion in this.

Pierre

-Message d'origine-
De : Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] 
Envoyé : mercredi 9 février 2011 17:04
À : solr-user@lucene.apache.org
Objet : Re: Concurrent updates/commits

Hello,

Thanks very much for your quick replies.

So, according to Pierre, all updates will be immediately posted to Solr, but
all commits will be serialised. But doesn't that contradict Jonathan's
example where you can end up with FIVE 'new indexes' being warmed? If
commits are serialised, then there can only ever be one Index Searcher being
auto-warmed at a time or have I got this wrong?

The reason we are investigating commit serialisation, is because we want to
know whether the commit requests will be blocked until the previous ones
finish.

Cheers,
- Savvas

On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:

  However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
  if more than one Solr client were to submit modifications and commit
 them
  at similar times, it is possible for part of one client's set of changes
 to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.

 I read this as If two client submit modifications and commits every couple
 of minutes, it could happen that modifications of client1 got committed by
 client2's commit before client1 asks for a commit.

 As far as I understand Solr commit, they are serialized by design. And
 committing too often could lead you to trouble if you have many warm-up
 queries (?).

 Hope this helps,

 Pierre
 -Message d'origine-
 De : Savvas-Andreas Moysidis [mailto:
 savvas.andreas.moysi...@googlemail.com]
 Envoyé : mercredi 9 février 2011 16:34
 À : solr-user@lucene.apache.org
 Objet : Concurrent updates/commits

 Hello,

 This topic has probably been covered before here, but we're still not very
 clear about how multiple commits work in Solr.
 We currently have a requirement to make our domain objects searchable
 immediately after the get updated in the database by some user action. This
 could potentially cause multiple updates/commits to be fired to Solr and we
 are trying to investigate how Solr handles those multiple requests.

 This thread:

 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

 suggests that Solr will handle all of the lower level details and that
 Before
 a *COMMIT* is done , lock is obtained and its released  after the
 operation
 which in my understanding means that Solr will serialise all update/commit
 requests?

 However, the Solr book, in the Commit, Optimise, Rollback section reads:
 if more than one Solr client were to submit modifications and commit them
 at similar times, it is possible for part of one client's set of changes to
 be committed before that client told Solr to commit
 which suggests that requests are *not* serialised.

 Our questions are:
 - Does Solr handle concurrent requests or do we need to add synchronisation
 logic around our code?
 - If Solr *does* handle concurrent requests, does it serialise each request
 or has some other strategy for processing those?


 Thanks,
 - Savvas



Re: Concurrent updates/commits

2011-02-09 Thread Savvas-Andreas Moysidis
Yes, we'll  probably go towards that path as our index files are relatively
small, so auto warming might not be extremely useful in our case..
Yep, we do realise the difference between a db and a Solr commit. :)

Thanks.

On 9 February 2011 16:15, Walter Underwood wun...@wunderwood.org wrote:

 Don't think commit, that is confusing. Solr is not a database. In
 particular, it does not have the isolation property from ACID.

 Solr indexes new documents as a batch, then installs a new version of the
 entire index. Installing a new index isn't instant, especially with warming
 queries. Solr creates the index, then warms it, then makes it available for
 regular queries.

 If you are creating indexes frequently, don't bother warming.

 wunder
 ==
 Walter Underwood
 Lead Engineer, MarkLogic

 On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote:

  Hello,
 
  Thanks very much for your quick replies.
 
  So, according to Pierre, all updates will be immediately posted to Solr,
 but
  all commits will be serialised. But doesn't that contradict Jonathan's
  example where you can end up with FIVE 'new indexes' being warmed? If
  commits are serialised, then there can only ever be one Index Searcher
 being
  auto-warmed at a time or have I got this wrong?
 
  The reason we are investigating commit serialisation, is because we want
 to
  know whether the commit requests will be blocked until the previous ones
  finish.
 
  Cheers,
  - Savvas
 
  On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:
 
  However, the Solr book, in the Commit, Optimise, Rollback section
  reads:
  if more than one Solr client were to submit modifications and commit
  them
  at similar times, it is possible for part of one client's set of
 changes
  to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.
 
  I read this as If two client submit modifications and commits every
 couple
  of minutes, it could happen that modifications of client1 got committed
 by
  client2's commit before client1 asks for a commit.
 
  As far as I understand Solr commit, they are serialized by design. And
  committing too often could lead you to trouble if you have many warm-up
  queries (?).
 
  Hope this helps,
 
  Pierre
  -Message d'origine-
  De : Savvas-Andreas Moysidis [mailto:
  savvas.andreas.moysi...@googlemail.com]
  Envoyé : mercredi 9 février 2011 16:34
  À : solr-user@lucene.apache.org
  Objet : Concurrent updates/commits
 
  Hello,
 
  This topic has probably been covered before here, but we're still not
 very
  clear about how multiple commits work in Solr.
  We currently have a requirement to make our domain objects searchable
  immediately after the get updated in the database by some user action.
 This
  could potentially cause multiple updates/commits to be fired to Solr and
 we
  are trying to investigate how Solr handles those multiple requests.
 
  This thread:
 
 
 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search
 
  suggests that Solr will handle all of the lower level details and that
  Before
  a *COMMIT* is done , lock is obtained and its released  after the
  operation
  which in my understanding means that Solr will serialise all
 update/commit
  requests?
 
  However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
  if more than one Solr client were to submit modifications and commit
 them
  at similar times, it is possible for part of one client's set of changes
 to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.
 
  Our questions are:
  - Does Solr handle concurrent requests or do we need to add
 synchronisation
  logic around our code?
  - If Solr *does* handle concurrent requests, does it serialise each
 request
  or has some other strategy for processing those?
 
 
  Thanks,
  - Savvas
 







Re: Concurrent updates/commits

2011-02-09 Thread Savvas-Andreas Moysidis
Thanks very much Em.

- Savvas

On 9 February 2011 16:22, Savvas-Andreas Moysidis 
savvas.andreas.moysi...@googlemail.com wrote:

 Yes, we'll  probably go towards that path as our index files are relatively
 small, so auto warming might not be extremely useful in our case..
 Yep, we do realise the difference between a db and a Solr commit. :)

 Thanks.


 On 9 February 2011 16:15, Walter Underwood wun...@wunderwood.org wrote:

 Don't think commit, that is confusing. Solr is not a database. In
 particular, it does not have the isolation property from ACID.

 Solr indexes new documents as a batch, then installs a new version of the
 entire index. Installing a new index isn't instant, especially with warming
 queries. Solr creates the index, then warms it, then makes it available for
 regular queries.

 If you are creating indexes frequently, don't bother warming.

 wunder
 ==
 Walter Underwood
 Lead Engineer, MarkLogic

 On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote:

  Hello,
 
  Thanks very much for your quick replies.
 
  So, according to Pierre, all updates will be immediately posted to Solr,
 but
  all commits will be serialised. But doesn't that contradict Jonathan's
  example where you can end up with FIVE 'new indexes' being warmed? If
  commits are serialised, then there can only ever be one Index Searcher
 being
  auto-warmed at a time or have I got this wrong?
 
  The reason we are investigating commit serialisation, is because we want
 to
  know whether the commit requests will be blocked until the previous ones
  finish.
 
  Cheers,
  - Savvas
 
  On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote:
 
  However, the Solr book, in the Commit, Optimise, Rollback section
  reads:
  if more than one Solr client were to submit modifications and commit
  them
  at similar times, it is possible for part of one client's set of
 changes
  to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.
 
  I read this as If two client submit modifications and commits every
 couple
  of minutes, it could happen that modifications of client1 got committed
 by
  client2's commit before client1 asks for a commit.
 
  As far as I understand Solr commit, they are serialized by design. And
  committing too often could lead you to trouble if you have many warm-up
  queries (?).
 
  Hope this helps,
 
  Pierre
  -Message d'origine-
  De : Savvas-Andreas Moysidis [mailto:
  savvas.andreas.moysi...@googlemail.com]
  Envoyé : mercredi 9 février 2011 16:34
  À : solr-user@lucene.apache.org
  Objet : Concurrent updates/commits
 
  Hello,
 
  This topic has probably been covered before here, but we're still not
 very
  clear about how multiple commits work in Solr.
  We currently have a requirement to make our domain objects searchable
  immediately after the get updated in the database by some user action.
 This
  could potentially cause multiple updates/commits to be fired to Solr
 and we
  are trying to investigate how Solr handles those multiple requests.
 
  This thread:
 
 
 http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search
 
  suggests that Solr will handle all of the lower level details and that
  Before
  a *COMMIT* is done , lock is obtained and its released  after the
  operation
  which in my understanding means that Solr will serialise all
 update/commit
  requests?
 
  However, the Solr book, in the Commit, Optimise, Rollback section
 reads:
  if more than one Solr client were to submit modifications and commit
 them
  at similar times, it is possible for part of one client's set of
 changes to
  be committed before that client told Solr to commit
  which suggests that requests are *not* serialised.
 
  Our questions are:
  - Does Solr handle concurrent requests or do we need to add
 synchronisation
  logic around our code?
  - If Solr *does* handle concurrent requests, does it serialise each
 request
  or has some other strategy for processing those?
 
 
  Thanks,
  - Savvas
 








Query regarding search term count in Solr

2011-02-09 Thread Rahul Warawdekar
Hi All,

This is Rahul and am using Solr for one of my upcoming projects.
I had a query regarding search term count using Solr.
We have a requirement in one of our search based projects to search the
results based on search term counts per document.

For eg,
if a user searches for something like solr[4:9], this query should return
only documents in which solr appears between 4 and 9 times (inclusively).
 if a user searches for something like solr lucene[4:9], this query should
return only documents in which the phrase solr lucene appears between 4
and 9 times (inclusively).

Is there any way from Solr to return results based on the search term and
phrase counts ?
If  not, can it be customized by extending existing Solr/Lucene libraries ?


-- 
Thanks and Regards
Rahul A. Warawdekar


Re: Nutch and Solr search on the fly

2011-02-09 Thread charan kumar
Hi Abishek,

depth is a param of crawl command, not fetch command

If you are using custom script calling individual stages of nutch crawl,
then depth N means , you running that script for N times.. You can put a
loop, in the script.

Thanks,
Charan

On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote:

 Hi Erick,

  Thanks a bunch for the response

  Could be a chance..but all I am wondering is where to specify the depth in
 the whole entire process in the URL
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
 specifying it during the fetcher phase but it was just ignored :(

 Thanks,
 Abi

 On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  WARNING: I don't do Nutch much, but could it be that your
  crawl depth is 1? See:
  http://wiki.apache.org/nutch/NutchTutorial
 
  http://wiki.apache.org/nutch/NutchTutorialand search for depth
  Best
  Erick
 
  On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com
 wrote:
 
   Hi Markus,
  
I am sorry for not being clear, I meant to say that...
  
Suppose if a url namely
 www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which
 http://www.somehost.com/gifts/greetingcard.html%28whichin
   turn contain links to a.html, b.html, c.html, d.html) is injected into
  the
   seed.txt, after the whole process I was expecting a bunch of other
 pages
   which crawled from this seed url. However, at the end of it all I see
 is
   the
   contents from only this page namely
   www.somehost.com/gifts/greetingcard.htmland I do not see any other
   pages(here a.html, b.html, c.html, d.html)
   crawled from this one.
  
The crawling happens only for the URLs mentioned in the seed.txt and
  does
   not proceed further from there. So I am just bit confused. Why is it
 not
   crawling the linked pages(a.html, b.html, c.html and d.html). I get a
   feeling that I am missing something that the author of the blog(
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
   everyone would know.
  
   Thanks,
   Abi
  
  
   On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
  markus.jel...@openindex.io
   wrote:
  
The parsed data is only sent to the Solr index of you tell a segment
 to
   be
indexed; solrindex crawldb linkdb segment
   
If you did this only once after injecting  and then the consequent
fetch,parse,update,index sequence then you, of course, only see those
URL's.
If you don't index a segment after it's being parsed, you need to do
 it
later
on.
   
On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
 Hi all,

  I am a newbie to nutch and solr. Well relatively much newer to
 Solr
   than
 Nutch :)

  I have been using nutch for past two weeks, and I wanted to know
 if
  I
can
 query or search on my nutch crawls on the fly(before it completes).
 I
   am
 asking this because the websites I am crawling are really huge and
 it
takes
 around 3-4 days for a crawl to complete. I want to analyze some
 quick
 results while the nutch crawler is still crawling the URLs. Some
 one
 suggested me that Solr would make it possible.

  I followed the steps in
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
  this.
   By
 this process, I see only the injected URLs are shown in the Solr
   search.
I
 know I did something really foolish and the crawl never happened, I
   feel
I
 am missing some information here. I think somewhere in the process
   there
 should be a crawling happening and I missed it out.

  Just wanted to see if some one could help me pointing this out and
   where
I
 went wrong in the process. Forgive my foolishness and thanks for
 your
 patience.

 Cheers,
 Abi
   
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
   
  
 



QueryWeight for Solr

2011-02-09 Thread Em

Hello folks,

I got a question regarding an own QueryWeight implementation for a special
usecase.

For the current usecase we want to experiment with different values for the
idf based on different algorithms and how they affect the scoring.

Is there a way to plug-in an own weight-implementation without rewriting the
full query-class?

Let's say we extend the DismaxQParser to create an extended boolean Query
(let's call it EBooleanQuery, E for extended) and we implement a QueryWeight
for this Query-class that takes some values into account that are not part
of the current approach.

Is this the way we have to go? Or what would you suggest?

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2459933.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: QueryWeight for Solr

2011-02-09 Thread Yonik Seeley
On Wed, Feb 9, 2011 at 12:16 PM, Em mailformailingli...@yahoo.de wrote:
 For the current usecase we want to experiment with different values for the
 idf based on different algorithms and how they affect the scoring.

For tf, idf, lengthNorm, coord, etc, see Similarity.
Solr already alows you to specify one in the schema, and work is
underway to make it per-field:
https://issues.apache.org/jira/browse/SOLR-2338

-Yonik
http://lucidimagination.com


Re: Query regarding search term count in Solr

2011-02-09 Thread Erick Erickson
I suspect it's worthwhile to back up and ask whether this is a reasonable
requirement. What is the use-case? Because unless the input is very
uniform, I wouldn't be surprised if this will produce poor results. For
instance,
if solr appears once in a field 5 words long and 5 times in another
document
where the same field is 1,000,000 words long, which is preferable?

This requirement can make sense if the fields being searched are uniform
in length, but even then I'm not sure it would be good for the user

That said, you know your problem domain best but before going through the
effort of making this all work I'd step back and ask this question.

There is no way that I know of of doing this out of the box with Solr. I can
imagine you could set up a custom scorer that accessed the underlying
TermDocs (see TermDocs in the Lucene API), but you'd also have to provide
your own query parser...

I'll reiterate, though, that it might be best to see if there are already
ways in Solr to get close enough behavior to satisfy the underlying
requirement rather than go down this route.

Best
Erick

On Wed, Feb 9, 2011 at 11:55 AM, Rahul Warawdekar 
rahul.warawde...@gmail.com wrote:

 Hi All,

 This is Rahul and am using Solr for one of my upcoming projects.
 I had a query regarding search term count using Solr.
 We have a requirement in one of our search based projects to search the
 results based on search term counts per document.

 For eg,
 if a user searches for something like solr[4:9], this query should return
 only documents in which solr appears between 4 and 9 times (inclusively).
  if a user searches for something like solr lucene[4:9], this query
 should
 return only documents in which the phrase solr lucene appears between 4
 and 9 times (inclusively).

 Is there any way from Solr to return results based on the search term and
 phrase counts ?
 If  not, can it be customized by extending existing Solr/Lucene libraries ?


 --
 Thanks and Regards
 Rahul A. Warawdekar



Re: Solr Out of Memory Error

2011-02-09 Thread Bing Li
Dear Adam,

I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh
as follows.

   ...
   if [ -z $LOGGING_MANAGER ]; then
 JAVA_OPTS=$JAVA_OPTS
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
   else
JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m
   fi
   ...

Is this change correct? After that, I still got the same exception. The
index is updated and searched frequently. I am trying to change the code to
avoid the frequent updates. I guess only changing JAVA_OPTS does not work.

Could you give me some help?

Thanks,
LB


On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada 
estrada.adam.gro...@gmail.com wrote:

 Is anyone familiar with the environment variable, JAVA_OPTS? I set
 mine to a much larger heap size and never had any of these issues
 again.

 JAVA_OPTS = -server -Xms4048m -Xmx4048m

 Adam

 On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com
 wrote:
  Hi all,
  By adding more servers do u mean sharding of index.And after sharding ,
 how
  my query performance will be affected .
  Will the query execution time increase.
 
  Thanks,
  Isan Fulia.
 
  On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:
 
 
  Hi Isan,
 
  It seems your index size 25GB si much more compared to you have total
 Ram
  size is 4GB.
  You have to do 2 things to avoid Out Of Memory Problem.
  1-Buy more Ram ,add at least 12 GB of more ram.
  2-Increase the Memory allocated to solr by setting XMX values.at least
 12
  GB
  allocate to solr.
 
  But if your all index will fit into the Cache memory it will give you
 the
  better result.
 
  Also add more servers to load balance as your QPS is high.
  Your 7 Laks data makes 25 GB of index its looking quite high.Try to
 lower
  the index size
  What are you indexing in your 25GB of index?
 
  -
  Thanx:
  Grijesh
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  Thanks  Regards,
  Isan Fulia.
 



Re: QueryWeight for Solr

2011-02-09 Thread Em

Hi Yonik,

thanks for the fast feedback.

Well, as far as I can see there is no possibility to get the original query
from the similarity-class...

Let me ask differently: I know there are some distributed
idf-implementations out there. 
One approach is to ask every shard for its idf for a term and than aggregate
those values at the master wo queried them all. Afterwards they use it for
their similarity etc.

How do they store these idfs for the current request so that the
similarity is aware of them?

I do not want to reimplement distributed idf, but I want to figure out how
they make it accessible for the similarity that is in use.

Thank you!

Regards


Yonik Seeley-2-2 wrote:
 
 On Wed, Feb 9, 2011 at 12:16 PM, Em mailformailingli...@yahoo.de wrote:
 For the current usecase we want to experiment with different values for
 the
 idf based on different algorithms and how they affect the scoring.
 
 For tf, idf, lengthNorm, coord, etc, see Similarity.
 Solr already alows you to specify one in the schema, and work is
 underway to make it per-field:
 https://issues.apache.org/jira/browse/SOLR-2338
 
 -Yonik
 http://lucidimagination.com
 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2460386.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Out of Memory Error

2011-02-09 Thread Markus Jelsma
Bing Li,

One should be conservative when setting Xmx. Also, just setting Xmx might not 
do the trick at all because the garbage collector might also be the issue 
here. Configure the JVM to output debug logs of the garbage collector and 
monitor the heap usage (especially the tenured generation) with a good tool 
like JConsole.

You might also want to take a look at your cache settings and autowarm 
parameters. In some scenario's with very frequent updates, a large corpus and 
a high load of heterogenous queries you might want to dump the documentCache 
and queryResultCache, the cache hitratio tends to be very low and the caches 
will just consume a lot of memory and CPU time.

One of my projects i finally decided to only use the filterCache. Using the 
other caches took too much RAM and CPU while running and had a lot of 
evictions and still a lot hitratio. I could, of course, make the caches a lot 
bigger and increase autowarming but that would take a lot of time before a 
cache is autowarmed and a very, very, large amount of RAM. I choose to rely on 
the OS-cache instead.

Cheers,

 Dear Adam,
 
 I also got the OutOfMemory exception. I changed the JAVA_OPTS in
 catalina.sh as follows.
 
...
if [ -z $LOGGING_MANAGER ]; then
  JAVA_OPTS=$JAVA_OPTS
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
else
 JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m
fi
...
 
 Is this change correct? After that, I still got the same exception. The
 index is updated and searched frequently. I am trying to change the code to
 avoid the frequent updates. I guess only changing JAVA_OPTS does not work.
 
 Could you give me some help?
 
 Thanks,
 LB
 
 
 On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada 
 
 estrada.adam.gro...@gmail.com wrote:
  Is anyone familiar with the environment variable, JAVA_OPTS? I set
  mine to a much larger heap size and never had any of these issues
  again.
  
  JAVA_OPTS = -server -Xms4048m -Xmx4048m
  
  Adam
  
  On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com
  
  wrote:
   Hi all,
   By adding more servers do u mean sharding of index.And after sharding ,
  
  how
  
   my query performance will be affected .
   Will the query execution time increase.
   
   Thanks,
   Isan Fulia.
   
   On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:
   Hi Isan,
   
   It seems your index size 25GB si much more compared to you have total
  
  Ram
  
   size is 4GB.
   You have to do 2 things to avoid Out Of Memory Problem.
   1-Buy more Ram ,add at least 12 GB of more ram.
   2-Increase the Memory allocated to solr by setting XMX values.at least
  
  12
  
   GB
   allocate to solr.
   
   But if your all index will fit into the Cache memory it will give you
  
  the
  
   better result.
   
   Also add more servers to load balance as your QPS is high.
   Your 7 Laks data makes 25 GB of index its looking quite high.Try to
  
  lower
  
   the index size
   What are you indexing in your 25GB of index?
   
   -
   Thanx:
   Grijesh
   --
  
   View this message in context:
  http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p228
  5779.html
  
   Sent from the Solr - User mailing list archive at Nabble.com.
   
   --
   Thanks  Regards,
   Isan Fulia.


Re: Solr Out of Memory Error

2011-02-09 Thread Markus Jelsma
I should also add that reducing the caches and autowarm sizes (or not using 
them at all) drastically reduces memory consumption when a new searcher is 
being prepares after a commit. The memory usage will spike at these events. 
Again, use a monitoring tool to get more information on your specific scenario.

 Bing Li,
 
 One should be conservative when setting Xmx. Also, just setting Xmx might
 not do the trick at all because the garbage collector might also be the
 issue here. Configure the JVM to output debug logs of the garbage
 collector and monitor the heap usage (especially the tenured generation)
 with a good tool like JConsole.
 
 You might also want to take a look at your cache settings and autowarm
 parameters. In some scenario's with very frequent updates, a large corpus
 and a high load of heterogenous queries you might want to dump the
 documentCache and queryResultCache, the cache hitratio tends to be very
 low and the caches will just consume a lot of memory and CPU time.
 
 One of my projects i finally decided to only use the filterCache. Using the
 other caches took too much RAM and CPU while running and had a lot of
 evictions and still a lot hitratio. I could, of course, make the caches a
 lot bigger and increase autowarming but that would take a lot of time
 before a cache is autowarmed and a very, very, large amount of RAM. I
 choose to rely on the OS-cache instead.
 
 Cheers,
 
  Dear Adam,
  
  I also got the OutOfMemory exception. I changed the JAVA_OPTS in
  catalina.sh as follows.
  
 ...
 if [ -z $LOGGING_MANAGER ]; then
 
   JAVA_OPTS=$JAVA_OPTS
  
  -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
  
 else
 
  JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m
 
 fi
 ...
  
  Is this change correct? After that, I still got the same exception. The
  index is updated and searched frequently. I am trying to change the code
  to avoid the frequent updates. I guess only changing JAVA_OPTS does not
  work.
  
  Could you give me some help?
  
  Thanks,
  LB
  
  
  On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada 
  
  estrada.adam.gro...@gmail.com wrote:
   Is anyone familiar with the environment variable, JAVA_OPTS? I set
   mine to a much larger heap size and never had any of these issues
   again.
   
   JAVA_OPTS = -server -Xms4048m -Xmx4048m
   
   Adam
   
   On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com
   
   wrote:
Hi all,
By adding more servers do u mean sharding of index.And after sharding
,
   
   how
   
my query performance will be affected .
Will the query execution time increase.

Thanks,
Isan Fulia.

On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:
Hi Isan,

It seems your index size 25GB si much more compared to you have
total
   
   Ram
   
size is 4GB.
You have to do 2 things to avoid Out Of Memory Problem.
1-Buy more Ram ,add at least 12 GB of more ram.
2-Increase the Memory allocated to solr by setting XMX values.at
least
   
   12
   
GB
allocate to solr.

But if your all index will fit into the Cache memory it will give
you
   
   the
   
better result.

Also add more servers to load balance as your QPS is high.
Your 7 Laks data makes 25 GB of index its looking quite high.Try to
   
   lower
   
the index size
What are you indexing in your 25GB of index?

-
Thanx:
Grijesh
--
   
View this message in context:
   http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2
   28 5779.html
   
Sent from the Solr - User mailing list archive at Nabble.com.

--
Thanks  Regards,
Isan Fulia.


Changing value of start parameter affects numFound?

2011-02-09 Thread mrw

I have a data set indexed over two irons, with M docs per Solr core for a
total of N cores.

If I perform a query across all N cores with start=0 and rows=30, I get,
say, numFound=27521).  If I simply change the start param to start=27510
(simulating being on the last page of data), I get a smaller result set
(say, numFound=21415).  

I had expected numFound to be the same in either case, since no other aspect
of the query had changed.  Am I mistaken?

I'm using Solr 1.4.1.955763M.  Faceting is enabled on the query. All cores
have the same schema.

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-tp2460645p2460645.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: QueryWeight for Solr

2011-02-09 Thread Yonik Seeley
On Wed, Feb 9, 2011 at 1:18 PM, Em mailformailingli...@yahoo.de wrote:
 How do they store these idfs for the current request so that the
 similarity is aware of them?

The df (as opposed to idf) is requested from the searcher by the
weight, which then uses the similarity to produce the idf.  See
TermWeight as an example.  There's no out-of-the-box plugin to provide
alternate df values though, other than the Searcher interface.

If you're doing custom enough scoring, then just implementing your own
query class is probably the way to go, but people might have other
ideas depending on the specifics of what you're trying to do.

-Yonik
http://lucidimagination.com


Re: QueryWeight for Solr

2011-02-09 Thread Em

Thanks, again. :)

Okay, so if one wants a distributed idf one should extend a searcher instead
of the query-class.

But it doesn't seem to be pluggable, right?

Well, for our purposes extending the query-class is enough, but just from
beeing curious: Where should one starts if one wants to make some components
pluggable?

Since Real-Time-Search is an issue where I read about the idea of making
things like the searcher pluggable, this could be beneficial to the
community.


Regards


Yonik Seeley-2-2 wrote:
 
 On Wed, Feb 9, 2011 at 1:18 PM, Em mailformailingli...@yahoo.de wrote:
 How do they store these idfs for the current request so that the
 similarity is aware of them?
 
 The df (as opposed to idf) is requested from the searcher by the
 weight, which then uses the similarity to produce the idf.  See
 TermWeight as an example.  There's no out-of-the-box plugin to provide
 alternate df values though, other than the Searcher interface.
 
 If you're doing custom enough scoring, then just implementing your own
 query class is probably the way to go, but people might have other
 ideas depending on the specifics of what you're trying to do.
 
 -Yonik
 http://lucidimagination.com
 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2460718.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Changing value of start parameter affects numFound?

2011-02-09 Thread mrw


mrw wrote:
 
 I have a data set indexed over two irons, with M docs per Solr core for a
 total of N cores.
 
 If I perform a query across all N cores with start=0 and rows=30, I get,
 say, numFound=27521).  If I simply change the start param to start=27510
 (simulating being on the last page of data), I get a smaller result set
 (say, numFound=21415).  
 
 I had expected numFound to be the same in either case, since no other
 aspect of the query had changed.  Am I mistaken?
 
 I'm using Solr 1.4.1.955763M.  Faceting is enabled on the query. All cores
 have the same schema.
 
 Thanks!
 

More detail:  numFound seems to vary unpredictably based on start value.


start,   numFound
--
0-46,   27521
47-59, 27520
60,  27519
61-91, 27518
62,  27517


Any ideas?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-tp2460645p2460795.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: QueryWeight for Solr

2011-02-09 Thread Yonik Seeley
On Wed, Feb 9, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote:
 Okay, so if one wants a distributed idf one should extend a searcher instead
 of the query-class.

Yes.
If you're interested in distributed search for Solr, there is a patch
in progress:
https://issues.apache.org/jira/browse/SOLR-1632

 But it doesn't seem to be pluggable, right?

Since you can weight with a different searcher than you query with, a
searcher works fine as a fine extension point (but it's a lucene-level
extension point, not a solr-level one).

-Yonik
http://lucidimagination.com


Re: Changing value of start parameter affects numFound?

2011-02-09 Thread Yonik Seeley
On Wed, Feb 9, 2011 at 1:42 PM, mrw mikerobertsw...@gmail.com wrote:

 I have a data set indexed over two irons, with M docs per Solr core for a
 total of N cores.

 If I perform a query across all N cores with start=0 and rows=30, I get,
 say, numFound=27521).  If I simply change the start param to start=27510
 (simulating being on the last page of data), I get a smaller result set
 (say, numFound=21415).

 I had expected numFound to be the same in either case, since no other aspect
 of the query had changed.  Am I mistaken?

You probably have some duplicate docs in your shards (those with the same id).
Solr doesn't know they are dups until it retrieves the ids of the docs
to merge, and then it only takes one of the dups and decrements
numFound.

-Yonik
http://lucidimagination.com


Architecture decisions with Solr

2011-02-09 Thread Greg Georges
Hello all,

I am looking into an enterprise search solution for our architecture and I am 
very pleased to see all the features Solr provides. In our case, we will have a 
need for a highly scalable application for multiple clients. This application 
will be built to serve many users who each will have a client account. Each 
client will have a multitude of documents to index (0-1000s of documents). 
After discussion we were talking about going multicore and to have one index 
file per client account. The reason for this is that security is achieved by 
having a separate index for each client etc.. Is this the best approach? How 
feasible is it (dynamically create indexes on client account creation. Is it 
better to go the faceted search capabilities route? Thanks for your help

Greg


Re: Architecture decisions with Solr

2011-02-09 Thread Darren Govoni
What about standing up a VM (search appliance that you would make) for
each client? 
If there's no data sharing across clients, then using the same solr
server/index doesn't seem necessary.

Solr will easily meet your needs though, its the best there is.

On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

 Hello all,
 
 I am looking into an enterprise search solution for our architecture and I am 
 very pleased to see all the features Solr provides. In our case, we will have 
 a need for a highly scalable application for multiple clients. This 
 application will be built to serve many users who each will have a client 
 account. Each client will have a multitude of documents to index (0-1000s of 
 documents). After discussion we were talking about going multicore and to 
 have one index file per client account. The reason for this is that security 
 is achieved by having a separate index for each client etc.. Is this the best 
 approach? How feasible is it (dynamically create indexes on client account 
 creation. Is it better to go the faceted search capabilities route? Thanks 
 for your help
 
 Greg




RE: Architecture decisions with Solr

2011-02-09 Thread Greg Georges
From what I understand about multicore, each of the indexes are independant 
from each other right? Or would one index have access to the info of the other? 
My requirement is like you mention, a client has access only to his or her 
search data based in their documents. Other clients have no access to the index 
of other clients.

Greg

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: 9 février 2011 14:28
To: solr-user@lucene.apache.org
Subject: Re: Architecture decisions with Solr

What about standing up a VM (search appliance that you would make) for
each client? 
If there's no data sharing across clients, then using the same solr
server/index doesn't seem necessary.

Solr will easily meet your needs though, its the best there is.

On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

 Hello all,
 
 I am looking into an enterprise search solution for our architecture and I am 
 very pleased to see all the features Solr provides. In our case, we will have 
 a need for a highly scalable application for multiple clients. This 
 application will be built to serve many users who each will have a client 
 account. Each client will have a multitude of documents to index (0-1000s of 
 documents). After discussion we were talking about going multicore and to 
 have one index file per client account. The reason for this is that security 
 is achieved by having a separate index for each client etc.. Is this the best 
 approach? How feasible is it (dynamically create indexes on client account 
 creation. Is it better to go the faceted search capabilities route? Thanks 
 for your help
 
 Greg




Re: Architecture decisions with Solr

2011-02-09 Thread Glen Newton
 This application will be built to serve many users

If this means that you have thousands of users, 1000s of VMs and/or
1000s of cores is not going to scale.

Have an ID in the index for each user, and filter using it.
Then they can see only their own documents.

Assuming that you are building an app that through which they
authenticate  talks to solr .
(i.e. all requests are filtered using their ID)

-Glen

On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote:
 From what I understand about multicore, each of the indexes are independant 
 from each other right? Or would one index have access to the info of the 
 other? My requirement is like you mention, a client has access only to his or 
 her search data based in their documents. Other clients have no access to the 
 index of other clients.

 Greg

 -Original Message-
 From: Darren Govoni [mailto:dar...@ontrenet.com]
 Sent: 9 février 2011 14:28
 To: solr-user@lucene.apache.org
 Subject: Re: Architecture decisions with Solr

 What about standing up a VM (search appliance that you would make) for
 each client?
 If there's no data sharing across clients, then using the same solr
 server/index doesn't seem necessary.

 Solr will easily meet your needs though, its the best there is.

 On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

 Hello all,

 I am looking into an enterprise search solution for our architecture and I 
 am very pleased to see all the features Solr provides. In our case, we will 
 have a need for a highly scalable application for multiple clients. This 
 application will be built to serve many users who each will have a client 
 account. Each client will have a multitude of documents to index (0-1000s of 
 documents). After discussion we were talking about going multicore and to 
 have one index file per client account. The reason for this is that security 
 is achieved by having a separate index for each client etc.. Is this the 
 best approach? How feasible is it (dynamically create indexes on client 
 account creation. Is it better to go the faceted search capabilities route? 
 Thanks for your help

 Greg






-- 

-


solr render biased search result

2011-02-09 Thread cyang2010

Hi,

I am asked that whether solr renders biased search result?  For example, for
this search (query all movie title by this Comedy genre),  for user who
indicates a preference to 1950's movies, solr renders the 1950's movies with
higher score (top in the list)?Or if user is a kid, then the result will
render G/PG rated movie top in the list, and render all the R rated movie
bottom in the list?

I know that solr can boost score based on match on a particular field.  But
it can't favor some value over other value in the same field.  is that
right?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461155.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Architecture decisions with Solr

2011-02-09 Thread Sujit Pal
Another option (assuming the case where a user can be granted access to
a certain class of documents, and more than one user would be able to
access certain documents) would be to store the access filter (as an OR
query of content types) in an external cache (perhaps a database or an
eternal cache that the database changes are published to periodically),
then using this access filter as a facet on the base query.

-sujit

On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote:
  This application will be built to serve many users
 
 If this means that you have thousands of users, 1000s of VMs and/or
 1000s of cores is not going to scale.
 
 Have an ID in the index for each user, and filter using it.
 Then they can see only their own documents.
 
 Assuming that you are building an app that through which they
 authenticate  talks to solr .
 (i.e. all requests are filtered using their ID)
 
 -Glen
 
 On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote:
  From what I understand about multicore, each of the indexes are independant 
  from each other right? Or would one index have access to the info of the 
  other? My requirement is like you mention, a client has access only to his 
  or her search data based in their documents. Other clients have no access 
  to the index of other clients.
 
  Greg
 
  -Original Message-
  From: Darren Govoni [mailto:dar...@ontrenet.com]
  Sent: 9 février 2011 14:28
  To: solr-user@lucene.apache.org
  Subject: Re: Architecture decisions with Solr
 
  What about standing up a VM (search appliance that you would make) for
  each client?
  If there's no data sharing across clients, then using the same solr
  server/index doesn't seem necessary.
 
  Solr will easily meet your needs though, its the best there is.
 
  On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:
 
  Hello all,
 
  I am looking into an enterprise search solution for our architecture and I 
  am very pleased to see all the features Solr provides. In our case, we 
  will have a need for a highly scalable application for multiple clients. 
  This application will be built to serve many users who each will have a 
  client account. Each client will have a multitude of documents to index 
  (0-1000s of documents). After discussion we were talking about going 
  multicore and to have one index file per client account. The reason for 
  this is that security is achieved by having a separate index for each 
  client etc.. Is this the best approach? How feasible is it (dynamically 
  create indexes on client account creation. Is it better to go the faceted 
  search capabilities route? Thanks for your help
 
  Greg
 
 
 
 
 
 



Re: solr render biased search result

2011-02-09 Thread Paul Libbrecht
Cyang,

why can't you, for a kid, add a boosting query 

genre:kid^2.0

aside of the rest?
That would double the score of a match if the users are kids.
But note that you'd better calibrate the coefficient with some test battery. 
This is part of the fine art, I think.

paul


Le 9 févr. 2011 à 20:44, cyang2010 a écrit :

 
 Hi,
 
 I am asked that whether solr renders biased search result?  For example, for
 this search (query all movie title by this Comedy genre),  for user who
 indicates a preference to 1950's movies, solr renders the 1950's movies with
 higher score (top in the list)?Or if user is a kid, then the result will
 render G/PG rated movie top in the list, and render all the R rated movie
 bottom in the list?
 
 I know that solr can boost score based on match on a particular field.  But
 it can't favor some value over other value in the same field.  is that
 right?
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461155.html
 Sent from the Solr - User mailing list archive at Nabble.com.



solr current workding directory or reading config files

2011-02-09 Thread Tri Nguyen
Hi,

I have a class (in a jar) that reads from properties (text) files.  I have 
these 
files in the same jar file as the class.

However, when my class reads those properties files, those files cannot be 
found 
since solr reads from tomcat's bin directory.

I don't really want to put the config files in tomcat's bin directory.

How do I reconcile this?

Tri

pre and post processing when building index

2011-02-09 Thread Tri Nguyen
Hi,

I'm scheduling solr to build every hour or so.

I'd like to do some pre and post processing for each index build.  The 
preprocessing would do some checks and perhaps will skip the build.

For post processing, I will do some checks and either commit or rollback the 
build.

Can I write some class and plugin into solr for this?

Thanks,

Tri

DataImportHandler: regex debugging

2011-02-09 Thread Jon Drukman
I am trying to use the regex transformer but it's not returning anything. 
Either my regex is wrong, or I've done something else wrong in the setup of the
entity.  Is there any way to debug this?  Making a change and waiting 7 minutes
to reindex the entity sucks.

entity name=boxshot
  query=SELECT GROUP_CONCAT(i.url, ',') boxshot_url,
 GROUP_CONCAT(i2.url, ',') boxshot_url_small FROM games g
 left join image_sizes i ON g.box_image_id = i.id AND i.size_type = 39
 left join image_sizes i2 on g.box_image_id = i2.id AND i2.size_type = 
40
 WHERE g.game_seo_title = '${game.game_seo_title}'
 GROUP BY g.game_seo_title
field name=main_image regex=^(.*?), sourceColName=boxshot_url /
field name=small_image regex=^(.*?), sourceColName=boxshot_url_small 
/
/entity

This returns columns that are either null, or have some comma-separated strings.
I want the bit up to the first comma, if it exists.

Ideally I could have it log the query and the input/output
of the field statements.



Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2011-02-09 Thread pravin

Hello,
Andy, so did you get final answer to your quetion?
I am also trying to do something similar. Please give me pointers if you
have any.
Basically even I need to use Ngram with WhitespaceTokenizer any help will be
appreciated.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/NGramFilterFactory-for-auto-complete-that-matches-the-middle-of-multi-lingual-tags-tp1619234p2459466.html
Sent from the Solr - User mailing list archive at Nabble.com.


Why does the StatsComponent only work with indexed fields?

2011-02-09 Thread Travis Truman
Is there a reason why the StatsComponent only deals with indexed fields?

I just updated the wiki: http://wiki.apache.org/solr/StatsComponent to call
this fact out since it was not apparent previously.

I've briefly skimmed the source of StatsComponent, but am not familiar
enough with the code or Solr yet to understand if it was omitted for
performance reasons or some other reason.

Any information would be appreciated.

Thanks,
Travis


Re: solr render biased search result

2011-02-09 Thread cyang2010

That makes sense.  It is a little bit indirect.  You have to translate that
user preference/profile into a search field value and then dictate search
result boosting the doc with that preference value.   
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461668.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Why does the StatsComponent only work with indexed fields?

2011-02-09 Thread Erick Erickson
What kinds of information would you expect for a stored-only field? I mean,
the stored part is just a blob that Solr doesn't peek inside of, so I'm not
sure
what useful information *could* be returned

Best
Erick

On Wed, Feb 9, 2011 at 3:55 PM, Travis Truman trum...@gmail.com wrote:

 Is there a reason why the StatsComponent only deals with indexed fields?

 I just updated the wiki: http://wiki.apache.org/solr/StatsComponent to
 call
 this fact out since it was not apparent previously.

 I've briefly skimmed the source of StatsComponent, but am not familiar
 enough with the code or Solr yet to understand if it was omitted for
 performance reasons or some other reason.

 Any information would be appreciated.

 Thanks,
 Travis



Re: solr render biased search result

2011-02-09 Thread Erick Erickson
What *could* solr do for you? You've outlined a domain-specific requirement,
I'm not sure how a general-purpose search engine would incorporate
that functionality

Best
Erick

On Wed, Feb 9, 2011 at 4:08 PM, cyang2010 ysxsu...@hotmail.com wrote:


 That makes sense.  It is a little bit indirect.  You have to translate that
 user preference/profile into a search field value and then dictate search
 result boosting the doc with that preference value.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461668.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr current workding directory or reading config files

2011-02-09 Thread Wilkes, Chris
Is your war always deployed the the same location, ie /usr/mycomp/ 
myapplication/webapps/myapp.war?  If so then on startup copy the  
files out of your directory and put them under CATALINA_BASE/solr (usr/ 
mycomp/myapplication/solr) and in your war file have the META-INF/ 
context.xml JNDI setting point to that.


Context
   Environment name=solr/home type=java.lang.String value=/usr/ 
mycomp/myapplication/solr override=true /

/Context

If you know of a way to reference CATALINA_BASE in the context.xml  
that would make it easier.


On Feb 9, 2011, at 12:00 PM, Tri Nguyen wrote:


Hi,

I have a class (in a jar) that reads from properties (text) files.   
I have these

files in the same jar file as the class.

However, when my class reads those properties files, those files  
cannot be found

since solr reads from tomcat's bin directory.

I don't really want to put the config files in tomcat's bin directory.

How do I reconcile this?

Tri




communication between entity processor and solr DataImporter

2011-02-09 Thread Tri Nguyen
Hi,

I'd like to communicate errors between my entity processor and the DataImporter 
in case of error.

Should there be an error in my entity processor, I'd like the index build to 
rollback. How can I do this?

I want to throw an exception of some sort.  Only thing I can think of is to 
force a runtime exception be thrown in nextRow() of the entityprocessor since 
runtime exceptions are not checked and does not have to be declared in the 
nextRow() method signature.

How can I request the nextRow() method signature be updated to throw 
Exception?  
Would it even make sense?

Tri

Re: solr current workding directory or reading config files

2011-02-09 Thread Tri Nguyen
Wanted to add some more details to my problem.  I have many jars that have 
their 
own config files.  So I'd have to copy files for every jar.  Can solr read from 
the classpath (jar files)?

Yes my war is always deployed to the same location under webapps.  I do already 
have solr/home defined in web.xml.  I'll try copying my files into there, but I 
would have to extract every jar file and do this manually.





From: Wilkes, Chris cwil...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, February 9, 2011 3:44:03 PM
Subject: Re: solr current workding directory or reading config files

Is your war always deployed the the same location, ie 
/usr/mycomp/myapplication/webapps/myapp.war?  If so then on startup copy the 
files out of your directory and put them under CATALINA_BASE/solr 
(usr/mycomp/myapplication/solr) and in your war file have the 
META-INF/context.xml JNDI setting point to that.

Context
  Environment name=solr/home type=java.lang.String 
value=/usr/mycomp/myapplication/solr override=true /
/Context

If you know of a way to reference CATALINA_BASE in the context.xml that would 
make it easier.

On Feb 9, 2011, at 12:00 PM, Tri Nguyen wrote:

 Hi,
 
 I have a class (in a jar) that reads from properties (text) files.  I have 
these
 files in the same jar file as the class.
 
 However, when my class reads those properties files, those files cannot be 
found
 since solr reads from tomcat's bin directory.
 
 I don't really want to put the config files in tomcat's bin directory.
 
 How do I reconcile this?
 
 Tri

Re: communication between entity processor and solr DataImporter

2011-02-09 Thread Tri Nguyen
I can throw DataImportHandlerException (a runtime exception) from my 
entityprocessor which will force a rollback.

Tri





From: Tri Nguyen tringuye...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Wed, February 9, 2011 3:50:05 PM
Subject: communication between entity processor and solr DataImporter

Hi,

I'd like to communicate errors between my entity processor and the DataImporter 
in case of error.

Should there be an error in my entity processor, I'd like the index build to 
rollback. How can I do this?

I want to throw an exception of some sort.  Only thing I can think of is to 
force a runtime exception be thrown in nextRow() of the entityprocessor since 
runtime exceptions are not checked and does not have to be declared in the 
nextRow() method signature.

How can I request the nextRow() method signature be updated to throw 
Exception?  

Would it even make sense?

Tri

Re: communication between entity processor and solr DataImporter

2011-02-09 Thread Erick Erickson
Tri:

You might want to consider, rather than going through DIH with your own
entity
processor, just using SolrJ in a separate process. That allows you much
finer
control over the behavior of your indexing process.

Making a connection to Solr via SolrJ and adding a one-field document is
maybe
a 20 line program. Of course the complexity will come in your
database-access
code and error handling, and your documents will be much larger than one
field,
I just included that estimate so you can guage whether a pilot would be
worthwhile...

Just a thought
Erick

On Wed, Feb 9, 2011 at 7:32 PM, Tri Nguyen tringuye...@yahoo.com wrote:

 I can throw DataImportHandlerException (a runtime exception) from my
 entityprocessor which will force a rollback.

 Tri




 
 From: Tri Nguyen tringuye...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Wed, February 9, 2011 3:50:05 PM
 Subject: communication between entity processor and solr DataImporter

 Hi,

 I'd like to communicate errors between my entity processor and the
 DataImporter
 in case of error.

 Should there be an error in my entity processor, I'd like the index build
 to
 rollback. How can I do this?

 I want to throw an exception of some sort.  Only thing I can think of is to
 force a runtime exception be thrown in nextRow() of the entityprocessor
 since
 runtime exceptions are not checked and does not have to be declared in the
 nextRow() method signature.

 How can I request the nextRow() method signature be updated to throw
 Exception?

 Would it even make sense?

 Tri



Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.
Hi Charan,

 Thanks for the clarifications.

 The link I have been referring to(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say
anything about using the crawl? Do I have to do it after the  last step
mentioned?

Thanks,
Abi

On Thu, Feb 10, 2011 at 12:58 AM, charan kumar charan.ku...@gmail.comwrote:

 Hi Abishek,

 depth is a param of crawl command, not fetch command

 If you are using custom script calling individual stages of nutch crawl,
 then depth N means , you running that script for N times.. You can put a
 loop, in the script.

 Thanks,
 Charan

 On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote:

  Hi Erick,
 
   Thanks a bunch for the response
 
   Could be a chance..but all I am wondering is where to specify the depth
 in
  the whole entire process in the URL
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
  specifying it during the fetcher phase but it was just ignored :(
 
  Thanks,
  Abi
 
  On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   WARNING: I don't do Nutch much, but could it be that your
   crawl depth is 1? See:
   http://wiki.apache.org/nutch/NutchTutorial
  
   http://wiki.apache.org/nutch/NutchTutorialand search for depth
   Best
   Erick
  
   On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com
  wrote:
  
Hi Markus,
   
 I am sorry for not being clear, I meant to say that...
   
 Suppose if a url namely
  www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which
 http://www.somehost.com/gifts/greetingcard.html%28which
  http://www.somehost.com/gifts/greetingcard.html%28whichin
turn contain links to a.html, b.html, c.html, d.html) is injected
 into
   the
seed.txt, after the whole process I was expecting a bunch of other
  pages
which crawled from this seed url. However, at the end of it all I see
  is
the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.
   
 The crawling happens only for the URLs mentioned in the seed.txt and
   does
not proceed further from there. So I am just bit confused. Why is it
  not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.
   
Thanks,
Abi
   
   
On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
   markus.jel...@openindex.io
wrote:
   
 The parsed data is only sent to the Solr index of you tell a
 segment
  to
be
 indexed; solrindex crawldb linkdb segment

 If you did this only once after injecting  and then the consequent
 fetch,parse,update,index sequence then you, of course, only see
 those
 URL's.
 If you don't index a segment after it's being parsed, you need to
 do
  it
 later
 on.

 On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
  Hi all,
 
   I am a newbie to nutch and solr. Well relatively much newer to
  Solr
than
  Nutch :)
 
   I have been using nutch for past two weeks, and I wanted to know
  if
   I
 can
  query or search on my nutch crawls on the fly(before it
 completes).
  I
am
  asking this because the websites I am crawling are really huge
 and
  it
 takes
  around 3-4 days for a crawl to complete. I want to analyze some
  quick
  results while the nutch crawler is still crawling the URLs. Some
  one
  suggested me that Solr would make it possible.
 
   I followed the steps in
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
   this.
By
  this process, I see only the injected URLs are shown in the Solr
search.
 I
  know I did something really foolish and the crawl never happened,
 I
feel
 I
  am missing some information here. I think somewhere in the
 process
there
  should be a crawling happening and I missed it out.
 
   Just wanted to see if some one could help me pointing this out
 and
where
 I
  went wrong in the process. Forgive my foolishness and thanks for
  your
  patience.
 
  Cheers,
  Abi

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350

   
  
 



Re: Architecture decisions with Solr

2011-02-09 Thread Adam Estrada
I tried the multi-core route and it gets too complicated and cumbersome to 
maintain. That is just from my own personal testing...It was suggested that 
each user have their own ID in a single index that you can query against 
accordingly. In the example schema.xml I believe there is a field called 
texttight or something like that that is meant for skew numbers. Give each user 
their own guid or md5 hash and add that as part of all your queries. That way, 
only their data are returned. It would be the equivalent of something like 
this...

SELECT * FROM mytable WHERE userid = '3F2504E04F8911D39A0C0305E82C3301' AND 

Grant Ingersoll gave a presentation at the Lucene Revolution conference that 
demonstrated that you can build a query to be as easy or as complicated as any 
SQL statement. Maybe he can share that PPT?

Adam

On Feb 9, 2011, at 2:47 PM, Sujit Pal wrote:

 Another option (assuming the case where a user can be granted access to
 a certain class of documents, and more than one user would be able to
 access certain documents) would be to store the access filter (as an OR
 query of content types) in an external cache (perhaps a database or an
 eternal cache that the database changes are published to periodically),
 then using this access filter as a facet on the base query.
 
 -sujit
 
 On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote:
 This application will be built to serve many users
 
 If this means that you have thousands of users, 1000s of VMs and/or
 1000s of cores is not going to scale.
 
 Have an ID in the index for each user, and filter using it.
 Then they can see only their own documents.
 
 Assuming that you are building an app that through which they
 authenticate  talks to solr .
 (i.e. all requests are filtered using their ID)
 
 -Glen
 
 On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com 
 wrote:
 From what I understand about multicore, each of the indexes are independant 
 from each other right? Or would one index have access to the info of the 
 other? My requirement is like you mention, a client has access only to his 
 or her search data based in their documents. Other clients have no access 
 to the index of other clients.
 
 Greg
 
 -Original Message-
 From: Darren Govoni [mailto:dar...@ontrenet.com]
 Sent: 9 février 2011 14:28
 To: solr-user@lucene.apache.org
 Subject: Re: Architecture decisions with Solr
 
 What about standing up a VM (search appliance that you would make) for
 each client?
 If there's no data sharing across clients, then using the same solr
 server/index doesn't seem necessary.
 
 Solr will easily meet your needs though, its the best there is.
 
 On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:
 
 Hello all,
 
 I am looking into an enterprise search solution for our architecture and I 
 am very pleased to see all the features Solr provides. In our case, we 
 will have a need for a highly scalable application for multiple clients. 
 This application will be built to serve many users who each will have a 
 client account. Each client will have a multitude of documents to index 
 (0-1000s of documents). After discussion we were talking about going 
 multicore and to have one index file per client account. The reason for 
 this is that security is achieved by having a separate index for each 
 client etc.. Is this the best approach? How feasible is it (dynamically 
 create indexes on client account creation. Is it better to go the faceted 
 search capabilities route? Thanks for your help
 
 Greg
 
 
 
 
 
 
 



Faceting Query

2011-02-09 Thread Isha Garg

Hi,
  What is the significance of copy field  when used in faceting . 
plz explain with example.


Thanks!
Isha


Faceting Query

2011-02-09 Thread Isha Garg

What is facet.pivot field? PLz explain with example