Re: DataImportHandler - synchronous execution

2010-01-12 Thread Noble Paul നോബിള്‍ नोब्ळ्
it can be added

On Tue, Jan 12, 2010 at 10:18 PM, Alexey Serba  wrote:
> Hi,
>
> I found that there's no explicit option to run DataImportHandler in a
> synchronous mode. I need that option to run DIH from SolrJ (
> EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
> to DIH as a workaround for this, but I think it makes sense to add
> specific option for that. Any objections?
>
> Alex
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: Multi language support

2010-01-12 Thread Walter Underwood
There is a band named "The The". And a producer named "Don Was". For a list of 
all-stopword movie titles at Netflix, see this post:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords in 
two languages. And a very good movie.

wunder

On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

> sorry, i forgot to include this 2009 paper comparing what stopwords do
> across 3 languages:
> 
> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
> 
> in my opinion, if stopwords annoy your users for very special cases
> like 'the the' then, instead consider using commongrams +
> defaultsimilarity.discountOverlaps = true so that you still get the
> benefits.
> 
> as you can see from the above paper, they can be extremely important
> depending on the language, they just don't matter so much for English.
> 
> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog  wrote:
>> There are a lot of projects that don't use stopwords any more. You
>> might consider dropping them altogether.
>> 
>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve  wrote:
>>> This is the way I've implemented multilingual search as well.
>>> 
>>> 2010/1/11 Markus Jelsma 
>>> 
 Hello,
 
 
 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.
 
 So you would have a field type like:
 >>>  
  
  
 
 etc etc.
 
 
 
 Cheers,
 
 -
 Markus Jelsma  Buyways B.V.
 Technisch ArchitectFriesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen
 
 
 Alg. 050-853 6600  KvK  01074105
 Tel. 050-853 6620  Fax. 050-3118124
 Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17
 
 
 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
 
> Hi Solr users.
> 
> I'm trying to set up a site with Solr search integrated. And I use the
> SolJava API to feed the index with search documents. At the moment I
> have only activated search on the English portion of the site. I'm
> interested in using as many features of solr as possible. Synonyms,
> Stopwords and stems all sounds quite interesting and useful but how do
> I set up this in a good way for a multilingual site?
> 
> The site don't have a huge text mass so performance issues don't
> really bother me but still I'd like to hear your suggestions before I
> try to implement an solution.
> 
> Best regards
> 
> Daniel
 
>>> 
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goks...@gmail.com
>> 
> 
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
> 



Re: question about date boosting

2010-01-12 Thread Joe Calderon

I think you need to use the new trieDateField
On 01/12/2010 07:06 PM, Daniel Higginbotham wrote:

Hello,

I'm trying to boost results based on date using the first example 
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents 



However, I'm getting an error that reads, "Can't use ms() function on 
non-numeric legacy date field"


The date field uses solr.DateField . What am I doing wrong?

Thank you!
Daniel Higginbotham




Need help Migrating to Solr

2010-01-12 Thread Abin Mathew
Hi

I am new to the solr technology. We have been using lucene for handling
searching in our web application www.toostep.com which is a knowledge
sharing platform developed in java using Spring MVC architecture and iBatis
as the persistance framework. Now that the application is getting very
complex we have decided to implement Solr technology over lucene.
Anyone having expertise in this area please give me some guidelines on where
to start off and how to form the schema for Solr.

Thanks and Regards
Abin Mathew


Re: What is this error means?

2010-01-12 Thread Israel Ekpo
Ellery,

A preliminary look at the source code indicates that the error is happening
because the solr server is taking longer than expected to respond to the
client

http://code.google.com/p/solr-php-client/source/browse/trunk/Apache/Solr/Service.php

The default time out handed down to Apache_Solr_Service:_sendRawPost() is 60
seconds since you were calling the addDocument() method

So if it took longer than that (1 minute), then it will exit with that error
message.

You will have to increase the default value to something very high like 10
minutes or so on line 252 in the source code since there is no way to
specify that in the constructor or the addDocument method.

Another alternative will be to update the default_socket_timeout in the
php.ini file or in the code using ini_set

I hope that helps



On Tue, Jan 12, 2010 at 9:33 PM, Ellery Leung  wrote:

>
> Hi, here is the stack trace:
>
> 
> Fatal error:  Uncaught exception 'Exception' with message '"0"
> Status: Communication Error' in
> C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Serv
> ice.php:385
> Stack trace:
> #0 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(652):
> Apache_Solr_Ser
> vice->_sendRawPost('http://127.0.0', ' #1 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(676):
> Apache_Solr_Ser
> vice->add(' #2
>
> C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(221):
> Apache_Solr_Service->addDocument(Object(Apache_Solr_Document))
> #3
>
> C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(262):
> SolrSearchEngine->buildIndex(Array, 'key')
> #4
> C:\nginx\html\apps\milio\lib\System\classes\Indexer\Indexer.class.php(51):
> So
> lrSearchEngine->createFullIndex('contacts', Array, 'key', 'www')
> #5 C:\nginx\html\apps\milio\lib\System\functions\createIndex.php(64):
> Indexer-&g
> t;create('www')
> #6 {main}
>  thrown in C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php on li
> ne 385
>
> C:\nginx\html\apps\milio\htdocs\Contacts>pause
> Press any key to continue . . .
>
> Thanks for helping me.
>
>
> Grant Ingersoll-6 wrote:
> >
> > Do you have a stack trace?
> >
> > On Jan 12, 2010, at 2:54 AM, Ellery Leung wrote:
> >
> >> When I am building the index for around 2 ~ 25000 records, sometimes
> >> I
> >> came across with this error:
> >>
> >>
> >>
> >> Uncaught exception "Exception" with message '0' Status: Communication
> >> Error
> >>
> >>
> >>
> >> I search Google & Yahoo but no answer.
> >>
> >>
> >>
> >> I am now committing document to solr on every 10 records fetched from a
> >> SQLite Database with PHP 5.3.
> >>
> >>
> >>
> >> Platform: Windows 7 Home
> >>
> >> Web server: Nginx
> >>
> >> Solr Specification Version: 1.4.0
> >>
> >> Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
> >> 12:33:40
> >>
> >> Lucene Specification Version: 2.9.1
> >>
> >> Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
> >>
> >> Solr hosted in jetty 6.1.3
> >>
> >>
> >>
> >> All the above are in one single test machine.
> >>
> >>
> >>
> >> The situation is that sometimes when I build the index, it can be
> created
> >> successfully.  But sometimes it will just stop with the above error.
> >>
> >>
> >>
> >> Any clue?  Please help.
> >>
> >>
> >>
> >> Thank you in advance.
> >>
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/What-is-this-error-means--tp27123815p27138658.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


question about date boosting

2010-01-12 Thread Daniel Higginbotham

Hello,

I'm trying to boost results based on date using the first example 
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents

However, I'm getting an error that reads, "Can't use ms() function on  
non-numeric legacy date field"


The date field uses solr.DateField . What am I doing wrong?

Thank you!
Daniel Higginbotham

Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-12 Thread Lance Norskog
Field Collapsing is what you want - this is a classic problem with
retail store product indexing and everyone uses field collapsing.
(That is, everyone who is willing to apply the patch on their own
code.)

Dedupe is completely the wrong word. Deduping is something else
entirely - it is about trying not to index the same document twice.

On Tue, Jan 12, 2010 at 11:30 AM, Kelly Taylor  wrote:
>
> David,
>
> Thanks, and yes, I decided to travel that path last night (applying SOLR-236
> patch) and plan to have some results by the end of the day; I'll post a
> summary.
>
> I read about field collapsing in your book last night. The book is an
> excellent resource by the way (shameless commendation plug!), and it made me
> laugh to find out that my use case is crazy!
>
> Regarding dedupe, I'm not sure either.  The component is mentioned in an
> article by Amit Nithianandan
> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics).
> I had concluded from the section entitled, "Comparing the Solr Approach with
> the RDBMS," that the dedupe component was somehow used as a "field
> collapsing" alternative (in my mind anyway) but I couldn't find a real-world
> example.
>
> Amit says, "...I might create an index with multiple documents or records
> for the same exact wiper blade, each document having different location data
> (lat/long, address, etc.) to represent an individual store. Solr has a
> de-duplication component to help show unique documents in case that
> particular wiper blade is available in multiple stores near me..."
>
> In my case, I was attempting to equate Amit's "wiper blade" with my
> "product" entity, and his "individual store" my "SKU" entity.
>
> Thanks again.
>
> -Kelly
>
>
> David Smiley @MITRE.org wrote:
>>
>> Kelly,
>> This is a good question you have posed and illustrates a challenge with
>> Solr's limited schema.  I don't see how the dedup will help.  I would
>> continue with the SKU based approach and use this patch:
>> https://issues.apache.org/jira/browse/SOLR-236
>> You'll collapse on the product id.  My book, p.192, highlights this
>> component as it existed when I wrote it but it has been updated since
>> then.
>>
>> A recent separate question by you on this list suggests you're going down
>> this path.  I would grab the attached SOLR-236.patch file and attempt to
>> apply it to the 1.4 source.
>>
>> ~ David Smiley
>> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27131969.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Multi language support

2010-01-12 Thread Robert Muir
sorry, i forgot to include this 2009 paper comparing what stopwords do
across 3 languages:

http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

in my opinion, if stopwords annoy your users for very special cases
like 'the the' then, instead consider using commongrams +
defaultsimilarity.discountOverlaps = true so that you still get the
benefits.

as you can see from the above paper, they can be extremely important
depending on the language, they just don't matter so much for English.

On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog  wrote:
> There are a lot of projects that don't use stopwords any more. You
> might consider dropping them altogether.
>
> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve  wrote:
>> This is the way I've implemented multilingual search as well.
>>
>> 2010/1/11 Markus Jelsma 
>>
>>> Hello,
>>>
>>>
>>> We have implemented language specific search in Solr using language
>>> specific fields and field types. For instance, an en_text field type can
>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>> did not use specific stopwords, instead we used one list shared by both
>>> languages.
>>>
>>> So you would have a field type like:
>>> >>  
>>>  
>>>  
>>>
>>> etc etc.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> -
>>> Markus Jelsma          Buyways B.V.
>>> Technisch Architect    Friesestraatweg 215c
>>> http://www.buyways.nl  9743 AD Groningen
>>>
>>>
>>> Alg. 050-853 6600      KvK  01074105
>>> Tel. 050-853 6620      Fax. 050-3118124
>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>
>>>
>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>
>>> > Hi Solr users.
>>> >
>>> > I'm trying to set up a site with Solr search integrated. And I use the
>>> > SolJava API to feed the index with search documents. At the moment I
>>> > have only activated search on the English portion of the site. I'm
>>> > interested in using as many features of solr as possible. Synonyms,
>>> > Stopwords and stems all sounds quite interesting and useful but how do
>>> > I set up this in a good way for a multilingual site?
>>> >
>>> > The site don't have a huge text mass so performance issues don't
>>> > really bother me but still I'd like to hear your suggestions before I
>>> > try to implement an solution.
>>> >
>>> > Best regards
>>> >
>>> > Daniel
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com


Re: Multi language support

2010-01-12 Thread Robert Muir
I don't think this is something to consider across the board for all
languages. The same grammatical units that are part of a word in one
language (and removed by stemmers) are independent morphemes in others
(and should be stopwords)

so please take this advice on a case-by-case basis for each language.

On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog  wrote:
> There are a lot of projects that don't use stopwords any more. You
> might consider dropping them altogether.
>
> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve  wrote:
>> This is the way I've implemented multilingual search as well.
>>
>> 2010/1/11 Markus Jelsma 
>>
>>> Hello,
>>>
>>>
>>> We have implemented language specific search in Solr using language
>>> specific fields and field types. For instance, an en_text field type can
>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>> did not use specific stopwords, instead we used one list shared by both
>>> languages.
>>>
>>> So you would have a field type like:
>>> >>  
>>>  
>>>  
>>>
>>> etc etc.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> -
>>> Markus Jelsma          Buyways B.V.
>>> Technisch Architect    Friesestraatweg 215c
>>> http://www.buyways.nl  9743 AD Groningen
>>>
>>>
>>> Alg. 050-853 6600      KvK  01074105
>>> Tel. 050-853 6620      Fax. 050-3118124
>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>
>>>
>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>
>>> > Hi Solr users.
>>> >
>>> > I'm trying to set up a site with Solr search integrated. And I use the
>>> > SolJava API to feed the index with search documents. At the moment I
>>> > have only activated search on the English portion of the site. I'm
>>> > interested in using as many features of solr as possible. Synonyms,
>>> > Stopwords and stems all sounds quite interesting and useful but how do
>>> > I set up this in a good way for a multilingual site?
>>> >
>>> > The site don't have a huge text mass so performance issues don't
>>> > really bother me but still I'd like to hear your suggestions before I
>>> > try to implement an solution.
>>> >
>>> > Best regards
>>> >
>>> > Daniel
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com


Re: What is this error means?

2010-01-12 Thread Ellery Leung

Hi, here is the stack trace:


Fatal error:  Uncaught exception 'Exception' with message '"0"
Status: Communication Error' in
C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Serv
ice.php:385
Stack trace:
#0 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(652):
Apache_Solr_Ser
vice->_sendRawPost('http://127.0.0', 'add('addDocument(Object(Apache_Solr_Document))
#3
C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(262):
SolrSearchEngine->buildIndex(Array, 'key')
#4
C:\nginx\html\apps\milio\lib\System\classes\Indexer\Indexer.class.php(51):
So
lrSearchEngine->createFullIndex('contacts', Array, 'key', 'www')
#5 C:\nginx\html\apps\milio\lib\System\functions\createIndex.php(64):
Indexer-&g
t;create('www')
#6 {main}
  thrown in C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php on li
ne 385

C:\nginx\html\apps\milio\htdocs\Contacts>pause
Press any key to continue . . .

Thanks for helping me.


Grant Ingersoll-6 wrote:
> 
> Do you have a stack trace?  
> 
> On Jan 12, 2010, at 2:54 AM, Ellery Leung wrote:
> 
>> When I am building the index for around 2 ~ 25000 records, sometimes
>> I
>> came across with this error:
>> 
>> 
>> 
>> Uncaught exception "Exception" with message '0' Status: Communication
>> Error
>> 
>> 
>> 
>> I search Google & Yahoo but no answer.
>> 
>> 
>> 
>> I am now committing document to solr on every 10 records fetched from a
>> SQLite Database with PHP 5.3.
>> 
>> 
>> 
>> Platform: Windows 7 Home
>> 
>> Web server: Nginx
>> 
>> Solr Specification Version: 1.4.0
>> 
>> Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
>> 12:33:40
>> 
>> Lucene Specification Version: 2.9.1
>> 
>> Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
>> 
>> Solr hosted in jetty 6.1.3
>> 
>> 
>> 
>> All the above are in one single test machine.
>> 
>> 
>> 
>> The situation is that sometimes when I build the index, it can be created
>> successfully.  But sometimes it will just stop with the above error.
>> 
>> 
>> 
>> Any clue?  Please help.
>> 
>> 
>> 
>> Thank you in advance.
>> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/What-is-this-error-means--tp27123815p27138658.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: EOF IOException Query

2010-01-12 Thread Lance Norskog
The index files are corrupted. You have to create index again from scratch.

This should have reported CorruptIndexException. The code in handling
index files does not catch all exceptions and wrap them as it should.

On Mon, Jan 11, 2010 at 3:10 PM, Osborn Chan  wrote:
> Hi all,
>
> I got following exception for SOLR, but the index is still searchable. (At 
> least it is searchable for query "*:*".)
> I am just wondering what is the root cause.
>
> Thanks,
> Osborn
>
> INFO: [publicGalleryPostMaster] webapp=/multicore path=/select 
> params={wt=javabin&rows=12&start=0&sort=/gallery/1/postlist/1Rank_i+desc&q=%2B(comm
> unityList_s_m:/gallery/1/postlist/1)+%2Bstate_s:A&version=1} status=500 
> QTime=3
> Jan 11, 2010 12:23:01 PM org.apache.solr.common.SolrException log
> SEVERE: java.io.IOException: read past EOF
>        at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
>        at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
>        at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:80)
>        at 
> org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:112)
>        at 
> org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:712)
>        at 
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:208)
>        at 
> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:676)
>        at 
> org.apache.lucene.search.FieldComparator$StringOrdValComparator.setNextReader(FieldComparator.java:667)
>        at 
> org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94)
>        at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:245)
>        at org.apache.lucene.search.Searcher.search(Searcher.java:171)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
>        at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
>        at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>        at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>



-- 
Lance Norskog
goks...@gmail.com


Re: Multi language support

2010-01-12 Thread Lance Norskog
There are a lot of projects that don't use stopwords any more. You
might consider dropping them altogether.

On Mon, Jan 11, 2010 at 2:25 PM, Don Werve  wrote:
> This is the way I've implemented multilingual search as well.
>
> 2010/1/11 Markus Jelsma 
>
>> Hello,
>>
>>
>> We have implemented language specific search in Solr using language
>> specific fields and field types. For instance, an en_text field type can
>> use an English stemmer, and list of stopwords and synonyms. We, however
>> did not use specific stopwords, instead we used one list shared by both
>> languages.
>>
>> So you would have a field type like:
>> >  
>>  
>>  
>>
>> etc etc.
>>
>>
>>
>> Cheers,
>>
>> -
>> Markus Jelsma          Buyways B.V.
>> Technisch Architect    Friesestraatweg 215c
>> http://www.buyways.nl  9743 AD Groningen
>>
>>
>> Alg. 050-853 6600      KvK  01074105
>> Tel. 050-853 6620      Fax. 050-3118124
>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>
>>
>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>
>> > Hi Solr users.
>> >
>> > I'm trying to set up a site with Solr search integrated. And I use the
>> > SolJava API to feed the index with search documents. At the moment I
>> > have only activated search on the English portion of the site. I'm
>> > interested in using as many features of solr as possible. Synonyms,
>> > Stopwords and stems all sounds quite interesting and useful but how do
>> > I set up this in a good way for a multilingual site?
>> >
>> > The site don't have a huge text mass so performance issues don't
>> > really bother me but still I'd like to hear your suggestions before I
>> > try to implement an solution.
>> >
>> > Best regards
>> >
>> > Daniel
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-12 Thread Lance Norskog
You can do this stripping in the DataImportHandler. You would have to
write your own stripping code using regular expresssions. Also, the
ExtractingRequestHandler strips out the html markup when you use it to
index an html file:

http://wiki.apache.org/solr/ExtractingRequestHandler

On Mon, Jan 11, 2010 at 1:43 PM, darniz  wrote:
>
> no problem
>
> Erick Erickson wrote:
>>
>> Ah, I read your post too fast and ignored the title. Sorry 'bout that.
>>
>> Erick
>>
>> On Mon, Jan 11, 2010 at 2:55 PM, darniz  wrote:
>>
>>>
>>> Well thats the whole discussion we are talking about.
>>> I had the impression that the html tags are filtered and then the field
>>> is
>>> stored without tags. But looks like the html tags are removed and terms
>>> are
>>> indexed purely for indexing, and the actual text is stored in raw format.
>>>
>>> Lets say for example if i enter a field like
>>> honda car road review
>>> When i do analysis on the body field the html filter removes the  tag
>>> and
>>> indexed works honda, car, road, review. But when i fetch body field to
>>> display in my document it returns honda car road review
>>>
>>> I hope i make sense.
>>> thanks
>>> darniz
>>>
>>>
>>>
>>> Erick Erickson wrote:
>>> >
>>> > This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>> > shows you
>>> > many
>>> > of the SOLR analyzers and filters. Would one of
>>> > the various *HTMLStrip* stuff work?
>>> >
>>> > HTH
>>> > ERick
>>> >
>>> > On Mon, Jan 11, 2010 at 2:44 PM, darniz 
>>> wrote:
>>> >
>>> >>
>>> >> Thanks we were having the saem issue.
>>> >> We are trying to store article content and we are strong a field like
>>> >> This article is for blah .
>>> >> Wheni see the analysis.jsp page it does strip out the  tags and is
>>> >> indexed. but when we fetch the document it returns the field with the
>>> 
>>> >> tags.
>>> >> From solr point of view, its correct but our issue is that this kind
>>> of
>>> >> html
>>> >> tags is screwing up our display of our page. Is there an easy way to
>>> >> esure
>>> >> how to strip out hte html tags, or do we have to take care of
>>> manually.
>>> >>
>>> >> Thanks
>>> >> Rashid
>>> >>
>>> >>
>>> >> aseem cheema wrote:
>>> >> >
>>> >> > Alright. It turns out that escapedTags is not for what I thought it
>>> is
>>> >> > for.
>>> >> > The problem that I am having with HTMLStripCharFilterFactory is that
>>> >> > it strips the html while indexing the field, but not while storing
>>> the
>>> >> > field. That is why what is see in analysis.jsp, which is index
>>> >> > analysis, does not match what gets stored... because.. well HTML is
>>> >> > stripped only for indexing. Makes so much sense.
>>> >> >
>>> >> > Thanks to Ryan McKinley for clarifying this.
>>> >> > Aseem
>>> >> >
>>> >> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema
>>> 
>>> >> > wrote:
>>> >> >> I am trying to post a document with the following content using
>>> SolrJ:
>>> >> >> content
>>> >> >> I need the xml/html tags to be ignored. Even though this works fine
>>> in
>>> >> >> analysis.jsp, this does not work with SolrJ, as the client escapes
>>> the
>>> >> >> < and > with < and > and HTMLStripCharFilterFactory does not
>>> >> >> strip those escaped tags. How can I achieve this? Any ideas will be
>>> >> >> highly appreciated.
>>> >> >>
>>> >> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
>>> >> >> there a way to get that to work?
>>> >> >> Thanks
>>> >> >> --
>>> >> >> Aseem
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Aseem
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116601.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27118304.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: LongField not stripping leading zeros

2010-01-12 Thread Chris Hostetter

: Thanks. Is there any performance penalty vs. LongField? I don't need to 

The other ones do normalization by converting to a Long internally -- i 
have no idea if you would see some micro performance benefit in doing 
the 0 stripping yourself.

Sorting a LongField should take less RAM then a SortableLongField (because 
the Sortable*Field's use String based FieldCaches to support sortMissing*) 
but if you aren't doing any sorting or range queries on the field that 
shouldn't matter.


-Hoss



Re: LongField not stripping leading zeros

2010-01-12 Thread Kevin Osborn
Thanks. Is there any performance penalty vs. LongField? I don't need to do any 
range queries on these value. I am basically treating them as numerical 
strings. I thought it would just be a shortcut to strip leading zeros, which I 
can easily do on my own.





From: Chris Hostetter 
To: Solr 
Sent: Tue, January 12, 2010 3:16:13 PM
Subject: Re: LongField not stripping leading zeros

: I have some text in our database in the form 0088698183939. The leading 
: zeros are useless, but I want to able to search it with no leading zeros 
: or several leading zeros. So, I decided to index this as a long, 
: expecting it to just store it as a number. But, instead, I see this in 
: the index:

Note the comment s about LongField in the example schema...

  Plain numeric field types that store and index the text
  value verbatim (and hence don't support range queries, since the
  lexicographic ordering isn't equal to the numeric ordering)

...LongField, IntField, etc.. all just index/store the exact value you 
put in -- the only distinctions between them and StrField is that they are 
rendered back as a numeric type (by the response writers) and they use the 
numericly typed FieldCache for sorting.

You should be using TrieLongField (or SortableLongField if you need 
sortMissing* type functionality)


-Hoss


  

Re: LongField not stripping leading zeros

2010-01-12 Thread Chris Hostetter
: I have some text in our database in the form 0088698183939. The leading 
: zeros are useless, but I want to able to search it with no leading zeros 
: or several leading zeros. So, I decided to index this as a long, 
: expecting it to just store it as a number. But, instead, I see this in 
: the index:

Note the comment s about LongField in the example schema...

  Plain numeric field types that store and index the text
  value verbatim (and hence don't support range queries, since the
  lexicographic ordering isn't equal to the numeric ordering)

...LongField, IntField, etc.. all just index/store the exact value you 
put in -- the only distinctions between them and StrField is that they are 
rendered back as a numeric type (by the response writers) and they use the 
numericly typed FieldCache for sorting.

You should be using TrieLongField (or SortableLongField if you need 
sortMissing* type functionality)


-Hoss



LongField not stripping leading zeros

2010-01-12 Thread Kevin Osborn
This is in Solr 1.3.

I have some text in our database in the form 0088698183939. The leading zeros 
are useless, but I want to able to search it with no leading zeros or several 
leading zeros. So, I decided to index this as a long, expecting it to just 
store it as a number. But, instead, I see this in the index:


   0088698183939


Shouldn't the leading zeros be gone since this is a number. And when I search 
on it, I only get a hit if I include two leading zeros. I could just clean 
everything on both index and search time, but could someone explain what is 
going on here? Thanks.



  

Re: updating solr server

2010-01-12 Thread Yonik Seeley
On Tue, Jan 12, 2010 at 2:53 PM, Smith G  wrote:
> 4) queuesize parameter of Streaming constructer: What could be the
> rough-value when it comes
> to real time application having a million+ documents to be indexed ? ..
>           So what does "queuesize" is exactly for ? , if we can go on
> adding as many as we can.

The queue provides a buffer between the document producer(s) (your
application code) and the consumers (the solrj impl that sends the doc
to solr) and helps to further increase concurrency.

Consider the following scenario if you had only 1 document producer
thread (the common case) and no queue inbetween.
Your producer creates a document, and adds it... but since all of the
consumer threads are busy, it blocks.
Next, consumer thread #1 and #2 both become ready at the same time.
Consumer thread takes your document, unblocking your producer thread
to create another document.  But in the meantime, consumer thread #2
sits idle.

It probably doesn't make sense to set the buffer above 10, unless your
document creation is really "bursty" and you need the additional
buffering to make sure that indexing threads are kept busy.

-Yonik
http://www.lucidimagination.com


Localsolr wt=json and fl compatible?

2010-01-12 Thread Brian Westphal

We've got Localsolr (2.9.1 lucene-spatial library) running on Solr 1.4 with
Tomcat 1.6. Everything's looking good, except for a couple little issues.

If we specify fl=id (or fl= anything) and wt=json it seems that the fl
parameter is ignored (thus we get a lot more detail in our results than we'd
like).

If we specify fl=id and leave out wt=json (which defaults to returning xml
results), we get the expected fields back. We'd really prefer to use wt=json
because the results are easier for us to deal with (also, the same issue
also arises with wt=python and wt=ruby).

-- 
View this message in context: 
http://old.nabble.com/Localsolr-wt%3Djson-and-fl-compatible--tp27131973p27131973.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: updating solr server

2010-01-12 Thread Smith G
 Hello,
 Yeah, to be brief.. I wanted to read documents and update them
simoultaneously with different threads. Main issue I considered is To
call add / commit for " how many " documents, because I can not keep
adding millions of documents one after another to
StreamingUpdateSolrServer by just sitting idle that it would take care
of evrything, doing it so is not possible because of memory issues.
So, if there is a case where I can splitdown document set into an
optimal sized batch , then I can also go for multiple threads in
updating.

Most of my doubts are solved. Thanks for your responses.

 :: "The beauty of StreamingUpdateSolrServer is that you don't have to
worry about batch sizes " ::, So now I can just forget about batch
sizes, etc. Just keep going on adding as many as I want.

  There is one more issue.. point 4 in my first mail.
4) queuesize parameter of Streaming constructer: What could be the
rough-value when it comes
to real time application having a million+ documents to be indexed ? ..
   So what does "queuesize" is exactly for ? , if we can go on
adding as many as we can.

Thanks alot.

2010/1/12 Yonik Seeley :
> On Tue, Jan 12, 2010 at 1:09 PM, Smiley, David W.  wrote:
>> The beauty of StreamingUpdateSolrServer is that you don't have to worry 
>> about batch sizes; it streams them all.  Just keep calling add() with one 
>> document and it'll get enqueued.  You can pass a collection but there's no 
>> performance benefit.
>
> Right - and the problem with building your own collection and passing
> it is that it's not being streamed (if it takes any time to build
> those docs - like reading from a DB - then that thread may be idle for
> some amount of time).  If you separate and make document production
> asynchronous from document sending, then you've just re-invented
> StreamingUpdateSolrServer.
>
> I'd really recommend just starting with StreamingUpdateSolrServer for
> any amount of indexing.
>
> -Yonik
> http://www.lucidimagination.com
>


Re: What is the proper way to deploy Solr with a custom schema.xml that requires extra JARs?

2010-01-12 Thread Otis Gospodnetic
> I can't put the extra JARs in the Solr home dir's lib subdir, can I?

Why, this is indeed what you should do, Kuro.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Teruhiko Kurosaka 
> To: "solr-user@lucene.apache.org" 
> Sent: Tue, January 12, 2010 1:23:37 PM
> Subject: What is the proper way to deploy Solr with a custom  schema.xml that 
> requires extra JARs?
> 
> I have schema.xml that uses a Tokenizer that I wrote.
> 
> I understand the standard way of deploying Solr is
> to place solr.war in webapps directory, have a separate
> directory that has conf files under its conf subdirectory,
> and specify that directory as Solr home dir via either 
> JVM property or JNDI.
> 
> I can't put the extra JARs in the Solr home dir's lib subdir,
> can I?
> 
> Is there any elegant way of placing the extra JARs
> other than expanding the war in webapp directory manually
> and adding the JARs?
> 
> -kuro



Re: Solr 1.4 Field collapsing - What are the steps for applying the SOLR-236 patch?

2010-01-12 Thread Martijn v Groningen
I wouldn't use the patches of the sub issues right now as they are
under development right now (the are currently a POC). I also think
that the latest patch in SOLR-236 is currently the best option. There
are some memory related problems with the patch that have to do with
caching. The fieldCollapse cache requires a lot of memory (best is not
to use it right now). The filterCache also becomes quite large as
well. Depending on the size your corpus you would need to increase
your heap size and play around with that.

Martijn

2010/1/12 Joe Calderon :
> it seems to be in flux right now as the solr developers slowly make
> improvements and ingest the various pieces into the solr trunk, i think your
> best bet might be to use the 12/24 patch and fix any errors where it doesnt
> apply cleanly
>
> im using solr trunk r892336 with the 12/24 patch
>
>
> --joe
> On 01/11/2010 08:48 PM, Kelly Taylor wrote:
>>
>> Hi,
>>
>> Is there a step-by-step for applying the patch for SOLR-236 to enable
>> field
>> collapsing in Solr 1.4?
>>
>> Thanks,
>> Kelly
>>
>
>



-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Replication problem

2010-01-12 Thread Jason Rutherglen
Multiple replicateAfter for example: startup,optimize  I believe this
was causing the issue, I limited it to commit, and it started to work
(with no other changes to solrconfig.xml)

On Tue, Jan 12, 2010 at 11:24 AM, Yonik Seeley
 wrote:
> On Tue, Jan 12, 2010 at 2:17 PM, Jason Rutherglen
>  wrote:
>> It was having multiple replicateAfter values... Perhaps a bug, though
>> I probably won't spend time investigating the why right now, nor
>> reproducing in the test cases.
>
> Do you mean that you changed the config and now it's working?
> We have test cases with multiple replicateAfter values (or do you mean
> actual multiple tags in the XML?)
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Meaning of this error: Failure to meet condition(s) of required/prohibited clause(s)???

2010-01-12 Thread Chris Hostetter

: Subject: Meaning of this error: Failure to meet condition(s) of
: required/prohibited clause(s)???

First of all: it's not an error -- it's a debuging statment generated when 
you asekd for an explanation of a document's score...

: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited
: clause(s)

It means that you have a BooleanQuery with some required or prohibited 
clauses, and he document didn't meet that condition.

The nested explanations go into more detail, soo...

:   0.0 = no match on required clause (((subKeywords:lcd^0.1 |
: keywords:lcd^0.5 | defaultKeywords:lcd | contributors:lcd^0.5 | title:lcd)
: (subKeywords:televis^0.1 | keywords:tvs^0.5 | defaultKeywords:tvs |
: contributors:tvs^0.5 | (title:televis title:tv title:tvs)))~1)

that entire thing is a BooleanQuery which is a required clause 
of the "outer" BooleanQuery mentioned previously, and this 
document doesn't match it.  The nested explanation tells you why...

: 0.0 = (NON-MATCH) Failure to match minimum number of optional clauses: 1

...so there you go.  This BooleanQuery consists of two clauses which are 
each DisjunctionMaxQueries (one for lcd and one for tv) and the 
minNrShouldMatch property requires that at least one must match (that's 
the "~1" in the toString info above)

Based on another email where you provided some of your schema details (but 
not enough to get a clera picture of everything going on) the one thing i 
can tell you is why you didn't get a match on the "keywords" field even 
though it had the string "LCD TVs" ... the reason is because you are using 
the KeywordTokenizer which treats all of it's input as a single token -- 
but whitespace is markup for the QueryParser (just like quotes and +/-) 
... it chunks the input on whitespace boundaries before consulting the 
field analyzers, so it the resulting query doesn't search the keywords 
field for "lcd tvs" as a single string unless you include it in quotes.

(you do however see later in the score explanation that it is doing this, 
and finding a match, because you have keywords i the "pf" param, which 
constructs an implicit phrase query for your entire input string -- but 
this is only a boost, it can't help a document that doesn't match the qf 
fields in the first place).

-Hoss



Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-12 Thread Kelly Taylor

David,

Thanks, and yes, I decided to travel that path last night (applying SOLR-236
patch) and plan to have some results by the end of the day; I'll post a
summary.

I read about field collapsing in your book last night. The book is an
excellent resource by the way (shameless commendation plug!), and it made me
laugh to find out that my use case is crazy!

Regarding dedupe, I'm not sure either.  The component is mentioned in an
article by Amit Nithianandan 
(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics).
 
I had concluded from the section entitled, "Comparing the Solr Approach with
the RDBMS," that the dedupe component was somehow used as a "field
collapsing" alternative (in my mind anyway) but I couldn't find a real-world
example.

Amit says, "...I might create an index with multiple documents or records
for the same exact wiper blade, each document having different location data
(lat/long, address, etc.) to represent an individual store. Solr has a
de-duplication component to help show unique documents in case that
particular wiper blade is available in multiple stores near me..."

In my case, I was attempting to equate Amit's "wiper blade" with my
"product" entity, and his "individual store" my "SKU" entity.

Thanks again.

-Kelly


David Smiley @MITRE.org wrote:
> 
> Kelly,
> This is a good question you have posed and illustrates a challenge with
> Solr's limited schema.  I don't see how the dedup will help.  I would
> continue with the SKU based approach and use this patch:
> https://issues.apache.org/jira/browse/SOLR-236
> You'll collapse on the product id.  My book, p.192, highlights this
> component as it existed when I wrote it but it has been updated since
> then.
> 
> A recent separate question by you on this list suggests you're going down
> this path.  I would grab the attached SOLR-236.patch file and attempt to
> apply it to the 1.4 source.
> 
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
> 

-- 
View this message in context: 
http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27131969.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Replication problem

2010-01-12 Thread Yonik Seeley
On Tue, Jan 12, 2010 at 2:17 PM, Jason Rutherglen
 wrote:
> It was having multiple replicateAfter values... Perhaps a bug, though
> I probably won't spend time investigating the why right now, nor
> reproducing in the test cases.

Do you mean that you changed the config and now it's working?
We have test cases with multiple replicateAfter values (or do you mean
actual multiple tags in the XML?)

-Yonik
http://www.lucidimagination.com


NullPointerException in ReplicationHandler.postCommit + question about compression

2010-01-12 Thread Stephen Weiss

Hi Solr List,

We're trying to set up java-based replication with Solr 1.4 (dist  
tarball).  We are running this to start with on a pair of test servers  
just to see how things go.


There's one major problem we can't seem to get past.  When we  
replicate manually (via the admin page) things seem to go well.   
However, when replication is triggered by a commit event on the  
master, the master gets a NullPointerException and no replication  
seems to take place.



SEVERE: java.lang.NullPointerException
	at org.apache.solr.handler.ReplicationHandler 
$4.postCommit(ReplicationHandler.java:922)
	at  
org 
.apache 
.solr 
.update.UpdateHandler.callPostCommitCallbacks(UpdateHandler.java:78)
	at  
org 
.apache 
.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java: 
411)
	at  
org 
.apache 
.solr 
.update 
.processor 
.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java: 
169)

at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
	at  
org 
.apache 
.solr 
.handler 
.ContentStreamHandlerBase 
.handleRequestBody(ContentStreamHandlerBase.java:54)
	at  
org 
.apache 
.solr 
.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
	at  
org 
.apache 
.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:336)
	at  
org 
.apache 
.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:239)
	at org.mortbay.jetty.servlet.ServletHandler 
$CachedChain.doFilter(ServletHandler.java:1115)
	at  
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java: 
361)
	at  
org 
.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java: 
216)
	at  
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java: 
181)
	at  
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java: 
766)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java: 
417)
	at  
org 
.mortbay 
.jetty 
.handler 
.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
	at  
org 
.mortbay 
.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
	at  
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java: 
152)

at org.mortbay.jetty.Server.handle(Server.java:324)
	at  
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java: 
534)
	at org.mortbay.jetty.HttpConnection 
$RequestHandler.content(HttpConnection.java:879)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:741)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:213)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
	at  
org 
.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java: 
409)
	at org.mortbay.thread.QueuedThreadPool 
$PoolThread.run(QueuedThreadPool.java:522)





This is the master config:

  



commit





name 
= 
"confFiles 
"> 
solrconfig_slave 
.xml:solrconfig.xml,schema.xml,synonyms.txt,stopwords.txt,elevate.xmlstr>



00:00:10

  


and... the slave config:

  



http://hostname.obscured.com:8080/solr/calendar_core/replication 





00:00:20




internal


  

5000
1




 
  


Does anyone know off the top of their head what this might indicate,  
or know what further troubleshooting steps we should be taking to  
isolate the issue?


Also, on a (probably) unrelated topic, we're kinda confused by this  
section of the slave config:




internal

Since we *are* on a LAN, what exactly should we be doing here?  The  
language is somewhat unclear... I thought that meant that we should  
just comment out the line altogether, but others think it means that  
we should leave it set to "internal".  We get that compression is  
probably unnecessary for our more vanilla setup, we're just not 100%  
sure how to express that correctly.


Thanks in advance for any advice!

--
Steve


Re: Replication problem

2010-01-12 Thread Jason Rutherglen
It was having multiple replicateAfter values... Perhaps a bug, though
I probably won't spend time investigating the why right now, nor
reproducing in the test cases.

On Tue, Jan 12, 2010 at 11:10 AM, Jason Rutherglen
 wrote:
> Hmm...Even with the IP address in the master URL on the slave, the
> indexversion command to the master mysteriously doesn't show the
> latest commit...  Totally freakin' bizarre!
>
> On Tue, Jan 12, 2010 at 10:53 AM, Jason Rutherglen
>  wrote:
>> There's a connect exception on the client, however I'd expect this to
>> show up in the slave replication console (it's not).  Is this correct
>> behavior (i.e. not showing replication errors)?
>>
>> On Mon, Jan 11, 2010 at 9:50 AM, Jason Rutherglen
>>  wrote:
>>> Yonik,
>>>
>>> I added startup to replicateAfter, however no dice... There's no
>>> errors the Tomcat log.
>>>
>>> The output of:
>>> http://localhost-master:8080/solr/main/replication?command=indexversion
>>> 
>>> 
>>> 0
>>> 0
>>> 
>>> 0
>>> 0
>>> 
>>>
>>> The master replication UI:
>>> Local Index      Index Version: 1263182366335, Generation: 3
>>>        Location: /mnt/solr/main/data/index
>>>        Size: 1.08 KB
>>>
>>> Master solrconfig.xml, and tomcat was restarted:
>>>
>>> 
>>>    
>>>       true
>>>       
>>>
>>>       
>>>       >> name="confFiles">schema.xml,synonyms.txt,stopwords.txt,elevate.xml
>>>       
>
>       
>        name="confFiles">schema.xml,synonyms.txt,stopwords.txt,elevate.xml
>       

Re: Problem comitting on 40GB index

2010-01-12 Thread Erick Erickson
Huh?

On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
wrote:

>
> : Subject: Problem comitting on 40GB index
> : In-Reply-To: <
> 7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com>
>
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
>
>
>
> -Hoss
>
>


NYC Search in the Cloud meetup: Jan 20

2010-01-12 Thread Otis Gospodnetic
Hello,

If "Search Engine Integration, Deployment and Scaling in the Cloud" sounds 
interesting to you, and you are going to be in or near New York next Wednesday 
(Jan 20) evening:

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/

Sorry for dupes to those of you subscribed to multiple @lucene lists.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



Re: Replication problem

2010-01-12 Thread Jason Rutherglen
Hmm...Even with the IP address in the master URL on the slave, the
indexversion command to the master mysteriously doesn't show the
latest commit...  Totally freakin' bizarre!

On Tue, Jan 12, 2010 at 10:53 AM, Jason Rutherglen
 wrote:
> There's a connect exception on the client, however I'd expect this to
> show up in the slave replication console (it's not).  Is this correct
> behavior (i.e. not showing replication errors)?
>
> On Mon, Jan 11, 2010 at 9:50 AM, Jason Rutherglen
>  wrote:
>> Yonik,
>>
>> I added startup to replicateAfter, however no dice... There's no
>> errors the Tomcat log.
>>
>> The output of:
>> http://localhost-master:8080/solr/main/replication?command=indexversion
>> 
>> 
>> 0
>> 0
>> 
>> 0
>> 0
>> 
>>
>> The master replication UI:
>> Local Index      Index Version: 1263182366335, Generation: 3
>>        Location: /mnt/solr/main/data/index
>>        Size: 1.08 KB
>>
>> Master solrconfig.xml, and tomcat was restarted:
>>
>> 
>>    
>>       true
>>       
>>
>>       
>>       > name="confFiles">schema.xml,synonyms.txt,stopwords.txt,elevate.xml
>>       

       
       >>> name="confFiles">schema.xml,synonyms.txt,stopwords.txt,elevate.xml
       

Re: Problem comitting on 40GB index

2010-01-12 Thread Chris Hostetter

: Subject: Problem comitting on 40GB index
: In-Reply-To: <7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



Re: Replication problem

2010-01-12 Thread Jason Rutherglen
There's a connect exception on the client, however I'd expect this to
show up in the slave replication console (it's not).  Is this correct
behavior (i.e. not showing replication errors)?

On Mon, Jan 11, 2010 at 9:50 AM, Jason Rutherglen
 wrote:
> Yonik,
>
> I added startup to replicateAfter, however no dice... There's no
> errors the Tomcat log.
>
> The output of:
> http://localhost-master:8080/solr/main/replication?command=indexversion
> 
> 
> 0
> 0
> 
> 0
> 0
> 
>
> The master replication UI:
> Local Index      Index Version: 1263182366335, Generation: 3
>        Location: /mnt/solr/main/data/index
>        Size: 1.08 KB
>
> Master solrconfig.xml, and tomcat was restarted:
>
> 
>    
>       true
>       
>
>       
>        name="confFiles">schema.xml,synonyms.txt,stopwords.txt,elevate.xml
>       
>>>
>>>       
>>>       >> name="confFiles">schema.xml,synonyms.txt,stopwords.txt,elevate.xml
>>>       

Re: Problem comitting on 40GB index

2010-01-12 Thread Erick Erickson
You'll be able to get some valuable info by monitoring your free space on
disk.

If this occurs again, it'd help if you posted your your SOLR
configuration and told us about any warmups you're doing...

Of course, there are always gremlins...

On Tue, Jan 12, 2010 at 12:36 PM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> I restarted the solr and stopped all searches. After that, the commit() was
> normal (2 secs) and it's been working for 3h without problems (indexing and
> a few searches too)... I haven't done any optimize yet, mainly because I had
> no deletes on the index and the performance is ok, so no need to optimize I
> think..
>
> I had tried this procedure a few times in the morning and the commit always
> hanged so.. I have no explanation for it start working suddenly..
> I'm making a commit every 2m (because I need the results updated on
> searches), so propably when I have more searches at the same time the commit
> will hang again right?
>
> Sorry for the newbie questions and thanks for your help and explanation
> Erik.
>
> BR,
> Frederico
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: terça-feira, 12 de Janeiro de 2010 15:15
> To: solr-user@lucene.apache.org
> Subject: Re: Problem comitting on 40GB index
>
> Rebooting the machine certainly closes the searchers, but
> depending upon how you shut it down there may be stale files
> After reboot (but before you start SOLR), how much space
> is on your disk? If it's 40G, you have no stale files
>
> Yes, IR is IndexReader, which is a searcher.
>
> I'll have to leave it to others if you don't have stale files
> hanging around, although if you're optimizing while
> searchers are running, you'll use up to 3X the index size...
>
> Otherwise I'll have to leave it to others for additional insights
>
> Best
> Erick
>
> On Tue, Jan 12, 2010 at 9:22 AM, Frederico Azeiteiro <
> frederico.azeite...@cision.com> wrote:
>
> > Hi Erik,
> >
> > I'm a newbie to solr... By IR, you mean searcher? Is there a place where
> I
> > can check the open searchers? And rebooting the machine shouldn't closed
> > that searchers?
> >
> > Thanks,
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: terça-feira, 12 de Janeiro de 2010 13:54
> > To: solr-user@lucene.apache.org
> > Subject: Re: Problem comitting on 40GB index
> >
> > There are several possibilities:
> >
> > 1> you have some process holding open your indexes, probably
> > other searchers. You *probably* are OK just committing
> > new changes if there is exactly *one* searcher keeping
> > your index open. If you have some process whereby
> > you periodically open a new search but you fail to close
> > the old one, then you'll use up an extra 40G for every
> > version of your index held open by your processes. That's
> >confusing... I'm saying that if you open any number of IRs,
> >you'll have 40G consumed. Then if you add
> >some more documents and open *another* IR,  you'll have
> >another 40G consumed. They'll stay around until you close
> >your readers.
> >
> > 2> If you optimize, there can be up to 3X the index size being
> >consumed if you also have a previous reader opened.
> >
> > So I suspect that sometime recently you've opened another
> > IR.
> >
> > HTH
> > Erick
> >
> >
> >
> > On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
> > frederico.azeite...@cision.com> wrote:
> >
> > > Hi all,
> > >
> > > I started working with solr about 1 month ago, and everything was
> > > running well both indexing as searching documents.
> > >
> > > I have a 40GB index with about 10 000 000 documents available. I index
> > > 3k docs for each 10m and commit after each insert.
> > >
> > > Since yesterday, I can't commit no articles to index. I manage to
> search
> > > ok, and index documents without commiting. But when I start the commit
> > > is takes a long time and eats all of the available disk space
> > > left(60GB). The commit eventually stops with full disk and I have to
> > > restart SOLR and get the 60GB returned to system.
> > >
> > > Before this, the commit was taking a few seconds to complete.
> > >
> > > Can someone help to debug the problem? Where should I start? Should I
> > > try to copy the index to other machine with more free space and try to
> > > commit? Should I try an optimize?
> > >
> > > Log for the last commit I tried:
> > >
> > > INFO: start
> > >
> commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> > > alse)
> > > (Then, after a long time...)
> > > Exception in thread "Lucene Merge Thread #0"
> > > org.apache.lucene.index.MergePolicy$MergeException:
> java.io.IOException:
> > > No space left on device
> > >at
> > >
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> > > ncurrentMergeScheduler.java:351)
> > >at
> > >
> org.apache.lucene.index.ConcurrentM

SF Bay Area Lucene Meetup Jan. 21st

2010-01-12 Thread Grant Ingersoll
There will be a San Francisco/Bay Area meetup on Jan. 21st at 7:15 PM at the 
"Hacker Dojo" (don't ask me...) location.  

RSVP and all the details are at http://www.meetup.com/SFBay-Lucene-Solr-Meetup/

Hope to see you there,
Grant

Re: Yankee's Solr integration

2010-01-12 Thread Aleksander Stensby
They have probably added the logic for that server-side. Solr does not
support these type of features, but they are easy to implement.

Saving a search could be as easy as storing the selected query parameters.
Then creating an alert (or RSS feed) for that would be a process on the
server that executes those stored queries agains solr at regular intervals,
and formats the results as either RSS or an email then ships that off to the
client that subscribed.

Cheers,
 Aleks


On Wed, Jan 6, 2010 at 3:12 PM, Nicolas Kern  wrote:

> Hello everybody,
>
> I was wordering how did Yankee (
> http://www.yankeegroup.com/search.do?searchType=advancedSearch) did to
> provide the possibility to Create Alerts, Save Searches, and generate a RSS
> Feed out of a custom search using Solr, do you have any idea ?
>
> Thanks a lot,
> Best regards & happy new year !
> Nicolas
>


Re: help implementing a couple of business rules

2010-01-12 Thread Aleksander Stensby
For your first question, wouldn't it be possible to achieve that with some
simple boolean logic? I mean, if you have a requirement to match any of the
other fields AND description2, but not if it ONLY matches description 2:

say matching x against field A, B, and description 2:
((A:x OR B:x) AND description2:x)
would only give you results from description2 IF there is also a match in
either one of the other two fields.

If I misunderstood your requirements, you should also note that solr
supports pure negative field matching aswell, meaning that you CAN exclude
results from a specific field entirely. From the wiki:

> Pure negative queries (all clauses prohibited) are allowed. 
> -inStock:falsefinds all field values where inStock is not false
>

Hope that helps,
 Aleks


On Mon, Jan 11, 2010 at 7:29 PM, Joe Calderon wrote:

> thx, but im not sure that covers all edge cases, to clarify
> 1. matching description2 is okay if other fields are matched too, but
> results matching only to description2 should be omitted
>
> 2. its okay to not match against the people field, but matches against
> the people field should only be phrase matches
>
> sorry if  i was unclear
>
> --joe
> On Mon, Jan 11, 2010 at 10:13 AM, Erik Hatcher 
> wrote:
> >
> > On Jan 11, 2010, at 12:56 PM, Joe Calderon wrote:
> >>
> >> 1. given a set of fields how to return matches that match across them
> >> but not just one specific one, ex im using a dismax parser currently
> >> but i want to exclude any results that only match against a field
> >> called 'description2'
> >
> > One way could be to add an fq parameter to the request:
> >
> >   &fq=-description2:()
> >
> >> 2. given a set of fields how to return matches that match across them
> >> but on one specific field match as a phrase only, ex im using a dismax
> >> parser currently but i want matches against a field called 'people' to
> >> only match as a phrase
> >
> > Doesn't setting pf=people accomplish this?
> >
> >Erik
> >
> >
>


What is the proper way to deploy Solr with a custom schema.xml that requires extra JARs?

2010-01-12 Thread Teruhiko Kurosaka
I have schema.xml that uses a Tokenizer that I wrote.

I understand the standard way of deploying Solr is
to place solr.war in webapps directory, have a separate
directory that has conf files under its conf subdirectory,
and specify that directory as Solr home dir via either 
JVM property or JNDI.

I can't put the extra JARs in the Solr home dir's lib subdir,
can I?

Is there any elegant way of placing the extra JARs
other than expanding the war in webapp directory manually
and adding the JARs?

-kuro


Re: updating solr server

2010-01-12 Thread Yonik Seeley
On Tue, Jan 12, 2010 at 1:09 PM, Smiley, David W.  wrote:
> The beauty of StreamingUpdateSolrServer is that you don't have to worry about 
> batch sizes; it streams them all.  Just keep calling add() with one document 
> and it'll get enqueued.  You can pass a collection but there's no performance 
> benefit.

Right - and the problem with building your own collection and passing
it is that it's not being streamed (if it takes any time to build
those docs - like reading from a DB - then that thread may be idle for
some amount of time).  If you separate and make document production
asynchronous from document sending, then you've just re-invented
StreamingUpdateSolrServer.

I'd really recommend just starting with StreamingUpdateSolrServer for
any amount of indexing.

-Yonik
http://www.lucidimagination.com


Re: updating solr server

2010-01-12 Thread Smiley, David W.
The beauty of StreamingUpdateSolrServer is that you don't have to worry about 
batch sizes; it streams them all.  Just keep calling add() with one document 
and it'll get enqueued.  You can pass a collection but there's no performance 
benefit.

StreamingUpdateSolrServer can be configured to use multiple simultaneous 
streams into Solr... I wouldn't use as many as you have CPUs; I'd go with 2 
then keep adding 1 till your docs/sec levels off.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Jan 12, 2010, at 12:52 PM, Smith G wrote:

> Hello ,
> I am using add() method which receives Collection of
> SolrInputDocuments instead of add() which receives a single document.
> I am afraid, is sending a group of documents being called as
> "batching" in Solr terminology? . If yes, then I am doing it ( by
> including additional logic in my code ). But the main point I dont get
> is how big a batch could be? How to find most suitable number of
> SolrDocs that could be sent at a time.
> Also, In case If I go for multi-threaded commons, then the
> number of threads to be used is equal to N of "N"-core processor, for
> being  optimal? .
> Thanks.
> 
> 2010/1/12 Yonik Seeley :
>> On Tue, Jan 12, 2010 at 3:48 AM, Smith G  wrote:
>>> Hello All,
>>>   I am trying to find a better approach ( perfomance wise
>>> ) to index documents. Document count is approximately a million+.
>>> First, I thought of writing multiple threads using
>>> CommonsHttpSolrServer to submit documents. But later I found out
>>> StreamingUpdateSolrServer, which says we can forget about batching.
>>> 
>>> 1) We can pass thread-count parameter to StreamingUpdateSolrServer,
>>> does it exactly serve the same as writing multiple threads using
>>> CommonsHttpSolrServer ?.
>> 
>> Not quite - streaming update solr server batches documents on the fly.
>>  So if you have a server with N CPUs, you should only need N threads
>> to saturate it.  Using multiple threads with CommonsHttpSolrServer,
>> it's still one document per request (unless you do your own batching)
>> and there is still latency between request and response, meaning it
>> would take more threads to fill in that latency.
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 




Re: updating solr server

2010-01-12 Thread Smith G
Hello ,
 I am using add() method which receives Collection of
SolrInputDocuments instead of add() which receives a single document.
I am afraid, is sending a group of documents being called as
"batching" in Solr terminology? . If yes, then I am doing it ( by
including additional logic in my code ). But the main point I dont get
is how big a batch could be? How to find most suitable number of
SolrDocs that could be sent at a time.
 Also, In case If I go for multi-threaded commons, then the
number of threads to be used is equal to N of "N"-core processor, for
being  optimal? .
Thanks.

2010/1/12 Yonik Seeley :
> On Tue, Jan 12, 2010 at 3:48 AM, Smith G  wrote:
>> Hello All,
>>               I am trying to find a better approach ( perfomance wise
>> ) to index documents. Document count is approximately a million+.
>> First, I thought of writing multiple threads using
>> CommonsHttpSolrServer to submit documents. But later I found out
>> StreamingUpdateSolrServer, which says we can forget about batching.
>>
>> 1) We can pass thread-count parameter to StreamingUpdateSolrServer,
>> does it exactly serve the same as writing multiple threads using
>> CommonsHttpSolrServer ?.
>
> Not quite - streaming update solr server batches documents on the fly.
>  So if you have a server with N CPUs, you should only need N threads
> to saturate it.  Using multiple threads with CommonsHttpSolrServer,
> it's still one document per request (unless you do your own batching)
> and there is still latency between request and response, meaning it
> would take more threads to fill in that latency.
>
> -Yonik
> http://www.lucidimagination.com
>


RE: Problem comitting on 40GB index

2010-01-12 Thread Frederico Azeiteiro
I restarted the solr and stopped all searches. After that, the commit() was 
normal (2 secs) and it's been working for 3h without problems (indexing and a 
few searches too)... I haven't done any optimize yet, mainly because I had no 
deletes on the index and the performance is ok, so no need to optimize I think..

I had tried this procedure a few times in the morning and the commit always 
hanged so.. I have no explanation for it start working suddenly.. 
I'm making a commit every 2m (because I need the results updated on searches), 
so propably when I have more searches at the same time the commit will hang 
again right?

Sorry for the newbie questions and thanks for your help and explanation Erik.

BR, 
Frederico

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: terça-feira, 12 de Janeiro de 2010 15:15
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

Rebooting the machine certainly closes the searchers, but
depending upon how you shut it down there may be stale files
After reboot (but before you start SOLR), how much space
is on your disk? If it's 40G, you have no stale files

Yes, IR is IndexReader, which is a searcher.

I'll have to leave it to others if you don't have stale files
hanging around, although if you're optimizing while
searchers are running, you'll use up to 3X the index size...

Otherwise I'll have to leave it to others for additional insights

Best
Erick

On Tue, Jan 12, 2010 at 9:22 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi Erik,
>
> I'm a newbie to solr... By IR, you mean searcher? Is there a place where I
> can check the open searchers? And rebooting the machine shouldn't closed
> that searchers?
>
> Thanks,
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: terça-feira, 12 de Janeiro de 2010 13:54
> To: solr-user@lucene.apache.org
> Subject: Re: Problem comitting on 40GB index
>
> There are several possibilities:
>
> 1> you have some process holding open your indexes, probably
> other searchers. You *probably* are OK just committing
> new changes if there is exactly *one* searcher keeping
> your index open. If you have some process whereby
> you periodically open a new search but you fail to close
> the old one, then you'll use up an extra 40G for every
> version of your index held open by your processes. That's
>confusing... I'm saying that if you open any number of IRs,
>you'll have 40G consumed. Then if you add
>some more documents and open *another* IR,  you'll have
>another 40G consumed. They'll stay around until you close
>your readers.
>
> 2> If you optimize, there can be up to 3X the index size being
>consumed if you also have a previous reader opened.
>
> So I suspect that sometime recently you've opened another
> IR.
>
> HTH
> Erick
>
>
>
> On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
> frederico.azeite...@cision.com> wrote:
>
> > Hi all,
> >
> > I started working with solr about 1 month ago, and everything was
> > running well both indexing as searching documents.
> >
> > I have a 40GB index with about 10 000 000 documents available. I index
> > 3k docs for each 10m and commit after each insert.
> >
> > Since yesterday, I can't commit no articles to index. I manage to search
> > ok, and index documents without commiting. But when I start the commit
> > is takes a long time and eats all of the available disk space
> > left(60GB). The commit eventually stops with full disk and I have to
> > restart SOLR and get the 60GB returned to system.
> >
> > Before this, the commit was taking a few seconds to complete.
> >
> > Can someone help to debug the problem? Where should I start? Should I
> > try to copy the index to other machine with more free space and try to
> > commit? Should I try an optimize?
> >
> > Log for the last commit I tried:
> >
> > INFO: start
> > commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> > alse)
> > (Then, after a long time...)
> > Exception in thread "Lucene Merge Thread #0"
> > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> > No space left on device
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> > ncurrentMergeScheduler.java:351)
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
> > entMergeScheduler.java:315)
> > Caused by: java.io.IOException: No space left on device
> >
> > I'm using Ubuntu 9.04 and Solr 1.4.0.
> >
> > Thanks in advance,
> >
> > Frederico
> >
>


DataImportHandler - synchronous execution

2010-01-12 Thread Alexey Serba
Hi,

I found that there's no explicit option to run DataImportHandler in a
synchronous mode. I need that option to run DIH from SolrJ (
EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
to DIH as a workaround for this, but I think it makes sense to add
specific option for that. Any objections?

Alex


Re: updating solr server

2010-01-12 Thread Yonik Seeley
On Tue, Jan 12, 2010 at 3:48 AM, Smith G  wrote:
> Hello All,
>               I am trying to find a better approach ( perfomance wise
> ) to index documents. Document count is approximately a million+.
> First, I thought of writing multiple threads using
> CommonsHttpSolrServer to submit documents. But later I found out
> StreamingUpdateSolrServer, which says we can forget about batching.
>
> 1) We can pass thread-count parameter to StreamingUpdateSolrServer,
> does it exactly serve the same as writing multiple threads using
> CommonsHttpSolrServer ?.

Not quite - streaming update solr server batches documents on the fly.
 So if you have a server with N CPUs, you should only need N threads
to saturate it.  Using multiple threads with CommonsHttpSolrServer,
it's still one document per request (unless you do your own batching)
and there is still latency between request and response, meaning it
would take more threads to fill in that latency.

-Yonik
http://www.lucidimagination.com


Re: Problem comitting on 40GB index

2010-01-12 Thread Erick Erickson
Rebooting the machine certainly closes the searchers, but
depending upon how you shut it down there may be stale files
After reboot (but before you start SOLR), how much space
is on your disk? If it's 40G, you have no stale files

Yes, IR is IndexReader, which is a searcher.

I'll have to leave it to others if you don't have stale files
hanging around, although if you're optimizing while
searchers are running, you'll use up to 3X the index size...

Otherwise I'll have to leave it to others for additional insights

Best
Erick

On Tue, Jan 12, 2010 at 9:22 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi Erik,
>
> I'm a newbie to solr... By IR, you mean searcher? Is there a place where I
> can check the open searchers? And rebooting the machine shouldn't closed
> that searchers?
>
> Thanks,
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: terça-feira, 12 de Janeiro de 2010 13:54
> To: solr-user@lucene.apache.org
> Subject: Re: Problem comitting on 40GB index
>
> There are several possibilities:
>
> 1> you have some process holding open your indexes, probably
> other searchers. You *probably* are OK just committing
> new changes if there is exactly *one* searcher keeping
> your index open. If you have some process whereby
> you periodically open a new search but you fail to close
> the old one, then you'll use up an extra 40G for every
> version of your index held open by your processes. That's
>confusing... I'm saying that if you open any number of IRs,
>you'll have 40G consumed. Then if you add
>some more documents and open *another* IR,  you'll have
>another 40G consumed. They'll stay around until you close
>your readers.
>
> 2> If you optimize, there can be up to 3X the index size being
>consumed if you also have a previous reader opened.
>
> So I suspect that sometime recently you've opened another
> IR.
>
> HTH
> Erick
>
>
>
> On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
> frederico.azeite...@cision.com> wrote:
>
> > Hi all,
> >
> > I started working with solr about 1 month ago, and everything was
> > running well both indexing as searching documents.
> >
> > I have a 40GB index with about 10 000 000 documents available. I index
> > 3k docs for each 10m and commit after each insert.
> >
> > Since yesterday, I can't commit no articles to index. I manage to search
> > ok, and index documents without commiting. But when I start the commit
> > is takes a long time and eats all of the available disk space
> > left(60GB). The commit eventually stops with full disk and I have to
> > restart SOLR and get the 60GB returned to system.
> >
> > Before this, the commit was taking a few seconds to complete.
> >
> > Can someone help to debug the problem? Where should I start? Should I
> > try to copy the index to other machine with more free space and try to
> > commit? Should I try an optimize?
> >
> > Log for the last commit I tried:
> >
> > INFO: start
> > commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> > alse)
> > (Then, after a long time...)
> > Exception in thread "Lucene Merge Thread #0"
> > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> > No space left on device
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> > ncurrentMergeScheduler.java:351)
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
> > entMergeScheduler.java:315)
> > Caused by: java.io.IOException: No space left on device
> >
> > I'm using Ubuntu 9.04 and Solr 1.4.0.
> >
> > Thanks in advance,
> >
> > Frederico
> >
>


Re: Deleting * and Re-index after schema change

2010-01-12 Thread Lee Smith
Dont worry my bad. 

I made a mistake in my dataimport to all have the same ID !

All working now thank you


On 12 Jan 2010, at 14:33, Lee Smith wrote:

> Hi Erik
> 
> Done as suggested and still only showing 1 Document
> 
> Doing a *:* give me 1 document
> 
> Cant understand why ?
> 
> On 12 Jan 2010, at 14:25, Erik Hatcher wrote:
> 
>> What does a search of *:* give you?
>> 
>> As far as your steps, delete the index folder *before* restarting Solr, not 
>> after.  That might be the issue.
>> 
>>  Erik
>> 
>> 
>> On Jan 12, 2010, at 9:23 AM, Lee Smith wrote:
>> 
>>> Am I doing this right.
>>> 
>>> I have made changes to my schema so as per guide I done the following.
>>> 
>>> Stopped the application
>>> Updated the Schema
>>> Re-Started
>>> Deleted the index folder
>>> Then ran a full import & optimize command ie:  
>>> /dataimport?command=full-import&optimize=true
>>> 
>>> In the status it shows Indexing Complete. Add/Updated 100800 documents. 0 
>>> Deleted
>>> 
>>> So all good ?
>>> 
>>> But in the stats page it only shows numDocs:1
>>> 
>>> The only thing I can see maybe in the stats page it says in the reader line 
>>> segments=1  but I noticed in the index folder the file says segments_6
>>> 
>>> Any ideas ?
>>> 
>>> Thank you
>> 
> 



Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg


Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
>> Thanks Erik, but I'm still a little confused as to exactly where in  
>> the Solr
>> config I set these parameters.
> 
> You'd configure them within the  element, something like  
> this:
> 
> 5
> 
> 

OK, thanks. (Should that really be str though, and not int or something?)


Erik Hatcher-4 wrote:
> 
> 
> Perhaps you could update the wiki with an example once you get it  
> working?
> 
> I'm flying a little blind here, just going off the source code, not  
> trying it out for real.
> 
> 

Sure -- it won't be til next week at the earliest though.

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128493.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-12 Thread Smiley, David W.
Kelly,
This is a good question you have posed and illustrates a challenge with Solr's 
limited schema.  I don't see how the dedup will help.  I would continue with 
the SKU based approach and use this patch:
https://issues.apache.org/jira/browse/SOLR-236
You'll collapse on the product id.  My book, p.192, highlights this component 
as it existed when I wrote it but it has been updated since then.

A recent separate question by you on this list suggests you're going down this 
path.  I would grab the attached SOLR-236.patch file and attempt to apply it to 
the 1.4 source.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Jan 11, 2010, at 5:27 PM, Kelly Taylor wrote:

> 
> I am in the process of building a Solr search solution for my application and
> have run into a roadblock with the schema design.  Trying to match criteria
> in one multi-valued field with corresponding criteria in another
> multi-valued field.  Any advice would be greatly appreciated.
> 
> BACKGROUND:
> My RDBMS data model is such that for every one of my "Product" entities,
> there are one-to-many "SKU" entities available for purchase. Each SKU entity
> can have its own price, as well as one-to-many options, etc.  The web
> frontend displays available "Product" entities on both directory and detail
> pages.
> 
> In order to take advantage of Solr's facet count, paging, and sorting
> functionality, I decided to base the Solr schema on "Product" documents; so
> none of my documents currently contain duplicate "Product" data, and all
> "SKU" related data is denormalized as necessary, but into multi-valued
> fields.  For example, I have a document with an "id" field set to
> "Product:7," a "docType" field is set to "Product" as well as multi-valued
> "SKU" related fields and data like, "sku_color" {Red | Green | Blue},
> "sku_size" {Small | Medium | Large}, "sku_price" {10.00 | 10.00 | 7.99}
> 
> I hit the roadblock when I tried to answer the question, "Which products are
> available that contain skus with color Green, size M, and a price of $9.99
> or less?"...and have now begun the switch to "SKU" level indexing.  This
> also gives me what I need for faceted browsing/navigation, and search
> refinement...leading the user to "Product" entities having purchasable "SKU"
> entities.  But this also means I now have documents which are mostly
> duplicates for each "Product," and all, facet counts, paging and sorting is
> then inaccurate;  so it appears I need do this myself, with multiple Solr
> requests.
> 
> Is this really the best approach; and if so, should I use the Solr
> Deduplication update processor when indexing and querying?
> 
> Thanks in advance,
> Kelly
> -- 
> View this message in context: 
> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27118977.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 





Re: Deleting * and Re-index after schema change

2010-01-12 Thread Lee Smith
Hi Erik

Done as suggested and still only showing 1 Document

Doing a *:* give me 1 document

Cant understand why ?

On 12 Jan 2010, at 14:25, Erik Hatcher wrote:

> What does a search of *:* give you?
> 
> As far as your steps, delete the index folder *before* restarting Solr, not 
> after.  That might be the issue.
> 
>   Erik
> 
> 
> On Jan 12, 2010, at 9:23 AM, Lee Smith wrote:
> 
>> Am I doing this right.
>> 
>> I have made changes to my schema so as per guide I done the following.
>> 
>> Stopped the application
>> Updated the Schema
>> Re-Started
>> Deleted the index folder
>> Then ran a full import & optimize command ie:  
>> /dataimport?command=full-import&optimize=true
>> 
>> In the status it shows Indexing Complete. Add/Updated 100800 documents. 0 
>> Deleted
>> 
>> So all good ?
>> 
>> But in the stats page it only shows numDocs:1
>> 
>> The only thing I can see maybe in the stats page it says in the reader line 
>> segments=1  but I noticed in the index folder the file says segments_6
>> 
>> Any ideas ?
>> 
>> Thank you
> 



complex multi valued fields

2010-01-12 Thread Adamsky, Robert

I have a document that has a multi-valued field where each value in
the field itself is comprised of two values itself.  Think of an invoice doc
with multi value line items - each line item having quantity and product name.

One option I see is to have a line item multi value field and when producing
the document to pass to Solr, concat the quantity and desc and put it in the 
multi
value field.

My preference would be the ability to define such complex multi valued fields
out of the box.  Is that supported in a Solr 1.4 environment?  Basically a field
type that allows you to define the other fields that make up a field.

This could look like something like this in schema.xml if supported:


  
  




Re: Deleting * and Re-index after schema change

2010-01-12 Thread Erik Hatcher

What does a search of *:* give you?

As far as your steps, delete the index folder *before* restarting  
Solr, not after.  That might be the issue.


Erik


On Jan 12, 2010, at 9:23 AM, Lee Smith wrote:


Am I doing this right.

I have made changes to my schema so as per guide I done the following.

Stopped the application
Updated the Schema
Re-Started
Deleted the index folder
Then ran a full import & optimize command ie:  /dataimport? 
command=full-import&optimize=true


In the status it shows Indexing Complete. Add/Updated 100800  
documents. 0 Deleted


So all good ?

But in the stats page it only shows numDocs:1

The only thing I can see maybe in the stats page it says in the  
reader line segments=1  but I noticed in the index folder the file  
says segments_6


Any ideas ?

Thank you




Deleting * and Re-index after schema change

2010-01-12 Thread Lee Smith
Am I doing this right.

I have made changes to my schema so as per guide I done the following.

Stopped the application
Updated the Schema
Re-Started
Deleted the index folder
Then ran a full import & optimize command ie:  
/dataimport?command=full-import&optimize=true

In the status it shows Indexing Complete. Add/Updated 100800 documents. 0 
Deleted

So all good ?

But in the stats page it only shows numDocs:1

The only thing I can see maybe in the stats page it says in the reader line 
segments=1  but I noticed in the index folder the file says segments_6

Any ideas ?

Thank you 

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Erik Hatcher


On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
Thanks Erik, but I'm still a little confused as to exactly where in  
the Solr

config I set these parameters.


You'd configure them within the  element, something like  
this:


   5


The example on the wiki page uses Lookup3Signature which  
(presumably) takes
no parameters, so there's no indication in the XML examples of where  
you

would set them.


Right, looking at the source code, Lookup3Signature takes no parameters.

Perhaps you could update the wiki with an example once you get it  
working?


I'm flying a little blind here, just going off the source code, not  
trying it out for real.


Erik



RE: Problem comitting on 40GB index

2010-01-12 Thread Frederico Azeiteiro
Hi Erik,

I'm a newbie to solr... By IR, you mean searcher? Is there a place where I can 
check the open searchers? And rebooting the machine shouldn't closed that 
searchers?

Thanks,

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: terça-feira, 12 de Janeiro de 2010 13:54
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

There are several possibilities:

1> you have some process holding open your indexes, probably
 other searchers. You *probably* are OK just committing
 new changes if there is exactly *one* searcher keeping
 your index open. If you have some process whereby
 you periodically open a new search but you fail to close
 the old one, then you'll use up an extra 40G for every
 version of your index held open by your processes. That's
confusing... I'm saying that if you open any number of IRs,
you'll have 40G consumed. Then if you add
some more documents and open *another* IR,  you'll have
another 40G consumed. They'll stay around until you close
your readers.

2> If you optimize, there can be up to 3X the index size being
consumed if you also have a previous reader opened.

So I suspect that sometime recently you've opened another
IR.

HTH
Erick



On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi all,
>
> I started working with solr about 1 month ago, and everything was
> running well both indexing as searching documents.
>
> I have a 40GB index with about 10 000 000 documents available. I index
> 3k docs for each 10m and commit after each insert.
>
> Since yesterday, I can't commit no articles to index. I manage to search
> ok, and index documents without commiting. But when I start the commit
> is takes a long time and eats all of the available disk space
> left(60GB). The commit eventually stops with full disk and I have to
> restart SOLR and get the 60GB returned to system.
>
> Before this, the commit was taking a few seconds to complete.
>
> Can someone help to debug the problem? Where should I start? Should I
> try to copy the index to other machine with more free space and try to
> commit? Should I try an optimize?
>
> Log for the last commit I tried:
>
> INFO: start
> commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> alse)
> (Then, after a long time...)
> Exception in thread "Lucene Merge Thread #0"
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> No space left on device
>at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> ncurrentMergeScheduler.java:351)
>at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
> entMergeScheduler.java:315)
> Caused by: java.io.IOException: No space left on device
>
> I'm using Ubuntu 9.04 and Solr 1.4.0.
>
> Thanks in advance,
>
> Frederico
>


Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg


Thanks Erik, but I'm still a little confused as to exactly where in the Solr
config I set these parameters.

The example on the wiki page uses Lookup3Signature which (presumably) takes
no parameters, so there's no indication in the XML examples of where you
would set them. Unless I'm missing something.

Thanks again,

Andrew.


Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
>> I'm interested in near-dupe removal as mentioned (briefly) here:
>>
>> http://wiki.apache.org/solr/Deduplication
>>
>> However the link for TextProfileSignature hasn't been filled in yet.
>>
>> Does anyone have an example of using TextProfileSignature that  
>> demonstrates
>> the tunable parameters mentioned in the wiki?
> 
> There are some comments in the source code*, but they weren't made  
> class-level.  I'm fixing that and committing it now, but here's the  
> comment:
> 
> /**
>   * This implementation is copied from Apache Nutch. 
>   * An implementation of a page signature. It calculates an MD5 hash
>   * of a plain text "profile" of a page.
>   * The algorithm to calculate a page "profile" takes the plain  
> text version of
>   * a page and performs the following steps:
>   * 
>   * remove all characters except letters and digits, and bring all  
> characters
>   * to lower case,
>   * split the text into tokens (all consecutive non-whitespace  
> characters),
>   * discard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
> characters),
>   * sort the list of tokens by decreasing frequency,
>   * round down the counts of tokens to the nearest multiple of QUANT
>   * (QUANT = QUANT_RATE * maxFreq, where  
> QUANT_RATE is 0.01f
>   * by default, and maxFreq is the maximum token  
> frequency). If
>   * maxFreq is higher than 1, then QUANT is always higher  
> than 2 (which
>   * means that tokens with frequency 1 are always discarded).
>   * tokens, which frequency after quantization falls below QUANT,  
> are discarded.
>   * create a list of tokens and their quantized frequency,  
> separated by spaces,
>   * in the order of decreasing frequency.
>   * 
>   * This list is then submitted to an MD5 hash calculation.*/
> 
> There are two parameters this implementation takes:
> 
>  quantRate = params.getFloat("quantRate", 0.01f);
>  minTokenLen = params.getInt("minTokenLen", 2);
> 
> Hope that helps.
> 
>   Erik
> 
> 
> 
> *
> http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128173.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Erik Hatcher


On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:

I'm interested in near-dupe removal as mentioned (briefly) here:

http://wiki.apache.org/solr/Deduplication

However the link for TextProfileSignature hasn't been filled in yet.

Does anyone have an example of using TextProfileSignature that  
demonstrates

the tunable parameters mentioned in the wiki?


There are some comments in the source code*, but they weren't made  
class-level.  I'm fixing that and committing it now, but here's the  
comment:


/**
 * This implementation is copied from Apache Nutch. 
 * An implementation of a page signature. It calculates an MD5 hash
 * of a plain text "profile" of a page.
 * The algorithm to calculate a page "profile" takes the plain  
text version of

 * a page and performs the following steps:
 * 
 * remove all characters except letters and digits, and bring all  
characters

 * to lower case,
 * split the text into tokens (all consecutive non-whitespace  
characters),
 * discard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
characters),

 * sort the list of tokens by decreasing frequency,
 * round down the counts of tokens to the nearest multiple of QUANT
 * (QUANT = QUANT_RATE * maxFreq, where  
QUANT_RATE is 0.01f
 * by default, and maxFreq is the maximum token  
frequency). If
 * maxFreq is higher than 1, then QUANT is always higher  
than 2 (which

 * means that tokens with frequency 1 are always discarded).
 * tokens, which frequency after quantization falls below QUANT,  
are discarded.
 * create a list of tokens and their quantized frequency,  
separated by spaces,

 * in the order of decreasing frequency.
 * 
 * This list is then submitted to an MD5 hash calculation.*/

There are two parameters this implementation takes:

quantRate = params.getFloat("quantRate", 0.01f);
minTokenLen = params.getInt("minTokenLen", 2);

Hope that helps.

Erik



* 
http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java



RE: update solr index

2010-01-12 Thread Marc Des Garets
I have 2 ways to update the index, either I use solrj using
SolrEmbeddedServer or I do it with an http query. If I do it with an
http query I indeed don't stop tomcat but I have to do some operations
(mainly taking instance out of the cluster) and I can't automate this
process when I can automate update which is my goal that's why I'm
trying to do update without having problems with garbage collection.

I have disabled cache as I don't need it. Is it running a number of old
queries to re-generate the cache anyway or it is a different cache you
are talking about? But I believe it still has to register a new
searcher. I don't know what the impact of this is though.

I guess I will go for running more tomcat running less indexes with
lower JVM heap.

Thank you for the link and thanks for the reply.


Marc

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: 12 January 2010 07:49
To: solr-user@lucene.apache.org
Subject: Re: update solr index

On Mon, Jan 11, 2010 at 7:42 PM, Marc Des Garets
wrote:

>
> I am running solr in tomcat and I have about 35 indexes (between 2 and
> 80 millions documents each). Currently if I try to update few
documents
> from an index (let's say the one which contains 80 millions documents)
> while tomcat is running and therefore receiving requests, I am getting
> few very long garbage collection (about 60sec). I am running tomcat
with
> -Xms10g -Xmx10g -Xmn2g -XX:PermSize=256m -XX:MaxPermSize=256m. I'm
using
> ConcMarkSweepGC.
>
> I have 2 questions:
> 1. Is solr doing something specific while an index is being updated
like
> updating something in memory which would cause the garbage collection?
>

Solr's caches are thrown away and a fixed number of old queries are
re-executed to re-generated the cache on the new index (known as
auto-warming). This happens on a commit.


>
> 2. Any idea how I could solve this problem? Currently I stop tomcat,
> update index, start tomcat. I would like to be able to update my index
> while tomcat is running. I was thinking about running more tomcat
> instance with less memory for each and each running few of my indexes.
> Do you think it would be the best way to go?
>
>
If you stop tomcat, how do you update the index? Are you running a
multi-core setup? Perhaps it is better to split up the indexes among
multiple boxes. Also, you should probably lower the JVM heap so that the
full GC pause doesn't make your index unavailable for such a long time.

Also see
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles
/Scaling-Lucene-and-Solr

-- 
Regards,
Shalin Shekhar Mangar.
--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.

Re: Problem comitting on 40GB index

2010-01-12 Thread Erick Erickson
There are several possibilities:

1> you have some process holding open your indexes, probably
 other searchers. You *probably* are OK just committing
 new changes if there is exactly *one* searcher keeping
 your index open. If you have some process whereby
 you periodically open a new search but you fail to close
 the old one, then you'll use up an extra 40G for every
 version of your index held open by your processes. That's
confusing... I'm saying that if you open any number of IRs,
you'll have 40G consumed. Then if you add
some more documents and open *another* IR,  you'll have
another 40G consumed. They'll stay around until you close
your readers.

2> If you optimize, there can be up to 3X the index size being
consumed if you also have a previous reader opened.

So I suspect that sometime recently you've opened another
IR.

HTH
Erick



On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi all,
>
> I started working with solr about 1 month ago, and everything was
> running well both indexing as searching documents.
>
> I have a 40GB index with about 10 000 000 documents available. I index
> 3k docs for each 10m and commit after each insert.
>
> Since yesterday, I can't commit no articles to index. I manage to search
> ok, and index documents without commiting. But when I start the commit
> is takes a long time and eats all of the available disk space
> left(60GB). The commit eventually stops with full disk and I have to
> restart SOLR and get the 60GB returned to system.
>
> Before this, the commit was taking a few seconds to complete.
>
> Can someone help to debug the problem? Where should I start? Should I
> try to copy the index to other machine with more free space and try to
> commit? Should I try an optimize?
>
> Log for the last commit I tried:
>
> INFO: start
> commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> alse)
> (Then, after a long time...)
> Exception in thread "Lucene Merge Thread #0"
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> No space left on device
>at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> ncurrentMergeScheduler.java:351)
>at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
> entMergeScheduler.java:315)
> Caused by: java.io.IOException: No space left on device
>
> I'm using Ubuntu 9.04 and Solr 1.4.0.
>
> Thanks in advance,
>
> Frederico
>


Re: What is this error means?

2010-01-12 Thread Grant Ingersoll
Do you have a stack trace?  

On Jan 12, 2010, at 2:54 AM, Ellery Leung wrote:

> When I am building the index for around 2 ~ 25000 records, sometimes I
> came across with this error:
> 
> 
> 
> Uncaught exception "Exception" with message '0' Status: Communication Error
> 
> 
> 
> I search Google & Yahoo but no answer.
> 
> 
> 
> I am now committing document to solr on every 10 records fetched from a
> SQLite Database with PHP 5.3.
> 
> 
> 
> Platform: Windows 7 Home
> 
> Web server: Nginx
> 
> Solr Specification Version: 1.4.0
> 
> Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
> 12:33:40
> 
> Lucene Specification Version: 2.9.1
> 
> Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
> 
> Solr hosted in jetty 6.1.3
> 
> 
> 
> All the above are in one single test machine.
> 
> 
> 
> The situation is that sometimes when I build the index, it can be created
> successfully.  But sometimes it will just stop with the above error.
> 
> 
> 
> Any clue?  Please help.
> 
> 
> 
> Thank you in advance.
> 



Problem comitting on 40GB index

2010-01-12 Thread Frederico Azeiteiro
Hi all,

I started working with solr about 1 month ago, and everything was
running well both indexing as searching documents.

I have a 40GB index with about 10 000 000 documents available. I index
3k docs for each 10m and commit after each insert.

Since yesterday, I can't commit no articles to index. I manage to search
ok, and index documents without commiting. But when I start the commit
is takes a long time and eats all of the available disk space
left(60GB). The commit eventually stops with full disk and I have to
restart SOLR and get the 60GB returned to system.

Before this, the commit was taking a few seconds to complete.

Can someone help to debug the problem? Where should I start? Should I
try to copy the index to other machine with more free space and try to
commit? Should I try an optimize?

Log for the last commit I tried:

INFO: start
commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
alse)
(Then, after a long time...)
Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
No space left on device
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
ncurrentMergeScheduler.java:351)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
entMergeScheduler.java:315)
Caused by: java.io.IOException: No space left on device

I'm using Ubuntu 9.04 and Solr 1.4.0.

Thanks in advance,

Frederico


Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg

Hi,

I'm interested in near-dupe removal as mentioned (briefly) here:

http://wiki.apache.org/solr/Deduplication

However the link for TextProfileSignature hasn't been filled in yet.

Does anyone have an example of using TextProfileSignature that demonstrates
the tunable parameters mentioned in the wiki?

Thanks!

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27127151.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Data Full Import Error

2010-01-12 Thread Noble Paul നോബിള്‍ नोब्ळ्
it is the way you start your solr server( -Xmx option)

On Tue, Jan 12, 2010 at 6:00 PM, Lee Smith  wrote:
> Thank you for your response.
>
> Will I just need to adjust the allowed memory in a config file or is this a 
> server issue. ?
>
> Sorry I know nothing about Java.
>
> Hope you can advise !
>
> On 12 Jan 2010, at 12:26, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> You need more memory to run dataimport.
>>
>>
>> On Tue, Jan 12, 2010 at 4:46 PM, Lee Smith  wrote:
>>> Hi All
>>>
>>> I am trying to do a data import but I am getting the following error.
>>>
>>> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
>>> QTime=405
>>> 2010-01-12 03:08:08.576::WARN:  Error for /solr/dataimport
>>> java.lang.OutOfMemoryError: Java heap space
>>> Jan 12, 2010 3:08:05 AM org.apache.solr.handler.dataimport.DataImporter 
>>> doFullImport
>>> SEVERE: Full Import failed
>>> java.lang.OutOfMemoryError: Java heap space
>>> Exception in thread "btpool0-2" java.lang.OutOfMemoryError: Java heap space
>>> Jan 12, 2010 3:08:14 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>>> INFO: start rollback
>>> Jan 12, 2010 3:08:21 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>>> INFO: end_rollback
>>> Jan 12, 2010 3:08:23 AM org.apache.solr.update.SolrIndexWriter finalize
>>> SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug 
>>> -- POSSIBLE RESOURCE LEAK!!!
>>
>> This is OK. don't bother
>>>
>>> Any ideas what this can be ??
>>>
>>> Hope you can help.
>>>
>>> Lee
>>>
>>>
>>
>>
>>
>> --
>> -
>> Noble Paul | Systems Architect| AOL | http://aol.com
>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: Data Full Import Error

2010-01-12 Thread Lee Smith
Thank you for your response.

Will I just need to adjust the allowed memory in a config file or is this a 
server issue. ?

Sorry I know nothing about Java.

Hope you can advise !

On 12 Jan 2010, at 12:26, Noble Paul നോബിള്‍ नोब्ळ् wrote:

> You need more memory to run dataimport.
> 
> 
> On Tue, Jan 12, 2010 at 4:46 PM, Lee Smith  wrote:
>> Hi All
>> 
>> I am trying to do a data import but I am getting the following error.
>> 
>> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
>> QTime=405
>> 2010-01-12 03:08:08.576::WARN:  Error for /solr/dataimport
>> java.lang.OutOfMemoryError: Java heap space
>> Jan 12, 2010 3:08:05 AM org.apache.solr.handler.dataimport.DataImporter 
>> doFullImport
>> SEVERE: Full Import failed
>> java.lang.OutOfMemoryError: Java heap space
>> Exception in thread "btpool0-2" java.lang.OutOfMemoryError: Java heap space
>> Jan 12, 2010 3:08:14 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>> INFO: start rollback
>> Jan 12, 2010 3:08:21 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>> INFO: end_rollback
>> Jan 12, 2010 3:08:23 AM org.apache.solr.update.SolrIndexWriter finalize
>> SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug 
>> -- POSSIBLE RESOURCE LEAK!!!
> 
> This is OK. don't bother
>> 
>> Any ideas what this can be ??
>> 
>> Hope you can help.
>> 
>> Lee
>> 
>> 
> 
> 
> 
> -- 
> -
> Noble Paul | Systems Architect| AOL | http://aol.com



Re: Data Full Import Error

2010-01-12 Thread Noble Paul നോബിള്‍ नोब्ळ्
You need more memory to run dataimport.


On Tue, Jan 12, 2010 at 4:46 PM, Lee Smith  wrote:
> Hi All
>
> I am trying to do a data import but I am getting the following error.
>
> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
> QTime=405
> 2010-01-12 03:08:08.576::WARN:  Error for /solr/dataimport
> java.lang.OutOfMemoryError: Java heap space
> Jan 12, 2010 3:08:05 AM org.apache.solr.handler.dataimport.DataImporter 
> doFullImport
> SEVERE: Full Import failed
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "btpool0-2" java.lang.OutOfMemoryError: Java heap space
> Jan 12, 2010 3:08:14 AM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
> Jan 12, 2010 3:08:21 AM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: end_rollback
> Jan 12, 2010 3:08:23 AM org.apache.solr.update.SolrIndexWriter finalize
> SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug 
> -- POSSIBLE RESOURCE LEAK!!!

This is OK. don't bother
>
> Any ideas what this can be ??
>
> Hope you can help.
>
> Lee
>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: complex query

2010-01-12 Thread Wangsheng Mei
2010/1/12 Wangsheng Mei 

> I have considered building lucene index like:
> Document:  { title, content, author, friends }
> Thus, author and friends are two seperate fields. so I can boost them
> seperately.
> The problem is, if a document's author is the logged-in user, it's
> uncessary to search the friends field, because it would not appear in that
> field(someone would be the friend of himself)
>
 Sorry, someone would not be the friend of himself

>
> Then I consider merging author and friends field, like:
> Document : { title, content, authors }
> the authors field contains author and it's friends with whitespace
> seperating them.
> but how can I give author himself's article a better score than it's
> friends' article?
>
>
>
> 2010/1/12 Wangsheng Mei 
>
> Hi, ALL!
>>
>> I have two tables in database.
>>
>> t_article {
>>title,
>>content,
>>author
>> }
>>
>> t_friend {
>>   person_A,
>>   person_B
>> }
>>
>> note that in t_friend is many-to-many relation。
>>
>> When a logged-in user, search articles with a query word, 3 factors should
>> be considered in.
>> factor 1. relevency score
>> factor 2. if the article is posted by himself, give a higher score
>> factor 3. if the article is posted by it's friend, give a higher score
>>
>> HOW can I boost factor 2 and 3?
>>
>> Thanks very much!
>>
>>
>> --
>> 梅旺生
>>
>
>
>
> --
> 梅旺生
>



-- 
梅旺生


Re: complex query

2010-01-12 Thread Wangsheng Mei
I have considered building lucene index like:
Document:  { title, content, author, friends }
Thus, author and friends are two seperate fields. so I can boost them
seperately.
The problem is, if a document's author is the logged-in user, it's uncessary
to search the friends field, because it would not appear in that
field(someone would be the friend of himself)

Then I consider merging author and friends field, like:
Document : { title, content, authors }
the authors field contains author and it's friends with whitespace
seperating them.
but how can I give author himself's article a better score than it's
friends' article?



2010/1/12 Wangsheng Mei 

> Hi, ALL!
>
> I have two tables in database.
>
> t_article {
>title,
>content,
>author
> }
>
> t_friend {
>   person_A,
>   person_B
> }
>
> note that in t_friend is many-to-many relation。
>
> When a logged-in user, search articles with a query word, 3 factors should
> be considered in.
> factor 1. relevency score
> factor 2. if the article is posted by himself, give a higher score
> factor 3. if the article is posted by it's friend, give a higher score
>
> HOW can I boost factor 2 and 3?
>
> Thanks very much!
>
>
> --
> 梅旺生
>



-- 
梅旺生


complex query

2010-01-12 Thread Wangsheng Mei
Hi, ALL!

I have two tables in database.

t_article {
   title,
   content,
   author
}

t_friend {
  person_A,
  person_B
}

note that in t_friend is many-to-many relation。

When a logged-in user, search articles with a query word, 3 factors should
be considered in.
factor 1. relevency score
factor 2. if the article is posted by himself, give a higher score
factor 3. if the article is posted by it's friend, give a higher score

HOW can I boost factor 2 and 3?

Thanks very much!


-- 
梅旺生


Data Full Import Error

2010-01-12 Thread Lee Smith
Hi All

I am trying to do a data import but I am getting the following error.

INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
QTime=405 
2010-01-12 03:08:08.576::WARN:  Error for /solr/dataimport
java.lang.OutOfMemoryError: Java heap space
Jan 12, 2010 3:08:05 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
java.lang.OutOfMemoryError: Java heap space
Exception in thread "btpool0-2" java.lang.OutOfMemoryError: Java heap space
Jan 12, 2010 3:08:14 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jan 12, 2010 3:08:21 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
Jan 12, 2010 3:08:23 AM org.apache.solr.update.SolrIndexWriter finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- 
POSSIBLE RESOURCE LEAK!!!

Any ideas what this can be ??

Hope you can help.

Lee



Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-12 Thread Chantal Ackermann

Hi Kelly,

"...the criteria for this hypothetical search involves multi-valued fields,
where the index of one matching criteria needs to correspond to the same
value in another multi-valued field in the same index. You can't do that..."

Just my two cents:
By storing values in two different multi-value fields you do cannot 
store their relation to each other. If you want to have that in the 
index as well, you need another field with the pairs (or triples or 
whatever) (like a map but stored as a list of patterned strings, e.g. 
"size1:prize2","size2:prize2" etc.
And that of course for every possible combination that the user can 
order (in your use case). Whenever delivery of a certain combination 
changes, you'll have to update the specific documents to reflect that.


You'll still need the other fields for facetting I suppose. (My 
experience is that you often need different fields for facetting than 
for searching.)


The index is flat. If you think that storing all combinations (and that 
multiple times for all documents in your current schema) is too vast, 
than maybe you should store it in an extra index (core), and store only 
id references? But I'm not sure about that. I would try to store 
everything in one flat schema (one or multiple cores), unless you really 
run into unsolvable (!) hardware/performance issues.


Cheers,
Chantal

Kelly Taylor schrieb:

Hi Markus,

Thanks again. I wish this were simple boolean algebra. This is something I
have already tried. So either I am missing the boat completely, or have
failed to communicate it clearly. I didn't want to confuse the issue further
but maybe the following excerpts will help...

Excerpt from  "Solr 1.4 Enterprise Search Server" by David Smiley & Eric
Pugh...

"...the criteria for this hypothetical search involves multi-valued fields,
where the index of one matching criteria needs to correspond to the same
value in another multi-valued field in the same index. You can't do that..."

And this excerpt is from "Solr and RDBMS: The basics of designing your
application for the best of both" by by Amit Nithianandan...

"...If I wanted to allow my users to search for wiper blades available in a
store nearby, I might create an index with multiple documents or records for
the same exact wiper blade, each document having different location data
(lat/long, address, etc.) to represent an individual store. Solr has a
de-duplication component to help show unique documents in case that
particular wiper blade is available in multiple stores near me..."

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics

Remember, with my original schema definition I have multi-valued fields, and
when the "product" document is built, these fields do contain an array of
values retrieved from each of the related skus. Skus are children of my
products.

Using your example data, which t-shirt sku is available for purchase as a
child of t-shirt product with id 3? Is it really the green, M, or have we
found a product document related to both a green t-shirt and a Medium
t-shirt of some other color, which will thereby leave the user with nothing
to purchase?

sku = 9 [color=green, size=L, price=10.99], product id = 3
sku = 10 [color=blue, size=S, price=9.99], product id = 3
sku = 11 [color=blue, size=M, price=10.99], product id = 3


id = 1
color = [green, blue]
size = [M, S]
price = 6

id = 2
color = [red, blue]
size = [L, S]
price = 12

id = 3
color = [green, red, blue]
size = [L, S, M]
price = 5


If this is still unclear, I'll post a new question based on findings from
this conversation. Thanks for all of your help.

-Kelly


Markus Jelsma - Buyways B.V. wrote:

Hello Kelly,


Simple boolean algebra, you tell Solr you want color = green AND size = M
so it will only return green t-shirts in size M. If you, however, turn the
AND in a OR it will return all t-shirts that are green OR in size M, thus
you can then get M sized shirts in the blue color or green shirts in size
XXL.

I suggest you'd just give it a try and perhaps come back later to find
some improvements for your query. It would also be a good idea - if i may
say so - to read the links provided in the earlier message.

Hope you will find what you're looking for :)


Cheers,

Kelly Taylor zei:

Hi Markus,

Thanks for your reply.

Using the current schema and query like you suggest, how can I identify
the unique combination of options and price for a given SKU?   I don't
want the user to arrive at a product which doesn't completely satisfy
their search request.  For example, with the "color:Green", "size:M",
and "price:[0 to 9.99]" search refinements applied,  no products should
be displayed which only have "size:M" in "color:Blue"

The actual data in the database for a product to display on the frontend
could be as follows:

product id = 1
product name = T-shirt

related skus...
-- sku id = 7 [color=green, size=S, price=10.99]
-- sku id = 9 [color=green, size=L, price=10.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-12 Thread Markus Jelsma
Hello,


I now believe that i really did misunderstand the problem and,
unfortunately, i don't believe i can be of much assistance as i did not
have to implement a similar problem.


Cheers,

-  
Markus Jelsma  Buyways B.V.
Technisch ArchitectFriesestraatweg 215c
http://www.buyways.nl  9743 AD Groningen   


Alg. 050-853 6600  KvK  01074105
Tel. 050-853 6620  Fax. 050-3118124
Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17


On Mon, 2010-01-11 at 16:56 -0800, Kelly Taylor wrote:

> Hi Markus,
> 
> Thanks again. I wish this were simple boolean algebra. This is something I
> have already tried. So either I am missing the boat completely, or have
> failed to communicate it clearly. I didn't want to confuse the issue further
> but maybe the following excerpts will help...
> 
> Excerpt from  "Solr 1.4 Enterprise Search Server" by David Smiley & Eric
> Pugh...
> 
> "...the criteria for this hypothetical search involves multi-valued fields,
> where the index of one matching criteria needs to correspond to the same
> value in another multi-valued field in the same index. You can't do that..."
> 
> And this excerpt is from "Solr and RDBMS: The basics of designing your
> application for the best of both" by by Amit Nithianandan...
> 
> "...If I wanted to allow my users to search for wiper blades available in a
> store nearby, I might create an index with multiple documents or records for
> the same exact wiper blade, each document having different location data
> (lat/long, address, etc.) to represent an individual store. Solr has a
> de-duplication component to help show unique documents in case that
> particular wiper blade is available in multiple stores near me..."
> 
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics
> 
> Remember, with my original schema definition I have multi-valued fields, and
> when the "product" document is built, these fields do contain an array of
> values retrieved from each of the related skus. Skus are children of my
> products.
> 
> Using your example data, which t-shirt sku is available for purchase as a
> child of t-shirt product with id 3? Is it really the green, M, or have we
> found a product document related to both a green t-shirt and a Medium
> t-shirt of some other color, which will thereby leave the user with nothing
> to purchase?
> 
> sku = 9 [color=green, size=L, price=10.99], product id = 3
> sku = 10 [color=blue, size=S, price=9.99], product id = 3
> sku = 11 [color=blue, size=M, price=10.99], product id = 3
> 
> >> id = 1
> >> color = [green, blue]
> >> size = [M, S]
> >> price = 6
> >>
> >> id = 2
> >> color = [red, blue]
> >> size = [L, S]
> >> price = 12
> >>
> >> id = 3
> >> color = [green, red, blue]
> >> size = [L, S, M]
> >> price = 5
> 
> If this is still unclear, I'll post a new question based on findings from
> this conversation. Thanks for all of your help.
> 
> -Kelly
> 
> 
> Markus Jelsma - Buyways B.V. wrote:
> > 
> > Hello Kelly,
> > 
> > 
> > Simple boolean algebra, you tell Solr you want color = green AND size = M
> > so it will only return green t-shirts in size M. If you, however, turn the
> > AND in a OR it will return all t-shirts that are green OR in size M, thus
> > you can then get M sized shirts in the blue color or green shirts in size
> > XXL.
> > 
> > I suggest you'd just give it a try and perhaps come back later to find
> > some improvements for your query. It would also be a good idea - if i may
> > say so - to read the links provided in the earlier message.
> > 
> > Hope you will find what you're looking for :)
> > 
> > 
> > Cheers,
> > 
> > Kelly Taylor zei:
> >>
> >> Hi Markus,
> >>
> >> Thanks for your reply.
> >>
> >> Using the current schema and query like you suggest, how can I identify
> >> the unique combination of options and price for a given SKU?   I don't
> >> want the user to arrive at a product which doesn't completely satisfy
> >> their search request.  For example, with the "color:Green", "size:M",
> >> and "price:[0 to 9.99]" search refinements applied,  no products should
> >> be displayed which only have "size:M" in "color:Blue"
> >>
> >> The actual data in the database for a product to display on the frontend
> >> could be as follows:
> >>
> >> product id = 1
> >> product name = T-shirt
> >>
> >> related skus...
> >> -- sku id = 7 [color=green, size=S, price=10.99]
> >> -- sku id = 9 [color=green, size=L, price=10.99]
> >> -- sku id = 10 [color=blue, size=S, price=9.99]
> >> -- sku id = 11 [color=blue, size=M, price=10.99]
> >> -- sku id = 12 [color=blue, size=L, price=10.99]
> >>
> >> Regards,
> >> Kelly
> >>
> >>
> >> Markus Jelsma - Buyways B.V. wrote:
> >>>
> >>> Hello Kelly,
> >>>
> >>>
> >>> I am not entirely sure if i understand your problem correctly. But i
> >>> believe your first approach is the right one.
> >>>
> >>> Your question: "Which products are available that contain skus with
> >>> color Gree

Re: updating solr server

2010-01-12 Thread Chantal Ackermann

2) Also, is CommonsHttpSolrServer  thread safe?


it is only if you initialize it with the MultiThreadedHttpConnectionManager:

http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/MultiThreadedHttpConnectionManager.html

Cheers,
Chantal


updating solr server

2010-01-12 Thread Smith G
Hello All,
   I am trying to find a better approach ( perfomance wise
) to index documents. Document count is approximately a million+.
First, I thought of writing multiple threads using
CommonsHttpSolrServer to submit documents. But later I found out
StreamingUpdateSolrServer, which says we can forget about batching.

1) We can pass thread-count parameter to StreamingUpdateSolrServer,
does it exactly serve the same as writing multiple threads using
CommonsHttpSolrServer ?.

2) Also, is CommonsHttpSolrServer  thread safe?

3) To be brief, which one of above is a better one ? ( and why ? ) .
Either writing multiple threads for Commons or directly adding each
and every document to Streaming ?

4) queuesize parameter : What could be the rough-value when it comes
to real time application having a million+ documents to be indexed ?

   I am not so aware in depth. Please dont mind if something is wrong.

Thanks.