Re: edismax available in solr 3.1?

2011-05-07 Thread Ahmet Arslan
 is edixmax available in solr 3.1?  I don't see any
 documentation about it.
 
 if it is, does it support the prefix and fuzzy query?

Yes and yes. See snippet taken from changes.txt

New Features
--

* SOLR-1553: New dismax parser implementation (accessible as edismax)
  that supports full lucene syntax, improved reserved char escaping,
  fielded queries, improved proximity boosting, and improved stopword
  handling. Note: status is experimental for now. (yonik)


Field Cache

2011-05-07 Thread samarth s
Hi,

I have read lucene field cache is used in faceting and sorting. Is it also
populated/used when only selected fields are retrieved using the 'fl' OR
'included fields in collapse' parameters? Is it also used for collapsing?

-- 
Regards,
Samarth


Whole unfiltered content in response document field

2011-05-07 Thread solrfan
Hi, I have a question to the content of the document fields. My configuration
is ok so far, I index a database with DIH and have configured a index
analyser as folow:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory 
ignoreCase=true 
words=stopwords.txt 
enablePositionIncrements=true 
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
/analyzer

... 

 fields
   field name=id type=int indexed=true stored=true required=true
/  
   field name=text type=text indexed=true stored=true/
 /fields

On the analysis view, my filters work poperly. On the end of the filter
chain I have only interest tokens. But when I search with Solr, I become as
a response the whole content of the indexed databse field. The field
contains stopwords, whitespaces, upercases and so on. I search for
stopwords, and I can find them. I would expect, I find in the response
document only the filtered content in the field and not the original raw
content that I would to index. 

Is this a normal behaviour? Do I understand Solr right? 

Many thanks! 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-unfiltered-content-in-response-document-field-tp2911588p2911588.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: uima fieldMappings and solr dynamicField

2011-05-07 Thread Koji Sekiguchi
I've opened https://issues.apache.org/jira/browse/SOLR-2503 .

Koji
-- 
http://www.rondhuit.com/en/

(11/05/06 20:15), Koji Sekiguchi wrote:
 Hello,
 
 I'd like to use dynamicField in feature-field mapping of uima update
 processor. It doesn't seem to be acceptable currently. Is it a bad idea
 in terms of use of uima? If it is not so bad, I'd like to try a patch.
 
 Background:
 
 Because my uima annotator can generate many types of named entity from
 a text, I don't want to implement so many types, but one type NamedEntity:
 
 typeSystemDescription
types
  typeDescription
namecom.rondhuit.uima.next.NamedEntity/name
description/
supertypeNameuima.tcas.Annotation/supertypeName
features
  featureDescription
namename/name
description/
rangeTypeNameuima.cas.String/rangeTypeName
  /featureDescription
  featureDescription
nameentity/name
description/
rangeTypeNameuima.cas.String/rangeTypeName
  /featureDescription
/features
  /typeDescription
/types
 /typeSystemDescription
 
 sample extracted named entities:
 
 name=PERSON, entity=Barack Obama
 name=TITLE, entity=the President
 
 Now, I'd like to map these named entities to Solr fields like this:
 
 PERSON_S:Barack Obama
 TITLE_S:the President
 
 Because the type of name (PERSON, TITLE, etc.) can be so many,
 I'd like to use dynamicField *_s. And where * is replaced by the name
 feature of NamedEntity.
 
 I think this is natural requirement from Solr view point, but I'm
 not sure my uima annotator implementation is correct or not. In other
 words, should I implement many types for each entity types?
 (e.g. PersonEntity, TitleEntity, ... instead of NamedEntity)
 
 Thank you!
 
 Koji




Re: Whole unfiltered content in response document field

2011-05-07 Thread Ahmet Arslan

 analyzer type=index
         tokenizer
 class=solr.WhitespaceTokenizerFactory/
         filter
 class=solr.StopFilterFactory 
                
 ignoreCase=true 
                
 words=stopwords.txt 
                
 enablePositionIncrements=true 
                
 /
         filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
         filter
 class=solr.LowerCaseFilterFactory/
 /analyzer
 
 ... 
 
  fields
    field name=id type=int
 indexed=true stored=true required=true
 /  
    field name=text type=text
 indexed=true stored=true/
  /fields
 
 On the analysis view, my filters work poperly. On the end
 of the filter
 chain I have only interest tokens. But when I search with
 Solr, I become as
 a response the whole content of the indexed databse field.
 The field
 contains stopwords, whitespaces, upercases and so on. I
 search for
 stopwords, and I can find them. I would expect, I find in
 the response
 document only the filtered content in the field and not the
 original raw
 content that I would to index. 
 
 Is this a normal behaviour? Do I understand Solr right? 

On the response, solr shows raw content. So you want to see analyzed/indexed 
content of a document in the response?

Searching and finding stop-words is not normal. May be you need to move 
StopFilter to under the WordDelimeter. Some punctuations may cause this.


Re: Replication Clarification Please

2011-05-07 Thread Bill Bell
I did not see answers... I am not an authority, but will tell you what I
think

Did you get some answers?


On 5/6/11 2:52 PM, Ravi Solr ravis...@gmail.com wrote:

Hello,
Pardon me if this has been already answered somewhere and I
apologize for a lengthy post. I was wondering if anybody could help me
understand Replication internals a bit more. We have a single
master-slave setup (solr 1.4.1) with the configurations as shown
below. Our environment is quite commit heavy (almost 100s of docs
every 5 minutes), and all indexing is done on Master and all searches
go to the Slave. We are seeing that the slave replication performance
gradually decreases and the speed decreases  1kbps and ultimately
gets backed up. Once we reload the core on slave it will be work fine
for sometime and then it again gets backed up. We have mergeFactor set
to 10 and ramBufferSizeMB is set to 32MB and solr itself is running
with 2GB memory and locktype is simple on both master and slave.

How big is your index? How many rows and GB ?

Every time you replicate, there are several resets on caching. So if you
are constantly
Indexing, you need to be careful on how that performance impact will apply.


I am hoping that the following questions might help me understand the
replication performance issue better (Replication Configuration is
given at the end of the email)

1. Does the Slave get the whole index every time during replication or
just the delta since the last replication happened ?


It depends. If you do an OPTIMIZE every time your index, then you will be
sending the whole index down.
If the amount of time if  10 segments, I believe that might also trigger
a whole index, since you cycled all the segments.
In that case I think you might want to increase the mergeFactor.



2. If there are huge number of queries being done on slave will it
affect the replication ? How can I improve the performance ? (see the
replications details at he bottom of the page)

It seems that might be one way the you get the index.* directories. At
least I see it more frequently when there is huge load and you are trying
to replicate.
You could replicate less frequently.


3. Will the segment names be same be same on master and slave after
replication ? I see that they are different. Is this correct ? If it
is correct how does the slave know what to fetch the next time i.e.
the delta.

Yes they better be. In the old days you could just rsync the data
directory from master and slave and reload the core, that worked fine.


4. When and why does the index.TIMESTAMP folder get created ? I see
this type of folder getting created only on slave and the slave
instance is pointing to it.

I would love to know all the conditions... I believe it is supposed to
replicate to index.*, then reload to point to it. But sometimes it gets
stuck in index.* land and never goes back to straight index.

There are several bug fixes for this in 3.1.


5. Does replication process copy both the index and index.TIMESTAMP
folder ?

I believe it is supposed to copy the segment or whole index/ from master
to index.* on slave.


6. what happens if the replication kicks off even before the previous
invocation has not completed ? will the 2nd invocation block or will
it go through causing more confusion ?

That is not supposed to happen, if a replication is in process, it should
not copy again until that one is complete.
Try it, just delete the data/*, restart SOLR, and force a replication,
while it is syncing, force it again. Does not seem to work for me.

7. If I have to prep a new master-slave combination is it OK to copy
the respective contents into the new master-slave and start solr ? or
do I have have to wipe the new slave and let it replicate from its new
master ?

If you shut down the slave, copy the data/* directory amd restart you
should be fine. That is how we fix the data/ dir when
there is corruption.

8. Doing an 'ls | wc -l' on index folder of master and slave gave 194
and 17968 respectively...I slave has lot of segments_xxx files. Is
this normal ?

Several bugs fixed in 3.1 for this one. Not a good thing You are
getting leftover segments or index.* directories.

MASTER

requestHandler name=/replication class=solr.
ReplicationHandler 
lst name=master
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
str name=replicateAfteroptimize/str

str name=confFilesschema.xml,stopwords.txt/str
str name=commitReserveDuration00:00:10/str
/lst
/requestHandler


SLAVE

requestHandler name=/replication class=solr.ReplicationHandler 
lst name=slave
str name=masterUrlmaster core url/str
str name=pollInterval00:03:00/str
str name=compressioninternal/str
str name=httpConnTimeout5000/str
str name=httpReadTimeout1/str
 /lst
/requestHandler


REPLICATION DETAILS FROM PAGE

Master master core url
Poll Interval 00:03:00
Local Index Index Version: