from:"Li"

I have roughly read the codes of RAMDirectory. it use a list of 1024
byte arrays and many overheads.
But as far as I know, using MMapDirectory, I can't prevent the page
faults. OS will swap less frequent pages out. Even if I allocate
enough memory for JVM, I can guarantee all the files in the directory
are in memory. am I understanding right? if it is, then some less
frequent queries will be slow.  How can I let them always in memory?

On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskog goks...@gmail.com wrote:
 Yes, use MMapDirectory. It is faster and uses memory more efficiently
 than RAMDirectory. This sounds wrong, but it is true. With
 RAMDirectory, Java has to work harder doing garbage collection.

 On Fri, Jun 8, 2012 at 1:30 AM, Li Li fancye...@gmail.com wrote:
 hi all
   I want to use lucene 3.6 providing searching service. my data is
 not very large, raw data is less that 1GB and I want to use load all
 indexes into memory. also I need save all indexes into disk
 persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
 Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
 size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
 indexes. It also has bad concurrency on
  multithreaded environments.
 It is recommended to materialize large indexes on disk and use
 MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
 the operating system, so copying data to
  Java heap space is not useful.

    should I use MMapDirectory? it seems another contrib instantiated.
 anyone test it with RAMDirectory?



 --
 Lance Norskog
 goks...@gmail.com

Re: what's better for in memory searching?

do you mean software RAM disk? using RAM to simulate disk? How to deal
with Persistence?

maybe I can hack by increase RAMOutputStream.BUFFER_SIZE from 1024 to 1024*1024.
it may have a waste. but I can adjust my merge policy to avoid to much segments.
I will have a big segment and a small segment. Every night I will
merge them. new added documents will flush into a new segment and I
will merge the new generated segment and the small one.
Our update operations are not very frequent.

On Mon, Jun 11, 2012 at 4:59 PM, Paul Libbrecht p...@hoplahup.net wrote:
 Li Li,

 have you considered allocating a RAM-Disk?
 It's not the most flexible thing... but it's certainly close, in performance 
 to a RAMDirectory.
 MMapping on that is likely to be useless but I doubt you can set it to zero.
 That'd need experiment.

 Also, doesn't caching and auto-warming provide the lowest latency for all 
 expected queries ?

 Paul


 Le 11 juin 2012 à 10:50, Li Li a écrit :

   I want to use lucene 3.6 providing searching service. my data is
 not very large, raw data is less that 1GB and I want to use load all
 indexes into memory. also I need save all indexes into disk
 persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

Re: what's better for in memory searching?

I am sorry. I make a mistake. even use RAMDirectory, I can not
guarantee they are not swapped out.

On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmann k...@solarier.de wrote:
 Set the swapiness to 0 to avoid memory pages being swapped to disk too
 early.

 http://en.wikipedia.org/wiki/Swappiness

 -Kuli

 Am 11.06.2012 10:38, schrieb Li Li:

 I have roughly read the codes of RAMDirectory. it use a list of 1024
 byte arrays and many overheads.
 But as far as I know, using MMapDirectory, I can't prevent the page
 faults. OS will swap less frequent pages out. Even if I allocate
 enough memory for JVM, I can guarantee all the files in the directory
 are in memory. am I understanding right? if it is, then some less
 frequent queries will be slow.  How can I let them always in memory?

 On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskoggoks...@gmail.com  wrote:

 Yes, use MMapDirectory. It is faster and uses memory more efficiently
 than RAMDirectory. This sounds wrong, but it is true. With
 RAMDirectory, Java has to work harder doing garbage collection.

 On Fri, Jun 8, 2012 at 1:30 AM, Li Lifancye...@gmail.com  wrote:

 hi all
   I want to use lucene 3.6 providing searching service. my data is
 not very large, raw data is less that 1GB and I want to use load all
 indexes into memory. also I need save all indexes into disk
 persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
 Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
 size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
 indexes. It also has bad concurrency on
  multithreaded environments.
 It is recommended to materialize large indexes on disk and use
 MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
 the operating system, so copying data to
  Java heap space is not useful.

    should I use MMapDirectory? it seems another contrib instantiated.
 anyone test it with RAMDirectory?




 --
 Lance Norskog
 goks...@gmail.com

Re: what's better for in memory searching?

yes, I need average query time less than 10 ms. The faster the better.
I have enough memory for lucene because I know there are not too much
data. there are not many modifications. every day there are about
hundreds of document update. if indexes are not in physical memory,
then IO operations will cost a few ms.
btw, the full gc may also add uncertainty, So I need optimize it as
much as possible.
On Mon, Jun 11, 2012 at 5:27 PM, Michael Kuhlmann k...@solarier.de wrote:
 You cannot guarantee this when you're running out of RAM. You'd have a
 problem then anyway.

 Why are you caring that much? Did you yet have performance issues? 1GB
 should load really fast, and both auto warming and OS cache should help a
 lot as well. With such an index, you usually don't need to fine tune
 performance that much.

 Did you think about using a SSD? Since you want to persist your index,
 you'll need to live with disk IO anyway.

 Greetings,
 Kuli

 Am 11.06.2012 11:20, schrieb Li Li:

 I am sorry. I make a mistake. even use RAMDirectory, I can not
 guarantee they are not swapped out.

 On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmannk...@solarier.de
  wrote:

 Set the swapiness to 0 to avoid memory pages being swapped to disk too
 early.

 http://en.wikipedia.org/wiki/Swappiness

 -Kuli

 Am 11.06.2012 10:38, schrieb Li Li:

 I have roughly read the codes of RAMDirectory. it use a list of 1024
 byte arrays and many overheads.
 But as far as I know, using MMapDirectory, I can't prevent the page
 faults. OS will swap less frequent pages out. Even if I allocate
 enough memory for JVM, I can guarantee all the files in the directory
 are in memory. am I understanding right? if it is, then some less
 frequent queries will be slow.  How can I let them always in memory?

 On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskoggoks...@gmail.com
  wrote:


 Yes, use MMapDirectory. It is faster and uses memory more efficiently
 than RAMDirectory. This sounds wrong, but it is true. With
 RAMDirectory, Java has to work harder doing garbage collection.

 On Fri, Jun 8, 2012 at 1:30 AM, Li Lifancye...@gmail.com    wrote:


 hi all
   I want to use lucene 3.6 providing searching service. my data is
 not very large, raw data is less that 1GB and I want to use load all
 indexes into memory. also I need save all indexes into disk
 persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
 Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
 size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
 indexes. It also has bad concurrency on
  multithreaded environments.
 It is recommended to materialize large indexes on disk and use
 MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
 the operating system, so copying data to
  Java heap space is not useful.

    should I use MMapDirectory? it seems another contrib instantiated.
 anyone test it with RAMDirectory?





 --
 Lance Norskog
 goks...@gmail.com

Re: what's better for in memory searching?

I found this. 
http://unix.stackexchange.com/questions/10214/per-process-swapiness-for-linux
it can provide  fine grained control of swapping

On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmann k...@solarier.de wrote:
 Set the swapiness to 0 to avoid memory pages being swapped to disk too
 early.

 http://en.wikipedia.org/wiki/Swappiness

 -Kuli

 Am 11.06.2012 10:38, schrieb Li Li:

 I have roughly read the codes of RAMDirectory. it use a list of 1024
 byte arrays and many overheads.
 But as far as I know, using MMapDirectory, I can't prevent the page
 faults. OS will swap less frequent pages out. Even if I allocate
 enough memory for JVM, I can guarantee all the files in the directory
 are in memory. am I understanding right? if it is, then some less
 frequent queries will be slow.  How can I let them always in memory?

 On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskoggoks...@gmail.com  wrote:

 Yes, use MMapDirectory. It is faster and uses memory more efficiently
 than RAMDirectory. This sounds wrong, but it is true. With
 RAMDirectory, Java has to work harder doing garbage collection.

 On Fri, Jun 8, 2012 at 1:30 AM, Li Lifancye...@gmail.com  wrote:

 hi all
   I want to use lucene 3.6 providing searching service. my data is
 not very large, raw data is less that 1GB and I want to use load all
 indexes into memory. also I need save all indexes into disk
 persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
 Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
 size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
 indexes. It also has bad concurrency on
  multithreaded environments.
 It is recommended to materialize large indexes on disk and use
 MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
 the operating system, so copying data to
  Java heap space is not useful.

    should I use MMapDirectory? it seems another contrib instantiated.
 anyone test it with RAMDirectory?




 --
 Lance Norskog
 goks...@gmail.com

Re: what's better for in memory searching?

is this method equivalent to set vm.swappiness which is global?
or it can set the swappiness for jvm process?

On Tue, Jun 12, 2012 at 5:11 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
Point about premature optimization makes sense for me. However some time
ago I've bookmarked potentially useful approach
http://lucene.472066.n3.nabble.com/High-response-time-after-being-idle-tp3616599p3617604.html.

On Mon, Jun 11, 2012 at 3:02 PM, Toke Eskildsen
t...@statsbiblioteket.dkwrote:

On Mon, 2012-06-11 at 11:38 +0200, Li Li wrote:
yes, I need average query time less than 10 ms. The faster the better.
I have enough memory for lucene because I know there are not too much
data. there are not many modifications. every day there are about
hundreds of document update. if indexes are not in physical memory,
then IO operations will cost a few ms.

I'm with Michael on this one: It seems that you're doing a premature
optimization. Guessing that your final index will be 5GB in size with
1 million documents (give or take 900.000:-), relatively simple queries
and so on, an average response time of 10 ms should be attainable even
on spinning drives. One hundred document updates per day are not many,
so again I would not expect problems.

As is often the case on this mailing list, the advice is try it. Using
a normal on-disk index and doing some warm up is the easy solution to
implement and nearly all of your work on this will be usable for a
RAM-based solution, if you are not satisfied with the speed. Or you
could buy a small cheap SSD and have no more worries...

Regards,
Toke Eskildsen

--
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Li Li

yes, I am also interested in good performance with 2 billion docs. how
many search nodes do you use? what's the average response time and qps
?

another question: where can I find related paper or resources of your
algorithm which explains the algorithm in detail? why it's better than
google site(better than lucene is not very interested because lucene
is not originally designed to provide search function like google)?

On Mon, May 28, 2012 at 1:06 AM, Darren Govoni dar...@ontrenet.com wrote:
 I think people on this list would be more interested in your approach to
 scaling 2 billion documents than modifying solr/lucene scoring (which is
 already top notch). So given that, can you share any references or
 otherwise substantiate good performance with 2 billion documents?

 Thanks.

 On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
 Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
 docs. With RankingAlgorithm 1.4.3, using the parameters
 age=latestdocs=number feature, you can retrieve the NRT inserted
 documents in milliseconds from such a huge index improving query and
 faceting performance and using very little resources ...

 Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
 the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
 RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.

 Regards,

 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org



 On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
     Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book

Re: How can i search site name

2012-05-22 Thread Li Li

you should define your search first.
if the site is www.google.com. how do you match it. full string
matching or partial matching. e.g. is google should match? if it
does, you should write your own analyzer for this field.

On Tue, May 22, 2012 at 2:03 PM, Shameema Umer shem...@gmail.com wrote:
 Sorry,
 Please let me know how can I search site name using the solr query syntax.
 My results should show title, url and content.
 Title and content are being searched even though the
 defaultSearchFieldcontent/defaultSearchField.

 I need url or site name too. please, help.

 Thanks in advance.

 On Tue, May 22, 2012 at 11:05 AM, ketan kore ketankore...@gmail.com wrote:

 you can go on www.google.com and just type the site which you want to
 search and google will show you the results as simple as that ...

Re: Installing Solr on Tomcat using Shell - Code wrong?

2012-05-22 Thread Li Li

you should find some clues from tomcat log
在 2012-5-22 晚上7:49，Spadez james_will...@hotmail.com写道：

 Hi,

 This is the install process I used in my shell script to try and get Tomcat
 running with Solr (debian server):



 I swear this used to work, but currently only Tomcat works. The Solr page
 just comes up with The requested resource (/solr/admin) is not available.

 Can anyone give me some insight into why this isnt working? Its driving me
 nuts.

 James

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Installing-Solr-on-Tomcat-using-Shell-Code-wrong-tp3985393.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr query with mandatory values

2012-05-09 Thread Li Li

+ before term is correct. in lucene term includes field and value.

Query  ::= ( Clause )*

Clause ::= [+, -] [TERM :] ( TERM | ( Query ) )

#_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - | + ) 

#_ESCAPED_CHAR: \\ ~[] 


in lucene query syntax, you can't express a term value including space.
you can use quotation mark but lucene will take it as a phrase query.
so you need escape space like title:hello\\ world
which will take hello world as a field value. and the analyzer then
will tokenize it. so you should use analyzer which can deal with
space. e.g. you can use keyword analyzer

as far as I know

On Thu, May 10, 2012 at 3:35 AM, Matt Kuiper matt.kui...@issinc.com wrote:
 Yes.

 See http://wiki.apache.org/solr/SolrQuerySyntax  - The standard Solr Query 
 Parser syntax is a superset of the Lucene Query Parser syntax.
 Which links to http://lucene.apache.org/core/3_6_0/queryparsersyntax.html

 Note - Based on the info on these pages I believe the + symbol is to be 
 placed just before the mandatory value, not before the field name in the 
 query.

 Matt Kuiper
 Intelligent Software Solutions

 -Original Message-
 From: G.Long [mailto:jde...@gmail.com]
 Sent: Wednesday, May 09, 2012 10:45 AM
 To: solr-user@lucene.apache.org
 Subject: Solr query with mandatory values

 Hi :)

 I remember that in a Lucene query, there is something like mandatory values. 
 I just have to add a + symbol in front of the mandatory parameter, like: 
 +myField:my value

 I was wondering if there was something similar in Solr queries? Or is this 
 behaviour activated by default?

 Gary

Re: SOLRJ: Is there a way to obtain a quick count of total results for a query

2012-05-04 Thread Li Li

don't score by relevance and score by document id may speed it up a little?
I haven't done any test of this. may be u can give it a try. because
scoring will consume
some cpu time. you just want to match and get total count

On Wed, May 2, 2012 at 11:58 PM, vybe3142 vybe3...@gmail.com wrote:
 I can achieve this by building a query with start and rows = 0, and using
 queryResponse.getResults().getNumFound().

 Are there any more efficient approaches to this?

 Thanks

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLRJ-Is-there-a-way-to-obtain-a-quick-count-of-total-results-for-a-query-tp3955322.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting result first which come first in sentance

2012-05-03 Thread Li Li

as for version below 4.0, it's not possible because lucene's score
model. position information is stored, but only used to support phrase
query. it just tell us whether a document is matched, but we can boost
a document. The similar problem is : how to implement proximity boost.
for 2 search terms, we need return all docs that contains this 2
terms. but if they are phrase, we give it a largest boost. if there is
a word between them, we give it a smaller one. if there are 2 words
between them, we will give it smaller score. 
all this ranking algorithm need more flexible score model.
I don't know whether the latest trunk take this into consideration.

On Fri, May 4, 2012 at 3:43 AM, Jonty Rhods jonty.rh...@gmail.com wrote:
 Hi all,



 I need suggetion:



 I

 Hi all,



 I need suggetion:



 I have many title like:



 1 bomb blast in kabul

 2 kabul bomb blast

 3 3 people killed in serial bomb blast in kabul



 I want 2nd result should come first while user search by kabul.

 Because kabul is on 1st postion in that sentance.  Similarly 1st result
 should come on 2nd and 3rd should come last.



 Please suggest me hot to implement this..



 Regard

 Jonty

Re: Sorting result first which come first in sentance

2012-05-03 Thread Li Li

for this version, you may consider using payload for position boost.
you can save boost values in payload.
I have used it in lucene api where anchor text should weigh more than
normal text. but I haven't used it in solr.
some searched urls:
http://wiki.apache.org/solr/Payloads
http://digitalpebble.blogspot.com/2010/08/using-payloads-with-dismaxqparser-in.html


On Fri, May 4, 2012 at 9:51 AM, Jonty Rhods jonty.rh...@gmail.com wrote:
 I am using solr version 3.4

Re: get latest 50 documents the fastest way

2012-05-01 Thread Li Li

you should reverse your sort algorithm. maybe you can override the tf
method of Similarity and return -1.0f * tf(). (I don't know whether
default collector allow score smaller than zero)
Or you can hack this by add a large number or write your own
collector, in its collect(int doc) method, you can do like this:
collect(int doc){
float score=scorer.score();
score*=-1.0f;

}
if you don't sort by relevant score, just set Sort

On Tue, May 1, 2012 at 10:38 PM, Yuval Dotan yuvaldo...@gmail.com wrote:
 Hi Guys
 We have a use case where we need to get the 50 *latest *documents that
 match my query - without additional ranking,sorting,etc on the results.
 My index contains 1,000,000,000 documents and i noticed that if the number
 of found documents is very big (larger than 50% of the index size -
 500,000,000 docs) than it takes more than 5 seconds to get the results even
 with rows=50 parameter.
 Is there a way to get the results faster?
 Thanks
 Yuval

question about NRT(soft commit) and Transaction Log in trunk

2012-04-28 Thread Li Li

hi
   I checked out the trunk and played with its new soft commit
feature. it's cool. But I've got a few questions about it.
   By reading some introductory articles and wiki, and hasted code
reading, my understand of it's implementation is:
   For normal commit(hard commit), we should flush all into disk and
commit it. flush is not very time consuming because of
os level cache. the most time consuming one is sync in commit process.
   Soft commit just flush postings and pending deletions into disk
and generating new segments. Then solr can use a
new searcher to read the latest indexes and warm up and then register itself.
   if there is no hard commit and the jvm crashes, then new data may lose.
   if my understanding is correct, then why we need transaction log?
   I found in DirectUpdateHandler2, every time a command is executed,
TransactionLog will record a line in log. But the default
sync level in RunUpdateProcessorFactory is flush, which means it will
not sync the log file. does this make sense?
   in database implementation, we usually write log and modify data
in memory because log is smaller than real data. if crashes.
we can redo the unfinished log and make data correct. will Solr
leverage this log like this? if it is, why it's not synced?

Re: Solr Scoring

2012-04-13 Thread Li Li

another way is to use payload http://wiki.apache.org/solr/Payloads
the advantage of payload is that you only need one field and can make frq
file smaller than use two fields. but the disadvantage is payload is stored
in prx file, so I am not sure which one is fast. maybe you can try them
both.

On Fri, Apr 13, 2012 at 8:04 AM, Erick Erickson erickerick...@gmail.comwrote:

 GAH! I had my head in make this happen in one field when I wrote my
 response, without being explicit. Of course Walter's solution is pretty
 much the standard way to deal with this.

 Best
 Erick

 On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood wun...@wunderwood.org
 wrote:
  It is easy. Create two fields, text_exact and text_stem. Don't use the
 stemmer in the first chain, do use the stemmer in the second. Give the
 text_exact a bigger weight than text_stem.
 
  wunder
 
  On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
 
  No, I don't think there's an OOB way to make this happen. It's
  a recurring theme, make exact matches score higher than
  stemmed matches.
 
  Best
  Erick
 
  On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com
 wrote:
  Hi,
 
  I have a field in my index called itemDesc which i am applying
  EnglishMinimalStemFilterFactory to. So if i index a value to this field
  containing Edges, the EnglishMinimalStemFilterFactory applies
 stemming
  and Edges becomes Edge. Now when i search for Edges, documents
 with
  Edge score better than documents with the actual search word -
 Edges.
  Is there a way i can make documents with the actual search word in this
  case Edges score better than document with Edge?
 
  I am using Solr 3.5. My field definition is shown below:
 
  fieldType name=text_en class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true expand=false/
  filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
  filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.EnglishMinimalStemFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
  ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt/
 filter class=solr.EnglishMinimalStemFilterFactory/
   /analyzer
 /fieldType
 
  Thanks.

Re: How to read SOLR cache statistics?

2012-04-13 Thread Li Li

http://wiki.apache.org/solr/SolrCaching

On Fri, Apr 13, 2012 at 2:30 PM, Kashif Khan uplink2...@gmail.com wrote:

 Does anyone explain what does the following parameters mean in SOLR cache
 statistics?

 *name*:  queryResultCache
 *class*:  org.apache.solr.search.LRUCache
 *version*:  1.0
 *description*:  LRU Cache(maxSize=512, initialSize=512)
 *stats*:  lookups : 98
 *hits *: 59
 *hitratio *: 0.60
 *inserts *: 41
 *evictions *: 0
 *size *: 41
 *warmupTime *: 0
 *cumulative_lookups *: 98
 *cumulative_hits *: 59
 *cumulative_hitratio *: 0.60
 *cumulative_inserts *: 39
 *cumulative_evictions *: 0

 AND also this


 *name*:  fieldValueCache
 *class*:  org.apache.solr.search.FastLRUCache
 *version*:  1.0
 *description*:  Concurrent LRU Cache(maxSize=1, initialSize=10,
 minSize=9000, acceptableSize=9500, cleanupThread=false)
 *stats*:  *lookups *: 8
 *hits *: 4
 *hitratio *: 0.50
 *inserts *: 2
 *evictions *: 0
 *size *: 2
 *warmupTime *: 0
 *cumulative_lookups *: 8
 *cumulative_hits *: 4
 *cumulative_hitratio *: 0.50
 *cumulative_inserts *: 2
 *cumulative_evictions *: 0
 *item_ABC *:

 {field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
 *item_BCD *:

 {field=BCD,memSize=341248,tindexSize=1952,time=1688,phase1=1688,nTerms=8075,bigTerms=0,termInstances=13510,uses=2}

 Without understanding these terms i cannot configure server for better
 cache
 usage. The point is searches are very slow. These stats were taken when
 server was down and restarted. I just want to understand what these terms
 mean actually


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907294.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: using solr to do a 'match'

2012-04-11 Thread Li Li

it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is matched and why/how the doc is matched.
especially for disjuction query, it should tell which term matches and
possible other
information such as tf/idf and the distance of terms(to support proximity
search).
That's the matcher's job. and then the scorer(a ranking algorithm) use
flexible algorithm
to score this document and the collector can collect it.

On Wed, Apr 11, 2012 at 10:28 AM, Chris Book chrisb...@gmail.com wrote:

 Hello, I have a solr index running that is working very well as a search.
  But I want to add the ability (if possible) to use it to do matching.  The
 problem is that by default it is only looking for all the input terms to be
 present, and it doesn't give me any indication as to how many terms in the
 target field were not specified by the input.

 For example, if I'm trying to match to the song title dust in the wind,
 I'm correctly getting a match if the input query is dust in wind.  But I
 don't want to get a match if the input is just dust.  Although as a
 search dust should return this result, I'm looking for some way to filter
 this out based on some indication that the input isn't close enough to the
 output.  Perhaps if I could get information that that the number of input
 terms is much less than the number of terms in the field.  Or something
 else along those line?

 I realize that this isn't the typical use case for a search, but I'm just
 looking for some suggestions as to how I could improve the above example a
 bit.

 Thanks,
 Chris

Re: using solr to do a 'match'

2012-04-11 Thread Li Li

I searched my mail but nothing found.
the thread searched by key words boolean expression is Indexing Boolean
Expressions from joaquin.delgado
to tell which terms are matched, for BooleanScorer2, a simple method is to
modify DisjunctionSumScorer and add a BitSet to record matched scorers.
When collector collect this document, it can get the scorer and recursively
find the matched terms.
But I think maybe it's better to add a component maybe named matcher that
do the matching job, and scorer use the information from the matcher and do
ranking things.

On Wed, Apr 11, 2012 at 4:32 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:

Hi,

This use case is similar to matching boolean expression problem. You can
find recent thread about it. I have an idea that we can introduce
disjunction query with dynamic mm (minShouldMatch parameter

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)
)
i.e. 'match these clauses disjunctively but for every document use
value
from field cache of field xxxCount as a minShouldMatch parameter'. Also
norms can be used as a source for dynamics mm values.

Wdyt?

On Wed, Apr 11, 2012 at 10:08 AM, Li Li fancye...@gmail.com wrote:

it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is matched and why/how the doc is
matched.
especially for disjuction query, it should tell which term matches and
possible other
information such as tf/idf and the distance of terms(to support proximity
search).
That's the matcher's job. and then the scorer(a ranking algorithm) use
flexible algorithm
to score this document and the collector can collect it.

On Wed, Apr 11, 2012 at 10:28 AM, Chris Book chrisb...@gmail.com
wrote:

Hello, I have a solr index running that is working very well as a
search.
But I want to add the ability (if possible) to use it to do matching.
The
problem is that by default it is only looking for all the input terms
to
be
present, and it doesn't give me any indication as to how many terms in
the
target field were not specified by the input.

For example, if I'm trying to match to the song title dust in the
wind,
I'm correctly getting a match if the input query is dust in wind.
But
I
don't want to get a match if the input is just dust. Although as a
search dust should return this result, I'm looking for some way to
filter
this out based on some indication that the input isn't close enough to
the
output. Perhaps if I could get information that that the number of
input
terms is much less than the number of terms in the field. Or something
else along those line?

I realize that this isn't the typical use case for a search, but I'm
just
looking for some suggestions as to how I could improve the above
example
a
bit.

Thanks,
Chris

--
Sincerely yours
Mikhail Khludnev
ge...@yandex.ru

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: pagerank??

2012-04-04 Thread Bing Li

According to my knowledge, Solr cannot support this.

In my case, I get data by keyword-matching from Solr and then rank the data
by PageRank after that.

Thanks,
Bing

On Wed, Apr 4, 2012 at 6:37 AM, Manuel Antonio Novoa Proenza 
mano...@estudiantes.uci.cu wrote:

 Hello,

 I have in my Solr index , many indexed documents.

 Let me know any way or efficient function to calculate the page rank of
 websites indexed.


 s

 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

Re: Trouble Setting Up Development Environment

2012-03-24 Thread Li Li

. Runtime ClassNotFoundExceptions may result.
  solr3_5P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/guava-r05.jar will not be exported
 or published. Runtime ClassNotFoundExceptions may result.  solr3_5
  P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/jcl-over-slf4j-1.6.1.jar will not
 be exported or published. Runtime ClassNotFoundExceptions may result.
  solr3_5P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/junit-4.7.jar will not be exported
 or published. Runtime ClassNotFoundExceptions may result.  solr3_5
  P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/servlet-api-2.4.jar will not be
 exported or published. Runtime ClassNotFoundExceptions may result.
  solr3_5P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/slf4j-api-1.6.1.jar will not be
 exported or published. Runtime ClassNotFoundExceptions may result.
  solr3_5P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/slf4j-jdk14-1.6.1.jar will not be
 exported or published. Runtime ClassNotFoundExceptions may result.
  solr3_5P/solr3_5Classpath Dependency Validator Message
 Classpath entry /solr3_5/ssrc/solr/lib/wstx-asl-3.2.7.jar will not be
 exported or published. Runtime ClassNotFoundExceptions may result.
  solr3_5P/solr3_5Classpath Dependency Validator Message



 On Fri, Mar 23, 2012 at 3:25 AM, Li Li fancye...@gmail.com wrote:

 here is my method.
 1. check out latest source codes from trunk or download tar ball
svn checkout
 http://svn.apache.org/repos/asf/lucene/dev/trunklucene_trunk

 2. create a dynamic web project in eclipse and close it.
   for example, I create a project name lucene-solr-trunk in my
 workspace.

 3. copy/mv the source code to this project(it's not necessary)
   here is my directory structure
   lili@lili-desktop:~/workspace/lucene-solr-trunk$ ls
 bin.tests-framework  build  lucene_trunk  src  testindex  WebContent
  lucene_trunk is the top directory checked out from svn in step 1.
 4. remove WebContent generated by eclipse and modify it to a soft link to
  lili@lili-desktop:~/workspace/lucene-solr-trunk$ ll WebContent
 lrwxrwxrwx 1 lili lili 28 2011-08-18 18:50 WebContent -
 lucene_trunk/solr/webapp/web/
 5. open lucene_trunk/dev-tools/eclipse/dot.classpath. copy all lines like
 kind=src to a temp file
 classpathentry kind=src path=lucene/core/src/java/
 classpathentry kind=src path=lucene/core/src/resources/
 
 6. replace all string like path=xxx to path=lucene_trunk/xxx and copy
 them into .classpath file
 7. mkdir WebContent/WEB-INF/lib
 8. extract all jar file in dot.classpath to WebContent/WEB-INF/lib
I use this command:
lili@lili-desktop:~/workspace/lucene-solr-trunk/lucene_trunk$ cat
 dev-tools/eclipse/dot.classpath |grep kind=\lib|awk -F path=\
 '{print
 $2}' |awk -F \/ '{print $1}' |xargs cp ../WebContent/WEB-INF/lib/
 9. open this project and refresh it.
if everything is ok, it will compile all java files successfully. if
 there is something wrong, Probably we don't use the correct jar. because
 there are many versions of the same library.
 10. right click the project - debug As - debug on Server
it will fail because no solr home is specified.
 11. right click the project - debug As - debug Configuration -
 Arguments
 Tab - VM arguments
 add

 -Dsolr.solr.home=/home/lili/workspace/lucene-solr-trunk/lucene_trunk/solr/example/solr
 you can also add other vm arguments like -Xmx1g here.
 12. all fine, add a break point at SolrDispatchFilter.doFilter(). all solr
 request comes here
 13. have fun~


 On Fri, Mar 23, 2012 at 11:49 AM, Karthick Duraisamy Soundararaj 
 karthick.soundara...@gmail.com wrote:

  Hi Solr Ppl,
 I have been trying to set up solr dev env. I downloaded
 the
  tar ball of eclipse and the solr 3.5 source. Here are the exact
 sequence of
  steps I followed
 
  I extracted the solr 3.5 source and eclipse.
  I installed run-jetty-run plugin for eclipse.
  I ran ant eclipse in the solr 3.5 source directory
  I used eclipse's Open existing project option to open up the files in
  solr 3.5 directory. I got a huge tree in the name of lucene_solr.
 
  I run it and there is a SEVERE error: System property not set
 excetption. *
  solr*.test.sys.*prop1* not set and then the jetty loads solr. I then try
  localhost:8080/solr/select/ I get null pointer execpiton. I am only
 able to
  access admin page.
 
  Is there anything else I need to do?
 
  I tried to follow
 
 
 http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
  .
  But I dont find the solr-3.5.war file. I tried ant dist to generate the
  dist folder but that has many jars and wars..
 
  I am able to compile the source

Re: Trouble Setting Up Development Environment

2012-03-23 Thread Li Li

here is my method.
1. check out latest source codes from trunk or download tar ball
svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunklucene_trunk

2. create a dynamic web project in eclipse and close it.
   for example, I create a project name lucene-solr-trunk in my
workspace.

3. copy/mv the source code to this project(it's not necessary)
   here is my directory structure
   lili@lili-desktop:~/workspace/lucene-solr-trunk$ ls
bin.tests-framework  build  lucene_trunk  src  testindex  WebContent
  lucene_trunk is the top directory checked out from svn in step 1.
4. remove WebContent generated by eclipse and modify it to a soft link to
  lili@lili-desktop:~/workspace/lucene-solr-trunk$ ll WebContent
lrwxrwxrwx 1 lili lili 28 2011-08-18 18:50 WebContent -
lucene_trunk/solr/webapp/web/
5. open lucene_trunk/dev-tools/eclipse/dot.classpath. copy all lines like
kind=src to a temp file
classpathentry kind=src path=lucene/core/src/java/
classpathentry kind=src path=lucene/core/src/resources/

6. replace all string like path=xxx to path=lucene_trunk/xxx and copy
them into .classpath file
7. mkdir WebContent/WEB-INF/lib
8. extract all jar file in dot.classpath to WebContent/WEB-INF/lib
I use this command:
lili@lili-desktop:~/workspace/lucene-solr-trunk/lucene_trunk$ cat
dev-tools/eclipse/dot.classpath |grep kind=\lib|awk -F path=\ '{print
$2}' |awk -F \/ '{print $1}' |xargs cp ../WebContent/WEB-INF/lib/
9. open this project and refresh it.
if everything is ok, it will compile all java files successfully. if
there is something wrong, Probably we don't use the correct jar. because
there are many versions of the same library.
10. right click the project - debug As - debug on Server
it will fail because no solr home is specified.
11. right click the project - debug As - debug Configuration - Arguments
Tab - VM arguments
 add
-Dsolr.solr.home=/home/lili/workspace/lucene-solr-trunk/lucene_trunk/solr/example/solr
 you can also add other vm arguments like -Xmx1g here.
12. all fine, add a break point at SolrDispatchFilter.doFilter(). all solr
request comes here
13. have fun~


On Fri, Mar 23, 2012 at 11:49 AM, Karthick Duraisamy Soundararaj 
karthick.soundara...@gmail.com wrote:

 Hi Solr Ppl,
I have been trying to set up solr dev env. I downloaded the
 tar ball of eclipse and the solr 3.5 source. Here are the exact sequence of
 steps I followed

 I extracted the solr 3.5 source and eclipse.
 I installed run-jetty-run plugin for eclipse.
 I ran ant eclipse in the solr 3.5 source directory
 I used eclipse's Open existing project option to open up the files in
 solr 3.5 directory. I got a huge tree in the name of lucene_solr.

 I run it and there is a SEVERE error: System property not set excetption. *
 solr*.test.sys.*prop1* not set and then the jetty loads solr. I then try
 localhost:8080/solr/select/ I get null pointer execpiton. I am only able to
 access admin page.

 Is there anything else I need to do?

 I tried to follow

 http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
 .
 But I dont find the solr-3.5.war file. I tried ant dist to generate the
 dist folder but that has many jars and wars..

 I am able to compile the source with ant compile, get the solr in example
 directory up and running.

 Will be great if someone can help me with this.

 Thanks,
 Karthick

Re: How to avoid the unexpected character error?

2012-03-16 Thread Li Li

it's not the right place.
when you use java -Durl=http://... -jar post.jar data.xml
the data.xml file must be a valid xml file. you shoud escape special chars
in this file.
I don't know how you generate this file.
if you use java program(or other scripts) to generate this file, you should
use xml tools to generate this file.
but if you generate like this:
StringBuilder buf=new StringBuilder();
buf.append(add);
buf.append(doc);
buf.append(field name=fnametext content/field);
you should escape special chars.
if you use java, you can make use of org.apache.solr.common.util.XML class

On Fri, Mar 16, 2012 at 2:03 PM, neosky neosk...@yahoo.com wrote:

 I am sorry, but I can't get what you mean.
 I tried the  HTMLStripCharFilter and PatternReplaceCharFilter. It doesn't
 work.
 Could you give me an example? Thanks!

  fieldType name=text_html class=solr.TextField
 positionIncrementGap=100
   analyzer
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
  /fieldType

 I also tried:

 charFilter class=solr.PatternReplaceCharFilterFactory pattern=([^a-z])
 replacement=
 maxBlockChars=1 blockDelimiters=|/

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-avoid-the-unexpected-character-error-tp3824726p3831064.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr out of memory exception

2012-03-15 Thread Li Li

it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB).
you should enable pointer compression by -XX:+UseCompressedOops

On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar yhus...@firstam.com wrote:

 Thanks for helping me out.

 I have allocated Xms-2.0GB Xmx-2.0GB

 However i see Tomcat is still using pretty less memory and not 2.0G

 Total Memory on my Windows Machine = 4GB.

 With smaller index size it is working perfectly fine. I was thinking of
 increasing the system RAM  tomcat heap space allocated but then how come
 on a different server with exactly same system and solr configuration 
 memory it is working fine?


 -Original Message-
 From: Li Li [mailto:fancye...@gmail.com]
 Sent: Thursday, March 15, 2012 11:11 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr out of memory exception

 how many memory are allocated to JVM?

 On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar yhus...@firstam.com
 wrote:

  Solr is giving out of memory exception. Full Indexing was completed fine.
  Later while searching maybe when it tries to load the results in memory
 it
  starts giving this exception. Though with the same memory allocated to
  Tomcat and exactly same solr replica on another server it is working
  perfectly fine. I am working on 64 bit software's including Java  Tomcat
  on Windows.
  Any help would be appreciated.
 
  Here are the logs:
 
  The server encountered an internal error (Severe errors in solr
  configuration. Check your log files for more detailed information on what
  may be wrong. If you want solr to continue after configuration errors,
  change: abortOnConfigurationErrorfalse/abortOnConfigurationError in
  null -
  java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
 at
  org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
  org.apache.solr.core.SolrCore.init(SolrCore.java:579) at
 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
  at
 
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
  at
 
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
  at
 
 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
  at
 
 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
  at
  org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
  at
 
 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
  at
 org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
  at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
 at
  org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
  org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
  org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
  org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
 
 org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
  at
 
 org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
  at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065)
 at
  org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
  org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
  org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
  org.apache.catalina.core.StandardService.start(StandardService.java:525)
 at
  org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
  org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
  sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
  java.lang.reflect.Method.invoke(Unknown Source) at
  org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
  org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
  java.lang.OutOfMemoryError: Java heap space at
 
 org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
  at
 org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:91)
  at
 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:122)
  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652) at
  org.apache.lucene.index.SegmentReader.get(SegmentReader.java:613) at
  org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:104)
 at
 
 org.apache.lucene.index.ReadOnlyDirectoryReader.init(ReadOnlyDirectoryReader.java:27)
  at
  org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
  at
 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java

Re: Solr out of memory exception

2012-03-15 Thread Li Li

it can reduce memory usage. for small heap application less than 4GB, it
may speed up.
but be careful, for large heap application, it depends.
you should do some test for yourself.
our application's test result is: it reduce memory usage but enlarge
response time. we use 25GB memory.

http://lists.apple.com/archives/java-dev/2010/Apr/msg00157.html

Dyer, James james.d...@ingrambook.com
viahttp://support.google.com/mail/bin/answer.py?hl=enctx=mailanswer=1311182
 lucene.apache.org
3/18/11

to solr-user
Our tests showed, in our situation, the compressed oops flag caused our
minor (ParNew) generation time to decrease significantly.   We're using a
larger heap (22gb) and our index size is somewhere in the 40's gb total.  I
guess with any of these jvm parameters, it all depends on your situation
and you need to test.  In our case, this flag solved a real problem we were
having.  Whoever wrote the JRocket book you refer to no doubt had other
scenarios in mind...

On Thu, Mar 15, 2012 at 3:02 PM, C.Yunqin 345804...@qq.com wrote:

 why should enable pointer compression?




 -- Original --
 From:  Li Lifancye...@gmail.com;
 Date:  Thu, Mar 15, 2012 02:41 PM
 To:  Husain, Yavaryhus...@firstam.com;
 Cc:  solr-user@lucene.apache.orgsolr-user@lucene.apache.org;
 Subject:  Re: Solr out of memory exception


 it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB).
 you should enable pointer compression by -XX:+UseCompressedOops

 On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar yhus...@firstam.com
 wrote:

  Thanks for helping me out.
 
  I have allocated Xms-2.0GB Xmx-2.0GB
 
  However i see Tomcat is still using pretty less memory and not 2.0G
 
  Total Memory on my Windows Machine = 4GB.
 
  With smaller index size it is working perfectly fine. I was thinking of
  increasing the system RAM  tomcat heap space allocated but then how come
  on a different server with exactly same system and solr configuration 
  memory it is working fine?
 
 
  -Original Message-
  From: Li Li [mailto:fancye...@gmail.com]
  Sent: Thursday, March 15, 2012 11:11 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr out of memory exception
 
  how many memory are allocated to JVM?
 
  On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar yhus...@firstam.com
  wrote:
 
   Solr is giving out of memory exception. Full Indexing was completed
 fine.
   Later while searching maybe when it tries to load the results in memory
  it
   starts giving this exception. Though with the same memory allocated to
   Tomcat and exactly same solr replica on another server it is working
   perfectly fine. I am working on 64 bit software's including Java 
 Tomcat
   on Windows.
   Any help would be appreciated.
  
   Here are the logs:
  
   The server encountered an internal error (Severe errors in solr
   configuration. Check your log files for more detailed information on
 what
   may be wrong. If you want solr to continue after configuration errors,
   change: abortOnConfigurationErrorfalse/abortOnConfigurationError in
   null -
   java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
  at
   org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
   org.apache.solr.core.SolrCore.init(SolrCore.java:579) at
  
 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
   at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
   at
  
 
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
   at
  
 
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
   at
  
 
 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
   at
  
 
 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
   at
  
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
   at
  
 
 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
   at
  org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
   at
 org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
  at
   org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943)
 at
   org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778)
 at
   org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504)
 at
   org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
  
 
 org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
   at
  
 
 org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
   at
 org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065)
  at
   org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
   org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057

Re: Sorting on non-stored field

it should be indexed by not analyzed. it don't need stored.
reading field values from stored fields is extremely slow.
So lucene will use StringIndex to read fields for sort. so if you want to
sort by some field, you should index this field and don't analyze it.

On Wed, Mar 14, 2012 at 6:43 PM, Finotti Simone tech...@yoox.com wrote:

 I was wondering: is it possible to sort a Solr result-set on a non-stored
 value?

 Thank you

Re: How to avoid the unexpected character error?

There is a class org.apache.solr.common.util.XML in solr
you can use this wrapper:
public static String escapeXml(String s) throws IOException{
StringWriter sw=new StringWriter();
XML.escapeCharData(s, sw);
return sw.getBuffer().toString();
}

On Wed, Mar 14, 2012 at 4:34 PM, neosky neosk...@yahoo.com wrote:

 I use the xml to index the data. One filed might contains some characters
 like '' =
 It seems that will produce the error
 I modify that filed doesn't index, but it doesn't work. I need to store the
 filed, but index might not be indexed.
 Thanks!

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-avoid-the-unexpected-character-error-tp3824726p3824726.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to avoid the unexpected character error?

no, it's nothing to do with schema.xml
post.jar just post a file, it don't parse this file.
solr will use xml parser to parse this file. if you don't escape special
characters, it's not a valid xml file and solr will throw exceptions.

On Thu, Mar 15, 2012 at 12:33 AM, neosky neosk...@yahoo.com wrote:

 Thanks!
 Does the schema.xml support this parameter? I am using the example post.jar
 to index my file.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-avoid-the-unexpected-character-error-tp3824726p3825959.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr out of memory exception

how many memory are allocated to JVM?

On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar yhus...@firstam.com wrote:

 Solr is giving out of memory exception. Full Indexing was completed fine.
 Later while searching maybe when it tries to load the results in memory it
 starts giving this exception. Though with the same memory allocated to
 Tomcat and exactly same solr replica on another server it is working
 perfectly fine. I am working on 64 bit software's including Java  Tomcat
 on Windows.
 Any help would be appreciated.

 Here are the logs:

 The server encountered an internal error (Severe errors in solr
 configuration. Check your log files for more detailed information on what
 may be wrong. If you want solr to continue after configuration errors,
 change: abortOnConfigurationErrorfalse/abortOnConfigurationError in
 null -
 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at
 org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
 org.apache.solr.core.SolrCore.init(SolrCore.java:579) at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
 at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
 at
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
 at
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
 at
 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
 at
 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
 at
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
 at
 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
 at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
 at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at
 org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
 org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
 org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
 org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
 org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
 at
 org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
 at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at
 org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
 org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
 org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
 org.apache.catalina.core.StandardService.start(StandardService.java:525) at
 org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
 org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
 sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
 java.lang.reflect.Method.invoke(Unknown Source) at
 org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
 org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
 java.lang.OutOfMemoryError: Java heap space at
 org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
 at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:91)
 at
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:122)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652) at
 org.apache.lucene.index.SegmentReader.get(SegmentReader.java:613) at
 org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:104) at
 org.apache.lucene.index.ReadOnlyDirectoryReader.init(ReadOnlyDirectoryReader.java:27)
 at
 org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
 at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at
 org.apache.lucene.index.IndexReader.open(IndexReader.java:476) at
 org.apache.lucene.index.IndexReader.open(IndexReader.java:403) at
 org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057) at
 org.apache.solr.core.SolrCore.init(SolrCore.java:579) at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
 at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
 at
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
 at
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
 at
 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
 at

Re: index size with replication

2012-03-13 Thread Li Li

 optimize will generate new segments and delete old ones. if your master
also provides searching service during indexing, the old files may be
opened by old SolrIndexSearcher. they will be deleted later. So when
indexing, the index size may double. But a moment later, old indexes will
be deleted.

On Wed, Mar 14, 2012 at 7:06 AM, Mike Austin mike.aus...@juggle.com wrote:

 I have a master with two slaves.  For some reason on the master if I do an
 optimize after indexing on the master it double in size from 42meg to 90
 meg.. however,  when the slaves replicate they get the 42meg index..

 Should the master and slaves always be the same size?

 Thanks,
 Mike

Re: How to limit the number of open searchers?

2012-03-06 Thread Li Li

what do u mean programmatically? modify codes of solr? becuase solr is
not like lucene, it only provide http interfaces for its users other than
java api

if you want to modify solr, you can find codes in SolrCore
private final LinkedListRefCountedSolrIndexSearcher _searchers = new
LinkedListRefCountedSolrIndexSearcher();
and _searcher is current searcher.
be careful to use searcherLock to synchronizing your codes.
maybe you can write your codes like:

synchronized(searcherLock){
if(_searchers.size==1){
...
}
}




On Tue, Mar 6, 2012 at 3:18 AM, Michael Ryan mr...@moreover.com wrote:

 Is there a way to limit the number of searchers that can be open at a
 given time?  I know there is a maxWarmingSearchers configuration that
 limits the number of warming searchers, but that's not quite what I'm
 looking for...

 Ideally, when I commit, I want there to only be one searcher open before
 the commit, so that during the commit and warming, there is a max of two
 searchers open.  I'd be okay with delaying the commit until there is only
 one searcher open.  Is there a way to programmatically determine how many
 searchers are currently open?

 -Michael

Re: Fw:how to make fdx file

2012-03-04 Thread Li Li

lucene will never modify old segment files, it just flushes into a new
segment or merges old segments into new one. after merging, old segments
will be deleted.
once a file(such as fdt and fdx) is generated. it will never be
re-generated. the only possible is that in the generating stage, there is
something wrong. or it's deleted by other programs such as wrongly deleted
by human.

On Sat, Mar 3, 2012 at 2:33 PM, C.Yunqin 345804...@qq.com wrote:

 yes,the fdt file still is there.  can i make new fdx file through fdt file.
  is there a posibilty that  during the process of updating and optimizing,
 the index will be deleted then re-generated?



  -- Original --
  From:  Erick Ericksonerickerick...@gmail.com;
  Date:  Sat, Mar 3, 2012 08:28 AM
  To:  solr-usersolr-user@lucene.apache.org;

  Subject:  Re: Fw:how to make fdx file


 As far as I know, fdx files don't just disappear, so I can only assume
 that something external removed it.

 That said, if you somehow re-indexed and had no fields where
 stored=true, then the fdx file may not be there.

 Are you seeing problems as a result? This file is used to store
 index information for stored fields. Do you have an fdt file?

 Best
 Erick

 On Fri, Mar 2, 2012 at 2:48 AM, C.Yunqin 345804...@qq.com wrote:
  Hi ,
my fdx file was unexpected gone, then the solr sever stop running;
 what I can do to recover solr?
 
   Other files still exist.
 
   Thanks very much
 
 
  /:includetail

Re: Solr HBase - Re: How is Data Indexed in HBase?

2012-02-23 Thread Bing Li

Dear Mr Gupta,

Your understanding about my solution is correct. Now both HBase and Solr
are used in my system. I hope it could work.

Thanks so much for your reply!

Best regards,
Bing

On Fri, Feb 24, 2012 at 3:30 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 regarding your question on hbase support for high performance and
 consistency - i would say hbase is highly scalable and performant. how it
 does what it does can be understood by reading relevant chapters around
 architecture and design in the hbase book.

 with regards to ranking, i see your problem. but if you split the problem
 into hbase specific solution and solr based solution, you can achieve the
 results probably. may be you do the ranking and store the rank in hbase and
 then use solr to get the results and then use hbase as a lookup to get the
 rank. or you can put the rank as part of the document schema and index the
 rank too for range queries and such. is my understanding of your scenario
 wrong?

 thanks


 On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote:

 Mr Gupta,

 Thanks so much for your reply!

 In my use cases, retrieving data by keyword is one of them. I think Solr
 is a proper choice.

 However, Solr does not provide a complex enough support to rank. And,
 frequent updating is also not suitable in Solr. So it is difficult to
 retrieve data randomly based on the values other than keyword frequency in
 text. In this case, I attempt to use HBase.

 But I don't know how HBase support high performance when it needs to keep
 consistency in a large scale distributed system.

 Now both of them are used in my system.

 I will check out ElasticSearch.

 Best regards,
 Bing


 On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 Bing,
 Its a classic battle on whether to use solr or hbase or a combination of
 both. both systems are very different but there is some overlap in the
 utility. they also differ vastly when it compares to computation power,
 storage needs, etc. so in the end, it all boils down to your use case. you
 need to pick the technology that it best suited to your needs.
 im still not clear on your use case though.

 btw, if you haven't started using solr yet - then you might want to
 checkout ElasticSearch. I spent over a week researching between solr and ES
 and eventually chose ES due to its cool merits.

 thanks


 On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 There is no secondary index support in HBase at the moment.

 It's on our road map.

 FYI

 On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote:

  Jacques,
 
  Yes. But I still have questions about that.
 
  In my system, when users search with a keyword arbitrarily, the query
 is
  forwarded to Solr. No any updating operations but appending new
 indexes
  exist in Solr managed data.
 
  When I need to retrieve data based on ranking values, HBase is used.
 And,
  the ranking values need to be updated all the time.
 
  Is that correct?
 
  My question is that the performance must be low if keeping
 consistency in a
  large scale distributed environment. How does HBase handle this issue?
 
  Thanks so much!
 
  Bing
 
 
  On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote:
 
   It is highly unlikely that you could replace Solr with HBase.
  They're
   really apples and oranges.
  
  
   On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:
  
   Dear all,
  
   I wonder how data in HBase is indexed? Now Solr is used in my
 system
   because data is managed in inverted index. Such an index is
 suitable to
   retrieve unstructured and huge amount of data. How does HBase deal
 with
   the
   issue? May I replaced Solr with HBase?
  
   Thanks so much!
  
   Best regards,
   Bing

How is Data Indexed in HBase?

2012-02-22 Thread Bing Li

Dear all,

I wonder how data in HBase is indexed? Now Solr is used in my system
because data is managed in inverted index. Such an index is suitable to
retrieve unstructured and huge amount of data. How does HBase deal with the
issue? May I replaced Solr with HBase?

Thanks so much!

Best regards,
Bing

Re: Solr HBase - Re: How is Data Indexed in HBase?

2012-02-22 Thread Bing Li

Mr Gupta,

Thanks so much for your reply!

In my use cases, retrieving data by keyword is one of them. I think Solr is
a proper choice.

However, Solr does not provide a complex enough support to rank. And,
frequent updating is also not suitable in Solr. So it is difficult to
retrieve data randomly based on the values other than keyword frequency in
text. In this case, I attempt to use HBase.

But I don't know how HBase support high performance when it needs to keep
consistency in a large scale distributed system.

Now both of them are used in my system.

I will check out ElasticSearch.

Best regards,
Bing


On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 Bing,
 Its a classic battle on whether to use solr or hbase or a combination of
 both. both systems are very different but there is some overlap in the
 utility. they also differ vastly when it compares to computation power,
 storage needs, etc. so in the end, it all boils down to your use case. you
 need to pick the technology that it best suited to your needs.
 im still not clear on your use case though.

 btw, if you haven't started using solr yet - then you might want to
 checkout ElasticSearch. I spent over a week researching between solr and ES
 and eventually chose ES due to its cool merits.

 thanks


 On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 There is no secondary index support in HBase at the moment.

 It's on our road map.

 FYI

 On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote:

  Jacques,
 
  Yes. But I still have questions about that.
 
  In my system, when users search with a keyword arbitrarily, the query is
  forwarded to Solr. No any updating operations but appending new indexes
  exist in Solr managed data.
 
  When I need to retrieve data based on ranking values, HBase is used.
 And,
  the ranking values need to be updated all the time.
 
  Is that correct?
 
  My question is that the performance must be low if keeping consistency
 in a
  large scale distributed environment. How does HBase handle this issue?
 
  Thanks so much!
 
  Bing
 
 
  On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote:
 
   It is highly unlikely that you could replace Solr with HBase.  They're
   really apples and oranges.
  
  
   On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:
  
   Dear all,
  
   I wonder how data in HBase is indexed? Now Solr is used in my system
   because data is managed in inverted index. Such an index is suitable
 to
   retrieve unstructured and huge amount of data. How does HBase deal
 with
   the
   issue? May I replaced Solr with HBase?
  
   Thanks so much!
  
   Best regards,
   Bing

Re: Sort by the number of matching terms (coord value)

2012-02-16 Thread Li Li

you can fool the lucene scoring fuction. override each function such as idf
queryNorm lengthNorm and let them simply return 1.0f.
I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only
score by vector space model and the formula can't be replaced by users.

On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark clark...@gmail.com wrote:

 Hi,

 I'm looking for a way to sort results by the number of matching terms.
 Being able to sort by the coord() value or by the overlap value that gets
 passed into the coord() function would do the trick. Is there a way I can
 expose those values to the sort function?

 I'd appreciate any help that points me in the right direction. I'm OK with
 making basic code modifications.

 Thanks!

 -Nick

Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Li Li

great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote:

 I implemented an index shrinker and it works.  I reduced my test index
 from 6.6 GB to 3.6 GB by removing a single shingled field I did not
 need anymore.  I'm actually using Lucene.Net for this project so code
 is C# using Lucene.Net 2.9.2 API.  But basic idea is:

 Create an IndexReader wrapper that only enumerates the terms you want
 to keep, and that removes terms from documents when returning
 documents.

 Use the SegmentMerger to re-write each segment (where each segment is
 wrapped by the wrapper class), writing new segment to a new directory.
 Collect the SegmentInfos and do a commit in order to create a new
 segments file in new index directory

 Done - you now have a shrunk index with specified terms removed.

 Implementation uses separate thread for each segment, so it re-writes
 them in parallel.  Took about 15 minutes to do 770,000 doc index on my
 macbook.


 On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
  I have roughly read the codes of 4.0 trunk. maybe it's feasible.
 SegmentMerger.add(IndexReader) will add to be merged Readers
 merge() will call
   mergeTerms(segmentWriteState);
   mergePerDoc(segmentWriteState);
 
mergeTerms() will construct fields from IndexReaders
 for(int
  readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
   final MergeState.IndexReaderAndLiveDocs r =
  mergeState.readers.get(readerIndex);
   final Fields f = r.reader.fields();
   final int maxDoc = r.reader.maxDoc();
   if (f != null) {
 slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
 fields.add(f);
   }
   docBase += maxDoc;
 }
 So If you wrapper your IndexReader and override its fields() method,
  maybe it will work for merge terms.
 
 for DocValues, it can also override AtomicReader.docValues(). just
  return null for fields you want to remove. maybe it should
  traverse CompositeReader's getSequentialSubReaders() and wrapper each
  AtomicReader
 
 other things like term vectors norms are similar.
  On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  I was thinking if I make a wrapper class that aggregates another
  IndexReader and filter out terms I don't want anymore it might work.
 And
  then pass that wrapper into SegmentMerger.  I think if I filter out
 terms
  on GetFieldNames(...) and Terms(...) it might work.
 
  Something like:
 
  HashSetstring ignoredTerms=...;
 
  FilteringIndexReader wrapper=new FilterIndexReader(reader);
 
  SegmentMerger merger=new SegmentMerger(writer);
 
  merger.add(wrapper);
 
  merger.Merge();
 
 
 
 
 
  On Feb 14, 2012, at 1:49 AM, Li Li wrote:
 
   for method 2, delete is wrong. we can't delete terms.
 you also should hack with the tii and tis file.
  
   On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
  
   method1, dumping data
   for stored fields, you can traverse the whole index and save it to
   somewhere else.
   for indexed but not stored fields, it may be more difficult.
  if the indexed and not stored field is not analyzed(fields such as
   id), it's easy to get from FieldCache.StringIndex.
  But for analyzed fields, though theoretically it can be restored
 from
   term vector and term position, it's hard to recover from index.
  
   method 2, hack with metadata
   1. indexed fields
delete by query, e.g. field:*
   2. stored fields
 because all fields are stored sequentially. it's not easy to
  delete
   some fields. this will not affect search speed. but if you want to
 get
   stored fields,  and the useless fields are very long, then it will
 slow
   down.
 also it's possible to hack with it. but need more effort to
   understand the index file format  and traverse the fdt/fdx file.
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
  
   this will give you some insight.
  
  
   On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart 
 bstewart...@gmail.com
  wrote:
  
   Lets say I have a large index (100M docs, 1TB, split up between 10
   indexes).  And a bunch of the stored and indexed fields are not
  used in
   search at all.  In order to save memory and disk, I'd like to
 rebuild
  that
   index *without* those fields, but I don't have original documents to
   rebuild entire index with (don't have the full-text anymore, etc.).
  Is
   there some way to rebuild or optimize an existing index with only a
  sub-set
   of the existing indexed fields?  Or alternatively is there a way to
  avoid
   loading some indexed fields at all ( to avoid loading term infos and
  terms
   index ) ?
  
   Thanks
   Bob

Re: Can I rebuild an index and remove some fields?

2012-02-14 Thread Li Li

I have roughly read the codes of 4.0 trunk. maybe it's feasible.
SegmentMerger.add(IndexReader) will add to be merged Readers
merge() will call
  mergeTerms(segmentWriteState);
  mergePerDoc(segmentWriteState);

   mergeTerms() will construct fields from IndexReaders
for(int
readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
  final MergeState.IndexReaderAndLiveDocs r =
mergeState.readers.get(readerIndex);
  final Fields f = r.reader.fields();
  final int maxDoc = r.reader.maxDoc();
  if (f != null) {
slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
fields.add(f);
  }
  docBase += maxDoc;
}
So If you wrapper your IndexReader and override its fields() method,
maybe it will work for merge terms.

for DocValues, it can also override AtomicReader.docValues(). just
return null for fields you want to remove. maybe it should
traverse CompositeReader's getSequentialSubReaders() and wrapper each
AtomicReader

other things like term vectors norms are similar.
On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote:

 I was thinking if I make a wrapper class that aggregates another
 IndexReader and filter out terms I don't want anymore it might work.   And
 then pass that wrapper into SegmentMerger.  I think if I filter out terms
 on GetFieldNames(...) and Terms(...) it might work.

 Something like:

 HashSetstring ignoredTerms=...;

 FilteringIndexReader wrapper=new FilterIndexReader(reader);

 SegmentMerger merger=new SegmentMerger(writer);

 merger.add(wrapper);

 merger.Merge();





 On Feb 14, 2012, at 1:49 AM, Li Li wrote:

  for method 2, delete is wrong. we can't delete terms.
you also should hack with the tii and tis file.
 
  On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
 
  method1, dumping data
  for stored fields, you can traverse the whole index and save it to
  somewhere else.
  for indexed but not stored fields, it may be more difficult.
 if the indexed and not stored field is not analyzed(fields such as
  id), it's easy to get from FieldCache.StringIndex.
 But for analyzed fields, though theoretically it can be restored from
  term vector and term position, it's hard to recover from index.
 
  method 2, hack with metadata
  1. indexed fields
   delete by query, e.g. field:*
  2. stored fields
because all fields are stored sequentially. it's not easy to
 delete
  some fields. this will not affect search speed. but if you want to get
  stored fields,  and the useless fields are very long, then it will slow
  down.
also it's possible to hack with it. but need more effort to
  understand the index file format  and traverse the fdt/fdx file.
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
 
  this will give you some insight.
 
 
  On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  Lets say I have a large index (100M docs, 1TB, split up between 10
  indexes).  And a bunch of the stored and indexed fields are not
 used in
  search at all.  In order to save memory and disk, I'd like to rebuild
 that
  index *without* those fields, but I don't have original documents to
  rebuild entire index with (don't have the full-text anymore, etc.).  Is
  there some way to rebuild or optimize an existing index with only a
 sub-set
  of the existing indexed fields?  Or alternatively is there a way to
 avoid
  loading some indexed fields at all ( to avoid loading term infos and
 terms
  index ) ?
 
  Thanks
  Bob

Re: New segment file created too often

 Commit is called
after adding each document


 you should add enough documents and then calling a commit. commit is a
cost operation.
 if you want to get latest feeded documents, you could use NRT

On Tue, Feb 14, 2012 at 12:47 AM, Huy Le hu...@springpartners.com wrote:

 Hi,

 I am using solr 3.5.  I seeing solr keeps creating new segment files (1MB
 files) so often that it triggers segment merge about every one minute. I
 search the news archive, but could not find any info on this issue.  I am
 indexing about 10 docs of less 2KB each every second.  Commit is called
 after adding each document. Relevant config params are:

 mergeFactor10/mergeFactor
 ramBufferSizeMB1024/ramBufferSizeMB
 maxMergeDocs2147483647/maxMergeDocs

 What might be triggering this frequent new segment files creation?  Thanks!

 Huy

 --
 Huy Le
 Spring Partners, Inc.
 http://springpadit.com

Re: New segment file created too often

as far as I know, there are three situation it will be flushed to a new
segment: RAM buffer for posting data structure is used up; added doc
numbers are exceeding threshold and there are many deletions in a segment
but your configuration seems it is not likely to flush many small segments.

ramBufferSizeMB1024/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
On Tue, Feb 14, 2012 at 1:10 AM, Huy Le hu...@springpartners.com wrote:

 Hi,

 I am using solr 3.5.  As I understood it, NRT is a solr 4 feature, but solr
 4 is not released yet.

 I understand commit after adding each document is expensive, but the
 application requires that documents be available after adding to the index.

 What I don't understand is why new segment files are created so often.
 Are the commit calls triggering new segment files being created?  I don't
 see this behavior in another environment of the same version of solr.

 Huy

 On Mon, Feb 13, 2012 at 11:55 AM, Li Li fancye...@gmail.com wrote:

   Commit is called
  after adding each document
 
 
   you should add enough documents and then calling a commit. commit is a
  cost operation.
   if you want to get latest feeded documents, you could use NRT
 
  On Tue, Feb 14, 2012 at 12:47 AM, Huy Le hu...@springpartners.com
 wrote:
 
   Hi,
  
   I am using solr 3.5.  I seeing solr keeps creating new segment files
  (1MB
   files) so often that it triggers segment merge about every one minute.
 I
   search the news archive, but could not find any info on this issue.  I
 am
   indexing about 10 docs of less 2KB each every second.  Commit is called
   after adding each document. Relevant config params are:
  
   mergeFactor10/mergeFactor
   ramBufferSizeMB1024/ramBufferSizeMB
   maxMergeDocs2147483647/maxMergeDocs
  
   What might be triggering this frequent new segment files creation?
   Thanks!
  
   Huy
  
   --
   Huy Le
   Spring Partners, Inc.
   http://springpadit.com
  
 



 --
 Huy Le
 Spring Partners, Inc.
 http://springpadit.com

Re: New segment file created too often

can you post your config file?
I found there are 2 places to config ramBufferSizeMB in latest svn of 3.6's
example solrconfig.xml. trying to modify them both?

  indexDefaults

useCompoundFilefalse/useCompoundFile

mergeFactor10/mergeFactor
!-- Sets the amount of RAM that may be used by Lucene indexing
 for buffering added documents and deletions before they are
 flushed to the Directory.  --
ramBufferSizeMB32/ramBufferSizeMB
!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
 Lucene will flush based on whichever limit is hit first.
  --
!-- maxBufferedDocs1000/maxBufferedDocs --

maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout

.
!-- termIndexInterval256/termIndexInterval --
  /indexDefaults

  !-- Main Index

   Values here override the values in the indexDefaults section
   for the main on disk index.
--
  mainIndex

useCompoundFilefalse/useCompoundFile
ramBufferSizeMB32/ramBufferSizeMB
mergeFactor10/mergeFactor
   
  /mainIndex

On Tue, Feb 14, 2012 at 1:10 AM, Huy Le hu...@springpartners.com wrote:

 Hi,

 I am using solr 3.5.  As I understood it, NRT is a solr 4 feature, but solr
 4 is not released yet.

 I understand commit after adding each document is expensive, but the
 application requires that documents be available after adding to the index.

 What I don't understand is why new segment files are created so often.
 Are the commit calls triggering new segment files being created?  I don't
 see this behavior in another environment of the same version of solr.

 Huy

 On Mon, Feb 13, 2012 at 11:55 AM, Li Li fancye...@gmail.com wrote:

   Commit is called
  after adding each document
 
 
   you should add enough documents and then calling a commit. commit is a
  cost operation.
   if you want to get latest feeded documents, you could use NRT
 
  On Tue, Feb 14, 2012 at 12:47 AM, Huy Le hu...@springpartners.com
 wrote:
 
   Hi,
  
   I am using solr 3.5.  I seeing solr keeps creating new segment files
  (1MB
   files) so often that it triggers segment merge about every one minute.
 I
   search the news archive, but could not find any info on this issue.  I
 am
   indexing about 10 docs of less 2KB each every second.  Commit is called
   after adding each document. Relevant config params are:
  
   mergeFactor10/mergeFactor
   ramBufferSizeMB1024/ramBufferSizeMB
   maxMergeDocs2147483647/maxMergeDocs
  
   What might be triggering this frequent new segment files creation?
   Thanks!
  
   Huy
  
   --
   Huy Le
   Spring Partners, Inc.
   http://springpadit.com
  
 



 --
 Huy Le
 Spring Partners, Inc.
 http://springpadit.com

Re: Can I rebuild an index and remove some fields?

method1, dumping data
for stored fields, you can traverse the whole index and save it to
somewhere else.
for indexed but not stored fields, it may be more difficult.
if the indexed and not stored field is not analyzed(fields such as id),
it's easy to get from FieldCache.StringIndex.
But for analyzed fields, though theoretically it can be restored from
term vector and term position, it's hard to recover from index.

method 2, hack with metadata
1. indexed fields
  delete by query, e.g. field:*
2. stored fields
   because all fields are stored sequentially. it's not easy to delete
some fields. this will not affect search speed. but if you want to get
stored fields,  and the useless fields are very long, then it will slow
down.
   also it's possible to hack with it. but need more effort to
understand the index file format  and traverse the fdt/fdx file.
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

this will give you some insight.

On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote:

 Lets say I have a large index (100M docs, 1TB, split up between 10
 indexes).  And a bunch of the stored and indexed fields are not used in
 search at all.  In order to save memory and disk, I'd like to rebuild that
 index *without* those fields, but I don't have original documents to
 rebuild entire index with (don't have the full-text anymore, etc.).  Is
 there some way to rebuild or optimize an existing index with only a sub-set
 of the existing indexed fields?  Or alternatively is there a way to avoid
 loading some indexed fields at all ( to avoid loading term infos and terms
 index ) ?

 Thanks
 Bob

Re: Can I rebuild an index and remove some fields?