solr index size
Hi, We built a Solr index on a set of documents a few times. Each time, we did an optimize to reduce the index to a single segment. The index sizes are slightly different across different runs. Even though the documents are not inserted in the same order across runs, it seems to me that the final optimized index should be identical. Running CheckIndex showed that the number of docs and fields are the same, but the number of terms are slightly different. Does anyone know how to explain this? Thanks, Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 jun...@almaden.ibm.com
Index Version Number
Is it possible for a Solr client to determine if the index has changed since the last time it performed a query? For example, is it possible to query the current Lucene indexVersion? Thanks in advance for your help, Richard
Re: Solr index
Hi, Solr doesn't include such functionality. But Nutch has: [o...@localhost src]$ ff \*Signature\*java ./test/org/apache/nutch/crawl/TestSignatureFactory.java ./java/org/apache/nutch/crawl/SignatureFactory.java ./java/org/apache/nutch/crawl/MD5Signature.java ./java/org/apache/nutch/crawl/Signature.java ./java/org/apache/nutch/crawl/TextProfileSignature.java ./java/org/apache/nutch/crawl/SignatureComparator.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: aidahaj > To: solr-user@lucene.apache.org > Sent: Friday, April 24, 2009 12:13:47 PM > Subject: Solr index > > > Hi , > I'm using Nutch to crawl a list of web sites. > Solr is my index(Nutch-1.0 integration with solr). > I'm working on detecting web site defacement(if there's any changes in the > text of a web page). > I want to know if solr may give me the possibility to detect the changes in > the Documents in the indexe before commiting or a log file or something like > that(the text that has been changed between two points of time ). > I'm looking for your help. Thanks a lot. > -- > View this message in context: > http://www.nabble.com/Solr-index-tp23219842p23219842.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr index
Solr 1.4 (trunk) has a similar functionality. http://wiki.apache.org/solr/Deduplication On Fri, Apr 24, 2009 at 9:53 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > Hi, > > Solr doesn't include such functionality. But Nutch has: > > [o...@localhost src]$ ff \*Signature\*java > ./test/org/apache/nutch/crawl/TestSignatureFactory.java > ./java/org/apache/nutch/crawl/SignatureFactory.java > ./java/org/apache/nutch/crawl/MD5Signature.java > ./java/org/apache/nutch/crawl/Signature.java > ./java/org/apache/nutch/crawl/TextProfileSignature.java > ./java/org/apache/nutch/crawl/SignatureComparator.java > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: aidahaj > > To: solr-user@lucene.apache.org > > Sent: Friday, April 24, 2009 12:13:47 PM > > Subject: Solr index > > > > > > Hi , > > I'm using Nutch to crawl a list of web sites. > > Solr is my index(Nutch-1.0 integration with solr). > > I'm working on detecting web site defacement(if there's any changes in > the > > text of a web page). > > I want to know if solr may give me the possibility to detect the changes > in > > the Documents in the indexe before commiting or a log file or something > like > > that(the text that has been changed between two points of time ). > > I'm looking for your help. Thanks a lot. > > -- > > View this message in context: > > http://www.nabble.com/Solr-index-tp23219842p23219842.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: Solr index
Thanks a lot, I have made a look in these classes. But what I exactly want to do is to detect if a Document(in the index of solr)has changed when I recrawl a site with Nutch. Not to block deduplication, but to detect if a Document has changed and extract changes in a file without writing them over the old Document. After that I decide wether to rewrite the Document or to keep both of them the old and the new one. I wish I am more precise. Thanks and permit my poor english. -- View this message in context: http://www.nabble.com/Solr-index-tp23219842p23254601.html Sent from the Solr - User mailing list archive at Nabble.com.
Multi-index Design
Hi All, I'm [still!] evaluating Solr and setting up a PoC. The requirements are to index the following objects: - people - name, status, date added, address, profile, other people specific fields like group... - organisations - name, status, date added, address, profile, other organisational specific fields like size... - products - name, status, date added, profile, other product specific fields like product groups.. AND...I need to isolate indexes to a number of dynamic domains (customerA, customerB...) that will grow over time. So, my initial thoughts are to do the following: - flatten the searchable objects as much as I can - use a type field to distinguish - into a single index - use multi-core approach to segregate domains of data So, a couple questions on this: 1) Is this approach/design sensible and do others use it? 2) By flattening the data we will only index common fields; is it unreasonable to do a second database search and union the results when doing advanced searches on non indexed fields? Do others do this? 3) I've read that I can dynamically add a new core - this fits well with the ability to dynamically add new domains; how scaliable is this approach? Would it be unreasonable to have 20-30 dynaimically created cores? I guess, redundancy aside and given our one core per domain approach, we could easily spill onto other physical servers without the need for replication? Thanks again for your help! rotis
Splitting the index
Hi all, Do you have any idea about to split the index in to different data directory? If so, kindly let me know please.. Thanks & regards Prabhu.K -- View this message in context: http://www.nabble.com/Splitting-the-index-tp23613882p23613882.html Sent from the Solr - User mailing list archive at Nabble.com.
Index size concerns
Salaam, We are using apache-solr to index our files for faster searches, all things happen without a problem, my only concern is the size of the cache. It seems that the trend is that the if I cache 1 GB of files the index goes to 800MB ie we are seeing a 80% cache size. Is this normal or am I missing something in the configuration of solr Thanks and regards, Muhammed Sameer
Locked Index files
Hi My solr Index files are locked and I can’t index anything How can I remove the lock file ? I can’t delete it
index pdf files
I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - - Hristovski D - application/pdf - microarray analysis, literature-based discovery, semantic predications, natural language processing - Thu Aug 12 10:58:37 EDT 2010 - Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
Index time boosting
Hi there, Im having some issues with my relevancy of some results. I have 5 fields, with varying boost values and being copied into a copyfield "text" which is used to be searched on ... im sending each of these fields with the boost values (i_title is 20, i_authors is 10 ... i_description is 1) now the issue is that some items which the query matches in the description field (which is boosted at 1) is scored higher than result for where the query is matched in the authors field (which is boosted at 10). this leaves me to believe that the index time boost isn't working correctly. Have i omitted norms correctly? the score for one result which is matched by the author field is nearly the same as the score for one result which is matched by the description field so doubtful the boost is kept and is normalized when in the copyfield. Unfortunately i cannot switch to dismax parser at this late stage so cannot do query time boosting unless theres another way of doing it on the standard parser. thanks joe -- View this message in context: http://lucene.472066.n3.nabble.com/Index-time-boosting-tp1411105p1411105.html Sent from the Solr - User mailing list archive at Nabble.com.
Index with ItalianStemmer
Hi all, I am experiencing a strange behavior while indexing italian text (an indexed not stored text field) when stemming with italian language: generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> protected="protwords.txt"/> if I try to index the text field with the value: "mi voglio documentare su Solr e sulla sua storia" (which means "I want to study Solr and its history") my search for "q=text:documentare" or for "q=text:documento" turns out nothing. The biggest issue is that the first one, which was intended to work both if stemming was and was not enabled, does not match any document If I change the stemmer language to English and then reindex, the first of the queries above succeeds as expected because no stemming is applied. Does anyone know what could be the root cause or if I am missing something? Thanks in advance for any help, Tommaso
Index update issue
Dear All, I use Solr in Rails. I added a new item, the index number update took a long time (one hour). For example, now the index is "97" and add a new item, the index will become "98" in one hour. I checked all of Solr config files, but I couldn't find the setting about that. I comment out 1 1000 in solrconfig.xml But... Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Index-update-issue-tp1487956p1487956.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Optimize Index
On 11/4/2010 7:22 AM, stockiii wrote: how can i start an optimize by using DIH, but NOT after an delta- or full-import ? I'm not aware of a way to do this with DIH, though there might be something I'm not aware of. You can do it with an HTTP POST. Here's how to do it with curl: /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \ -H "Content-Type: text/xml" \ --data-binary '' Shawn
Re: Optimize Index
For what it's worth, the Solr class instructor at the Lucene Revolution conference recommended *against* optimizing, and instead suggested to just let the merge factor do it's job. On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey wrote: > On 11/4/2010 7:22 AM, stockiii wrote: > >> how can i start an optimize by using DIH, but NOT after an delta- or >> full-import ? >> > > I'm not aware of a way to do this with DIH, though there might be something > I'm not aware of. You can do it with an HTTP POST. Here's how to do it > with curl: > > /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \ > -H "Content-Type: text/xml" \ > --data-binary '' > > Shawn > >
Re: Optimize Index
Huh? That's something new for me. Optmize removed documents that have been flagged for deletion. For relevancy it's important those are removed because document frequencies are not updated for deletes. Did i miss something? > For what it's worth, the Solr class instructor at the Lucene Revolution > conference recommended *against* optimizing, and instead suggested to just > let the merge factor do it's job. > > On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey wrote: > > On 11/4/2010 7:22 AM, stockiii wrote: > >> how can i start an optimize by using DIH, but NOT after an delta- or > >> full-import ? > > > > I'm not aware of a way to do this with DIH, though there might be > > something I'm not aware of. You can do it with an HTTP POST. Here's > > how to do it with curl: > > > > /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \ > > -H "Content-Type: text/xml" \ > > --data-binary '' > > > > Shawn
Re: Optimize Index
what you can try maxSegments=2 or more as a 'partial' optimize: "If the index is so large that optimizes are taking longer than desired or using more disk space during optimization than you can spare, consider adding the maxSegments parameter to the optimize command. In the XML message, this would be an attribute; the URL form and SolrJ have the corresponding option too. By default this parameter is 1 since an optimize results in a single Lucene "segment". By setting it larger than 1 but less than the mergeFactor, you permit partial optimization to no more than this many segments. Of course the index won't be fully optimized and therefore searches will be slower. " from http://wiki.apache.org/solr/PacktBook2009 (I only found that link there must be sth. on the real wiki for the maxSegments parameter ...) Hello. My Index have ~30 Million documents and a optimize=true is very heavy. it takes long time ... how can i start an optimize by using DIH, but NOT after an delta- or full-import ? i set my index to compound-index. thx -- http://jetwick.com twitter search prototype
Re: Optimize Index
no, you didn't miss anything. The comment at Lucen Revolution was more along the lines that optimize didn't actually improve much #absent# deletes. Plus, on a significant size corpus, the doc frequencies won't changed that much by deleting documents, but that's a case-by-case thing Best Erick On Thu, Nov 4, 2010 at 4:31 PM, Markus Jelsma wrote: > Huh? That's something new for me. Optmize removed documents that have been > flagged for deletion. For relevancy it's important those are removed > because > document frequencies are not updated for deletes. > > Did i miss something? > > > For what it's worth, the Solr class instructor at the Lucene Revolution > > conference recommended *against* optimizing, and instead suggested to > just > > let the merge factor do it's job. > > > > On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey wrote: > > > On 11/4/2010 7:22 AM, stockiii wrote: > > >> how can i start an optimize by using DIH, but NOT after an delta- or > > >> full-import ? > > > > > > I'm not aware of a way to do this with DIH, though there might be > > > something I'm not aware of. You can do it with an HTTP POST. Here's > > > how to do it with curl: > > > > > > /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \ > > > -H "Content-Type: text/xml" \ > > > --data-binary '' > > > > > > Shawn >
Export Index Data.
Hi Is possible to export one set of documents indexed in one solr server for do a sincronization with other solr server? Thank's
only index synonyms
Hi Can the following usecase be achieved. value to be analysed at index time "this is a pretty line of text" synonym list is pretty => scenic , text => words valued placed in the index is "scenic words" That is to say only the matching synonyms. Basically i want to produce a normalised set of phrases for faceting. Cheers Lee C
Optimize a Index
Hallo, i have a Index withe 800.000 Dokuments, and now i hope it will be Faster, if i optimize the Index, it sounds good ;-) But i cant find an Example to Optimize one of milticors or all cors.. Maby one of you have a little example for that .. King
SCHEMA-INDEX-MISMATCH
Hi,all. I use Lucene's NumericField to index "price" field,And query with solr.TrieDoubleField. When i use "price:[1 TO 5000]" to search,it can return all results that price is between 1 and 5000. but the price value return is :ERROR:SCHEMA-INDEX-MISMATCH,stringValue=2000.0 anybogy know why? -- View this message in context: http://old.nabble.com/SCHEMA-INDEX-MISMATCH-tp26897605p26897605.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Corrupted Index
what version of solr are you running? On Jan 7, 2010, at 3:08 PM, Jake Brownell wrote: Hi all, Our application uses solrj to communicate with our solr servers. We started a fresh index yesterday after upping the maxFieldLength setting in solrconfig. Our task indexes content in batches and all appeared to be well until noonish today, when after 40k docs, I started seeing errors. I've placed three stack traces below, the first occurred once and was the initial error, the second occurred a few times before the third started occurring on each request. I'd really appreciate any insight into what could have caused this, a missing file and then a corrupt index. If you know we'll have to nuke the entire index and start over I'd like to know that too-oddly enough searches against the index appear to be working. Thanks! Jake #1 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ _fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ _fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org .bookshare .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) Caused by: solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.apache.solr.common.SolrException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org .bookshare .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) #2 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSo
RE: Corrupted Index
Yes, that would be helpful to include, sorry, the official 1.4. -Original Message- From: Ryan McKinley [mailto:ryan...@gmail.com] Sent: Thursday, January 07, 2010 2:15 PM To: solr-user@lucene.apache.org Subject: Re: Corrupted Index what version of solr are you running? On Jan 7, 2010, at 3:08 PM, Jake Brownell wrote: > Hi all, > > Our application uses solrj to communicate with our solr servers. We > started a fresh index yesterday after upping the maxFieldLength > setting in solrconfig. Our task indexes content in batches and all > appeared to be well until noonish today, when after 40k docs, I > started seeing errors. I've placed three stack traces below, the > first occurred once and was the initial error, the second occurred a > few times before the third started occurring on each request. I'd > really appreciate any insight into what could have caused this, a > missing file and then a corrupt index. If you know we'll have to > nuke the entire index and start over I'd like to know that too-oddly > enough searches against the index appear to be working. > > Thanks! > Jake > > #1 > > January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 > January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ > _fsk_1uj.del (No such file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No > such file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update > January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ > _fsk_1uj.del (No such file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No > such file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update > org.benetech.exception.WrappedException > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) > > org > .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) >org.apache.solr.client.solrj.SolrServer#commit(86) >org.apache.solr.client.solrj.SolrServer#commit(75) >org.bookshare.search.solr.SolrSearchServerWrapper#add(63) >org.bookshare.search.solr.SolrSearchEngine#index(232) > > org > .bookshare > .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) >org.bookshare.service.task.SearchEngineIndexingTask#run(53) > org.bookshare.service.scheduler.TaskWrapper#run(233) > java.util.TimerThread#mainLoop(512) >java.util.TimerThread#run(462) > Caused by: > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update > org.apache.solr.common.SolrException > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) > > org > .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) > org.apache.solr.client.solrj.SolrServer#commit(86) >org.apache.solr.client.solrj.SolrServer#commit(75) >org.bookshare.search.solr.SolrSearchServerWrapper#add(63) >org.bookshare.search.solr.SolrSearchEngine#index(232) > > org > .bookshare > .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) >org.bookshare.service.task.SearchEngineIndexingTask#run(53) >org.bookshare.service.scheduler.TaskWrapper#run(233) >java.util.TimerThread#mainLoop(512) >java.util.TimerThread#run(462) > > #2 > > January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 > January 7, 2010 12:10:10 PM CST > org.apache.lucene.index.CorruptIndexException: doc counts differ for > segment _hug: fieldsReader shows 8 but segmentInfo shows 2 > > org.apache.lucene.index.CorruptIndexException: doc counts differ for > segment _hug: fieldsReader shows 8 but segmentInfo shows 2 > > request: /core0/update > org.apache.lucene.index.CorruptIndexException: doc counts differ for > segment _hug: fieldsReader shows 8 but segmentInfo shows 2 > > org.apache.lucene.index.CorruptIndexException: doc counts differ for > segment _hug: fieldsReader shows 8 but segmentInfo shows 2 > > request: /core0/upda
Re: Corrupted Index
If you need to fix the index and maybe lose some data (in bad segments), check Lucene's CheckIndex (cmd-line app) Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Jake Brownell > To: "solr-user@lucene.apache.org" > Sent: Thu, January 7, 2010 3:08:55 PM > Subject: Corrupted Index > > Hi all, > > Our application uses solrj to communicate with our solr servers. We started a > fresh index yesterday after upping the maxFieldLength setting in solrconfig. > Our > task indexes content in batches and all appeared to be well until noonish > today, > when after 40k docs, I started seeing errors. I've placed three stack traces > below, the first occurred once and was the initial error, the second occurred > a > few times before the third started occurring on each request. I'd really > appreciate any insight into what could have caused this, a missing file and > then > a corrupt index. If you know we'll have to nuke the entire index and start > over > I'd like to know that too-oddly enough searches against the index appear to > be > working. > > Thanks! > Jake > > #1 > > January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 > January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No > such > file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file > or > directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update > January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No > such > file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file > or > directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update > org.benetech.exception.WrappedException > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) > > org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) > org.apache.solr.client.solrj.SolrServer#commit(86) > org.apache.solr.client.solrj.SolrServer#commit(75) > org.bookshare.search.solr.SolrSearchServerWrapper#add(63) > org.bookshare.search.solr.SolrSearchEngine#index(232) > > org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95) > org.bookshare.service.task.SearchEngineIndexingTask#run(53) > org.bookshare.service.scheduler.TaskWrapper#run(233) > java.util.TimerThread#mainLoop(512) > java.util.TimerThread#run(462) > Caused by: > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) > > request: /core0/update > org.apache.solr.common.SolrException > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) > > org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) > org.apache.solr.client.solrj.SolrServer#commit(86) > org.apache.solr.client.solrj.SolrServer#commit(75) > org.bookshare.search.solr.SolrSearchServerWrapper#add(63) > org.bookshare.search.solr.SolrSearchEngine#index(232) > > org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95) > org.bookshare.service.task.SearchEngineIndexingTask#run(53) > org.bookshare.service.scheduler.TaskWrapper#run(233) > java.util.TimerThread#mainLoop(512) > java.util.TimerThread#run(462) > > #2 > > January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 > January 7, 2010 12:10:10 PM CST > org.apache.lucene.index.CorruptIndexException: > doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo > shows 2 > > org.apache.lucene.index.CorruptIndexException: doc counts differ for segment > _hug: fieldsReader shows 8 but segmentInfo shows 2 > > request: /core0/update org.apache.lucene.index.CorruptIndexException: doc > counts > differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 > > org.apache.lucene.index.CorruptIndexException: doc counts differ for segment > _hug: fieldsReader shows 8
update solr index
Hi, I am running solr in tomcat and I have about 35 indexes (between 2 and 80 millions documents each). Currently if I try to update few documents from an index (let's say the one which contains 80 millions documents) while tomcat is running and therefore receiving requests, I am getting few very long garbage collection (about 60sec). I am running tomcat with -Xms10g -Xmx10g -Xmn2g -XX:PermSize=256m -XX:MaxPermSize=256m. I'm using ConcMarkSweepGC. I have 2 questions: 1. Is solr doing something specific while an index is being updated like updating something in memory which would cause the garbage collection? 2. Any idea how I could solve this problem? Currently I stop tomcat, update index, start tomcat. I would like to be able to update my index while tomcat is running. I was thinking about running more tomcat instance with less memory for each and each running few of my indexes. Do you think it would be the best way to go? Thanks, Marc -- This transmission is strictly confidential, possibly legally privileged, and intended solely for the addressee. Any views or opinions expressed within it are those of the author and do not necessarily represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary companies. If you are not the intended recipient then you must not disclose, copy or take any action in reliance of this transmission. If you have received this transmission in error, please notify the sender as soon as possible. No employee or agent is authorised to conclude any binding agreement on behalf of i-CD Publishing (UK) Ltd with another party by email without express written confirmation by an authorised employee of the Company. http://www.192.com (Tel: 08000 192 192). i-CD Publishing (UK) Ltd is incorporated in England and Wales, company number 3148549, VAT No. GB 673128728.
ERROR:SCHEMA-INDEX-MISMATCH
Hi, I upgrade Solr v1.3 to v1.4 but in new version i still use the old index. I changed the new schema with old fields also. I have fields in my schema - but after upgarding when i am searching i got the reult like this - * ERROR:SCHEMA-INDEX-MISMATCH,stringValue=4194304 ERROR:SCHEMA-INDEX-MISMATCH,stringValue=0 ERROR:SCHEMA-INDEX-MISMATCH,stringValue=4 ERROR:SCHEMA-INDEX-MISMATCH,stringValue=78 ERROR:SCHEMA-INDEX-MISMATCH,stringValue=5 ERROR:SCHEMA-INDEX-MISMATCH,stringValue=228 * Please help me why this error is coming and how i can i solve this problem and how can i use the old index in Solr v1.4. -- DEEPAK AGRAWAL +91-9379433455 GOOD LUCK.
Re: Index size
It depends on many factors - how big those docs are (compare a tweet to a news article to a book chapter) whether you store the data or just index it, whether you compress it, how and how much you analyze the data, etc. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message > From: Jean-Sebastien Vachon > To: solr-user@lucene.apache.org > Sent: Wed, February 24, 2010 8:57:21 AM > Subject: Index size > > Hi All, > > I'm currently looking on integrating Solr and I'd like to have some hints on > the > size of the index (number of documents) I could possibly host on a server > running a Double-Quad server (16 cores) with 48Gb of RAM running Linux. > Basically, I need to determine how many of these servers would be required to > host about half a billion documents. Should I setup multiple Solr instances > (in > Virtual Machines or not) or should I run a single instance (with multicores > or > not) using all available memory as the cache ? > > I also made some tests with shardings on this same server and I could not see > any improvement (at least not with 4.5 millions documents). Should all the > shards be hosted on different servers? I shall try with more documents in the > following days. > > Thx
Re: Index size
Hi, All the document can be up to 10K. Most if it comes from a single field which is both indexed and stored. The data is uncompressed because it would eat up to much CPU considering the volume we have. We have around 30 fields in all. We also need to compute some facets as well as collapse the documents forming the result set and to be able to sort them on any field. Thx On 2010-02-25, at 5:50 PM, Otis Gospodnetic wrote: > It depends on many factors - how big those docs are (compare a tweet to a > news article to a book chapter) whether you store the data or just index it, > whether you compress it, how and how much you analyze the data, etc. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Hadoop ecosystem search :: http://search-hadoop.com/ > > > > - Original Message >> From: Jean-Sebastien Vachon >> To: solr-user@lucene.apache.org >> Sent: Wed, February 24, 2010 8:57:21 AM >> Subject: Index size >> >> Hi All, >> >> I'm currently looking on integrating Solr and I'd like to have some hints on >> the >> size of the index (number of documents) I could possibly host on a server >> running a Double-Quad server (16 cores) with 48Gb of RAM running Linux. >> Basically, I need to determine how many of these servers would be required >> to >> host about half a billion documents. Should I setup multiple Solr instances >> (in >> Virtual Machines or not) or should I run a single instance (with multicores >> or >> not) using all available memory as the cache ? >> >> I also made some tests with shardings on this same server and I could not >> see >> any improvement (at least not with 4.5 millions documents). Should all the >> shards be hosted on different servers? I shall try with more documents in >> the >> following days. >> >> Thx >
Re: Optimize Index
My very first guess would be that you're removing an index that isn't the one your SOLR configuration points at. Second guess would be that your browser is caching the results of your first query and not going to SOLR at all. Stranger things have happened . Third guess is you've mis-identified the core in your URL. Can you check those three things and let us know if you still have the problem? Erick On Tue, Mar 2, 2010 at 7:36 AM, Lee Smith wrote: > Hi All > > Is there a post request method to clean the index? > > I have removed my index folder and restarted solr and its still showing > documents in the stats. > > I have run this post request: > http://localhost:8983/solr/core1/update?optimize=true > > I get no errors but the stats are still show my 4 documents > > Hope you can advise. > > Thanks
Re: Optimize Index
Ha Now I feel stupid !! I had a misspell in the data path and you were correct. Can I ask Erik was the command correct though ? Thank you Lee On 2 Mar 2010, at 13:54, Erick Erickson wrote: > My very first guess would be that you're removing an index that isn't > the one your SOLR configuration points at. > > Second guess would be that your browser is caching the results of > your first query and not going to SOLR at all. Stranger things have > happened . > > Third guess is you've mis-identified the core in your URL. > > Can you check those three things and let us know if you still > have the problem? > > Erick > > On Tue, Mar 2, 2010 at 7:36 AM, Lee Smith wrote: > >> Hi All >> >> Is there a post request method to clean the index? >> >> I have removed my index folder and restarted solr and its still showing >> documents in the stats. >> >> I have run this post request: >> http://localhost:8983/solr/core1/update?optimize=true >> >> I get no errors but the stats are still show my 4 documents >> >> Hope you can advise. >> >> Thanks
Fwd: index merge
Hi, I have created 2 identical cores coreX and coreY (both have different dataDir values, but their index is same). coreX - always serves the request when a user performs a search. coreY - the updates will happen to this core and then I need to synchronize it with coreX after the update process, so that coreX also has the latest data in it. After coreX and coreY are synchronized, both should again be identical again. For this purpose I tried core merging of coreX and coreY once coreY is updated with the latest set of data. But I find coreX to be containing double the record count as in coreY. (coreX = coreX+coreY) Is there a problem in using MERGE concept here. If it is wrong can some one pls suggest the best approach. I tried the various merges explained in my previous mail. Any help is deeply appreciated. Thanks and Rgds, Mark. -- Forwarded message -- From: Mark Fletcher Date: Sat, Mar 6, 2010 at 9:17 AM Subject: index merge To: solr-user@lucene.apache.org Cc: goks...@gmail.com Hi, I have a doubt regarding Index Merging:- I have set up 2 cores COREX and COREY. COREX - always serves user requests COREY - gets updated with the latest values (dataDir is in a different location from COREX) I tried merging coreX and coreY at the end of COREY getting updated with the latest data values so that COREX and COREY are having the latest data. So the user who always queries COREX gets the latest data.Pls find the various approaches I followed and the commands used. I tried these merges:- COREX = COREX and COREY merged curl ' http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreX/data/index&indexDir=/opt1/solr/coreY/data/index ' COREX = COREY and COREY merged curl ' http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreY/data/index&indexDir=/opt1/solr/coreY/data/index ' COREX = COREY and COREA merged (COREA just contains the initial 2 seed segments.. a dummy core) curl ' http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreY/data/index&indexDir=/opt1/solr/coreA/data/index ' When I check the record count in COREX and COREY, COREX always contains about double of what COREY has. Is everything fine here and just the record count is different or is there something wrong. Note:- I have only 2 cores here and I tried the X=X+Y approach, X=Y+Y and X=Y+A approach where A is a dummy index. Never have the record counts matched after the merging is done. Can someone please help me understand why this record count difference occurs and is there anything fundamentally wrong in my approach. Thanks and Rgds, Mark.
Re: index merge
Hi Mark, On Sun, Mar 7, 2010 at 6:20 PM, Mark Fletcher wrote: > > I have created 2 identical cores coreX and coreY (both have different > dataDir values, but their index is same). > coreX - always serves the request when a user performs a search. > coreY - the updates will happen to this core and then I need to synchronize > it with coreX after the update process, so that coreX also has the > latest data in it. After coreX and coreY are synchronized, both > should again be identical again. > > For this purpose I tried core merging of coreX and coreY once coreY is > updated with the latest set of data. But I find coreX to be containing > double the record count as in coreY. > (coreX = coreX+coreY) > > Is there a problem in using MERGE concept here. If it is wrong can some one > pls suggest the best approach. I tried the various merges explained in my > previous mail. > > Index merge happens at the Lucene level which has no idea about uniqueKeys. Therefore when you merge two indexes containing exactly the same documents (by uniqueKey), you get double the document count. Looking at your scenario, it seems to me that what you want to do is a swap operation. coreX is serving the requests, coreY is updated and now you can swap coreX with coreY so that new requests hit the updated index. I suggest you look at the swap operation instead of index merge. -- Regards, Shalin Shekhar Mangar.
Re: index merge
Hi Shalin, Thank you for the reply. I got your point. So I understand merge will just duplicate things. I ran the SWAP command. Now:- COREX has the dataDir pointing to the updated dataDir of COREY. So COREX has the latest. Again, COREY (on which the update regularly runs) is pointing to the old index of COREX. So this now doesnt have the most updated index. Now shouldn't I update the index of COREY (now pointing to the old COREX) so that it has the latest footprint as in COREX (having the latest COREY index)so that when the update again happens to COREY, it has the latest and I again do the SWAP. Is a physical copying of the index named COREY (the latest and now datDir of COREX after SWAP) to the index COREX (now the dataDir of COREY.. the orginal non-updated index of COREX) the best way for this or is there any other better option. Once again, later when COREY is again updated with the latest, I will run the SWAP again and it will be fine with COREX again pointing to its original dataDir (now the updated one).So every even SWAP command run will point COREX back to its original dataDir. (same case with COREY). My only concern is after the SWAP is done, updating the old index (which was serving previously and now replaced by the new index). What is the best way to do that? Physically copy the latest index to the old one and make it in sync with the latest one so that by the time it is to get the latest updates it has the latest in it so that the new ones can be added to this and it becomes the latest and is again swapped? Please share your opinion. Once again your help is appreciated. I am kind of going in circles with multiple indexs for some days! Thanks and Rgds, Mark. On Mon, Mar 8, 2010 at 7:45 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > Hi Mark, > > On Sun, Mar 7, 2010 at 6:20 PM, Mark Fletcher > wrote: > > > > > I have created 2 identical cores coreX and coreY (both have different > > dataDir values, but their index is same). > > coreX - always serves the request when a user performs a search. > > coreY - the updates will happen to this core and then I need to > synchronize > > it with coreX after the update process, so that coreX also has the > > latest data in it. After coreX and coreY are synchronized, > both > > should again be identical again. > > > > For this purpose I tried core merging of coreX and coreY once coreY is > > updated with the latest set of data. But I find coreX to be containing > > double the record count as in coreY. > > (coreX = coreX+coreY) > > > > Is there a problem in using MERGE concept here. If it is wrong can some > one > > pls suggest the best approach. I tried the various merges explained in my > > previous mail. > > > > > Index merge happens at the Lucene level which has no idea about uniqueKeys. > Therefore when you merge two indexes containing exactly the same documents > (by uniqueKey), you get double the document count. > > Looking at your scenario, it seems to me that what you want to do is a swap > operation. coreX is serving the requests, coreY is updated and now you can > swap coreX with coreY so that new requests hit the updated index. I suggest > you look at the swap operation instead of index merge. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: index merge
Hi Mark, On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher wrote: > > I ran the SWAP command. Now:- > COREX has the dataDir pointing to the updated dataDir of COREY. So COREX > has the latest. > Again, COREY (on which the update regularly runs) is pointing to the old > index of COREX. So this now doesnt have the most updated index. > > Now shouldn't I update the index of COREY (now pointing to the old COREX) > so that it has the latest footprint as in COREX (having the latest COREY > index)so that when the update again happens to COREY, it has the latest and > I again do the SWAP. > > Is a physical copying of the index named COREY (the latest and now datDir > of COREX after SWAP) to the index COREX (now the dataDir of COREY.. the > orginal non-updated index of COREX) the best way for this or is there any > other better option. > > Once again, later when COREY is again updated with the latest, I will run > the SWAP again and it will be fine with COREX again pointing to its original > dataDir (now the updated one).So every even SWAP command run will point > COREX back to its original dataDir. (same case with COREY). > > My only concern is after the SWAP is done, updating the old index (which > was serving previously and now replaced by the new index). What is the best > way to do that? Physically copy the latest index to the old one and make it > in sync with the latest one so that by the time it is to get the latest > updates it has the latest in it so that the new ones can be added to this > and it becomes the latest and is again swapped? > Perhaps it is best if we take a step back and understand why you need two identical cores? -- Regards, Shalin Shekhar Mangar.
Re: index merge
Hi Shalin, Thank you for the mail. My main purpose of having 2 identical cores COREX - always serves user request COREY - every day once, takes the updates/latest data and passess it on to COREX. is:- Suppose say I have only one COREY and suppose a request comes to COREY while the update of the latest data is happening on to it. Wouldn't it degrade performance of the requests at that point of time? So I was planning to keep COREX and COREY always identical. Once COREY has the latest it should somehow sync with COREX so that COREX also now has the latest. COREY keeps on getting the updates at a particular time of day and it will again pass it on to COREX. This process continues everyday. What is the best possible way to implement this? Thanks, Mark. On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > Hi Mark, > > On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher < > mark.fletcher2...@gmail.com> wrote: > >> >> I ran the SWAP command. Now:- >> COREX has the dataDir pointing to the updated dataDir of COREY. So COREX >> has the latest. >> Again, COREY (on which the update regularly runs) is pointing to the old >> index of COREX. So this now doesnt have the most updated index. >> >> Now shouldn't I update the index of COREY (now pointing to the old COREX) >> so that it has the latest footprint as in COREX (having the latest COREY >> index)so that when the update again happens to COREY, it has the latest and >> I again do the SWAP. >> >> Is a physical copying of the index named COREY (the latest and now datDir >> of COREX after SWAP) to the index COREX (now the dataDir of COREY.. the >> orginal non-updated index of COREX) the best way for this or is there any >> other better option. >> >> Once again, later when COREY is again updated with the latest, I will run >> the SWAP again and it will be fine with COREX again pointing to its original >> dataDir (now the updated one).So every even SWAP command run will point >> COREX back to its original dataDir. (same case with COREY). >> >> My only concern is after the SWAP is done, updating the old index (which >> was serving previously and now replaced by the new index). What is the best >> way to do that? Physically copy the latest index to the old one and make it >> in sync with the latest one so that by the time it is to get the latest >> updates it has the latest in it so that the new ones can be added to this >> and it becomes the latest and is again swapped? >> > > Perhaps it is best if we take a step back and understand why you need two > identical cores? > > -- > Regards, > Shalin Shekhar Mangar. >
Re: index merge
Hi Mark, On Mon, Mar 8, 2010 at 9:23 PM, Mark Fletcher wrote: > > My main purpose of having 2 identical cores > COREX - always serves user request > COREY - every day once, takes the updates/latest data and passess it on to > COREX. > is:- > > Suppose say I have only one COREY and suppose a request comes to COREY > while the update of the latest data is happening on to it. Wouldn't it > degrade performance of the requests at that point of time? > The thing to note is that both reads and writes are happening on the same box. So when you swap cores, the OS has to cache the hot segments of the new (inactive) index. If you were just re-opening the same (active) index, at least some of the existing files could remain in the OS's file cache. I think that may just degrade performance further so you should definitely benchmark before going through with this. The best practice is to use a master/slave architecture and separate the writes and reads. > So I was planning to keep COREX and COREY always identical. Once COREY has > the latest it should somehow sync with COREX so that COREX also now has the > latest. COREY keeps on getting the updates at a particular time of day and > it will again pass it on to COREX. This process continues everyday. > You could use the same approach that Solr 1.3's snapinstaller script used. It deletes the files and creates hard links to the new index files. -- Regards, Shalin Shekhar Mangar.
Re: index merge
On 03/08/2010 10:53 AM, Mark Fletcher wrote: Hi Shalin, Thank you for the mail. My main purpose of having 2 identical cores COREX - always serves user request COREY - every day once, takes the updates/latest data and passess it on to COREX. is:- Suppose say I have only one COREY and suppose a request comes to COREY while the update of the latest data is happening on to it. Wouldn't it degrade performance of the requests at that point of time? Yes - but your not going to help anything by using two indexes - best you can do it use two boxes. 2 indexes on the same box will actually be worse than one if they are identical and you are swapping between them. Writes on an index will not affect reads in the way you are thinking - only in that its uses IO and CPU that the read process cant. Thats going to happen with 2 indexes on the same box too - except now you have way more data to cache and flip between, and you can't take any advantage of things just being written possibly being in the cache for reads. Lucene indexes use a write once strategy - when writing new segments, you are not touching the segments being read from. Lucene is already doing the index juggling for you at the segment level. So I was planning to keep COREX and COREY always identical. Once COREY has the latest it should somehow sync with COREX so that COREX also now has the latest. COREY keeps on getting the updates at a particular time of day and it will again pass it on to COREX. This process continues everyday. What is the best possible way to implement this? Thanks, Mark. On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar< shalinman...@gmail.com> wrote: Hi Mark, On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher< mark.fletcher2...@gmail.com> wrote: I ran the SWAP command. Now:- COREX has the dataDir pointing to the updated dataDir of COREY. So COREX has the latest. Again, COREY (on which the update regularly runs) is pointing to the old index of COREX. So this now doesnt have the most updated index. Now shouldn't I update the index of COREY (now pointing to the old COREX) so that it has the latest footprint as in COREX (having the latest COREY index)so that when the update again happens to COREY, it has the latest and I again do the SWAP. Is a physical copying of the index named COREY (the latest and now datDir of COREX after SWAP) to the index COREX (now the dataDir of COREY.. the orginal non-updated index of COREX) the best way for this or is there any other better option. Once again, later when COREY is again updated with the latest, I will run the SWAP again and it will be fine with COREX again pointing to its original dataDir (now the updated one).So every even SWAP command run will point COREX back to its original dataDir. (same case with COREY). My only concern is after the SWAP is done, updating the old index (which was serving previously and now replaced by the new index). What is the best way to do that? Physically copy the latest index to the old one and make it in sync with the latest one so that by the time it is to get the latest updates it has the latest in it so that the new ones can be added to this and it becomes the latest and is again swapped? Perhaps it is best if we take a step back and understand why you need two identical cores? -- Regards, Shalin Shekhar Mangar. -- - Mark http://www.lucidimagination.com
Re: index merge
Hi All, Thank you for the very valuable suggestions. I am planning to try using the Master - Slave configuration. Best Rgds, Mark. On Mon, Mar 8, 2010 at 11:17 AM, Mark Miller wrote: > On 03/08/2010 10:53 AM, Mark Fletcher wrote: > >> Hi Shalin, >> >> Thank you for the mail. >> My main purpose of having 2 identical cores >> COREX - always serves user request >> COREY - every day once, takes the updates/latest data and passess it on to >> COREX. >> is:- >> >> Suppose say I have only one COREY and suppose a request comes to COREY >> while >> the update of the latest data is happening on to it. Wouldn't it degrade >> performance of the requests at that point of time? >> >> > Yes - but your not going to help anything by using two indexes - best you > can do it use two boxes. 2 indexes on the same box will actually > be worse than one if they are identical and you are swapping between them. > Writes on an index will not affect reads in the way you are thinking - only > in that its uses IO and CPU that the read process cant. Thats going to > happen with 2 indexes on the same box too - except now you have way more > data to cache and flip between, and you can't take any advantage of things > just being written possibly being in the cache for reads. > > Lucene indexes use a write once strategy - when writing new segments, you > are not touching the segments being read from. Lucene is already doing the > index juggling for you at the segment level. > > > So I was planning to keep COREX and COREY always identical. Once COREY has >> the latest it should somehow sync with COREX so that COREX also now has >> the >> latest. COREY keeps on getting the updates at a particular time of day and >> it will again pass it on to COREX. This process continues everyday. >> >> What is the best possible way to implement this? >> >> Thanks, >> >> Mark. >> >> >> On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar< >> shalinman...@gmail.com> wrote: >> >> >> >>> Hi Mark, >>> >>> On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher< >>> mark.fletcher2...@gmail.com> wrote: >>> >>> >>> >>>> I ran the SWAP command. Now:- >>>> COREX has the dataDir pointing to the updated dataDir of COREY. So COREX >>>> has the latest. >>>> Again, COREY (on which the update regularly runs) is pointing to the old >>>> index of COREX. So this now doesnt have the most updated index. >>>> >>>> Now shouldn't I update the index of COREY (now pointing to the old >>>> COREX) >>>> so that it has the latest footprint as in COREX (having the latest COREY >>>> index)so that when the update again happens to COREY, it has the latest >>>> and >>>> I again do the SWAP. >>>> >>>> Is a physical copying of the index named COREY (the latest and now >>>> datDir >>>> of COREX after SWAP) to the index COREX (now the dataDir of COREY.. the >>>> orginal non-updated index of COREX) the best way for this or is there >>>> any >>>> other better option. >>>> >>>> Once again, later when COREY is again updated with the latest, I will >>>> run >>>> the SWAP again and it will be fine with COREX again pointing to its >>>> original >>>> dataDir (now the updated one).So every even SWAP command run will point >>>> COREX back to its original dataDir. (same case with COREY). >>>> >>>> My only concern is after the SWAP is done, updating the old index (which >>>> was serving previously and now replaced by the new index). What is the >>>> best >>>> way to do that? Physically copy the latest index to the old one and make >>>> it >>>> in sync with the latest one so that by the time it is to get the latest >>>> updates it has the latest in it so that the new ones can be added to >>>> this >>>> and it becomes the latest and is again swapped? >>>> >>>> >>>> >>> Perhaps it is best if we take a step back and understand why you need two >>> identical cores? >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >>> >>> >> >> > > > -- > - Mark > > http://www.lucidimagination.com > > > >
Index field untokenized
Hi All, I want to index some data untokenized (e.g. url), but I can't find a way to do it. I know there is a way to do it in solr configuration but I want to specify this options directly in my solr xml. This is a fragment of the xml that i post in slr and I want to know if is possible to add to some field (e.g. modsCollection.name.xlink:href) an extra attribute in some other way the information about how to index it.// /// http://www.fao.org/faooa/schemas/eims/v0.9"; xmlns:mods="http://www.loc.gov/mods/v3"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xmlns:eims="http://www.fao.org/faooa/schemas/eims/v0.9"; xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:xalan="http://xml.apache.org/xalan"; xmlns:l="http://lang.data"; xmlns:fn="http://www.w3.org/2005/xpath-functions"; xmlns:dcterms="http://purl.org/dc/terms/"; xmlns:ags="http://www.fao.org/agris/agmes/schemas/0.1/"; xmlns:uvalibadmin="http://dl.lib.virginia.edu/bin/admin/admin.dtd/"; xmlns:uvalibdesc="http://dl.lib.virginia.edu/bin/dtd/descmeta/descmeta.dtd"; xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"; xmlns:dc="http://purl.org/dc/elements/1.1/"; xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:zs="http://www.loc.gov/zing/srw/";> eims-document:1960 . http://aims.fao.org/aos/v01/corporatebody/c_1962 iso639-2b /Regards, Alessandro http://www.fao.org/faooa/schemas/eims/v0.9"; xmlns:mods="http://www.loc.gov/mods/v3"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xmlns:eims="http://www.fao.org/faooa/schemas/eims/v0.9"; xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:xalan="http://xml.apache.org/xalan"; xmlns:l="http://lang.data"; xmlns:fn="http://www.w3.org/2005/xpath-functions"; xmlns:dcterms="http://purl.org/dc/terms/"; xmlns:ags="http://www.fao.org/agris/agmes/schemas/0.1/"; xmlns:uvalibadmin="http://dl.lib.virginia.edu/bin/admin/admin.dtd/"; xmlns:uvalibdesc="http://dl.lib.virginia.edu/bin/dtd/descmeta/descmeta.dtd"; xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"; xmlns:dc="http://purl.org/dc/elements/1.1/"; xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:zs="http://www.loc.gov/zing/srw/";> eims-document:1960 Active Note relative à la réforme de l'ONU et de la FAO 2010-03-11T13:37:44.537Z 2010-03-11T13:39:15.819Z 2 AUDREC1 Fedora API-M modifyDatastreamByValue DC fedoraAdmin 2010-03-11T13:37:44.801Z Initial Import of this Object AUDREC2 Fedora API-M addDatastream MODS fedoraAdmin 2010-03-11T13:39:09.348Z AUDREC3 Fedora API-M addDatastream AGRISFO fedoraAdmin 2010-03-11T13:39:11.931Z AUDREC4 Fedora API-M addDatastream EIMS fedoraAdmin 2010-03-11T13:39:13.434Z AUDREC5 Fedora API-M addDatastream SKOS fedoraAdmin 2010-03-11T13:39:15.819Z fr Note relative à la réforme de l'ONU et de la FAO pubid.fao.org:210159 FAO info:fedora/eims-document:1960 faooa:FRBR-EXPRESSION J8010 3.3 2006-06-29 fr Note relative à la réforme de l'ONU et de la FAO fao-aos-corporatebody corporate http://aims.fao.org/aos/v01/corporatebody/c_1962 en FAO, Rome (Italy). Fisheries and Aquaculture Dept. marcrelator text Author marcrelator text conference en FAO Committee on Fisheries. Sub-Committee on Aquaculture (Sess. 4 : 6-10 Oct 2008 : Puerto Varas, Chile) marcrelator text Author marcrelator text type Conference type type Non-conventional type iso639-2b code fra iso639-2b code text French text jn J8010 jn rn 210159 0 3 en KC 1 en Publication
Separate index files
Hello, there - How often in Solr used possibility to store index in separate files for different things, for example, products (at the one Solr instance)? The aim is maintain separate files for backup, independent re-indexing, something else(?). And in what extent useful that solutions? Thanks Sergei
Re: index merge
Hi All, I am running solr in 64 bit HP-UX system. The total index size is about 5GB and when i try load any new document, solr tries to merge the existing segments first and results in following error. I could see a temp file is growng within index dir around 2GB in size and later it fails with this exception. It looks like, by reaching Integer.MAXVALUE, the exception occurs. Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: File too large (errno:27) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315) Caused by: java.io.IOException: File too large (errno:27) at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:456) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.flushBuffer(SimpleFSDirectory.java:192) at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96) at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85) at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:109) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.close(SimpleFSDirectory.java:199) at org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:144) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:357) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5029) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4614) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291) --- The solrconfig.xml contains default values for , sections as below. ^M ^M false^M ^M 10^M ^M ^M ^M ^M 32^M ^M 1^M 1000^M 1^M ^M ^M ^ ^M ^M false^M 32^M 10^M ^M ^M ^M ^ Could anyone help me to resolve this exception? Regards, Uma -- View this message in context: http://lucene.472066.n3.nabble.com/index-merge-tp472904p829810.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: index merge
> I am running solr in 64 bit HP-UX system. The total > index size is about > 5GB and when i try load any new document, solr tries to > merge the existing > segments first and results in following error. I could see > a temp file is > growng within index dir around 2GB in size and later it > fails with this > exception. It looks like, by reaching Integer.MAXVALUE, the > exception > occurs. 32 isn't 32MB ramBufferSizeMB too small?
Re: index merge
Hi All, The problem is resolved. It is purely due to filesystem. My filesystem is of 32-bit, running on 64 bit OS. I changed to 64 bit filesystem and all works as expected. Uma -- View this message in context: http://lucene.472066.n3.nabble.com/index-merge-tp472904p832053.html Sent from the Solr - User mailing list archive at Nabble.com.
Rebuild an index
Hi, We use Drupal as the CMS and Solr for our search engine needs and are planning to have Solr Master-Slave replication setup across the data centers. I am in the process of testing my replication - what is the best means to delete the index on the Solr slave and then replicate a fresh copy from Master? We use Solr 1.3. Thanks, Sai Thumuluri My Master solrconfig.xml is startup commit commit schema.xml,synonyms.txt,stopwords.txt,elevate.xml And my slave solrconfig.xml http://masterURL:8080/solr/replication 01:00:00
stemming the index
My index contains data of 2 different languages, English & German. Now which analyzer & stemmer should be applied on this data before feeding to index -Sarfaraz
Update the index
Hi, Can some one point me to a location where it describes how to update an already indexed document? I was thinking there is and tag explained somewhere but cant find it. Thanks, -- Gavin Selvaratnam, Project Leader hSenid Mobile Solutions Phone: +94-11-2446623/4 Fax: +94-11-2307579 Web: http://www.hSenidMobile.com Make it happen Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. The content and opinions contained in this email are not necessarily those of hSenid Software International. If you have received this email in error please contact the sender.
Index time synonyms
I have a hard time understanding the synonyms behaviour..especially because i don't have the syn filter at index time. If i have this synonym at index time Alternative Sentence,Probation before Judgement,Pretrial Diversion does all occurrence of 'alternative sentence' also get indexed as 'probation judgement' and 'pretrial diversion' ? or does it do this wierd grouping (alternative probation pretrial)(sentence diversion)judgement so all occurrences of 'alternative' will be indexed as 'sentence' and 'diversion' ? Then what about the word 'judgement'? Please someone help me understand this. I have another question related to synonyms posted here http://www.nabble.com/solr-synonyms-behaviour-td15051211.html ..please help with that too... -- View this message in context: http://www.nabble.com/Index-time-synonyms-tp15073889p15073889.html Sent from the Solr - User mailing list archive at Nabble.com.
Lucene index verifier
(Sorry, my Lucene java-user access is wonky.) I would like to verify that my snapshots are not corrupt before I enable them. What is the simplest program to verify that a Lucene index is not corrupt? Or, what is a Solr query that will verify that there is no corruption? With the minimum amount of time? Thanks, Lance Norskog
Shared index base
I know there was such discussions about the subject, but I want to ask again if somebody could share more information. We are planning to have several separate servers for our search engine. One of them will be index/search server, and all others are search only. We want to use SAN (BTW: should we consider something else?) and give access to it from all servers. So all servers will use the same index base, without any replication, same files. Is this a good practice? Did somebody do the same? Any problems noticed? Or any suggestions, even about different configurations are highly appreciated. Thanks, Gene
Merging Solr index
Hi- http://wiki.apache.org/solr/MergingSolrIndexes recommends using the Lucene contributed app IndexMergeTool to merge two Solr indexes. What happens if both indexes have records with the same unique key? Will they both go into the new index? Is the implementation of unique IDs in the Solr java or in Lucene? If it is in Solr, how would I hackup a Solr IndexMergeTool? Cheers, Lance Norskog
Re: Index splitting
Hi Nico, I don't think there is a tool to split an existing Lucene index, though I imagine one could write such a tool using http://lucene.apache.org/java/2_3_1/fileformats.html as a guide. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Nico Heid <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, April 29, 2008 4:10:09 AM > Subject: Index splitting > > Hi, > Let me first roughly describe the scenario :-) > > We're trying to index online stored data for some thousand users. > The schema.xml has a custom identifier for the user, so FQ can be applied > and further filtering is only done for the user (more important, the user > doesn't get to see results from data not belonging to him) > > Unfortunatelly, the Index might become quite big ( we're indexing more that > 50 TB Data, all kind of files, full text (indexed only, not stored) where > possible, elsewhere fileinfos (size, date) and meta if available) > > So Question the is: > > We're thinking of starting out with multiple Solr instances (either in their > own containers or MultiCore, guess that's not the important point), on 1 to > n machines. Lets just pretend: we do modulo 5 on the user number and assign > it to one of the two machines. The index gets distributed on QuerySlaves ( > 1-m dependend on the need). > > So now the Question: > Is there a way to split a too big index into smaller ones? Do I have to > create more instances at the beginning, so that I will not run out of power > and space? (which will ad quite a bit of redundance of data) > Lets say I miscalculated and used only 2 indices, but now I see I need at > least 4. > > Any idea will be very welcome, > > Thanks, > Nico > > >
Re: Index splitting
I seem to recall Doug C. commenting on this: http://lucene.markmail.org/search/?q=FilterIndexReader#query :FilterIndexReader%20from%3A%22Doug%20Cutting%22+page:1+mid:y673avueo43ufwhm+state:results Not sure if that is exactly what you are looking for, but sounds similar. -Grant On Apr 29, 2008, at 1:10 PM, Otis Gospodnetic wrote: Hi Nico, I don't think there is a tool to split an existing Lucene index, though I imagine one could write such a tool using http://lucene.apache.org/java/2_3_1/fileformats.html as a guide. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Nico Heid <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, April 29, 2008 4:10:09 AM Subject: Index splitting Hi, Let me first roughly describe the scenario :-) We're trying to index online stored data for some thousand users. The schema.xml has a custom identifier for the user, so FQ can be applied and further filtering is only done for the user (more important, the user doesn't get to see results from data not belonging to him) Unfortunatelly, the Index might become quite big ( we're indexing more that 50 TB Data, all kind of files, full text (indexed only, not stored) where possible, elsewhere fileinfos (size, date) and meta if available) So Question the is: We're thinking of starting out with multiple Solr instances (either in their own containers or MultiCore, guess that's not the important point), on 1 to n machines. Lets just pretend: we do modulo 5 on the user number and assign it to one of the two machines. The index gets distributed on QuerySlaves ( 1-m dependend on the need). So now the Question: Is there a way to split a too big index into smaller ones? Do I have to create more instances at the beginning, so that I will not run out of power and space? (which will ad quite a bit of redundance of data) Lets say I miscalculated and used only 2 indices, but now I see I need at least 4. Any idea will be very welcome, Thanks, Nico -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Index splitting
On Tue, 29 Apr 2008 10:10:09 +0200 "Nico Heid" <[EMAIL PROTECTED]> wrote: > So now the Question: > Is there a way to split a too big index into smaller ones? Do I have to > create more instances at the beginning, so that I will not run out of power > and space? (which will ad quite a bit of redundance of data) > Lets say I miscalculated and used only 2 indices, but now I see I need at > least 4. Hi Nico, being able to split the index without having to reindex the lot would be a nice option :) One approach we use in a project I am working on is to split up the full extent of your domain (user IDs) in equal parts from the start - with this we get n clusters and it is as much as we will need to grow outwards . Then we grow each cluster in depth as needed. It obviously helps if you have an equal (or random) distribution across your clusters (we do). Given that you probably won't know how many users you'll get your case is different to ours. To even out your distribution of user-ids to cluster you can use a function of the user-id (ie, md5(user_id) ) instead of user_id itself. HIH, B _ {Beto|Norberto|Numard} Meijome Percusive Maintenance - The art of tuning or repairing equipment by hitting it. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Multiple Index creation
Hi All, I tried to search within the SOLR archive, but could not find the answer of how can I create multiple index within SOLR. In case of lucene I can create an IndexWriter with a new Index, and hence can have multiple Index, I can allow search on that multiple index. How can I create in Solr a multiple Index. --Thanks and Regards Vaijanath
SOLR index size
Hi, I'm using SOLR to keep track of customer complaints. I only need to keep recent complaints, but I want to keep as many as I can fit on my hard drive. Is there any way I can configure SOLR to dump old entries in the index when the index reaches a certain size? I'm using a month old version from trunk. Thanks, Marshall
Re: Index structuring
You could have been more specific on the dataset size. If your data volumes are growing you can partition your index into multiple shards. http://wiki.apache.org/solr/DistributedSearch --Noble On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha <[EMAIL PROTECTED]> wrote: > > Dear Readers, > > I am a newbie in solr world. I have successfully deployed solr on my > machine, and I am able to index a large DB table. I am pretty sure that > internal index structure of solr is much capable to handle large data sets. > > But, say my data size keeps growing at jet speed, then what should be the > index structure? Do I need to follow some specific index structuring > patterns/algos for handling such massive data? > > I am sorry as I may be sounding novice in this area. I would appreciate your > thoughts/suggestions. > > Regards, > Ritesh Ambastha > -- > View this message in context: > http://www.nabble.com/Index-structuring-tp17576449p17576449.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
Re: Index structuring
For the datasize you are proposing , single index should be fine .Just give the m/c enough RAM Distributed search involves multiple requests made between shards which may be an unncessary overhead. --Noble On Wed, Jun 4, 2008 at 4:02 PM, Ritesh Ambastha <[EMAIL PROTECTED]> wrote: > > Thanks Noble, > > I maintain two separate indexes on my disk for two different search > services. > The index size of two are: 91MB and 615MB. I am pretty sure that these index > size will grow in future, and may reach 10GB. > > My doubts : > > 1. When should I start partitioning my index? > 2. Is there any performance issue with partitioning? For eg: A query on 1GB > and 500MB indexed data will take same time to give the result? Or lesser the > index size, lesser the response time? > > > Regards, > Ritesh Ambastha > > Noble Paul നോബിള് नोब्ळ् wrote: >> >> You could have been more specific on the dataset size. >> >> If your data volumes are growing you can partition your index into >> multiple shards. >> http://wiki.apache.org/solr/DistributedSearch >> --Noble >> >> On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha <[EMAIL PROTECTED]> >> wrote: >>> >>> Dear Readers, >>> >>> I am a newbie in solr world. I have successfully deployed solr on my >>> machine, and I am able to index a large DB table. I am pretty sure that >>> internal index structure of solr is much capable to handle large data >>> sets. >>> >>> But, say my data size keeps growing at jet speed, then what should be the >>> index structure? Do I need to follow some specific index structuring >>> patterns/algos for handling such massive data? >>> >>> I am sorry as I may be sounding novice in this area. I would appreciate >>> your >>> thoughts/suggestions. >>> >>> Regards, >>> Ritesh Ambastha >>> -- >>> View this message in context: >>> http://www.nabble.com/Index-structuring-tp17576449p17576449.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> --Noble Paul >> >> > > -- > View this message in context: > http://www.nabble.com/Index-structuring-tp17576449p17643690.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
Re: Index structuring
Thanks Noble, That means, I can go ahead with single Index for long. :) Regards, Ritesh Ambastha Noble Paul നോബിള് नोब्ळ् wrote: > > For the datasize you are proposing , single index should be fine .Just > give the m/c enough RAM > > Distributed search involves multiple requests made between shards > which may be an unncessary overhead. > --Noble > > On Wed, Jun 4, 2008 at 4:02 PM, Ritesh Ambastha <[EMAIL PROTECTED]> > wrote: >> >> Thanks Noble, >> >> I maintain two separate indexes on my disk for two different search >> services. >> The index size of two are: 91MB and 615MB. I am pretty sure that these >> index >> size will grow in future, and may reach 10GB. >> >> My doubts : >> >> 1. When should I start partitioning my index? >> 2. Is there any performance issue with partitioning? For eg: A query on >> 1GB >> and 500MB indexed data will take same time to give the result? Or lesser >> the >> index size, lesser the response time? >> >> >> Regards, >> Ritesh Ambastha >> >> Noble Paul നോബിള് नोब्ळ् wrote: >>> >>> You could have been more specific on the dataset size. >>> >>> If your data volumes are growing you can partition your index into >>> multiple shards. >>> http://wiki.apache.org/solr/DistributedSearch >>> --Noble >>> >>> On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha >>> <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Dear Readers, >>>> >>>> I am a newbie in solr world. I have successfully deployed solr on my >>>> machine, and I am able to index a large DB table. I am pretty sure that >>>> internal index structure of solr is much capable to handle large data >>>> sets. >>>> >>>> But, say my data size keeps growing at jet speed, then what should be >>>> the >>>> index structure? Do I need to follow some specific index structuring >>>> patterns/algos for handling such massive data? >>>> >>>> I am sorry as I may be sounding novice in this area. I would appreciate >>>> your >>>> thoughts/suggestions. >>>> >>>> Regards, >>>> Ritesh Ambastha >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/Index-structuring-tp17576449p17576449.html >>>> Sent from the Solr - User mailing list archive at Nabble.com. >>>> >>>> >>> >>> >>> >>> -- >>> --Noble Paul >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Index-structuring-tp17576449p17643690.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > --Noble Paul > > -- View this message in context: http://www.nabble.com/Index-structuring-tp17576449p17643798.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index structuring
Thanks Noble, I maintain two separate indexes on my disk for two different search services. The index size of two are: 91MB and 615MB. I am pretty sure that these index size will grow in future, and may reach 10GB. My doubts : 1. When should I start partitioning my index? 2. Is there any performance issue with partitioning? For eg: A query on 1GB and 500MB indexed data will take same time to give the result? Or lesser the index size, lesser the response time? Regards, Ritesh Ambastha Noble Paul നോബിള് नोब्ळ् wrote: > > You could have been more specific on the dataset size. > > If your data volumes are growing you can partition your index into > multiple shards. > http://wiki.apache.org/solr/DistributedSearch > --Noble > > On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha <[EMAIL PROTECTED]> > wrote: >> >> Dear Readers, >> >> I am a newbie in solr world. I have successfully deployed solr on my >> machine, and I am able to index a large DB table. I am pretty sure that >> internal index structure of solr is much capable to handle large data >> sets. >> >> But, say my data size keeps growing at jet speed, then what should be the >> index structure? Do I need to follow some specific index structuring >> patterns/algos for handling such massive data? >> >> I am sorry as I may be sounding novice in this area. I would appreciate >> your >> thoughts/suggestions. >> >> Regards, >> Ritesh Ambastha >> -- >> View this message in context: >> http://www.nabble.com/Index-structuring-tp17576449p17576449.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > --Noble Paul > > -- View this message in context: http://www.nabble.com/Index-structuring-tp17576449p17643690.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index structuring
A lot of this also depends on the number of documents. But we have successfully used Solr with upto 10-12 million documents. On Wed, Jun 4, 2008 at 4:10 PM, Ritesh Ambastha <[EMAIL PROTECTED]> wrote: > > Thanks Noble, > > That means, I can go ahead with single Index for long. > :) > > Regards, > Ritesh Ambastha > > Noble Paul നോബിള് नोब्ळ् wrote: > > > > For the datasize you are proposing , single index should be fine .Just > > give the m/c enough RAM > > > > Distributed search involves multiple requests made between shards > > which may be an unncessary overhead. > > --Noble > > > > On Wed, Jun 4, 2008 at 4:02 PM, Ritesh Ambastha <[EMAIL PROTECTED]> > > wrote: > >> > >> Thanks Noble, > >> > >> I maintain two separate indexes on my disk for two different search > >> services. > >> The index size of two are: 91MB and 615MB. I am pretty sure that these > >> index > >> size will grow in future, and may reach 10GB. > >> > >> My doubts : > >> > >> 1. When should I start partitioning my index? > >> 2. Is there any performance issue with partitioning? For eg: A query on > >> 1GB > >> and 500MB indexed data will take same time to give the result? Or lesser > >> the > >> index size, lesser the response time? > >> > >> > >> Regards, > >> Ritesh Ambastha > >> > >> Noble Paul നോബിള് नोब्ळ् wrote: > >>> > >>> You could have been more specific on the dataset size. > >>> > >>> If your data volumes are growing you can partition your index into > >>> multiple shards. > >>> http://wiki.apache.org/solr/DistributedSearch > >>> --Noble > >>> > >>> On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha > >>> <[EMAIL PROTECTED]> > >>> wrote: > >>>> > >>>> Dear Readers, > >>>> > >>>> I am a newbie in solr world. I have successfully deployed solr on my > >>>> machine, and I am able to index a large DB table. I am pretty sure > that > >>>> internal index structure of solr is much capable to handle large data > >>>> sets. > >>>> > >>>> But, say my data size keeps growing at jet speed, then what should be > >>>> the > >>>> index structure? Do I need to follow some specific index structuring > >>>> patterns/algos for handling such massive data? > >>>> > >>>> I am sorry as I may be sounding novice in this area. I would > appreciate > >>>> your > >>>> thoughts/suggestions. > >>>> > >>>> Regards, > >>>> Ritesh Ambastha > >>>> -- > >>>> View this message in context: > >>>> http://www.nabble.com/Index-structuring-tp17576449p17576449.html > >>>> Sent from the Solr - User mailing list archive at Nabble.com. > >>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> --Noble Paul > >>> > >>> > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Index-structuring-tp17576449p17643690.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > >> > >> > > > > > > > > -- > > --Noble Paul > > > > > > -- > View this message in context: > http://www.nabble.com/Index-structuring-tp17576449p17643798.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: Index structuring
The number of docs I have indexed till now is : 1,633,570 I am bit afraid as the number of indexed docs will grow atleast 5-10 times in very near future. Regards, Ritesh Ambastha Shalin Shekhar Mangar wrote: > > A lot of this also depends on the number of documents. But we have > successfully used Solr with upto 10-12 million documents. > > On Wed, Jun 4, 2008 at 4:10 PM, Ritesh Ambastha <[EMAIL PROTECTED]> > wrote: > >> >> Thanks Noble, >> >> That means, I can go ahead with single Index for long. >> :) >> >> Regards, >> Ritesh Ambastha >> >> Noble Paul നോബിള് नोब्ळ् wrote: >> > >> > For the datasize you are proposing , single index should be fine .Just >> > give the m/c enough RAM >> > >> > Distributed search involves multiple requests made between shards >> > which may be an unncessary overhead. >> > --Noble >> > >> > On Wed, Jun 4, 2008 at 4:02 PM, Ritesh Ambastha >> <[EMAIL PROTECTED]> >> > wrote: >> >> >> >> Thanks Noble, >> >> >> >> I maintain two separate indexes on my disk for two different search >> >> services. >> >> The index size of two are: 91MB and 615MB. I am pretty sure that these >> >> index >> >> size will grow in future, and may reach 10GB. >> >> >> >> My doubts : >> >> >> >> 1. When should I start partitioning my index? >> >> 2. Is there any performance issue with partitioning? For eg: A query >> on >> >> 1GB >> >> and 500MB indexed data will take same time to give the result? Or >> lesser >> >> the >> >> index size, lesser the response time? >> >> >> >> >> >> Regards, >> >> Ritesh Ambastha >> >> >> >> Noble Paul നോബിള് नोब्ळ् wrote: >> >>> >> >>> You could have been more specific on the dataset size. >> >>> >> >>> If your data volumes are growing you can partition your index into >> >>> multiple shards. >> >>> http://wiki.apache.org/solr/DistributedSearch >> >>> --Noble >> >>> >> >>> On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha >> >>> <[EMAIL PROTECTED]> >> >>> wrote: >> >>>> >> >>>> Dear Readers, >> >>>> >> >>>> I am a newbie in solr world. I have successfully deployed solr on my >> >>>> machine, and I am able to index a large DB table. I am pretty sure >> that >> >>>> internal index structure of solr is much capable to handle large >> data >> >>>> sets. >> >>>> >> >>>> But, say my data size keeps growing at jet speed, then what should >> be >> >>>> the >> >>>> index structure? Do I need to follow some specific index structuring >> >>>> patterns/algos for handling such massive data? >> >>>> >> >>>> I am sorry as I may be sounding novice in this area. I would >> appreciate >> >>>> your >> >>>> thoughts/suggestions. >> >>>> >> >>>> Regards, >> >>>> Ritesh Ambastha >> >>>> -- >> >>>> View this message in context: >> >>>> http://www.nabble.com/Index-structuring-tp17576449p17576449.html >> >>>> Sent from the Solr - User mailing list archive at Nabble.com. >> >>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> --Noble Paul >> >>> >> >>> >> >> >> >> -- >> >> View this message in context: >> >> http://www.nabble.com/Index-structuring-tp17576449p17643690.html >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> > >> > -- >> > --Noble Paul >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Index-structuring-tp17576449p17643798.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > -- > Regards, > Shalin Shekhar Mangar. > > -- View this message in context: http://www.nabble.com/Index-structuring-tp17576449p17643909.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index structuring
Fot 16 mil docs it may not be necessary. Add the shards when you see that perf is degrading. --Noble On Wed, Jun 4, 2008 at 4:17 PM, Ritesh Ambastha <[EMAIL PROTECTED]> wrote: > > The number of docs I have indexed till now is : 1,633,570 > I am bit afraid as the number of indexed docs will grow atleast 5-10 times > in very near future. > > Regards, > Ritesh Ambastha > > > > Shalin Shekhar Mangar wrote: >> >> A lot of this also depends on the number of documents. But we have >> successfully used Solr with upto 10-12 million documents. >> >> On Wed, Jun 4, 2008 at 4:10 PM, Ritesh Ambastha <[EMAIL PROTECTED]> >> wrote: >> >>> >>> Thanks Noble, >>> >>> That means, I can go ahead with single Index for long. >>> :) >>> >>> Regards, >>> Ritesh Ambastha >>> >>> Noble Paul നോബിള് नोब्ळ् wrote: >>> > >>> > For the datasize you are proposing , single index should be fine .Just >>> > give the m/c enough RAM >>> > >>> > Distributed search involves multiple requests made between shards >>> > which may be an unncessary overhead. >>> > --Noble >>> > >>> > On Wed, Jun 4, 2008 at 4:02 PM, Ritesh Ambastha >>> <[EMAIL PROTECTED]> >>> > wrote: >>> >> >>> >> Thanks Noble, >>> >> >>> >> I maintain two separate indexes on my disk for two different search >>> >> services. >>> >> The index size of two are: 91MB and 615MB. I am pretty sure that these >>> >> index >>> >> size will grow in future, and may reach 10GB. >>> >> >>> >> My doubts : >>> >> >>> >> 1. When should I start partitioning my index? >>> >> 2. Is there any performance issue with partitioning? For eg: A query >>> on >>> >> 1GB >>> >> and 500MB indexed data will take same time to give the result? Or >>> lesser >>> >> the >>> >> index size, lesser the response time? >>> >> >>> >> >>> >> Regards, >>> >> Ritesh Ambastha >>> >> >>> >> Noble Paul നോബിള് नोब्ळ् wrote: >>> >>> >>> >>> You could have been more specific on the dataset size. >>> >>> >>> >>> If your data volumes are growing you can partition your index into >>> >>> multiple shards. >>> >>> http://wiki.apache.org/solr/DistributedSearch >>> >>> --Noble >>> >>> >>> >>> On Sat, May 31, 2008 at 9:02 PM, Ritesh Ambastha >>> >>> <[EMAIL PROTECTED]> >>> >>> wrote: >>> >>>> >>> >>>> Dear Readers, >>> >>>> >>> >>>> I am a newbie in solr world. I have successfully deployed solr on my >>> >>>> machine, and I am able to index a large DB table. I am pretty sure >>> that >>> >>>> internal index structure of solr is much capable to handle large >>> data >>> >>>> sets. >>> >>>> >>> >>>> But, say my data size keeps growing at jet speed, then what should >>> be >>> >>>> the >>> >>>> index structure? Do I need to follow some specific index structuring >>> >>>> patterns/algos for handling such massive data? >>> >>>> >>> >>>> I am sorry as I may be sounding novice in this area. I would >>> appreciate >>> >>>> your >>> >>>> thoughts/suggestions. >>> >>>> >>> >>>> Regards, >>> >>>> Ritesh Ambastha >>> >>>> -- >>> >>>> View this message in context: >>> >>>> http://www.nabble.com/Index-structuring-tp17576449p17576449.html >>> >>>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>>> >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> --Noble Paul >>> >>> >>> >>> >>> >> >>> >> -- >>> >> View this message in context: >>> >> http://www.nabble.com/Index-structuring-tp17576449p17643690.html >>> >> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >>> >> >>> > >>> > >>> > >>> > -- >>> > --Noble Paul >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Index-structuring-tp17576449p17643798.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >> > > -- > View this message in context: > http://www.nabble.com/Index-structuring-tp17576449p17643909.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
Re: Updating index
Mihails, Update is done as delete + re-add. You may also want to look at SOLR-139 in Solr JIRA. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Mihails Agafonovs <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, June 17, 2008 6:25:26 AM > Subject: Updating index > > Hi! > > Updating index with post.jar just replaces the index with the defined > xml's. But if there are, for example, two fields in all xml's that > were changed, is there a way to update only these fields (incremental > update)? If there are a lot of large xml's, it would be performance > slowdown each time rewriting the index, and also an unreal job to > change the fields manually. > Ar cieņu, Mihails
Deleting Solr index
How can I clear the whole Solr index? Ar cieņu, Mihails
Automated Index Creation
Hi, Sorry if this question sounds daft but I was wondering if there was anything built into Solr that allows you to automate the creation of new indexes once they reach a certain size or point in time. I looked briefly at the documentation on CollectionDestribution, but it seems more geared to towards replicatting to other production servers...I'm looking for something that is more along the lines of archiving indexes for later use... Thanks, Willie
Re: Index partioning
That wiki page is purely an idea proposal at this time, not a feature of Solr (yet or perhaps ever). Erik On Sep 1, 2008, at 2:23 AM, sanraj25 wrote: Hi I read the doument on http://wiki.apache.org/solr/IndexPartitioning Now i want partition my solr index into two. Based on that document I changed solrconfig.xml. But I can't visible any partitioned folder other than default one. I need help on index partitioning.give some suggestion to this. Thanks in advance -Santhanaraj -- View this message in context: http://www.nabble.com/Index-partioning-tp19249441p19249441.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index partioning
That wiki page is purely an idea proposal at this time, not a feature of Solr (yet or perhaps ever). Erik I found this thread in the archive... I'm responsible for a number of ruby on rails websites, all of which need search. Solr seems to have everything I need, but I am wondering what's the best way to maintain multiple indexes? Multiple Solr instances on different ports? Any help, much appreciated. -- John P.S., I don't know how up-to-date it is, but I have Erik's book on order from Amazon... hoping that will help.
Re: Index partioning
: I found this thread in the archive... : : I'm responsible for a number of ruby on rails websites, all of which need : search. Solr seems to have everything I need, but I am wondering what's the : best way to maintain multiple indexes? : : Multiple Solr instances on different ports? having multiple indexes is a different beast from the "Index Partitioning" topic this thread was discussing ... there's some good info on the wiki about the various options (they each have their trade offs to consider) http://wiki.apache.org/solr/MultipleIndexes -Hoss
Re: Index partioning
We do both #2 and #4 from the Wiki page. If the schemas have a lot of overlap and you don't foresee the need to scale to multiple machines (either due to index size or amount of traffic), it may be best to put all the data in a single index with different type fields (#4); this certainly minimizes maintenance. #1 or #2 seems like a better choice if it is likely you will eventually need to physically separate the indexes. Jason On Mon, Sep 8, 2008 at 6:15 PM, Chris Hostetter <[EMAIL PROTECTED]>wrote: > : I found this thread in the archive... > : > : I'm responsible for a number of ruby on rails websites, all of which need > : search. Solr seems to have everything I need, but I am wondering what's > the > : best way to maintain multiple indexes? > : > : Multiple Solr instances on different ports? > > having multiple indexes is a different beast from the "Index Partitioning" > topic this thread was discussing ... there's some good info on the wiki > about the various options (they each have their trade offs to consider) > >http://wiki.apache.org/solr/MultipleIndexes > > -Hoss > > -- Jason Rennie Head of Machine Learning Technologies, StyleFeeder http://www.stylefeeder.com/ Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Re: Lucene index
On Tue, Sep 23, 2008 at 5:33 PM, Dinesh Gupta <[EMAIL PROTECTED]>wrote: > > Hi, > Current we are using Lucene api to create index. > > It creates index in a directory with 3 files like > > xxx.cfs , deletable & segments. > > If I am creating Lucene indexes from Solr, these file will be created or > not? The lucene index will be created in the solr_home inside the data/index directory. > Please give me example on MySQL data base instead of hsqldb > If you are talking about DataImportHandler then there is no difference in the configuration except for using the MySql driver instead of hsqldb. -- Regards, Shalin Shekhar Mangar.
RE: Lucene index
atalogues doc.add(new Field("clg",(String) data.get("Catalogues"),Field.Store.YES,Field.Index.TOKENIZED)); //doc.add(Field.Text("clg", (String) data.get("Catalogues"))); //Product Delivery Cities doc.add(new Field("dcty",(String) data.get("DelCities"),Field.Store.YES,Field.Index.TOKENIZED)); // Additional Information //Top Selling Count String sellerCount=((Long)data.get("SellCount")).toString(); doc.add(new Field("bsc",sellerCount,Field.Store.YES,Field.Index.TOKENIZED)); I am preparing data from querying databse. Please tell me how can I migrate my logic to Solr. I have spend more than a week. But have got nothing. Please help me. Can I attach my files here? Thanks in Advance Regards Dinesh Gupta > Date: Tue, 23 Sep 2008 18:53:07 +0530 > From: [EMAIL PROTECTED] > To: solr-user@lucene.apache.org > Subject: Re: Lucene index > > On Tue, Sep 23, 2008 at 5:33 PM, Dinesh Gupta <[EMAIL PROTECTED]>wrote: > > > > > Hi, > > Current we are using Lucene api to create index. > > > > It creates index in a directory with 3 files like > > > > xxx.cfs , deletable & segments. > > > > If I am creating Lucene indexes from Solr, these file will be created or > > not? > > > The lucene index will be created in the solr_home inside the data/index > directory. > > > > Please give me example on MySQL data base instead of hsqldb > > > > If you are talking about DataImportHandler then there is no difference in > the configuration except for using the MySql driver instead of hsqldb. > > -- > Regards, > Shalin Shekhar Mangar. _ Want to explore the world? Visit MSN Travel for the best deals. http://in.msn.com/coxandkings
Re: Lucene index
Hi Dinesh, This seems straightforward for Solr. You can use the embedded jetty server for a start. Look at the tutorial on how to get started. You'll need to modify the schema.xml to define all the fields that you want to index. The wiki page at http://wiki.apache.org/solr/SchemaXml is a good start on how to do that. Each field in your code will have a counterpart in the schema.xml with appropriate flags (indexed/stored/tokenized etc.) Once that is complete, try to modify the DataImportHandler's hsqldb example for your mysql database. On Tue, Sep 23, 2008 at 7:01 PM, Dinesh Gupta <[EMAIL PROTECTED]>wrote: > > Hi Shalin Shekhar, > > Let me explain my issue. > > I have some tables in my database like > > Product > Category > Catalogue > Keywords > Seller > Brand > Country_city_group > etc. > I have a class that represent product document as > > Document doc = new Document(); >// Keywords which can be used directly for search >doc.add(new Field("id",(String) > data.get("PRN"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > >// Sorting fields] >String priceString = (String) data.get("Price"); >if (priceString == null) >priceString = "0"; >long price = 0; >try { >price = (long) Double.parseDouble(priceString); >} catch (Exception e) { > >} > >doc.add(new > Field("prc",NumberUtils.pad(price),Field.Store.YES,Field.Index.UN_TOKENIZED)); >Date createDate = (Date) data.get("CreateDate"); >if (createDate == null) createDate = new Date(); > >doc.add(new Field("cdt",String.valueOf(createDate.getTime()), > Field.Store.NO,Field.Index.UN_TOKENIZED)); > >Date modiDate = (Date) data.get("ModiDate"); >if (modiDate == null) modiDate = new Date(); > >doc.add(new Field("mdt",String.valueOf(modiDate.getTime()), > Field.Store.NO,Field.Index.UN_TOKENIZED)); >//doc.add(Field.UnStored("cdt", > String.valueOf(createDate.getTime(; > >// Additional fields for search >doc.add(new Field("bnm",(String) > data.get("Brand"),Field.Store.YES,Field.Index.TOKENIZED)); >doc.add(new Field("bnm1",(String) data.get("Brand1"),Field.Store.NO > ,Field.Index.UN_TOKENIZED)); >//doc.add(Field.Text("bnm", (String) data.get("Brand"))); > //Tokenized and Unstored >doc.add(new Field("bid",(String) > data.get("BrandId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); >//doc.add(Field.Keyword("bid", (String) data.get("BrandId"))); // > untokenized & >doc.add(new Field("grp",(String) data.get("Group"),Field.Store.NO > ,Field.Index.TOKENIZED)); >//doc.add(Field.Text("grp", (String) data.get("Group"))); >doc.add(new Field("gid",(String) > data.get("GroupId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); >//doc.add(Field.Keyword("gid", (String) data.get("GroupId"))); //New >doc.add(new Field("snm",(String) > data.get("Seller"),Field.Store.YES,Field.Index.UN_TOKENIZED)); >//doc.add(Field.Text("snm", (String) data.get("Seller"))); >doc.add(new Field("sid",(String) > data.get("SellerId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); >//doc.add(Field.Keyword("sid", (String) data.get("SellerId"))); // > New >doc.add(new Field("ttl",(String) > data.get("Title"),Field.Store.YES,Field.Index.TOKENIZED)); >//doc.add(Field.UnStored("ttl", (String) data.get("Title"), true)); > >String title1 = (String) data.get("Title"); >title1 = removeSpaces(title1); >doc.add(new Field("ttl1",title1,Field.Store.NO > ,Field.Index.UN_TOKENIZED)); > >doc.add(new Field("ttl2",title1,Field.Store.NO > ,Field.Index.TOKENIZED)); >//doc.add(Field.UnStored("ttl", (String) data.get("Title"), true)); > >// ColumnC - Product Sequence >String productSeq = (String) data.get("ProductSeq"); >if (productSeq == null) productSeq = ""; >doc.add(new Field("seq",productSeq,Field.Store.NO > ,Field.Index.UN_TOKENIZED)); >//doc.add(Field.Keyword("seq", productSeq)); > >// New Added >doc.add(new Field("sdc",(String) data.get("SpecialDescript
RE: Lucene index
Hi Shalin Shekhar, First of all thanks to you for quick replying. I have done the things that you have explained here Since I am creating indexes in multi threads and it takes 6-10 hours to creating for approx. 3 lac products I am using hibernate to access DB & applying custom logic to prepare data and putting in a map and finally writing to index. Now can I achieve this. I am able to search by using solr web admin but not able to add. Please tell me how can I attach my file to you. Thanks Regards, Dinesh Gupta > Date: Tue, 23 Sep 2008 19:36:22 +0530 > From: [EMAIL PROTECTED] > To: solr-user@lucene.apache.org > Subject: Re: Lucene index > > Hi Dinesh, > > This seems straightforward for Solr. You can use the embedded jetty server > for a start. Look at the tutorial on how to get started. > > You'll need to modify the schema.xml to define all the fields that you want > to index. The wiki page at http://wiki.apache.org/solr/SchemaXml is a good > start on how to do that. Each field in your code will have a counterpart in > the schema.xml with appropriate flags (indexed/stored/tokenized etc.) > > Once that is complete, try to modify the DataImportHandler's hsqldb example > for your mysql database. > > On Tue, Sep 23, 2008 at 7:01 PM, Dinesh Gupta <[EMAIL PROTECTED]>wrote: > > > > > Hi Shalin Shekhar, > > > > Let me explain my issue. > > > > I have some tables in my database like > > > > Product > > Category > > Catalogue > > Keywords > > Seller > > Brand > > Country_city_group > > etc. > > I have a class that represent product document as > > > > Document doc = new Document(); > >// Keywords which can be used directly for search > >doc.add(new Field("id",(String) > > data.get("PRN"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > > > >// Sorting fields] > >String priceString = (String) data.get("Price"); > >if (priceString == null) > >priceString = "0"; > >long price = 0; > >try { > >price = (long) Double.parseDouble(priceString); > >} catch (Exception e) { > > > >} > > > >doc.add(new > > Field("prc",NumberUtils.pad(price),Field.Store.YES,Field.Index.UN_TOKENIZED)); > >Date createDate = (Date) data.get("CreateDate"); > >if (createDate == null) createDate = new Date(); > > > >doc.add(new Field("cdt",String.valueOf(createDate.getTime()), > > Field.Store.NO,Field.Index.UN_TOKENIZED)); > > > >Date modiDate = (Date) data.get("ModiDate"); > >if (modiDate == null) modiDate = new Date(); > > > >doc.add(new Field("mdt",String.valueOf(modiDate.getTime()), > > Field.Store.NO,Field.Index.UN_TOKENIZED)); > >//doc.add(Field.UnStored("cdt", > > String.valueOf(createDate.getTime(; > > > >// Additional fields for search > >doc.add(new Field("bnm",(String) > > data.get("Brand"),Field.Store.YES,Field.Index.TOKENIZED)); > >doc.add(new Field("bnm1",(String) data.get("Brand1"),Field.Store.NO > > ,Field.Index.UN_TOKENIZED)); > >//doc.add(Field.Text("bnm", (String) data.get("Brand"))); > > //Tokenized and Unstored > >doc.add(new Field("bid",(String) > > data.get("BrandId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > >//doc.add(Field.Keyword("bid", (String) data.get("BrandId"))); // > > untokenized & > >doc.add(new Field("grp",(String) data.get("Group"),Field.Store.NO > > ,Field.Index.TOKENIZED)); > >//doc.add(Field.Text("grp", (String) data.get("Group"))); > >doc.add(new Field("gid",(String) > > data.get("GroupId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > >//doc.add(Field.Keyword("gid", (String) data.get("GroupId"))); //New > >doc.add(new Field("snm",(String) > > data.get("Seller"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > >//doc.add(Field.Text("snm", (String) data.get("Seller"))); > >doc.add(new Field("sid",(String) > > data.get("SellerId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > >//doc.add(Field.Keyword("sid", (String) data.get("SellerId"))); // > > New
Re: Lucene index
Hi Dinesh, There are two ways in which you can import data from databases. 1. Use your custom code with the Solrj client library to upload documents to Solr -- http://wiki.apache.org/solr/Solrj 2. Use DataImportHandler and write data-config.xml and custom Transformers -- http://wiki.apache.org/solr/DataImportHandler Take a look at both and use the one which suits you best. On Wed, Sep 24, 2008 at 6:37 PM, Dinesh Gupta <[EMAIL PROTECTED]>wrote: > > Hi Shalin Shekhar, > > First of all thanks to you for quick replying. > > I have done the things that you have explained here > > Since I am creating indexes in multi threads and it takes 6-10 hours to > creating for approx. 3 lac products > > I am using hibernate to access DB & applying custom logic to prepare data > and putting in a map > and finally writing to index. > > Now can I achieve this. > > I am able to search by using solr web admin > but not able to add. > Please tell me how can I attach my file to you. > > Thanks > > Regards, > Dinesh Gupta > > > Date: Tue, 23 Sep 2008 19:36:22 +0530 > > From: [EMAIL PROTECTED] > > To: solr-user@lucene.apache.org > > Subject: Re: Lucene index > > > > Hi Dinesh, > > > > This seems straightforward for Solr. You can use the embedded jetty > server > > for a start. Look at the tutorial on how to get started. > > > > You'll need to modify the schema.xml to define all the fields that you > want > > to index. The wiki page at http://wiki.apache.org/solr/SchemaXml is a > good > > start on how to do that. Each field in your code will have a counterpart > in > > the schema.xml with appropriate flags (indexed/stored/tokenized etc.) > > > > Once that is complete, try to modify the DataImportHandler's hsqldb > example > > for your mysql database. > > > > On Tue, Sep 23, 2008 at 7:01 PM, Dinesh Gupta < > [EMAIL PROTECTED]>wrote: > > > > > > > > Hi Shalin Shekhar, > > > > > > Let me explain my issue. > > > > > > I have some tables in my database like > > > > > > Product > > > Category > > > Catalogue > > > Keywords > > > Seller > > > Brand > > > Country_city_group > > > etc. > > > I have a class that represent product document as > > > > > > Document doc = new Document(); > > >// Keywords which can be used directly for search > > >doc.add(new Field("id",(String) > > > data.get("PRN"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > > > > > >// Sorting fields] > > >String priceString = (String) data.get("Price"); > > >if (priceString == null) > > >priceString = "0"; > > >long price = 0; > > >try { > > >price = (long) Double.parseDouble(priceString); > > >} catch (Exception e) { > > > > > >} > > > > > >doc.add(new > > > > Field("prc",NumberUtils.pad(price),Field.Store.YES,Field.Index.UN_TOKENIZED)); > > >Date createDate = (Date) data.get("CreateDate"); > > >if (createDate == null) createDate = new Date(); > > > > > >doc.add(new Field("cdt",String.valueOf(createDate.getTime()), > > > Field.Store.NO,Field.Index.UN_TOKENIZED)); > > > > > >Date modiDate = (Date) data.get("ModiDate"); > > >if (modiDate == null) modiDate = new Date(); > > > > > >doc.add(new Field("mdt",String.valueOf(modiDate.getTime()), > > > Field.Store.NO,Field.Index.UN_TOKENIZED)); > > >//doc.add(Field.UnStored("cdt", > > > String.valueOf(createDate.getTime(; > > > > > >// Additional fields for search > > >doc.add(new Field("bnm",(String) > > > data.get("Brand"),Field.Store.YES,Field.Index.TOKENIZED)); > > >doc.add(new Field("bnm1",(String) data.get("Brand1"), > Field.Store.NO > > > ,Field.Index.UN_TOKENIZED)); > > >//doc.add(Field.Text("bnm", (String) data.get("Brand"))); > > > //Tokenized and Unstored > > >doc.add(new Field("bid",(String) > > > data.get("BrandId"),Field.Store.YES,Field.Index.UN_TOKENIZED)); > > >//doc.add(Field.Keyword("bid", (String) data.get("BrandId"))); > // &g
Re: Index partitioning
: I want to partition my index based on category information. Also, while : indexing I want to store particular category data to corresponding index : partition. In the same way I need to search for category information on : corresponding partition.. I found some information on wiki link : http://wiki.apache.org/solr/IndexPartitioning. But it couldn't help much I've updated that document to reflect it's status as an "whiteboard" that hasn't been updated in a while, and that some of the ideas expressed there can achieved using a combination of distributed search and multicore. deciding which "core" to update when a new doc comes in (ie: based on category in your example) or which core to search against when a query comes in is something your application would need do -- the distributed searching functionality doesn't provide a feature like that. how it would work would be somewhat dependent on your use case (if you want every category to have it's own 'partition' it's trivial; if you want to hash on the category name it's a little more code but straightforward; if you have complex rules about how certain categories shoudl be grouped together in the same partition -- then you need to implement those special rules in your client code. -Hoss
Re: Index Builder
On 3/5/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > What/where is the Index Builder that is referred to in > http://wiki.apache.org/solr/CollectionBuilding? It's currently client-supplied (i.e. there isn't one). Having all Solr users have to write their own builders (code that gets data from a source and posts XML documents) certainly isn't optimal. It would be nice if we could give Solr a database URL with some SQL, and have it automatically slurp and index the records. It would also be nice to be able to grab documents from a CSV or other simple structured text file and index them. These ideas are on already on the task list on the (currently down) Wiki. -Yonik
Re: Index Builder
I had a feeling that was the case. So, I was thinking I could write a driver program that takes in my files and then calls the API directly. Is this doable? How do you guys do it on your live site? Do you do it all through HTTP requests or through a driver that calls the API? I think I would prefer the API calls for bulk loading. Where should I look for these? -Grant Yonik Seeley <[EMAIL PROTECTED]> wrote: On 3/5/06, Grant Ingersoll wrote: > What/where is the Index Builder that is referred to in > http://wiki.apache.org/solr/CollectionBuilding? It's currently client-supplied (i.e. there isn't one). Having all Solr users have to write their own builders (code that gets data from a source and posts XML documents) certainly isn't optimal. It would be nice if we could give Solr a database URL with some SQL, and have it automatically slurp and index the records. It would also be nice to be able to grab documents from a CSV or other simple structured text file and index them. These ideas are on already on the task list on the (currently down) Wiki. -Yonik -- Grant Ingersoll http://www.grantingersoll.com - Yahoo! Mail Use Photomail to share photos without annoying attachments.
Re: Index Builder
: I had a feeling that was the case. So, I was thinking I could write a : driver program that takes in my files and then calls the API directly. : Is this doable? How do you guys do it on your live site? Do you do it : all through HTTP requests or through a driver that calls the API? I : think I would prefer the API calls for bulk loading. Where should I : look for these? Once upon a time, I agrued for having an robust update API, and a way to write "updater plugins" that would run within the Solr JVM ... and I was talked out of it in favor doing everything over HTTP. So yeah ... that's what I do: build/update entirely over HTTP. >From what i remember of the internal update API, you could probably write a new subclass of UpdateHandler that you register in the solrconfig.xml which pulled most of the data from wherever you want -- but it would still need to be triggered by (minimal) "" messages over HTTP. alternately, you could write your own Servlet with load-on-startup="true" that used the internal update methods directly. -Hoss
Re: Index Builder
On 3/5/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > So, I was thinking I could write a driver program that takes in my files and > then calls the API directly. Is this doable? It's doable... While it will be more efficient, it's not clear how much you will gain, esp if you run with multiple CPUs (IndexWriting is highly synchronized). Check out the UpdateHandler abstract class: public abstract int addDoc(AddUpdateCommand cmd) throws IOException; public abstract void delete(DeleteUpdateCommand cmd) throws IOException; public abstract void deleteByQuery(DeleteUpdateCommand cmd) throws IOException; public abstract void commit(CommitUpdateCommand cmd) throws IOException; public abstract void close() throws IOException; While the implementation of the UpdateHandler is pluggable, there isn't a place to plug in different client handlers (like there is with RequestHandler). You could create another servlet in the same webapp and get the current UpdateHandler (SolrCore.updateHandler) and use that to update the index. Seems like there isn't a getter for SolrCore.updateHandler... feel free to sumbit a patch if you want to go this route. You could even drop down to a lower level and use DocumentBuilder to create your own Lucene Document instances and write them with an IndexWriter yourself. -Yonik > Do you do it all through HTTP requests or through a driver that calls the > API? > I think I would prefer the API calls for bulk loading. Where should I look > for these? > > -Grant > > Yonik Seeley <[EMAIL PROTECTED]> wrote: On 3/5/06, Grant Ingersoll wrote: > > What/where is the Index Builder that is referred to in > > http://wiki.apache.org/solr/CollectionBuilding? > > It's currently client-supplied (i.e. there isn't one). > > Having all Solr users have to write their own builders (code that gets > data from a source and posts XML documents) certainly isn't optimal. > > It would be nice if we could give Solr a database URL with some SQL, > and have it automatically slurp and index the records. It would also > be nice to be able to grab documents from a CSV or other simple > structured text file and index them. > > These ideas are on already on the task list on the (currently down) Wiki. > > -Yonik
add/update index
Hi, I have created a process which uses xsl to convert my data to the form indicated in the examples so that it can be added to the index as the solr tutorial indicates: value ... In some cases the xsl process will create a field element with no data. (ie ) Is this considered bad input and will not be accepted? Or is this something that solr should deal with? Currently for each field element with no data I receive the message: java.lang.NullPointerException at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:78) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:74) at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:917) at org.apache.solr.core.SolrCore.update(SolrCore.java:685) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:52) ... Just curious if the gurus out there think I should deal with the null values in my xsl process or if this can be dealt with in solr itself? Thanks, Tricia ps. Thanks for the timely fix for the UTF-8 issue!
Index-time Boosting
I'm trying to figure out how to set per-field boosts in Solr at index time. For example, if I want the title to be boosted by a factor of 8, I could do that in a query, or I could add the title text with a boost of 8 to the default text field along with the body text (with a boost of 1). For other engines I've worked with, this gives a lot more performance at the cost of some flexibility -- you need to reindex to change the weightings. I don't see an obvious way to do this in a Solr schema, though it might make sense to add a boost attribute to copyField. Any ideas? Did I miss something? wunder -- Walter Underwood Search Guru, Netflix
Starting an index...
I have played with the "example" directory for a while. Everything seems to work well. Now I'd like to start my own index and I have a few questions. 1. I suppose I can start from copying the whole example directory and name it myindex. I understand that I need to modify the solr/conf/schema.xml to suit my data. Besides that, is there anything else that I must/should change? I'll take a look at the stopwords.txt, etc. to see if any changes is required. How about solr.war? Anything else I need to customize? (I'm not a heavy java developer.) 2. For each index, do I need to copy this directory and start a solr instance? Is it possible to run one solr instance for multiple indices? 3. solr comes with jetty and it seems to work pretty well. Is there any reason that I should switch to tomcat for production servers? -- Thanks, Jack __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Auto index update
Hello, Can anybody suggest me of what is the best method to implement auto index update on SOLR from mysql database. thanks and regards aditya
Re: index problem
On 3/29/07, James liu <[EMAIL PROTECTED]> wrote: i use freebsd6, tomcat 6(without install)+jdk1.5_07+php5+mssql i debug my program and data is ok before update to do index and index process is ok. no error. but i find index file not what i wanna. it have changed. tomcat6's server.xml,,i added "URIEncoding="UTF-8" data send to solr do index by curl (with utf-8) anyone know how to fix it? Can you reproduce with a small example that shows 1) the document being sent to Solr (complete with HTTP headers, or at least the curl command) 2) the query that shows the problem -Yonik
Re: Index Files
On 3/29/07, Michael Beccaria <[EMAIL PROTECTED]> wrote: Simple curious question from a newbie: Can I have another computer index my data, then copy the index folder files into my live system and run a commit? The project idea I have is for a library catalog which will update holdings information (whether a book is checked out) for an item record (also a solr\lucene record). My collection is small enough that it can index the entire library collection in about 20 minutes. If I have another computer continually indexing, then copy those files to my live system and commit, will that successfully update the index? Yes, this is essentially what Solr's distribution scripts do in an automated way. -Yonik
Re: index problem
Problem i fix it. Thks,Yonik. 2007/3/30, Yonik Seeley <[EMAIL PROTECTED]>: On 3/29/07, James liu <[EMAIL PROTECTED]> wrote: > i use freebsd6, tomcat 6(without install)+jdk1.5_07+php5+mssql > > i debug my program and data is ok before update to do index > > and index process is ok. no error. > > but i find index file not what i wanna. it have changed. > > tomcat6's server.xml,,i added "URIEncoding="UTF-8" > > data send to solr do index by curl (with utf-8) > > > anyone know how to fix it? Can you reproduce with a small example that shows 1) the document being sent to Solr (complete with HTTP headers, or at least the curl command) 2) the query that shows the problem -Yonik -- regards jl
Question: index performance
i find it will be OutOfMemory when i get more that 10k records. so now i index 10k records( 5k / record) if i use for to index more data. it always show OutOfMemory. i use top to moniter and find index finish, free memory is 125m,,and sometime it will be 218m it show me solr index finish and use sometime free memory? how can i index more data than 10k records and doesn't stop by OutOfMemory. tomcat i set memory 512m. -- regards jl
Re: Index corruptions?
In additional to snapshot, you can also make backup copies of your Solr index using the backup script. Backup are created the same way as snapshots using hard links. Each one is a viable full index. Bill On 5/3/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: I have a couple of questions regarding index corruptions. 1) Has anyone using Solr in a production environment ever experienced an index corruption? If so, how frequently do they occur? 2) It seems like the CollectionDistribution setup would be a good way to put in place a recovery plan for (or at least have some viable backups of) the index. However, I have a small concern that if the index gets corrupted on the master server, the corruption would propagate down to the slave servers as well. Is this concern unfounded? Also, each of the snapshots taken by snapshooter are viable full indexes, correct? If so, that means I'd have a backup of the index each and every time a commit (or optimize for that matter) is done, which would be awesome. One of our biggest requirements for the indexing process is to have a good backup/recover strategy in place and I want to make sure Solr will be able to provide that. Thanks in advance! Charlie
Re: Index corruptions?
Hi Charlie, On 5/3/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: I have a couple of questions regarding index corruptions. 1) Has anyone using Solr in a production environment ever experienced an index corruption? If so, how frequently do they occur? I once had all slaves complain about a missing file in the index. The master never had a problem. The problem went away at the next snapshot. Is the "cp-lr" in snapshot really guaranteed to be atomic? Or is it just fast, and unlikely to be interrupted? This has only occurred once over the last 5 months. 2) It seems like the CollectionDistribution setup would be a good way to put in place a recovery plan for (or at least have some viable backups of) the index. However, I have a small concern that if the index gets corrupted on the master server, the corruption would propagate down to the slave servers as well. Is this concern unfounded? I would expect this to be true. Also, each of the snapshots taken by snapshooter are viable full indexes, correct? If so, that means I'd have a backup of the index each and every time a commit (or optimize for that matter) is done, which would be awesome. That's my understanding. Tom
Re: Index corruptions?
On 5/7/07, Tom Hill <[EMAIL PROTECTED]> wrote: Is the "cp-lr" in snapshot really guaranteed to be atomic? Or is it just fast, and unlikely to be interrupted? It's called from Solr within a synchronized context, and it's guaranteed that no index changes (via Solr at least) will happen concurrently. -Yonik
Re: Index Concurrency
On 5/9/07, joestelmach <[EMAIL PROTECTED]> wrote: My first intuition is to give each user their own index. My thinking here is that querying would be faster (since each user's index would be much smaller than one big index,) and, more importantly, that I would dodge any concurrency issues stemming from multiple threads trying to update the same index simultaneously. I realize that Lucene implements a locking mechanism to protect against concurrent access, but I seem to hit the lock access timeout quite easily with only a couple threads. After looking at solr, I would really like to take advantage of the many features it adds to Lucene, but it doesn't look like I'll be able to achieve multiple indexes. No, not currently. Start your implementation with just a single index... unless it is very large, it will likely be fast enough. Solr also handles all the concurrency issues, and you should never hit "lock access timeout" when updating from multiple threads. -Yonik
Re: Index Concurrency
Yonik, Thanks for your fast reply. > No, not currently. Start your implementation with just a single > index... unless it is very large, it will likely be fast enough. My index will get quite large > Solr also handles all the concurrency issues, and you should never hit > "lock access timeout" when updating from multiple threads. Does solr provide any additional concurrency control over what Lucene provides? In my simple testing of indexing 2,000 messages, solr would issue lock access timeouts with as little as 10 threads. Running all 2,000 messages through sequentially yields no problems at all. Actually, I'm able churn through over 100,000 messages when no threads are involved. Am I missing some concurrency settings? Thanks, Joe -- View this message in context: http://www.nabble.com/Index-Concurrency-tf3718634.html#a10406382 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Concurrency
On 5/9/07, joestelmach <[EMAIL PROTECTED]> wrote: Does solr provide any additional concurrency control over what Lucene provides? Yes, coordination between the main index searcher, the index writer, and the index reader needed to delete other documents. In my simple testing of indexing 2,000 messages, solr would issue lock access timeouts with as little as 10 threads. That's weird... I've never seen that. The lucene write lock is only obtained when the IndexWriter is created. Can you post the relevant part of the log file where the exception happens? Also, unless you have at least 6 CPU cores or so, you are unlikely to see greater throughput with 10 threads. If you add multiple documents per HTTP-POST (such that HTTP latency is minimized), the best setting would probably be nThreads == nCores. For a single doc per POST, more threads will serve to cover the latency and keep Solr busy. -Yonik
Re: Index Concurrency
Though, isn't there a recent patch to allow multiple indices under a single Solr instance in JIRA? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, May 9, 2007 6:32:33 PM Subject: Re: Index Concurrency On 5/9/07, joestelmach <[EMAIL PROTECTED]> wrote: > My first intuition is to give each user their own index. My thinking here is > that querying would be faster (since each user's index would be much smaller > than one big index,) and, more importantly, that I would dodge any > concurrency issues stemming from multiple threads trying to update the same > index simultaneously. I realize that Lucene implements a locking mechanism > to protect against concurrent access, but I seem to hit the lock access > timeout quite easily with only a couple threads. > > After looking at solr, I would really like to take advantage of the many > features it adds to Lucene, but it doesn't look like I'll be able to achieve > multiple indexes. No, not currently. Start your implementation with just a single index... unless it is very large, it will likely be fast enough. Solr also handles all the concurrency issues, and you should never hit "lock access timeout" when updating from multiple threads. -Yonik
Re: Index Concurrency
> Yes, coordination between the main index searcher, the index writer, > and the index reader needed to delete other documents. Can you point me to any documentation/code that describes this implementation? > That's weird... I've never seen that. > The lucene write lock is only obtained when the IndexWriter is created. > Can you post the relevant part of the log file where the exception > happens? After doing some more testing, I believe it was a stale lock file that was causing me to have these lock issues yesterday - sorry for the false alarm :) > Also, unless you have at least 6 CPU cores or so, you are unlikely to > see greater throughput with 10 threads. If you add multiple documents > per HTTP-POST (such that HTTP latency is minimized), the best setting > would probably be nThreads == nCores. For a single doc per POST, more > threads will serve to cover the latency and keep Solr busy. I agree with your thinking here. My requirement for a large number of threads is somewhat of an artifact of my current system design. I'm trying not to serialize the system's processing at the point of indexing. -- View this message in context: http://www.nabble.com/Index-Concurrency-tf3718634.html#a10424207 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Concurrency
On 5/10/07, joestelmach <[EMAIL PROTECTED]> wrote: > Yes, coordination between the main index searcher, the index writer, > and the index reader needed to delete other documents. Can you point me to any documentation/code that describes this implementation? Look at SolrCore.getSearcher() and DirectUpdateHandler2. -Yonik
Delete entire index
Hi, Is there a way to have Solr completely remove the current index? ? We're still in development and so our schema is wavering. Anytime we make a change and want to re-index we first have to: stop tomcat (or the solr webapp) manually remove the data/index restart tomcat (or the solr webapp) The removing of the data/index directory is where we have the most trouble, because of the file permissions. The data/index directory is owned by tomcat/tomcat so in order to remove it, we have to issue sudo rm which we'd like to avoid. Ideally if we could just tell Solr to delete all data without having to do anymore manual work, it'd be great! : ) Something else that would help is if we tell Tomcat/Solr which user/ group and/or permission to use on the data/index directory when it's created. Any thoughts on this? Matt
solr index problem
when i index 1.7m docs and 4k-5k per doc. OutOfMemory happen when it finish index ~1.13m docs I just restart tomcat , delete all lock and restart do index. No error or warning infor until it finish. anyone know why? or have the same error? -- regards jl
Re: Optimize index
You optimize by sending a command to the SOLR update handler. I'm not sure about the different index formats though.. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 8, 2007, at 12:32 PM, Jae Joo wrote: Does anyone know how to optimize the index and what the difference between compound format and stand format? Thanks, Jae Joo