RE: Solr vs. Compass
Minutello, Nick wrote: Maybe spend some time playing with Compass rather than speculating ;) I spent few weeks by studying Compass source code, it was three years ago, and Compass docs (3 years ago) were saying the same as now: Compass::Core provides support for two phase commits transactions (read_committed and serializable), implemented on top of Lucene index segmentations. The implementation provides fast commits (faster than Lucene), though they do require the concept of Optimizers that will keep the index at bay. Compass::Core comes with support for Local and JTA transactions, and Compass::Spring comes with Spring transaction synchronization. When only adding data to the index, Compass comes with the batch_insert transaction, which is the same IndexWriter operation with the same usual suspects for controlling performance and memory. It is just blatant advertisement, trick; even JavaDocs remain unchanged... Clever guys from Compass can re-apply transaction log to Lucene in case of server crash (for instance, server was 'killed' _before_ Lucene flushed new segment to disk). Internally, it is implemented as a background thread. Nothing says in docs lucene is part of transaction; I studied source - it is just 'speculating'. Minutello, Nick wrote: If it helps, on the project where I last used compass, we had what I consider to be a small dataset - just a few million documents. Nothing related to indexing/searching took more than a second or 2 - mostly it was 10's or 100's of milliseconds. That app has been live almost 3 years. I did the same, and I was happy with Compass: I got Lucene-powered search without any development. But I got performance problems after few weeks... I needed about 300 TPS, and Compass-based approach didn't work. With SOLR, I have 4000 index updates per second. -Fuad http://www.tokenizer.org -- View this message in context: http://old.nabble.com/Solr-vs.-Compass-tp27259766p27317213.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR uniqueKey - extremely strange behavior! Documents disappeared...
After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: JVM Heap utilization Memory leaks with Solr
Can you tell me please how many non-tokenized single-valued fields your schema uses, and how many documents? Thanks, Fuad Rahul R wrote: My primary issue is not Out of Memory error at run time. It is memory leaks: heap space not being released after doing a force GC also. So after sometime as progressively more heap gets utilized, I start running out of memory The verdict however seems unanimous that there are no known memory leak issues within Solr. I am still looking at my application to analyse the problem. Thank you. On Thu, Aug 13, 2009 at 10:58 PM, Fuad Efendi f...@efendi.ca wrote: Most OutOfMemoryException (if not 100%) happening with SOLR are because of http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/FieldCache. html - it is used internally in Lucene to cache Field value and document ID. My very long-term observations: SOLR can run without any problems few days/months and unpredictable OOM happens just because someone tried sorted search which will populate array with IDs of ALL documents in the index. The only solution: calculate exactly amount of RAM needed for FieldCache... For instance, for 100,000,000 documents single instance of FieldCache may require 8*100,000,000 bytes (8 bytes per document ID?) which is almost 1Gb (at least!) I didn't notice any memory leaks after I started to use 16Gb RAM for SOLR instance (almost a year without any restart!) -Original Message- From: Rahul R [mailto:rahul.s...@gmail.com] Sent: August-13-09 1:25 AM To: solr-user@lucene.apache.org Subject: Re: JVM Heap utilization Memory leaks with Solr *You should try to generate heap dumps and analyze the heap using a tool like the Eclipse Memory Analyzer. Maybe it helps spotting a group of objects holding a large amount of memory* The tool that I used also allows to capture heap snap shots. Eclipse had a lot of pre-requisites. You need to apply some three or five patches before you can start using it My observations with this tool were that some Hashmaps were taking up a lot of space. Although I could not pin it down to the exact HashMap. These would either be weblogic's or Solr's I will anyway give eclipse's a try and see how it goes. Thanks for your input. Rahul On Wed, Aug 12, 2009 at 2:15 PM, Gunnar Wagenknecht gun...@wagenknecht.orgwrote: Rahul R schrieb: I tried using a profiling tool - Yourkit. The trial version was free for 15 days. But I couldn't find anything of significance. You should try to generate heap dumps and analyze the heap using a tool like the Eclipse Memory Analyzer. Maybe it helps spotting a group of objects holding a large amount of memory. -Gunnar -- Gunnar Wagenknecht gun...@wagenknecht.org http://wagenknecht.org/ -- View this message in context: http://www.nabble.com/JVM-Heap-utilization---Memory-leaks-with-Solr-tp24802380p25017767.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
But how to explain that within an hour (after commit) I have had about 500,000 new documents, and within 30 hours (after commit) only 1,300,000? Same _random_enough_ documents... BTW, SOLR Console was showing only few hundreds deletesById although I don't use any deleteById explicitly; only update with allowOverwrite and uniqueId. markrmiller wrote: I'd say you have a lot of documents that have the same id. When you add a doc with the same id, first the old one is deleted, then the new one is added (atomically though). The deleted docs are not removed from the index immediately though - the doc id is just marked as deleted. Over time though, as segments are merged due to hitting triggers while adding new documents, deletes are removed (which deletes depends on which segments have been merged). So if you add a tone of documents over time, many with the same ids, you would likely see this type of maxDoc, numDoc churn. maxDoc will include deleted docs while numDoc will not. -- - Mark http://www.lucidimagination.com On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017826.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
One more hour, and I have +0.5 mlns more (after commit/optimize) Something strange happening with SOLR buffer flush (if we have single segment???)... explicit commit prevents it... 30 hours, with index flush, commit: 783,714 + 1 hour, commit, optimize: 1,281,851 + 1 hour, commit, optimize: 1,786,552 Same random docs retrieved from web... Funtick wrote: But how to explain that within an hour (after commit) I have had about 500,000 new documents, and within 30 hours (after commit) only 783,714? Same _random_enough_ documents... BTW, SOLR Console was showing only few hundreds deletesById although I don't use any deleteById explicitly; only update with allowOverwrite and uniqueId. markrmiller wrote: I'd say you have a lot of documents that have the same id. When you add a doc with the same id, first the old one is deleted, then the new one is added (atomically though). The deleted docs are not removed from the index immediately though - the doc id is just marked as deleted. Over time though, as segments are merged due to hitting triggers while adding new documents, deletes are removed (which deletes depends on which segments have been merged). So if you add a tone of documents over time, many with the same ids, you would likely see this type of maxDoc, numDoc churn. maxDoc will include deleted docs while numDoc will not. -- - Mark http://www.lucidimagination.com On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: JVM Heap utilization Memory leaks with Solr
BTW, you should really prefer JRockit which really rocks!!! Mission Control has necessary toolongs; and JRockit produces _nice_ exception stacktrace (explaining almost everything) in case of even OOM which SUN JVN still fails to produce. SolrServlet still catches Throwable: } catch (Throwable e) { SolrException.log(log,e); sendErr(500, SolrException.toStr(e), request, response); } finally { Rahul R wrote: Otis, Thank you for your response. I know there are a few variables here but the difference in memory utilization with and without shards somehow leads me to believe that the leak could be within Solr. I tried using a profiling tool - Yourkit. The trial version was free for 15 days. But I couldn't find anything of significance. Regards Rahul On Tue, Aug 4, 2009 at 7:35 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Rahul, A) There are no known (to me) memory leaks. I think there are too many variables for a person to tell you what exactly is happening, plus you are dealing with the JVM here. :) Try jmap -histo:live PID-HERE | less and see what's using your memory. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Rahul R rahul.s...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, August 4, 2009 1:09:06 AM Subject: JVM Heap utilization Memory leaks with Solr I am trying to track memory utilization with my Application that uses Solr. Details of the setup : -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr 1.3.0 - Hardware : 12 CPU, 24 GB RAM For testing during PSR I am using a smaller subset of the actual data that I want to work with. Details of this smaller sub-set : - 5 million records, 4.5 GB index size Observations during PSR: A) I have allocated 3.2 GB for the JVM(s) that I used. After all users logout and doing a force GC, only 60 % of the heap is reclaimed. As part of the logout process I am invalidating the HttpSession and doing a close() on CoreContainer. From my application's side, I don't believe I am holding on to any resource. I wanted to know if there are known issues surrounding memory leaks with Solr ? B) To further test this, I tried deploying with shards. 3.2 GB was allocated to each JVM. All JVMs had 96 % free heap space after start up. I got varying results with this. Case 1 : Used 6 weblogic domains. My application was deployed one 1 domain. I split the 5 million index into 5 parts of 1 million each and used them as shards. After multiple users used the system and doing a force GC, around 94 - 96 % of heap was reclaimed in all the JVMs. Case 2: Used 2 weblogic domains. My application was deployed on 1 domain. On the other, I deployed the entire 5 million part index as one shard. After multiple users used the system and doing a gorce GC, around 76 % of the heap was reclaimed in the shard JVM. And 96 % was reclaimed in the JVM where my application was running. This result further convinces me that my application can be absolved of holding on to memory resources. I am not sure how to interpret these results ? For searching, I am using Without Shards : EmbeddedSolrServer With Shards :CommonsHttpSolrServer In terms of Solr objects this is what differs in my code between normal search and shards search (distributed search) After looking at Case 1, I thought that the CommonsHttpSolrServer was more memory efficient but Case 2 proved me wrong. Or could there still be memory leaks in my application ? Any thoughts, suggestions would be welcome. Regards Rahul -- View this message in context: http://www.nabble.com/JVM-Heap-utilization---Memory-leaks-with-Solr-tp24802380p25018165.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
UPDATE: After few more minutes (after previous commit): docsPending: about 7,000,000 After commit: numDocs: 2,297,231 Increase = 2,297,231 - 1,281,851 = 1,000,000 (average) So that I have 7 docs with same ID in average. Having 100,000,000 and then dropping below 1,000,000 is strange; it is a bug somewhere... need to investigate ramBufferSize and MergePolicy, including SOLR uniqueId implementation... Funtick wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25018221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
sorry for typo in prev msg, Increase = 2,297,231 - 1,786,552 = 500,000 (average) RATE (non-unique-id:unique-id) = 7,000,000 : 500,000 = 14:1 but 125:1 (initial 30 hours) was very strange... Funtick wrote: UPDATE: After few more minutes (after previous commit): docsPending: about 7,000,000 After commit: numDocs: 2,297,231 Increase = 2,297,231 - 1,281,851 = 1,000,000 (average) So that I have 7 docs with same ID in average. Having 100,000,000 and then dropping below 1,000,000 is strange; it is a bug somewhere... need to investigate ramBufferSize and MergePolicy, including SOLR uniqueId implementation... Funtick wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25018263.html Sent from the Solr - User mailing list archive at Nabble.com.
Contributions Needed: Faceting Performance, SOLR Caching
Users Developers Possible Contributors, Hi, Recently I did some code hacks and I am using frequency calcs for TermVector instead of default out-of-the-box DocSet Intersections. It improves performance hundreds of times at shopping engine http://www.tokenizer.org - please check http://issues.apache.org/jira/browse/SOLR-711 - I feel the term faceting (and related architectural decision made for CNET several years ago) is completely wrong. Default SOLR response times: 30-180 seconds; with TermVector: 0.2 seconds (25 millions documents, tokenized field). For non-tokenized field: it also looks natural to use frequency calcs, but I have not done it yet. Sorry... too busy with Liferay Portal contract assignments, http://www.linkedin.com/in/liferay Another possible performance improvements: create safe concurrent cache for SOLR, you may check LingPipe, and also http://issues.apache.org/jira/browse/SOLR-665 and http://issues.apache.org/jira/browse/SOLR-667. Lucene developers are doing greate job to remove synchronization in several places too, such as isDeleted() method call... would be nice to have unsynchronized API version for read-only indexes. Thanks! -- View this message in context: http://www.nabble.com/Contributions-Needed%3A-Faceting-Performance%2C-SOLR-Caching-tp20058987p20058987.html Sent from the Solr - User mailing list archive at Nabble.com.
background merge hit exception
Is it file-system error? I can commit and I can not optimize: Exception in thread main org.apache.solr.common.SolrException: background merge hit exception: _ztu:C14604370 _105b:C1690769 _105l:C340280 _105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541 _1075:C337713 _1080:C463455 into _1081 [optimize] java.io.IOException: background merge hit exception: _ztu:C14604370 _105b:C1690769 _105l:C340280 _105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541 _1075:C337713 _1080:C463455 into _1081 [optimize]at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2300) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2230) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:355) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:104) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:115) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1081)at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:565) at org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run(AprEndpoint.java:1949) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: No space left on deviceat java.io.RandomAccessFile.writeBytes(RandomAccessFile.java) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:632) at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96) at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85) at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:109) at org.apache.lucene.store.FSDirectory$FSIndexOutput.close(FSDirectory.java:639) at org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:133)at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:361) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) -- View this message in context: http://www.nabble.com/background-merge-hit-exception-tp19130991p19130991.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: background merge hit exception
I found an answer: not enough space in filesystem. Funtick wrote: Is it file-system error? I can commit and I can not optimize: Exception in thread main org.apache.solr.common.SolrException: background merge hit exception: _ztu:C14604370 _105b:C1690769 _105l:C340280 _105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541 _1075:C337713 _1080:C463455 into _1081 [optimize] java.io.IOException: background merge hit exception: _ztu:C14604370 _105b:C1690769 _105l:C340280 _105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541 _1075:C337713 _1080:C463455 into _1081 [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2300) -- View this message in context: http://www.nabble.com/background-merge-hit-exception-tp19130991p19131084.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
Yes, it should be extremely simple! I simply can't understand how you describe it: Britske wrote: Rows in solr represent productcategories. I will have up to 100k of them. - Each product category can have 10k products each. These are encoded as the 10k columns / fields (all 10k fields are int values) - At any given at most 1 product per productcategory is returned, (analoguous to selecting 1 out of 10k columns). (This is the requirements that makes this scheme possible) -products in the same column have certain characteristics in common, which are encoded in the column name (using dynamic fields). So the combination of these characteristics uniquely determines 1 out of 10k columns. When the user hasn't supplied all characteristics good defaults for these characteristics can be chosen, so a column can always be determined. - on top of that each row has 20 productcategory-fields (which all possible 10k products of that category share). 1. You can't really define 10.000 columns; you are probably using multivalued field for that. (sorry if I am not familiar with newest-greatest features of SOLR such as 'dynamic fields') 2. You are trying to pass to Lucene 'normalized data' - But it is indeed the job of Lucene, to normalize data! 3. All 10k fields are int values!? Lucene is designed for full-text search... are you trying to use Lucene instead of a database? Sorry if I don't understand your design... Britske wrote: Funtick wrote: Britske wrote: - Rows in solr represent productcategories. I will have up to 100k of them. - Each product category can have 10k products each. These are encoded as the 10k columns / fields (all 10k fields are int values) You are using multivalued fields, you are not using 10k fields. And 10k is huge. Design is wrong... you should define two fileds only: Category, Product. Lucene will do the rest. -Fuad ;-). Well I wish it was that simple. -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18756166.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Performance
Hoss, This is still extremely interesting area for possible improvements; I simply don't want the topic to die http://www.nabble.com/Facet-Performance-td7746964.html http://issues.apache.org/jira/browse/SOLR-665 http://issues.apache.org/jira/browse/SOLR-667 http://issues.apache.org/jira/browse/SOLR-669 I am currently using faceting on single-valued _tokenized_ field with huge amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds average response time (for faceted queries only!) I think we can use additional cache for facet results (to store calculated values!); Lucene's FieldCache can be used only for non-tokenized single-valued non-bollean fields -Fuad hossman_lucene wrote: : Unfortunately which strategy will be chosen is currently undocumented : and control is a bit oblique: If the field is tokenized or multivalued : or Boolean, the FilterQuery method will be used; otherwise the : FieldCache method. I expect I or others will improve that shortly. Bear in mind, what's provide out of the box is SimpleFacets ... it's designed to meet simple faceting needs ... when you start talking about 100s or thousands of constraints per facet, you are getting outside the scope of what it was intended to serve efficiently. At a certain point the only practical thing to do is write a custom request handler that makes the best choices for your data. For the record: a really simple patch someone could submit would be to make add an optional field based param indicating which type of faceting (termenum/fieldcache) should be used to generate the list of terms and then make SimpleFacets.getFacetFieldCounts use that and call the apprpriate method insteado calling getTermCounts -- that way you could force one or the other if you know it's better for your data/query. -Hoss -- View this message in context: http://www.nabble.com/Facet-Performance-tp7746964p18756500.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
I do understand that, at first glance, it seems possible to use multivalued fields, but with multivalued fields it's not possible to pinpoint the exact value within the multivalued field that I need. I used a technics with single document consisting on single Category and multiple Products (multi-valued): using 'Highlight'. Highlight feature gives you exact subset of Products with queried terms, as a highlighted list... -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757269.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
But answer to initial question... I think your documents are huge... Funtick wrote: Britske wrote: I do understand that, at first glance, it seems possible to use multivalued fields, but with multivalued fields it's not possible to pinpoint the exact value within the multivalued field that I need. I used a technics with single document consisting on single Category and multiple Products (multi-valued): using 'Highlight'. Highlight feature gives you exact subset of Products with queried terms, as a highlighted list... -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757342.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
Simple design with _single_ valued fields: IdCategoryProduct 001 TVSONY 12345 002 Radio Panasonic 54321 003 TVToshiba ABCD 004 Radio ABCD Z-54321 We have 4 documents with single-valued fields. It's not neccessary to store 'Category' field in index... Data is not 'normalized' from DBA's viewpoint, but it is what Lucene needs... Britske wrote: no, I'm using dynamic fields, they've been around for a pretty long time. I use int-values in the 10k fields for filtering and sorting. On top of that I use a lot of full-text filtering on the other fields, as well as faceting, etc. I do understand that, at first glance, it seems possible to use multivalued fields, but with multivalued fields it's not possible to pinpoint the exact value within the multivalued field that I need. Consider the case with 1 multi-valued field, category, as you called it, which would have at most 10k fields. The meaning of these values within the field are completely lost, although it is a requirement to fetch products (thus values in the multivalued field) given a specific set of criteria. In other words, there is no way of getting a specific value from a multivalued field given a set of criteria. Now, compare that with my current design in which these criteria pinpoint a specific field / column to use and the difference should be clear. regards, Britske Funtick wrote: Yes, it should be extremely simple! I simply can't understand how you describe it: Britske wrote: Rows in solr represent productcategories. I will have up to 100k of them. - Each product category can have 10k products each. These are encoded as the 10k columns / fields (all 10k fields are int values) - At any given at most 1 product per productcategory is returned, (analoguous to selecting 1 out of 10k columns). (This is the requirements that makes this scheme possible) -products in the same column have certain characteristics in common, which are encoded in the column name (using dynamic fields). So the combination of these characteristics uniquely determines 1 out of 10k columns. When the user hasn't supplied all characteristics good defaults for these characteristics can be chosen, so a column can always be determined. - on top of that each row has 20 productcategory-fields (which all possible 10k products of that category share). 1. You can't really define 10.000 columns; you are probably using multivalued field for that. (sorry if I am not familiar with newest-greatest features of SOLR such as 'dynamic fields') 2. You are trying to pass to Lucene 'normalized data' - But it is indeed the job of Lucene, to normalize data! 3. All 10k fields are int values!? Lucene is designed for full-text search... are you trying to use Lucene instead of a database? Sorry if I don't understand your design... Britske wrote: Funtick wrote: Britske wrote: - Rows in solr represent productcategories. I will have up to 100k of them. - Each product category can have 10k products each. These are encoded as the 10k columns / fields (all 10k fields are int values) You are using multivalued fields, you are not using 10k fields. And 10k is huge. Design is wrong... you should define two fileds only: Category, Product. Lucene will do the rest. -Fuad ;-). Well I wish it was that simple. -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757461.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
With many-to-many relationship between Category and Product we can go with multivalued Category field, or we can even have repeated values in CategoryPoint-of-Interest fields (_single_ valued); it's not necessary to store all fields in an index - you can store pointer to database Primary Key for instance. 001 Attraction CN Tower 002 Hotel CN Tower 003 Hotel Sheraton 004 Restaurant CN Tower Funtick wrote: Simple design with _single_ valued fields: IdCategoryProduct 001 TVSONY 12345 002 Radio Panasonic 54321 003 TVToshiba ABCD 004 Radio ABCD Z-54321 -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757613.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
Britske wrote: When performing these queries I notice a big difference between qTime (which is mostly in the 15-30 ms range due to caching) and total time taken to return the response (measured through SolrJ's elapsedTime), which takes between 500-1600 ms. Documents have a lot of stored fields (more than 10.000), but at any given query a maximum of say 20 are returned (through fl-field ) or used (as part of filtering, faceting, sorting) Hi Britske, how do you manage 10.000 field definitions? Sorry I didn't understand... Guys, I am constantly seeing the same problem, athough I have just a few small fields defined, lazyLoading is disabled, and memory is more than enough (25Gb for SOLR, 7Gb for OS, 3Gb index). Britske, do you see the difference with faceted queries only? Yonik, I am suspecting there is _bug_ with SOLR faceting so that faceted query time (qtime) is 10-20ms and elapsed time is huge; SOLR has filterCache where Key is 'filter'; SOLR does not have any queryFacetResultCache where Key is 'query' and Value is 'facets'... Am I right? -Fuad -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18736155.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
Britske wrote: - Rows in solr represent productcategories. I will have up to 100k of them. - Each product category can have 10k products each. These are encoded as the 10k columns / fields (all 10k fields are int values) You are using multivalued fields, you are not using 10k fields. And 10k is huge. Design is wrong... you should define two fileds only: Category, Product. Lucene will do the rest. -Fuad -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18737748.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true
Funtick wrote: Britske wrote: - Rows in solr represent productcategories. I will have up to 100k of them. - Each product category can have 10k products each. These are encoded as the 10k columns / fields (all 10k fields are int values) You are using multivalued fields, you are not using 10k fields. And 10k is huge. Design is wrong... you should define two fileds only: Category, Product. Lucene will do the rest. -Fuad two _single_value_ fields per document Category, Product... (may be some additional fields such as ISDN, Price, copy-field for faceting, etc.) -- View this message in context: http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18737834.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Uprade lucene to 2.3
Special things: - 2.3.1 fixes bugs with 'autocommit' of version 2.3.0 - I am having OutOfMemoryError constantly, I can't understand where the problem is yet... I didn't have it with default SOLR 1.2 installation. It's not memory-cache related, most probably it is a bug somewhere... Yongjun Rong-2 wrote: It seems the latest lucene 2.3 has some improvement on performance. I'm just wondering if it is ok for us to easily upgrade the solr's lucene from 2.1 to 2.3. Is any special thing we need to know except just replace the lucene jars in the lib directory. -- View this message in context: http://www.nabble.com/Uprade-lucene-to-2.3-tp16963107p16968012.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problems querying Russian content
Hi Danier, Ensure that UTF-8 is everywhere... SOLR, WebServer, AppServer, HTTP Headers, etc. And do not use q=#1041;#1072;#1084;#1073;#1072;#1088;#1073;#1080;#1072; #1050;#1080;#1088;#1082;#1091;#1076;#1091; use this instead (encoded URL): q=%D0%91%D0%B0%D0%BC%D0%B1%D0%B0%D1%80%D0%B1%D0%B8%D0%B0+%D0%9A%D0%B8%D1%80%D0%BA%D1%83%D0%B4%D1%83 http://www.tokenizer.org is a search engine, SOLR powered... I need to add some large Internet shops to the crawler, from Russia... Quoting Daniel Alheiros: Hi I'm in trouble now about how to issue queries against Solr using in my q parameter content in Russian (it applies to Chinese and Arabic as well). The problem is I can't send any Russian special character in URL's because they don't fit in ASCII domain, so I'm doing a POST to accomplish that. My application gets the request and logs it (and the Russian characters appear correctly on my logs) and then calls the Solr server and Solr is not receiving it correctly... I can just see in the Solr log the special characters as question marks... Did anyone faced problems like that? My whole system is set to work in UTF-8 (browser, application servers). Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: To make sure XML is UTF-8
Tiong Jeffrey wrote: Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! You won't have any problem with standard JAXP and java.util.* etc. classes, even with comlpex MySQL data (one column is LATIN1, another is LATIN2, another is ASCII, ...) In Java, use standard classes: String, Long, Date. And use JAXP. -- View this message in context: http://www.nabble.com/To-make-sure-XML-is-UTF-8-tf3891427.html#a11032117 Sent from the Solr - User mailing list archive at Nabble.com.
Re: To make sure XML is UTF-8
Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! How do you generate XML output? Output itself is usually a raw byte array, it uses Transport and Encoding. If you save it in a file system and forget about transport-layer-encoding you will get some new problems... during indexing the XML output is not working - what exactly happens, which kind of error messages?