Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set
NRT does not work because index updates hundreds times per second vs. cache warm-up time few minutes and we are in a loop allowing you to query your huge index in ms. Solr also allows to query in ms. What is the difference? No one can sort 1,000,000 terms in descending counts order faster than current Solr implementation, and FieldCache UnInvertedCache can't be used together with NRT cache discarded few times per second! - Fuad http://www.tokenizer.ca On 12-08-14 8:17 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: You should try realtime NRT available with Apache Solr 4.0 with RankingAlgorithm 1.4.4, allows faceting in realtime. RankingAlgorithm 1.4.4 also provides an age feature that allows you to retrieve the most recent changed docs in realtime, allowing you to query your huge index in ms. You can get more information and also download from here: http://solr-ra.tgels.org Regards - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external implementation On 8/13/2012 11:38 AM, Fuad Efendi wrote: SOLR-4.0 I am trying to implement this; funny idea to share: 1. http://wiki.apache.org/solr/HierarchicalFaceting unfortunately it does not support date ranges. However, workaround: use String type instead of *_tdt and define fields such as published_hour published_day published_week S( Of course you will need to stick with timezone; but you can add an index(es) for each timezone. And most important, string facets are much faster than Date Trie ranges. 2. Our index is overs 100 millions (from social networks) and rapidly grows (millions a day); cache warm up takes few minutes; Near-Real-Time does not work with faceting. HoweverS( another workaround: we can have Daily Core (optimized at midnight), plus Current Core (only today's data, optimized), plus Last Hour Core (near real time) Last Hour Data is small enough and we can use Facets with Near Real Time feature Service layer will accumulate search results from three layers, it will be near real time. Any thoughts? Thanks,
Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set
You should try realtime NRT available with Apache Solr 4.0 with RankingAlgorithm 1.4.4, allows faceting in realtime. RankingAlgorithm 1.4.4 also provides an age feature that allows you to retrieve the most recent changed docs in realtime, allowing you to query your huge index in ms. You can get more information and also download from here: http://solr-ra.tgels.org Regards - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external implementation On 8/13/2012 11:38 AM, Fuad Efendi wrote: SOLR-4.0 I am trying to implement this; funny idea to share: 1. http://wiki.apache.org/solr/HierarchicalFaceting unfortunately it does not support date ranges. However, workaround: use String type instead of *_tdt and define fields such as published_hour published_day published_week S( Of course you will need to stick with timezone; but you can add an index(es) for each timezone. And most important, string facets are much faster than Date Trie ranges. 2. Our index is overs 100 millions (from social networks) and rapidly grows (millions a day); cache warm up takes few minutes; Near-Real-Time does not work with faceting. HoweverS( another workaround: we can have Daily Core (optimized at midnight), plus Current Core (only today's data, optimized), plus Last Hour Core (near real time) Last Hour Data is small enough and we can use Facets with Near Real Time feature Service layer will accumulate search results from three layers, it will be near real time. Any thoughts? Thanks,
Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set
There is a per segment faceting option - but I think just for single value fields right now? On Mon, Aug 13, 2012 at 2:38 PM, Fuad Efendi f...@efendi.ca wrote: SOLR-4.0 I am trying to implement this; funny idea to share: 1. http://wiki.apache.org/solr/HierarchicalFaceting unfortunately it does not support date ranges. However, workaround: use String type instead of *_tdt and define fields such as published_hour published_day published_week Š Of course you will need to stick with timezone; but you can add an index(es) for each timezone. And most important, string facets are much faster than Date Trie ranges. 2. Our index is overs 100 millions (from social networks) and rapidly grows (millions a day); cache warm up takes few minutes; Near-Real-Time does not work with faceting. HoweverŠ another workaround: we can have Daily Core (optimized at midnight), plus Current Core (only today's data, optimized), plus Last Hour Core (near real time) Last Hour Data is small enough and we can use Facets with Near Real Time feature Service layer will accumulate search results from three layers, it will be near real time. Any thoughts? Thanks, -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada http://www.tokenizer.ca http://www.linkedin.com/in/lucene -- - Mark http://www.lucidimagination.com
Re: Near Real Time Indexing and Searching with solr 3.6
Hi, You might want to take a look at Solr's trunk (very soon to be 4.0.0 alpha release), which already has a near-real-time solution (using Lucene's near-real-time APIs). Lucene has NRTCachingDirectory (to use RAM for small / recently flushed segments), but I don't think Solr uses it yet. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 3, 2012 at 4:02 AM, thomas tho...@codemium.com wrote: Hi, As part of my bachelor thesis I'm trying to archive NRT with Solr 3.6. I've came up with a basic concept and would be trilled if I could get some feedback. The main idea is to use two different Indexes. One persistent on disc and one in RAM. The plan is to route every added and modified document to the RAMIndex (http://imgur.com/kLfUN). After a certain period of time, this index would get cleared and the documents get added to the persistent Index. Some major problems I still have with this idea is: - deletions of documents from documents in the persistent index - having the same unique IDs in both the RAM index and persitent Index, as a result of an updated document - Merging search results to filter out old versions of updated documents Would such an idea be viable to persuit? Thanks for you time
RE: Near Real Time
Further without the NRT features present what's the closest I can expect to real time for the typical use case (obviously this will vary but the average deploy). One hour? One Minute? It seems like there are a few hacks to get somewhat close. Thanks so much. Depends a lot on the nature of the requests and the size of the index, but one minute is often doable. On a large index that facets on many fields per request, one minute is probably still out of reach. With no facets, what index size is consider, in general, out of reach for NRT? Is a 9GB index with 7 million records out of reach? How about 3GB with 3 million records? 3GB with 800K records? This is for 1 min. NRT setting. Thanks. -- George
Re: Near Real Time
On Wed, Oct 21, 2009 at 10:19 PM, George Aroush geo...@aroush.net wrote: Depends a lot on the nature of the requests and the size of the index, but one minute is often doable. On a large index that facets on many fields per request, one minute is probably still out of reach. With no facets, what index size is consider, in general, out of reach for NRT? Is a 9GB index with 7 million records out of reach? How about 3GB with 3 million records? 3GB with 800K records? This is for 1 min. NRT setting. With Solr 1.4, 1 min latencies should be doable in the scenarios above. -Yonik http://www.lucidimagination.com
Re: Near real-time search of user data
we have a similar usecase and I have raised an issue for the same (SOLR-880) currently we are using an internal patch and we hopw to submit one soon. we also use an LRU based automatic loading unloading feature. if a request comes up for a core that is 'STOPPED' . the core is 'STARTED' and the request is served. We keep an upper limit of the no:of cores to be kept loaded and if the limit is crossed, a least recently used core is 'STOPPED' . --Noble On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I've used a similar strategy for Simpy.com, but with raw Lucene and not Solr. The crucial piece is to close (inactive) user indices periodically and thus free the memory. Are you doing the same with your per-user Solr cores and still running into memory issues? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mark Ferguson mark.a.fergu...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, February 20, 2009 1:14:15 AM Subject: Near real-time search of user data Hi, I am trying to come up with a strategy for a solr setup in which a user's indexed data can be nearly immediately available to them for search. My current strategy (which is starting to cause problems) is as follows: - each user has their own personal index (core), which gets committed after each update - there is a main index which is basically an aggregate of all user indexes. This index gets committed every 5 minutes or so. In this way, I can search a user's personal index to get real-time results, and concatenate the world results from the main index, which aren't as important to be immediate. This multicore strategy worked well in test scenarios but as the user indexes get larger it is starting to fall apart as I run into memory issues in maintaining too many cores. It's not realistic to dedicate a new machine to every 5K-10K users and I think this is what I will have to do to maintain the multicore strategy. So I am hoping that someone will be able to provide some tips on how to accomplish what I am looking for. One option is to simply send a commit to the main index every couple seconds, but I was hoping someone with experience could shed some light on whether this is a viable option before I attempt that route (i.e. can commits be sent that frequently on a large index?). The indexes are distributed but they could still be in the 2-100GB range. Thanks very much for any suggestions! Mark -- --Noble Paul
Re: Near real-time search of user data
Thanks Noble and Otis for your suggestions. After reading more messages on the mailing list relating to this problem, I decided to implement one suggestion which was to keep an archive index and a smaller delta index containing only recent updates, then do a distributed search across them. The delta index is small so can handle rapid commits (every 1-2 seconds). This setup works well for my architecture because it is easy to keep track of recent changes in the database and then send those to the archive index every hour or so, then clear out the delta. I really like your ideas about closing inactive indexes when using a multicore setup; having too many indexes open was definitely the issue plaguing me. Thanks for your great ideas and the time you take on this project! Mark On Thu, Feb 19, 2009 at 9:31 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: we have a similar usecase and I have raised an issue for the same (SOLR-880) currently we are using an internal patch and we hopw to submit one soon. we also use an LRU based automatic loading unloading feature. if a request comes up for a core that is 'STOPPED' . the core is 'STARTED' and the request is served. We keep an upper limit of the no:of cores to be kept loaded and if the limit is crossed, a least recently used core is 'STOPPED' . --Noble On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I've used a similar strategy for Simpy.com, but with raw Lucene and not Solr. The crucial piece is to close (inactive) user indices periodically and thus free the memory. Are you doing the same with your per-user Solr cores and still running into memory issues? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mark Ferguson mark.a.fergu...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, February 20, 2009 1:14:15 AM Subject: Near real-time search of user data Hi, I am trying to come up with a strategy for a solr setup in which a user's indexed data can be nearly immediately available to them for search. My current strategy (which is starting to cause problems) is as follows: - each user has their own personal index (core), which gets committed after each update - there is a main index which is basically an aggregate of all user indexes. This index gets committed every 5 minutes or so. In this way, I can search a user's personal index to get real-time results, and concatenate the world results from the main index, which aren't as important to be immediate. This multicore strategy worked well in test scenarios but as the user indexes get larger it is starting to fall apart as I run into memory issues in maintaining too many cores. It's not realistic to dedicate a new machine to every 5K-10K users and I think this is what I will have to do to maintain the multicore strategy. So I am hoping that someone will be able to provide some tips on how to accomplish what I am looking for. One option is to simply send a commit to the main index every couple seconds, but I was hoping someone with experience could shed some light on whether this is a viable option before I attempt that route (i.e. can commits be sent that frequently on a large index?). The indexes are distributed but they could still be in the 2-100GB range. Thanks very much for any suggestions! Mark -- --Noble Paul