Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set

2012-08-20 Thread Fuad Efendi
NRT does not work because index updates hundreds times per second vs.
cache warm-up time few minutesŠ and we are in a loopŠ

 allowing you to query
 your huge index in ms.

Solr also allows to query in ms. What is the difference? No one can sort
1,000,000 terms in descending counts order faster than current Solr
implementation, and FieldCache  UnInvertedCache can't be used together
with NRTŠ cache discarded few times per second!

- Fuad
http://www.tokenizer.ca




On 12-08-14 8:17 AM, Nagendra Nagarajayya
nnagaraja...@transaxtions.com wrote:

You should try realtime NRT available with Apache Solr 4.0 with
RankingAlgorithm 1.4.4, allows faceting in realtime.

RankingAlgorithm 1.4.4 also provides an age feature that allows you to
retrieve the most recent changed docs in realtime, allowing you to query
your huge index in ms.

You can get more information and also download from here:

http://solr-ra.tgels.org

Regards

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external
implementation


On 8/13/2012 11:38 AM, Fuad Efendi wrote:
 SOLR-4.0

 I am trying to implement this; funny idea to share:

 1. http://wiki.apache.org/solr/HierarchicalFaceting
 unfortunately it does not support date ranges. However, workaround: use
 String type instead of *_tdt and define fields such as
 published_hour
 published_day
 published_week
 S(

 Of course you will need to stick with timezone; but you can add an
index(es)
 for each timezone. And most important, string facets are much faster
than
 Date Trie ranges.



 2. Our index is overs 100 millions (from social networks) and rapidly
grows
 (millions a day); cache warm up takes few minutes; Near-Real-Time does
not
 work with faceting.

 HoweverS( another workaround: we can have Daily Core (optimized at
midnight),
 plus Current Core (only today's data, optimized), plus Last Hour Core
(near
 real time)

 Last Hour Data is small enough and we can use Facets with Near Real
Time
 feature

 Service layer will accumulate search results from three layers, it will
be
 near real time.



 Any thoughts? Thanks,









Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set

2012-08-14 Thread Nagendra Nagarajayya
You should try realtime NRT available with Apache Solr 4.0 with 
RankingAlgorithm 1.4.4, allows faceting in realtime.


RankingAlgorithm 1.4.4 also provides an age feature that allows you to 
retrieve the most recent changed docs in realtime, allowing you to query 
your huge index in ms.


You can get more information and also download from here:

http://solr-ra.tgels.org

Regards

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external 
implementation



On 8/13/2012 11:38 AM, Fuad Efendi wrote:

SOLR-4.0

I am trying to implement this; funny idea to share:

1. http://wiki.apache.org/solr/HierarchicalFaceting
unfortunately it does not support date ranges. However, workaround: use
String type instead of *_tdt and define fields such as
published_hour
published_day
published_week
S(

Of course you will need to stick with timezone; but you can add an index(es)
for each timezone. And most important, string facets are much faster than
Date Trie ranges.



2. Our index is overs 100 millions (from social networks) and rapidly grows
(millions a day); cache warm up takes few minutes; Near-Real-Time does not
work with faceting.

HoweverS( another workaround: we can have Daily Core (optimized at midnight),
plus Current Core (only today's data, optimized), plus Last Hour Core (near
real time)

Last Hour Data is small enough and we can use Facets with Near Real Time
feature

Service layer will accumulate search results from three layers, it will be
near real time.



Any thoughts? Thanks,








Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set

2012-08-13 Thread Mark Miller
There is a per segment faceting option - but I think just for single value
fields right now?


On Mon, Aug 13, 2012 at 2:38 PM, Fuad Efendi f...@efendi.ca wrote:

 SOLR-4.0

 I am trying to implement this; funny idea to share:

 1. http://wiki.apache.org/solr/HierarchicalFaceting
 unfortunately it does not support date ranges. However, workaround: use
 String type instead of *_tdt and define fields such as
 published_hour
 published_day
 published_week
 Š

 Of course you will need to stick with timezone; but you can add an
 index(es)
 for each timezone. And most important, string facets are much faster than
 Date Trie ranges.



 2. Our index is overs 100 millions (from social networks) and rapidly grows
 (millions a day); cache warm up takes few minutes; Near-Real-Time does not
 work with faceting.

 HoweverŠ another workaround: we can have Daily Core (optimized at
 midnight),
 plus Current Core (only today's data, optimized), plus Last Hour Core (near
 real time)

 Last Hour Data is small enough and we can use Facets with Near Real Time
 feature

 Service layer will accumulate search results from three layers, it will be
 near real time.



 Any thoughts? Thanks,




 --
 Fuad Efendi
 416-993-2060
 Tokenizer Inc., Canada
 http://www.tokenizer.ca
 http://www.linkedin.com/in/lucene






-- 
- Mark

http://www.lucidimagination.com


Re: Near Real Time Indexing and Searching with solr 3.6

2012-07-03 Thread Michael McCandless
Hi,

You might want to take a look at Solr's trunk (very soon to be 4.0.0
alpha release), which already has a near-real-time solution (using
Lucene's near-real-time APIs).

Lucene has NRTCachingDirectory (to use RAM for small / recently
flushed segments), but I don't think Solr uses it yet.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 3, 2012 at 4:02 AM, thomas tho...@codemium.com wrote:
 Hi,

 As part of my bachelor thesis I'm trying to archive NRT with Solr 3.6. I've
 came up with a basic concept and would be trilled if I could get some
 feedback.

 The main idea is to use two different Indexes. One persistent on disc and
 one in RAM. The plan is to route every added and modified document to the
 RAMIndex (http://imgur.com/kLfUN). After a certain period of time, this
 index would get cleared and the documents get added to the persistent Index.

 Some major problems I still have with this idea is:
 - deletions of documents from documents in the persistent index
 - having the same unique IDs in both the RAM index and persitent Index, as a
 result of an updated document
   - Merging search results to filter out old versions of updated documents

 Would such an idea be viable to persuit?

 Thanks for you time



RE: Near Real Time

2009-10-21 Thread George Aroush
    Further without the NRT features present what's the closest I can 
  expect to real time for the typical use case (obviously this will vary
  but the average deploy). One hour? One Minute? It seems like there are 
  a few hacks to get somewhat close. Thanks so much.
 
 Depends a lot on the nature of the requests and the size of the index,
 but one minute is often doable.
 On a large index that facets on many fields per request, one minute is
 probably still out of reach.

With no facets, what index size is consider, in general, out of reach for
NRT?  Is a 9GB index with 7 million records out of reach?  How about 3GB
with 3 million records?  3GB with 800K records?  This is for 1 min. NRT
setting.

Thanks.

-- George



Re: Near Real Time

2009-10-21 Thread Yonik Seeley
On Wed, Oct 21, 2009 at 10:19 PM, George Aroush geo...@aroush.net wrote:
 Depends a lot on the nature of the requests and the size of the index,
 but one minute is often doable.
 On a large index that facets on many fields per request, one minute is
 probably still out of reach.

 With no facets, what index size is consider, in general, out of reach for
 NRT?  Is a 9GB index with 7 million records out of reach?  How about 3GB
 with 3 million records?  3GB with 800K records?  This is for 1 min. NRT
 setting.

With Solr 1.4, 1 min latencies should be doable in the scenarios above.

-Yonik
http://www.lucidimagination.com


Re: Near real-time search of user data

2009-02-19 Thread Noble Paul നോബിള്‍ नोब्ळ्
we have a similar usecase and I have raised an issue for the same (SOLR-880)
currently we are using an internal patch and we hopw to submit one soon.

we also use an LRU based automatic loading unloading feature. if a
request comes up for a core that is 'STOPPED' . the core is 'STARTED'
and the request is served.

We  keep an upper limit of the no:of cores to be kept loaded and if
the limit is crossed, a least recently used core is 'STOPPED' .

--Noble


On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 I've used a similar strategy for Simpy.com, but with raw Lucene and not Solr. 
  The crucial piece is to close (inactive) user indices periodically and thus 
 free the memory.  Are you doing the same with your per-user Solr cores and 
 still running into memory issues?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Mark Ferguson mark.a.fergu...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, February 20, 2009 1:14:15 AM
 Subject: Near real-time search of user data

 Hi,

 I am trying to come up with a strategy for a solr setup in which a user's
 indexed data can be nearly immediately available to them for search. My
 current strategy (which is starting to cause problems) is as follows:

   - each user has their own personal index (core), which gets committed
 after each update
   - there is a main index which is basically an aggregate of all user
 indexes. This index gets committed every 5 minutes or so.

 In this way, I can search a user's personal index to get real-time results,
 and concatenate the world results from the main index, which aren't as
 important to be immediate.

 This multicore strategy worked well in test scenarios but as the user
 indexes get larger it is starting to fall apart as I run into memory issues
 in maintaining too many cores. It's not realistic to dedicate a new machine
 to every 5K-10K users and I think this is what I will have to do to maintain
 the multicore strategy.

 So I am hoping that someone will be able to provide some tips on how to
 accomplish what I am looking for. One option is to simply send a commit to
 the main index every couple seconds, but I was hoping someone with
 experience could shed some light on whether this is a viable option before I
 attempt that route (i.e. can commits be sent that frequently on a large
 index?). The indexes are distributed but they could still be in the 2-100GB
 range.

 Thanks very much for any suggestions!

 Mark





-- 
--Noble Paul


Re: Near real-time search of user data

2009-02-19 Thread Mark Ferguson
Thanks Noble and Otis for your suggestions.

After reading more messages on the mailing list relating to this problem, I
decided to implement one suggestion which was to keep an archive index and a
smaller delta index containing only recent updates, then do a distributed
search across them. The delta index is small so can handle rapid commits
(every 1-2 seconds). This setup works well for my architecture because it is
easy to keep track of recent changes in the database and then send those to
the archive index every hour or so, then clear out the delta.

I really like your ideas about closing inactive indexes when using a
multicore setup; having too many indexes open was definitely the issue
plaguing me. Thanks for your great ideas and the time you take on this
project!

Mark



On Thu, Feb 19, 2009 at 9:31 PM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

 we have a similar usecase and I have raised an issue for the same
 (SOLR-880)
 currently we are using an internal patch and we hopw to submit one soon.

 we also use an LRU based automatic loading unloading feature. if a
 request comes up for a core that is 'STOPPED' . the core is 'STARTED'
 and the request is served.

 We  keep an upper limit of the no:of cores to be kept loaded and if
 the limit is crossed, a least recently used core is 'STOPPED' .

 --Noble


 On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 
  I've used a similar strategy for Simpy.com, but with raw Lucene and not
 Solr.  The crucial piece is to close (inactive) user indices periodically
 and thus free the memory.  Are you doing the same with your per-user Solr
 cores and still running into memory issues?
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Mark Ferguson mark.a.fergu...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Friday, February 20, 2009 1:14:15 AM
  Subject: Near real-time search of user data
 
  Hi,
 
  I am trying to come up with a strategy for a solr setup in which a
 user's
  indexed data can be nearly immediately available to them for search. My
  current strategy (which is starting to cause problems) is as follows:
 
- each user has their own personal index (core), which gets committed
  after each update
- there is a main index which is basically an aggregate of all user
  indexes. This index gets committed every 5 minutes or so.
 
  In this way, I can search a user's personal index to get real-time
 results,
  and concatenate the world results from the main index, which aren't as
  important to be immediate.
 
  This multicore strategy worked well in test scenarios but as the user
  indexes get larger it is starting to fall apart as I run into memory
 issues
  in maintaining too many cores. It's not realistic to dedicate a new
 machine
  to every 5K-10K users and I think this is what I will have to do to
 maintain
  the multicore strategy.
 
  So I am hoping that someone will be able to provide some tips on how to
  accomplish what I am looking for. One option is to simply send a commit
 to
  the main index every couple seconds, but I was hoping someone with
  experience could shed some light on whether this is a viable option
 before I
  attempt that route (i.e. can commits be sent that frequently on a large
  index?). The indexes are distributed but they could still be in the
 2-100GB
  range.
 
  Thanks very much for any suggestions!
 
  Mark
 
 



 --
 --Noble Paul