Re: Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Thank you Eric for your replies and the link. Regards Olivier 2015-08-02 3:47 GMT+02:00 Erick Erickson : > Here's some background: > > http://lucidworks.com/blog/solr-suggester/ > > Basically, the limitation is that to build the suggester all docs in > the index need to be read to pull out the

Re: solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Shawn Heisey
On 8/1/2015 6:49 PM, Jay Potharaju wrote: > I currently have a single collection with 40 million documents and index > size of 25 GB. The collections gets updated every n minutes and as a result > the number of deleted documents is constantly growing. The data in the > collection is an amalgamation

Re: Fast autocomplete for large dataset

2015-08-01 Thread Erick Erickson
Here's some background: http://lucidworks.com/blog/solr-suggester/ Basically, the limitation is that to build the suggester all docs in the index need to be read to pull out the stored field and build either the FST or the sidecar Lucene index, which can be a _very_ costly operation (as in minute

Re: solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Erick Erickson
40 million docs isn't really very many by modern standards, although if they're huge documents then that might be an issue. So is this a single shard or multiple shards? If you're really facing performance issues, simply making a new collection with more than one shard (independent of how many rep

solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Jay Potharaju
Hi I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing. The data in the collection is an amalgamation of more than 1000+ customer records. The numb

Re: Avoid re indexing

2015-08-01 Thread Nagasharath
Yes, shard splitting will only help in managing large clusters and to improve query performance. In my case as index size is fully grown (no capacity to hold in the existing shards) across the collection adding a new shard will help and for which I have to re index. > On 01-Aug-2015, at 6:34 p

Re: Avoid re indexing

2015-08-01 Thread Upayavira
Erm, that doesn't seem to make sense. Seems like you are talking about *merging* shards. Say you had two shards, 3m docs each: shard1: 3m docs shard2: 3m docs If you split shard1, you would have: shard1_0: 1.5m docs shard1_1: 1.5m docs shard2: 3m docs You could, of course, then split shard2. Y

Re: Avoid re indexing

2015-08-01 Thread Nagasharath
If my current shard is holding 3 million documents will the new subshard after splitting also be able to hold 3 million documents? If that is the case After shard splitting the sub shards should hold 6 million documents if a shard is split in to two. Am I right? > On 01-Aug-2015, at 5:43 pm, Upa

Re: Avoid re indexing

2015-08-01 Thread Upayavira
On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: > I am using solrj to index documents > > i agree with you regarding the index update but i should not see any > deleted documents as it is a fresh index. Can we actually identify what > are > those deleted documents? If you post doc

Re: Avoid re indexing

2015-08-01 Thread naga sharathrayapati
I am using solrj to index documents i agree with you regarding the index update but i should not see any deleted documents as it is a fresh index. Can we actually identify what are those deleted documents? if there is no option of adding shards to existing collection i do not like the idea of re

Re: Avoid re indexing

2015-08-01 Thread Upayavira
On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote: > I have an exception with one of the document after indexing 6 mil > documents > out of 10 mil, is there any way i can avoid re indexing the 6 mil > documents? How are you indexing your documents? Are you using the DIH? Personally, I

Avoid re indexing

2015-08-01 Thread naga sharathrayapati
I have an exception with one of the document after indexing 6 mil documents out of 10 mil, is there any way i can avoid re indexing the 6 mil documents? I also see that there are few documents that are deleted (based on the count) while indexing, is there a way to identify what are those documents

Re: Peronalized Search Results or Matching Documents to Users

2015-08-01 Thread Mikhail Khludnev
On Sat, Aug 1, 2015 at 9:45 PM, Upayavira wrote: > ticket? > https://issues.apache.org/jira/browse/SOLR-5944 > > On Sat, Aug 1, 2015, at 02:02 PM, Erick Erickson wrote: > > How soon? It's pretty much done AFAIK, but the folks trying to work on > > it have had their priorities re-arranged. > > >

Re: Peronalized Search Results or Matching Documents to Users

2015-08-01 Thread Upayavira
ticket? On Sat, Aug 1, 2015, at 02:02 PM, Erick Erickson wrote: > How soon? It's pretty much done AFAIK, but the folks trying to work on > it have had their priorities re-arranged. > > So I really don't have a date. > > Erick > > On Fri, Jul 31, 2015 at 4:59 PM, Upayavira wrote: > > How soon?

Re: Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Thank you Eric, I would like to implement an autocomplete for large dataset. The autocomplete should show the phrase or the question the user want as the user types. The requirement is that the autocomplete should be fast (not slowdown by the volume of data as dataset become bigger), and easy to

Re: Fast autocomplete for large dataset

2015-08-01 Thread Erick Erickson
Not really. There's no need to use ngrams as the article suggests if the terms component does what you need. Which is why I asked you about what autocomplete means in your context. Which you have not clarified. Have you even looked at terms component? Especially the terms.prefix option? Terms com

Re: Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Thank you Eric for your reply. If I understand it seems that these approaches are using index to hold terms. As the index grows bigger, it can be a performance issues. Is it right? Please can you check this article to see what I mean?

Re: Fast autocomplete for large dataset

2015-08-01 Thread Erick Erickson
Well, defining what you mean by "autocomplete" would be a start. If it's just a user types some letters and you suggest the next N terms in the list, TermsComponent will fix you right up. If it's more complicated, the AutoSuggest functionality might help. If it's correcting spelling, there's the

Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Hi, I am looking for a fast and easy to maintain way to do autocomplete for large dataset in solr. I heard about Ternary Search Tree (TST) . But I would like to know if there is something I missed such as best practice, Solr new feature. Any sugge

Re: Peronalized Search Results or Matching Documents to Users

2015-08-01 Thread Erick Erickson
How soon? It's pretty much done AFAIK, but the folks trying to work on it have had their priorities re-arranged. So I really don't have a date. Erick On Fri, Jul 31, 2015 at 4:59 PM, Upayavira wrote: > How soon? And will you be able to use them for querying, or just > faceting/sorting/displayin

Re: Do not match on high frequency terms

2015-08-01 Thread Mikhail Khludnev
It seems like you need to develop custom query or query parser. Regarding SolrJ: you can try to call http://wiki.apache.org/solr/TermsComponent https://cwiki.apache.org/confluence/display/solr/The+Terms+Component I'm not sure how exactly call TermsComponent in SolrJ, I just found https://lucene.apa

Re: Join Parent and Child Documents

2015-08-01 Thread Mikhail Khludnev
On Sat, Aug 1, 2015 at 10:51 AM, Vineeth Dasaraju wrote: > Hi, > > I had indexed a nested json object into solr as a parent document with > child documents. Whenever I query for a term in the child document, I am > returned only the child documents. Is it possible to get the parent > document alo

Join Parent and Child Documents

2015-08-01 Thread Vineeth Dasaraju
Hi, I had indexed a nested json object into solr as a parent document with child documents. Whenever I query for a term in the child document, I am returned only the child documents. Is it possible to get the parent document along with the child documents as a part of the results? I have been tryi