Re: Does docValues impact termfreq ?

2015-10-26 Thread Erick Erickson
Do be aware that docValues can only be used for non-text types, i.e. numerics, strings and the like. Specifically, docValues are _not_ possible for solr.textField and docValues don't support analysis chains because the underlying primitive types don't. You'll get an error if you try to specify

Re: Does docValues impact termfreq ?

2015-10-26 Thread Emir Arnautovic
If I got it right, you are using term query, use function to get TF as score, iterate all documents in results and sum up total number of occurrences of specific term in index? Is this only way you use index or this is side functionality? Thanks, Emir On 24.10.2015 22:28, Aki Balogh wrote:

Re: Does docValues impact termfreq ?

2015-10-26 Thread Emir Arnautovic
Hi Aki, IMO this is underuse of Solr (not to mention SolrCloud). I would recommend doing in memory document parsin (if you need something from Lucene/Solr analysis classes, use it) and use some other cache like solution to store term/total frequency pairs (you can try Redis). That way you

Re: Does docValues impact termfreq ?

2015-10-26 Thread Aki Balogh
Hi Emir, This is correct. This is the only way we use the index. Thanks, Aki On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > If I got it right, you are using term query, use function to get TF as > score, iterate all documents in results and sum up

Re: Does docValues impact termfreq ?

2015-10-26 Thread Scott Stults
Aki, does the sumtotaltermfreq function do what you need? On Mon, Oct 26, 2015 at 9:43 AM, Aki Balogh wrote: > Hi Emir, > > This is correct. This is the only way we use the index. > > Thanks, > Aki > > On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic < >

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Gotcha - that's disheartening. One idea: when I run termfreq, I get all of the termfreqs for each document one-by-one. Is there a way to have solr sum it up before creating the request, so I only receive one number in the response? On Sat, Oct 24, 2015 at 11:05 AM, Upayavira

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Hi Jack, I'm just using solr to get word count across a large number of documents. It's somewhat non-standard, because we're ignoring relevance, but it seems to work well for this use case otherwise. My understanding then is: 1) since termfreq is pre-processed and fetched, there's no good way

Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
If you just want word length, then do work during indexing - index a field for the word length. Then, I believe you can do faceting - e.g. with the json faceting API I believe you can do a sum() calculation on a field rather than the more traditional count. Thinking aloud, there might be an

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Thanks, let me think about that. We're using termfreq to get the TF score, but we don't know which term we'll need the TF for. So we'd have to do a corpuswide summing of termfreq for each potential term across all documents in the corpus. It seems like it'd require some development work to

Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
If you mean using the term frequency function query, then I'm not sure there's a huge amount you can do to improve performance. The term frequency is a number that is used often, so it is stored in the index pre-calculated. Perhaps, if your data is not changing, optimising your index would reduce

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Thanks, Jack. I did some more research and found similar results. In our application, we are making multiple (think: 50) concurrent requests to calculate term frequency on a set of documents in "real-time". The faster that results return, the better. Most of these requests are unique, so cache

Re: Does docValues impact termfreq ?

2015-10-24 Thread Jack Krupansky
That's what a normal query does - Lucene takes all the terms used in the query and sums them up for each document in the response, producing a single number, the score, for each document. That's the way Solr is designed to be used. You still haven't elaborated why you are trying to use Solr in a

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Certainly, yes. I'm just doing a word count, ie how often does a specific term come up in the corpus? On Oct 24, 2015 4:20 PM, "Upayavira" wrote: > yes, but what do you want to do with the TF? What problem are you > solving with it? If you are able to share that... > > On Sat,

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Yes, sorry, I am not being clear. We are not even doing scoring, just getting the raw TF values. We're doing this in solr because it can scale well. But with large corpora, retrieving the word counts takes some time, in part because solr is splitting up word count by document and generating a

Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
Can you explain more what you are using TF for? Because it sounds rather like scoring. You could disable field norms and IDF and scoring would be mostly TF, no? Upayavira On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote: > Thanks, let me think about that. > > We're using termfreq to get the

Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
yes, but what do you want to do with the TF? What problem are you solving with it? If you are able to share that... On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote: > Yes, sorry, I am not being clear. > > We are not even doing scoring, just getting the raw TF values. We're > doing > this in

Does docValues impact termfreq ?

2015-10-23 Thread Aki Balogh
Hello, In our solr application, we use a Function Query (termfreq) very heavily. Index time and disk space are not important, but we're looking to improve performance on termfreq at query time. I've been reading up on docValues. Would this be a way to improve performance? I had read that Lucene

Re: Does docValues impact termfreq ?

2015-10-23 Thread Jack Krupansky
Term frequency applies only to the indexed terms of a tokenized field. DocValues is really just a copy of the original source text and is not tokenized into terms. Maybe you could explain how exactly you are using term frequency in function queries. More importantly, what is so "heavy" about your