People usually want to do some analysis during index time. This analysis
should be considered 'expensive', compared to any single query run. You
can think of it as indexing every day, over a 86400 second day, vs a 200 ms
query time.
Normally, you want to index as honestly as possible. That is,
What I have done for this in the past is calculating the expected value of
a symbol within a universe. Then calculating the difference between
expected value and the actual value at the time you see a symbol. Take the
difference and use the most surprising symbols, in rank order from most
surpris
You do not need stop words to do what you need to do, For one thing, stop
words requires a segmentation on a phrase-by-phrase basis in some cases.
That is, especially in places like Europe, there is a lot of mixed
language. (Your milage may vary :).
In order to do what you want, you really need t
Walter,
When you do the query, what is the sort of the results?
tim
On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood
wrote:
> I’ll back up a bit, since it is sort of an X/Y problem.
>
> I have an index with four shards and 17 million documents. I want to dump
> all the docs in JSON, label each
Adi,
If you are looking for something specific you might want to try something
different. Before you would search 'the end of a document', you might
think about segmenting the document and searching specific segments. At
the end of a lot of things like email will be signatures. Those are fairly
If this is about a normalized query, I would put the normalization text
into a specific field. The reason for this is you may want to search the
overall text during any form of expansion phase of searching for data.
That is, maybe you want to know the context of up to the 120th word. At
least you
https://stackoverflow.com/questions/48348312/solr-7-how-to-do-full-text-search-w-geo-spatial-search
On Mon, Sep 30, 2019 at 10:31 AM Anushka Gupta <
anushka_gu...@external.mckinsey.com> wrote:
> Hi,
>
> I want to be able to filter on different cities and also sort the results
> based on geoproxi
My two cents worth of comment,
For our local lucene indexes we use AES encryption. We encrypt the blocks
on the way out, decrypt on the way in.
We are using a C version of lucene, not the java version. But, I suspect
the same methodology could be applied. This assumes the data at rest is
the at
Venkat,
There is another way to do this. If you have a category of "thing" you are
attempting to filter over, then you create a query and tag the documents
with this category. So, create a 'categories' field and append 'thing' to
the field updating the field if need be. (Be wary of over generat
Hi Sambhav,
Calculate the percentage of letter pairs per language in the index.
Given the letter pairs in the incoming token, find the closest "match" for
the languages in the indexes.
Even on a small number of tokens you will get close to the intended
language. You can also calculate the "sourc
I am not sure how solr is exactly set up currently, much less on any
specific system. But, for operations which are largely reading, *maybe*
like a query, you might be able run on a read only partition.
A firewall is a lot less work and a good start, like 90% of the problem.
To do this, you brin
Deepti,
I am going to guess the analyzer part of the .net application is cutting
off the last token.
If you try the queries on the console of the running solr cluster, what do
you get? If you dump that specific field for all the docs, can you find it
with grep?
tim
On Fri, Jul 20, 2018 at 10:5
We have 3.4.10 and have *tested* at a functional level 6.6.2. So far it
works. We have not done any stress/load testing. But would have to do this
prior to release.
On Tue, May 22, 2018 at 9:44 AM, Walter Underwood
wrote:
> Is anybody running Zookeeper 3.4.12 with Solr 6.6.2? Is that a recomme
A simple date range query does not really represent how people query over
time and dates. If you want any form of date queries, above a single
range, then a special field allowing tokenized query will be the only way
to find documents.
A query for 'ever tuesday in november of 2017' would have to
For smaller length documents TFIDFSimilarity will weight towards shorter
documents. Another way to say this, if your documents are 5-10 terms, the
5 terms are going to win.
You might think about having per token, or token pair, weight. I would be
surprised if there was not something similar out t
My last company we ended up writing a custom analyzer to handle
punctuation. But this was for lucent 2 or 3. That analyzer was carried
forward as we updated and was used for all human derived text.
Although now there are way better analyzers and way better ways to hook
them up, as noted above by
I really like Profiler. It takes a little bit of set up, but it works.
tim
On Wed, Dec 6, 2017 at 2:04 AM, Peter Sturge wrote:
> Hi,
> We'be been using JPRofiler (www.ej-technologies.com) for years now.
> Without a doubt, the most comprehensive and useful profiler for java.
> Works very well,
You can add a ~3 to the query to allow the order to be reversed, but you
will get extra hits. Maybe it is a ~4, i can never remember on phrases and
reversals. I usually just try it.
Alternatively, you can create a custom query field for what you need from
dates. For example, if you want to sear
There should be a way to use a phrasal query for the specific names.
On Wed, Aug 2, 2017 at 2:15 PM, Phil Scadden wrote:
> Hopefully changing to default AND solves your problem. If so, I would be
> quite interested in what your index config looks like in the end. I also
> have upcoming need to i
deniz,
I was going to add something here. The reason what you want is probably
hard to do is because you are asking solr, which stores a document, to
return documents using an attribute of document pairs. As only a though
exercise, if you stored record pairs as a single document, you could
proba
Joe,
To do this correctly, soundly, you will need to sample the data and mark
them as threatening or neutral. You can probably expand on this quite a
bit, but that would be a good start. You can then draw another set of
samples and see how you did. You use one to train and one to validate.
Wha
Hendrik,
I would recommend attempting to stick to the query syntax, as it is in
lucene, as close as possible.
However, if you do your own query parse build up, you can use a Lucene
Query object. I don't know where this bolts into solr, exactly. But I
have done this extensively with lucene. The
I would possibly extend this a bit futher. There is the source, then the
'normalized' version of the data, then the indexed version.
Sometimes you realize you miss something in the normalized view and you
have to go back to the actual source.
This will be as likely as there are number of sources
I have been chasing the chegg recruiters. I expect to here back from Glenn
sometime tomorrow.
tim
On Mon, Nov 18, 2013 at 6:37 PM, Walter Underwood wrote:
> I work at Chegg.com and I really like it, but we have more search work
> than I can do by myself, so we are hiring a senior software engi
24 matches
Mail list logo