On 6/26/2019 12:56 PM, Lucky Sharma wrote:
@Shawn: Sorry I forgot to mention the corpus size: the corpus size is
around 3 million docs, where we need to query for 1500 docs and run
aggregations, sorting, search on them.

Assuming the documents aren't HUGE, that sounds like something Solr should be able to handle pretty easily on a typical modern 64-bit system. I handled multiple indexes much larger than that with an 8GB heap on Linux servers with 64GB total memory. Most likely you won't need anything that large.

Depending on exactly what you're going to do with it, that's probably also something easily handled by a relational database or a more modern NoSQL solution ... especially if "traditional search" is not part of your goals. Solr can do things beyond search, but search is where everything is optimized, so if search is not part of your goal, you might want to look elsewhere.

@David: But will that not be a performance hit (resource incentive)?
since it will have that many terms to search upon, the query parse
tree will be big, isn't it?

The terms query parser is far more efficient than a simple boolean "OR" search with the same number of terms. It is highly recommended for use cases like you have described.

The default maxBooleanClauses limit that Lucene enforces on boolean queries is 1024 ... but this is an arbitrary value. The limit was designed as a way to prevent massive queries from running when it wasn't truly intended for such queries to have been created in the first place. It is common for users to increase the default limit.

You're probably going to want to send your queries as POST requests, because those have a 2MB default body-size restriction, which can be increased. GET requests are limited by the HTTP header size restriction, which defaults 8192 bytes on all web server implementations I have checked, including the one that's included with Solr. Increasing that is possible, but not recommended ... especially to the sizes you would need for the queries you have described.

Thanks,
Shawn

Reply via email to