Hello Derek,
See answers inline.
--
Mark Bennett / LucidWorks: Search & Big Data /
[email protected]
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
On Jun 9, 2014, at 12:00 AM, Derek Poh <[email protected]> wrote:
My company is actively looking at alternative search engine applications
to replace our current Endeca application.
I have no experience and knowledge on Solr and Lucene.
Please bear with me, I would like to find out if the following features
are available on Solr.
1. Aggregate results (rollups).
Eg. Froma list of search result of products (each has field = supplier
id), can the results be aggregated by supplier id with the original results
ordering retain.
Yes it can:
http://wiki.apache.org/solr/FieldCollapsing
2. Filter/Navigator, counts.
List out a field's possible values and their counts fromthe indexed data
and from the return results.
The field's values can be sorted by the values description or by the
values countsin the return results.
Yes, Solr calls these "Facets" and offers several types:
http://wiki.apache.org/solr/SimpleFacetParameters
http://wiki.apache.org/solr/HierarchicalFaceting
Eg. Field "Business Type" belowwith it's possible values and the count
for each value(in bracket). Can the field be return in the result with it's
values sorted either by description or bycounts?
Business Type
Manufacturer (15269)
Exporter (12493)
Trading Company (5541)
Agent (1324)
Wholesaler (1202)
Importer (682)
Buying Office (394)
Distributor (278)
Other (157)
Retailer (116)
Consultant (54)
Absolutely, and Solr is very fast and accurate.
3. Configureand defined the relevance rankingand matching logic of the
return result.
Yes, though not by that name.
Step 1:
Configure default edismax parameters in your solrconfig.xml
Step 2:
Create additional search handlers in solrconfig.xml, and each search
handler can have its own edismax configuration.
Normally the format of the search URL is:
http://localhost:8983/solr/collection_name/select?q=text:budget
You would replace the "select" with the name of the search handler that
has the edismax config you want.
With multiple search handlers, you'd use something like:
http://localhost:8983/solr/collection_name/search_
freshest?q=text:budget
http://localhost:8983/solr/collection_name/search_most_
popular?q=text:budget
4. Defined and configure the thesaurus (1-wayor 2-way), stemming and
stop words.
Yes, Solr is very good about this, you have both options.
Also, Solr let's you choose:
* Index time, or query time, or both
* Use expansion or reduction
You can even have more than one thesaurus file and have them each handled
differently.
For example:
* Use an english_language thesaurus, which rarely changes, and expand
that at index time
* Use your company_synonyms, which may change frequently, and expand them
at search time.
I'll let you find these in the wiki, http://wiki.apache.org
5. Multi-language supportfor Simplified Chinese and Spanish.
Yes!
And for simplified Chinese, please make sure to use the SmartCN analyzer,
and not the simplistic "CJK"; SmartCN actually looks for Chinese language
word breaks using statistical methods, and therefore should give better
results.
6. Scalability.
At present, we are indexing 4million recordsand the number is expected
to increase by more than 10 folds in the near future.
40 million documents can normally be handled on a single machine,
assuming it has enough RAM and doesn't have a lot of other stuff running.
You might want a second machine for failover.
When people use multiple machines, then the way to do that is via
SolrCloud.
7. Search results debugging. Eg. why record was matchedor why record was
ranked as such.
Yes.
You typically add &debugQuery=true&debug.explain.structured=true to the
URL.
The output is a bit technical, it takes some practice to understand.
There's also a graphical relevancy debugger with a free eval period:
http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/
Derek
----------------------
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.