common ecommerce use case

2018-07-06 Thread Sreenivas.T
Hi,

It's a common use case in ecommerce world and I would like to hear best
approaches to implement them. What are the options to implement "red shoes"
(color category), "Men's bikes"  (Gender category) or "Samsung TV" (brand
catgeory) kind of queries?

 is it better to implement using filters like color, gender, category,
brand? or should we be searching and boost the results on those fields? We
may need to some query preprocessing to identify the filters for filter
query.

I know its business decision and mainly involves in recall vs precision.
But I would like to hear suggestions.

Thanks & regards,
Sreenivas


saas based search With Solr

2018-06-11 Thread Sreenivas.T
All,

Does any one aware of commercially available SAAS based Solr search tool?

Regards,
Sreenivas


Re: Opinions on ExtractingRequestHandler

2018-02-08 Thread Sreenivas.T
Frederik,

We have also used separate service, which uses tika & then use solrj to
index the content.
The main reason, why we went for this approach is to have flexibility to
manipulate/transform data over and above what tika does.

What I understand is that, if there is no other transformation needed
"ExtractingRequestHandler"
should be fine in production too.

Regards,
Sreenivas

On 8 February 2018 at 17:17, Frederik Van Hoyweghen <
frederik.vanhoyweg...@chapoo.com> wrote:

> Hey everyone,
>
> What are your experiences on making (in production) use of Solr's
> ExtractingRequestHandler?
>
> I've been reading some mixed remarks so I was wondering what your actual
> experiences with it are.
>
> Personally, I feel like setting up a separate service which is solely
> responsible for parsing file contents (to be indexed by Solr later on in
> the process) using Tika is a safer approach, so we can use whatever Tika
> version we want along with other things we might want to add.
>
> Looking forward to your response!
>
> Kind regards,
> Frederik
>


Span queries

2017-12-18 Thread Sreenivas.T
Hi,

I'm writing span query with in own custom query parser. I get the tokens
from query analyzer and create span term queries from each token. Later I'm
doing span near query with all these span term queries.

This would work if all the tokens are present in the index with in
specified slop. However if One of them missing no results comes up. I know
it is kind of phrase query.

Is there a way to write span queries to ignore few tokens? I would like to
return results even if few tokens are present. Please suggest.

Regards,
Sreenivas


Re: How extractingrequest handler works?

2017-12-08 Thread Sreenivas.T
Thanks Erick.

Im using ManifoldCF to connect to Fileshare and index the content to Solr.
So I was thinking to customize Solr's updateProcessor. However, It looks
like Manifold CF need to have Tika extracting before indexing to Solr.
am not sure what should be our approach.


-Sreenivas


On 8 December 2017 at 21:41, Erick Erickson  wrote:

> I wouldn't extend the extracting request handler at all, just run the
> custom code independently of Solr. This is generally recommended
> anyway, here's a way to get started:
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> The database bits are just there because I wanted to talk about both
> at once, they're not necessary for Tika at all.
>
> Best,
> Erick
>
> On Fri, Dec 8, 2017 at 12:30 AM, Sreenivas.T  wrote:
> > All,
> >
> > How extractingrequest handler internally indexes tika extracted content?
> > Does it internally calls update processor?
> >
> > I've custom update document processor that need to work on tika extracted
> > content and needs to call an API.
> >
> > Is it that I need to extend that extractingrequesthandler and do
> > customizations or can I call custom update processor from
> > extractingrequesthandler?
> >
> > Sreenivas
>


How extractingrequest handler works?

2017-12-08 Thread Sreenivas.T
All,

How extractingrequest handler internally indexes tika extracted content?
Does it internally calls update processor?

I've custom update document processor that need to work on tika extracted
content and needs to call an API.

Is it that I need to extend that extractingrequesthandler and do
customizations or can I call custom update processor from
extractingrequesthandler?

Sreenivas


Re: Calling rest API from Solr custom tokenizer plugin

2017-12-06 Thread Sreenivas.T
Thanks Doug. Now I think it's better to customize Manifold CF's output
connector for
Solr.

Sreenivas
On Thu, Dec 7, 2017 at 10:01 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> A tokenizer plugin is probably not what you want, you probably want
> something more like an UpdateProcessor that can manipulate the whole
> document as it comes into Solr. Or you may want to avoid having a Solr
> plugin call to an API and do this work outside of Solr (what happens when
> the API is down, should doc updates fail? for example).
>
> A tokenizer plugin would definitely not be recommended. Tokenizers need to
> fast, low-level code that split up text into tokens based on readily
> accesible config & data. The overhead of a network call would be far too
> high,
>
> You probably want to put your extracted tags Into a different field anyway,
> and a tokenizer only works on text within a single field.
>
> -Doug
>
> On Wed, Dec 6, 2017 at 10:57 PM Sreenivas.T  wrote:
>
> > All,
> >
> > I need help from experts. We are trying to build a cognitive search
> > platform with enterprise content from content sources like sharepoint,
> file
> > share etc.. before content is getting indexed to Solr, I need to call our
> > internal AI platform to get additional metadata like classification tags
> > etc..
> >
> > I'm planning to leverage manifold cf for getting the content from sources
> > and planning to write
> > Custom tokenizer plugin to send the content to AI platform, which intern
> > returns with additional tags. I'll index additional tags dynamically
> > through plugin code.
> >
> > Is it a feasible solution?Is there any other way to achieve the same? I
> was
> > planning to not to customize manifold cf.
> >
> > Please suggest
> >
> >
> >
> > Regards,
> > Sreenivas
> >
> --
> Consultant, OpenSource Connections. Contact info at
> http://o19s.com/about-us/doug-turnbull/; Free/Busy (
> http://bit.ly/dougs_cal)
>


Calling rest API from Solr custom tokenizer plugin

2017-12-06 Thread Sreenivas.T
All,

I need help from experts. We are trying to build a cognitive search
platform with enterprise content from content sources like sharepoint, file
share etc.. before content is getting indexed to Solr, I need to call our
internal AI platform to get additional metadata like classification tags
etc..

I'm planning to leverage manifold cf for getting the content from sources
and planning to write
Custom tokenizer plugin to send the content to AI platform, which intern
returns with additional tags. I'll index additional tags dynamically
through plugin code.

Is it a feasible solution?Is there any other way to achieve the same? I was
planning to not to customize manifold cf.

Please suggest



Regards,
Sreenivas


Re: Provide suggestion on indexing performance

2017-09-14 Thread Sreenivas.T
I agree with Tom. Doc values and stored fields are present for different
reasons. Doc values is another index that gets build for faster
sorting/faceting.

On Wed, Sep 13, 2017 at 11:30 PM Tom Evans  wrote:

> On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon 
> wrote:
> > Hi,
> >
> > We want to know about the indexing performance in the below mentioned
> > scenarios, consider the total number of 10 string fields and total number
> > of documents are 10 million.
> >
> > 1) indexed=true, stored=true
> > 2) indexed=true, docValues=true
> >
> > Which one should we prefer in terms of indexing performance, please share
> > your experience.
> >
> > With regards,
> > Aman Tandon
>
> Your question doesn't make much sense. You turn on stored when you
> need to retrieve the original contents of the fields after searching,
> and you use docvalues to speed up faceting, sorting and grouping.
> Using docvalues to retrieve values during search is more expensive
> than simply using stored values, so if your primary aim is retrieving
> stored values, use stored=true.
>
> Secondly, the only way to answer performance questions for your schema
> and data is to try it out. Generate 10 million docs, store them in a
> doc (eg as CSV), and then use the post tool to try different schema
> and query options.
>
> Cheers
>
> Tom
>