Schema/Index design for disparate data sources (Federated / Google like search)

2015-12-22 Thread Susheel Kumar
Hello,

I am going thru few use cases where we have kind of multiple disparate data
sources which in general doesn't have much common fields and i was thinking
to design different schema/index/collection for each of them and query each
of them separately and provide different result sets to the client.

I have seen one implementation where all different fields from these
disparate data sources are put together in single schema/design/collection
that it can be searched easily using catch all field but this was having
200+ fields including copy fields. The problem i see with this design is
ingestion will be slower (and scaling) as many of the fields for one data
source will not be applicable when ingesting for other data source.
Basically everything is being dumped into one huge schema/index/collection.

After looking above, I am wondering how we can design this better in
another implementation where we have the requirement to search across
disparate source (each having multiple fields 10-15 fields searchable &
10-15 fields stored) with only 1 common field like description in each of
the data sources.  Most of the time user may perform search on description
and rest of the time combination of different fields. Similar to google
like search where you search for "coffee" and it searches in various data
sources (websites, maps, images, places etc.)

My thought is to make separate indexes for each search scenario.  For
example for single search box, we index description, other key fields which
can be searched together  and their data source type into one index/schema
that we don't make a huge index/schema and use the catch all field for
search.

And for other Advance search (field specific) scenario we create separate
index/schema for each data sources.

Any suggestions/guidelines on how we can better address this in terms of
responsiveness and scaling? Each data source may have documents in 50-100+
millions.

Thanks,
Susheel


Re: Schema/Index design for disparate data sources (Federated / Google like search)

2015-12-22 Thread Jack Krupansky
Step one is to refine and more clearly state the requirements. Sure,
sometimes (most of the time?) the end user really doesn't know exactly what
they expect or want other than "Gee, I want to search for everything, isn't
that obvious??!!", but that simply means that an analyst is needed to
intervene before you leap to implementation. An analyst is someone who
knows how to interview all relevant parties (not just the approving
manager) to understand their true needs. I mean, who knows, maybe all they
really need is basic keyword search. Or... maybe they actually need a
full-blown data warehouse with precise access to each specific field of
each data source. Without knowing how refined user queries need to get,
there is little to go on here.

My other advice is to be careful not to overthink the problem - to imagine
that some complex solution is needed when the end users really only need to
do super basic queries. In general, managers are very poor when it comes to
analysis and requirement specification.

Do they need to do date searches on a variety of date fields?

Do they need to do numeric or range queries on specific numeric fields?

Do they need to do any exact match queries on raw character fields (as
opposed to tokenized text)?

Do they have fields like product names or numbers in addition to free-form
text?

Do they need to distinguish or weight titles from detailed descriptions?

You could have catchall fields for categories of field types like titles,
bodies, authors/names, locations, dates, numeric values. But... who
knows... this may be more than what an average user really needs.

As far as the concern about fields from different sources that are not
used, Lucene only stores and indexes fields which have values, so no
storage or performance is consumed when you have a lot of fields which are
not present for a particular data source.

-- Jack Krupansky

On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar <susheel2...@gmail.com>
wrote:

> Hello,
>
> I am going thru few use cases where we have kind of multiple disparate data
> sources which in general doesn't have much common fields and i was thinking
> to design different schema/index/collection for each of them and query each
> of them separately and provide different result sets to the client.
>
> I have seen one implementation where all different fields from these
> disparate data sources are put together in single schema/design/collection
> that it can be searched easily using catch all field but this was having
> 200+ fields including copy fields. The problem i see with this design is
> ingestion will be slower (and scaling) as many of the fields for one data
> source will not be applicable when ingesting for other data source.
> Basically everything is being dumped into one huge schema/index/collection.
>
> After looking above, I am wondering how we can design this better in
> another implementation where we have the requirement to search across
> disparate source (each having multiple fields 10-15 fields searchable &
> 10-15 fields stored) with only 1 common field like description in each of
> the data sources.  Most of the time user may perform search on description
> and rest of the time combination of different fields. Similar to google
> like search where you search for "coffee" and it searches in various data
> sources (websites, maps, images, places etc.)
>
> My thought is to make separate indexes for each search scenario.  For
> example for single search box, we index description, other key fields which
> can be searched together  and their data source type into one index/schema
> that we don't make a huge index/schema and use the catch all field for
> search.
>
> And for other Advance search (field specific) scenario we create separate
> index/schema for each data sources.
>
> Any suggestions/guidelines on how we can better address this in terms of
> responsiveness and scaling? Each data source may have documents in 50-100+
> millions.
>
> Thanks,
> Susheel
>


Re: Schema/Index design for disparate data sources (Federated / Google like search)

2015-12-22 Thread Susheel Kumar
Thanks, Jack for various points. A question when you have hundreds of
fields from different sources and you also have lot of copy fields
instructions for facets, sort or catch all etc. you suffer some performance
hit during ingestion as many of the copy instructions would just be
executing but doing nothing since they don't have data, do you agree?

Assuming keyword search is required on different data sources and present
result from each data source when user is typing (instant / auto complete)
in single search box and advance search (very field specific) is required
in the advance search option,  how do you suggest to design the
index/schema?

Let me know if i am missing any other info to get your thoughts.

On Tue, Dec 22, 2015 at 11:53 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Step one is to refine and more clearly state the requirements. Sure,
> sometimes (most of the time?) the end user really doesn't know exactly what
> they expect or want other than "Gee, I want to search for everything, isn't
> that obvious??!!", but that simply means that an analyst is needed to
> intervene before you leap to implementation. An analyst is someone who
> knows how to interview all relevant parties (not just the approving
> manager) to understand their true needs. I mean, who knows, maybe all they
> really need is basic keyword search. Or... maybe they actually need a
> full-blown data warehouse with precise access to each specific field of
> each data source. Without knowing how refined user queries need to get,
> there is little to go on here.
>
> My other advice is to be careful not to overthink the problem - to imagine
> that some complex solution is needed when the end users really only need to
> do super basic queries. In general, managers are very poor when it comes to
> analysis and requirement specification.
>
> Do they need to do date searches on a variety of date fields?
>
> Do they need to do numeric or range queries on specific numeric fields?
>
> Do they need to do any exact match queries on raw character fields (as
> opposed to tokenized text)?
>
> Do they have fields like product names or numbers in addition to free-form
> text?
>
> Do they need to distinguish or weight titles from detailed descriptions?
>
> You could have catchall fields for categories of field types like titles,
> bodies, authors/names, locations, dates, numeric values. But... who
> knows... this may be more than what an average user really needs.
>
> As far as the concern about fields from different sources that are not
> used, Lucene only stores and indexes fields which have values, so no
> storage or performance is consumed when you have a lot of fields which are
> not present for a particular data source.
>
> -- Jack Krupansky
>
> On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I am going thru few use cases where we have kind of multiple disparate
> data
> > sources which in general doesn't have much common fields and i was
> thinking
> > to design different schema/index/collection for each of them and query
> each
> > of them separately and provide different result sets to the client.
> >
> > I have seen one implementation where all different fields from these
> > disparate data sources are put together in single
> schema/design/collection
> > that it can be searched easily using catch all field but this was having
> > 200+ fields including copy fields. The problem i see with this design is
> > ingestion will be slower (and scaling) as many of the fields for one data
> > source will not be applicable when ingesting for other data source.
> > Basically everything is being dumped into one huge
> schema/index/collection.
> >
> > After looking above, I am wondering how we can design this better in
> > another implementation where we have the requirement to search across
> > disparate source (each having multiple fields 10-15 fields searchable &
> > 10-15 fields stored) with only 1 common field like description in each of
> > the data sources.  Most of the time user may perform search on
> description
> > and rest of the time combination of different fields. Similar to google
> > like search where you search for "coffee" and it searches in various data
> > sources (websites, maps, images, places etc.)
> >
> > My thought is to make separate indexes for each search scenario.  For
> > example for single search box, we index description, other key fields
> which
> > can be searched together  and their data source type into one
> index/schema
> > that we don't make a huge index/schema and use the catch all field for
> > search.
> >
> > And for other Advance search (field specific) scenario we create separate
> > index/schema for each data sources.
> >
> > Any suggestions/guidelines on how we can better address this in terms of
> > responsiveness and scaling? Each data source may have documents in
> 50-100+
> > millions.
> >
> > Thanks,
> > Susheel
> >
>


Re: Google like search

2010-12-16 Thread satya swaroop
Hi All,

 Thanks for your suggestions.. I got the result of what i expected..

Cheers,
Satya


Google like search

2010-12-14 Thread satya swaroop
Hi All,
 Can we get the results like google  having some data  about the
search... I was able to get the data that is the first 300 characters of a
file, but it is not helpful for me, can i be get the data that is having the
first found key in that file

Regards,
Satya


Re: Google like search

2010-12-14 Thread Tanguy Moal
Hi Satya,

I think what you'e looking for is called highlighting in the sense
of highlighting the query terms in their matching context.

You could start by googling solr highlight, surely the first results
will make sense.

Solr's wiki results are usually a good entry point :
http://wiki.apache.org/solr/HighlightingParameters .

Maybe I misunderstood your question, but I hope that'll help...

Regards,

Tanguy


2010/12/14 satya swaroop satya.yada...@gmail.com:
 Hi All,
         Can we get the results like google  having some data  about the
 search... I was able to get the data that is the first 300 characters of a
 file, but it is not helpful for me, can i be get the data that is having the
 first found key in that file

 Regards,
 Satya



Re: Google like search

2010-12-14 Thread satya swaroop
Hi Tanguy,
  I am not asking for highlighting.. I think it can be
explained with an example.. Here i illustarte it::

when i post the query like dis::

http://localhost:8080/solr/select?q=Javaversion=2.2start=0rows=10indent=on

i Would be getting the result as follows::

-response
-lst name=responseHeader
int name=status0/int
int name=QTime1/int
/lst
-result name=response numFound=1 start=0
-doc
str name=filenameJava%20debugging.pdf/str
str name=id122/str
-arr name=text1
-str
Table of Contents
If you're viewing this document online, you can click any of the topics
below to link directly to that section.
1. Tutorial tips 2
2. Introducing debugging  4
3. Overview of the basics 6
4. Lessons in client-side debugging 11
5. Lessons in server-side debugging 15
6. Multithread debugging 18
7. Jikes overview 20
/str
/arr
/doc
/result
/response

Here the str field contains the first 300 characters of the file as i kept a
field to copy only 300 characters in schema.xml...
But i dont want the content like dis.. Is there any way to make an o/p as
follows::

str Java is one of the best language,java is easy to learn.../str


where this content is at start of the chapter,where the first word of java
is occured in the file...


Regards,
Satya


RE: Google like search

2010-12-14 Thread Dave Searle
Highlighting is exactly what you need, although if you highlight the whole 
book, this could slow down your queries. Index/store the first 5000-1 
characters and see how you get on

-Original Message-
From: satya swaroop [mailto:satya.yada...@gmail.com] 
Sent: 14 December 2010 10:08
To: solr-user@lucene.apache.org
Subject: Re: Google like search

Hi Tanguy,
  I am not asking for highlighting.. I think it can be
explained with an example.. Here i illustarte it::

when i post the query like dis::

http://localhost:8080/solr/select?q=Javaversion=2.2start=0rows=10indent=on

i Would be getting the result as follows::

-response
-lst name=responseHeader
int name=status0/int
int name=QTime1/int
/lst
-result name=response numFound=1 start=0
-doc
str name=filenameJava%20debugging.pdf/str
str name=id122/str
-arr name=text1
-str
Table of Contents
If you're viewing this document online, you can click any of the topics
below to link directly to that section.
1. Tutorial tips 2
2. Introducing debugging  4
3. Overview of the basics 6
4. Lessons in client-side debugging 11
5. Lessons in server-side debugging 15
6. Multithread debugging 18
7. Jikes overview 20
/str
/arr
/doc
/result
/response

Here the str field contains the first 300 characters of the file as i kept a
field to copy only 300 characters in schema.xml...
But i dont want the content like dis.. Is there any way to make an o/p as
follows::

str Java is one of the best language,java is easy to learn.../str


where this content is at start of the chapter,where the first word of java
is occured in the file...


Regards,
Satya


Re: Google like search

2010-12-14 Thread Tanguy Moal
Satya,

In fact the highlighter will select the relevant part of the whole
text and return it with the matched terms highlighted.

If you do so for a whole book, you will face the issue spotted by Dave
(too long text).

To address that issue, you have the possibility to split your book in
chapters, and index each chapter as a unique document.

You would then be interested in adding a field to identify uniquely
each book (using ISBN number for example) and turn on grouping (or
collapsing) on that field ... (see this very good blog post :
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/
)

Moreover, you might be interested by the following JIRA issue :
https://issues.apache.org/jira/browse/SOLR-2272 . Using this patch,
you could for example ensure that if a given document-chapter is
selected by the query, then another (or several) document(s) (maybe a
father book-document, or all the other chapters) get selected along
the way (by doing a self-join on the ISBN number). Here again,
grouping afterward would return a group of document representing each
book.

Good luck!

--
Tanguy

2010/12/14 Dave Searle dave.sea...@magicalia.com:
 Highlighting is exactly what you need, although if you highlight the whole 
 book, this could slow down your queries. Index/store the first 5000-1 
 characters and see how you get on

 -Original Message-
 From: satya swaroop [mailto:satya.yada...@gmail.com]
 Sent: 14 December 2010 10:08
 To: solr-user@lucene.apache.org
 Subject: Re: Google like search

 Hi Tanguy,
                  I am not asking for highlighting.. I think it can be
 explained with an example.. Here i illustarte it::

 when i post the query like dis::

 http://localhost:8080/solr/select?q=Javaversion=2.2start=0rows=10indent=on

 i Would be getting the result as follows::

 -response
 -lst name=responseHeader
 int name=status0/int
 int name=QTime1/int
 /lst
 -result name=response numFound=1 start=0
 -doc
 str name=filenameJava%20debugging.pdf/str
 str name=id122/str
 -arr name=text1
 -str
 Table of Contents
 If you're viewing this document online, you can click any of the topics
 below to link directly to that section.
 1. Tutorial tips 2
 2. Introducing debugging  4
 3. Overview of the basics 6
 4. Lessons in client-side debugging 11
 5. Lessons in server-side debugging 15
 6. Multithread debugging 18
 7. Jikes overview 20
 /str
 /arr
 /doc
 /result
 /response

 Here the str field contains the first 300 characters of the file as i kept a
 field to copy only 300 characters in schema.xml...
 But i dont want the content like dis.. Is there any way to make an o/p as
 follows::

 str Java is one of the best language,java is easy to learn.../str


 where this content is at start of the chapter,where the first word of java
 is occured in the file...


 Regards,
 Satya



Re: Google like search

2010-12-14 Thread satya swaroop
Hi Tanguy,
 Thanks for ur reply. sorry to ask this type of question.
how can we index each chapter of a file as seperate document.As for i know
we just give the path of file to solr to index it... Can u provide me any
sources for this type... I mean any blogs or wiki's...

Regards,
satya


Re: Google like search

2010-12-14 Thread Tanguy Moal
To do so, you have several possibilities, I don't know if there is a best one.

It depends pretty much on the format of the input file(s), your
affinities with a given programing language,some libraries you might
need and the time you're ready to spend on this task.

Consider having a look at SolrJ  (http://wiki.apache.org/solr/Solrj)
or at the DataImportHandler
(http://wiki.apache.org/solr/DataImportHandler) .

Cheers,

--
Tanguy

2010/12/14 satya swaroop satya.yada...@gmail.com:
 Hi Tanguy,
                 Thanks for ur reply. sorry to ask this type of question.
 how can we index each chapter of a file as seperate document.As for i know
 we just give the path of file to solr to index it... Can u provide me any
 sources for this type... I mean any blogs or wiki's...

 Regards,
 satya



Re: Google like search

2010-12-14 Thread Bhavnik Gajjar
Hi Satya,

Coming to your original question, there is one possibility to make Solr 
emit snippets like Google. Solr query syntax goes like,

http://localhost:8080/solr/DefaultInstance/select/?q=javaversion=2.2start=0rows=10indent=onhl=truehl.snippets=5hl.fl=Field_Textfl=Field_Text

Note that, the key thing used here is Highlighting feature provided by 
Solr. Executing above Solr query will result into two main block of 
results. First part would contain normal results, whereas another part 
would contain highlighted snippets, based on the parameters provided in 
query. One should pickup the later part (snippets) and show it in result 
page UI.

Cheers,

Bhavnik Gajjar


On 12/14/2010 8:35 PM, Tanguy Moal wrote:
 To do so, you have several possibilities, I don't know if there is a best one.

 It depends pretty much on the format of the input file(s), your
 affinities with a given programing language,some libraries you might
 need and the time you're ready to spend on this task.

 Consider having a look at SolrJ  (http://wiki.apache.org/solr/Solrj)
 or at the DataImportHandler
 (http://wiki.apache.org/solr/DataImportHandler) .

 Cheers,

 --
 Tanguy

 2010/12/14 satya swaroopsatya.yada...@gmail.com:
 Hi Tanguy,
  Thanks for ur reply. sorry to ask this type of question.
 how can we index each chapter of a file as seperate document.As for i know
 we just give the path of file to solr to index it... Can u provide me any
 sources for this type... I mean any blogs or wiki's...

 Regards,
 satya


The contents of this eMail including the contents of attachment(s) are 
privileged and confidential material of Gateway NINtec Pvt. Ltd. (GNPL) and 
should not be disclosed to, used by or copied in any manner by anyone other 
than the intended addressee(s). If this eMail has been received by error, 
please advise the sender immediately and delete it from your system. The views 
expressed in this eMail message are those of the individual sender, except 
where the sender expressly, and with authority, states them to be the views of 
GNPL. Any unauthorized review, use, disclosure, dissemination, forwarding, 
printing or copying of this eMail or any action taken in reliance on this eMail 
is strictly prohibited and may be unlawful. This eMail may contain viruses. 
GNPL has taken every reasonable precaution to minimize this risk, but is not 
liable for any damage you may sustain as a result of any virus in this eMail. 
You should carry out your own virus checks before opening the eMail or 
attachment(s). GNPL is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt. GNPL reserves the right to monitor and review the content of all 
messages sent to or from this eMail address and may be stored on the GNPL eMail 
system. In case this eMail has reached you in error, and you  would no longer 
like to receive eMails from us, then please send an eMail to 
d...@gatewaynintec.com