Recap on derived objects in Solr Index, 'schema in a can'

Dennis Gearon Mon, 20 Dec 2010 20:04:07 -0800

Based on more searches and manual consolidation, I've put together some of 
the ideas for this already suggested in a summary below. The last item in the 
summary
seems to be interesting, low technical cost way of doing it.


Basically, it treats the index like a 'BigTable', a la "No SQL".

Erick Erickson pointed out: 
"...but there's absolutely no requirement 
that all documents in SOLR have the same fields..."

I guess I don't have the right understanding of what goes into a Document
in Solr. Is it just a set of fields, each with it's own independent field type
declaration/id, it's name, and it's content?

So even though there's a schema for an index, one could ignore it and
jsut throw any other named fields and types and content at document addition 
time?

So If I wanted to search on a base set, all documents having it, I could then
additionally filter based on the (might be wrong use of this) dynamic fields?






Origninal Thread that I started:
----------------------------------------
http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html

-----------------------------------------------------------------------------------------------------

Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
-----------------------------------------------------------------------------------------------------


1/ Base object of some kind, x number of fields
2/ Derived objects representing Divisiion in company, different customer bases, 
etc.
      each having 2 additional, unique fields.
3/ Assume 1000 such derived object types
4/ A 'flattened' Index would have the x base object fields,
    ****and 2000**** additional fields

 
================================================
Solutions Posited
-----------------------

A/ First thought, muliti-value columns as key pairs.
      1/ Difficult to access individual items of more than one 'word' length 
             for querying in multivalued fields.
      2/ All sorts of statistical stuff probably wouldn't apply?
      3/ (James Dayer said:) There's also one "gotcha" we've experienced when 
searching acrosse
            multi-valued fields:  SOLR will match across field occurences. 
             In the  example below, if you were to search q=contrib_name:(james 
AND smith),
             you will get this record back.  It matches one name from one 
contributor  and 

             another name from a different contributor.  This is not what our  
users want. 


             As a work-around, I am converting these to phrase queries with 
             slop: "james smith"~50 ... Just use a slop # smaller than your  
positionIncrementGap 

             and bigger than the # of terms entered.  This will  prevent the 
cross-field matches 

             yet allow the words to occur in any  order.   

            The problem with this approach is that Lucene doesn't support 
wildcards in phrases
B/ Dynamic fields was suggested, but I am not sure exactly how they
        work, and the person who suggested it was not sure it would work, 
either.
C/ Different field naming conventions were suggested in field types were 
similar.
        I can't predict that.
D/ Found this old thread, and i had other suggestions:
       1/ Use multiple cores, one for each record type/schema, aggregate them 
in 
during the query.
       2/ Use a fixed number of additional fields X 2. Eatch additional field 
is 
actually a pair of fields.
           The first of the pair gives the colmn name, the second gives the 
data. 

            a) Although I like this, I wonder how many extra fields to use, 
            b) it was pointed out that relevancy and other statistical 
criterial 
for queries might suffer.
       3/ Index the different objects exactly as they are, i.e. as Erick 
Erickson said:
           "I'm not entirely sure this is germane, but there's absolutely no 
requirement 

           that all documents in SOLR have the same fields. So it's possible 
for 
you to 

           index the "wildly different content" in "wildly different fields" 
<G>. Then 

           searching for screen:LCD would be straightforward."...
Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Recap on derived objects in Solr Index, 'schema in a can'

Reply via email to