Re: SpanNearQuery -- bug or feature?

2014-01-13 Thread Gregory Dearing
Piotr,

The 'unordered' flag allows spans to be overlapping and still be a match.
I believe this is a feature.

It may seem unusual for a term to be 'near' itself, but it may be more
intuitive if you consider spans that are more than one term long.

spanNear(
[spanNear([contents:test, contents:bunga], 0, true),
 spanNear([contents:bunga, contents:test], 0, true)],
10, false
)

This is searching for two phrases, as long as they're reasonably 'close'.
It should match your first example document even though the sub-spans
overlap on the term 'bunga'.

Also, Mark Miller wrote a really nice article on span mechanics that may be
helpful: http://searchhub.org/2009/07/18/the-spanquery/

-Greg


On Fri, Jan 10, 2014 at 7:01 PM, Piotr Pęzik  wrote:

> Hi,
>
> could anyone please tell me if the following behavior is expected in
> Lucene 4.5?
>
> Let's assume we have an index with two documents:
>
> 1. contents: "test bunga bunga test"
> 2. contents: "test bunga test"
>
> We run two SpanNearQueries against this index:
>
> 1. spanNear([contents:bunga, contents:bunga], 0, true)
> 2. spanNear([contents:bunga, contents:bunga], 0, false)
>
> For the first query we get 1 hit. The first document in the example above
> gets matched and the second one doesn't. This make sense, because we want a
>  the term "bunga" followed by another "bunga" here.
>
> For the second query both documents get matched. Why does the second
> document with a single occurrence of 'bunga' get matched?
>
> A complete example follows.
>
> Thanks in advance!
>
>
>
> Piotr
>
>
> ---
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.DirectoryReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.util.Version;
> import java.io.StringReader;
> import static org.junit.Assert.assertEquals;
>
> class SpansBug {
>
> public static void main(String [] args) throws Exception {
>
> Directory dir = new RAMDirectory();
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45);
> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
> analyzer);
>
> IndexWriter writer = new IndexWriter(dir, iwc);
> String contents = "contents";
> Document doc1 = new Document();
> doc1.add(new TextField(contents, new StringReader("test bunga
> bunga test")));
> Document doc2 = new Document();
> doc2.add(new TextField(contents, new StringReader("test bunga
> test")));
>
> writer.addDocument(doc1);
> writer.addDocument(doc2);
>
> writer.commit();
>
> IndexSearcher searcher = new IndexSearcher(DirectoryReader.
> open(dir));
>
> SpanQuery stq1 = new SpanTermQuery(new Term(contents,"bunga"));
> SpanQuery stq2 = new SpanTermQuery(new Term(contents,"bunga"));
> SpanQuery [] spqa = new SpanQuery[]{stq1,stq2};
>
> SpanNearQuery spanQ1 = new SpanNearQuery(spqa,0, true);
> SpanNearQuery spanQ2 = new SpanNearQuery(spqa,0, false);
>
> System.out.println(spanQ1);
>
> TopDocs tdocs1 = searcher.search(spanQ1,10);
> assertEquals(tdocs1.totalHits ,1);
>
> System.out.println(spanQ2);
>
> TopDocs tdocs2 = searcher.search(spanQ2,10);
> //Why does the following assertion fail?
> assertEquals(tdocs2.totalHits ,1);
>
>
> }
> }
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: SpanTermQuery getSpans

2014-04-01 Thread Gregory Dearing
Martin,

Note that the documents within each index segment (leaf context) are zero
indexed.  Meaning that each segment in your index will contain a different
document with a segment-relative docId of 0.

When working inleaf context, you can calculate a document's absolute docId
with something like: "int absoluteDocId = context.docBase + docId;".

-Greg



On Tue, Apr 1, 2014 at 5:52 PM, Martin Líška  wrote:

> Dear all,
>
> I'm experiencing troubles with SpanTermQuery.getSpans(AtomicReaderContext
> context, Bits acceptDocs, Map termContexts) method in
> version 4.6. I want to use it to retrieve payloads of matched spans.
>
> First, I search the index with IndexSearcher.search(query, limit) and I get
> TopDocs. In these TopDocs, there is a certain document A. I know, that the
> query is an instance of PayloadTermQuery, so I search for spans
> using query.getSpans(indexReader.leaves().get(0), null, new HashMap()); but
> this wont return the spans for the document A.
>
> I observed, that getSpans method won't return any spans for documents with
> IDs greater than say 900, even though documents with IDs greater than 900
> were returned in the original search. All other documents below ID 900 are
> returned sucessfully from getSpans method.
>
> I also tried passing all the leaves of indexReader to getSpans with no
> effect.
>
> Please help.
>
> Thank you
>


Re: Avoid memory issues when indexing terms with multiplicity

2014-04-04 Thread Gregory Dearing
Hi David,

I'm not an expert, but I've climbed through the consumers myself in the
past.  The big limit is that the full postings for a document or document
block must fit into memory.  There may be other hidden processing limits
(ie. memory used per-field).

I think it would be possible to create a custom consumer chain that avoids
these limits problems, but it would be a lot of work.

My suggestions would be...

1.) If you're able to index your documents when not expanding terms,
consider whether expansion is really necessary.

If you're expanding them for relavency purposes, then consider storing the
frequency as a payload.  You can use something like PayloadTermQuery and
Similarity.scorePayload() to adjust scoring based on the value.  I wouldn't
expect this to noticably affect query times but, of course, it will depend
on your use case.

2.) I think you could override your TermsConsumer's implementation of
finishTerm() to rewrite "dog:3" as "dog" and multiply Term Frequency by 3,
right before the term is written to the postings.  This is not for the
faint of heart, and I wouldn't recommend trying unless #1 doesn't meet your
needs.

-Greg



On Fri, Apr 4, 2014 at 6:16 AM, Dávid Nemeskey  wrote:

> Hi guys,
>
> I have just recently (re-)joined the list. I have an issue with indexing;
> I hope
> someone can help me with it.
>
> The use-case is that some of the fields in the document are made up of
> term:frequency pairs. What I am doing right now is to expand these with a
> TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat
> cat", and
> index that. However, the problem is that when these fields contain real
> data
> (anchor text, references, etc.), the resulting field texts for some
> documents
> can be really huge; so much in fact, that I get OutOfMemory exceptions.
>
> I would be grateful if someone could tell me how this issue could be
> solved. I
> thought of circumventing the problem by maximizing the frequency I allow or
> using the logarithm thereof, but it would be nice to know if there is a
> proper
> solution for the problem. I have had a look at the code, but got lost in
> all the
> different Consumers. Here are a few questions I have come up with, but the
> real
> solution might be something entirely different...
>
> 1. Is there information on how much using payloads (and hence positions)
> slow
> down querying?
> 2. Provided that I do not want payloads, can I extend something (perhaps a
> Consumer) to achieve what I want?
> 3. Is there a documentation somewhere that describes how indexing works,
> which
> Consumer, Writer, etc. is invoked when?
> 4. Am I better off by just post-processing indices, perhaps by writing the
> frequency to a payload during indexing, and then run through the index,
> remove
> the payloads and positions and writing the posting lists myself?
>
> Thank you very much.
>
> Best,
> Dávid Nemeskey


Re: BooleanScorer - Maximum Prohibited Scorers?

2014-04-17 Thread Gregory Dearing
David,

Any document that matches a MUST_NOT clause will not match the
BooleanQuery.  By definition.

This means that "maximumNumberMustNotMatch" is effectively hardcoded to
zero.

-Greg


On Wed, Apr 16, 2014 at 3:59 PM, David Stimpert wrote:

> Hello,
> I have found useful functionality in BooleanQuery which allows me to
> specify a minimum number of matching optional terms
> (i.e. setMinimumNumberShouldMatch).  I do not, however, see similar
> functionality available for setting the maximum number of MUST_NOTs (i.e.
> setMaximumNumberMustNotMatch).  I am starting to look into how I could
> customize this functionality.  Does this seem feasible?  Do you foresee any
> major challenges?  Any advice?
>


Re: BooleanScorer - Maximum Prohibited Scorers?

2014-04-17 Thread Gregory Dearing
David,

I believe I misunderstood your question in my earlier response.

I think you can create a logical "MaximumNumberMustNotMatch" by nesting
Boolean Queries.

1.) Create a Boolean Query, using 'SHOULD' clauses and setting Minimum
Number Should Match.

2. ) Wrap the BooleanQuery with a 'MUST_NOT' BooleanClause.

3.) Add the negating BooleanClause to a second BooleanQuery.

-Greg






On Thu, Apr 17, 2014 at 10:40 AM, Gregory Dearing wrote:

> David,
>
> Any document that matches a MUST_NOT clause will not match the
> BooleanQuery.  By definition.
>
> This means that "maximumNumberMustNotMatch" is effectively hardcoded to
> zero.
>
> -Greg
>
>
> On Wed, Apr 16, 2014 at 3:59 PM, David Stimpert wrote:
>
>> Hello,
>> I have found useful functionality in BooleanQuery which allows me to
>> specify a minimum number of matching optional terms
>> (i.e. setMinimumNumberShouldMatch).  I do not, however, see similar
>> functionality available for setting the maximum number of MUST_NOTs (i.e.
>> setMaximumNumberMustNotMatch).  I am starting to look into how I could
>> customize this functionality.  Does this seem feasible?  Do you foresee
>> any
>> major challenges?  Any advice?
>>
>
>


Re: Question about JoinUtil

2014-12-16 Thread Gregory Dearing
Glen,

Lucene isn't relational at heart and may not be the right tool for
what you're trying to accomplish. Note that JoinQuery doesn't join
'left' and 'right' answers; rather it transforms a 'left' answerset
into a 'right' answerset.

JoinQuery is able to perform this transformation with a single extra
search, which wouldn't be possible if it accepted a 'toQuery'
argument.


That being said, here are some suggestions...

1. If all you really need is data from the 'right' set of answers (the
joined TO set), then you can just add more queries to perform
right-hand filtering.

   createJoinQuery(...) AND TermQuery("country", "CA*")

Note that 'left.name' in your SQL example is no longer available.

2. If you really need to filter both sides, and you need to return
data from both sides, it probably requires some programming.  In
pseudo-code...

  leftAnswerSet = searcher.search(fromQuery)

  foreach leftAnswer in leftAnswerSet {
rightAnswers = searcher.search(leftAnswer AND TermQuery("country", "CA*"))
results.add([leftAnswer, rightAnswers])
  }

This is obviously not very efficient, but I think it probably
represents what JoinQuery would look like if it allowed a 'toQuery'
capability and returned data from both sides of the join.

3. If you can denormalize your data into hierarchies, then you could
use index-time joining (BlockJoin) for better performance and easier
collecting of your grouped data.  This is really limiting if your
relationships are truly many to many.

Hope that helps,
Greg


On Tue, Dec 16, 2014 at 10:46 AM, Glen Newton  wrote:
> Anyone?
>
> On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton  wrote:
>> Is there any reason JoinUtil (below) does not have a 'Query toQuery'
>> available? I was wanting to filter on the 'to' side as well. I feel I
>> am missing something here.
>>
>> To make sure this is not an XY problem, here is my use case:
>>
>> I have a many-to-many relationship. The left, join, and right 'table'
>> objects are all indexed in the same lucene index, with a field 'type'
>> to distinguish them.
>>
>> I need to do something like this:
>> select left.name, right.country from left, join, right where
>> left.type="fooType" and right.type="barType" and join.leftId=left.id
>> and join.rightId=right.id and left.name="Fred*" and
>> right.country="Ca*"
>>
>> Is JoinUtil the way to go?
>> Or should I roll my own?
>>Or am I indexing/using-Lucene incorrectly, thinking relational when
>> a different way to index or query would be better in an idiomatic
>> Lucene manner?  :-)
>>
>>
>> Thanks,
>> Glen
>>
>> https://lucene.apache.org/core/4_10_2/join/org/apache/lucene/search/join/JoinUtil.html
>>
>> public static Query createJoinQuery(String fromField,
>> boolean multipleValuesPerDocument,
>> String toField,
>> Query fromQuery,
>> IndexSearcher fromSearcher,
>> ScoreMode scoreMode)
>>  throws IOException
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: including self-joins in parent/child queries

2014-12-16 Thread Gregory Dearing
Michael,

Note that the index doesn't contain any special information about
block-join relationships... it uses a convention that child docs are
indexed before parent docs (ie. the root doc in each hierarchy has the
largest docId in its block).

This means that it can 'join' to parents just by comparing child
docIds (from the subquery set) to the set of parent docIds.  A child's
parent is the closest parent docId that is larger than the child's
docId.

That explanation is all just to say... if your subquery matched a
parent, then joined to a parent set, and no exception was thrown, the
resulting answer will be in the NEXT BOOK.  (The closest docId that is
larger than a parent's docId in the parent set, will be from another
document block)

I would suggest using different field names for each level of a block
hierarchy, just so you can be sure what level your original query
actually hits.  You could accomplish the same by adding a 'docType'
field.

In your case, you might consider pushing your 'Book' level fields into
a special child doc.  For example, your Book document could have no
searchable fields; its children could include both 'Chapter' child
docs and also a 'BookMetadata' child doc.

-Greg




On Tue, Dec 16, 2014 at 10:42 AM, Michael Sokolov
 wrote:
> OK - I see looking at the code that an exception is thrown if a parent doc
> matches the subquery -- so that explains what will happen, but I guess my
> further question is -- is that necessary? Could we just not throw an
> exception there?
>
> -Mike
>
>
> On 12/16/2014 10:38 AM, Michael Sokolov wrote:
>>
>> I see in the docs of ToParentBlockJoinQuery that:
>>
>>  * The child documents must be orthogonal to the parent
>>  * documents: the wrapped child query must never
>>  * return a parent document.
>>
>> First, it would be helpful if the docs explained what would happen if that
>> assumption were violated.
>>
>> Second, I want to do that!
>>
>> My parent documents have the same fields as their child documents (title,
>> text, etc): in some cases the best match for a query is the entire book, (ie
>> a query for "Java Programming"), in other cases it is a specific chapter (a
>> query for "Java regular expressions").
>>
>> Currently I am using Solr grouping queries to roll up parent and child,
>> but I am hoping to get a performance boost by using the parent/child
>> indexing which is a natural for us since we always index a book at a time.
>>
>> If need be, I will simply index a child document that represents the
>> parent (ie duplicate the parent document but with a different type so as to
>> exclude it from the join subquery), but is this really necessary? If so, can
>> you explain why?
>>
>>
>> Thanks
>>
>> -Mike
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: including self-joins in parent/child queries

2014-12-17 Thread Gregory Dearing
Michael,

I think I understand your point.  I should mention that I'm just a user of
BlockJoin... a year ago, I was doing the same tests you are now and was
just trying to share my observations. :)

I think either rule would be reasonable, but probably prefer that it throws
exception... this helps provide some structure in a mechanic that already
gets a little crazy.

Also... your approach sounds fine, but I'd still like to suggest that best
practice is to ensure subqueries can only match one 'type' of document.
This becomes important if you have anything more complex than a flat
hierarchy.

As an example, suppose you had the following relationships...

1.) A 'Book' has 'Chapters', which have 'Paragraphs', which are searched
via the 'text' field.
2.) A 'Book' may have an 'Appendix'' which is searched via the 'text' field.

A query like ((text:apple JoinTo type:chapter) AND (text:tree JoinTo
type:chapter)) JoinTo type:book" would seem like a reasonable way to find
books where both terms occurred in the same chapter.

But what happens is that the term queries will hit Appendices, each of
which will be joined to the first Chapter in the next Book.  The search
might return Books whose first chapter has the word 'tree', because the
previous book's Appendix had the word 'apple'. :)

It's equally possible to accidentally create a 'ToUncleJoin' or
'ToCousinJoin'.

Just my two cents,
Greg


On Tue, Dec 16, 2014 at 8:42 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:
>
> Looking at the code, there are explicit checks for if (childId ==
parentId) throw an exception ...
>
> It seems to me that instead, the logic *could* be if (childId ==
parentId) then --- accumulate the parentId as if it were a child *and*
terminate the block.
>
> In your phraseology, we could change "A child's parent is the closest
parent docId that is larger than the child's docId." to "A child's parent
is the closest parent docId that is greater than or equal to the child's
docId."
>
> I agree that the solution to my problem (without changing Lucene) is to
index the parent doc fields in a new child doc (we use a docType field to
distinguish -- changing the names of all the fields at this point would be
kind of painful).  But I was just curious whether there was any reason in
principle that a doc could not be its own parent.
>
> -Mike
>
>
>
> On 12/16/2014 8:20 PM, Gregory Dearing wrote:
>>
>> Michael,
>>
>> Note that the index doesn't contain any special information about
>> block-join relationships... it uses a convention that child docs are
>> indexed before parent docs (ie. the root doc in each hierarchy has the
>> largest docId in its block).
>>
>> This means that it can 'join' to parents just by comparing child
>> docIds (from the subquery set) to the set of parent docIds.  A child's
>> parent is the closest parent docId that is larger than the child's
>> docId.
>>
>> That explanation is all just to say... if your subquery matched a
>> parent, then joined to a parent set, and no exception was thrown, the
>> resulting answer will be in the NEXT BOOK.  (The closest docId that is
>> larger than a parent's docId in the parent set, will be from another
>> document block)
>>
>> I would suggest using different field names for each level of a block
>> hierarchy, just so you can be sure what level your original query
>> actually hits.  You could accomplish the same by adding a 'docType'
>> field.
>>
>> In your case, you might consider pushing your 'Book' level fields into
>> a special child doc.  For example, your Book document could have no
>> searchable fields; its children could include both 'Chapter' child
>> docs and also a 'BookMetadata' child doc.
>>
>> -Greg
>>
>>
>>
>>
>> On Tue, Dec 16, 2014 at 10:42 AM, Michael Sokolov
>>  wrote:
>>>
>>> OK - I see looking at the code that an exception is thrown if a parent
doc
>>> matches the subquery -- so that explains what will happen, but I guess
my
>>> further question is -- is that necessary? Could we just not throw an
>>> exception there?
>>>
>>> -Mike
>>>
>>>
>>> On 12/16/2014 10:38 AM, Michael Sokolov wrote:
>>>>
>>>> I see in the docs of ToParentBlockJoinQuery that:
>>>>
>>>>   * The child documents must be orthogonal to the parent
>>>>   * documents: the wrapped child query must never
>>>>   * return a pare

Re: ToChildBlockJoinQuery question

2015-01-21 Thread Gregory Dearing
James,

I haven't actually ran your example, but I think the source problem is that
your source query ("NT:American") is hitting documents that have no
children.

The reason the exception is so weird is that one of your index segments
contains zero documents that match your filter.  Specifically, there's an
index segment containing docs matching "NT:american", but with no documents
matching "AGTY:np".

This will cause CachingWrapperFilter, which normally returns a FixedBitSet,
to instead return a generic "Empty" DocIdSet.  Which leads to the exception
from ToChildBlockJoinQuery.

The summary is, make sure that your source query only hits documents that
were actually added using 'addDocuments()'.  Since it looks like you're
extracting your block relationships from the existing index, that might
mean that you'll need to add some extra metadata to the newly created docs
instead of just cloning what already exists.

-Greg


On Wed, Jan 21, 2015 at 10:00 AM, McKinley, James T <
james.mckin...@cengage.com> wrote:

> Hi,
>
> I'm attempting to use ToChildBlockJoinQuery in Lucene 4.8.1 by following
> Mike McCandless' blog post:
>
>
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
>
> I have a set of child documents which are named works and a set of parent
> documents which are named persons that are the creators of the named
> works.  The parent document has a nationality and the child document does
> not.  I want to query the children (named works) limiting by the
> nationality of the parent (named person).  I've indexed the documents as
> follows (I'm pulling the docs from an existing index):
>
> private void createNamedWorkIndex(String srcIndexPath, String
> destIndexPath) throws IOException {
> FSDirectory srcDir = FSDirectory.open(new
> File(srcIndexPath));
> FSDirectory destDir = FSDirectory.open(new
> File(destIndexPath));
>
> IndexReader reader = DirectoryReader.open(srcDir);
>
> Version version = Version.LUCENE_48;
> IndexWriterConfig conf = new IndexWriterConfig(version,
> new StandardTextAnalyzer(version));
>
> Set crids = getCreatorIds(reader);
>
> String[] crida = crids.toArray(new String[crids.size()]);
>
> int numThreads = 24;
> ExecutorService executor =
> Executors.newFixedThreadPool(numThreads);
>
> int numCrids = crids.size();
> int batchSize = numCrids / numThreads;
> int remainder = numCrids % numThreads;
>
> System.out.println("Inserting work/creator blocks using "
> + numThreads + " threads...");
> try (IndexWriter writer = new IndexWriter(destDir, conf)){
> for (int i = 0; i < numThreads; i++) {
> String[] cridRange;
> if (i == numThreads - 1) {
> cridRange =
> Arrays.copyOfRange(crida, i*batchSize, ((i+1)*batchSize - 1) + remainder);
> } else {
> cridRange =
> Arrays.copyOfRange(crida, i*batchSize, ((i+1)*batchSize - 1));
> }
> String id = "" + ((char)('A' + i));
> Runnable indexer = new IndexRunnable(id ,
> reader, writer, new HashSet(Arrays.asList(cridRange)));
> executor.execute(indexer);
> }
> executor.shutdown();
> executor.awaitTermination(2, TimeUnit.HOURS);
> } catch (Exception e) {
> executor.shutdownNow();
> throw new RuntimeException(e);
> } finally {
> reader.close();
> srcDir.close();
> destDir.close();
> }
>
> System.out.println("Done!");
> }
>
> public static class IndexRunnable implements Runnable {
> private String id;
> private IndexReader reader;
> private IndexWriter writer;
> private Set crids;
>
> public IndexRunnable(String id, IndexReader reader,
> IndexWriter writer, Set crids) {
> this.id = id;
> this.reader = reader;
> this.writer = writer;
> this.crids = crids;
> }
>
> @Override
> public void run() {
> IndexSearcher searcher = new IndexSearcher(reader);
>
> try {
> int count = 0;
> for (String crid : crids) {
> List docs = new
> ArrayList<>();
>
>

Re: ToChildBlockJoinQuery question

2015-01-21 Thread Gregory Dearing
Jim,

I think you hit the nail on the head... that's not what BlockJoinQueries do.

If you're wanting to search for children and join to their parents... then
use ToParentBlockJoinQuery, with a query that matches the set of children
and a filter that matches the set of parents.

If you're searching for parents, then joining to their children... then use
ToChildBlockJoinQuery, with a query that matches the set of parents and a
filter that matches the set of children.

When you add related documents to the index (via addDocuments), make that
children are added before their parents.

The reason all the above is necessary is that it makes it possible to have
a nested hierarchy of relationships (ie. Parents have Children, which have
Children of their own).  You need a query to indicate which part of the
hierarchy you're starting from, and a filter indicating which part of the
hierarchy you're joining to.

Also, you will always get an exception if your query and your filter both
match the same document.  A child can't be its own parent.

BlockJoin is a very powerful feature, but what it's really doing is
modelling relationships using an index that doesn't know what a
relationship is.  The relationships are determined by a combination of the
order that you indexed the block, and the format of your query.  This
disjoin can lead to some weird behavior if you're not absolutely sure how
it works.

Thanks,
Greg





On Wed, Jan 21, 2015 at 4:34 PM, McKinley, James T <
james.mckin...@cengage.com> wrote:

>
> Am I understanding how this is supposed to work?  What I think I am (and
> should be) doing is providing a query and filter that specifies the parent
> docs and the ToChildBlockJoinQuery should return me all the child docs for
> the resulting parent docs.  Is this correct?  The reason I think I'm not
> understanding is that I don't see why I need both a filter and a query to
> specify the parent docs when a single query or filter should suffice.  Am I
> misunderstanding what parentQuery and parentFilter mean, they both refer to
> parent docs right?
>
> Jim
>


Re: ToChildBlockJoinQuery question

2015-01-22 Thread Gregory Dearing
Mike,

I agree that it's not absolutely necessary to enforce children not being
their own parent.  I was just trying to describe the current
implementation, and why you were seeing exceptions.

The difference is mostly philosophical.  The advantage of the current
approach (in my opinion) is that it the BlockJoin mechanic has a lot of
terrible edge cases if used naively, and enforcing "child can't be its own
parent" can help catch quite a few of them.

I had a discussion on this list on the same topic, which might be useful: Re:
including self-joins in parent/child queries
<http://mail-archives.apache.org/mod_mbox/lucene-java-user/201412.mbox/%3ccaasl1-_ppmcnq3apjjfbt3adb4pgaspve-8o5r9gv5kldpf...@mail.gmail.com%3E>

-Greg

On Wed, Jan 21, 2015 at 7:59 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> On 1/21/2015 6:59 PM, Gregory Dearing wrote:
>
>> Jim,
>>
>> I think you hit the nail on the head... that's not what BlockJoinQueries
>> do.
>>
>> If you're wanting to search for children and join to their parents... then
>> use ToParentBlockJoinQuery, with a query that matches the set of children
>> and a filter that matches the set of parents.
>>
>> If you're searching for parents, then joining to their children... then
>> use
>> ToChildBlockJoinQuery, with a query that matches the set of parents and a
>> filter that matches the set of children.
>>
>> When you add related documents to the index (via addDocuments), make that
>> children are added before their parents.
>>
>> The reason all the above is necessary is that it makes it possible to have
>> a nested hierarchy of relationships (ie. Parents have Children, which have
>> Children of their own).  You need a query to indicate which part of the
>> hierarchy you're starting from, and a filter indicating which part of the
>> hierarchy you're joining to.
>>
>> Also, you will always get an exception if your query and your filter both
>> match the same document.  A child can't be its own parent.
>>
> That's true for the existing implementation, but seems unnecessary from
> what I can tell.  See https://github.com/safarijv/
> ifpress-solr-plugin/blob/master/src/main/java/com/
> ifactory/press/db/solr/search/SafariBlockJoinQuery.java for a variant
> that allows a child to be its own parent.
>
> -Mike
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: ToChildBlockJoinQuery question

2015-01-23 Thread Gregory Dearing
Hey Mike,

My fault... I wasn't paying attention and thought I was replying to a
response from James.  No wonder it reminded me of our last conversation. :)

-Greg

On Thu, Jan 22, 2015 at 10:37 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> Yeah I know -- we've been around this block before.  I agree that the
> whole block indexing/searching feature is a bit confusing, trappy and
> error-prone, and it may be helpful to have those boundary conditions as
> signposts, but in my case relaxing the restriction enabled me to execute
> the queries I want without having to write a lot of awkward extensions to
> my indexing code.  That code uses Python's haystack, which is based on
> django models, and in order to comply with the parent-not-its-child idea, I
> would have had to introduce dummy documents to stand in as the parents,
> something that isn't at all natural or straightforward in that
> django/haystack view of the world.  Maybe the enforcement of that
> restriction could be relaxed according to an option in the query
> constructor.
>
> -Mike
>


Re: Calculate the score of an arbitrary string vs a query?

2015-04-10 Thread Gregory Dearing
Hi Ali,

The short answer to your question is... there's no good way to create a
score from your result string, without using the Lucene index, that will be
directly comparable to the Lucene score.  The reason is that the score
isn't just a function of the query and the contents of the document.  It's
also (usually) a function of the contents of the entire corpus... or rather
how common terms are across the entire corpus.

That being said... the default scoring algorithm is based on tf/idf.  The
implementation isn't in any one class... every query type (e.g. Term Query,
Boolean Query, etc...) contains its own code for calculating scores.  So
the complete scoring formula will depend on the type of queries you're
using.  Many of those implementations also call into the Similarity API
that you mentioned.

If you'd like to see representative examples of scoring code, then take a
look at TermWeight/TermScorer, and also BooleanWeight, which has several
associated scorers.

-Greg


On Tue, Apr 7, 2015 at 1:32 AM, Ali Akhtar  wrote:

> Hello,
>
> I'm in a situation where a search query string is being submitted
> simultaneously to Lucene, and to an external API.
>
> Results are fetched from both sources. I already have a score available for
> Lucene results, but I don't have a score for the results fetched from the
> external source.
>
> I'd like to calculate scores of results from the API, so that I can rank
> the results by the score, and show the top 5 results from both sources.
> (I.e the results would be merged.)
>
> Is there any Lucene API method, to which I can submit a search string and
> result string, and get a score back? If not, which class contains the
> source code for calculating the score, so that I can implement my own
> scoring class, using the same algorithm?
>
> I've looked at the Similarity class Javadocs, but it doesn't include any
> source code for calculating the score.
>
> Any help would be greatly appreciated. Thanks.
>