Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-17 Thread Chitra R
case 1:
In taxonomy, for each indexed document, examines facet label ,
computes their ordinals and mappings, and which will be stored in sidecar
index at index time.

case 2:
In doc values, these(ordinals) are computed at search time, so
there will be a time and memory trade-off between both cases, hope so.


In taxonomy, building hierarchical facets at index time makes faceting cost
minimal at search time than flat facets in doc values.

Except (memory,time and NRT latency) , Is any another contrast between
hierarchical and flat facets at search time?


Kindly post your suggestions...


Regards,
Chitra

On Thu, Nov 17, 2016 at 6:40 PM, Chitra R  wrote:

> Okay. I agree with you, Taxonomy maintains and supports hierarchical
> facets during indexing. Hope hierarchical in the sense, we might index the 
> field
> Publish date : 2010/10/15 as Publish date: 2010 , Publish date: 2010/10
> and Publish date: 2010/10/15 , their facet ordinals are maintained in
> sidecar index and it is mapped to the main index.
>
> For example:
>
> In search-lucene.com , I enter a term (say facet), top
> documents and their categories are displayed after performing the search.
> Say I drill down through Publish date/2010 to collect its child counts and
> after I will pass through publishdate/2010/10 to collect their child
> counts. And for each drill down, each search will be performed to collect
> its top docs and categories.
>
>
>*Even I can achieve this in flat facets by changing the
> drill down query. *
>
> Am I right or missed anything? yet I don't know if I missed anything...
>
> So What is the need of hierarchical facets? Could you please explain
> it(hierarchical facets) in the real-world use case?
>
>
> Regards,
> Chitra
>
> On Wed, Nov 16, 2016 at 7:36 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> You store dimension + string (a single value path, since it's not
>> hierarchical) into SSDVFF so that you can compute facet counts, either
>> ordinary drill down counts or the drill sideways counts.
>>
>> You can see examples of drill sideways at
>> http://jirasearch.mikemccandless.com, e.g. drill down on any of those
>> fields on the left and you don't lose the previous facet counts for
>> that field.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Nov 16, 2016 at 8:51 AM, Chitra R  wrote:
>> > Hi,
>> >
>> > Lucene-Drill sideways
>> >
>> > jira_issue:LUCENE-4748
>> >
>> >  Is this the reason( ie Drill sideways
>> makes
>> > a very nice faceted search UI because we
>> > don't "lose" the facet counts after drilling in) behind storing path and
>> > dimension for the given SSDVF field? Else anything?
>> >
>> > Regards,
>> > Chitra
>> >
>> >
>> >  Hey, thank you so much for the fast response, I agree NRT refresh
>> is
>> > somewhat costly operations and this is the major pitfall, suppose we
>> use doc
>> > value faceting.
>> >
>> >
>> >  While indexing SortedSetDocValuesFacetField , it stores
>> > path and dimension of the given field internally. So Can we achieve
>> > hierarchical facets using DrillDownQuery? Hope, purpose of storing path
>> and
>> > dimension is to achieve hierarchical facets. If yes (ie we can achieve
>> > hierarchy in SSDVFF) , so what is the need to move over taxonomy?
>> >  Else I missed anything?
>> >
>> >
>> >  What is the real purpose to store path and dimension in
>> > SSDVF field?
>> >
>> >
>> > Kindly post your suggestions.
>> >
>> > Regards,
>> > Chitra
>> >
>> >
>> >
>> > On Sat, Nov 12, 2016 at 4:03 AM, Michael McCandless
>> >  wrote:
>> >>
>> >> On Fri, Nov 11, 2016 at 5:21 AM, Chitra R 
>> wrote:
>> >>
>> >> > i)Hope, when opening SortedSetDocValuesReaderState , we are
>> >> > calculating ordinals( this will be used to calculate facet count )
>> for
>> >> > doc
>> >> > values field and this only made the state instance somewhat costly.
>> >> >   Am I right or any other reason behind that?
>> >>
>> >> That's correct.  It adds some latency to an NRT refresh, and some heap
>> >> used to hold the ordinal mappings.
>> >>
>> >> >  ii) During indexing, we are providing facet ordinals in each
>> >> > doc
>> >> > and I think it will be useful in search side, to calculate facet
>> counts
>> >> > only for matching docs.  otherwise, it carries any other benefits?
>> >>
>> >> Well, compared to the taxonomy facets, SSDV facets don't require a
>> >> separate index.
>> >>
>> >> But they add latency/heap usage, and they cannot do hierarchical
>> >> facets yet (though this could be fixed if someone just built it).
>> >>
>> >> >  iii) Is SortedSetDocValuesReaderState thread-safe (ie)
>> multiple
>> >> > threads can call this method concurrently?
>> >>
>> >> Yes.
>> >>
>> >> Mike McCandless
>> >>
>> >> 

Re: Multi-field IDF

2016-11-17 Thread Will Martin
are you familiar with pivoted normalized document length practice or 
theory? or croft's recent work on relevance algorithms accounting for 
structured field presence?




On 11/17/2016 5:20 PM, Nicolás Lichtmaier wrote:
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!



El 17/11/16 a las 18:25, Ahmet Arslan escribió:

Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not 
so usual in titles, then it has some discrimination power in that 
domain.


I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:

IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking 
wrong?


Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Multi-field IDF

2016-11-17 Thread Nicolás Lichtmaier
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!



El 17/11/16 a las 18:25, Ahmet Arslan escribió:

Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in 
titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:
IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multi-field IDF

2016-11-17 Thread Ahmet Arslan
Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in 
titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:
IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: enhancement for SynonymFilter

2016-11-17 Thread Michael McCandless
Hmm are you saying SynonymFilter in 4.10.4 has this capability but
6.3.0 lost it?

So you you have a synonym "wow that's funny" -> "wtf", you want the
token for "wow" to state that it has a synonym?

Using the PositionLengthAttribute you should be able to reconstruct
this, because when you see "wtf' with position length 3, you know it
spanned "wow", "that's", "funny".

Mike McCandless

http://blog.mikemccandless.com


On Thu, Nov 17, 2016 at 10:22 AM, Bernd Fehling
 wrote:
> Currently I'm tackling a problem with SynonymFilter while going from 4.10.4 
> to 6.3.0.
>
> For a special solution I need to know if a word (or multiword) is producing
> synonyms in SynonymFilter.
>
> Therefore I suggest the enhancement of "hasSynonyms" for SynonymFilter.
>
> A workaroud would be to buffer all results from SynonymFilter and check if
> after a word or multiword (of any type) is the next one a SYNONYM.
>
> A function "hasSynonyms" in SynonymFilter would make things easy :-)
>
> What do you think about this?
>
> Regards
> Bernd
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Multi-field IDF

2016-11-17 Thread Nicolás Lichtmaier
IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.


One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.


Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ASCIIFoldingFilter

2016-11-17 Thread Julian Motz

Hello together,

We're currently discussing 
 about the usage of the 
ASCIIFoldingFilter 
 
class in our diacritics project. This project will be about mapping 
diacritics (e.g. "ü") to their associated ASCII characters (e.g. "u" and 
"ue" in this example).


The ASCIIFoldingFilter class would help us a lot and therefore I hope 
you can answer the following questions we have:


1. The ASCIIFoldingFilter class only maps diacritics to their ASCII
   base characters, e.g. "ü" => "u", not to their ASCII characters in
   the associated language, e.g. "ue" in this case. Why have you
   excluded these mappings in the class?
2. Does anyone of you know if German and Norwegian are the only
   languages that have such language specific mappings (e.g. "ü" =>
   "ue" instead of "ü" => "u")?
3. Can we use the data in your class (Apache license) in our project
   (MIT license) with naming the copyright inside the file in our
   repository, but not in the end product that users will have when
   using our project? The project would use the data in the build to
   generate a JSON file.

Thanks in advance.

Cheers,
Julian


enhancement for SynonymFilter

2016-11-17 Thread Bernd Fehling
Currently I'm tackling a problem with SynonymFilter while going from 4.10.4 to 
6.3.0.

For a special solution I need to know if a word (or multiword) is producing
synonyms in SynonymFilter.

Therefore I suggest the enhancement of "hasSynonyms" for SynonymFilter.

A workaroud would be to buffer all results from SynonymFilter and check if
after a word or multiword (of any type) is the next one a SYNONYM.

A function "hasSynonyms" in SynonymFilter would make things easy :-)

What do you think about this?

Regards
Bernd

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StartsWith on DrillDown?

2016-11-17 Thread Matt Hicks
I understand this, and that's how I'm using it now, but my situation is
that in my application I want to offer the ability to auto-complete tags
that have results based on the current query.  This is why I'm looking for
a "StartsWith" filter on the tags.  Certainly I could get back all of the
tags and then filter them myself, but eventually there could be hundreds of
thousands of tags that I'm filtering through and if the user starts typing
"but" I want to be able to show "butterfly" if there are tag matches within
the current query.  I'm currently using taxonomy facets.

On Thu, Nov 17, 2016 at 4:23 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> The idea w/ drill down is you are running a "base query" (what the
> user actually searched for, originally) and then, if the user has
> clicked to drill down on any facet labels, you are also adding
> drill-down queries.
>
> You pass the "base query" to the DrillDownQuery constructor.
>
> And, normally, to add drill-down queries, you would use the add method
> that takes only strings, when the user clicked on a dimension + label.
>
> The add method that takes a custom drill-down query is for more
> advanced use cases, where you are able to create your own query that
> accomplishes the same thing as drilling down by a label; e.g., for
> numeric range facets, you would use this method to pass a numeric
> range filter down.
>
> Have you seen the demo facet examples, e.g.
>
> https://github.com/apache/lucene-solr/blob/master/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleSortedSetFacetsExample.java
> ?
>
> Are you using SSDV facets or taxonomy facets?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Nov 16, 2016 at 5:29 PM, Matt Hicks  wrote:
> > My situation is that I simply don't understand how I'm supposed to pass a
> > `Query` into it.  Just passing in a `new QueryParser(facetName,
> > standardAnalyzer)` to `drillDown.add(facetName, queryParser.parse("valid
> > query"))` just always returns zero items.  I'm guessing it has to do with
> > needing a specific type of query?
> >
> > On Wed, Nov 16, 2016 at 4:05 PM Michael McCandless
> >  wrote:
> >>
> >> Can you post a test case showing the unexpected behavior?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Wed, Nov 16, 2016 at 1:55 PM, Matt Hicks  wrote:
> >> > Is this simply not possible to accomplish or does nobody on this list
> >> > know?
> >> >
> >> > On Mon, Nov 14, 2016 at 2:39 PM Matt Hicks  wrote:
> >> >
> >> >> I'm trying to add a sub-query to my DrillDownQuery but I keep ending
> up
> >> >> with no results when I use add(String dim, Query subQuery).  I'm
> trying
> >> >> to
> >> >> query the tags that start with a specific String.
> >> >>
> >> >> Any suggestions of how to do this would be greatly appreciated. I am
> >> >> using
> >> >> Lucene Core 6.3.0.
> >> >>
> >> >> Thank you
> >> >>
>


Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-17 Thread Chitra R
Okay. I agree with you, Taxonomy maintains and supports hierarchical facets
during indexing. Hope hierarchical in the sense, we might index the field
Publish date : 2010/10/15 as Publish date: 2010 , Publish date: 2010/10
and Publish date: 2010/10/15 , their facet ordinals are maintained in
sidecar index and it is mapped to the main index.

For example:

In search-lucene.com , I enter a term (say facet), top
documents and their categories are displayed after performing the search.
Say I drill down through Publish date/2010 to collect its child counts and
after I will pass through publishdate/2010/10 to collect their child
counts. And for each drill down, each search will be performed to collect
its top docs and categories.


   *Even I can achieve this in flat facets by changing the
drill down query. *

Am I right or missed anything? yet I don't know if I missed anything...

So What is the need of hierarchical facets? Could you please explain
it(hierarchical facets) in the real-world use case?


Regards,
Chitra

On Wed, Nov 16, 2016 at 7:36 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> You store dimension + string (a single value path, since it's not
> hierarchical) into SSDVFF so that you can compute facet counts, either
> ordinary drill down counts or the drill sideways counts.
>
> You can see examples of drill sideways at
> http://jirasearch.mikemccandless.com, e.g. drill down on any of those
> fields on the left and you don't lose the previous facet counts for
> that field.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 16, 2016 at 8:51 AM, Chitra R  wrote:
> > Hi,
> >
> > Lucene-Drill sideways
> >
> > jira_issue:LUCENE-4748
> >
> >  Is this the reason( ie Drill sideways
> makes
> > a very nice faceted search UI because we
> > don't "lose" the facet counts after drilling in) behind storing path and
> > dimension for the given SSDVF field? Else anything?
> >
> > Regards,
> > Chitra
> >
> >
> >  Hey, thank you so much for the fast response, I agree NRT refresh is
> > somewhat costly operations and this is the major pitfall, suppose we use
> doc
> > value faceting.
> >
> >
> >  While indexing SortedSetDocValuesFacetField , it stores
> > path and dimension of the given field internally. So Can we achieve
> > hierarchical facets using DrillDownQuery? Hope, purpose of storing path
> and
> > dimension is to achieve hierarchical facets. If yes (ie we can achieve
> > hierarchy in SSDVFF) , so what is the need to move over taxonomy?
> >  Else I missed anything?
> >
> >
> >  What is the real purpose to store path and dimension in
> > SSDVF field?
> >
> >
> > Kindly post your suggestions.
> >
> > Regards,
> > Chitra
> >
> >
> >
> > On Sat, Nov 12, 2016 at 4:03 AM, Michael McCandless
> >  wrote:
> >>
> >> On Fri, Nov 11, 2016 at 5:21 AM, Chitra R 
> wrote:
> >>
> >> > i)Hope, when opening SortedSetDocValuesReaderState , we are
> >> > calculating ordinals( this will be used to calculate facet count ) for
> >> > doc
> >> > values field and this only made the state instance somewhat costly.
> >> >   Am I right or any other reason behind that?
> >>
> >> That's correct.  It adds some latency to an NRT refresh, and some heap
> >> used to hold the ordinal mappings.
> >>
> >> >  ii) During indexing, we are providing facet ordinals in each
> >> > doc
> >> > and I think it will be useful in search side, to calculate facet
> counts
> >> > only for matching docs.  otherwise, it carries any other benefits?
> >>
> >> Well, compared to the taxonomy facets, SSDV facets don't require a
> >> separate index.
> >>
> >> But they add latency/heap usage, and they cannot do hierarchical
> >> facets yet (though this could be fixed if someone just built it).
> >>
> >> >  iii) Is SortedSetDocValuesReaderState thread-safe (ie)
> multiple
> >> > threads can call this method concurrently?
> >>
> >> Yes.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >
> >
>


Re: StartsWith on DrillDown?

2016-11-17 Thread Michael McCandless
The idea w/ drill down is you are running a "base query" (what the
user actually searched for, originally) and then, if the user has
clicked to drill down on any facet labels, you are also adding
drill-down queries.

You pass the "base query" to the DrillDownQuery constructor.

And, normally, to add drill-down queries, you would use the add method
that takes only strings, when the user clicked on a dimension + label.

The add method that takes a custom drill-down query is for more
advanced use cases, where you are able to create your own query that
accomplishes the same thing as drilling down by a label; e.g., for
numeric range facets, you would use this method to pass a numeric
range filter down.

Have you seen the demo facet examples, e.g.
https://github.com/apache/lucene-solr/blob/master/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleSortedSetFacetsExample.java
?

Are you using SSDV facets or taxonomy facets?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Nov 16, 2016 at 5:29 PM, Matt Hicks  wrote:
> My situation is that I simply don't understand how I'm supposed to pass a
> `Query` into it.  Just passing in a `new QueryParser(facetName,
> standardAnalyzer)` to `drillDown.add(facetName, queryParser.parse("valid
> query"))` just always returns zero items.  I'm guessing it has to do with
> needing a specific type of query?
>
> On Wed, Nov 16, 2016 at 4:05 PM Michael McCandless
>  wrote:
>>
>> Can you post a test case showing the unexpected behavior?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Nov 16, 2016 at 1:55 PM, Matt Hicks  wrote:
>> > Is this simply not possible to accomplish or does nobody on this list
>> > know?
>> >
>> > On Mon, Nov 14, 2016 at 2:39 PM Matt Hicks  wrote:
>> >
>> >> I'm trying to add a sub-query to my DrillDownQuery but I keep ending up
>> >> with no results when I use add(String dim, Query subQuery).  I'm trying
>> >> to
>> >> query the tags that start with a specific String.
>> >>
>> >> Any suggestions of how to do this would be greatly appreciated. I am
>> >> using
>> >> Lucene Core 6.3.0.
>> >>
>> >> Thank you
>> >>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org