[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530351#comment-13530351
 ] 

Shai Erera commented on LUCENE-4619:


bq. can the facedIndexingParams be part of IndexWriterConfig?

I don't think so, b/c then core would depend on Facets? Also, IWC is for IW, 
and IW cannot handle facets itself. I.e., you need a TaxonomyWriter, CDB and 
friends to actually be able to use FIParams.

> Create a specialized path for facets counting
> -
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600, 
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
> be responsible for some of the performance loss that we see, compared to 
> specialized code.
> During our discussion, we've decided to target a specific use case - facets 
> counting and work on it, top-to-bottom by reusing as much code as possible. 
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
> do only counting (i.e. respects only CountFacetRequest), no sampling, 
> partitions and complements. The API allows us to do so very cleanly, and in 
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
> and CategoryPath (List, Iterable, whatever) and adds the needed information 
> to both the taxonomy index as well as the search index.
> ** That API is similar in nature to CategoryDocumentBuilder, only easier to 
> consume -- it's just another field that you add to the Document.
> ** We'll have two extensions for it: PayloadFacetsField and 
> DocValuesFacetsField, so that we can benchmark the two approaches. 
> Eventually, one of them we believe, will be eliminated, and we'll remain w/ 
> just one (hopefully the DV one).
> * Implement either a FacetsAccumulator/Collector which takes a bunch of 
> CountFacetRequests and returns the top-counts.
> ** Aggregations are done in-collection, rather than post. Note that we have 
> LUCENE-4600 open for exploring that. Either we finish this exploration here, 
> or do it there. Just FYI that the issue exists.
> ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open 
> a separate issue to explore improving that API to be bulk, and then we can 
> decide if this specialized Collector should use those abstractions, or be 
> really optimized for the facet counting case.
> * At the moment, this path will assume that a document holds multiple 
> dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike 
> for a document), and therefore use OrdPolicy.NO_PARENTS.
> ** Later, we'd like to explore how to have this specialized path handle the 
> ALL_PARENTS case too, as it shouldn't be so hard to do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530307#comment-13530307
 ] 

Gilad Barkai commented on LUCENE-4619:
--

Throwing in a crazy idea.. can the facedIndexingParams be part of 
IndexWriterConfig?

> Create a specialized path for facets counting
> -
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600, 
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
> be responsible for some of the performance loss that we see, compared to 
> specialized code.
> During our discussion, we've decided to target a specific use case - facets 
> counting and work on it, top-to-bottom by reusing as much code as possible. 
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
> do only counting (i.e. respects only CountFacetRequest), no sampling, 
> partitions and complements. The API allows us to do so very cleanly, and in 
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
> and CategoryPath (List, Iterable, whatever) and adds the needed information 
> to both the taxonomy index as well as the search index.
> ** That API is similar in nature to CategoryDocumentBuilder, only easier to 
> consume -- it's just another field that you add to the Document.
> ** We'll have two extensions for it: PayloadFacetsField and 
> DocValuesFacetsField, so that we can benchmark the two approaches. 
> Eventually, one of them we believe, will be eliminated, and we'll remain w/ 
> just one (hopefully the DV one).
> * Implement either a FacetsAccumulator/Collector which takes a bunch of 
> CountFacetRequests and returns the top-counts.
> ** Aggregations are done in-collection, rather than post. Note that we have 
> LUCENE-4600 open for exploring that. Either we finish this exploration here, 
> or do it there. Just FYI that the issue exists.
> ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open 
> a separate issue to explore improving that API to be bulk, and then we can 
> decide if this specialized Collector should use those abstractions, or be 
> really optimized for the facet counting case.
> * At the moment, this path will assume that a document holds multiple 
> dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike 
> for a document), and therefore use OrdPolicy.NO_PARENTS.
> ** Later, we'd like to explore how to have this specialized path handle the 
> ALL_PARENTS case too, as it shouldn't be so hard to do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530205#comment-13530205
 ] 

Michael McCandless commented on LUCENE-4619:


Maybe if we rename CDB to FacetsDocumentBuilder, move it to oal.document, make 
it a single method call for the user (FDB.addFields), that's good enough 
progress for the common case for now?

I still don't like this field/dimension duality: it feels like the facet module 
is "hiding" what should be separate fields, within a single Lucene field.  If I 
need to store these fields (because I want to present them in the the UI), I'm 
already adding them as separate fields.

I think doc.add(new FacetField(...)) is more intuitive than fdb.addFields(doc, 
) for a the common/basic use case... but at least improving CDB here would 
be progress.




> Create a specialized path for facets counting
> -
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600, 
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
> be responsible for some of the performance loss that we see, compared to 
> specialized code.
> During our discussion, we've decided to target a specific use case - facets 
> counting and work on it, top-to-bottom by reusing as much code as possible. 
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
> do only counting (i.e. respects only CountFacetRequest), no sampling, 
> partitions and complements. The API allows us to do so very cleanly, and in 
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
> and CategoryPath (List, Iterable, whatever) and adds the needed information 
> to both the taxonomy index as well as the search index.
> ** That API is similar in nature to CategoryDocumentBuilder, only easier to 
> consume -- it's just another field that you add to the Document.
> ** We'll have two extensions for it: PayloadFacetsField and 
> DocValuesFacetsField, so that we can benchmark the two approaches. 
> Eventually, one of them we believe, will be eliminated, and we'll remain w/ 
> just one (hopefully the DV one).
> * Implement either a FacetsAccumulator/Collector which takes a bunch of 
> CountFacetRequests and returns the top-counts.
> ** Aggregations are done in-collection, rather than post. Note that we have 
> LUCENE-4600 open for exploring that. Either we finish this exploration here, 
> or do it there. Just FYI that the issue exists.
> ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open 
> a separate issue to explore improving that API to be bulk, and then we can 
> decide if this specialized Collector should use those abstractions, or be 
> really optimized for the facet counting case.
> * At the moment, this path will assume that a document holds multiple 
> dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike 
> for a document), and therefore use OrdPolicy.NO_PARENTS.
> ** Later, we'd like to explore how to have this specialized path handle the 
> ALL_PARENTS case too, as it shouldn't be so hard to do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530182#comment-13530182
 ] 

Shai Erera commented on LUCENE-4619:


Let's leave the ctor for now, I'll make sure that FIP returns the right API in 
LUCENE-4621. Iterable is great, but I think that we can risk a List? ;).

I thought about the FacetsField more. FacetsField is not really a field, right? 
You cannot set its FieldType (not critical) and any of Field's set()/get() 
methods are just in the way when you look at it. Also, it cannot actually 
create two fields (won't be needed hopefully w/ DV), and adding the same 
FacetsField twice w/ different categories will create multiple positions, which 
is something that I'm not sure we should introduce.

So maybe we improve CDB's API to make it more approachable? I.e., for reusing 
FacetsField you'd have to add setCPs to it? That's really like CDB. The only 
different that remains then is the ability to Document.add(new FacetsField) vs. 
CDB.addFields(Document).

You could do new FacetsField("Author"), but that doesn't simplify a lot I 
think? I.e. you could still add the same field twice. And thinking about the 
faceted search apps that I wrote, I don't always know which facets I'm going to 
add. Sometimes I just get a list of CPs (read them from JSON, XML, programmatic 
API) and I add them all in one shot. Starting to break that into fields will be 
a mess.

While we could always say "let's add FacetsField for the common people and keep 
CDB for the crazy ones", I prefer if there was one entry point. It's also 
easier down the line to make changes / fix bugs?

And that that CDB is a Builder, lets you do iw.addDocument(cdb.addFields(new 
Document())), ain't that awesome !? :)

I mean in the end of the day, CDB is not of type Field. But though FacetsField 
is, it's very limited and app has to excess more logic when dealing with it. 
Vs. CDB where you just call addFields(CPs, doc)?

> Create a specialized path for facets counting
> -
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600, 
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
> be responsible for some of the performance loss that we see, compared to 
> specialized code.
> During our discussion, we've decided to target a specific use case - facets 
> counting and work on it, top-to-bottom by reusing as much code as possible. 
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
> do only counting (i.e. respects only CountFacetRequest), no sampling, 
> partitions and complements. The API allows us to do so very cleanly, and in 
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
> and CategoryPath (List, Iterable, whatever) and adds the needed information 
> to both the taxonomy index as well as the search index.
> ** That API is similar in nature to CategoryDocumentBuilder, only easier to 
> consume -- it's just another field that you add to the Document.
> ** We'll have two extensions for it: PayloadFacetsField and 
> DocValuesFacetsField, so that we can benchmark the two approaches. 
> Eventually, one of them we believe, will be eliminated, and we'll remain w/ 
> just one (hopefully the DV one).
> * Implement either a FacetsAccumulator/Collector which takes a bunch of 
> CountFacetRequests and returns the top-counts.
> ** Aggregations are done in-collection, rather than post. Note that we have 
> LUCENE-4600 open for exploring that. Either we finish this exploration here, 
> or do it there. Just FYI that the issue exists.
> ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open 
> a separate issue to explore improving that API to be bulk, and then we can 
> decide if this specialized Collector should use those abstractions, or be 
> really optimized for the facet counting case.
> * At the moment, this path will assume that a document holds multiple 
> dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike 
> for a document), and therefore use OrdPolicy.NO_PARENTS.
> ** Later, we'd like to explore how to have this specialized path handle the 
> ALL_PARENTS case too, as it shouldn't be so hard to do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--

[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530175#comment-13530175
 ] 

Michael McCandless commented on LUCENE-4619:


bq. You can remove TestDemoFacets from the patch? (since you committed it)

Will do.

bq. DEFAULT_INDEXING_PARAMS is a good idea, but it can change (it's mutable). 
However, I plan to add freeze() to it in LUCENE-4621, so maybe update the 
nocommit to call freeze() on it?

OK.

bq. Adding the field more than once ... that's tricky. It will of course create 
multiple positions, but the entire code now assumes a single position for 
facets. Is there any way we can prevent it? I guess not. Maybe put a WARNING 
for now?

Yeah I think a big warning in the javadocs is all we can do for now ... would 
be nice to somehow catch it but I can't think of a way now.

Separately this duality of dimension/field is sort of confusing.  Shouldn't I 
be adding FacetField("Author") and FacetField("Publish Date") to my Document?  
Instead of a single FacetField taking these two "dimensions" as 
CategoryPaths... I know that's a major change but it's still confusing to my 
[somewhat] new eyes ...

{quote}
It should be ok to add it more than once, with different FIParams, e.g. 
different CLPs.
I guess that's where CDB simplifies things?
{quote}

Right ... I wouldn't say CDB simplifies things (for the "basic" usage).

{quote}
The check that you do in the ctor:
I think it can be simplified to just check FIP.getAllCLPs().size() != 1? If so, 
throw the exception
Also note that you have a typo in the exception msg
{quote}

OK good!  Hmm that's an Iterable ... and ... I don't really want to iterate 
over it if I'm iterating over the CPs anyway ... I think?  Or maybe we have it 
return a List instead?  Hmm but this call (in DefaultFIP) makes a new ArrayList 
every time ... we should just use Collections.singletonList here ... I'll fix 
that.  So what to do?

bq. The rest of the code just freaks the hell out of me! 

LOL!!  Me too :)  Trying to figure out how to just get a byte[] out (so I can 
do the DocValuesFacetField) is not easy ...

bq. I think that you need the stream because that's the only way to add the 
field with the drill-down term and fulltree posting.

OK.

bq. Would rather if that was simplified ... but I don't want to implement a 
different TokenStream, not sure there's much value in it.

This TokenStream somehow adds one position w/ the payload (with token 
"fulltree", if no partitioning I think?), and then provides more tokens so they 
are indexed for drilldown?

bq. Still, maybe put a TODO (not nocommit) to check if it can be simplified?

OK will do.

{quote}
BTW, when I look at the test, adding the PayloadFacetField is super neat. I 
wonder if we simplified CDB to have less API, then all this would still be 
required. E.g.:
{noformat}
List categories; // you need to prepare that w/ PayloadFacetField too!
Document doc = new Document();
cdb.addFields(doc, categories);
{noformat}
It's not a Field, but it works w/ multiple CLPs and all (basically the code 
today). Only you don't need to set the categories. It can also mask in the 
future the case where we put the fulltree under one field and the terms under 
another, or everything as a DV ... What do you think?
{quote}
I think that's a step in the right direction ... but I think a FacetField is 
even easier to consume?

bq. Sorry that I don't fix the patch myself, but I'm rather short on time now. 
I can do it later. I'm also going to tackle LUCENE-4621 now.

No prob, I'll fix.

I'm also working on hacked up DocValuesFacetField.  Wait til you see how I get 
the byte[] to feed to DV :)

> Create a specialized path for facets counting
> -
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600, 
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
> be responsible for some of the performance loss that we see, compared to 
> specialized code.
> During our discussion, we've decided to target a specific use case - facets 
> counting and work on it, top-to-bottom by reusing as much code as possible. 
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
> do only counting (i.e. respects only CountFacetRequest), no sampling, 
> partitions and complements. The API allows us to do so very cleanly, and in 
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
> and CategoryPath (List, Iterable, whatever) and adds the n

[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530049#comment-13530049
 ] 

Shai Erera commented on LUCENE-4619:


Few comments:

* You can remove TestDemoFacets from the patch? (since you committed it)

* DEFAULT_INDEXING_PARAMS is a good idea, but it can change (it's mutable). 
However, I plan to add {{freeze()}} to it in LUCENE-4621, so maybe update the 
nocommit to call freeze() on it?

* Adding the field more than once ... that's tricky. It will of course create 
multiple positions, but the entire code now assumes a single position for 
facets. Is there any way we can prevent it? I guess not. Maybe put a WARNING 
for now?
** It should be ok to add it more than once, with different FIParams, e.g. 
different CLPs.
** I guess that's where CDB simplifies things?

* The check that you do in the ctor:
** I think it can be simplified to just check FIP.getAllCLPs().size() != 1? If 
so, throw the exception
** Also note that you have a typo in the exception msg

* The rest of the code just freaks the hell out of me! :)
** I think that you need the stream because that's the only way to add the 
field with the drill-down term and fulltree posting.
** Would rather if that was simplified ... but I don't want to implement a 
different TokenStream, not sure there's much value in it.
** Still, maybe put a TODO (not nocommit) to check if it can be simplified?

* BTW, when I look at the test, adding the PayloadFacetField is super neat. I 
wonder if we simplified CDB to have less API, then all this would still be 
required. E.g.:

{code}
List categories; // you need to prepare that w/ PayloadFacetField too!
Document doc = new Document();
cdb.addFields(doc, categories);
{code}

It's not a Field, but it works w/ multiple CLPs and all (basically the code 
today). Only you don't need to set the categories. It can also mask in the 
future the case where we put the fulltree under one field and the terms under 
another, or everything as a DV ... What do you think?

Sorry that I don't fix the patch myself, but I'm rather short on time now. I 
can do it later. I'm also going to tackle LUCENE-4621 now.

> Create a specialized path for facets counting
> -
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600, 
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
> be responsible for some of the performance loss that we see, compared to 
> specialized code.
> During our discussion, we've decided to target a specific use case - facets 
> counting and work on it, top-to-bottom by reusing as much code as possible. 
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
> do only counting (i.e. respects only CountFacetRequest), no sampling, 
> partitions and complements. The API allows us to do so very cleanly, and in 
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
> and CategoryPath (List, Iterable, whatever) and adds the needed information 
> to both the taxonomy index as well as the search index.
> ** That API is similar in nature to CategoryDocumentBuilder, only easier to 
> consume -- it's just another field that you add to the Document.
> ** We'll have two extensions for it: PayloadFacetsField and 
> DocValuesFacetsField, so that we can benchmark the two approaches. 
> Eventually, one of them we believe, will be eliminated, and we'll remain w/ 
> just one (hopefully the DV one).
> * Implement either a FacetsAccumulator/Collector which takes a bunch of 
> CountFacetRequests and returns the top-counts.
> ** Aggregations are done in-collection, rather than post. Note that we have 
> LUCENE-4600 open for exploring that. Either we finish this exploration here, 
> or do it there. Just FYI that the issue exists.
> ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open 
> a separate issue to explore improving that API to be bulk, and then we can 
> decide if this specialized Collector should use those abstractions, or be 
> really optimized for the facet counting case.
> * At the moment, this path will assume that a document holds multiple 
> dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike 
> for a document), and therefore use OrdPolicy.NO_PARENTS.
> ** Later, we'd like to explore how to have this specialized path handle the 
> ALL_PARENTS case too, as it shouldn't be so hard to do.

--
This message is automatically generated by JIRA.
If you think