Re: Anti-Pattern in lucent-join jar?
On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos dari...@gmail.com wrote: public Scorer scorer(){ TermsWithScoreCollector collector = new TermsWithScoreCollector(); JoinQuery.this.s.search(JoinQuery.this.q, collector); //do the rest.. } Darin, I hardly follow, but this approach either is not efficient or even doesn't work. Generally join is O(n^2) operation, which is most impls try to reduce. weight.scorer() is invoked per segment, and scorer yields results only from a particular segment. However, fromQuery should run across all segments. Hence, TermsWithScoreCollector will collect IDs globally again and again. As you can see, the current JoinUtil design is much more efficient, it reuses global IDs hash across all to segments searches. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
I get the impression there was a concern that the caller could hold on to the query generated by JoinUtil for too long - eg across requests in Solr. I'm not sure why the OP thinks that would happen, though. -Mike On 12/08/2014 04:57 AM, Mikhail Khludnev wrote: On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos dari...@gmail.com wrote: public Scorer scorer(){ TermsWithScoreCollector collector = new TermsWithScoreCollector(); JoinQuery.this.s.search(JoinQuery.this.q, collector); //do the rest.. } Darin, I hardly follow, but this approach either is not efficient or even doesn't work. Generally join is O(n^2) operation, which is most impls try to reduce. weight.scorer() is invoked per segment, and scorer yields results only from a particular segment. However, fromQuery should run across all segments. Hence, TermsWithScoreCollector will collect IDs globally again and again. As you can see, the current JoinUtil design is much more efficient, it reuses global IDs hash across all to segments searches.
Re: Anti-Pattern in lucent-join jar?
On Mon, Dec 8, 2014 at 5:38 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I get the impression there was a concern that the caller could hold on to the query generated by JoinUtil for too long - eg across requests in Solr. Michael, if you still bother, SOLR-6234 https://issues.apache.org/jira/browse/SOLR-6234 is free from this issue. Cache keys (Queries), are fairly small and GC friendly. I'm not sure why the OP thinks that would happen, though. Could you please expand OP? I didn't get it. -Mike On 12/08/2014 04:57 AM, Mikhail Khludnev wrote: On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos dari...@gmail.com wrote: public Scorer scorer(){ TermsWithScoreCollector collector = new TermsWithScoreCollector(); JoinQuery.this.s.search( JoinQuery.this.q, collector); //do the rest.. } Darin, I hardly follow, but this approach either is not efficient or even doesn't work. Generally join is O(n^2) operation, which is most impls try to reduce. weight.scorer() is invoked per segment, and scorer yields results only from a particular segment. However, fromQuery should run across all segments. Hence, TermsWithScoreCollector will collect IDs globally again and again. As you can see, the current JoinUtil design is much more efficient, it reuses global IDs hash across all to segments searches. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
Hi Mikhail, I was merely posing a thought in an effort to continue to learn and educate myself. Your point about Weight.scorer() being called per segment helps my understanding. I am in the middle of building a POC for a customer of mine that I pointed out in this thread on Dec 5th (shortly after noon). I have spent countless hours over the weekend continuing to try and learn the internals of SOLR and Lucene. Thanks Darin On Dec 8, 2014, at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos dari...@gmail.com wrote: public Scorer scorer(){ TermsWithScoreCollector collector = new TermsWithScoreCollector(); JoinQuery.this.s.search(JoinQuery.this.q, collector); //do the rest.. } Darin, I hardly follow, but this approach either is not efficient or even doesn't work. Generally join is O(n^2) operation, which is most impls try to reduce. weight.scorer() is invoked per segment, and scorer yields results only from a particular segment. However, fromQuery should run across all segments. Hence, TermsWithScoreCollector will collect IDs globally again and again. As you can see, the current JoinUtil design is much more efficient, it reuses global IDs hash across all to segments searches. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
Right - allowing Solr to manage these queries (SOLR-6234) seems like the way to go ... OP == original poster (I lost track of who started the discussion) -Mike On 12/08/2014 10:19 AM, Mikhail Khludnev wrote: On Mon, Dec 8, 2014 at 5:38 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I get the impression there was a concern that the caller could hold on to the query generated by JoinUtil for too long - eg across requests in Solr. Michael, if you still bother, SOLR-6234 https://issues.apache.org/jira/browse/SOLR-6234 is free from this issue. Cache keys (Queries), are fairly small and GC friendly. I'm not sure why the OP thinks that would happen, though. Could you please expand OP? I didn't get it. -Mike On 12/08/2014 04:57 AM, Mikhail Khludnev wrote: On Fri, Dec 5, 2014 at 10:44 PM, Darin Amos dari...@gmail.com wrote: public Scorer scorer(){ TermsWithScoreCollector collector = new TermsWithScoreCollector(); JoinQuery.this.s.search( JoinQuery.this.q, collector); //do the rest.. } Darin, I hardly follow, but this approach either is not efficient or even doesn't work. Generally join is O(n^2) operation, which is most impls try to reduce. weight.scorer() is invoked per segment, and scorer yields results only from a particular segment. However, fromQuery should run across all segments. Hence, TermsWithScoreCollector will collect IDs globally again and again. As you can see, the current JoinUtil design is much more efficient, it reuses global IDs hash across all to segments searches.
Re: Anti-Pattern in lucent-join jar?
Thanks Roman! Let's expand it for the sake of completeness. Such issue is not possible in Solr, because caches are associated with the searcher. While you follow this design (see Solr userCache), and don't update what's cached once, there is no chance to shoot the foot. There were few caches inside of Lucene (old FieldCache, CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped onto segment keys, hence it exclude such leakage across different searchers. On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com wrote: +1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is likely an antipattern inside a lucene component. Because custom components may create join queries, hold to them and then execute much later against a different searcher. One approach would be to postpone term collection until the query actually runs, I looked far and wide for appropriate place, but only found createWeight() - but at least it does give developers NO opportunity to shoot their feet! ;-) Since it may serve as an inspiration to someone, here is a link: https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101 roman On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Thanks Roman! Let's expand it for the sake of completeness. Such issue is not possible in Solr, because caches are associated with the searcher. While you follow this design (see Solr userCache), and don't update what's cached once, there is no chance to shoot the foot. There were few caches inside of Lucene (old FieldCache, CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped onto segment keys, hence it exclude such leakage across different searchers. On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com wrote: +1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
Couldn’t you just keep passing the wrapped query and searcher down to Weight.scorer()? This would allow you to wait until the query is executed to do term collection. If you want to protect against creating and executing the query with different searchers, you would have to make the query factory (or constructor) only visible to the query parser or parser plugin? I might not have followed you, this discussing challenges my understanding of Lucene and SOLR. Darin On Dec 5, 2014, at 12:47 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is likely an antipattern inside a lucene component. Because custom components may create join queries, hold to them and then execute much later against a different searcher. One approach would be to postpone term collection until the query actually runs, I looked far and wide for appropriate place, but only found createWeight() - but at least it does give developers NO opportunity to shoot their feet! ;-) Since it may serve as an inspiration to someone, here is a link: https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101 roman On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Thanks Roman! Let's expand it for the sake of completeness. Such issue is not possible in Solr, because caches are associated with the searcher. While you follow this design (see Solr userCache), and don't update what's cached once, there is no chance to shoot the foot. There were few caches inside of Lucene (old FieldCache, CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped onto segment keys, hence it exclude such leakage across different searchers. On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com wrote: +1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
Not sure I understand. It is the searcher which executes the query, how would you 'convince' it to pass the query? First the Weight is created, weight instance creates scorer - you would have to change the API to do the passing (or maybe not...?) In my case, the relationships were across index segments, so I had to collect them first - but in some other situations, when you look only at the data inside one index segments, it _might_ be better to wait On Fri, Dec 5, 2014 at 1:25 PM, Darin Amos dari...@gmail.com wrote: Couldn’t you just keep passing the wrapped query and searcher down to Weight.scorer()? This would allow you to wait until the query is executed to do term collection. If you want to protect against creating and executing the query with different searchers, you would have to make the query factory (or constructor) only visible to the query parser or parser plugin? I might not have followed you, this discussing challenges my understanding of Lucene and SOLR. Darin On Dec 5, 2014, at 12:47 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is likely an antipattern inside a lucene component. Because custom components may create join queries, hold to them and then execute much later against a different searcher. One approach would be to postpone term collection until the query actually runs, I looked far and wide for appropriate place, but only found createWeight() - but at least it does give developers NO opportunity to shoot their feet! ;-) Since it may serve as an inspiration to someone, here is a link: https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101 roman On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Thanks Roman! Let's expand it for the sake of completeness. Such issue is not possible in Solr, because caches are associated with the searcher. While you follow this design (see Solr userCache), and don't update what's cached once, there is no chance to shoot the foot. There were few caches inside of Lucene (old FieldCache, CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped onto segment keys, hence it exclude such leakage across different searchers. On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com wrote: +1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
In this case I was thinking about something like the following.. if you changed the Query implementation or created your own similar query: If you consider this query: q={!scorejoin from=parent to=id}type:child public class ScoreJoinQuery extends Query(){ private Query q = null; private IndexSearcher s = null; public JoinQuery(Query q, IndexSearcher s){ this.q = q; //THis is the term query type:child this.s = s; } . . . public Weight createWeight(…..){ return new Weight(){ . . . public Scorer scorer(){ TermsWithScoreCollector collector = new TermsWithScoreCollector(); JoinQuery.this.s.search(JoinQuery.this.q, collector); //do the rest.. } } } } This is what I was thinking in my head…. but I don’t really believe it offers any value above how the scorcejoin query works today. On Dec 5, 2014, at 2:16 PM, Roman Chyla roman.ch...@gmail.com wrote: Not sure I understand. It is the searcher which executes the query, how would you 'convince' it to pass the query? First the Weight is created, weight instance creates scorer - you would have to change the API to do the passing (or maybe not...?) In my case, the relationships were across index segments, so I had to collect them first - but in some other situations, when you look only at the data inside one index segments, it _might_ be better to wait On Fri, Dec 5, 2014 at 1:25 PM, Darin Amos dari...@gmail.com wrote: Couldn’t you just keep passing the wrapped query and searcher down to Weight.scorer()? This would allow you to wait until the query is executed to do term collection. If you want to protect against creating and executing the query with different searchers, you would have to make the query factory (or constructor) only visible to the query parser or parser plugin? I might not have followed you, this discussing challenges my understanding of Lucene and SOLR. Darin On Dec 5, 2014, at 12:47 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is likely an antipattern inside a lucene component. Because custom components may create join queries, hold to them and then execute much later against a different searcher. One approach would be to postpone term collection until the query actually runs, I looked far and wide for appropriate place, but only found createWeight() - but at least it does give developers NO opportunity to shoot their feet! ;-) Since it may serve as an inspiration to someone, here is a link: https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101 roman On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Thanks Roman! Let's expand it for the sake of completeness. Such issue is not possible in Solr, because caches are associated with the searcher. While you follow this design (see Solr userCache), and don't update what's cached once, there is no chance to shoot the foot. There were few caches inside of Lucene (old FieldCache, CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped onto segment keys, hence it exclude such leakage across different searchers. On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com wrote: +1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin -- Sincerely
Re: Anti-Pattern in lucent-join jar?
Hello, I wonder if you see https://issues.apache.org/jira/browse/SOLR-6234 which solves such problem. QueryResult Cache are useless for join, because they carry cropped results. Potentially you can hit filter cache wrapping fromQuery into this monster bridge new FilteredQuery(new MatchAllDocsQuery(), filterCache.get(fromQuery).getTopFilter()) however, you refer to TermsWithScoreCollector, but filterCache doesn't stores scores. fromQuery is not a hotspot for JoinQuery usually (I spoke about it at last LuceneRevolution) Fwiw, it's common to have a heavy processing at Lucene level eg. see RangeQuery. The idea is to cache the result of query execution (but not the intermediate data) on the levels above like it's done Solr's filterCache or queryResultCache. Hope it helps On Thu, Dec 4, 2014 at 6:49 PM, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Anti-Pattern in lucent-join jar?
+1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote: Hello All, I have been doing a lot of research in building some custom queries and I have been looking at the Lucene Join library as a reference. I noticed something that I believe could actually have a negative side effect. Specifically I was looking at the JoinUtil.createJoinQuery(…) method and within that method you see the following code: TermsWithScoreCollector termsWithScoreCollector = TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); fromSearcher.search(fromQuery, termsWithScoreCollector); As you can see, when the JoinQuery is being built, the code is executing the query that is wraps with it’s own collector to collect all the scores. If I were to write a query parser using this library (which someone has done here), doesn’t this reduce the benefit of the SOLR query cache? The wrapped query is being executing when the Join Query is being constructed, not when it is executed. Thanks Darin