Re: Performance of cross join vs block join
span across segments. Anyway, please elaborate. One of block join benefits is an ability to hit only the first matched child in group, and jump over followings. It doesn't applicable in general, but get huge gain some times. On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla roman.ch...@gmail.comwrote: Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does
Re: Performance of cross join vs block join
Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. Could you point me to some more documentation? Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance of cross join vs block join
On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance of cross join vs block join
Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer
Re: Performance of cross join vs block join
Hello Roman, Thanks for your interest. I briefly looked on your approach, and I'm really interested in your numbers. Here is the trivial code, I'd rather prefer rely on your testing framework, and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you need it? https://github.com/m-khl/join-tester What you are saying about benchmark representativeness definitely makes sense. I didn't try to establish a complete absolutely representative benchmark. Just wanted to have rough numbers, related for my usecase, certainly. I'm from eCommerce, that volume was enough for me. What I didn't get is, 'not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment'. Usually, there is no problem with blocks in multi segment index, block definitely can't span across segments. Anyway, please elaborate. One of block join benefits is an ability to hit only the first matched child in group, and jump over followings. It doesn't applicable in general, but get huge gain some times. On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more
Performance of cross join vs block join
Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela
Re: Performance of cross join vs block join
Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance of cross join vs block join
In my current use case I have 4 tables with a one to many relationship between them (one is the parent and the rest are the children ) and I have created for each table a separate Solr core. Now I have the request to return all those parents that match a certain criteria or one of its children match the same criteria or a different criteria. Given the fact that moving all these documents in a single core implies more changes in the current code than keeping the cores as they are I considered also the solution with union of cross joins. Next I performed some tests and saw that having join in a single core does not add too much compared to union of cross join, hence I don't know which solution to adopt. Do you see a use case where I would hit the wall if I keep the documents in separate cores? BTW the link bellow does not work (I have found it while searching this topic) , it displays an empty page. Thanks, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com