Re: Solr relevancy tuning
I realize I never responded to this thread, shame on me! Jorge/Giovanni Kelvin looks pretty cool -- thanks for sharing it. When we use Quepid ,we sometimes do it at places with existing relevancy test scripts like Kelvin. Quepid/test scripts tend to satisfy different nitches. In addition to testing, Quepid is a GUI for helping you explain/investigate and sandbox in addition to test. Sometimes this is nice for fuzzier/more qualitative judgments especially when you want to collaborate with non-technical stakeholders. Its been our replacement for the spreadsheet that a lot of our clients used before Quepid -- where the non-technical folks would list Scripts work very well for getting that pass/fail response. Its nice that Kelvin gives you a temperature instead of necessarily a pass fail, that level of fuzzyness is definitely useful. We certainly see value in both (and will probably be doing more to integrate Quepid with continuous integration/scripting). Cheers, -Doug On Mon, May 5, 2014 at 2:47 AM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: One good thing about kelvin it's more a programmatic task, so you could execute the scripts after a few changes/deployment and get a general idea if the new changes has impacted into the search experience; yeah sure the changing catalog it's still a problem but I kind of like to be able to execute a few commands and presto get it done. This could become a must-run test in the test suite of the app. I kind of do this already but testing from the user interface, using the test library provided by symfony2 (framework I'm using) and the functional tests. It's not test-driven-search-relevancy perse but we ensure not to mess up with some basic queries we use to test the search feature. - Original Message - From: Giovanni Bricconi giovanni.bricc...@banzai.it To: solr-user solr-user@lucene.apache.org Cc: Ahmet Arslan iori...@yahoo.com Sent: Friday, April 11, 2014 5:15:56 AM Subject: Re: Solr relevancy tuning Hello Doug I have just watched the quepid demonstration video, and I strongly agree with your introduction: it is very hard to involve marketing/business people in repeated testing session, and speadsheets or other kind of files are not the right tool to use. Currenlty I'm quite alone in my tuning task and having a visual approach could be benefical for me, you are giving me many good inputs! I see that kelvin (my scripted tool) and queepid follows the same path. In queepid someone quickly whatches the results and applies colours to result, in kelvin you enter one on more queries (network cable, ethernet cable) and states that the result must contains ethernet in the title, or must come from a list of product categories. I also do diffs of results, before and after changes, to check what is going on; but I have to do that in a very unix-scripted way. Have you considered of placing a counter of total red/bad results in quepid? I use this index to have a quick overview of changes impact across all queries. Actually I repeat tests in production from times to time, and if I see the kelvin temperature rising (the number of errors going up) I know I have to check what's going on because new products maybe are having a bad impact on the index. I also keep counters of products with low quality images/no images at all or too short listings, sometimes are useful to undestand better what will happen if you change some bq/fq in the application. I see also that after changes in quepid someone have to check gray results and assign them a colour, in kelvin case sometimes the conditions can do a bit of magic (new product names still contains SM-G900F) but sometimes can introduce false errors (the new product name contains only Galaxy 5 and not the product code SM-G900F). So some checks are needed but with quepid everybody can do the check, with kelvin you have to change some line of a script, and not everybody is able/willing to do that. The idea of a static index is a good suggestion, I will try to have it in the next round of search engine improvement. Thank you Doug! 2014-04-09 17:48 GMT+02:00 Doug Turnbull dturnb...@opensourceconnections.com: Hey Giovanni, nice to meet you. I'm the person that did the Test Driven Relevancy talk. We've got a product Quepid (http://quepid.com) that lets you gather good/bad results for queries and do a sort of test driven development against search relevancy. Sounds similar to your existing scripted approach. Have you considered keeping a static catalog for testing purposes? We had a project with a lot of updates and date-dependent relevancy. This lets you create some test scenarios against a static data set. However, one downside is you can't recreate problems in production in your test setup exactly-- you have to find a similar issue that reflects what you're seeing. Cheers, -Doug On Wed, Apr 9, 2014 at 10:42 AM, Giovanni
Re: Solr relevancy tuning
One good thing about kelvin it's more a programmatic task, so you could execute the scripts after a few changes/deployment and get a general idea if the new changes has impacted into the search experience; yeah sure the changing catalog it's still a problem but I kind of like to be able to execute a few commands and presto get it done. This could become a must-run test in the test suite of the app. I kind of do this already but testing from the user interface, using the test library provided by symfony2 (framework I'm using) and the functional tests. It's not test-driven-search-relevancy perse but we ensure not to mess up with some basic queries we use to test the search feature. - Original Message - From: Giovanni Bricconi giovanni.bricc...@banzai.it To: solr-user solr-user@lucene.apache.org Cc: Ahmet Arslan iori...@yahoo.com Sent: Friday, April 11, 2014 5:15:56 AM Subject: Re: Solr relevancy tuning Hello Doug I have just watched the quepid demonstration video, and I strongly agree with your introduction: it is very hard to involve marketing/business people in repeated testing session, and speadsheets or other kind of files are not the right tool to use. Currenlty I'm quite alone in my tuning task and having a visual approach could be benefical for me, you are giving me many good inputs! I see that kelvin (my scripted tool) and queepid follows the same path. In queepid someone quickly whatches the results and applies colours to result, in kelvin you enter one on more queries (network cable, ethernet cable) and states that the result must contains ethernet in the title, or must come from a list of product categories. I also do diffs of results, before and after changes, to check what is going on; but I have to do that in a very unix-scripted way. Have you considered of placing a counter of total red/bad results in quepid? I use this index to have a quick overview of changes impact across all queries. Actually I repeat tests in production from times to time, and if I see the kelvin temperature rising (the number of errors going up) I know I have to check what's going on because new products maybe are having a bad impact on the index. I also keep counters of products with low quality images/no images at all or too short listings, sometimes are useful to undestand better what will happen if you change some bq/fq in the application. I see also that after changes in quepid someone have to check gray results and assign them a colour, in kelvin case sometimes the conditions can do a bit of magic (new product names still contains SM-G900F) but sometimes can introduce false errors (the new product name contains only Galaxy 5 and not the product code SM-G900F). So some checks are needed but with quepid everybody can do the check, with kelvin you have to change some line of a script, and not everybody is able/willing to do that. The idea of a static index is a good suggestion, I will try to have it in the next round of search engine improvement. Thank you Doug! 2014-04-09 17:48 GMT+02:00 Doug Turnbull dturnb...@opensourceconnections.com: Hey Giovanni, nice to meet you. I'm the person that did the Test Driven Relevancy talk. We've got a product Quepid (http://quepid.com) that lets you gather good/bad results for queries and do a sort of test driven development against search relevancy. Sounds similar to your existing scripted approach. Have you considered keeping a static catalog for testing purposes? We had a project with a lot of updates and date-dependent relevancy. This lets you create some test scenarios against a static data set. However, one downside is you can't recreate problems in production in your test setup exactly-- you have to find a similar issue that reflects what you're seeing. Cheers, -Doug On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: Thank you for the links. The book is really useful, I will definitively have to spend some time reformatting the logs to to access number of result founds, session id and much more. I'm also quite happy that my test cases produces similar results to the precision reports shown at the beginning of the book. Giovanni 2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com: Hi Giovanni, Here are some relevant pointers : http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy http://rosenfeldmedia.com/books/search-analytics/ http://www.sematext.com/search-analytics/index.html Ahmet On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: It is about one year I'm working on an e-commerce site, and unfortunately I have no information retrieval background, so probably I am missing some important practices about relevance tuning and search engines. During this period I had
Re: Solr relevancy tuning
Hello Doug I have just watched the quepid demonstration video, and I strongly agree with your introduction: it is very hard to involve marketing/business people in repeated testing session, and speadsheets or other kind of files are not the right tool to use. Currenlty I'm quite alone in my tuning task and having a visual approach could be benefical for me, you are giving me many good inputs! I see that kelvin (my scripted tool) and queepid follows the same path. In queepid someone quickly whatches the results and applies colours to result, in kelvin you enter one on more queries (network cable, ethernet cable) and states that the result must contains ethernet in the title, or must come from a list of product categories. I also do diffs of results, before and after changes, to check what is going on; but I have to do that in a very unix-scripted way. Have you considered of placing a counter of total red/bad results in quepid? I use this index to have a quick overview of changes impact across all queries. Actually I repeat tests in production from times to time, and if I see the kelvin temperature rising (the number of errors going up) I know I have to check what's going on because new products maybe are having a bad impact on the index. I also keep counters of products with low quality images/no images at all or too short listings, sometimes are useful to undestand better what will happen if you change some bq/fq in the application. I see also that after changes in quepid someone have to check gray results and assign them a colour, in kelvin case sometimes the conditions can do a bit of magic (new product names still contains SM-G900F) but sometimes can introduce false errors (the new product name contains only Galaxy 5 and not the product code SM-G900F). So some checks are needed but with quepid everybody can do the check, with kelvin you have to change some line of a script, and not everybody is able/willing to do that. The idea of a static index is a good suggestion, I will try to have it in the next round of search engine improvement. Thank you Doug! 2014-04-09 17:48 GMT+02:00 Doug Turnbull dturnb...@opensourceconnections.com: Hey Giovanni, nice to meet you. I'm the person that did the Test Driven Relevancy talk. We've got a product Quepid (http://quepid.com) that lets you gather good/bad results for queries and do a sort of test driven development against search relevancy. Sounds similar to your existing scripted approach. Have you considered keeping a static catalog for testing purposes? We had a project with a lot of updates and date-dependent relevancy. This lets you create some test scenarios against a static data set. However, one downside is you can't recreate problems in production in your test setup exactly-- you have to find a similar issue that reflects what you're seeing. Cheers, -Doug On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: Thank you for the links. The book is really useful, I will definitively have to spend some time reformatting the logs to to access number of result founds, session id and much more. I'm also quite happy that my test cases produces similar results to the precision reports shown at the beginning of the book. Giovanni 2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com: Hi Giovanni, Here are some relevant pointers : http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy http://rosenfeldmedia.com/books/search-analytics/ http://www.sematext.com/search-analytics/index.html Ahmet On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: It is about one year I'm working on an e-commerce site, and unfortunately I have no information retrieval background, so probably I am missing some important practices about relevance tuning and search engines. During this period I had to fix many bugs about bad search results, which I have solved sometimes tuning edismax weights, sometimes creating ad hoc query filters or query boosting; but I am still not able to figure out what should be the correct process to improve search results relevance. These are the practices I am following, I would really appreciate any comments about them and any hints about what practices you follow in your projects: - In order to have a measure of search quality I have written many test cases such as if the user searches for nike sport watch the search result should display at least four tom tom products with the words nike and sportwatch in the title. I have written a tool that read such tests from json files and applies them to my applications, and then counts the number of results that does not match the criterias stated in the test cases. (for those interested
Solr relevancy tuning
It is about one year I'm working on an e-commerce site, and unfortunately I have no information retrieval background, so probably I am missing some important practices about relevance tuning and search engines. During this period I had to fix many bugs about bad search results, which I have solved sometimes tuning edismax weights, sometimes creating ad hoc query filters or query boosting; but I am still not able to figure out what should be the correct process to improve search results relevance. These are the practices I am following, I would really appreciate any comments about them and any hints about what practices you follow in your projects: - In order to have a measure of search quality I have written many test cases such as if the user searches for nike sport watch the search result should display at least four tom tom products with the words nike and sportwatch in the title. I have written a tool that read such tests from json files and applies them to my applications, and then counts the number of results that does not match the criterias stated in the test cases. (for those interested this tool is available at https://github.com/gibri/kelvin but it is still quite a prototype) - I use this count as a quality index, I tried various times to change the edismax weight to lower the whole number of error, or to add new filters/boostings to the application to try to decrease the error count. - The pros of this is that at least you have a number to look at, and that you have a quick way of checking the impact of a modification. - The bad side is that you have to maintain the test cases: now I have about 800 tests and my product catalogue changes often, this implies that some products exits the catalog, and some test cases cant pass anymore. - I am populating the test cases using errors reported from users, and I feel that this is driving the test cases too much toward pathologic cases. An more over I haven't many test for cases that are working well now. I would like to use search logs as drivers to generate tests, but I feel I haven't picked the right path. Using top queries, manually reviewing results, and then writing tests is a slow process; moreover many top queries are ambiguous or are driven by site ads. Many many queries are unique per users. How to deal with these cases? How are you using your log to find out test cases to fix? Are you looking for queries where the user is not opening any returned results? Which kpi have you chosen to find out query that are not providing good results? And what are you using as kpi for the whole search, beside the conversion rate? Can you suggest me any other practices you are using on your projects? Thank you very much in advance Giovanni
Re: Solr relevancy tuning
Hi Giovanni, Here are some relevant pointers : http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy http://rosenfeldmedia.com/books/search-analytics/ http://www.sematext.com/search-analytics/index.html Ahmet On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: It is about one year I'm working on an e-commerce site, and unfortunately I have no information retrieval background, so probably I am missing some important practices about relevance tuning and search engines. During this period I had to fix many bugs about bad search results, which I have solved sometimes tuning edismax weights, sometimes creating ad hoc query filters or query boosting; but I am still not able to figure out what should be the correct process to improve search results relevance. These are the practices I am following, I would really appreciate any comments about them and any hints about what practices you follow in your projects: - In order to have a measure of search quality I have written many test cases such as if the user searches for nike sport watch the search result should display at least four tom tom products with the words nike and sportwatch in the title. I have written a tool that read such tests from json files and applies them to my applications, and then counts the number of results that does not match the criterias stated in the test cases. (for those interested this tool is available at https://github.com/gibri/kelvin but it is still quite a prototype) - I use this count as a quality index, I tried various times to change the edismax weight to lower the whole number of error, or to add new filters/boostings to the application to try to decrease the error count. - The pros of this is that at least you have a number to look at, and that you have a quick way of checking the impact of a modification. - The bad side is that you have to maintain the test cases: now I have about 800 tests and my product catalogue changes often, this implies that some products exits the catalog, and some test cases cant pass anymore. - I am populating the test cases using errors reported from users, and I feel that this is driving the test cases too much toward pathologic cases. An more over I haven't many test for cases that are working well now. I would like to use search logs as drivers to generate tests, but I feel I haven't picked the right path. Using top queries, manually reviewing results, and then writing tests is a slow process; moreover many top queries are ambiguous or are driven by site ads. Many many queries are unique per users. How to deal with these cases? How are you using your log to find out test cases to fix? Are you looking for queries where the user is not opening any returned results? Which kpi have you chosen to find out query that are not providing good results? And what are you using as kpi for the whole search, beside the conversion rate? Can you suggest me any other practices you are using on your projects? Thank you very much in advance Giovanni
Re: Solr relevancy tuning
Thank you for the links. The book is really useful, I will definitively have to spend some time reformatting the logs to to access number of result founds, session id and much more. I'm also quite happy that my test cases produces similar results to the precision reports shown at the beginning of the book. Giovanni 2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com: Hi Giovanni, Here are some relevant pointers : http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy http://rosenfeldmedia.com/books/search-analytics/ http://www.sematext.com/search-analytics/index.html Ahmet On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: It is about one year I'm working on an e-commerce site, and unfortunately I have no information retrieval background, so probably I am missing some important practices about relevance tuning and search engines. During this period I had to fix many bugs about bad search results, which I have solved sometimes tuning edismax weights, sometimes creating ad hoc query filters or query boosting; but I am still not able to figure out what should be the correct process to improve search results relevance. These are the practices I am following, I would really appreciate any comments about them and any hints about what practices you follow in your projects: - In order to have a measure of search quality I have written many test cases such as if the user searches for nike sport watch the search result should display at least four tom tom products with the words nike and sportwatch in the title. I have written a tool that read such tests from json files and applies them to my applications, and then counts the number of results that does not match the criterias stated in the test cases. (for those interested this tool is available at https://github.com/gibri/kelvin but it is still quite a prototype) - I use this count as a quality index, I tried various times to change the edismax weight to lower the whole number of error, or to add new filters/boostings to the application to try to decrease the error count. - The pros of this is that at least you have a number to look at, and that you have a quick way of checking the impact of a modification. - The bad side is that you have to maintain the test cases: now I have about 800 tests and my product catalogue changes often, this implies that some products exits the catalog, and some test cases cant pass anymore. - I am populating the test cases using errors reported from users, and I feel that this is driving the test cases too much toward pathologic cases. An more over I haven't many test for cases that are working well now. I would like to use search logs as drivers to generate tests, but I feel I haven't picked the right path. Using top queries, manually reviewing results, and then writing tests is a slow process; moreover many top queries are ambiguous or are driven by site ads. Many many queries are unique per users. How to deal with these cases? How are you using your log to find out test cases to fix? Are you looking for queries where the user is not opening any returned results? Which kpi have you chosen to find out query that are not providing good results? And what are you using as kpi for the whole search, beside the conversion rate? Can you suggest me any other practices you are using on your projects? Thank you very much in advance Giovanni
Re: Solr relevancy tuning
Hey Giovanni, nice to meet you. I'm the person that did the Test Driven Relevancy talk. We've got a product Quepid (http://quepid.com) that lets you gather good/bad results for queries and do a sort of test driven development against search relevancy. Sounds similar to your existing scripted approach. Have you considered keeping a static catalog for testing purposes? We had a project with a lot of updates and date-dependent relevancy. This lets you create some test scenarios against a static data set. However, one downside is you can't recreate problems in production in your test setup exactly-- you have to find a similar issue that reflects what you're seeing. Cheers, -Doug On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: Thank you for the links. The book is really useful, I will definitively have to spend some time reformatting the logs to to access number of result founds, session id and much more. I'm also quite happy that my test cases produces similar results to the precision reports shown at the beginning of the book. Giovanni 2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com: Hi Giovanni, Here are some relevant pointers : http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy http://rosenfeldmedia.com/books/search-analytics/ http://www.sematext.com/search-analytics/index.html Ahmet On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi giovanni.bricc...@banzai.it wrote: It is about one year I'm working on an e-commerce site, and unfortunately I have no information retrieval background, so probably I am missing some important practices about relevance tuning and search engines. During this period I had to fix many bugs about bad search results, which I have solved sometimes tuning edismax weights, sometimes creating ad hoc query filters or query boosting; but I am still not able to figure out what should be the correct process to improve search results relevance. These are the practices I am following, I would really appreciate any comments about them and any hints about what practices you follow in your projects: - In order to have a measure of search quality I have written many test cases such as if the user searches for nike sport watch the search result should display at least four tom tom products with the words nike and sportwatch in the title. I have written a tool that read such tests from json files and applies them to my applications, and then counts the number of results that does not match the criterias stated in the test cases. (for those interested this tool is available at https://github.com/gibri/kelvin but it is still quite a prototype) - I use this count as a quality index, I tried various times to change the edismax weight to lower the whole number of error, or to add new filters/boostings to the application to try to decrease the error count. - The pros of this is that at least you have a number to look at, and that you have a quick way of checking the impact of a modification. - The bad side is that you have to maintain the test cases: now I have about 800 tests and my product catalogue changes often, this implies that some products exits the catalog, and some test cases cant pass anymore. - I am populating the test cases using errors reported from users, and I feel that this is driving the test cases too much toward pathologic cases. An more over I haven't many test for cases that are working well now. I would like to use search logs as drivers to generate tests, but I feel I haven't picked the right path. Using top queries, manually reviewing results, and then writing tests is a slow process; moreover many top queries are ambiguous or are driven by site ads. Many many queries are unique per users. How to deal with these cases? How are you using your log to find out test cases to fix? Are you looking for queries where the user is not opening any returned results? Which kpi have you chosen to find out query that are not providing good results? And what are you using as kpi for the whole search, beside the conversion rate? Can you suggest me any other practices you are using on your projects? Thank you very much in advance Giovanni -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com