I realize I never responded to this thread, shame on me! Jorge/Giovanni Kelvin looks pretty cool -- thanks for sharing it. When we use Quepid ,we sometimes do it at places with existing relevancy test scripts like Kelvin. Quepid/test scripts tend to satisfy different nitches. In addition to testing, Quepid is a GUI for helping you explain/investigate and sandbox in addition to test. Sometimes this is nice for fuzzier/more qualitative judgments especially when you want to collaborate with non-technical stakeholders. Its been our replacement for the "spreadsheet" that a lot of our clients used before Quepid -- where the non-technical folks would list
Scripts work very well for getting that pass/fail response. Its nice that Kelvin gives you a "temperature" instead of necessarily a pass fail, that level of fuzzyness is definitely useful. We certainly see value in both (and will probably be doing more to integrate Quepid with continuous integration/scripting). Cheers, -Doug On Mon, May 5, 2014 at 2:47 AM, Jorge Luis Betancourt González < jlbetanco...@uci.cu> wrote: > One good thing about kelvin it's more a programmatic task, so you could > execute the scripts after a few changes/deployment and get a general idea > if the new changes has impacted into the search experience; yeah sure the > changing catalog it's still a problem but I kind of like to be able to > execute a few commands and presto get it done. This could become a must-run > test in the test suite of the app. I kind of do this already but testing > from the user interface, using the test library provided by symfony2 > (framework I'm using) and the functional tests. It's not > test-driven-search-relevancy "perse" but we ensure not to mess up with some > basic queries we use to test the search feature. > > ----- Original Message ----- > From: "Giovanni Bricconi" <giovanni.bricc...@banzai.it> > To: "solr-user" <solr-user@lucene.apache.org> > Cc: "Ahmet Arslan" <iori...@yahoo.com> > Sent: Friday, April 11, 2014 5:15:56 AM > Subject: Re: Solr relevancy tuning > > Hello Doug > > I have just watched the quepid demonstration video, and I strongly agree > with your introduction: it is very hard to involve marketing/business > people in repeated testing session, and speadsheets or other kind of files > are not the right tool to use. > Currenlty I'm quite alone in my tuning task and having a visual approach > could be benefical for me, you are giving me many good inputs! > > I see that kelvin (my scripted tool) and queepid follows the same path. In > queepid someone quickly whatches the results and applies colours to result, > in kelvin you enter one on more queries (network cable, ethernet cable) and > states that the result must contains ethernet in the title, or must come > from a list of product categories. > > I also do diffs of results, before and after changes, to check what is > going on; but I have to do that in a very unix-scripted way. > > Have you considered of placing a counter of total red/bad results in > quepid? I use this index to have a quick overview of changes impact across > all queries. Actually I repeat tests in production from times to time, and > if I see the "kelvin temperature" rising (the number of errors going up) I > know I have to check what's going on because new products maybe are having > a bad impact on the index. > > I also keep counters of products with low quality images/no images at all > or too short listings, sometimes are useful to undestand better what will > happen if you change some bq/fq in the application. > > I see also that after changes in quepid someone have to check "gray" > results and assign them a colour, in kelvin case sometimes the conditions > can do a bit of magic (new product names still contains SM-G900F) but > sometimes can introduce false errors (the new product name contains only > Galaxy 5 and not the product code SM-G900F). So some checks are needed but > with quepid everybody can do the check, with kelvin you have to change some > line of a script, and not everybody is able/willing to do that. > > The idea of a static index is a good suggestion, I will try to have it in > the next round of search engine improvement. > > Thank you Doug! > > > > > 2014-04-09 17:48 GMT+02:00 Doug Turnbull < > dturnb...@opensourceconnections.com>: > > > Hey Giovanni, nice to meet you. > > > > I'm the person that did the Test Driven Relevancy talk. We've got a > product > > Quepid (http://quepid.com) that lets you gather good/bad results for > > queries and do a sort of test driven development against search > relevancy. > > Sounds similar to your existing scripted approach. Have you considered > > keeping a static catalog for testing purposes? We had a project with a > lot > > of updates and date-dependent relevancy. This lets you create some test > > scenarios against a static data set. However, one downside is you can't > > recreate problems in production in your test setup exactly-- you have to > > find a similar issue that reflects what you're seeing. > > > > Cheers, > > -Doug > > > > > > On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi < > > giovanni.bricc...@banzai.it> wrote: > > > > > Thank you for the links. > > > > > > The book is really useful, I will definitively have to spend some time > > > reformatting the logs to to access number of result founds, session id > > and > > > much more. > > > > > > I'm also quite happy that my test cases produces similar results to the > > > precision reports shown at the beginning of the book. > > > > > > Giovanni > > > > > > > > > 2014-04-09 12:59 GMT+02:00 Ahmet Arslan <iori...@yahoo.com>: > > > > > > > Hi Giovanni, > > > > > > > > Here are some relevant pointers : > > > > > > > > > > > > > > > > > > http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy > > > > > > > > > > > > http://rosenfeldmedia.com/books/search-analytics/ > > > > > > > > http://www.sematext.com/search-analytics/index.html > > > > > > > > > > > > Ahmet > > > > > > > > > > > > On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi < > > > > giovanni.bricc...@banzai.it> wrote: > > > > It is about one year I'm working on an e-commerce site, and > > > unfortunately I > > > > have no "information retrieval" background, so probably I am missing > > some > > > > important practices about relevance tuning and search engines. > > > > During this period I had to fix many "bugs" about bad search results, > > > which > > > > I have solved sometimes tuning edismax weights, sometimes creating ad > > hoc > > > > query filters or query boosting; but I am still not able to figure > out > > > what > > > > should be the correct process to improve search results relevance. > > > > > > > > These are the practices I am following, I would really appreciate any > > > > comments about them and any hints about what practices you follow in > > your > > > > projects: > > > > > > > > - In order to have a measure of search quality I have written many > test > > > > cases such as "if the user searches for <<nike sport watch>> the > search > > > > result should display at least four <<tom tom>> products with the > words > > > > <<nike>> and <<sportwatch>> in the title". I have written a tool that > > > read > > > > such tests from json files and applies them to my applications, and > > then > > > > counts the number of results that does not match the criterias stated > > in > > > > the test cases. (for those interested this tool is available at > > > > https://github.com/gibri/kelvin but it is still quite a prototype) > > > > > > > > - I use this count as a quality index, I tried various times to > change > > > the > > > > edismax weight to lower the whole number of error, or to add new > > > > filters/boostings to the application to try to decrease the error > > count. > > > > > > > > - The pros of this is that at least you have a number to look at, and > > > that > > > > you have a quick way of checking the impact of a modification. > > > > > > > > - The bad side is that you have to maintain the test cases: now I > have > > > > about 800 tests and my product catalogue changes often, this implies > > that > > > > some products exits the catalog, and some test cases cant pass > anymore. > > > > > > > > - I am populating the test cases using errors reported from users, > and > > I > > > > feel that this is driving the test cases too much toward pathologic > > > cases. > > > > An more over I haven't many test for cases that are working well now. > > > > > > > > I would like to use search logs as drivers to generate tests, but I > > feel > > > I > > > > haven't picked the right path. Using top queries, manually reviewing > > > > results, and then writing tests is a slow process; moreover many top > > > > queries are ambiguous or are driven by site ads. > > > > > > > > Many many queries are unique per users. How to deal with these cases? > > > > > > > > How are you using your log to find out test cases to fix? Are you > > looking > > > > for queries where the user is not "opening" any returned results? > Which > > > kpi > > > > have you chosen to find out query that are not providing good > results? > > > And > > > > what are you using as kpi for the whole search, beside the conversion > > > rate? > > > > > > > > Can you suggest me any other practices you are using on your > projects? > > > > > > > > Thank you very much in advance > > > > > > > > Giovanni > > > > > > > > > > > > > > > > > > > -- > > Doug Turnbull > > Search & Big Data Architect > > OpenSource Connections <http://o19s.com> > > > > ________________________________________________________________________________________________ > I Conferencia Científica Internacional UCIENCIA 2014 en la UCI del 24 al > 26 de abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu > -- Doug Turnbull Search & Big Data Architect OpenSource Connections <http://o19s.com>