Re: Solr relevancy tuning

2014-06-12 Thread Doug Turnbull
I realize I never responded to this thread, shame on me!

Jorge/Giovanni Kelvin looks pretty cool -- thanks for sharing it. When we
use Quepid ,we sometimes do it at places with existing relevancy test
scripts like Kelvin. Quepid/test scripts tend to satisfy different nitches.
In addition to testing, Quepid is a GUI for helping you explain/investigate
and sandbox in addition to test. Sometimes this is nice for fuzzier/more
qualitative judgments especially when you want to collaborate with
non-technical stakeholders. Its been our replacement for the spreadsheet
that a lot of our clients used before Quepid -- where the non-technical
folks would list

Scripts work very well for getting that pass/fail response. Its nice that
Kelvin gives you a temperature instead of necessarily a pass fail, that
level of fuzzyness is definitely useful.

We certainly see value in both (and will probably be doing more to
integrate Quepid with continuous integration/scripting).

Cheers,
-Doug


On Mon, May 5, 2014 at 2:47 AM, Jorge Luis Betancourt González 
jlbetanco...@uci.cu wrote:

 One good thing about kelvin it's more a programmatic task, so you could
 execute the scripts after a few changes/deployment and get a general idea
 if the new changes has impacted into the search experience; yeah sure the
 changing catalog it's still a problem but I kind of like to be able to
 execute a few commands and presto get it done. This could become a must-run
 test in the test suite of the app. I kind of do this already but testing
 from the user interface, using the test library provided by symfony2
 (framework I'm using) and the functional tests. It's not
 test-driven-search-relevancy perse but we ensure not to mess up with some
 basic queries we use to test the search feature.

 - Original Message -
 From: Giovanni Bricconi giovanni.bricc...@banzai.it
 To: solr-user solr-user@lucene.apache.org
 Cc: Ahmet Arslan iori...@yahoo.com
 Sent: Friday, April 11, 2014 5:15:56 AM
 Subject: Re: Solr relevancy tuning

 Hello Doug

 I have just watched the quepid demonstration video, and I strongly agree
 with your introduction: it is very hard to involve marketing/business
 people in repeated testing session, and speadsheets or other kind of files
 are not the right tool to use.
 Currenlty I'm quite alone in my tuning task and having a visual approach
 could be benefical for me, you are giving me many good inputs!

 I see that kelvin (my scripted tool) and queepid follows the same path. In
 queepid someone quickly whatches the results and applies colours to result,
 in kelvin you enter one on more queries (network cable, ethernet cable) and
 states that the result must contains ethernet in the title, or must come
 from a list of product categories.

 I also do diffs of results, before and after changes, to check what is
 going on; but I have to do that in a very unix-scripted way.

 Have you considered of placing a counter of total red/bad results in
 quepid? I use this index to have a quick overview of changes impact across
 all queries. Actually I repeat tests in production from times to time, and
 if I see the kelvin temperature rising (the number of errors going up) I
 know I have to check what's going on because new products maybe are having
 a bad impact on the index.

 I also keep counters of products with low quality images/no images at all
 or too short listings, sometimes are useful to undestand better what will
 happen if you change some bq/fq in the application.

 I see also that after changes in quepid someone have to check gray
 results and assign them a colour, in kelvin case sometimes the conditions
 can do a bit of magic (new product names still contains SM-G900F) but
 sometimes can introduce false errors (the new product name contains only
 Galaxy 5 and not the product code SM-G900F). So some checks are needed but
 with quepid everybody can do the check, with kelvin you have to change some
 line of a script, and not everybody is able/willing to do that.

 The idea of a static index is a good suggestion, I will try to have it in
 the next round of search engine improvement.

 Thank you Doug!




 2014-04-09 17:48 GMT+02:00 Doug Turnbull 
 dturnb...@opensourceconnections.com:

  Hey Giovanni, nice to meet you.
 
  I'm the person that did the Test Driven Relevancy talk. We've got a
 product
  Quepid (http://quepid.com) that lets you gather good/bad results for
  queries and do a sort of test driven development against search
 relevancy.
  Sounds similar to your existing scripted approach. Have you considered
  keeping a static catalog for testing purposes? We had a project with a
 lot
  of updates and date-dependent relevancy. This lets you create some test
  scenarios against a static data set. However, one downside is you can't
  recreate problems in production in your test setup exactly-- you have to
  find a similar issue that reflects what you're seeing.
 
  Cheers,
  -Doug
 
 
  On Wed, Apr 9, 2014 at 10:42 AM, Giovanni

Re: Solr relevancy tuning

2014-05-05 Thread Jorge Luis Betancourt González
One good thing about kelvin it's more a programmatic task, so you could execute 
the scripts after a few changes/deployment and get a general idea if the new 
changes has impacted into the search experience; yeah sure the changing catalog 
it's still a problem but I kind of like to be able to execute a few commands 
and presto get it done. This could become a must-run test in the test suite of 
the app. I kind of do this already but testing from the user interface, using 
the test library provided by symfony2 (framework I'm using) and the functional 
tests. It's not test-driven-search-relevancy perse but we ensure not to mess 
up with some basic queries we use to test the search feature.

- Original Message -
From: Giovanni Bricconi giovanni.bricc...@banzai.it
To: solr-user solr-user@lucene.apache.org
Cc: Ahmet Arslan iori...@yahoo.com
Sent: Friday, April 11, 2014 5:15:56 AM
Subject: Re: Solr relevancy tuning

Hello Doug

I have just watched the quepid demonstration video, and I strongly agree
with your introduction: it is very hard to involve marketing/business
people in repeated testing session, and speadsheets or other kind of files
are not the right tool to use.
Currenlty I'm quite alone in my tuning task and having a visual approach
could be benefical for me, you are giving me many good inputs!

I see that kelvin (my scripted tool) and queepid follows the same path. In
queepid someone quickly whatches the results and applies colours to result,
in kelvin you enter one on more queries (network cable, ethernet cable) and
states that the result must contains ethernet in the title, or must come
from a list of product categories.

I also do diffs of results, before and after changes, to check what is
going on; but I have to do that in a very unix-scripted way.

Have you considered of placing a counter of total red/bad results in
quepid? I use this index to have a quick overview of changes impact across
all queries. Actually I repeat tests in production from times to time, and
if I see the kelvin temperature rising (the number of errors going up) I
know I have to check what's going on because new products maybe are having
a bad impact on the index.

I also keep counters of products with low quality images/no images at all
or too short listings, sometimes are useful to undestand better what will
happen if you change some bq/fq in the application.

I see also that after changes in quepid someone have to check gray
results and assign them a colour, in kelvin case sometimes the conditions
can do a bit of magic (new product names still contains SM-G900F) but
sometimes can introduce false errors (the new product name contains only
Galaxy 5 and not the product code SM-G900F). So some checks are needed but
with quepid everybody can do the check, with kelvin you have to change some
line of a script, and not everybody is able/willing to do that.

The idea of a static index is a good suggestion, I will try to have it in
the next round of search engine improvement.

Thank you Doug!




2014-04-09 17:48 GMT+02:00 Doug Turnbull 
dturnb...@opensourceconnections.com:

 Hey Giovanni, nice to meet you.

 I'm the person that did the Test Driven Relevancy talk. We've got a product
 Quepid (http://quepid.com) that lets you gather good/bad results for
 queries and do a sort of test driven development against search relevancy.
 Sounds similar to your existing scripted approach. Have you considered
 keeping a static catalog for testing purposes? We had a project with a lot
 of updates and date-dependent relevancy. This lets you create some test
 scenarios against a static data set. However, one downside is you can't
 recreate problems in production in your test setup exactly-- you have to
 find a similar issue that reflects what you're seeing.

 Cheers,
 -Doug


 On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi 
 giovanni.bricc...@banzai.it wrote:

  Thank you for the links.
 
  The book is really useful, I will definitively have to spend some time
  reformatting the logs to to access number of result founds, session id
 and
  much more.
 
  I'm also quite happy that my test cases produces similar results to the
  precision reports shown at the beginning of the book.
 
  Giovanni
 
 
  2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com:
 
   Hi Giovanni,
  
   Here are some relevant pointers :
  
  
  
 
 http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
  
  
   http://rosenfeldmedia.com/books/search-analytics/
  
   http://www.sematext.com/search-analytics/index.html
  
  
   Ahmet
  
  
   On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
   giovanni.bricc...@banzai.it wrote:
   It is about one year I'm working on an e-commerce site, and
  unfortunately I
   have no information retrieval background, so probably I am missing
 some
   important practices about relevance tuning and search engines.
   During this period I had

Re: Solr relevancy tuning

2014-04-11 Thread Giovanni Bricconi
Hello Doug

I have just watched the quepid demonstration video, and I strongly agree
with your introduction: it is very hard to involve marketing/business
people in repeated testing session, and speadsheets or other kind of files
are not the right tool to use.
Currenlty I'm quite alone in my tuning task and having a visual approach
could be benefical for me, you are giving me many good inputs!

I see that kelvin (my scripted tool) and queepid follows the same path. In
queepid someone quickly whatches the results and applies colours to result,
in kelvin you enter one on more queries (network cable, ethernet cable) and
states that the result must contains ethernet in the title, or must come
from a list of product categories.

I also do diffs of results, before and after changes, to check what is
going on; but I have to do that in a very unix-scripted way.

Have you considered of placing a counter of total red/bad results in
quepid? I use this index to have a quick overview of changes impact across
all queries. Actually I repeat tests in production from times to time, and
if I see the kelvin temperature rising (the number of errors going up) I
know I have to check what's going on because new products maybe are having
a bad impact on the index.

I also keep counters of products with low quality images/no images at all
or too short listings, sometimes are useful to undestand better what will
happen if you change some bq/fq in the application.

I see also that after changes in quepid someone have to check gray
results and assign them a colour, in kelvin case sometimes the conditions
can do a bit of magic (new product names still contains SM-G900F) but
sometimes can introduce false errors (the new product name contains only
Galaxy 5 and not the product code SM-G900F). So some checks are needed but
with quepid everybody can do the check, with kelvin you have to change some
line of a script, and not everybody is able/willing to do that.

The idea of a static index is a good suggestion, I will try to have it in
the next round of search engine improvement.

Thank you Doug!




2014-04-09 17:48 GMT+02:00 Doug Turnbull 
dturnb...@opensourceconnections.com:

 Hey Giovanni, nice to meet you.

 I'm the person that did the Test Driven Relevancy talk. We've got a product
 Quepid (http://quepid.com) that lets you gather good/bad results for
 queries and do a sort of test driven development against search relevancy.
 Sounds similar to your existing scripted approach. Have you considered
 keeping a static catalog for testing purposes? We had a project with a lot
 of updates and date-dependent relevancy. This lets you create some test
 scenarios against a static data set. However, one downside is you can't
 recreate problems in production in your test setup exactly-- you have to
 find a similar issue that reflects what you're seeing.

 Cheers,
 -Doug


 On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi 
 giovanni.bricc...@banzai.it wrote:

  Thank you for the links.
 
  The book is really useful, I will definitively have to spend some time
  reformatting the logs to to access number of result founds, session id
 and
  much more.
 
  I'm also quite happy that my test cases produces similar results to the
  precision reports shown at the beginning of the book.
 
  Giovanni
 
 
  2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com:
 
   Hi Giovanni,
  
   Here are some relevant pointers :
  
  
  
 
 http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
  
  
   http://rosenfeldmedia.com/books/search-analytics/
  
   http://www.sematext.com/search-analytics/index.html
  
  
   Ahmet
  
  
   On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
   giovanni.bricc...@banzai.it wrote:
   It is about one year I'm working on an e-commerce site, and
  unfortunately I
   have no information retrieval background, so probably I am missing
 some
   important practices about relevance tuning and search engines.
   During this period I had to fix many bugs about bad search results,
  which
   I have solved sometimes tuning edismax weights, sometimes creating ad
 hoc
   query filters or query boosting; but I am still not able to figure out
  what
   should be the correct process to improve search results relevance.
  
   These are the practices I am following, I would really appreciate any
   comments about them and any hints about what practices you follow in
 your
   projects:
  
   - In order to have a measure of search quality I have written many test
   cases such as if the user searches for nike sport watch the search
   result should display at least four tom tom products with the words
   nike and sportwatch in the title. I have written a tool that
  read
   such tests from json files and applies them to my applications, and
 then
   counts the number of results that does not match the criterias stated
 in
   the test cases. (for those interested 

Solr relevancy tuning

2014-04-09 Thread Giovanni Bricconi
It is about one year I'm working on an e-commerce site, and unfortunately I
have no information retrieval background, so probably I am missing some
important practices about relevance tuning and search engines.
During this period I had to fix many bugs about bad search results, which
I have solved sometimes tuning edismax weights, sometimes creating ad hoc
query filters or query boosting; but I am still not able to figure out what
should be the correct process to improve search results relevance.

These are the practices I am following, I would really appreciate any
comments about them and any hints about what practices you follow in your
projects:

- In order to have a measure of search quality I have written many test
cases such as if the user searches for nike sport watch the search
result should display at least four tom tom products with the words
nike and sportwatch in the title. I have written a tool that read
such tests from json files and applies them to my applications, and then
counts the number of results that does not match the criterias stated in
the test cases. (for those interested this tool is available at
https://github.com/gibri/kelvin but it is still quite a prototype)

- I use this count as a quality index, I tried various times to change the
edismax weight to lower the whole number of error, or to add new
filters/boostings to the application to try to decrease the error count.

- The pros of this is that at least you have a number to look at, and that
you have a quick way of checking the impact of a modification.

- The bad side is that you have to maintain the test cases: now I have
about 800 tests and my product catalogue changes often, this implies that
some products exits the catalog, and some test cases cant pass anymore.

- I am populating the test cases using errors reported from users, and I
feel that this is driving the test cases too much toward pathologic cases.
An more over I haven't many test for cases that are working well now.

I would like to use search logs as drivers to generate tests, but I feel I
haven't picked the right path. Using top queries, manually reviewing
results, and then writing tests is a slow process; moreover many top
queries are ambiguous or are driven by site ads.

Many many queries are unique per users. How to deal with these cases?

How are you using your log to find out test cases to fix? Are you looking
for queries where the user is not opening any returned results? Which kpi
have you chosen to find out query that are not providing good results? And
what are you using as kpi for the whole search, beside the conversion rate?

Can you suggest me any other practices you are using on your projects?

Thank you very much in advance

Giovanni


Re: Solr relevancy tuning

2014-04-09 Thread Ahmet Arslan
Hi Giovanni,

Here are some relevant pointers : 

http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
 

http://rosenfeldmedia.com/books/search-analytics/ 

http://www.sematext.com/search-analytics/index.html 


Ahmet


On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
giovanni.bricc...@banzai.it wrote:
It is about one year I'm working on an e-commerce site, and unfortunately I
have no information retrieval background, so probably I am missing some
important practices about relevance tuning and search engines.
During this period I had to fix many bugs about bad search results, which
I have solved sometimes tuning edismax weights, sometimes creating ad hoc
query filters or query boosting; but I am still not able to figure out what
should be the correct process to improve search results relevance.

These are the practices I am following, I would really appreciate any
comments about them and any hints about what practices you follow in your
projects:

- In order to have a measure of search quality I have written many test
cases such as if the user searches for nike sport watch the search
result should display at least four tom tom products with the words
nike and sportwatch in the title. I have written a tool that read
such tests from json files and applies them to my applications, and then
counts the number of results that does not match the criterias stated in
the test cases. (for those interested this tool is available at
https://github.com/gibri/kelvin but it is still quite a prototype)

- I use this count as a quality index, I tried various times to change the
edismax weight to lower the whole number of error, or to add new
filters/boostings to the application to try to decrease the error count.

- The pros of this is that at least you have a number to look at, and that
you have a quick way of checking the impact of a modification.

- The bad side is that you have to maintain the test cases: now I have
about 800 tests and my product catalogue changes often, this implies that
some products exits the catalog, and some test cases cant pass anymore.

- I am populating the test cases using errors reported from users, and I
feel that this is driving the test cases too much toward pathologic cases.
An more over I haven't many test for cases that are working well now.

I would like to use search logs as drivers to generate tests, but I feel I
haven't picked the right path. Using top queries, manually reviewing
results, and then writing tests is a slow process; moreover many top
queries are ambiguous or are driven by site ads.

Many many queries are unique per users. How to deal with these cases?

How are you using your log to find out test cases to fix? Are you looking
for queries where the user is not opening any returned results? Which kpi
have you chosen to find out query that are not providing good results? And
what are you using as kpi for the whole search, beside the conversion rate?

Can you suggest me any other practices you are using on your projects?

Thank you very much in advance

Giovanni



Re: Solr relevancy tuning

2014-04-09 Thread Giovanni Bricconi
Thank you for the links.

The book is really useful, I will definitively have to spend some time
reformatting the logs to to access number of result founds, session id and
much more.

I'm also quite happy that my test cases produces similar results to the
precision reports shown at the beginning of the book.

Giovanni


2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com:

 Hi Giovanni,

 Here are some relevant pointers :


 http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy


 http://rosenfeldmedia.com/books/search-analytics/

 http://www.sematext.com/search-analytics/index.html


 Ahmet


 On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
 giovanni.bricc...@banzai.it wrote:
 It is about one year I'm working on an e-commerce site, and unfortunately I
 have no information retrieval background, so probably I am missing some
 important practices about relevance tuning and search engines.
 During this period I had to fix many bugs about bad search results, which
 I have solved sometimes tuning edismax weights, sometimes creating ad hoc
 query filters or query boosting; but I am still not able to figure out what
 should be the correct process to improve search results relevance.

 These are the practices I am following, I would really appreciate any
 comments about them and any hints about what practices you follow in your
 projects:

 - In order to have a measure of search quality I have written many test
 cases such as if the user searches for nike sport watch the search
 result should display at least four tom tom products with the words
 nike and sportwatch in the title. I have written a tool that read
 such tests from json files and applies them to my applications, and then
 counts the number of results that does not match the criterias stated in
 the test cases. (for those interested this tool is available at
 https://github.com/gibri/kelvin but it is still quite a prototype)

 - I use this count as a quality index, I tried various times to change the
 edismax weight to lower the whole number of error, or to add new
 filters/boostings to the application to try to decrease the error count.

 - The pros of this is that at least you have a number to look at, and that
 you have a quick way of checking the impact of a modification.

 - The bad side is that you have to maintain the test cases: now I have
 about 800 tests and my product catalogue changes often, this implies that
 some products exits the catalog, and some test cases cant pass anymore.

 - I am populating the test cases using errors reported from users, and I
 feel that this is driving the test cases too much toward pathologic cases.
 An more over I haven't many test for cases that are working well now.

 I would like to use search logs as drivers to generate tests, but I feel I
 haven't picked the right path. Using top queries, manually reviewing
 results, and then writing tests is a slow process; moreover many top
 queries are ambiguous or are driven by site ads.

 Many many queries are unique per users. How to deal with these cases?

 How are you using your log to find out test cases to fix? Are you looking
 for queries where the user is not opening any returned results? Which kpi
 have you chosen to find out query that are not providing good results? And
 what are you using as kpi for the whole search, beside the conversion rate?

 Can you suggest me any other practices you are using on your projects?

 Thank you very much in advance

 Giovanni




Re: Solr relevancy tuning

2014-04-09 Thread Doug Turnbull
Hey Giovanni, nice to meet you.

I'm the person that did the Test Driven Relevancy talk. We've got a product
Quepid (http://quepid.com) that lets you gather good/bad results for
queries and do a sort of test driven development against search relevancy.
Sounds similar to your existing scripted approach. Have you considered
keeping a static catalog for testing purposes? We had a project with a lot
of updates and date-dependent relevancy. This lets you create some test
scenarios against a static data set. However, one downside is you can't
recreate problems in production in your test setup exactly-- you have to
find a similar issue that reflects what you're seeing.

Cheers,
-Doug


On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi 
giovanni.bricc...@banzai.it wrote:

 Thank you for the links.

 The book is really useful, I will definitively have to spend some time
 reformatting the logs to to access number of result founds, session id and
 much more.

 I'm also quite happy that my test cases produces similar results to the
 precision reports shown at the beginning of the book.

 Giovanni


 2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com:

  Hi Giovanni,
 
  Here are some relevant pointers :
 
 
 
 http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
 
 
  http://rosenfeldmedia.com/books/search-analytics/
 
  http://www.sematext.com/search-analytics/index.html
 
 
  Ahmet
 
 
  On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
  giovanni.bricc...@banzai.it wrote:
  It is about one year I'm working on an e-commerce site, and
 unfortunately I
  have no information retrieval background, so probably I am missing some
  important practices about relevance tuning and search engines.
  During this period I had to fix many bugs about bad search results,
 which
  I have solved sometimes tuning edismax weights, sometimes creating ad hoc
  query filters or query boosting; but I am still not able to figure out
 what
  should be the correct process to improve search results relevance.
 
  These are the practices I am following, I would really appreciate any
  comments about them and any hints about what practices you follow in your
  projects:
 
  - In order to have a measure of search quality I have written many test
  cases such as if the user searches for nike sport watch the search
  result should display at least four tom tom products with the words
  nike and sportwatch in the title. I have written a tool that
 read
  such tests from json files and applies them to my applications, and then
  counts the number of results that does not match the criterias stated in
  the test cases. (for those interested this tool is available at
  https://github.com/gibri/kelvin but it is still quite a prototype)
 
  - I use this count as a quality index, I tried various times to change
 the
  edismax weight to lower the whole number of error, or to add new
  filters/boostings to the application to try to decrease the error count.
 
  - The pros of this is that at least you have a number to look at, and
 that
  you have a quick way of checking the impact of a modification.
 
  - The bad side is that you have to maintain the test cases: now I have
  about 800 tests and my product catalogue changes often, this implies that
  some products exits the catalog, and some test cases cant pass anymore.
 
  - I am populating the test cases using errors reported from users, and I
  feel that this is driving the test cases too much toward pathologic
 cases.
  An more over I haven't many test for cases that are working well now.
 
  I would like to use search logs as drivers to generate tests, but I feel
 I
  haven't picked the right path. Using top queries, manually reviewing
  results, and then writing tests is a slow process; moreover many top
  queries are ambiguous or are driven by site ads.
 
  Many many queries are unique per users. How to deal with these cases?
 
  How are you using your log to find out test cases to fix? Are you looking
  for queries where the user is not opening any returned results? Which
 kpi
  have you chosen to find out query that are not providing good results?
 And
  what are you using as kpi for the whole search, beside the conversion
 rate?
 
  Can you suggest me any other practices you are using on your projects?
 
  Thank you very much in advance
 
  Giovanni
 
 




-- 
Doug Turnbull
Search  Big Data Architect
OpenSource Connections http://o19s.com