OK single action recs are working so output to Solr with only [B'B] and B.

On Aug 13, 2013, at 10:52 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

Corrections inline

> On Aug 13, 2013, at 10:49 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
> I finally got some time to work on this and have a first cut at output to 
> Solr working on the github repo. It only works on 2-action input but I'll 
> have that cleaned up soon so it will work with one action. Solr indexing has 
> not been tested yet and the field names and/or types may need tweaking. 
> 
> It takes the result of the previous drop:
> 1) DRMs for B (user history or B items action1) and A (user history of A 
> items action2)
> 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
> 
> There are two final outputs created using mapreduce but requiring 2 in-memory 
> hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
> on each node) but haven't tried yet. It orders items in #2 fields by strength 
> of "link", which is the similarity value used in [B'B] or [B'A]. It would be 
> nice to order #1 by recency but there is no provision for passing through 
> timestamps at present so they are ordered by the strength of preference. This 
> is probably not useful and so can be ignored. Ordering by recency might be 
> useful for truncating queries by recency while leaving the training data 
> containing 100% of available history.
> 
> 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
> like this:
> id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
> ...
> 
> 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
> like this:
> id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
> …
> 
> It may work on a cluster, I haven't tried yet. As soon as someone has some 
> large-ish sample log files I'll give them a try. Check the sample input files 
> in the resources dir for format.
> 
> https://github.com/pferrel/solr-recommender
> 
> 
> On Aug 13, 2013, at 10:17 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> When I started looking at this I was a bit skeptical. As a Search engine Solr 
> may be peerless, but as yet another NoSQL db?
> 
> However getting further into this I see one very large benefit. It has one 
> feature that sets it completely apart from the typical NoSQL db. The type of 
> queries you do return fuzzy results--in the very best sense of that word. The 
> most interesting queries are based on similarity to some exemplar. Results 
> are returned in order of similarity strength, not ordered by a sort field.
> 
> Wherever similarity based queries are important I'll look at Solr first. 
> SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
> probably at least an alternative to using docs and CSVs to import the data 
> from Mahout.
> 
> 
> 
> On Aug 12, 2013, at 2:32 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> Yes.  That would be interesting.
> 
> 
> 
> 
> On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
> 
>> A little digression: Might a Matrix implementation backed by a Solr index
>> and uses SolrJ for querying help at all for the Solr recommendation
>> approach?
>> 
>> It supports multiple fields of String, Text, or boolean flags.
>> 
>> Best
>> Gokhan
>> 
>> 
>> On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>> 
>>> Also a question about user history.
>>> 
>>> I was planning to write these into separate directories so Solr could
>>> fetch them from different sources but it occurs to me that it would be
>>> better to join A and B by user ID and output a doc per user ID with three
>>> fields, id, A item history, and B item history. Other fields could be
>> added
>>> for users metadata.
>>> 
>>> Sound correct? This is what I'll do unless someone stops me.
>>> 
>>> On Aug 7, 2013, at 11:25 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>> 
>>> Once you have a sample or example of what you think the
>>> "log file" version will look like, can you post it? It would be great to
>>> have example lines for two actions with or without the same item IDs.
>> I'll
>>> make sure we can digest it.
>>> 
>>> I thought more about the ingest part and I don't think the one-item-space
>>> is actually a problem. It just means one item dictionary. A and B will
>> have
>>> the right content, all I have to do is make sure the right ranks are
>> input
>>> to the MM,
>>> Transpose, and RSJ. This in turn is only one extra count of the # of
>> items
>>> in A's item space. This should be a very easy change If my thinking is
>>> correct.
>>> 
>>> 
>>> On Aug 7, 2013, at 8:09 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>> 
>>> On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>>> 
>>>> 4) To add more metadata to the Solr output will be left to the consumer
>>>> for now. If there is a good data set to use we can illustrate how to do
>>> it
>>>> in the project. Ted may have some data for this from musicbrainz.
>>> 
>>> 
>>> I am working on this issue now.
>>> 
>>> The current state is that I can bring in a bunch of track names and links
>>> to artist names and so on.  This would provide the basic set of items
>>> (artists, genres, tracks and tags).
>>> 
>>> There is a hitch in bringing in the data needed to generate the logs
>> since
>>> that part of MB is not Apache compatible.  I am working on that issue.
>>> 
>>> Technically, the data is in a massively normalized relational form right
>>> now, but it isn't terribly hard to denormalize into a form that we need.
>>> 
>>> 
>>> 
>> 
> 
> 
> 

Reply via email to