Hi everyone,

I am using SimilarProducts template. I have around 3 millions of event data
for about 180k unique items, which is collected in 2 months of period.
Original event data size is about 900MB, but after training, model data
size shrinks to only 16KB. And when I try to get predictions, I receive
predictions only for 15 items.

I have only $set and view events in my event store, which looks like
following.


{
  "event" : "$set",
  "entityType" : "item",
  "entityId" : "someEntityId",
  "properties" : {
    "property1" : "property1_value",
    "property2" : "property2_value"
  }
}


{
  "event" : "view",
  "entityType" : "user",
  "entityId" : "userSessionId",
  "targetEntityType" : "item",
  "targetEntityId" : "someTargetEntityId",
  "properties" : {}
}



Few facts about my implementation:
- I have removed the requirement in engine template to set user before user
can view the item, as described here
https://github.com/apache/incubator-predictionio/tree/develop/examples/scala-parallel-similarproduct/no-set-user

- Since I dont want to track users, Im using session id of the user as the
entityId in view event.
- In my case I cannot track if an item is already set in event store or
not. for this reason, I'm setting the item before each view event every
time. As I read many times in forums that it does not affect predictions,
if an item has multiple set events.
- I'm using MySQL to store everything (event data, model data, metadata
etc.) because of certain requirements.

I have following questions about above problem:
1: Why model data is so small and why I'm getting predictions only for a
couple of items?
2: Is this event data quality problem? If yes, How can I test and improve
the data quality?
3: Is it safe to remove old duplicate set events with MySQL query and leave
only the latest set event for item? Will it help regarding data quality?
4: I see different settings for ALS algorithm in engine.json file. Can
tweaking those settings in someway help? Are those settings explained
somewhere?

Currently my ALS algorithm settings looks like this:

"algorithms": [
    {
      "name": "als",
      "params": {
        "rank": 10,
        "numIterations" : 10,
        "lambda": 0.01,
        "seed": 3
      }
    }
  ]


Many thanks for your time and suggestions.

Best,
Tahir

Reply via email to