Hi everyone,
I am using SimilarProducts template. I have around 3 millions of event data
for about 180k unique items, which is collected in 2 months of period.
Original event data size is about 900MB, but after training, model data
size shrinks to only 16KB. And when I try to get predictions, I receive
predictions only for 15 items.
I have only $set and view events in my event store, which looks like
following.
{
"event" : "$set",
"entityType" : "item",
"entityId" : "someEntityId",
"properties" : {
"property1" : "property1_value",
"property2" : "property2_value"
}
}
{
"event" : "view",
"entityType" : "user",
"entityId" : "userSessionId",
"targetEntityType" : "item",
"targetEntityId" : "someTargetEntityId",
"properties" : {}
}
Few facts about my implementation:
- I have removed the requirement in engine template to set user before user
can view the item, as described here
https://github.com/apache/incubator-predictionio/tree/develop/examples/scala-parallel-similarproduct/no-set-user
- Since I dont want to track users, Im using session id of the user as the
entityId in view event.
- In my case I cannot track if an item is already set in event store or
not. for this reason, I'm setting the item before each view event every
time. As I read many times in forums that it does not affect predictions,
if an item has multiple set events.
- I'm using MySQL to store everything (event data, model data, metadata
etc.) because of certain requirements.
I have following questions about above problem:
1: Why model data is so small and why I'm getting predictions only for a
couple of items?
2: Is this event data quality problem? If yes, How can I test and improve
the data quality?
3: Is it safe to remove old duplicate set events with MySQL query and leave
only the latest set event for item? Will it help regarding data quality?
4: I see different settings for ALS algorithm in engine.json file. Can
tweaking those settings in someway help? Are those settings explained
somewhere?
Currently my ALS algorithm settings looks like this:
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations" : 10,
"lambda": 0.01,
"seed": 3
}
}
]
Many thanks for your time and suggestions.
Best,
Tahir