It is missing definition of atom (at least the page referred to); is it
the basic piece of information?
It is also seems that numeric is continuous (temperature, fin data) and
categoric and ordinal are discrete (words, ratings).
As such all these data types will be more naturally categorized
Hi Greg,
Thank you for your time debugging this!
Maybe we should simply make TanimotoCoefficientSimilarity return
Double.NaN in case of no overlap?
--sebastian
On 29.11.2011 06:28, Greg H wrote:
Sorry for taking so long to reply but I think I found where the problem is.
After comparing the
Hi Greg,
Thank you for your time debugging this!
Maybe we should simply make TanimotoCoefficientSimilarity return
Double.NaN in case of no overlap?
--sebastian
On 29.11.2011 06:28, Greg H wrote:
Sorry for taking so long to reply but I think I found where the problem is.
After comparing the
Hi,
There was a conversation some time ago about incorporating time
dependency for preferences:
http://thread.gmane.org/gmane.comp.apache.mahout.user/2951
Has there been any more discussion about this? Has anything been
checked into Mahout? Is anyone working on it? I might be able to
We are still in the process of resolving this features and weights problem.
I think normally you convert documents based on features when you have a
dictionary of features defined.
In case of apple, we need to define the size, weight, color
like in case of geography, you have country, city,
Yeah the non-distributed implementation returns NaN in this case, which is
a bit of an abuse, since it is defined to be 0. In practice I have always
thought that is the right thing to do for consistency with other
implementations, where no overlap means undefined similarity. You could
argue it
Hello Anatoliy,
On 29.11.2011, at 10:32, Anatoliy Kats wrote:
Hi,
There was a conversation some time ago about incorporating time dependency
for preferences: http://thread.gmane.org/gmane.comp.apache.mahout.user/2951
Has there been any more discussion about this? Has anything been
Anyone have an easy algorithm for coloring clusters in a nice way? That is,
given k clusters, color each centroid and all of it's associated points in such
a way that it is visually appealing and avoids, to the extent it can, coloring
two unique clusters the same color.
Also, the same goes
On 29.11.2011 11:52, Sean Owen wrote:
Yeah the non-distributed implementation returns NaN in this case, which is
a bit of an abuse, since it is defined to be 0. In practice I have always
thought that is the right thing to do for consistency with other
implementations, where no overlap means
Currently I am running into some problem. My taste-web run fined
with Grouplens 1M without changing MAVEN_OPS heapsize.
However, my datapoints has only 1200++ users with 1++ ratings
can only run if I change the heapsize to 2048. Any suggestion to
solve this
This doesn't make sense -- it runs with 1M ratings, but not with 1? are
you sure you have your numbers right? 1 is a tiny data set.
Heap size should not be 2048, but something like 2048M.
On Tue, Nov 29, 2011 at 2:03 PM, Tan Shern Shiou
shernshiou@mnc.com.mywrote:
Currently I am
I am sorry if I didnt make myself clear.
I have 2 datasets here.
1. Grouplens 1M
2. My own with 10,000 ratings (very small test data)
The taste-web running fine with default heapsize. But when I load with
my own dataset (10,000+), it crash after slopone recommendation. I need
to change to
Coloring is pretty easy in R, which is what I use. I just build a color
map with the right number of indices and use the cluster id to index the
colormap. For grins, I vary the transparency according to how seriously
down-sampled the cluster is. That lets me get a good visual feel for the
Manuel,
If you can blind your data sufficiently to release it publicly, it would
make it much easier to get others to help with this.
On Tue, Nov 29, 2011 at 3:21 AM, Manuel Blechschmidt
manuel.blechschm...@gmx.de wrote:
Hello Anatoliy,
On 29.11.2011, at 10:32, Anatoliy Kats wrote:
Hi,
In general, you have to increase a JVM's heap size if you're running
anything that needs non-trivial memory. I think the default heap size is
still 32M or 64M, which is quite small for these purposes. So I am not
surprised if you must increase the heap size, in general.
It is still surprising to
Thanks for the advice.. I would look into it.
Another question, taste-web can support Cassandra without major rewrite
right?
On 29/11/2011 10:32 PM, Sean Owen wrote:
like your input is actually not
what you think it is, or something else you're doing is consuming a great
deal of memory. I
You can stick CassandraDataModel into your non-distributed recommender,
yes. It is still going to cache the Cassandra data in memory -- even
reading out of a fast Cassandra cluster is too slow for this kind of
intense access pattern -- but yes it will read just fine.
On Tue, Nov 29, 2011 at 2:37
Hi Ted,
I agree with you. I would love to release it.
Unfortunately it is not my data therefore I can not just release it to public
not even anonymized. If someone is willing to contribute new algorithms I can
release anonymized data sets on a personal basis.
The problem is that there are
The deanonymization attacks depend on some aspect of the data being related
to real-world events or products. The attack on the netflix data depended
on the movies being identified so that ratings could be correlated to
ratings on other systems.
If you blind product id's and user id's then none
I find this taxonomy excessive and over-done. The distinctions I find
useful include
- continuous variables
- discrete variables with a known set of values (I call these categorical,
usually). This includes ordinal variables since ordering rarely makes a
lot of difference.
- discrete
Hi Manuel,
Thank you for the reference. I am just testing the waters for now,
trying to find out what's available. I should have a usecase in a
couple of weeks. I'll reread what's said here then, and continue the
thread.
Cheers,
Anatoliy
On 11/29/2011 03:21 PM, Manuel Blechschmidt
On 29 November 2011 16:11, Ted Dunning ted.dunn...@gmail.com wrote:
The deanonymization attacks depend on some aspect of the data being related
to real-world events or products. The attack on the netflix data depended
on the movies being identified so that ratings could be correlated to
I'm still learning R, do you have code handy you could share?
On Nov 29, 2011, at 6:25 AM, Ted Dunning wrote:
Coloring is pretty easy in R, which is what I use. I just build a color
map with the right number of indices and use the cluster id to index the
colormap. For grins, I vary the
Hi,
I brought up this question in dev a few weeks ago. I have a
recommendation algorithm that learns the similarity matrix relying on
both current items, and expired ones that should not be recommended.
However, AverageAbsoluteDifferenceRecommenderEvaluator compares the
predicted and
It seems a bit surprising that there is no method along these lines in
DataModel (or some subclass thereof):
setPreference(long userID, long itemID, float value, long time)
Am I just overlooking something? Are you always expected to just go in
under the covers of some DataModel's
Hi Jamey,
On 29.11.2011, at 18:32, Jamey Wood wrote:
It seems a bit surprising that there is no method along these lines in
DataModel (or some subclass thereof):
setPreference(long userID, long itemID, float value, long time)
Am I just overlooking something?
Yes you do, here you go:
Thanks for the response, Manuel. But what I'm asking about here is a way
to _set_ a time when storing a preference into a DataModel (i.e. control
the value that'll subsequently be returned by the getPreferenceTime method).
Thanks,
Jamey
On Tue, Nov 29, 2011 at 10:41 AM, Manuel Blechschmidt
I think the idea is that this method always sets the time now, since it's
called now, whenever now is. Historical data is ingested differently,
through files or databases, and these can specify a time for each datum.
Does that address what you need to do?
On Tue, Nov 29, 2011 at 5:50 PM, Jamey
Hi Jamey,
On 29.11.2011, at 18:50, Jamey Wood wrote:
Thanks for the response, Manuel. But what I'm asking about here is a way
to _set_ a time when storing a preference into a DataModel (i.e. control
the value that'll subsequently be returned by the getPreferenceTime method).
Actually as far
Am Dienstag, 29. November 2011, 10:32:39 schrieb Anatoliy Kats:
Hello,
There was a conversation some time ago about incorporating time
dependency for preferences:
http://thread.gmane.org/gmane.comp.apache.mahout.user/2951
Has there been any more discussion about this? Has anything been
The organization meeting for Austin SIGKDD was an outstanding success.
Seventeen people attended the meeting. Everyone was very interested in
furthering their professional skills and starting a weekly hackers dojo. The
focus of the group will be on big data machine learning. For the initial
Ah wow, thanks for that list. I will take a look at some of those
within the next couple of weeks.
On 11/30/2011 02:12 AM, Christoph Hermann wrote:
Am Dienstag, 29. November 2011, 10:32:39 schrieb Anatoliy Kats:
Hello,
There was a conversation some time ago about incorporating time
Hi Sean,
OK, I understand, thanks. I am working with Boolean data for the time
being, so I'm using the IRStatsEvaluator. But I'll revisit the issue if
and when I go back to integer preferences.
On 11/29/2011 08:19 PM, Sean Owen wrote:
The recommendation process ends with steps:
1.
33 matches
Mail list logo