Re: Data class taxonomy for machine learning

2011-11-29 Thread Konstantin Shmakov
It is missing definition of atom (at least the page referred to); is it the basic piece of information? It is also seems that numeric is continuous (temperature, fin data) and categoric and ordinal are discrete (words, ratings). As such all these data types will be more naturally categorized

Re: ItemSimilarityJob's results differ from non-distributed version

2011-11-29 Thread Sebastian Schelter
Hi Greg, Thank you for your time debugging this! Maybe we should simply make TanimotoCoefficientSimilarity return Double.NaN in case of no overlap? --sebastian On 29.11.2011 06:28, Greg H wrote: Sorry for taking so long to reply but I think I found where the problem is. After comparing the

Re: ItemSimilarityJob's results differ from non-distributed version

2011-11-29 Thread Sebastian Schelter
Hi Greg, Thank you for your time debugging this! Maybe we should simply make TanimotoCoefficientSimilarity return Double.NaN in case of no overlap? --sebastian On 29.11.2011 06:28, Greg H wrote: Sorry for taking so long to reply but I think I found where the problem is. After comparing the

Time-based preferences for recommendation

2011-11-29 Thread Anatoliy Kats
Hi, There was a conversation some time ago about incorporating time dependency for preferences: http://thread.gmane.org/gmane.comp.apache.mahout.user/2951 Has there been any more discussion about this? Has anything been checked into Mahout? Is anyone working on it? I might be able to

RE: Relevance score - Classification

2011-11-29 Thread Faizan(Aroha)
We are still in the process of resolving this features and weights problem. I think normally you convert documents based on features when you have a dictionary of features defined. In case of apple, we need to define the size, weight, color like in case of geography, you have country, city,

Re: ItemSimilarityJob's results differ from non-distributed version

2011-11-29 Thread Sean Owen
Yeah the non-distributed implementation returns NaN in this case, which is a bit of an abuse, since it is defined to be 0. In practice I have always thought that is the right thing to do for consistency with other implementations, where no overlap means undefined similarity. You could argue it

Re: Time-based preferences for recommendation

2011-11-29 Thread Manuel Blechschmidt
Hello Anatoliy, On 29.11.2011, at 10:32, Anatoliy Kats wrote: Hi, There was a conversation some time ago about incorporating time dependency for preferences: http://thread.gmane.org/gmane.comp.apache.mahout.user/2951 Has there been any more discussion about this? Has anything been

Clustering graph coloring and layout

2011-11-29 Thread Grant Ingersoll
Anyone have an easy algorithm for coloring clusters in a nice way? That is, given k clusters, color each centroid and all of it's associated points in such a way that it is visually appealing and avoids, to the extent it can, coloring two unique clusters the same color. Also, the same goes

Re: ItemSimilarityJob's results differ from non-distributed version

2011-11-29 Thread Sebastian Schelter
On 29.11.2011 11:52, Sean Owen wrote: Yeah the non-distributed implementation returns NaN in this case, which is a bit of an abuse, since it is defined to be 0. In practice I have always thought that is the right thing to do for consistency with other implementations, where no overlap means

Re: Using Brisk with Mahout

2011-11-29 Thread Tan Shern Shiou
Currently I am running into some problem. My taste-web run fined with Grouplens 1M without changing MAVEN_OPS heapsize. However, my datapoints has only 1200++ users with 1++ ratings can only run if I change the heapsize to 2048. Any suggestion to solve this

Re: Using Brisk with Mahout

2011-11-29 Thread Sean Owen
This doesn't make sense -- it runs with 1M ratings, but not with 1? are you sure you have your numbers right? 1 is a tiny data set. Heap size should not be 2048, but something like 2048M. On Tue, Nov 29, 2011 at 2:03 PM, Tan Shern Shiou shernshiou@mnc.com.mywrote: Currently I am

Re: Using Brisk with Mahout

2011-11-29 Thread Tan Shern Shiou
I am sorry if I didnt make myself clear. I have 2 datasets here. 1. Grouplens 1M 2. My own with 10,000 ratings (very small test data) The taste-web running fine with default heapsize. But when I load with my own dataset (10,000+), it crash after slopone recommendation. I need to change to

Re: Clustering graph coloring and layout

2011-11-29 Thread Ted Dunning
Coloring is pretty easy in R, which is what I use. I just build a color map with the right number of indices and use the cluster id to index the colormap. For grins, I vary the transparency according to how seriously down-sampled the cluster is. That lets me get a good visual feel for the

Re: Time-based preferences for recommendation

2011-11-29 Thread Ted Dunning
Manuel, If you can blind your data sufficiently to release it publicly, it would make it much easier to get others to help with this. On Tue, Nov 29, 2011 at 3:21 AM, Manuel Blechschmidt manuel.blechschm...@gmx.de wrote: Hello Anatoliy, On 29.11.2011, at 10:32, Anatoliy Kats wrote: Hi,

Re: Using Brisk with Mahout

2011-11-29 Thread Sean Owen
In general, you have to increase a JVM's heap size if you're running anything that needs non-trivial memory. I think the default heap size is still 32M or 64M, which is quite small for these purposes. So I am not surprised if you must increase the heap size, in general. It is still surprising to

Re: Using Brisk with Mahout

2011-11-29 Thread Tan Shern Shiou
Thanks for the advice.. I would look into it. Another question, taste-web can support Cassandra without major rewrite right? On 29/11/2011 10:32 PM, Sean Owen wrote: like your input is actually not what you think it is, or something else you're doing is consuming a great deal of memory. I

Re: Using Brisk with Mahout

2011-11-29 Thread Sean Owen
You can stick CassandraDataModel into your non-distributed recommender, yes. It is still going to cache the Cassandra data in memory -- even reading out of a fast Cassandra cluster is too slow for this kind of intense access pattern -- but yes it will read just fine. On Tue, Nov 29, 2011 at 2:37

Re: Time-based preferences for recommendation

2011-11-29 Thread Manuel Blechschmidt
Hi Ted, I agree with you. I would love to release it. Unfortunately it is not my data therefore I can not just release it to public not even anonymized. If someone is willing to contribute new algorithms I can release anonymized data sets on a personal basis. The problem is that there are

Re: Time-based preferences for recommendation

2011-11-29 Thread Ted Dunning
The deanonymization attacks depend on some aspect of the data being related to real-world events or products. The attack on the netflix data depended on the movies being identified so that ratings could be correlated to ratings on other systems. If you blind product id's and user id's then none

Re: Data class taxonomy for machine learning

2011-11-29 Thread Ted Dunning
I find this taxonomy excessive and over-done. The distinctions I find useful include - continuous variables - discrete variables with a known set of values (I call these categorical, usually). This includes ordinal variables since ordering rarely makes a lot of difference. - discrete

Re: Time-based preferences for recommendation

2011-11-29 Thread Anatoliy Kats
Hi Manuel, Thank you for the reference. I am just testing the waters for now, trying to find out what's available. I should have a usecase in a couple of weeks. I'll reread what's said here then, and continue the thread. Cheers, Anatoliy On 11/29/2011 03:21 PM, Manuel Blechschmidt

Re: Time-based preferences for recommendation

2011-11-29 Thread Dan Brickley
On 29 November 2011 16:11, Ted Dunning ted.dunn...@gmail.com wrote: The deanonymization attacks depend on some aspect of the data being related to real-world events or products.  The attack on the netflix data depended on the movies being identified so that ratings could be correlated to

Re: Clustering graph coloring and layout

2011-11-29 Thread Grant Ingersoll
I'm still learning R, do you have code handy you could share? On Nov 29, 2011, at 6:25 AM, Ted Dunning wrote: Coloring is pretty easy in R, which is what I use. I just build a color map with the right number of indices and use the cluster id to index the colormap. For grins, I vary the

Evaluating recommendations with expired items

2011-11-29 Thread Anatoliy Kats
Hi, I brought up this question in dev a few weeks ago. I have a recommendation algorithm that learns the similarity matrix relying on both current items, and expired ones that should not be recommended. However, AverageAbsoluteDifferenceRecommenderEvaluator compares the predicted and

Including a timestamp when setting preferences

2011-11-29 Thread Jamey Wood
It seems a bit surprising that there is no method along these lines in DataModel (or some subclass thereof): setPreference(long userID, long itemID, float value, long time) Am I just overlooking something? Are you always expected to just go in under the covers of some DataModel's

Re: Including a timestamp when setting preferences

2011-11-29 Thread Manuel Blechschmidt
Hi Jamey, On 29.11.2011, at 18:32, Jamey Wood wrote: It seems a bit surprising that there is no method along these lines in DataModel (or some subclass thereof): setPreference(long userID, long itemID, float value, long time) Am I just overlooking something? Yes you do, here you go:

Re: Including a timestamp when setting preferences

2011-11-29 Thread Jamey Wood
Thanks for the response, Manuel. But what I'm asking about here is a way to _set_ a time when storing a preference into a DataModel (i.e. control the value that'll subsequently be returned by the getPreferenceTime method). Thanks, Jamey On Tue, Nov 29, 2011 at 10:41 AM, Manuel Blechschmidt

Re: Including a timestamp when setting preferences

2011-11-29 Thread Sean Owen
I think the idea is that this method always sets the time now, since it's called now, whenever now is. Historical data is ingested differently, through files or databases, and these can specify a time for each datum. Does that address what you need to do? On Tue, Nov 29, 2011 at 5:50 PM, Jamey

Re: Including a timestamp when setting preferences

2011-11-29 Thread Manuel Blechschmidt
Hi Jamey, On 29.11.2011, at 18:50, Jamey Wood wrote: Thanks for the response, Manuel. But what I'm asking about here is a way to _set_ a time when storing a preference into a DataModel (i.e. control the value that'll subsequently be returned by the getPreferenceTime method). Actually as far

Re: Time-based preferences for recommendation

2011-11-29 Thread Christoph Hermann
Am Dienstag, 29. November 2011, 10:32:39 schrieb Anatoliy Kats: Hello, There was a conversation some time ago about incorporating time dependency for preferences: http://thread.gmane.org/gmane.comp.apache.mahout.user/2951 Has there been any more discussion about this? Has anything been

Successful Organization Meeting for Austin SIGKDD

2011-11-29 Thread David Boney
The organization meeting for Austin SIGKDD was an outstanding success. Seventeen people attended the meeting. Everyone was very interested in furthering their professional skills and starting a weekly hackers dojo. The focus of the group will be on big data machine learning. For the initial

Re: Time-based preferences for recommendation

2011-11-29 Thread Anatoliy Kats
Ah wow, thanks for that list. I will take a look at some of those within the next couple of weeks. On 11/30/2011 02:12 AM, Christoph Hermann wrote: Am Dienstag, 29. November 2011, 10:32:39 schrieb Anatoliy Kats: Hello, There was a conversation some time ago about incorporating time

Re: Evaluating recommendations with expired items

2011-11-29 Thread Anatoliy Kats
Hi Sean, OK, I understand, thanks. I am working with Boolean data for the time being, so I'm using the IRStatsEvaluator. But I'll revisit the issue if and when I go back to integer preferences. On 11/29/2011 08:19 PM, Sean Owen wrote: The recommendation process ends with steps: 1.