There was a variant of cholesky decomposition in Mahout at one time not so
long ago. I would guess that it is still there.
It is difficult to make a truly distributed version of QR decomposition,
but for the purposes of the randomized SVD in Mahout, it wasn't actually
necessary to have a true QR.
hows and products, it's also helps that there is traffic driven
> from external sources.
>
> Thanks for the detailed hints - now it's time to see what comes out of
> this.
>
> Johannes
>
> On Sun, Nov 12, 2017 at 7:52 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote
years of batch :)
>
> Thanks for your thoughts, I am happy I can rule something out given the
> domain (poisson llr). Luckily the domain I'm working on is event
> recommendations, so there is a natural deterministic item expiry (as
> compared to christmas like stuff).
>
> Aga
rsonalized but would yield “hot in
> Greece”
>
I think that this is a good approach.
>
> Ted’s “Christmas video” tag is what I was calling a business rule and can
> be added to either of the above techniques.
>
But the (not) hotness feature might help with automated this.
On Sat, May 6, 2017 at 2:43 PM, Scott C. Cote wrote:
> Will you be wearing “one of those t-shirts” on Monday in Houston :) ?
>
Not likely.
It is in the archive.
ng old mahout
> >color
> >> > palatte if one were to dab their brush in the appropriate colors.
> >This
> >> > could also be represented in any single color. (Not sure what that
> >does
> >> to
> >> > our TM, is it ok i
gt;
>
> On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > Do you have constructive input (guidance or opinion is welcome input) or
> > would you like to discontinue the contest. If the later, -1 now.
> >
> >
> > On Apr 27, 2017, at 3:
ive input (guidance or opinion is welcome input) or
> would you like to discontinue the contest. If the later, -1 now.
>
>
> On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> I thought that none of the proposals were worth continuing with.
>
>
&
I thought that none of the proposals were worth continuing with.
On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel wrote:
> Yes, -1 means you hate them all or think the designers are not worth
> paying. We have to pay to continue, I’ll foot the bill (donations
>
ds,
> Arun
>
> On 2 April 2017 at 11:59, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > Arun,
> >
> > That's good news.
> >
> > The second limitation will be how much data you have for each document
> and
> > whehter you have a good measure
>
> It wont be a problem for me to use the LAN path for configurations and
> index.I can use the larger document base.
>
> Thanks and Regards,
> Arun
>
> On 2 April 2017 at 07:00, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > On Sat, Apr 1, 2017 at 6:21 PM, arun abraham
On Sat, Apr 1, 2017 at 6:21 PM, arun abraham
wrote:
> As a first step I am trying to recommend min of two documents(As my
> Solr document index is ~100 docs).
>
This is kind of weird.
Can you say why you have so very few documents?
There may be something special
On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel wrote:
> maybe we should drop the name Mahout altogether.
I have been told that there is a cool secondary interpretation of Mahout as
well.
I think that the Hebrew word is pronounced roughly like Mahout.
מַהוּת
The cool
>From my perspective, the state of the art of machine learning is with
systems like Tensorflow and dl4j. If you can deal with the limits of a
non-clustered GPU system, then Theano and Cafe are very useful. Keras
papers over the difference between different back-ends nicely.
Tensorflow and Theano
This actually sounds like a very small problem.
My guess is that there are bad settings for the interaction and frequency
cuts.
On Thu, Jun 23, 2016 at 11:07 AM, Pat Ferrel wrote:
> In addition to increasing downsampling there are some other things to
> note. The
On Sat, Jun 4, 2016 at 10:14 AM, forme book wrote:
> On the (Lucene side) has already by default this implementations, what I do
> struggle to understand what is the advantage of having lucene.vector in
> mahout when Lucene offer that feature out of the box ?
>
> Maybe I'm
It just means that there is an association. Causation is much more
difficult to ascertain.
On Wed, May 4, 2016 at 6:06 AM, Nikaash Puri wrote:
> Hi,
>
> Just wanted to clarify a small doubt. On running LLR with primary
> indicator as view and secondary indicator as
Mahout is considerably better at sparse operations and optimizations than
dense ones.
Beyond that, I would expect that you would do better with traditional math
libraries.
And, are you really trying to invert a matrix? The common maxim is that
this implies an error in your method because
On Thu, Feb 25, 2016 at 6:52 AM, wrote:
> Thank you for your answer
> What other tools you advise me to use?
> Do you recommend Rhadoop?
>
Try h2o instead. Good R interface. Decent model building.
See here:
https://ssc.io/pdf/rec11-schelter.pdf
On Fri, Feb 19, 2016 at 3:16 AM, Lee S wrote:
> Hi:
>Does anybody know which paper the mr algorithm is based on?
>
Did you want textual similarity?
Or semantic similarity?
The actual semantics of a message can be opaque from the content, but clear
from the usage.
On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl wrote:
> David,
> LDA or LSI can work quite nicely for similarity (YMMV of
On Sun, Nov 29, 2015 at 9:36 PM, Niklas Ekvall
wrote:
> My conclusion is that recommenditembased in Mahout works better for ratings
> than binary data, what is your conclusions?
>
Still operator error somewhere. Binary data works much better as a real
recommender.
There are a few problems that you have.
1) user-based recommendation is often slower than item-based (sometimes
MUCH slower). This can make a 2-10x difference in practice
2) pre-computing recommendations is usually much less efficient than
computing them on the fly (because typically few users
No. Not entirely surprising, but it is *really* nice to get some public
results on this.
The treatment of the negatives as a separate cross term instead of just
lumping them together is a very significant difference.
On Tue, Nov 3, 2015 at 3:42 PM, Peter Jaumann
On Tue, Nov 3, 2015 at 3:20 PM, Pat Ferrel wrote:
> For the strict out there we did not directly isolate the two actions,
> which is work remaining so some of the lift might be due to just having
> more data but it’s a really good first step because more data doesn't
>
om SVD right ? thanks, canal
>
>
> On Monday, October 5, 2015 2:25 PM, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>
> That isn't enough detail.
>
> How do you mean to compute degrees of freedom? WHy do you need the inverse
> to do this?
>
> Where di
ay, October 5, 2015 6:25 AM, Peter Jaumann <
> peter.jauma...@gmail.com> wrote:
>
>
> This should be done with a matrix solver indeed!!!
>
>
>
> On Oct 4, 2015 11:53 AM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:
> >
> >
> > It is almos
On Sun, Oct 4, 2015 at 10:32 PM, go canal wrote:
> in fact i need to support both double and complex double for either
> distributed memory based or out-of-core.
Ahh...
Well Mahout doesn't support complex anything. So this isn't going to help
you.
lid> wrote:
> I will be more than interested to extend to complex double, when the
> solver is ready for double data type. thanks, canal
>
>
> On Monday, October 5, 2015 2:02 PM, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>
> On Sun, Oct 4, 2015 at 10:32
roject requires the
> inversion of a very large matrix. will have to revert back to scalapack or MR
> based solutions I guess.
> thanks, canal
>
>
> On Saturday, October 3, 2015 11:31 PM, Ted Dunning
> <ted.dunn...@gmail.com> wrote:
>
>
> I doubt serio
I doubt seriously that Samsara will support matrix inversion per se. The
problem is
a) it densifies sparse matrices
b) it is much more costly than solving a linear system
Samsara is roughly memory based, but different back-ends will try to spill
to disk if necessary. It is likely that the
On Tue, Sep 22, 2015 at 5:51 PM, Ankit Goel wrote:
> What I wanted to do was modify the clustering algorithm, in hopes of
> experimenting with different versions of it. I'm not much hung over the MR
> part of things, rather the clustering algo itself.
>
Have at it.
My own feeling is that the right answer is to look at average squared
distance on your training data and on held out data.
As long as these values are nearly the same, you likely have a smaller (or
equal) than optimal value of k. When the average squared distance is
significantly less on the
The most central point in a cluster is often referred to as a medoid
(similar to median, but multi-dimensional).
The Mahout code does not compute medoids. In general, they are difficult
to compute and implementing a full k-medoid clustering algorithm even more
so.
On Mon, Jul 20, 2015 at 6:25
suggestions how I should go about that? So far I'm using
nutch to crawl, solr to index and now I'm here on mahout.
On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
The most central point in a cluster is often referred to as a medoid
(similar to median
The standard approach is to re-run the off-line learning.
It is possible, though not yet supported in Mahout tools, to do real-time
updates.
See here for some details:
https://www.mapr.com/resources/videos/fully-real-time-recommendation-%E2%80%93-ted-dunning-sf-data-mining
On Fri, Jun 19
The streaming k-means works by building a sketch of the data which is then
used to do real clustering.
It might be that this sketch would be acceptable to do k-medoids, but that
is definitely not guaranteed.
Similarly, it might be possible to build a medoid sketch instead of a mean
based sketch,
Mahout is deprecating pretty much all of the classic MapReduce
implementations in any case in favor of algorithms based fundamentally on a
new linear algebra system known as Mahout-Samsara.
On Fri, May 29, 2015 at 10:52 PM, Punit Naik naik.puni...@gmail.com wrote:
Hello all users
I just
Actually, this is probably done more easily using a simple matrix
multiplication. The reason for not using recommendation code for this is
that your problem is entirely dense.
How exactly you should go about this is a different question. Up to tens
of thousands of stars, you can probably do
)
On Fri, Apr 10, 2015 at 9:34 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
Ah... forgot this.
+1 (binding)
On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
I downloaded and tested the signatures and check-sums on
{binary,source}
x
I downloaded and tested the signatures and check-sums on {binary,source} x
{zip,tar} + pom. All were correct.
One thing that I worry a little about is that the name of the artifact
doesn't include apache. Not sure that is a hard requirement, but it
seems a good thing to do.
On Fri, Apr 10,
Ah... forgot this.
+1 (binding)
On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
I downloaded and tested the signatures and check-sums on {binary,source} x
{zip,tar} + pom. All were correct.
One thing that I worry a little about is that the name of the artifact
For practical recommendation systems, ratings are almost irrelevant.
Ratings were prominent in the original academic work on recommendations
largely because with the early research systems, users had no recordable
interactions with content other than ratings. The Taste component of
Mahout was
Are you sure that the problem is writing the results? It seems to me that
the real problem is the use of a user-based recommender.
For such a small data set, for instance, a search-based recommender will be
able to make recommendations in less than a millisecond with multiple
recommendations
, 2015 at 2:45 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Also, if you can include linking information between documents, you
should
be able to substantially improve accuracy. Same goes for behavioral data
like browsing history.
On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta
Lanczos may be more accurate than SSVD, but if you use a power step or
three, this difference goes away as well.
The best way to select k is actually to pick a value k_max larger than you
expect to need and then pick random vectors instead of singular vectors.
To evaluate how many singular
in text format.
Destination IP address is not implicit infact its in the second row and
is a server.
Kindly suggest how i can do the kmeans clustering wrt timestamp or is
there a better way?
Regards,Raghuveer
On Thursday, March 26, 2015 6:34 AM, Ted Dunning
ted.dunn...@gmail.com wrote
Also, if you can include linking information between documents, you should
be able to substantially improve accuracy. Same goes for behavioral data
like browsing history.
On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar
hersheetachandan...@gmail.com wrote:
Thank you so much Chirag and
helpful
if you can show me a sample for this issue. Kindly suggest.
Thanks,
Raghuveer
On Tuesday, February 17, 2015 12:24 AM, Ted Dunning
ted.dunn...@gmail.com wrote:
Please take questions like this to the Mahout mailing list.
I really prefer to answer these questions in public
Glad to help.
You can help us by reporting your results when you get them.
We look forward to that!
On Tue, Mar 10, 2015 at 4:22 AM, Efi Koulouri ekoulou...@gmail.com wrote:
Things got clearier with your help!
Thank you very much
On 9 March 2015 at 01:50, Ted Dunning ted.dunn
interesting but in my case I
think that building the recommender using the java classes is more
appropriate as I need to use both approaches (post filtering,pre
filtering). Am I right ?
On 8 March 2015 at 16:08, Ted Dunning ted.dunn...@gmail.com wrote:
The by far easiest way to build
The by far easiest way to build a recommender (especially for production)
is to use the search engine approach (what Pat was recommending).
Post filtering can be done using the search engine far more easily than
using Java classes.
On Sat, Mar 7, 2015 at 8:44 AM, Pat Ferrel
On Sat, Mar 7, 2015 at 3:05 AM, Tevfik Aytekin tevfik.ayte...@gmail.com
wrote:
There can be two solutions:
1. There should be a parameter n, which determines the minimum number
of common ratings needed to compute a similarity otherwise the system
should return NaN.
2. The similarity should
The terms main and secondary are a bit confusing.
The easiest definition is that cooccurrence analyzes the record of actions you
want to recommend. Cross occurrence tries to transfer from one behavior to
another.
In practice, it has been common to conflate many behaviors into one precisely
On Sat, Feb 14, 2015 at 6:05 AM, Eugenio Tacchini
eugenio.tacch...@gmail.com wrote:
Hi Pat, I don't understand why it is not a Mahout problem, my goal is to
evaluate (RMSE) the output of a user based algorithm comparing different
user similarity measures, Mahout already has everything I need
We haven't had anyone volunteer as a mentor this year as far as I know.
On Sun, Feb 15, 2015 at 12:36 PM, Prasad Priyadarshana Fernando
bpp...@gmail.com wrote:
Hi,
I am interested in doing a project on recommender system framework for GSOC
2015. Can somebody tell me whether Apache is
On Fri, Feb 13, 2015 at 9:37 AM, Eugenio Tacchini
eugenio.tacch...@gmail.com wrote:
If I need to use a classical user-based technique, however, the only
alternative is the Taste-oriented code, am I right?
Right.
Still, I can't see how
to perform a prediction for a a user/item couple, is
On Fri, Feb 13, 2015 at 11:11 AM, Eugenio Tacchini
eugenio.tacch...@gmail.com wrote:
Is there anyone who can give me some hints about this task?
Another way to look at this is to try to wedge this into the item
similarity code.
There are hooks available in the map-reduce version of item
That is a really old paper that basically pre-dates all of the recent
important work in neural networks.
You should look for works on Rectified Linear Units (ReLU), drop-out
regularization, parameter servers (downpour sgd) and deep learning.
Map-reduce as you have used it will not produce
Juanjo,
Using the Taste components, it will be almost impossible to get really high
performance. For that, using the itemsimilarity program to feed a search
index is the best alternative.
The scala version of the itemsimilarity program is available in Scala and
could be called fairly easily as
The old Taste code is not the state of the art. User-based recommenders
built on that will be slow.
On Thu, Jan 15, 2015 at 7:10 AM, Juanjo Ramos jjar...@gmail.com wrote:
Hi David,
You implement your custom algorithm and create your own class that
implements the UserSimilarity interface.
On Thu, Jan 15, 2015 at 5:23 AM, Miguel Angel Martin junquera
mianmarjun.mailingl...@gmail.com wrote:
My question is:..
Is it better to scale up these dimensions directly in the tf-idf
sequence final mix file using this correction factors OR first do scale
up in each tf-vectors and
trying to find a scalable solution for my problem, I tried to
fit it in what's already implemented in Mahout (for clustering), but
it's not so obvious to me.
I'm open to suggestions, I'm still new to all of this.
Thanks,
Marko
On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote
The easiest way is to scale those dimensions up.
On Wed, Jan 14, 2015 at 2:41 AM, Miguel Angel Martin junquera
mianmarjun.mailingl...@gmail.com wrote:
hi all,
I am clustering using kmeans several text documents from distintct sources
and I have generated the sparse vectors of each
have you considered implementing using something like spark? That could be
much easier than raw map-reduce
On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:
In KNN like algorithm we need to load model Data into cache for predicting
the records.
Here is the
On Sat, Jan 10, 2015 at 3:02 AM, Marko Dinic marko.di...@nissatech.com
wrote:
For example, mean of two sinusoids while one of them is shifted by Pi is
0. And that's definitely not a good centroid in my case.
Well, if you think that phase shifts represent small distance proportional
to phase
with others in cluster (some kind of centroid/medioid).
What do you think about this approach and about the scalability?
I would highly appreciate your answer, thanks.
On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di
On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di...@nissatech.com
wrote:
1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that
could be used as a distance measure for clustering?
No.
2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
that
On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani chirag.lakh...@gmail.com
wrote:
In the Mahout in Action book I got the impression that the term memo will
seed the random number generator and I wanted to confirm that means I will
have consistency if I deploy this vectorizer in both my Hadoop
On Tue, Dec 23, 2014 at 7:39 AM, AlShater, Hani halsha...@souq.com wrote:
@Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and
yarn is configured accordingly. I am trying to avoid spark memory caching.
Have you tried the map-reduce version?
On Tue, Dec 23, 2014 at 9:16 AM, Pat Ferrel p...@occamsmachete.com wrote:
To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the
cross-cooccurrence indicators and you’ll have to translate your IDs into
Mahout IDs. This means mapping user and item IDs from your values into
Can you say what kind of cluster you have?
How many machines? How much memory? How much memory is given to Spark?
On Sun, Dec 21, 2014 at 11:44 PM, AlShater, Hani halsha...@souq.com wrote:
Hi All,
I am trying to use spark-itemsimilarity on 160M user interactions dataset.
The job launches
How much data are you going to be collecting? How many users and how many
presentations per user?
Are you saying that the product for each video are completely fixed? Does
the same product appear for more than one video?
Do users interact with products outside of the narrow confines that you
Natalia,
It sounds like you are starting from the assumption that ratings are being
done.
This can happen, but in production recommendation settings, ratings is
typically a very low value input because the meaning of a rating is very
complex and because so few users actually do ratings unless
On Thu, Dec 4, 2014 at 5:38 AM, Shahid Shaikh shaikhshah...@gmail.com
wrote:
i see the problem is with the way data is written
What exactly do you mean by this?
etc)
Maybe location,sales per item(similarity might lead to knowledge of people
who share same purchasing patterns) etc.
On Wed, Dec 3, 2014 at 5:28 PM, Ted Dunning ted.dunn...@gmail.com wrote:
On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com
wrote:
I have multiple
On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com wrote:
I have multiple different columns such as category,shipping location,item
price,online user, etc.
How can i use all these different columns and improve recommendation
quality(ie.calculate more precise similarity
On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal
chiragnagpal_12...@aitpune.edu.in wrote:
Since Density based clustering algorithms, are being utilised extensively,
especially by the GIS research groups, it is a bit sad that there isn't a
Map Reduce implementation available..
I think I
'. I think scalability should not be an
issue for a Map Reduce implementation.
Chirag Nagpal
University of Pune, India
www.chiragnagpal.com
From: Ted Dunning ted.dunn...@gmail.com
Sent: Sunday, November 30, 2014 6:29 PM
To: user@mahout.apache.org
The error message that you got indicated that some input was textual and
needed to be an integer.
Is there a chance that the type of some of your input is incorrect in your
sequence files?
On Mon, Nov 24, 2014 at 3:47 PM, Ashok Harnal ashokhar...@gmail.com wrote:
Thanks for reply. I did not
There is no inherent mathematical difference, but there may be some pretty
significant practical differences.
Using the three matrix form (X = USV') puts the normalization constants
into a place where you can control them a bit easier. This can be useful
if you want *both* user and item vectors
Check out H2O.
http://0xdata.com/
On Mon, Nov 10, 2014 at 1:38 AM, zhonghong...@yy.com zhonghong...@yy.com
wrote:
So is there any scalable rbms available ?
I'm going to implement a recommender based on it.
From: Ted Dunning
Date: 2014-11-10 15:34
To: user@mahout.apache.org
Subject: Re
What should the input be?
On Tue, Nov 4, 2014 at 12:28 AM, Lee S sle...@gmail.com wrote:
Hi all:
I'm wondering why the input and output of most algorithm like
kmeans,naivebayes are all sequencefiles. One more step of conversion need
to be done if we want the algorithm works.And
I think
in
vector(dense or sparse) format ,so a conversion step
needs to be doned before algorithms deal with data. Is that right?
2014-11-04 23:56 GMT+08:00 Ted Dunning ted.dunn...@gmail.com:
What should the input be?
On Tue, Nov 4, 2014 at 12:28 AM, Lee S sle...@gmail.com wrote:
Hi all
process
from scratch or can it be done incrementally?
Best,
Mahesh.B.
On Thu, Oct 23, 2014 at 1:13 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Yes. Mahout can do this.
Pro: MapR classifiers are pretty easy to integrate because of a simple
API.
Con: The state of the art
. The Python API
uses the standard CPython implementation, and can call into existing C
libraries for Python such as NumPy.
On Thu, Oct 23, 2014 at 1:11 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
Hmmm
I don't think that the array formats used by Spark are compatible
vibhanshugs...@gmail.com
wrote:
actually spark is available in python also, so users of spark are having an
upper hand over users of traditional users of mahout. This is applicable to
all the libraries of python (including numpy).
On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning ted.dunn
Yes. Mahout can do this.
Pro: MapR classifiers are pretty easy to integrate because of a simple API.
Con: The state of the art with MapR classifiers is pretty far behind the
rest-of-the-world state of the art.
On Tue, Oct 21, 2014 at 5:26 PM, Si Chen sic...@opensourcestrategies.com
wrote:
On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel p...@occamsmachete.com wrote:
The problem is not in building Spark it is in building Mahout using the
correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
in the repos.
This should be true for MapR as well.
On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija balijamahesh@gmail.com
wrote:
I am trying to differentiate between Mahout and Spark, here is the small
list,
Features Mahout Spark Clustering Y Y Classification Y Y Regression Y
Y Dimensionality Reduction Y Y Java Y Y Scala N Y
On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel p...@occamsmachete.com wrote:
Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
work. Does anyone object to upgrading our Spark dependency? I’m not sure if
Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups mah...@orbit-x.de wrote:
I have my own implementation of SimilarityAnalysis and by tuning number of
tasks I have reached HUGE performance gains.
Since I couldn't find how to pass the number of tasks to shuffle
operations directly, I have set
On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups mah...@orbit-x.de wrote:
Do you think that simply increasing this parameter is a safe and sane
thing
to do?
Why would it be unsafe?
In my own implementation I am using 400 tasks on my 4-node-2cpu cluster
and the execution times of largest
me some pointers on how I can apply it in this
setting?
Thanks,
Rohit
On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
This is an incredibly tiny dataset. If you delete singletons, it is
likely
to get significantly smaller.
I think that something like LDA
.
I ll try to share something if I succeed.
Arian Pasquali
http://about.me/arianpasquali
2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com:
Lucene 4.x supports okapi-bm25. So it should be easy to implement.
On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn
On Wed, Oct 1, 2014 at 7:52 AM, Arian Pasquali ar...@arianpasquali.com
wrote:
My dataset is a collection of documents in german and I can say that the
scores seems better compared to my TFIDF scores. Results make more sense
now, specially my bi-grams.
OK.
I will take note.
I would recommend that you look at actions other than ratings as well.
Did a user expand and read 1 review? did they read 3 reviews?
Did they mark a rating as useful?
Did they ask for contact information?
You know your system better than I possibly could, but using other
information in
How are you using LLR to compute user similarity? It is normally used to
compute item similarity?
Also, what is your scale? how many users? how many items? how many
actions per user?
On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com
wrote:
Hi,
I am exploring a
again!
On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
On Fri, Sep 19, 2014 at 3:29 AM, mario.al...@gmail.com wrote:
So my question was -shouldn't we consider both the frequency
distribution
of item sales *and* of users purchases in the same formula
Can you say how many words you are seeing?
How many unique bigrams?
As Suneel asked, which version of Mahout?
On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster bu...@collectiveip.com
wrote:
I've been implementing the RowSimilarityJob on our 40-node cluster and have
run into so serious
1 - 100 of 1929 matches
Mail list logo