Re: [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)

2020-08-12 Thread Pat Ferrel
Big fun. Thanks for putting this together. I’ll abuse my few Twitter followers with the announcement. From: Trevor Grant Reply: user@mahout.apache.org Date: August 12, 2020 at 5:59:45 AM To: Mahout Dev List , user@mahout.apache.org Subject:  [ANNOUNCE] Mahout Con 2020 (A sub-track of

Users of Scala 2.11

2018-04-24 Thread Pat Ferrel
Hi all, Mahout has hit a bit of a bump in releasing a Scala 2.11 version. I was able to build 0.13.0 for Scala 2.11 and have published it on github as a Maven compatible repo. I’m also using it from SBT. If anyone wants access let me know.

Re: "LLR with time"

2017-11-12 Thread Pat Ferrel
gt; Thanks for your thoughts, I am happy I can rule something out given the >> domain (poisson llr). Luckily the domain I'm working on is event >> recommendations, so there is a natural deterministic item expiry (as >> compared to christmas like stuff). >> >> Again

Re: "LLR with time"

2017-11-11 Thread Pat Ferrel
t) vs itemA(t-1) > .. > > and derive multiple indicators per item to be indexed. > > But this all relies on discretizing time into buckets and not looking at > the distribution of time between events like in presentation above - maybe > there is something way smarter > > Johann

Re: "LLR with time"

2017-11-10 Thread Pat Ferrel
. On Nov 10, 2017, at 4:12 PM, Pat Ferrel <p...@occamsmachete.com> wrote: So your idea is to find anomalies in event frequencies to detect “hot” items? Interesting, maybe Ted will chime in. What I do is take the frequency, first, and second, derivatives as measures of popularity, incr

Re: "LLR with time"

2017-11-10 Thread Pat Ferrel
So your idea is to find anomalies in event frequencies to detect “hot” items? Interesting, maybe Ted will chime in. What I do is take the frequency, first, and second, derivatives as measures of popularity, increasing popularity, and increasingly increasing popularity. Put another way popular,

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
<trevor.d.gr...@gmail.com> wrote: The spark is included via maven classifier- the sbt line should be libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" % "0.13.1-SNAPSHOT" classifier "spark_2.1" On Tue, Oct 3, 2017 at 2:55 PM, Pat

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
Actually if you require scala 2.11 and spark 2.1 you have to use the current master (o.13.0 does not support these) and also can’t use sbt, unless you have some trick I haven’t discovered. On Oct 3, 2017, at 12:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote: I’m the aforementioned p

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
I’m the aforementioned pferrel @Hoa, thanks for that reference, I forgot I had that example. First don’t use the Hadoop part of Mahout, it is not supported and will be deprecated. The Spark version of cooccurrence will be supported. You find it in the SimilarityAnalysis object. If you go back

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-30 Thread Pat Ferrel
Matt, I’m interested in following up on this. If you can’t do a PR, can you describe what you did a bit more? On Aug 21, 2017, at 12:05 PM, Pat Ferrel <p...@occamsmachete.com> wrote: Matt I’ll create a feature branch of Mahout in my git repo for simplicity (we are in code freeze for

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
_________ From: Pat Ferrel <p...@occamsmachete.com> Sent: Monday, August 21, 2017 2:26:58 PM To: user@mahout.apache.org Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs) That looks like ancient code from the old mapreduce days.

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
s one thing, but if you have to move the data to CPU and back >> to memory to distributed it around possibly multiple times, you may wind up >> with something much slower than you would have had if you were to attack >> the problem directly. >> >> >> >&g

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
through the implications. If I can also test it I have some large real-world data where I can test real-world speedup. On Aug 21, 2017, at 10:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote: Interesting indeed. What is “massive”? Does the change pass all unit tests? On Aug 17, 2017, at 1

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
he CPU and munch on > it a bit it is one thing, but if you have to move the data to CPU and back > to memory to distributed it around possibly multiple times, you may wind up > with something much slower than you would have had if you were to attack > the problem directly. >

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-16 Thread Pat Ferrel
he javadoc on those methods mentions they shouldn't be used unless absolutely necessary due to their O(log n) complexity. Thanks for your time...this is fun stuff! Matt On 8/15/17, 10:15 AM, "Pat Ferrel" <p...@occamsmachete.com> wrote: > Great, this is the best way to use th

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-15 Thread Pat Ferrel
gt; O(log n) operations I mentioned seem to take >95% of runtime. > > Thanks, > Matt > > From: Pat Ferrel <p...@occamsmachete.com> > Sent: Monday, August 14, 2017 11:02:42 PM > To: user@mahout.apache.org > Subject: Re: spark-itemsimil

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-14 Thread Pat Ferrel
Are you using the CLI? If so it’s likely that there is only one partition of the data. If you use Mahout in the Spark shell or using it as a lib, do a repartition on the input data before passing it into SimilarityAnalysis.cooccurrencesIDSs. I repartition to 4*total cores to start with and set

Re: [DISCUSS] Naming convention for multiple spark/scala combos

2017-07-07 Thread Pat Ferrel
IIRC these all fit sbt’s conventons? On Jul 7, 2017, at 2:05 PM, Trevor Grant wrote: So to tie all of this together- org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1_6 org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_0

Re: Proposal for changing Mahout's Git branching rules

2017-06-21 Thread Pat Ferrel
s to develop instead of master? Do they need to PR against develop branch, and if not, who is responsible for confict resolution then that is to arise from diffing and merging into different targets? On Tue, Jun 20, 2017 at 10:09 AM, Pat Ferrel <p...@actionml.com> wrote: > As I said I

Re: Proposal for changing Mahout's Git branching rules

2017-06-20 Thread Pat Ferrel
ast couple of months but I forget what it was at the moment. Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Mon, Ju

Re: Proposal for changing Mahout's Git branching rules

2017-06-19 Thread Pat Ferrel
branches that are created and ephemeral with this method. On Jun 19, 2017, at 5:52 PM, Pat Ferrel <p...@occamsmachete.com> wrote: I just heard we are not using git flow (the process not the tool), we are checking unclean (untested in any significant way) changes to master

Re: Proposal for changing Mahout's Git branching rules

2017-06-19 Thread Pat Ferrel
l.com> > wrote: > > Cool, I'll make a new dev branch now. > > Dev, develop, any preference? > > On Sat, Apr 22, 2017 at 10:30 AM, Pat Ferrel <p...@occamsmachete.com> > wrote: > >> It hasn't been often but I’ve been bit by it and had to ask users of a >

Re: New Website is Staged

2017-05-09 Thread Pat Ferrel
Are you guys ready for serious comments on the new design or is this just a first running version? On May 9, 2017, at 8:20 AM, Trevor Grant wrote: In the interest of getting this thing up and running, use DFW Meetup video as a place holder for time being? Trevor

Attack email

2017-05-03 Thread Pat Ferrel
If anyone gets a Google Docs share from me don’t click it. The URL is https:// accounts.google.com……. but it is an attack to get your contacts. Delete it.

Re: New logo

2017-05-03 Thread Pat Ferrel
nterlocking solid yellow/blue background 3rd is simple letter M as wireframe but prefer the diagram be in yellow. I don't care for the loopy curved logos (sorry Andrew!) Good luck!! Ellen Friedman On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel <p...@occamsmachete.com <mailto:p...@occ

Re: Scaling up spark Iitem similarity on big data data sets

2017-05-01 Thread Pat Ferrel
I just ran into the opposite case Sebastian mentions, where a very large % of users have only one interaction. They come from Social media or Search and see only thing and leave. Processing this data turned into a huge job but led to virtually no change in the model since users with very few

Re: New logo

2017-04-27 Thread Pat Ferrel
hu, Apr 27, 2017 at 5:54 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Fair enough, I think Trevor feels the same. > > The blue man can continue, all it takes is a -1 > > > On Apr 27, 2017, at 3:50 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > &

Re: New logo

2017-04-27 Thread Pat Ferrel
uggest a better path and I hate negative feedback. But there it is. On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Do you have constructive input (guidance or opinion is welcome input) or > would you like to discontinue the contest. If the later, -1 now. >

Re: New logo

2017-04-27 Thread Pat Ferrel
Apr 27, 2017 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Yes, -1 means you hate them all or think the designers are not worth > paying. We have to pay to continue, I’ll foot the bill (donations > appreciated) but don’t want to unless people think it will lead t

Re: New logo

2017-04-27 Thread Pat Ferrel
ments/84/84017/attachment_84017937 >> >> I like the stylized and simple "M" and it reminds me of diagrams showing >> vector multiplication. >> >> On Thu, Apr 27, 2017 at 12:56 PM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> >>> We

Re: New logo

2017-04-27 Thread Pat Ferrel
you have 24 hours to vote Here’s my +1 to continue refining. On Apr 27, 2017, at 11:41 AM, Pat Ferrel <p...@occamsmachete.com> wrote: Here is a second group, hopefully picked to be unique.https://99designs.com/contests/poll/vl7xed We got a lot of responses, these 2 polls contain th

Re: New logo

2017-04-27 Thread Pat Ferrel
Here is a second group, hopefully picked to be unique.https://99designs.com/contests/poll/vl7xed We got a lot of responses, these 2 polls contain the best afaict. On Apr 27, 2017, at 11:25 AM, Pat Ferrel <p...@occamsmachete.com> wrote: Vote: https://99designs.com/contests/poll/rqcg

New logo

2017-04-27 Thread Pat Ferrel
Vote: https://99designs.com/contests/poll/rqcgif We asked for something “mathy” and asked for no elephant and rider. We have the rest of the week to tweak so leave comments about what you like or would like to change. We don’t have to pick one of these, so if you hate them all, make that known

New site and logo

2017-04-24 Thread Pat Ferrel
The Mahout site is moving to Jekyll with a bit if a new look and so it might be nice to get an update of the logo. I think the consensus was to keep the Mahout name but I didn’t get a feel for the logo. One concern mentioned is that Mahout is no longer attached to Hadoop (the elephant) so

Re: Proposal for changing Mahout's Git branching rules

2017-04-22 Thread Pat Ferrel
p, any preference? On Sat, Apr 22, 2017 at 10:30 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > It hasn't been often but I’ve been bit by it and had to ask users of a > dependent project to checkout a specific commit, nasty. > > The main affect would be to automation efforts that are cu

Re: Proposal for changing Mahout's Git branching rules

2017-04-22 Thread Pat Ferrel
er/dev branch approach is solid. On Sat, Apr 22, 2017 at 10:06 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > I’ve been introduced to what is now being called git-flow, which at it’s > simplest is just a branching strategy with several key benefits. The most > important part of

Proposal for changing Mahout's Git branching rules

2017-04-22 Thread Pat Ferrel
I’ve been introduced to what is now being called git-flow, which at it’s simplest is just a branching strategy with several key benefits. The most important part of it is that the master branch is rock solid all the time because we use the “develop” branch for integrating Jiras, PRs, features,

Re: Lambda and Kappa CCO

2017-04-17 Thread Pat Ferrel
gt; Pat- > > What can we do from the mahout side? Would we need any new data > structures? Trevor and I were just discussing some of the troubles of > near real time matrix streaming. > ---------- > *From:* Pat Ferrel <p...@occamsmachete.com> > *

Re: Loading data from files - Samsara

2017-04-04 Thread Pat Ferrel
Mahout-Samsara has a couple CLI drivers but these are mostly for examples. They read from csv files but may not do what you want. Mahout can also run in a Spark Shell or as a library to your app, which gives you all the data loading functions of Spark or Scala. For instance I use

Re: Reg:-Integrating Mahout with Solr

2017-04-02 Thread Pat Ferrel
017 at 23:46, Pat Ferrel <p...@occamsmachete.com> wrote: > You want to create “Behavioral Search”? This is where you boost items that > have the search terms in them more likely to be favored by the individual > user? > > You want to use the CCO algorithm in Mahout. You need to

Re: Reg:-Integrating Mahout with Solr

2017-04-01 Thread Pat Ferrel
You want to create “Behavioral Search”? This is where you boost items that have the search terms in them more likely to be favored by the individual user? You want to use the CCO algorithm in Mahout. You need to collect behavioral information like conversions, detailed page views, etc. Run each

Re: Samsara's learning curve

2017-03-29 Thread Pat Ferrel
While I agree with D and T, I’ll add a few things to watch out for. One of the hardest things to learn is the new model of execution, it’s not quite Spark or any other compute engine. You need to create contexts that have virtualized the actual compute engine. But you will probably need to use

Re: Lambda and Kappa CCO

2017-03-27 Thread Pat Ferrel
causes many cooccurrences to change. This becomes feasible if you include the effect of down-sampling, but that has to be in the algorithm. From: Pat Ferrel <p...@occamsmachete.com> Sent: Saturday, March 25, 2017 12:01:00 PM To: Trevor Grant; user@mahout.apache.org Cc: Ted Dunning; s..

Re: Marketing

2017-03-25 Thread Pat Ferrel
2017 7:22 PM (GMT-08:00) To: user@mahout.apache.org Cc: Mahout Dev List <d...@mahout.apache.org> Subject: Re: Marketing On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > maybe we should drop the name Mahout altogether. I have been told that there is a co

Re: Marketing

2017-03-24 Thread Pat Ferrel
." -Virgil* On Thu, Mar 23, 2017 at 5:43 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > The little blue man (the mahout) was reborn (samsara) as a honey-badger? > He must be close indeed to reaching true enlightenment, or is that Buddhism? > > > On Mar 23, 2017, at

Re: Marketing

2017-03-23 Thread Pat Ferrel
The little blue man (the mahout) was reborn (samsara) as a honey-badger? He must be close indeed to reaching true enlightenment, or is that Buddhism? On Mar 23, 2017, at 12:42 PM, Andrew Palumbo wrote: +1 on revamp. Sent from my Verizon Wireless 4G LTE smartphone

Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-03-16 Thread Pat Ferrel
OK, my tests passed including the last blocker, will test again on the new RC. On Mar 16, 2017, at 8:56 AM, Andrew Musselman wrote: Cancelling vote due to https://issues.apache.org/jira/browse/MAHOUT-1955 On Wed, Mar 15, 2017 at 8:55 AM, Andrew Musselman <

Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-03-14 Thread Pat Ferrel
The release was not made due to broken drivers, now fixed. I assume a new RC will come shortly? On Mar 11, 2017, at 9:54 PM, Andrew Musselman wrote: This is the vote for release 0.13.0 of Apache Mahout. The vote will be going for at least 72 hours and will be closed on

Re: 0.13.0-RC not fully compatible with Spark 1.6.3?

2017-03-06 Thread Pat Ferrel
--master spark://ubuntu:7077 --input ~/data/rating_200k.csv --output ~/data/rating_200k_output --itemIDColumn 1 --rowIDColumn 0 --sparkExecutorMem 6g -Ursprüngliche Nachricht- Von: Pat Ferrel [mailto:p...@occamsmachete.com] Gesendet: Freitag, 3. März 2017 20:49 An: Michael Müller Cc

Re: 0.13.0-RC not fully compatible with Spark 1.6.3?

2017-03-03 Thread Pat Ferrel
Thanks, I’ll see if I can reproduce. So you are downloading the binary and running the Mahout spark-itemsimilarity driver from that binary? You say “using the same Spark cluster” How is this setup, an env var like MASTER=? Can you supply you you point to the cluster and your CLI for the job?

Re: Universal Recommender. How to rank items returned by query on three types of indicators?

2017-02-06 Thread Pat Ferrel
Feb 5, 2017 at 10:36 AM, Pat Ferrel <p...@occamsmachete.com <mailto:p...@occamsmachete.com>> wrote: > Nice, someone does read the math :-) > > Content: The type of personalized “content” indicators talked about in the > slides are not supported by the Universal Recommende

Re: Universal Recommender. How to rank items returned by query on three types of indicators?

2017-02-05 Thread Pat Ferrel
Nice, someone does read the math :-) Content: The type of personalized “content” indicators talked about in the slides are not supported by the Universal Recommender and have little value unless you have no collaborative filtering data. They can theoretically be mixed with other indicators but

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Pat Ferrel
My perspective comes from the data side. I work in recommenders and that means log analysis for huge amounts of data. Even a small shop doing this will immediately run our of the capacity in Python or R on a single node. MLlib is a set of prepackaged algorithms that will work (mostly) with big

Re: Question about spark-itemsimilarity

2017-01-15 Thread Pat Ferrel
urns is about 500 events per user. > Best regards, Niklas > > > 2016-12-15 3:23 GMT+01:00 Pat Ferrel <p...@occamsmachete.com>: > >> Cross-occurrence allows us to ask the question: are 2 events correlated. >> >> To use the Ecom example, purchase is the

Re: Question about spark-itemsimilarity

2016-12-14 Thread Pat Ferrel
-occurrences - purchase history/clicks or downloads Best, Niklas 2016-12-01 18:47 GMT+01:00 Pat Ferrel <p...@occamsmachete.com>: > No you can’t, the value is ignored. The algorithm looks at occurrences, > cooccurrences, and cross-occurrences of several event types not values > atta

Re: Question about spark-itemsimilarity

2016-12-01 Thread Pat Ferrel
No you can’t, the value is ignored. The algorithm looks at occurrences, cooccurrences, and cross-occurrences of several event types not values attached to events. If you are trying to use rating info, this has been pretty much discarded as being not very useful. For instance you may like

using root LLR

2016-11-15 Thread Pat Ferrel
around 20-30 for raw LLR which corresponds to about 5 for root LLR. I often eyeball the lists of indicators for items that I understand to find a point where the list of indicators becomes about half noise, half useful indicators. On Sat, Jan 2, 2016 at 2:15 PM, Pat Ferrel <p...@occamsmachete

Re: spark-itemsimilarity slower than itemsimilarity

2016-10-03 Thread Pat Ferrel
Except for reading the input it now takes ~5 minutes to train. On Sep 30, 2016, at 5:12 PM, Pat Ferrel <p...@occamsmachete.com> wrote: Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avo

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-30 Thread Pat Ferrel
Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines. We are using this with ~1TB of data. Using Mahout as a

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-28 Thread Pat Ferrel
. This brings the cost to a quite reasonable range. You are very unlikely to need machines that large anyway but you could afford it if you only pay for the time they are actually used. On Sep 26, 2016, at 12:30 AM, Arnau Sanchez <pyar...@gmail.com> wrote: On Sun, 25 Sep 2016 09:01:43 -07

Re: spark-itemsimilarity slower than itemsimilarity

2016-09-25 Thread Pat Ferrel
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file—that partitioning will usually follow the

Recommenders and MABs

2016-09-17 Thread Pat Ferrel
I’ve been thinking about how one would implement an application that only shows recommendations. This is partly because people want to build such things. There are many problems with this including cold start and overfit. However these problems also face MABs and are solved with sampling

Re: Scaling up spark Iitem similarity on big data data sets

2016-06-23 Thread Pat Ferrel
In addition to increasing downsampling there are some other things to note. The original OOM was caused by the use of BiMaps to store your row and column ids. These will increase with the size of the total storage needed for 2 hashmaps per id type. With only 16g you may have very little else

Re: Clustering options

2016-05-24 Thread Pat Ferrel
Mahout Samsara is more about rolling your own algo, though it has already implemented several as examples. If you want to build your own clustering you will find a lot of what you need in the R-like DSL. But if you want something already built you may want to look at Spark’s MLlib kmeans.

Re: Welcome Trevor Grant as a new Mahout Committer

2016-05-24 Thread Pat Ferrel
Kokanee too? Welcome indeed! On May 24, 2016, at 6:34 AM, Shannon Quinn wrote: Welcome Trevor! On 5/24/16 7:14 AM, Stevo Slavić wrote: > Congratulations Trevor, well deserved, welcome to the team! > > On Tue, May 24, 2016 at 12:32 PM, Suneel Marthi

Re: Read output of sparkrowsimilairty in scala

2016-05-12 Thread Pat Ferrel
There are several ways to do this. The design was meant to be extended by a trait that would do the actual read/write. Check out TDIndexedDatasetReader. You can create a similar trait called MySQLIndexedDatasetReader. There are other examples in that file for reading and writing. Also check the

Re: RowSimilakrity : NotSerializableException

2016-05-07 Thread Pat Ferrel
I think you have to create a SparkDistributedContext, which has Mahout specific Kryo serialization and adds Mahout jars. If you let Mahout create the Spark context it’s simpler val implicit mc = mahoutSparkContext(masterUr = “local", appName = “SparkExample”) As I recall the sc will then

Re: Mahout rowSimilarity

2016-05-04 Thread Pat Ferrel
n-app.html> >>> >>> Let me know if you need more help. >>> >>> Thank you, >>> Nikaash Puri >>>> On 03-May-2016, at 9:49 PM, Rohit Jain <rohitkjai...@gmail.com> wrote: >>>> >>>> Hello Pat, >>>>

Re: Mahout rowSimilarity

2016-05-03 Thread Pat Ferrel
Sure, but at least some would be Scala. There are examples in Mahout that take PairRDDs as input but anything that constructs an IndexedDataset would be fine. I use this code in a system that creates an RDD from HBase. Think of the task as one of how to create a Spark RDD from your DB content.

Re: Custom Apache mahout Recommender over Hadoop

2016-04-06 Thread Pat Ferrel
Mahout in Action is out of date and the code mentioned is being deprecated. Many of the examples don’t run anymore. These days we run on more modern compute platforms like Spark. For the latest Mahout Recommender you can start with the Command Line Interface to spark-itemsimilarity and

Re: Removing MAHOUT_LOCAL option

2016-03-20 Thread Pat Ferrel
Reduce-based jobs which officially became deprecated in 0.10.0. On Sun, Mar 20, 2016 at 10:25 AM, Andrew Musselman < andrew.mussel...@gmail.com> wrote: > Yes as I understand it. > > > On Sunday, March 20, 2016, Pat Ferrel <p...@occamsmachete.com> wrote: > >> Are we

Re: Removing MAHOUT_LOCAL option

2016-03-20 Thread Pat Ferrel
Are we just talking about Hadoop Mapreduce? I thought is was ignored when using Spark. On Mar 20, 2016, at 8:20 AM, alok tanna wrote: -1 MAHOUT_LOCAL is very useful for quick POC . Thanks, Alok Tanna Sent from my iPhone > On Mar 20, 2016, at 5:01 AM, Mihai Dascalu

Re: New Mahout "Samsara" Book

2016-02-25 Thread Pat Ferrel
I’m working on something, @Dmitriy mentioned that we might be able to add to the second edition of the Beyond Mapreduce book, also looking at a free self-published PDF book. Not sure how much it helps to have paper or a publisher behind promotion. Any thoughts are welcome In the meantime it

Re: New Mahout "Samsara" Book

2016-02-25 Thread Pat Ferrel
This is awesome news! Can’t wait to get a copy. Congratulations Dmitriy and Andrew. Also thanks for the invitation Scott. I feel like Mahout has gone through a rebirth apropos of the Samsara name and it’s time people hear about it. On Feb 25, 2016, at 8:45 AM, scott cote

Re: mahout spark-itemsimilarity does not work on EMR 4.3

2016-02-24 Thread Pat Ferrel
Another way to get Mahout item-similarity based recommender is to use the Universal Recommender here: https://github.com/actionml/template-scala-parallel-universal-recommendation/tree/v0.3.0 It includes an event input pipeline, periodic Mahout+Spark based model generation and a realtime

Re: Document similarity

2016-02-24 Thread Pat Ferrel
a new document is added. In case of LDA ... I guess the best way is to calculate the topics on the new document using the topics from the previous LDA run ... And then every once in a while to recalculate the topics with the new documents? On Sun, Feb 14, 2016 at 10:02 PM, Pat Fer

Re: What's the mr item-based recommend algorithm essay?

2016-02-19 Thread Pat Ferrel
The reboot of that old mr engines in Mahout-Samsara is what we call Correlated Cross-Occurrence (CCO) this is the core of a mutli-modal recommender engine that can use almost any information about the user, context, or items to make recommendations. It is the first Open Source version of this

Re: Document similarity

2016-02-14 Thread Pat Ferrel
Something we are working on for purely content based similarity is using a KNN engine (search engine) but creating features from word2vec and an NER (Named Entity Recognizer). putting the generated features into fields of a doc can really help with similarity because w2v and NER create

Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space

2016-02-13 Thread Pat Ferrel
gt;> >>>> export HIVE_SERVER2_THRIFT_BIND_HOST=0.0.0.0 >>>> export HIVE_SERVER2_THRIFT_PORT=10001 >>>> >>>> export SPARK_DRIVER_MEMORY=15G >>>> export SPARK_DAEMON_JAVA_OPTS="$SPARK_DAEMON_JAVA_OPTS >>>> -XX:On

Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space

2016-02-12 Thread Pat Ferrel
ilure: Task 0 in stage 12.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 12.0 (TID 24, localhost): java.lang.OutOfMemoryError: > GC overhead limit exceeded > ……. > ….. > .. > . > > Driver stacktrace: > Caused by: java.lang.OutOfMemoryError: GC overhead

Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space

2016-02-01 Thread Pat Ferrel
You probably need to increase your driver memory and 8g will not work. 16g is probably the smallest stand alone machine that will work since the driver and executors run on it. > On Feb 1, 2016, at 1:24 AM, jg...@konodrac.com wrote: > > Hello everybody, > > We are experimenting problems when

Re: User similarity in Mahout

2016-01-03 Thread Pat Ferrel
Your problem will be that there isn’t enough cooccurrence between users since, well, how many jobs can any one user apply for and how likely is another user to apply for the same or overlapping jobs? The JDs have a short lifetime and so don’t lend themselves to the older single action

Some test results

2015-12-30 Thread Pat Ferrel
As many of you know Mahout-Samsara includes an interesting and important extension to cooccurrence similarity, which supports cross-coossurrence and log-likelihood downsampling. This, when combined with a search engine, gives us a multimodal recommender. Some of us integrated Mahout with a DB

Re: CachingUserSimilarity concurrency issue.

2015-12-27 Thread Pat Ferrel
That is from some very old code that is on the deprecation path. Mahout doesn’t accept Hadoop Mapreduce code anymore and this is even older, part of the Taste in-memory recommender. So if you change it, you may have to maintain it yourself. If you want something more modern, check out the

Re: root LLR support in org.apache.mahout.math.cf.SimilarityAnalysis

2015-12-15 Thread Pat Ferrel
No, if you want to work on that feel free, it should be pretty easy to add that option. However be aware that LLR is used in the downsampling step so you don’t get all elements of llr(A’A) for reasons that keep the calculation at O(n) downsampling is based on number of non-zero elements in a

Re: Mahout - Recommenditemvalue with magnitude of 1

2015-11-29 Thread Pat Ferrel
, what is your conclusions? Best, Niklas 2015-11-24 21:56 GMT+01:00 Pat Ferrel <p...@occamsmachete.com>: > > >> On Nov 24, 2015, at 12:21 PM, Niklas Ekvall <niklas.ekv...@gmail.com> > wrote: >> >> Okay! >> >> No pre-filter and the user/item i

Re: Mahout - Recommenditemvalue with magnitude of 1

2015-11-24 Thread Pat Ferrel
we use all data as input to Mahout and do the filtring inside Mahout? We use the second latest version of Mahout! Best regards, Niklas On Tuesday, November 24, 2015, Pat Ferrel <p...@occamsmachete.com <javascript:_e(%7B%7D,'cvml','p...@occamsmachete.com');>> wrote: > Do your ids s

Re: Mahout - Recommenditemvalue with magnitude of 1

2015-11-24 Thread Pat Ferrel
outcorrect? Yes, but I wouldn't filter. The recs will very likely be better than random with only a small number of events. > > We do the same pre-filter for Spark item-similarity, is that wrong to? No, spark-itemsimilarity uses string ids. > > Best regards, Niklas > >

Re: Mahout - Recommenditemvalue with magnitude of 1

2015-11-24 Thread Pat Ferrel
111346:1.0,112201:1.0,65759:1.0,133127:1.0,61378:1.0,16413:1.0,113289:1.0,49675:1.0,14995:1.0,141028:1.0,27506:1.0] Best regards, Niklas 2015-11-24 16:48 GMT+01:00 Pat Ferrel <p...@occamsmachete.com>: > Sounds like you may not have the input right. Recommendations should be > sorted by t

Re: Mahout - Recommenditemvalue with magnitude of 1

2015-11-24 Thread Pat Ferrel
Sounds like you may not have the input right. Recommendations should be sorted by the strength and so shouldn’t all be 1 unless the data is very odd. Can you give us a small sample of the input? BTW a newer recommender using Mahout’s Spark based code and a search engine is here:

Mahout 0.11.1

2015-11-09 Thread Pat Ferrel
Can someone forward the announcement directly to my email? I didn’t get the announcement of release.

Haters get Love too

2015-11-03 Thread Pat Ferrel
A colleague of mine just build a MAP@k precision evaluator for the Mahout based cooccurrence recommender we’ve been working on and we ran some data scraped from rottentomatoes.com They have “fresh” and “rotten” reviews tied to reviewer ids. A fair bit of discussion

Re: Is Mahout obsolete now?

2015-10-20 Thread Pat Ferrel
ribution). On Tue, Oct 20, 2015 at 12:05 PM, Pavan K Narayanan < pavan.naraya...@gmail.com> wrote: > Perhaps this page <http://mahout.apache.org/users/basics/algorithms.html> > needs > to be updated with algorithms and features of 0.11.0? > > On 19 October 2015 at 18:29,

Re: Is Mahout obsolete now?

2015-10-19 Thread Pat Ferrel
BTW this use of Mahout-Samsara on Spark for recs has really expanded. The Samsara part I’m calling a Correlation Engine, it can be used to mix usage, content, and context to make recs. I look back on 2 years ago as pretty much groping around for solutions. Things are much clearer now (for me at

Re: Like/No Rating/Dislike Dataset Representation to Mahout

2015-10-11 Thread Pat Ferrel
Actually there is another way to do this but you need a multi-action recommender. Turns out that dislikes can actually predict likes. These are both actions the user takes that give us some idea of their taste. To use both actions we need to pick one that is most indicative of a user’s

Re: Exception in thread "main" java.lang.IllegalArgumentException: Unable to read output from "mahout -spark classpath"

2015-10-07 Thread Pat Ferrel
ank you Pat. I was having that issue when I was trying to do something > like that. > Just curious, how should I prepare the data so that it can > satisfy drmDfsRead (path) ? DRM format and how to create the DRM file > ? thanks, canal > > > On Wednesday, October 7, 2015 4:09 AM

Re: Exception in thread "main" java.lang.IllegalArgumentException: Unable to read output from "mahout -spark classpath"

2015-10-06 Thread Pat Ferrel
y much for the help. I will try Spark 1.4. I would like to try distributed matrix multiplication. not sure if there are sample codes available. I am very new to this stack. thanks, canal On Monday, October 5, 2015 12:23 AM, Pat Ferrel <p...@occamsmachete.com> wrote: Mahout 0.1

Re: Exception in thread "main" java.lang.IllegalArgumentException: Unable to read output from "mahout -spark classpath"

2015-10-04 Thread Pat Ferrel
Mahout 0.11.0 is built on Spark 1.4 and so 1.5.1 is a bit unknown. I think the Mahout Shell does not run on 1.5.1. That may not be the error below, which is caused when Mahout tries to create a set of jars to use in the Spark executors. The code runs `mahout -spark classpath` to get these. So

Re: How to deal with catogrical and date data in mahout ?

2015-09-04 Thread Pat Ferrel
Mahout Samsara executes on Spark and doesn’t include a full recommender because it uses a search engine to perform the last calculation and serve results. The newer code does not require the mapping. You can use the provided text-delimited format for input or write your own.

Re: What does maxRating parameter in ALS recommend algorithm mean?

2015-09-01 Thread Pat Ferrel
Google Alternating Least Squares. The science is similar to dimensionality reduction. The algorithm was originally designed to predict user ratings for things like the early netflix 5 star ratings. Now we tend to look at ranking items as more important. The max rating is probably to set the

  1   2   3   4   5   6   7   >