Re: [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)

2020-12-13 Thread Eric Link
Unsubscribe

On Wed, Aug 12, 2020, 7:59 AM Trevor Grant  wrote:

> Hey all,
>
> We got enough people to volunteer for talks that we are going to be putting
> on our very own track at ApacheCon (@Home) this year!
>
> Check out the schedule here:
> https://www.apachecon.com/acna2020/tracks/mahout.html
>
> To see the talks live / in real time, please register at:
> https://hopin.to/events/apachecon-home
>
> But if you can't make it- we plan on pushing all of the recorded sessions
> to the website after.
>
> Thanks so much everyone, and can't wait to 'see' you there!
>
> tg
>


Re: [ANNOUNCE] Apache Mahout 14.1 Release

2020-12-13 Thread Eric Link
Unsubscribe

On Thu, Oct 8, 2020, 9:14 AM Andrew Musselman  wrote:

> The Apache Mahout PMC is pleased to announce the release of Mahout 14.1.
> Mahout's goal is to create an environment for quickly creating
> machine-learning applications that scale and run on the highest-performance
> parallel computation engines available. Mahout comprises an interactive
> environment and library that support generalized scalable linear algebra
> and include many modern machine-learning algorithms. This release ships
> some major changes from 0.14.0, most in support of refactoring the build
> system.
>
> To get started with Apache Mahout 14.1, download the release artifacts and
> signatures from https://downloads.apache.org/mahout/14.1/.
>
> Many thanks to the contributors and committers who were part of this
> release.
>
>
> RELEASE HIGHLIGHTS
>
> The theme of the 14.1 release is a major refactor for simplicity of usage
> and maintenance. Pom structure and components have moved, so please ask on
> the mailing lists for help if anything is not where you expect it.
>
>
> STATS
>
> A total of 17 separate JIRA issues are addressed in this release [1].
>
>
> GETTING STARTED
>
> Download the release artifacts and signatures at
> https://mahout.apache.org/general/downloads.html. The examples directory
> contains several working examples of the core functionality available in
> Mahout. These can be run via scripts in the examples/bin directory. Most
> examples do not need a Hadoop cluster in order to run.
>
>
> FUTURE PLANS
>
> 14.2
>
> As the project moves towards a 14.2 release, we are working on the
> following:
>
> * Further Native Integration for increased speedups
>
> * JCuda backing for In-core Matrices and CUDA solvers
>
> * Enumeration across multiple GPUs per JVM instance on a given instance
>
> * GPU/OpenMP Acceleration for linear solvers
>
> * Runtime probing and optimization of available hardware for caching of
> correct/most optimal solver
>
> * Python bindings for DSL
>
>
>
> CONTRIBUTING
>
> If you are interested in contributing, please see our How to Contribute [2]
> page or contact us via email at d...@mahout.apache.org.
>
>
> CREDITS
>
> As with every release, we wish to thank all of the users and contributors
> to Mahout. Please see the JIRA Release Notes [1] for individual credits.
> Big thanks to Chris Dutz for his effort on the refactoring and cleanup in
> this release.
>
>
> KNOWN ISSUES:
>
> * The classify-wikipedia.sh example has an outdated link to the data files.
> A workaround is to change the download section of the script to:  `curl
>
> https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p002336425p003046511.bz2
> -o
> 
> ${WORK_DIR}/wikixml/enwiki-latest-pages-articles.xml.bz2`
>
> * Currently GPU acceleration for supported operations is limited to a
> single JVM instance
>
> * Occasional segfault with certain GPU models and computations
>
> * On older GPUs some tests fail when building ViennaCL due to card
> limitations
>
> * Currently automatic probing of a system’s hardware happens at each
> supported operation, adding some overhead
>
> * Currently the example in the main README errors out due to a packaging
> error; we will be fixing this in the next point release
>
>
>
> [1]
> <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20issuetype%20in%20(standardIssueTypes()%2C%20subTaskIssueTypes())%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20in%20(0.13.0%2C%200.13.1%2C%201.0.0)
> >
>
> https://issues.apache.org/jira/browse/MAHOUT-2068?jql=project%20%3D%20MAHOUT%20AND%20issuetype%20in%20(standardIssueTypes()%2C%20subTaskIssueTypes())%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20in%20(0.14.1%2C%200.14.0)
>
> [2] https://mahout.apache.org/developers/how-to-contribute
>


Re: Error spark-mahout when spark-submit mode cluster

2018-08-08 Thread Eric Link
ost recent failure:
> Lost task 1.3 in stage 0.0 (TID 6, 10.0.2.15, executor 0):
> java.lang.IllegalStateException: unread block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2773)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1599)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Driver stacktrace:
> 18/08/01 14:18:53 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7)
> on 10.0.2.15, executor 0: java.lang.IllegalStateException (unread block
> data) [duplicate 7]
> 18/08/01 14:18:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 18/08/01 14:18:53 INFO DAGScheduler: Job 0 failed: collect at
> GenerateIndicator.scala:38, took 5.265593 s
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 0.0 (TID 6, 10.0.2.15, executor 0): java.lang.IllegalStateException: unread
> block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2773)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1599)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Thanks a lot for your time.
> Cheers.



-- 
Eric Link
214.641.5465


Re: "LLR with time"

2017-11-19 Thread Eric Link
system. If there would be a way to use the hotness when calculating
> >> the
> >>>>> indicators for subpopulations it would be great., especially for a
> >>> cross
> >>>>> recommender.
> >>>>>
> >>>>> e.g. people in greece _now_ are viewing this show/product  whatever
> >>>>>
> >>>>> And here the popularity of the recommended item in this
> > subpopulation
> >>>> could
> >>>>> be overrseen when just looking at the overall derivatives of
> >> activity.
> >>>>>
> >>>>> Maybe one could do multiple G-Tests using sliding windows
> >>>>> * itemA&itemB  vs population (classic)
> >>>>> * itemA&itemB(t) vs itemA&itemB(t-1)
> >>>>> ..
> >>>>>
> >>>>> and derive multiple indicators per item to be indexed.
> >>>>>
> >>>>> But this all relies on discretizing time into buckets and not
> > looking
> >>> at
> >>>>> the distribution of time between events like in presentation above
> > -
> >>>> maybe
> >>>>> there is  something way smarter
> >>>>>
> >>>>> Johannes
> >>>>>
> >>>>> On Sat, Nov 11, 2017 at 2:50 AM, Pat Ferrel  >>
> >>>> wrote:
> >>>>>
> >>>>>> BTW you should take time buckets that are relatively free of daily
> >>>> cycles
> >>>>>> like 3 day, week, or month buckets for “hot”. This is to remove
> >>> cyclical
> >>>>>> affects from the frequencies as much as possible since you need 3
> >>>> buckets
> >>>>>> to see the change in change, 2 for the change, and 1 for the event
> >>>>> volume.
> >>>>>>
> >>>>>>
> >>>>>> On Nov 10, 2017, at 4:12 PM, Pat Ferrel 
> >>> wrote:
> >>>>>>
> >>>>>> So your idea is to find anomalies in event frequencies to detect
> >> “hot”
> >>>>>> items?
> >>>>>>
> >>>>>> Interesting, maybe Ted will chime in.
> >>>>>>
> >>>>>> What I do is take the frequency, first, and second, derivatives as
> >>>>>> measures of popularity, increasing popularity, and increasingly
> >>>>> increasing
> >>>>>> popularity. Put another way popular, trending, and hot. This is
> >> simple
> >>>> to
> >>>>>> do by taking 1, 2, or 3 time buckets and looking at the number of
> >>>> events,
> >>>>>> derivative (difference), and second derivative. Ranking all items
> > by
> >>>>> these
> >>>>>> value gives various measures of popularity or its increase.
> >>>>>>
> >>>>>> If your use is in a recommender you can add a ranking field to all
> >>> items
> >>>>>> and query for “hot” by using the ranking you calculated.
> >>>>>>
> >>>>>> If you want to bias recommendations by hotness, query with user
> >>> history
> >>>>>> and boost by your hot field. I suspect the hot field will tend to
> >>>>> overwhelm
> >>>>>> your user history in this case as it would if you used anomalies
> > so
> >>>> you’d
> >>>>>> also have to normalize the hotness to some range closer to the one
> >>>>> created
> >>>>>> by the user history matching score. I haven’t found a vey good way
> >> to
> >>>> mix
> >>>>>> these in a model so use hot as a method of backfill if you cannot
> >>> return
> >>>>>> enough recommendations or in places where you may want to show
> > just
> >>> hot
> >>>>>> items. There are several benefits to this method of using hot to
> >> rank
> >>>> all
> >>>>>> items including the fact that you can apply business rules to them
> >>> just
> >>>>> as
> >>>>>> normal recommendations—so you can ask for hot in “electronics” if
> >> you
> >>>>> know
> >>>>>> categories, or hot "in-stock" items, or ...
> >>>>>>
> >>>>>> Still anomaly detection does sound like an interesting approach.
> >>>>>>
> >>>>>>
> >>>>>> On Nov 10, 2017, at 3:13 PM, Johannes Schulte <
> >>>>> johannes.schu...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi "all",
> >>>>>>
> >>>>>> I am wondering what would be the best way to incorporate event
> > time
> >>>>>> information into the calculation of the G-Test.
> >>>>>>
> >>>>>> There is a claim here
> >>>>>> https://de.slideshare.net/tdunning/finding-changes-in-real-data
> >>>>>>
> >>>>>> saying "Time aware variant of G-Test is possible"
> >>>>>>
> >>>>>> I remember i experimented with exponentially decayed counts some
> >> years
> >>>>> ago
> >>>>>> and this involved changing the counts to doubles, but I suspect
> >> there
> >>> is
> >>>>>> some smarter way. What I don't get is the relation to a data
> >> structure
> >>>>> like
> >>>>>> T-Digest when working with a lot of counts / cells for every
> >>> combination
> >>>>> of
> >>>>>> items. Keeping a t-digest for every combination seems unfeasible.
> >>>>>>
> >>>>>> How would one incorporate event time into recommendations to
> > detect
> >>>>>> "hotness" of certain relations? Glad if someone has an idea...
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Johannes
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>
>


-- 
Eric Link
214.641.5465


Re: Algorithms of prediction

2016-02-29 Thread Eric Link
unsubscribe

On Thu, Feb 25, 2016 at 2:47 PM Keith Aumiller <
keith.aumil...@stlouisintegration.com> wrote:

> I use h2o and it's good with an easy interface to learn for a new user.
> Even without the R libraries
>
>
> On Thu, Feb 25, 2016 at 10:54 AM, Ted Dunning 
> wrote:
>
> > On Thu, Feb 25, 2016 at 6:52 AM,  wrote:
> >
> > > Thank you for your answer
> > > What other tools you advise me to use?
> > > Do you recommend Rhadoop?
> > >
> >
> > Try h2o instead.  Good R interface. Decent model building.
> >
>
>
>
> --
> Thanks,
>
> Keith Aumiller
> MBA - IT Professional
> Lafayette Hill PA
> 314-369-0811
>


Re: clusterpp is only writing directories for about half of my clusters.

2012-10-20 Thread Eric Link
We are looking at using mahout in our organization.  We have a need to do 
statistical analysis and do clustering and make recommendations.  What is the 
'sweet spot' for doing this with mahout?  Meaning, what types of data sets and 
data volumes are the best fit for using a tool like mahout, versus doing 
things, say,  in a sql database.  I hear big data doesn't really start until 
you have terabytes and petabytes of data, so I'm not sure the data sets I have 
are worthy!Thanks for any thoughts on the proper fit for a tool like 
mahout.- Eric



On Oct 20, 2012, at 2:44 PM, Matt Molek  wrote:

> First off, thank you everyone for your help so far. This mailing list
> has been a great help getting me up and running with Mahout
> 
> Right now, I'm clustering a set of ~3M documents into 300 clusters.
> Then I'm using clusterpp to split the documents up into directories
> containing the vectors belonging to each cluster. After I perform the
> clustering, clusterdump shows that each cluster has between ~800 and
> ~200,000 documents. This isn't a great spread, but the point is that
> none of the clusters are empty.
> 
> Here are my commands:
> 
> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
> -k 300 -x 15 -cl -ow
> 
> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
> 
> bin/mahout clusterpp -i pca-clusters -o bottom
> 
> 
> Since none of my clusters are empty, I would expect clusterpp to
> create 300 directories in "bottom", one for each cluster. Instead,
> only 147 directories are created. The other 153 outputs are just empty
> part-r-* files sitting in the "bottom" directory.
> 
> I haven't found too much information when searching on this issue but
> I did come across one mailing list post from a while back:
> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3c4f3e52fc.7000...@windwardsolutions.com%3E
> 
> In that discussion someone said, "If that is the only thing that is
> contained in the part-r-* file [it had no vectors], then the reducer
> responsible to write to that part-r-* file did not receive any input
> records to write to it. This happens because the program uses the
> default hash partitioner which sometimes maps records belonging to
> different clusters to a same reducer; thus leaving some reducers
> without any input records."
> 
> So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them.
> 
> One final detail: I'm not sure if this matters, but the clusters
> output by kmeans are not numbered 1 to 300. They have an odd looking,
> nonsequential numbering sequence. The first 5 clusters are:
> VL-3740844
> VL-3741044
> VL-3741140
> VL-3741161
> VL-3741235
> 
> I haven't done much with kmeans before, so I wasn't sure if this was
> an unexpected behavior or not.



Re: Recommendations for new users

2012-10-12 Thread Eric Link
Do you have a link to your stack overflow answer?  Thx. - Eric


On Oct 12, 2012, at 10:54 AM, Sean Owen  wrote:

> See my answer on StackOverflow. Yes it is important.
> On Oct 12, 2012 4:23 PM, "Ahmet Ylmaz"  wrote:
> 
>> Hi,
>> We are planning to use Mahout for our movie recommender system. And we are
>> planning to use SVD for model building.
>> 
>> When a new user comes we will require him/her to rate a certain number of
>> movies (say 10).
>> 
>> In order to recommend movies to this new user we have to rebuild the
>> entire model. But this not appealing in terms of computational load.
>> 
>> 
>> I'm looking for better solutions.
>> 
>> 
>> For FunkSVD, one solution seems to be retraining the model *only* on the
>> new user, in order to learn the factors associated with him.
>> Since there are not many ratings associated with the new user you can
>> learn the new user's factors in a quite negligible time.
>> 
>> Actually this solution seems not to be difficult to implement. So, I
>> wonder why this is not implemented in Mahout given that in commercial
>> settings it is very important to be able to immediately recommend items to
>> users after they give some ratings.
>> 
>> Thank you
>> Ahmet