[jira] [Commented] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-03-27 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616044#comment-13616044 ] Marty Kube commented on MAHOUT-1164: Yep, the changes look good to me. Ship it!

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Marty Kube
n the email comes through (<10% chance lately) then I lose track. On Wed, Mar 27, 2013 at 7:30 AM, Marty Kube wrote: So I'd like to continue to improve the RF classifier code. I've been posting patches along the lines of the refactoring discussed here. The patches are not being looked

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Marty Kube
r 27, 2013, at 12:14 AM, Sebastian Schelter wrote: > Totally agree on that. The impact of making Mahout more usable is much > higher than that of adding a new algorithm. > > On 27.03.2013 05:41, Ted Dunning wrote: >> It is critically important. >> >> On Wed, Mar 27,

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Marty Kube
IMHO usability is really important.I've posted a couple of patches recently around making the RF classifiers easier to use. I found myself working on consistent data format and command line option support. It's not glamorous but it's important. On 3/26/2013 8:26 PM, Ted Dunning wrote: Go

Re: Discussion Of ML environment/MR, Mahout

2013-03-15 Thread Marty Kube
Hey Sean, I hear what you are saying, I've been working the RF classifiers and the community/code could use a little more cohesion. Having a "most ML on most platforms" would be a good thing. You point out valid organizational hurdles. Are there organizational changes that could be made to

Re: ARFF file support for RF classifiers

2013-03-15 Thread Marty Kube
2013 05:15, Marty Kube wrote: Hey, I've been working on adding ARFF support for RF classifiers. I've posted a couple of patches and wanted to explain what I had in mind. The larger goal here is to run any integration to generate a dictionary/meta-data and sequence file that can be consume

ARFF file support for RF classifiers

2013-03-14 Thread Marty Kube
Hey, I've been working on adding ARFF support for RF classifiers. I've posted a couple of patches and wanted to explain what I had in mind. The larger goal here is to run any integration to generate a dictionary/meta-data and sequence file that can be consumed directly by the RF classifiers. T

[jira] [Updated] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-03-14 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1164: --- Attachment: MAHOUT-1164.patch > Make ARFF integration generate meta-data in JSON for

[jira] [Updated] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-03-14 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1164: --- Summary: Make ARFF integration generate meta-data in JSON format (was: Make ARFF intregration

[jira] [Updated] (MAHOUT-1164) Make ARFF intregration generate meta-data in JSON format

2013-03-14 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1164: --- Status: Patch Available (was: Open) This patch adds a command line option to generate JSON meta

[jira] [Updated] (MAHOUT-1164) Make ARFF intregration generate meta-data in JSON format

2013-03-14 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1164: --- Description: Add a command line option to generate meta-data in a JSON format. This ticket

[jira] [Created] (MAHOUT-1164) Make ARFF intregration generate meta-data in JSON format

2013-03-14 Thread Marty Kube (JIRA)
Marty Kube created MAHOUT-1164: -- Summary: Make ARFF intregration generate meta-data in JSON format Key: MAHOUT-1164 URL: https://issues.apache.org/jira/browse/MAHOUT-1164 Project: Mahout Issue

[jira] [Updated] (MAHOUT-1163) Make random forest classifier meta-data file human readable

2013-03-13 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1163: --- Attachment: MAHOUT-1163.patch > Make random forest classifier meta-data file human reada

[jira] [Updated] (MAHOUT-1163) Make random forest classifier meta-data file human readable

2013-03-13 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1163: --- Status: Patch Available (was: Open) The attached patch generates and reads meta-data files in a

[jira] [Created] (MAHOUT-1163) Make random forest classifier meta-data file human readable

2013-03-13 Thread Marty Kube (JIRA)
Marty Kube created MAHOUT-1163: -- Summary: Make random forest classifier meta-data file human readable Key: MAHOUT-1163 URL: https://issues.apache.org/jira/browse/MAHOUT-1163 Project: Mahout

Re: Out-of-core random forest implementation

2013-03-08 Thread Marty Kube
riant data usually typically induce an overhead of an order of magnitude compared to approaches specialized for distributed iterations. Best, Sebastian On 08.03.2013 14:36, Marty Kube wrote: What about using one map reduce job per iteration? The models you load into distributed cache are the model fro

Re: Out-of-core random forest implementation

2013-03-08 Thread Marty Kube
On 03/07/2013 04:56 PM, Ted Dunning wrote: On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg wrote: ... Right now what we have is a single-machine procedure for scanning through some data, building a set of histograms, combining histograms and then expanding the tree. The next step is to decide the

Re: Out-of-core random forest implementation

2013-03-08 Thread Marty Kube
What about using one map reduce job per iteration? The models you load into distributed cache are the model from the last round and the reducer can emit the expanded model. We are presumably working with large data sets so I would not expect start-up latency to be an issue. On 03/07/2013 04:

[jira] [Resolved] (MAHOUT-1149) Partial Implementation wiki page is out of date

2013-02-28 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube resolved MAHOUT-1149. Resolution: Fixed Fix Version/s: 0.8 I've updated the wiki for the classifier, r

Re: ARFF file support for random forest classifiers

2013-02-28 Thread Marty Kube
r parts of the application consume these files... On 02/28/2013 01:14 PM, Ted Dunning wrote: making this consistent would be very helpful. On Thu, Feb 28, 2013 at 9:33 AM, Marty Kube wrote: Hey, I've been looking at consuming ARFF files for random forest classification. If you look

ARFF file support for random forest classifiers

2013-02-28 Thread Marty Kube
Hey, I've been looking at consuming ARFF files for random forest classification. If you look at the partial implementation example page one is asked to download an ARFF file, edit the ARFF file to remove the meta-data, and then recreate the same meta-data with command line arguments to the De

[jira] [Commented] (MAHOUT-1149) Partial Implementation wiki page is out of date

2013-02-28 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589583#comment-13589583 ] Marty Kube commented on MAHOUT-1149: I figured out how to edit the wiki page

[jira] [Updated] (MAHOUT-1149) Partial Implementation wiki page is out of date

2013-02-28 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1149: --- Summary: Partial Implementation wiki page is out of date (was: Partial Implementation wiki page is

[jira] [Updated] (MAHOUT-1150) ARFF Integration does not support quoted identifiers

2013-02-27 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1150: --- Description: I ran the NSL-KDD data set (http://nsl.cs.unb.ca/NSL-KDD/) through the ARFF

[jira] [Updated] (MAHOUT-1150) ARFF Integeration does not support quoted identifiers

2013-02-27 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1150: --- Description: I ran NSL-KDD data set (http://nsl.cs.unb.ca/NSL-KDD/) through the ARFF integration

[jira] [Updated] (MAHOUT-1150) ARFF Integeration does not support quoted identifiers

2013-02-27 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marty Kube updated MAHOUT-1150: --- Attachment: MAHOUT-1150.patch Attaching patch > ARFF Integeration does not supp

[jira] [Created] (MAHOUT-1150) ARFF Integeration does not support quoted identifiers

2013-02-27 Thread Marty Kube (JIRA)
Marty Kube created MAHOUT-1150: -- Summary: ARFF Integeration does not support quoted identifiers Key: MAHOUT-1150 URL: https://issues.apache.org/jira/browse/MAHOUT-1150 Project: Mahout Issue

Re: Out-of-core random forest implementation

2013-02-21 Thread Marty Kube
, 2013 at 5:01 PM, Andy Twigg <mailto:andy.tw...@gmail.com>> wrote: Even better, there is already a good implementation of the histograms: https://github.com/bigmlcom/histogram -Andy On 20 February 2013 22:50, Marty Kube mailto:marty.kube.apa...@gmail.com>> wrote:

Re: Out-of-core random forest implementation

2013-02-20 Thread Marty Kube
mplementing that, if anyone else is interested? [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf On 20 February 2013 14:27, Andy Twigg wrote: Why don't we start from https://github.com/ashenfad/hadooptree ? On 20 February 2013 13:25, Marty Kube wrote: Hi Lorenz,

Re: Out-of-core random forest implementation

2013-02-20 Thread Marty Kube
e start from >> >> https://github.com/ashenfad/hadooptree ? >> >> On 20 February 2013 13:25, Marty Kube wrote: >>> Hi Lorenz, >>> >>> Very interesting, that's what I was asking for when I mentioned non-MR >>> implementations :-) >>

Re: Out-of-core random forest implementation

2013-02-20 Thread Marty Kube
rking on a PLANET-like implementation on top of spark: http://spark-project.org I think this framework is a nice fit for the problem. If the input data fits into the "total cluster memory" you benefit from the caching of the RDD's. regards, lorenz On Feb 20, 2013, at 2:42 AM, Mar

Re: Out-of-core random forest implementation

2013-02-19 Thread Marty Kube
all of the resource management that is built in. On 02/19/2013 08:04 PM, Ted Dunning wrote: If non-MR means map-only job with communicating mappers and a state store, I am down with that. What did you mean? On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube < martyk...@beavercreekconsulting.com&

Re: Out-of-core random forest implementation

2013-02-19 Thread Marty Kube
ution? On 02/15/2013 03:09 AM, deneche abdelhakim wrote: On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: On 01/28/2013 02:33 PM, Ted Dunning wrote: I think I was suggesting something weaker. I was suggesting that trees get built against a por

Re: Out-of-core random forest implementation

2013-02-15 Thread Marty Kube
ementation... On 02/15/2013 03:09 AM, deneche abdelhakim wrote: On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: On 01/28/2013 02:33 PM, Ted Dunning wrote: I think I was suggesting something weaker. I was suggesting that trees get built against

Re: Out-of-core random forest implementation

2013-02-15 Thread Marty Kube
Even if you are not doing map reduce exactly, hadoop does give you a nice infrastructure for running jobs across a lot of host. On 02/15/2013 04:00 PM, Ted Dunning wrote: Remember that Hadoop != map-reduce. If there is another style that we need to use, that isn't such a bad thing. On Fri, F

Re: Out-of-core random forest implementation

2013-02-15 Thread Marty Kube
/2013 03:09 AM, deneche abdelhakim wrote: On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: On 01/28/2013 02:33 PM, Ted Dunning wrote: I think I was suggesting something weaker. I was suggesting that trees get built against a portion of the data an

Re: Out-of-core random forest implementation

2013-02-14 Thread Marty Kube
On 01/28/2013 02:33 PM, Ted Dunning wrote: I think I was suggesting something weaker. I was suggesting that trees get built against a portion of the data and each node builds some number of trees against just the data it sees. This is in the spirit of random forests, but not the letter. I'm lo

[jira] [Created] (MAHOUT-1149) Partial Implementation wiki page is out off date

2013-02-09 Thread Marty Kube (JIRA)
Marty Kube created MAHOUT-1149: -- Summary: Partial Implementation wiki page is out off date Key: MAHOUT-1149 URL: https://issues.apache.org/jira/browse/MAHOUT-1149 Project: Mahout Issue Type

Re: Out-of-core random forest implementation

2013-01-28 Thread Marty Kube
I think the best design can be selected by starting with the simplest approach and letting results/data make the decisions. Is there a data set we could use to drive the design choices? On 01/28/2013 02:33 PM, Ted Dunning wrote: I think I was suggesting something weaker. I was suggesting that

Re: Out-of-core random forest implementation

2013-01-25 Thread Marty Kube
Hey Andy, What is the use case that is driving your question? Are you looking at the training phase - I didn't realise that one needed to keep the data in memory. I have a use case where even keeping the trees, much less the data, in memory during classification is an issue. Ted, do you have

hadoop 2.0

2013-01-11 Thread Marty Kube
Hey, I did a build against 2.0.2-alpha. It seemed to go well enough. Everything compiled and the unit test passed. I was thinking the next step is to check runtime dependencies by running a few examples. Any suggestions on what would be a good examples to try out? If the runtime shakes o

[jira] [Commented] (MAHOUT-1137) Same to MAHOUT-1061: ClassNotFoundException in mahout split -xm mapreduce: org.apache.mahout.utils.SplitInputJob$SplitInputMapper

2013-01-10 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550692#comment-13550692 ] Marty Kube commented on MAHOUT-1137: I'm pretty sure CH4 is not supported

[jira] [Commented] (MAHOUT-1136) Cannot import project into eclipse with m2e 1.2

2013-01-07 Thread Marty Kube (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546491#comment-13546491 ] Marty Kube commented on MAHOUT-1136: +1 The patch fixed the eclipse import pro

Re: mahout-pmml

2012-12-27 Thread Marty Kube
this module independently so that you don't have to wait for others to commit partial results. On Wed, Dec 26, 2012 at 6:52 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: I took a look at JPMML... At the bottom of it they have ran a JAXB compiler on the PMML V4 schema

Re: mahout-pmml

2012-12-26 Thread Marty Kube
regards, Simon -- Forwarded message -- From: Simon Vocella Date: Mon, Dec 17, 2012 at 1:50 AM Subject: mahout-pmml To: Grant Ingersoll Cc: Marty Kube Hi Grant, I start with this is the project https://github.com/voxsim/mahout-pmml (I pushed only the skeleton for now) with mahout and jpmml int

Re: coding in Mahout

2012-12-10 Thread Marty Kube
Grant Ingersoll wrote: On Dec 6, 2012, at 9:46 PM, Marty Kube wrote: I'd work on model import before export. It seems to me that mahout has the scalable execution platform. Being able to import a model might be nice for cross-validation/QA against a model developed on a less scalable platfo

Re: coding in Mahout

2012-12-10 Thread Marty Kube
structures that they need. Logistic regression can be a special case of neural nets. There is also (I think) a specific PMML structure for them. On Sun, Dec 9, 2012 at 1:22 AM, Grant Ingersoll wrote: On Dec 6, 2012, at 9:46 PM, Marty Kube wrote: I'd work on model import before export. I

Re: coding in Mahout

2012-12-08 Thread Marty Kube
re is any problem of license? BSD vs Apache? Marty if you want to help me, or if you already start something, it's ok, my idea is to work with github, I already forked mahout. Simon On Fri, Dec 7, 2012 at 3:46 AM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: I'd wor

Re: coding in Mahout

2012-12-06 Thread Marty Kube
I'd work on model import before export. It seems to me that mahout has the scalable execution platform. Being able to import a model might be nice for cross-validation/QA against a model developed on a less scalable platform. On 12/06/2012 08:28 AM, Simon Vocella wrote: Ok , have you got al

Re: Link to mahout build status

2012-11-27 Thread Marty Kube
That's a good URL :-) Thanks! How would one fix the link on the home page? On 11/27/2012 07:16 PM, Ted Dunning wrote: Try https://builds.apache.org//job/Mahout-Quality/ instead. On Tue, Nov 27, 2012 at 4:00 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: Hey, I'

Link to mahout build status

2012-11-27 Thread Marty Kube
Hey, I've been working through get an mahout development environment set up. Sometimes things don't work out for me, so my first question is "is the build okay?" So, I go to http://mahout.apache.org/ and click on the "Code Quality Reports