[
https://issues.apache.org/jira/browse/MAHOUT-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-56:
-----------------------------------
Attachment: watchmaker-tsp.patch
*Changes*
* org.apache.mahout.ga.watchmaker.MahoutEvaluator removes any axisting input
directory before storing the population
* org.apache.mahout.ga.watchmaker.cd.FileInfosParser Uses the CATEGORICAL token
for symbolic (nominal) attributes. This makes it easy to identify a token using
the first character.
* org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool is used to generate the
.infos file needed by the CDGA for a new dataset.
The new tool works as follow:
* he is invoked using the following command (the dataset path is given as a
parameter):
{noformat}
$ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar
org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool dataset_path
{noformat}
* the tool searches for an existing infos file, in the same directory of the
dataset with the same name and with the ".infos" extension, that contain the
type of the attributes:
** 'N' numerical attribute
** 'C' categorical attribute
** 'L' label (this also a categorical attribute)
** 'I' to ignore the attribute
each attribute is in a separate line
* the tool uses a Hadoop job to parse the dataset and collect the informations
* the results are writen back in the same .info file, in a format compatible
with CDGA
for example, this is the info file generated for the [KDDCup
(1999)|http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html] 10% Training
Dataset :
{panel:title=kddcup.data_10_percent.infos}
NUMERICAL, 0.0,58329.0
CATEGORICAL, icmp,udp,tcp
CATEGORICAL,
rje,login,time,systat,ntp_u,mtp,uucp_path,bgp,nntp,efs,Z39_50,csnet_ns,tim_i,X11,telnet,ftp_data,finger,other,exec,uucp,netstat,klogin,ecr_i,remote_job,urh_i,netbios_dgm,pop_2,auth,private,shell,printer,kshell,urp_i,vmnet,pop_3,echo,daytime,iso_tsap,courier,tftp_u,sunrpc,red_i,ctf,supdup,gopher,ssh,sql_net,name,smtp,hostnames,netbios_ssn,ftp,IRC,imap4,netbios_ns,http,ldap,eco_i,link,http_443,domain_u,discard,nnsp,pm_dump,domain,whois
CATEGORICAL, S2,SF,OTH,S0,S3,RSTR,RSTO,SH,S1,RSTOS0,REJ
NUMERICAL, 0.0,6.9337562E8
NUMERICAL, 0.0,5155468.0
CATEGORICAL, 0,1
NUMERICAL, 0.0,3.0
NUMERICAL, 0.0,3.0
NUMERICAL, 0.0,30.0
NUMERICAL, 0.0,5.0
CATEGORICAL, 0,1
NUMERICAL, 0.0,884.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,2.0
NUMERICAL, 0.0,993.0
NUMERICAL, 0.0,28.0
NUMERICAL, 0.0,2.0
NUMERICAL, 0.0,8.0
NUMERICAL, 0.0,1.4E-45
CATEGORICAL, 0
CATEGORICAL, 0,1
NUMERICAL, 0.0,511.0
NUMERICAL, 0.0,511.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,255.0
NUMERICAL, 0.0,255.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
LABEL,
teardrop.,ipsweep.,phf.,nmap.,land.,portsweep.,warezmaster.,smurf.,guess_passwd.,ftp_write.,perl.,loadmodule.,back.,imap.,normal.,pod.,spy.,neptune.,satan.,buffer_overflow.,rootkit.,warezclient.,multihop.
{panel}
*What's Next*
* I think I found a quick workaround to allow CDGA to handle multi-class
classification, I should implement it and try it on the KDD dataset
* Run the code on a small cluster and hope that it will work :P
> Watchmaker Integration
> ----------------------
>
> Key: MAHOUT-56
> URL: https://issues.apache.org/jira/browse/MAHOUT-56
> Project: Mahout
> Issue Type: Task
> Components: Genetic Algorithms
> Reporter: Deneche A. Hakim
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.1
>
> Attachments: libs.zip, libs.zip, libs.zip, tsp-screenshot-1.jpg,
> watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch,
> watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch,
> watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch,
> watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch,
> watchmaker-tsp.patch, watchmaker-tsp.patch
>
>
> The goal of this task is to allow watchmaker definded problems be solved in
> Mahout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.