[
https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wettin updated MAHOUT-61:
------------------------------
Attachment: MAHOUT-61.txt
This is what it is now:
1. InstanceHandler gathers instances
2. TokenizationMapper, Reducer and Combiner create one intermediate
MapWritiable instance (see [4]). These are reduced down to unique feature names
and class values.
3. The features and class values are placed in maps, assigned column index and
numeric values, and stored as MapFile on DFS.
4. VectorBuilderMapper is a Mapping only job that use the results from [2] and
[3] to produce sparse vectors.
> Text problem matrix builder
> ----------------------------
>
> Key: MAHOUT-61
> URL: https://issues.apache.org/jira/browse/MAHOUT-61
> Project: Mahout
> Issue Type: New Feature
> Reporter: Karl Wettin
> Assignee: Karl Wettin
> Priority: Minor
> Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder.
> Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be
> nice to bounce the data via JDBM or perhaps using the PersistentHashMap in
> MAHOUT-19.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.