[
https://issues.apache.org/jira/browse/MAHOUT-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022174#comment-13022174
]
Stanley Xu commented on MAHOUT-677:
-----------------------------------
Hi Sean,
I thought Ted knew the slow-down comes from the addToVector. Since he said
"Doing all of the allocations involved in such a copy-heavy programming style
costs quite a bit, and lots of people focus on reducing allocation costs by
re-using data structures extensively. The fact is, however, that the real cost
is not so much the cost of allocation, but the cost of copying the data over
and over. An additional cost is due to the fact that constructing a new String
object in Java involves creating the hash- code for each new string.
Computation of the hash code costs almost as much as copying the data." in the
book page 252.
And even didn't count the cost of String copy and hashCode, the fast version
still has improvements in IO and text parsing.
But I thought your idea of use total binary input would be really helpful in
real production mode, since the SGD algorithm is really blazing fast, any
performance improvement in the feature parse, hashing, encoding would improve
the overall performance a lot.
> The SimpleCsvExamples didn't really parsed the double correctly with the
> FastLine and FastLineReader
> ----------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-677
> URL: https://issues.apache.org/jira/browse/MAHOUT-677
> Project: Mahout
> Issue Type: Bug
> Components: Examples
> Affects Versions: 0.5
> Reporter: Stanley Xu
> Priority: Minor
> Fix For: 0.5
>
> Attachments: simplecsvexamplebugfix.diff
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> The FastLineReader in SimpleCsvExamples.java try to parse the line quickly
> through parse the bytes directly from the stream without the cost of copy
> Strings. But it didn't parse the line correctly and will get all double
> values as zero in fast parsing mode
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira