[ 
https://issues.apache.org/jira/browse/MAHOUT-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022174#comment-13022174
 ] 

Stanley Xu commented on MAHOUT-677:
-----------------------------------

Hi Sean,

I thought Ted knew the slow-down comes from the addToVector. Since he said 
"Doing all of the allocations involved in such a copy-heavy programming style 
costs quite a bit, and lots of people focus on reducing allocation costs by 
re-using data structures extensively. The fact is, however, that the real cost 
is not so much the cost of allocation, but the cost of copying the data over 
and over. An additional cost is due to the fact that constructing a new String 
object in Java involves creating the hash- code for each new string. 
Computation of the hash code costs almost as much as copying the data." in the 
book page 252.

And even didn't count the cost of String copy and hashCode, the fast version 
still has improvements in IO and text parsing. 

But I thought your idea of use total binary input would be really helpful in 
real production mode, since the SGD algorithm is really blazing fast, any 
performance improvement in the feature parse, hashing, encoding would improve 
the overall performance a lot.


> The SimpleCsvExamples didn't really parsed the double correctly with the 
> FastLine and FastLineReader
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-677
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-677
>             Project: Mahout
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 0.5
>            Reporter: Stanley Xu
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: simplecsvexamplebugfix.diff
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The FastLineReader in SimpleCsvExamples.java try to parse the line quickly 
> through parse the bytes directly from the stream without the cost of copy 
> Strings. But it didn't parse the line correctly and will get all double 
> values as zero in fast parsing mode

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to