[
https://issues.apache.org/jira/browse/MAHOUT-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022152#comment-13022152
]
Stanley Xu commented on MAHOUT-677:
-----------------------------------
Hi Sean,
First, most of the time spent is not on IO or even parsing, but spent on copy a
String over and over, which is also mentioned in your book <Mahout In Action>,
and I reproduced the same result by a test. The code cost most of the time is :
encoder[i].addToVector(x.get(i), v);
Which copied a String once and have to generate the hashCode once again I
guess. So what should be avoid is the conversion to a String in parsing the
data actually for SGD algorithm.
I agreed that the performance could be optimized more by a customized binary
input format. But I thought the example here is good enough since it proved the
idea and easy to read. Using a customized binary format might make the code or
data hard to read, and a binary protocol like Thrift is even slower while
parsing the data comparing to a customized parser by pure text per my
experience.
Anyway, it is your call, why don't you ask the author of the Chapter 16.3.4 of
<Mahout In Action> to decide you guys need a better example or just use the
patch here?
> The SimpleCsvExamples didn't really parsed the double correctly with the
> FastLine and FastLineReader
> ----------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-677
> URL: https://issues.apache.org/jira/browse/MAHOUT-677
> Project: Mahout
> Issue Type: Bug
> Components: Examples
> Affects Versions: 0.5
> Reporter: Stanley Xu
> Priority: Minor
> Fix For: 0.5
>
> Attachments: simplecsvexamplebugfix.diff
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> The FastLineReader in SimpleCsvExamples.java try to parse the line quickly
> through parse the bytes directly from the stream without the cost of copy
> Strings. But it didn't parse the line correctly and will get all double
> values as zero in fast parsing mode
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira