[ 
https://issues.apache.org/jira/browse/MAHOUT-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022098#comment-13022098
 ] 

Stanley Xu commented on MAHOUT-677:
-----------------------------------

Hi Sean,

The code is a proof that you could optimize the time spent on parsing the 
inputs. This optimization is required for sequential algorithm like SGD. 
Because for SGD, most of the time was spent on parsing and hashing the 
features, it is mentioned in Chapter 16.3.4 in Mahout in Action. And per our 
test, more than 80% of the time was spent on parsing the inputs and put it into 
a Vector, and the training only cost about 10% of the time and the IO cost 
another 10%. So I will say that the optimization could be thought as "required".

The fast mode should be used in the generated text by the 
SimpleCsvExamples.java which is an integer. I guess why they use double is that 
they didn't have a Vector implementation that use a int array to save the 
content.

> The SimpleCsvExamples didn't really parsed the double correctly with the 
> FastLine and FastLineReader
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-677
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-677
>             Project: Mahout
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 0.5
>            Reporter: Stanley Xu
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: simplecsvexamplebugfix.diff
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The FastLineReader in SimpleCsvExamples.java try to parse the line quickly 
> through parse the bytes directly from the stream without the cost of copy 
> Strings. But it didn't parse the line correctly and will get all double 
> values as zero in fast parsing mode

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to