[ 
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-633:
-----------------------------

    Attachment: MAHOUT-633.patch

I'm through core now, but no tests or other modules. It's already an epic 
patch, but I think it's quite good. It actually deletes 100 lines more of code 
than it adds, and reduces some duplication, and even does a bit better job at 
cleanup.

I'm still working on this patch.

The one possible downside is that this implementation creates a new Writable 
for each read. This is mildly positive in that it avoids some common bugs in 
reading from sequence files wherein the caller doesn't realize it's storing a 
copy to a Writable that's changing. (The Mahout code is cloning/instantiating 
new ones in most cases anyways as it has to) It does mean more objects 
allocated though. While I think the overhead of that is minor, probably, 
compared to the I/O of the read itself, it's not obviously trivial.

I think this can be designed around in the wrapper -- for instance often the 
key is not used so that can be embedded in the logic of the iterator to avoid 
allocations.

The only gotchas to watch out for in this patch are, first, that change / fix? 
to VectorCache. And second Dmitriy when I ran into some of your latest commits 
I left several uses of SequenceFile.Reader as-is since it needs to be done that 
way, but ended up accumulating a load of small style changes. That bit of the 
patch is noisy. (There was also a section where I think I elided use of a 
"closeables" object in the case where it seemed to be a no-op?)

Thoughts?

> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
>                 Key: MAHOUT-633
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-633
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: iterable, iterator, sequence-file
>             Fix For: 0.5
>
>         Attachments: MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which 
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd 
> like to add it, then use it throughout the code. See patch, which for now 
> merely has the proposed new classes. 
> Well it also moves some other iterator-related classes that seemed to be 
> outside their rightful home in common.iterator.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to