>> How much code in our current implementation depends on the data
representation?
> Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
implementation).
Gilles,
*Splitter* base class assumes X to be a dense, 2-dimensional, numpy array.
Available splitters include: *BestSplitter*, *PresortBestSplitter* and
*RandomSplitter* (all inheriting Splitter).
That said, I think there are these possible architectures:
1-
- make "Splitter" base class support both dense and sparse representation
- implicitly assume that "BestSplitter", "PresortBestSplitter" and
"RandomSplitter" are dealing with dense structure
- create "SparseSplitter" that inherits "DenseSplitter" and implicitly
assumes the sparse structure
2-
- replace "Splitter" by "DenseSplitter"
- make "BestSplitter", "PresortBestSplitter" and "RandomSplitter" inherit
from "DenseSplitter"
- create "SparseSplitter"
- force "SparseSplitter" usage when the input is a sparse matrix
Which one should we use?
Also, which splitting approach should be used on SparseSplitter? Why not
"SparseBestSplitter", "SparsePresortBestSplitter" and
"SparseRandomSplitter"?
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general