> 1- I removed a "with nogil" statement [2]. Is there a way to keep it?


In order to keep it, you can pass directly the bumpy array indices, indptr, 
data and nnz to the splitter. Detecting the input
type could be done easily in the python code.

> 2- Toy sparse input test is failing, and I think it's due to incorrect values 
> being read from X.data [3] (the method considers the matrix is in CSC 
> format). Can you see something I'm doing wrong here?


Tests seems to fail, because you also test with the other splitters  (random 
splitter, pre-sort best splitter)
which don't implement the sparse matrix support.


I have looked a bit at your code and it’s a great start. It would be easier to 
help you if you open a pull request.

Best,
Arnaud



On 07 Feb 2014, at 19:55, Felipe Eltermann <[email protected]> wrote:

> Arnaud,
> 
> I added a issparse attribute to Splitter base class.
> Doing so, I think I managed to introduce sparse support without the need of 
> replicating Splitters' business logic code.
> I'm working on this branch [1]
> 
> I have two questions:
> 1- I removed a "with nogil" statement [2]. Is there a way to keep it?
> 2- Toy sparse input test is failing, and I think it's due to incorrect values 
> being read from X.data [3] (the method considers the matrix is in CSC 
> format). Can you see something I'm doing wrong here?
> 
> 
> [1] https://github.com/eltermann/scikit-learn/commits/DT-sparse
> [2] 
> https://github.com/eltermann/scikit-learn/commit/8388bedff4e225cda9a1b2b6e3fc250bb7d22276#diff-a2cead4f3702cc4b9f76562bb2777edbL2297
> [3] 
> https://github.com/eltermann/scikit-learn/commit/5ba9c367661446c3eba7e6ea54adc1ff5cdfd39f#diff-a2cead4f3702cc4b9f76562bb2777edbR1281
> 
> On Wed, Feb 5, 2014 at 10:34 AM, Arnaud Joly <[email protected]> wrote:
> I think that I would go for the option that minimize the amount of code 
> duplication.
> I would probably start with 2. Since we don’t pickle anymore the Splitter and 
> criterion, the constructor 
> arguments could be used to pass the X and the y matrix.
> 
> Cheers,
> Arnaud
> 
>  
> 
> On 04 Feb 2014, at 17:38, Felipe Eltermann <[email protected]> wrote:
> 
>> >>  How much code in our current implementation depends on the data 
>> >> representation?
>> > Not much actually. It now basically boils down to simply write a new 
>> > splitter object. Everything else remains the same. So basically, I would 
>> > say that it amounts to 300~ lines of Cython (out of the 2300 lines in our 
>> > implementation).
>> 
>> Gilles,
>> 
>> Splitter base class assumes X to be a dense, 2-dimensional, numpy array.
>> Available splitters include: BestSplitter, PresortBestSplitter and 
>> RandomSplitter (all inheriting Splitter).
>> 
>> That said, I think there are these possible architectures:
>> 
>> 1-
>> - make "Splitter" base class support both dense and sparse representation
>> - implicitly assume that "BestSplitter", "PresortBestSplitter" and 
>> "RandomSplitter" are dealing with dense structure
>> - create "SparseSplitter" that inherits "DenseSplitter" and implicitly 
>> assumes the sparse structure
>> 
>> 2-
>> - replace "Splitter" by "DenseSplitter"
>> - make "BestSplitter", "PresortBestSplitter" and "RandomSplitter" inherit 
>> from "DenseSplitter"
>> - create "SparseSplitter"
>> - force "SparseSplitter" usage when the input is a sparse matrix
>> 
>> Which one should we use?
>> Also, which splitting approach should be used on SparseSplitter? Why not 
>> "SparseBestSplitter", "SparsePresortBestSplitter" and "SparseRandomSplitter"?
>> ------------------------------------------------------------------------------
>> Managing the Performance of Cloud-Based Applications
>> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
>> Read the Whitepaper.
>> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk_______________________________________________
>> 
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to