Re: [Numpy-discussion] Histograms of extremely large data sets

eric jones Tue, 12 Dec 2006 23:42:33 -0800

Hey Cameron,

I wrote a simple weave based histogram function that should work foryour problem. It should work for any array input data type. The neededfiles (and a few tests and examples) are attached.

Below is the output from the histogram_speed.py file attached. The testtakes about 10 seconds to bin a uniformly distributed set of data from a1000x1000x100 uint8 array into 256 bins. It compares Travis' nifty newiterator based indexing in numpy to raw C indexing of a contiguousarray. The two algorithms give identical results, and the speeddifference is negligible. That's cool because the iterator based stuffmakes this sort of algorithms quite easy to handle for N-dimensional.


Hope that helps,
eric

ps. For those who care, I had to make a minor change to the array typeconverters so that they can be used with the iterator interface moreeasily. Later this will be folded into weave, but for now I sub-classedthe standard array converter and made the modifications.


# speed test output.
c:\eric\code\histogram> histogram_speed.py
type: uint8
millions of elements: 100.0
sec (C indexing based): 9.52776707654
[390141 390352 390598 389706 390985 390856 389785 390262 389929 391024
391854 390243 391255 390723 390525 391751 389842 391612 389601 391210
390799 391674 390693 390381 390460 389839 390185 390909 390215 391271
390934 390818 390528 389990 389982 389667 391035 390317 390616 390916
390191 389771 391448 390325 390556 391333 390148 390894 389611 390511
390614 390999 389646 391255 391284 391214 392106 391067 391480 389991
391091 390271 389801 390044 391459 390644 391309 390450 390200 391537
390907 390160 391117 390738 391638 391200 390815 390611 390355 389925
390939 390932 391569 390287 389987 389545 391140 391280 389773 389794
389559 390085 389991 391372 390189 391010 390863 390432 390743 390959
389271 390210 390967 390999 391177 389777 391748 390623 391597 392009
389308 390557 390213 390930 390449 390327 390600 390626 389985 390816
389671 390187 390595 390973 390921 390599 390167 391196 390381 391345
392166 389709 390656 389886 390646 390355 391273 391342 390234 390751
390515 390048 390455 391122 391069 390968 390488 390708 391027 391179
391110 390453 390632 390825 391369 390844 390001 391487 390778 390788
390609 390254 389907 391803 391508 391414 391012 389987 389284 390699
391094 390658 390463 390291 390848 389616 390894 389561 390971 391165
391378 391698 389434 390591 390027 391088 390787 391165 390169 391212
389799 389829 389764 390435 391158 391834 391206 390041 391537 390237
390253 391025 392336 391081 390005 391057 390226 390240 390197 389906
391164 391157 390639 391501 389125 389922 390961 390012 389832 389650
390018 390461 390695 390140 390939 389089 391094 390076 391123 389518
391340 390039 390786 391751 391133 390675 392305 390667 391243 389889
390103 390438 389215 389805 392180 391351 389923 390932 390136 390556
389684 390324 390152 390982 391355]
sec (numpy iteration based): 10.3055525213
[390141 390352 390598 389706 390985 390856 389785 390262 389929 391024
391854 390243 391255 390723 390525 391751 389842 391612 389601 391210
390799 391674 390693 390381 390460 389839 390185 390909 390215 391271
390934 390818 390528 389990 389982 389667 391035 390317 390616 390916
390191 389771 391448 390325 390556 391333 390148 390894 389611 390511
390614 390999 389646 391255 391284 391214 392106 391067 391480 389991
391091 390271 389801 390044 391459 390644 391309 390450 390200 391537
390907 390160 391117 390738 391638 391200 390815 390611 390355 389925
390939 390932 391569 390287 389987 389545 391140 391280 389773 389794
389559 390085 389991 391372 390189 391010 390863 390432 390743 390959
389271 390210 390967 390999 391177 389777 391748 390623 391597 392009
389308 390557 390213 390930 390449 390327 390600 390626 389985 390816
389671 390187 390595 390973 390921 390599 390167 391196 390381 391345
392166 389709 390656 389886 390646 390355 391273 391342 390234 390751
390515 390048 390455 391122 391069 390968 390488 390708 391027 391179
391110 390453 390632 390825 391369 390844 390001 391487 390778 390788
390609 390254 389907 391803 391508 391414 391012 389987 389284 390699
391094 390658 390463 390291 390848 389616 390894 389561 390971 391165
391378 391698 389434 390591 390027 391088 390787 391165 390169 391212
389799 389829 389764 390435 391158 391834 391206 390041 391537 390237
390253 391025 392336 391081 390005 391057 390226 390240 390197 389906
391164 391157 390639 391501 389125 389922 390961 390012 389832 389650
390018 390461 390695 390140 390939 389089 391094 390076 391123 389518
391340 390039 390786 391751 391133 390675 392305 390667 391243 389889
390103 390438 389215 389805 392180 391351 389923 390932 390136 390556
389684 390324 390152 390982 391355]
0


Cameron Walsh wrote:

Hi all,

I'm trying to generate histograms of extremely large datasets.  I've
tried a few methods, listed below, all with their own shortcomings.
Mailing-list archive and google searches have not revealed any
solutions.

Method 1:

import numpy
import matplotlib

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

histogram = pylab.hist(data, bins=range(0,256))
pylab.xlim(0,256)
pylab.show()

The problem with this method is it appears to never finish.  It is
however, extremely fast for smaller data sets, like 5x1000x1000 (1-2
seconds) instead of 500x1000x1000.


Method 2:

import numpy
import matplotlib

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

bins=numpy.zeros((256),dtype="uint32")
   for val in data.flat:
       bins[val]+=1
barchart = pylab.bar(xrange(256),bins,align="center")
pylab.xlim(0,256)
pylab.show()

The problem with this method is it is incredibly slow, taking up to 30
seconds for a 1x1000x1000 sample, I have neither the patience nor the
inclination to time a 500x1000x1000 sample.


Method 3:

import numpy

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

a=numpy.histogram(data,256)


The problem with this one is:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/local/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 96, in histogram
   n = sort(a).searchsorted(bins)
ValueError: dimensions too large.


It seems that iterating over the entire array and doing it manually is
the slowest possible method, but that the rest are not much better.
Is there a faster method available, or do I have to implement method 2
in C and submit the change as a patch?

Thanks and best regards,

Cameron.
_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

from numpy import array, zeros, asarray, sort, int32
from scipy import weave
from typed_array_converter import converters

def histogram(ary, bins):
    
    ary = asarray(ary)

    # Ensure the bins are in order from smallest to largest.
    bins = sort(asarray(bins))
    
    # Create an array to hold the histogram count results.
    results = zeros(len(bins)-1,dtype=int32)
    
    # The C++ code that actually does the histogramming.    
    code = """
           PyArrayIterObject *iter = (PyArrayIterObject*)PyArray_IterNew(py_ary);
           
           while(iter->index < iter->size)
           {
           
               //////////////////////////////////////////////////////////
               // binary search
               //////////////////////////////////////////////////////////
               
               // This requires an update to weave 
               ary_data_type value = *((ary_data_type*)iter->dataptr);
               int bin_index=-1;
               int left=0;
               int right=Nbins[0]-1;
               int mid_point;
                              
               // handle out-of-range cases first
               if (value>=bins[left] && value<bins[right])
               {
                   // search for the "bin" index that this value falls in.
                   while(left < right)
                   {
                       mid_point = (right+left)/2;
                       if (value >= bins[mid_point] && 
                           value < bins[mid_point+1])
                       {
                           bin_index = mid_point;
                           break;
                       }
                       else if (value < bins[mid_point])
                           right = mid_point;
                       else if (value > bins[mid_point])
                           left = mid_point+1;
                   }
               } 
               
               //////////////////////////////////////////////////////////
               // Bin counter increment
               //////////////////////////////////////////////////////////

               // If the value was found, increment the counter for that bin.
               if (bin_index != -1)
               {
                   results[bin_index]++;
               }    
               PyArray_ITER_NEXT(iter);
           }
           """
    weave.inline(code, ['ary','bins','results'], 
                 type_converters=converters, 
                 compiler='gcc')
                 
    return results

from numpy import arange, product, sum, zeros, uint8
from numpy.random import randint

from weave_histogram import histogram
from weave_contiguous_histogram import histogram_contiguous

import time

shape = 1000,1000,100
size = product(shape)
data = randint(0,256,size).astype(uint8)
bins = arange(256)

print 'type:', data.dtype
print 'millions of elements:', size/1e6

t1 = time.clock()
res1 = histogram_contiguous(data, bins)
t2 = time.clock()
print 'sec (C indexing based):', t2-t1
print res1

# Reshape data so that it is 3 dimensional.
data.shape = shape
t1 = time.clock()
res2 = histogram(data, bins)
t2 = time.clock()
print 'sec (numpy iteration based):', t2-t1
print res2

print sum(res2-res1)

from unittest import TestCase
from weave_histogram import histogram
from numpy.testing import assert_array_equal
from numpy import array

class HistogramTestCase(TestCase):

    def test_below(self):
        
        bins = [0,1]
        ary = [-1]
        result = histogram(ary, bins)
        desired = array([0])
        assert_array_equal(desired, result)

    def test_above(self):
        
        bins = [0,1]
        ary = [2]
        result = histogram(ary, bins)
        desired = array([0])
        assert_array_equal(desired, result)

    def test_on_lower_bound(self):
        """ Values on the lower bound are included in bin
        """        
        bins = [0,1]
        ary = [0]
        result = histogram(ary, bins)
        desired = array([1])
        assert_array_equal(desired, result)

    def test_on_upper_bound(self):
        """ Values on the upper bound are not included in bin
        """
        bins = [0,1]
        ary = [1]
        result = histogram(ary, bins)
        desired = array([0])
        assert_array_equal(desired, result)

    def test_in_fisrt_bin(self):
        """ Values on the upper bound are not included in bin
        """
        bins = [0,1]
        ary = [0.5]
        result = histogram(ary, bins)
        desired = array([1])
        assert_array_equal(desired, result)

    def test_in_second_bin(self):
        """ Values on the upper bound are not included in bin
        """
        bins = [0,1,2]
        ary = [1]
        result = histogram(ary, bins)
        desired = array([0,1])
        assert_array_equal(desired, result)

    def test_multiple_int(self):
        """ Try multiple integer values
        """
        bins = [0,2,4]
        ary = [1,2,3]
        result = histogram(ary, bins)
        desired = array([1,2])
        assert_array_equal(desired, result)

    def test_multiple_float(self):
        """ Try multiple floating point values
        """
        bins = [0,1,2]
        ary = [.5,1, 1.5]
        result = histogram(ary, bins)
        desired = array([1,2])
        assert_array_equal(desired, result)

if __name__ == "__main__":
    from unittest import main
    main()

""" Modified version of array_converter that adds a type definition macro
   for each array.
"""

from scipy.weave.standard_array_spec import array_converter
from scipy.weave import c_spec

##############################################################################
# Weave modifications
##############################################################################

class typed_array_converter(array_converter):
    """ Minor change to the original array type converter that adds a macro
        for the array's data type.
    """
    
    def declaration_code(self,templatize = 0,inline=0):
        code = super(typed_array_converter, self).declaration_code(templatize, 
                                                                inline)
        res = self.template_vars(inline=inline)
        # need to add a macro that defines the array's data type
        code += '#define %(name)s_data_type %(num_type)s\n' % res
        
        return code

# Create a list of type converters that includes this array converter instead
# of the standard one.
converters = [c_spec.int_converter(),
              c_spec.float_converter(),
              c_spec.complex_converter(),
              c_spec.unicode_converter(),
              c_spec.string_converter(),
              c_spec.list_converter(),
              c_spec.dict_converter(),
              c_spec.tuple_converter(),
              c_spec.file_converter(),
              c_spec.instance_converter(),
              typed_array_converter()]


##############################################################################

""" This version only works with contiguous arrays.
"""

from numpy import array, zeros, asarray, sort, int32
from scipy import weave

def histogram_contiguous(ary, bins):
    
    ary = asarray(ary)

    # Ensure the bins are in order from smallest to largest.
    bins = sort(asarray(bins))
    
    # Create an array to hold the histogram count results.
    results = zeros(len(bins)-1,dtype=int32)
    
    # The C++ code that actually does the histogramming.    
    code = """           
           for(int i=0;i<Nary[0];i++) 
           {
           
               //////////////////////////////////////////////////////////
               // binary search
               //////////////////////////////////////////////////////////
               
               // This shameful macro makes it easier for weave to handle
               // different data types...
               #define value ary[i]
               int bin_index=-1;
               int left=0;
               int right=Nbins[0]-1;
               int mid_point;
               
               
               // handle out-of-range cases first
               if (value>=bins[left] && value<bins[right])
               {
                   // search for the "bin" index that this value falls in.
                   while(left < right)
                   {
                       mid_point = (right+left)/2;
                       if (value >= bins[mid_point] && 
                           value < bins[mid_point+1])
                       {
                           bin_index = mid_point;
                           break;
                       }
                       else if (value < bins[mid_point])
                           right = mid_point;
                       else if (value > bins[mid_point])
                           left = mid_point+1;
                   }
               } 
               
               //////////////////////////////////////////////////////////
               // Bin counter increment
               //////////////////////////////////////////////////////////

               // If the value was found, increment the counter for that bin.
               if (bin_index != -1)
               {
                   results[bin_index]++;
               }    
           }           
           """
    weave.inline(code, ['ary','bins','results'], compiler='gcc')
    return results

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Histograms of extremely large data sets

Reply via email to