Re: [julia-users] Re: A "Big Data" stress test

Douglas Bates Wed, 30 Apr 2014 11:54:19 -0700

On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:
>
> Is 22GB too much? It seems like just uncompressing this and storing it 
> naturally would be fine on a large machine. How big are the categorical 
> integers? Would storing an index to an integer really help? It seems like 
> it would only help if the integers are larger than the indices.
>


For example I just checked the first million instances of one of the 
variables and there are only 1050 distinct values, even though those values 
are 10 digit integers, as often happens with identifiers like this.  So 
let's assume that we can store the indices as Uint16.  We obtain the 
equivalent information by storing a relatively small vector of Int's, 
representing the actual values and a memory-mapped file at two bytes per 
record, for this variable.

To me it seems that working from the original textual representation as a 
.csv.gz file is going to involve a lot of storage, i/o and conversion of 
strings to integers.
 

>
>
> On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com<javascript:>
> > wrote:
>
>> On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
>>>
>>> It is sometimes difficult to obtain realistic "Big" data sets.  A 
>>> Revolution Analytics blog post yesterday
>>>
>>> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
>>> will-become-repeat-buyers.html
>>>
>>> mentioned the competition
>>>
>>> http://www.kaggle.com/c/acquire-valued-shoppers-challenge
>>>
>>> with a very large data set, which may be useful in looking at 
>>> performance bottlenecks.
>>>
>>> You do need to sign up to be able to download the data and you must 
>>> agree only to use the data for the purposes of the competition and to 
>>> remove the data once the competition is over.
>>>
>>
>> I did download the largest of the data files which consists of about 350 
>> million records on 11 variables in CSV format.  The compressed file is 
>> around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip 
>> package allows for working with the compressed file for sequential access.
>>
>> Most of the variables are what I would call categorical (stored as 
>> integer values) and could be represented as a pooled data vector.  One 
>> variable is a date and one is a price which could be stored as an integer 
>> value (number of cents) or as a Float32.
>>
>> So the first task would be parsing all those integers and creating a 
>> binary representation.  This could be done using a Relational DataBase but 
>> I think that might be overkill for a static table like this.  I have been 
>> thinking of storing each column as a memory-mapped array in a format like 
>> pooled data.  That is, store only the indices into a table of values so 
>> that the indices can be represented as whatever size of unsigned int is 
>> large enough for the table size.
>>
>> To work out the storage format I should first determine the number of 
>> distinct values for each categorical variable.  I was planning on using 
>> split(readline(gzfilehandle,",")) applying int() to the appropriate fields 
>> and storing the values in a Set or perhaps an IntSet.  Does this seem like 
>> a reasonable way to start?
>>
>
>

Re: [julia-users] Re: A "Big Data" stress test

Reply via email to