Ah, ok, yes – if there aren't very many distinct values, it could definitely help. With strings it's always nice to convert from variable-length strings to fixed-size indices.
On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates <dmba...@gmail.com> wrote: > On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote: >> >> Is 22GB too much? It seems like just uncompressing this and storing it >> naturally would be fine on a large machine. How big are the categorical >> integers? Would storing an index to an integer really help? It seems like >> it would only help if the integers are larger than the indices. >> > > For example I just checked the first million instances of one of the > variables and there are only 1050 distinct values, even though those values > are 10 digit integers, as often happens with identifiers like this. So > let's assume that we can store the indices as Uint16. We obtain the > equivalent information by storing a relatively small vector of Int's, > representing the actual values and a memory-mapped file at two bytes per > record, for this variable. > > To me it seems that working from the original textual representation as a > .csv.gz file is going to involve a lot of storage, i/o and conversion of > strings to integers. > > >> >> >> On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com> wrote: >> >>> On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote: >>>> >>>> It is sometimes difficult to obtain realistic "Big" data sets. A >>>> Revolution Analytics blog post yesterday >>>> >>>> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers- >>>> will-become-repeat-buyers.html >>>> >>>> mentioned the competition >>>> >>>> http://www.kaggle.com/c/acquire-valued-shoppers-challenge >>>> >>>> with a very large data set, which may be useful in looking at >>>> performance bottlenecks. >>>> >>>> You do need to sign up to be able to download the data and you must >>>> agree only to use the data for the purposes of the competition and to >>>> remove the data once the competition is over. >>>> >>> >>> I did download the largest of the data files which consists of about 350 >>> million records on 11 variables in CSV format. The compressed file is >>> around 2.6 GB, uncompressed it would be over 22GB. Fortunately, the GZip >>> package allows for working with the compressed file for sequential access. >>> >>> Most of the variables are what I would call categorical (stored as >>> integer values) and could be represented as a pooled data vector. One >>> variable is a date and one is a price which could be stored as an integer >>> value (number of cents) or as a Float32. >>> >>> So the first task would be parsing all those integers and creating a >>> binary representation. This could be done using a Relational DataBase but >>> I think that might be overkill for a static table like this. I have been >>> thinking of storing each column as a memory-mapped array in a format like >>> pooled data. That is, store only the indices into a table of values so >>> that the indices can be represented as whatever size of unsigned int is >>> large enough for the table size. >>> >>> To work out the storage format I should first determine the number of >>> distinct values for each categorical variable. I was planning on using >>> split(readline(gzfilehandle,",")) applying int() to the appropriate >>> fields and storing the values in a Set or perhaps an IntSet. Does this >>> seem like a reasonable way to start? >>> >> >>