Ah, ok, yes – if there aren't very many distinct values, it could
definitely help. With strings it's always nice to convert from
variable-length strings to fixed-size indices.

On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates <dmba...@gmail.com> wrote:

> On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:
>> Is 22GB too much? It seems like just uncompressing this and storing it
>> naturally would be fine on a large machine. How big are the categorical
>> integers? Would storing an index to an integer really help? It seems like
>> it would only help if the integers are larger than the indices.
> For example I just checked the first million instances of one of the
> variables and there are only 1050 distinct values, even though those values
> are 10 digit integers, as often happens with identifiers like this.  So
> let's assume that we can store the indices as Uint16.  We obtain the
> equivalent information by storing a relatively small vector of Int's,
> representing the actual values and a memory-mapped file at two bytes per
> record, for this variable.
> To me it seems that working from the original textual representation as a
> .csv.gz file is going to involve a lot of storage, i/o and conversion of
> strings to integers.
>> On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com> wrote:
>>> On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
>>>> It is sometimes difficult to obtain realistic "Big" data sets.  A
>>>> Revolution Analytics blog post yesterday
>>>> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
>>>> will-become-repeat-buyers.html
>>>> mentioned the competition
>>>> http://www.kaggle.com/c/acquire-valued-shoppers-challenge
>>>> with a very large data set, which may be useful in looking at
>>>> performance bottlenecks.
>>>> You do need to sign up to be able to download the data and you must
>>>> agree only to use the data for the purposes of the competition and to
>>>> remove the data once the competition is over.
>>> I did download the largest of the data files which consists of about 350
>>> million records on 11 variables in CSV format.  The compressed file is
>>> around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip
>>> package allows for working with the compressed file for sequential access.
>>> Most of the variables are what I would call categorical (stored as
>>> integer values) and could be represented as a pooled data vector.  One
>>> variable is a date and one is a price which could be stored as an integer
>>> value (number of cents) or as a Float32.
>>> So the first task would be parsing all those integers and creating a
>>> binary representation.  This could be done using a Relational DataBase but
>>> I think that might be overkill for a static table like this.  I have been
>>> thinking of storing each column as a memory-mapped array in a format like
>>> pooled data.  That is, store only the indices into a table of values so
>>> that the indices can be represented as whatever size of unsigned int is
>>> large enough for the table size.
>>> To work out the storage format I should first determine the number of
>>> distinct values for each categorical variable.  I was planning on using
>>> split(readline(gzfilehandle,",")) applying int() to the appropriate
>>> fields and storing the values in a Set or perhaps an IntSet.  Does this
>>> seem like a reasonable way to start?

Reply via email to