This would certainly be useful - to have prepackaged large datasets for 
people to work with. The question is what kind of operations would one want 
to do on such a dataset. If you could provide a set of well defined 
benchmarks (simple kernel codes that developers can work with), this could 
certainly be useful.

-viral

On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote:
>
> If there is some desire for "big data" tests, there is a fair number of 
> public astronomical datasets that wouldn't be too hard to package up. 
>
> The catalog level versions aren't too different than the type of dataset 
> metioned by Doug. There are a number of fairly simple analyses that could 
> be done on them for testing, either simple predictions or classifications. 
>  These wouldn't be hard too document and/or describe.  I can produce 
> examples if people care.
>
> For example, SDSS (a survey I work on) has public catalog data of ~470 
> million objects (rows), with something like ~3 million of those that have 
> more in depth information (many more columns).   Depending on the test 
> questions, these can be trimmed to provide datasets of various sizes. 
>  Numbers pulled from: http://www.sdss3.org/dr10/scope.php. 
>
> Anyhow, I guess the advantage here is the data is public and can be used 
> indefinitely.  And it's astronomy data, so naturally it's awesome.  ;) 
>  (However, it might suffer from the "who cares in the real world" issue.)
>
> Cameron
>
>
>
> On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski <ste...@karpinski.org>wrote:
>
>> Ah, ok, yes – if there aren't very many distinct values, it could 
>> definitely help. With strings it's always nice to convert from 
>> variable-length strings to fixed-size indices.
>>  
>>
>> On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates <dmba...@gmail.com> wrote:
>>
>>> On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:
>>>>
>>>> Is 22GB too much? It seems like just uncompressing this and storing it 
>>>> naturally would be fine on a large machine. How big are the categorical 
>>>> integers? Would storing an index to an integer really help? It seems like 
>>>> it would only help if the integers are larger than the indices.
>>>>
>>>
>>> For example I just checked the first million instances of one of the 
>>> variables and there are only 1050 distinct values, even though those values 
>>> are 10 digit integers, as often happens with identifiers like this.  So 
>>> let's assume that we can store the indices as Uint16.  We obtain the 
>>> equivalent information by storing a relatively small vector of Int's, 
>>> representing the actual values and a memory-mapped file at two bytes per 
>>> record, for this variable.
>>>
>>> To me it seems that working from the original textual representation as 
>>> a .csv.gz file is going to involve a lot of storage, i/o and conversion of 
>>> strings to integers.
>>>  
>>>
>>>>
>>>>
>>>> On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com>wrote:
>>>>
>>>>> On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
>>>>>>
>>>>>> It is sometimes difficult to obtain realistic "Big" data sets.  A 
>>>>>> Revolution Analytics blog post yesterday
>>>>>>
>>>>>> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
>>>>>> will-become-repeat-buyers.html
>>>>>>
>>>>>> mentioned the competition
>>>>>>
>>>>>> http://www.kaggle.com/c/acquire-valued-shoppers-challenge
>>>>>>
>>>>>> with a very large data set, which may be useful in looking at 
>>>>>> performance bottlenecks.
>>>>>>
>>>>>> You do need to sign up to be able to download the data and you must 
>>>>>> agree only to use the data for the purposes of the competition and to 
>>>>>> remove the data once the competition is over.
>>>>>>
>>>>>
>>>>> I did download the largest of the data files which consists of about 
>>>>> 350 million records on 11 variables in CSV format.  The compressed file 
>>>>> is 
>>>>> around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip 
>>>>> package allows for working with the compressed file for sequential access.
>>>>>
>>>>> Most of the variables are what I would call categorical (stored as 
>>>>> integer values) and could be represented as a pooled data vector.  One 
>>>>> variable is a date and one is a price which could be stored as an integer 
>>>>> value (number of cents) or as a Float32.
>>>>>
>>>>> So the first task would be parsing all those integers and creating a 
>>>>> binary representation.  This could be done using a Relational DataBase 
>>>>> but 
>>>>> I think that might be overkill for a static table like this.  I have been 
>>>>> thinking of storing each column as a memory-mapped array in a format like 
>>>>> pooled data.  That is, store only the indices into a table of values so 
>>>>> that the indices can be represented as whatever size of unsigned int is 
>>>>> large enough for the table size.
>>>>>
>>>>> To work out the storage format I should first determine the number of 
>>>>> distinct values for each categorical variable.  I was planning on using 
>>>>> split(readline(gzfilehandle,",")) applying int() to the appropriate 
>>>>> fields and storing the values in a Set or perhaps an IntSet.  Does this 
>>>>> seem like a reasonable way to start?
>>>>>
>>>>
>>>>
>>
>

Reply via email to