Re: [julia-users] Re: A Big Data stress test

2014-05-01 Thread Viral Shah
This would certainly be useful - to have prepackaged large datasets for 
people to work with. The question is what kind of operations would one want 
to do on such a dataset. If you could provide a set of well defined 
benchmarks (simple kernel codes that developers can work with), this could 
certainly be useful.

-viral

On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote:

 If there is some desire for big data tests, there is a fair number of 
 public astronomical datasets that wouldn't be too hard to package up. 

 The catalog level versions aren't too different than the type of dataset 
 metioned by Doug. There are a number of fairly simple analyses that could 
 be done on them for testing, either simple predictions or classifications. 
  These wouldn't be hard too document and/or describe.  I can produce 
 examples if people care.

 For example, SDSS (a survey I work on) has public catalog data of ~470 
 million objects (rows), with something like ~3 million of those that have 
 more in depth information (many more columns).   Depending on the test 
 questions, these can be trimmed to provide datasets of various sizes. 
  Numbers pulled from: http://www.sdss3.org/dr10/scope.php. 

 Anyhow, I guess the advantage here is the data is public and can be used 
 indefinitely.  And it's astronomy data, so naturally it's awesome.  ;) 
  (However, it might suffer from the who cares in the real world issue.)

 Cameron



 On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski ste...@karpinski.orgwrote:

 Ah, ok, yes – if there aren't very many distinct values, it could 
 definitely help. With strings it's always nice to convert from 
 variable-length strings to fixed-size indices.
  

 On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote:

 On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:

 Is 22GB too much? It seems like just uncompressing this and storing it 
 naturally would be fine on a large machine. How big are the categorical 
 integers? Would storing an index to an integer really help? It seems like 
 it would only help if the integers are larger than the indices.


 For example I just checked the first million instances of one of the 
 variables and there are only 1050 distinct values, even though those values 
 are 10 digit integers, as often happens with identifiers like this.  So 
 let's assume that we can store the indices as Uint16.  We obtain the 
 equivalent information by storing a relatively small vector of Int's, 
 representing the actual values and a memory-mapped file at two bytes per 
 record, for this variable.

 To me it seems that working from the original textual representation as 
 a .csv.gz file is going to involve a lot of storage, i/o and conversion of 
 strings to integers.
  



 On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmb...@gmail.comwrote:

 On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A 
 Revolution Analytics blog post yesterday

 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
 will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at 
 performance bottlenecks.

 You do need to sign up to be able to download the data and you must 
 agree only to use the data for the purposes of the competition and to 
 remove the data once the competition is over.


 I did download the largest of the data files which consists of about 
 350 million records on 11 variables in CSV format.  The compressed file 
 is 
 around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip 
 package allows for working with the compressed file for sequential access.

 Most of the variables are what I would call categorical (stored as 
 integer values) and could be represented as a pooled data vector.  One 
 variable is a date and one is a price which could be stored as an integer 
 value (number of cents) or as a Float32.

 So the first task would be parsing all those integers and creating a 
 binary representation.  This could be done using a Relational DataBase 
 but 
 I think that might be overkill for a static table like this.  I have been 
 thinking of storing each column as a memory-mapped array in a format like 
 pooled data.  That is, store only the indices into a table of values so 
 that the indices can be represented as whatever size of unsigned int is 
 large enough for the table size.

 To work out the storage format I should first determine the number of 
 distinct values for each categorical variable.  I was planning on using 
 split(readline(gzfilehandle,,)) applying int() to the appropriate 
 fields and storing the values in a Set or perhaps an IntSet.  Does this 
 seem like a reasonable way to start?






Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Stefan Karpinski
Is 22GB too much? It seems like just uncompressing this and storing it
naturally would be fine on a large machine. How big are the categorical
integers? Would storing an index to an integer really help? It seems like
it would only help if the integers are larger than the indices.


On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmba...@gmail.com wrote:

 On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A
 Revolution Analytics blog post yesterday

 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
 will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at performance
 bottlenecks.

 You do need to sign up to be able to download the data and you must agree
 only to use the data for the purposes of the competition and to remove the
 data once the competition is over.


 I did download the largest of the data files which consists of about 350
 million records on 11 variables in CSV format.  The compressed file is
 around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip
 package allows for working with the compressed file for sequential access.

 Most of the variables are what I would call categorical (stored as integer
 values) and could be represented as a pooled data vector.  One variable is
 a date and one is a price which could be stored as an integer value (number
 of cents) or as a Float32.

 So the first task would be parsing all those integers and creating a
 binary representation.  This could be done using a Relational DataBase but
 I think that might be overkill for a static table like this.  I have been
 thinking of storing each column as a memory-mapped array in a format like
 pooled data.  That is, store only the indices into a table of values so
 that the indices can be represented as whatever size of unsigned int is
 large enough for the table size.

 To work out the storage format I should first determine the number of
 distinct values for each categorical variable.  I was planning on using
 split(readline(gzfilehandle,,)) applying int() to the appropriate fields
 and storing the values in a Set or perhaps an IntSet.  Does this seem like
 a reasonable way to start?



Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Douglas Bates
On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:

 Is 22GB too much? It seems like just uncompressing this and storing it 
 naturally would be fine on a large machine. How big are the categorical 
 integers? Would storing an index to an integer really help? It seems like 
 it would only help if the integers are larger than the indices.


For example I just checked the first million instances of one of the 
variables and there are only 1050 distinct values, even though those values 
are 10 digit integers, as often happens with identifiers like this.  So 
let's assume that we can store the indices as Uint16.  We obtain the 
equivalent information by storing a relatively small vector of Int's, 
representing the actual values and a memory-mapped file at two bytes per 
record, for this variable.

To me it seems that working from the original textual representation as a 
.csv.gz file is going to involve a lot of storage, i/o and conversion of 
strings to integers.
 



 On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmb...@gmail.comjavascript:
  wrote:

 On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A 
 Revolution Analytics blog post yesterday

 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
 will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at 
 performance bottlenecks.

 You do need to sign up to be able to download the data and you must 
 agree only to use the data for the purposes of the competition and to 
 remove the data once the competition is over.


 I did download the largest of the data files which consists of about 350 
 million records on 11 variables in CSV format.  The compressed file is 
 around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip 
 package allows for working with the compressed file for sequential access.

 Most of the variables are what I would call categorical (stored as 
 integer values) and could be represented as a pooled data vector.  One 
 variable is a date and one is a price which could be stored as an integer 
 value (number of cents) or as a Float32.

 So the first task would be parsing all those integers and creating a 
 binary representation.  This could be done using a Relational DataBase but 
 I think that might be overkill for a static table like this.  I have been 
 thinking of storing each column as a memory-mapped array in a format like 
 pooled data.  That is, store only the indices into a table of values so 
 that the indices can be represented as whatever size of unsigned int is 
 large enough for the table size.

 To work out the storage format I should first determine the number of 
 distinct values for each categorical variable.  I was planning on using 
 split(readline(gzfilehandle,,)) applying int() to the appropriate fields 
 and storing the values in a Set or perhaps an IntSet.  Does this seem like 
 a reasonable way to start?




Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Stefan Karpinski
Ah, ok, yes – if there aren't very many distinct values, it could
definitely help. With strings it's always nice to convert from
variable-length strings to fixed-size indices.


On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote:

 On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:

 Is 22GB too much? It seems like just uncompressing this and storing it
 naturally would be fine on a large machine. How big are the categorical
 integers? Would storing an index to an integer really help? It seems like
 it would only help if the integers are larger than the indices.


 For example I just checked the first million instances of one of the
 variables and there are only 1050 distinct values, even though those values
 are 10 digit integers, as often happens with identifiers like this.  So
 let's assume that we can store the indices as Uint16.  We obtain the
 equivalent information by storing a relatively small vector of Int's,
 representing the actual values and a memory-mapped file at two bytes per
 record, for this variable.

 To me it seems that working from the original textual representation as a
 .csv.gz file is going to involve a lot of storage, i/o and conversion of
 strings to integers.




 On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmb...@gmail.com wrote:

 On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A
 Revolution Analytics blog post yesterday

 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
 will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at
 performance bottlenecks.

 You do need to sign up to be able to download the data and you must
 agree only to use the data for the purposes of the competition and to
 remove the data once the competition is over.


 I did download the largest of the data files which consists of about 350
 million records on 11 variables in CSV format.  The compressed file is
 around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip
 package allows for working with the compressed file for sequential access.

 Most of the variables are what I would call categorical (stored as
 integer values) and could be represented as a pooled data vector.  One
 variable is a date and one is a price which could be stored as an integer
 value (number of cents) or as a Float32.

 So the first task would be parsing all those integers and creating a
 binary representation.  This could be done using a Relational DataBase but
 I think that might be overkill for a static table like this.  I have been
 thinking of storing each column as a memory-mapped array in a format like
 pooled data.  That is, store only the indices into a table of values so
 that the indices can be represented as whatever size of unsigned int is
 large enough for the table size.

 To work out the storage format I should first determine the number of
 distinct values for each categorical variable.  I was planning on using
 split(readline(gzfilehandle,,)) applying int() to the appropriate
 fields and storing the values in a Set or perhaps an IntSet.  Does this
 seem like a reasonable way to start?





Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Cameron McBride
If there is some desire for big data tests, there is a fair number of
public astronomical datasets that wouldn't be too hard to package up.

The catalog level versions aren't too different than the type of dataset
metioned by Doug. There are a number of fairly simple analyses that could
be done on them for testing, either simple predictions or classifications.
 These wouldn't be hard too document and/or describe.  I can produce
examples if people care.

For example, SDSS (a survey I work on) has public catalog data of ~470
million objects (rows), with something like ~3 million of those that have
more in depth information (many more columns).   Depending on the test
questions, these can be trimmed to provide datasets of various sizes.
 Numbers pulled from: http://www.sdss3.org/dr10/scope.php.

Anyhow, I guess the advantage here is the data is public and can be used
indefinitely.  And it's astronomy data, so naturally it's awesome.  ;)
 (However, it might suffer from the who cares in the real world issue.)

Cameron



On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski ste...@karpinski.orgwrote:

 Ah, ok, yes – if there aren't very many distinct values, it could
 definitely help. With strings it's always nice to convert from
 variable-length strings to fixed-size indices.


 On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote:

 On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:

 Is 22GB too much? It seems like just uncompressing this and storing it
 naturally would be fine on a large machine. How big are the categorical
 integers? Would storing an index to an integer really help? It seems like
 it would only help if the integers are larger than the indices.


 For example I just checked the first million instances of one of the
 variables and there are only 1050 distinct values, even though those values
 are 10 digit integers, as often happens with identifiers like this.  So
 let's assume that we can store the indices as Uint16.  We obtain the
 equivalent information by storing a relatively small vector of Int's,
 representing the actual values and a memory-mapped file at two bytes per
 record, for this variable.

 To me it seems that working from the original textual representation as a
 .csv.gz file is going to involve a lot of storage, i/o and conversion of
 strings to integers.




 On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmb...@gmail.com wrote:

 On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A
 Revolution Analytics blog post yesterday

 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
 will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at
 performance bottlenecks.

 You do need to sign up to be able to download the data and you must
 agree only to use the data for the purposes of the competition and to
 remove the data once the competition is over.


 I did download the largest of the data files which consists of about
 350 million records on 11 variables in CSV format.  The compressed file is
 around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip
 package allows for working with the compressed file for sequential access.

 Most of the variables are what I would call categorical (stored as
 integer values) and could be represented as a pooled data vector.  One
 variable is a date and one is a price which could be stored as an integer
 value (number of cents) or as a Float32.

 So the first task would be parsing all those integers and creating a
 binary representation.  This could be done using a Relational DataBase but
 I think that might be overkill for a static table like this.  I have been
 thinking of storing each column as a memory-mapped array in a format like
 pooled data.  That is, store only the indices into a table of values so
 that the indices can be represented as whatever size of unsigned int is
 large enough for the table size.

 To work out the storage format I should first determine the number of
 distinct values for each categorical variable.  I was planning on using
 split(readline(gzfilehandle,,)) applying int() to the appropriate
 fields and storing the values in a Set or perhaps an IntSet.  Does this
 seem like a reasonable way to start?