Re: [Numpy-discussion] Fortran order in recarray.

Alex Rogozhnikov Wed, 22 Feb 2017 08:59:01 -0800

Hi Matthew, 
maybe it is not the best place to discuss problems of pandas, but to show that 
I am not missing something, let's consider a simple example.


# simplest DataFrame
x = pandas.DataFrame(dict(a=numpy.arange(10), b=numpy.arange(10, 20)))

# simplest indexing. Can you predict results without looking at comments?
x[:2]         # returns two first rows, as expected
x[[0, 1]]    # returns copy of x, whole dataframe
x[numpy.array(2)] # fails with IndexError: indices are out-of-bounds (can you 
guess why?)
x[[0, 1], :] # unhashable type: list

just in case - I know about .loc and .iloc, but when you write code with many 
subroutines, you concentrate on numpy inputs, and at some point you simply 
forget to convert some of the data you operated with to numpy and it continues 
to work, but it yields wrong results (while you tested everything, but you 
tested this for numpy). Checking all the inputs in each small subroutine is 
strange.

Ok, a bit more:
x[x['a'] > 5]        # works as expected
x[x['a'] > 5, :]     # 'Series' objects are mutable, thus they cannot be hashed
lookup = numpy.arange(10)
x[lookup[x['a']] > 5]     # works as expected
x[lookup[x['a']] > 5, :]  # TypeError: unhashable type: 'numpy.ndarray'

x[lookup]['a']   # indexError
x['a'][lookup]   # works as expected

Now let's go a bit further: train/test splitted the data for machine learning 
(again, the most frequent operation)

from sklearn.model_selection import train_test_split
x1, x2 = train_test_split(x, random_state=42)

# compare next to operations with pandas.DataFrame
col = x1['a']
print col[:2]       # first two elements
print col[[0, 1]]  # doesn't fail (while there in no row with index 0), fills 
it with NaN
print col[numpy.arange(2)] # same as previous

print col[col > 4] # as expected
print col[col.values > 4] # as expected
print col.values[col > 4] # converts boolean to int, uses int indexing, but at 
least raises warning

Mistakes done by such silent misoperating are not easy to detect (when your 
data pipeline consists of several steps), quite hard to locate the source of 
problem and almost impossible to be sure that you indeed avoided all such 
caveats. Code review turns into paranoidal process (if you care about the 
result, of course).

Things are even worse, because I've demonstrated this for my installation, and 
probably if you run this with some other pandas installation, you get some 
other results (that were really basic operations). So things that worked ok in 
one version, may work different way in the other, this becomes completely 
intractable. 

Pandas may be nice, if you need a report, and you need get it done tomorrow. 
Then you'll throw away the code. When we initially used pandas as main data 
storage in yandex/rep, it looked like an good idea, but a year later it was 
obvious this was a wrong decision. In case when you build data pipeline / 
research that should be working several years later (using some other 
installation by someone else), usage of pandas shall be minimal. 

That's why I am looking for a reliable pandas substitute, which should be: 
- completely consistent with numpy and should fail when this wasn't implemented 
/ impossible
- fewer new abstractions, nobody wants to learn 
one-more-way-to-manipulate-the-data, specifically other researchers
- it may be less convenient for interactive data mungling
  - in particular, less methods is ok
- written code should be interpretable, and hardly can be misinterpreted.
- not super slow, 1-10 gigabytes datasets are a normal situation

Well, that's it. 
Sorry for large letter.

Alex.



> 22 февр. 2017 г., в 18:38, Matthew Harrigan <[email protected]> 
> написал(а):
> 
> Alex,
> 
> Can you please post some code showing exactly what you are trying to do and 
> any issues you are having, particularly the "irritating problems with its row 
> indexing and some other problems" you quote above?
> 
> On Wed, Feb 22, 2017 at 10:34 AM, Robert McLeod <[email protected] 
> <mailto:[email protected]>> wrote:
> Just as a note, Appveyor supports uploading modules to "public websites":
> 
> https://packaging.python.org/appveyor/ 
> <https://packaging.python.org/appveyor/>
> 
> The main issue I would see from this, is the PyPi has my password stored on 
> my machine in a plain text file.   I'm not sure whether there's a way to 
> provide Appveyor with a SSH key instead.
> 
> On Wed, Feb 22, 2017 at 4:23 PM, Alex Rogozhnikov <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Francesc, 
> thanks a lot for you reply and for your impressive job on bcolz! 
> 
> Bcolz seems to make stress on compression, which is not of much interest for 
> me, but the ctable, and chunked operations look very appropriate to me now. 
> (Of course, I'll need to test it much before I can say this for sure, that's 
> current impression).
> 
> The strongest concern with bcolz so far is that it seems to be completely 
> non-trivial to install on windows systems, while pip provides binaries for 
> most (or all?) OS for numpy. 
> I didn't build pip binary wheels myself, but is it hard / impossible to cook 
> pip-installabel binaries?
> 
>> You can change shapes of numpy arrays, but that usually involves copies of 
>> the whole container.
> sure, but this is ok for me, as I plan to organize column editing in 
> 'batches', so this should require seldom copying. 
> It would be nice to see an example to understand how deep I need to go inside 
> numpy.
> 
> Cheers, 
> Alex. 
>  
> 
> 
> 
>> 22 февр. 2017 г., в 17:03, Francesc Alted <[email protected] 
>> <mailto:[email protected]>> написал(а):
>> 
>> Hi Alex,
>> 
>> 2017-02-22 12:45 GMT+01:00 Alex Rogozhnikov <[email protected] 
>> <mailto:[email protected]>>:
>> Hi Nathaniel, 
>> 
>> 
>>> pandas
>> 
>> yup, the idea was to have minimal pandas.DataFrame-like storage (which I was 
>> using for a long time), 
>> but without irritating problems with its row indexing and some other 
>> problems like interaction with matplotlib.
>> 
>>> A dict of arrays?
>> 
>> 
>> that's what I've started from and implemented, but at some point I decided 
>> that I'm reinventing the wheel and numpy has something already. In 
>> principle, I can ignore this 'column-oriented' storage requirement, but 
>> potentially it may turn out to be quite slow-ish if dtype's size is large.
>> 
>> Suggestions are welcome.
>> 
>> You may want to try bcolz:
>> 
>> https://github.com/Blosc/bcolz <https://github.com/Blosc/bcolz>
>> 
>> bcolz is a columnar storage, basically as you require, but data is 
>> compressed by default even when stored in-memory (although you can disable 
>> compression if you want to).
>> 
>>  
>> 
>> Another strange question:
>> in general, it is considered that once numpy.array is created, it's shape 
>> not changed. 
>> But if i want to keep the same recarray and change it's dtype and/or shape, 
>> is there a way to do this?
>> 
>> You can change shapes of numpy arrays, but that usually involves copies of 
>> the whole container.  With bcolz you can change length and add/del columns 
>> without copies.  If your containers are large, it is better to inform bcolz 
>> on its final estimated size.  See:
>> 
>> http://bcolz.blosc.org/en/latest/opt-tips.html 
>> <http://bcolz.blosc.org/en/latest/opt-tips.html>
>> 
>> Francesc
>>  
>> 
>> Thanks, 
>> Alex.
>> 
>> 
>> 
>>> 22 февр. 2017 г., в 3:53, Nathaniel Smith <[email protected] 
>>> <mailto:[email protected]>> написал(а):
>>> 
>>> On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Ah, got it. Thanks, Chris!
>>> I thought recarray can be only one-dimensional (like tables with named 
>>> columns).
>>> 
>>> Maybe it's better to ask directly what I was looking for: 
>>> something that works like a table with named columns (but no labelling for 
>>> rows), and keeps data (of different dtypes) in a column-by-column way (and 
>>> this is numpy, not pandas). 
>>> 
>>> Is there such a magic thing?
>>> 
>>> Well, that's what pandas is for...
>>> 
>>> A dict of arrays?
>>> 
>>> -n
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion 
>>> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
>> 
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion 
>> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
>> 
>> 
>> 
>> 
>> -- 
>> Francesc Alted
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion 
>> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected] <mailto:[email protected]>
> https://mail.scipy.org/mailman/listinfo/numpy-discussion 
> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
> 
> 
> 
> 
> -- 
> Robert McLeod, Ph.D.
> Center for Cellular Imaging and Nano Analytics (C-CINA)
> Biozentrum der Universität Basel
> Mattenstrasse 26, 4058 Basel
> Work: +41.061.387.3225 <tel:+41%2061%20387%2032%2025>
> [email protected] <mailto:[email protected]>
> [email protected] <mailto:[email protected]>
> [email protected] <mailto:[email protected]>
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected] <mailto:[email protected]>
> https://mail.scipy.org/mailman/listinfo/numpy-discussion 
> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fortran order in recarray.

Reply via email to