Re: [Numpy-discussion] record data previous to Numpy use

2017-07-07 Thread Derek Homeier
On 07 Jul 2017, at 4:24 PM, paul.carr...@free.fr wrote:
> 
> ps : I'd like to use the following code that is much more familiar for me :-)
> 
> COMP_list = np.asarray(COMP_list, dtype = np.float64) 
> i = np.arange(1,NumberOfRecords,2)
> COMP_list = np.delete(COMP_list,i)
> 
Not sure about the background of this, but if you want to remove every second 
entry
(if NumberOfRecords is the full length of the list, that is), it would always 
be preferable
to make changes to the list, or even better, extract only the entries you want:

COMP_list = np.asarray(COMP_list[::2], dtype = np.float64)

Have a good weekend

Derek

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-07 Thread paul . carrico
Hi (all) 

Ounce again I would like to thanks the community for the supports. 

I progressing in moving my code to Python .. 

In my mind some parts remains quite hugly (and burns me the eyes), but
it works and I'll optimized it in the future ; so far I can work with
the data in a single reading 

I builts some blocks in a text file and used Astropy to read it (work
fine now - i'll test pandas next step) 

Not finish yet but in a significant progress compare to yesterday :-) 

Have a good WE 

Paul 

ps : I'd like to use the following code that is much more familiar for
me :-) 

COMP_list = np.asarray(COMP_list, dtype = np.float64) 
i = np.arange(1,NumberOfRecords,2)
COMP_list = np.delete(COMP_list,i) 

Le 2017-07-07 12:04, Derek Homeier a écrit :

> On 7 Jul 2017, at 1:59 am, Chris Barker  wrote: 
> 
>> On Thu, Jul 6, 2017 at 10:55 AM,   wrote:
>> It's is just a reflexion, but for huge files one solution might be to 
>> split/write/build first the array in a dedicated file (2x o(n) iterations - 
>> one to identify the blocks size - additional one to get and write), and then 
>> to load it in memory and work with numpy - 
>> 
>> I may have your use case confused, but if you have a huge file with multiple 
>> "blocks" in it, there shouldn't be any problem with loading it in one go -- 
>> start at the top of the file and load one block at a time (accumulating in a 
>> list) -- then you only have the memory overhead issues for one block at a 
>> time, should be no problem.
>> 
>> at this stage the dimension is known and some packages will be fast and more 
>> adapted (pandas or astropy as suggested).
>> 
>> pandas at least is designed to read variations of CSV files, not sure you 
>> could use the optimized part to read an array out of part of an open file 
>> from a particular point or not.
> The fragmented structure indeed would probably be the biggest challenge, 
> although astropy,
> while it cannot read from an open file handle, at least should be able to 
> directly parse a block
> of input lines, e.g. collected with readline() in a list. Guess pandas could 
> do the same.
> Alternatively the line positions of the blocks could be directly passed to 
> the data_start and
> data_end keywords, but that would require opening and at least partially 
> reading the file
> multiple times. In fact, if the blocks are relatively small, the overhead may 
> be too large to
> make it worth using the faster parsers - if you look at the timing notebooks 
> I had linked to
> earlier, it takes at least ~100 input lines before they show any speed gains 
> over genfromtxt,
> and ~1000 to see roughly linear scaling. In that case writing your own 
> customised reader
> could be the best option after all.
> 
> Cheers,
> Derek
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-07 Thread Derek Homeier
On 7 Jul 2017, at 1:59 am, Chris Barker  wrote:
> 
> On Thu, Jul 6, 2017 at 10:55 AM,   wrote:
> It's is just a reflexion, but for huge files one solution might be to 
> split/write/build first the array in a dedicated file (2x o(n) iterations - 
> one to identify the blocks size - additional one to get and write), and then 
> to load it in memory and work with numpy - 
> 
> 
> I may have your use case confused, but if you have a huge file with multiple 
> "blocks" in it, there shouldn't be any problem with loading it in one go -- 
> start at the top of the file and load one block at a time (accumulating in a 
> list) -- then you only have the memory overhead issues for one block at a 
> time, should be no problem.
> 
> at this stage the dimension is known and some packages will be fast and more 
> adapted (pandas or astropy as suggested).
> 
> pandas at least is designed to read variations of CSV files, not sure you 
> could use the optimized part to read an array out of part of an open file 
> from a particular point or not.
> 
The fragmented structure indeed would probably be the biggest challenge, 
although astropy,
while it cannot read from an open file handle, at least should be able to 
directly parse a block
of input lines, e.g. collected with readline() in a list. Guess pandas could do 
the same.
Alternatively the line positions of the blocks could be directly passed to the 
data_start and
data_end keywords, but that would require opening and at least partially 
reading the file
multiple times. In fact, if the blocks are relatively small, the overhead may 
be too large to
make it worth using the faster parsers - if you look at the timing notebooks I 
had linked to
earlier, it takes at least ~100 input lines before they show any speed gains 
over genfromtxt,
and ~1000 to see roughly linear scaling. In that case writing your own 
customised reader
could be the best option after all.

Cheers,
Derek
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Chris Barker
On Thu, Jul 6, 2017 at 10:55 AM,  wrote:
>
> It's is just a reflexion, but for huge files one solution might be to
> split/write/build first the array in a dedicated file (2x o(n) iterations -
> one to identify the blocks size - additional one to get and write), and
> then to load it in memory and work with numpy -
>

I may have your use case confused, but if you have a huge file with
multiple "blocks" in it, there shouldn't be any problem with loading it in
one go -- start at the top of the file and load one block at a time
(accumulating in a list) -- then you only have the memory overhead issues
for one block at a time, should be no problem.

at this stage the dimension is known and some packages will be fast and
> more adapted (pandas or astropy as suggested).
>
pandas at least is designed to read variations of CSV files, not sure you
could use the optimized part to read an array out of part of an open file
from a particular point or not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Robert Kern
On Thu, Jul 6, 2017 at 3:19 AM,  wrote:
>
> Thanks Rober for your effort - I'll have a look on it
>
> ...  the goal is be guide in how to proceed (and to understand), and not
to have a "ready-made solution" ... but I appreciate honnestly :-)

Sometimes it's easier to just write the code than to try to explain in
prose what to do. :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread paul . carrico
Thanks all for your advices 

Well many thing to look for, but it's obvious now  that I've first to
work on (better) strategy than the one I was thinking previously (i.e.
load all the files and results in one step). 

It's is just a reflexion, but for huge files one solution might be to
split/write/build first the array in a dedicated file (2x o(n)
iterations - one to identify the blocks size - additional one to get and
write), and then to load it in memory and work with numpy - at this
stage the dimension is known and some packages will be fast and more
adapted (pandas or astropy as suggested). 

Thanks all for your time and help 

Paul___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Chris Barker
OK, you have two performance "issues"

1) memory use: IF yu need to read a file to build a numpy array, and dont
know how big it is when you start,  you need to accumulate the values
first, and then make an array out of them. And numpy arrays are fixed size,
so they can not efficiently accumulate values.

The usual way to handle this is to read the data into a list with .append()
or the like, and then make an array from it. This is quite fast -- lists
are fast and efficient for extending arrays. However, you are then storing
(at least) a pointer and a python float object for each value, which is a
lot more memory than a single float value in a numpy array, and you need to
make the array from it, which means you have the full list and all its
pyton floats AND the array in memory at once.

Frankly, computers have a lot of memory these days, so this is a non-issue
in most cases.

Nonetheless, a while back I wrote an extendable numpy array object to
address just this issue. You can find the code on gitHub here:

https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py

I have not tested it with recent numpy's but I expect is still works fine.
It's also py2, but wouldn't take much to port.

In practice, it uses less memory that the "build a list, then make it into
an array", but isnt any faster, unless you add (.extend) a bunch of values
at once, rather than one at a time. (if you do it one at a time, the whole
python float to numpy float conversion, and function call overhead takes
just as long).

But it will, generally be as fast or faster than using  a list, and use
less memory, so a fine basis for a big ascii file reader.

However, it looks like while your files may be huge, they hold a number of
arrays, so each array may not be large enough to bother with any of this.

2) parsing and converting overhead -- for the most part, python/numpy text
file reading code read the text into a python string, converts it to python
number objects, then puts them in a list or converts them to native numbers
in an array. This whole process is a bit slow (though reading files is slow
anyway, so usually not worth worrying about, which is why the built-in file
reading methods do this). To improve this, you need to use code that reads
the file and parses it in C, and puts it straight into a numpy array
without passing through python. This is what the pandas (and I assume
astropy) text file readers do.

But if you don't want those dependencies, there is the "fromfile()"
function in numpy -- it is not very robust, but if you files are
well-formed, then it is quite fast. So your code would look something like:

with open(the_filename) as infile:
while True:
line = infile.readline()
if not line:
break
# work with line to figure out the next block
if ready_to_read_a_block:
arr = np.fromfile(infile, dtype=np.int32, count=num_values,
sep=' ')
# sep specifies that you are reading text, not binary!
arr.shape = the_shape_it_should_be


But Robert is right -- get it to work with the "usual" methods -- i.e. put
numbers in a list, then make an array out it -- first, and then worry about
making it faster.

-CHB


On Thu, Jul 6, 2017 at 1:49 AM,  wrote:

> Dear All
>
>
> First of all thanks for the answers and the information’s (I’ll ding into
> it) and let me trying to add comments on what I want to :
>
>1. My asci file mainly contains data (float and int) in a single column
>2. (it is not always the case but I can easily manage it – as well I
>saw I can use ‘spli’ instruction if necessary)
>3. Comments/texts indicates the beginning of a bloc immediately
>followed by the number of sub-blocs
>4. So I need to read/record all the values in order to build a matrix
>before working on it (using Numpy & vectorization)
>   - The columns 2 and 3 have been added for further treatments
>   - The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
> confident) on how to proceed, but I’m really blocked on data records … I
> trying to find a way to efficiently read and record data in a matrix:
>
>- avoiding dynamic memory allocation (here using ‘append’ in python
>meaning, not np),
>- dealing with huge asci file: the latest file I get contains more
>than *60 million of lines*
>
>
> Please find in attachment an extract of the input format
> (‘example_of_input’), and the matrix I’m trying to create and manage with
> Numpy
>
>
> Thanks again for your time
>
> Paul
>
>
> ###
>
> ##BEGIN *-> line number x in the original file*
>
> 42   *-> indicates the number of sub-blocs*
>
> 1 *-> number of the 1rst sub-bloc*
>
> 6 *-> gives how many value belong to the sub bloc*
>
> 12
>
> 47
>
> 2
>
> 46
>
> 3
>
> 51
>
> ….
>
> 13  * -> another type of sub-bloc with 25 values*
>
> 25
>
> 15
>
> 88

Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread paul . carrico
Thanks Rober for your effort - I'll have a look on it 

...  the goal is be guide in how to proceed (and to understand), and not
to have a "ready-made solution" ... but I appreciate honnestly :-) 

Paul 

Le 2017-07-06 11:51, Robert Kern a écrit :

> On Thu, Jul 6, 2017 at 1:49 AM,  wrote:
>> 
>> Dear All
>> 
>> First of all thanks for the answers and the information's (I'll ding into 
>> it) and let me trying to add comments on what I want to :
>> 
>> My asci file mainly contains data (float and int) in a single column
>> (it is not always the case but I can easily manage it - as well I saw I can 
>> use 'spli' instruction if necessary)
>> Comments/texts indicates the beginning of a bloc immediately followed by the 
>> number of sub-blocs
>> So I need to read/record all the values in order to build a matrix before 
>> working on it (using Numpy & vectorization)
>> 
>> The columns 2 and 3 have been added for further treatments
>> The '0' values will be specifically treated afterward
>> 
>> 
>> Numpy won't be a problem I guess (I did some basic tests and I'm quite 
>> confident) on how to proceed, but I'm really blocked on data records ... I 
>> trying to find a way to efficiently read and record data in a matrix:
>> 
>> avoiding dynamic memory allocation (here using 'append' in python meaning, 
>> not np), 
> 
> Although you can avoid some list appending in your case (because the blocks 
> self-describe their length), I would caution you against prematurely avoiding 
> it. It's often the most natural way to write the code in Python, so go ahead 
> and write it that way first. Once you get it working correctly, but it's too 
> slow or memory intensive, then you can puzzle over how to preallocate the 
> numpy arrays later. But quite often, it's fine. In this case, the reading and 
> handling of the text data itself is probably the bottleneck, not appending to 
> the lists. As I said, Python lists are cleverly implemented to make appending 
> fast. Accumulating numbers in a list then converting to an array afterwards 
> is a well-accepted numpy idiom. 
> 
>> dealing with huge asci file: the latest file I get contains more than 60 
>> million of lines
>> 
>> Please find in attachment an extract of the input format 
>> ('example_of_input'), and the matrix I'm trying to create and manage with 
>> Numpy
>> 
>> Thanks again for your time
> 
> Try something like the attached. The function will return a list of blocks. 
> Each block will itself be a list of numpy arrays, which are the sub-blocks 
> themselves. I didn't bother adding the first three columns to the sub-blocks 
> or trying to assemble them all into a uniform-width matrix by padding with 
> trailing 0s. Since you say that the trailing 0s are going to be "specially 
> treated afterwards", I suspect that you can more easily work with the lists 
> of arrays instead. I assume floating-point data rather than trying to figure 
> out whether int or float from the data. The code can handle multiple data 
> values on one line (not especially well-tested, but it ought to work), but it 
> assumes that the number of sub-blocks, index of the sub-block, and sub-block 
> size are each on the own line. The code gets a little more complicated if 
> that's not the case.
> 
> --
> Robert Kern 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Robert Kern
On Thu, Jul 6, 2017 at 1:49 AM,  wrote:
>
> Dear All
>
> First of all thanks for the answers and the information’s (I’ll ding into
it) and let me trying to add comments on what I want to :
>
> My asci file mainly contains data (float and int) in a single column
> (it is not always the case but I can easily manage it – as well I saw I
can use ‘spli’ instruction if necessary)
> Comments/texts indicates the beginning of a bloc immediately followed by
the number of sub-blocs
> So I need to read/record all the values in order to build a matrix before
working on it (using Numpy & vectorization)
>
> The columns 2 and 3 have been added for further treatments
> The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
confident) on how to proceed, but I’m really blocked on data records … I
trying to find a way to efficiently read and record data in a matrix:
>
> avoiding dynamic memory allocation (here using ‘append’ in python
meaning, not np),

Although you can avoid some list appending in your case (because the blocks
self-describe their length), I would caution you against prematurely
avoiding it. It's often the most natural way to write the code in Python,
so go ahead and write it that way first. Once you get it working correctly,
but it's too slow or memory intensive, then you can puzzle over how to
preallocate the numpy arrays later. But quite often, it's fine. In this
case, the reading and handling of the text data itself is probably the
bottleneck, not appending to the lists. As I said, Python lists are
cleverly implemented to make appending fast. Accumulating numbers in a list
then converting to an array afterwards is a well-accepted numpy idiom.

> dealing with huge asci file: the latest file I get contains more than 60
million of lines
>
> Please find in attachment an extract of the input format
(‘example_of_input’), and the matrix I’m trying to create and manage with
Numpy
>
> Thanks again for your time

Try something like the attached. The function will return a list of blocks.
Each block will itself be a list of numpy arrays, which are the sub-blocks
themselves. I didn't bother adding the first three columns to the
sub-blocks or trying to assemble them all into a uniform-width matrix by
padding with trailing 0s. Since you say that the trailing 0s are going to
be "specially treated afterwards", I suspect that you can more easily work
with the lists of arrays instead. I assume floating-point data rather than
trying to figure out whether int or float from the data. The code can
handle multiple data values on one line (not especially well-tested, but it
ought to work), but it assumes that the number of sub-blocks, index of the
sub-block, and sub-block size are each on the own line. The code gets a
little more complicated if that's not the case.

--
Robert Kern
from __future__ import print_function

import numpy as np


def write_random_file(filename, n_blocks=42, n_elems=60*1000*1000):
q, r = divmod(n_elems, n_blocks)
block_lengths = [q] * n_blocks
block_lengths[-1] += r
with open(filename, 'w') as f:
print('##BEGIN', file=f)
print(n_blocks, file=f)
for i, block_length in enumerate(block_lengths, 1):
print(i, file=f)
print(block_length, file=f)
block = np.random.randint(0, 1000, size=block_length)
for x in block:
print(x, file=f)


def read_blocked_file(filename):
blocks = []
with open(filename, 'r') as f:
# Loop over all blocks.
while True:
# Consume lines until the start of the next block.
# Unfortunately, we can't use `for line in f:` because we need to
# use `f.readline()` later.
line = f.readline()
found_block = True
while '##BEGIN' not in line:
line = f.readline()
if line == '':
# We've reached the end of the file.
found_block = False
break
if not found_block:
# We iterated to the end of the file. Break out of the `while`
# loop.
break

# Read the number of sub-blocks.
# This assumes that it is on a line all by itself.
n_subblocks = int(f.readline())
subblocks = []
for i_subblock in range(1, n_subblocks + 1):
read_i_subblock = int(f.readline())
# These ought to match.
if read_i_subblock != i_subblock:
raise RuntimeError("Mismatched sub-block index")
# Read the size of the sub-block.
subblock_size = int(f.readline())
# Allocate an array for the contents.
subblock_data = np.empty(subblock_size, dtype=float)
i = 0
while True:
line = f.readline()

Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread paul . carrico
Dear All

First of all thanks for the answers and the information's (I'll ding
into it) and let me trying to add comments on what I want to : 

* My asci file mainly contains data (float and int) in a single column
* (it is not always the case but I can easily manage it - as well I
saw I can use 'spli' instruction if necessary)
* Comments/texts indicates the beginning of a bloc immediately
followed by the number of sub-blocs

* So I need to read/record all the values in order to build a matrix
before working on it (using Numpy & vectorization) 

* The columns 2 and 3 have been added for further treatments
* The '0' values will be specifically treated afterward

Numpy won't be a problem I guess (I did some basic tests and I'm quite
confident) on how to proceed, but I'm really blocked on data records … I
trying to find a way to efficiently read and record data in a matrix: 

* avoiding dynamic memory allocation (here using 'append' in python
meaning, not np),
* dealing with huge asci file: the latest file I get contains more
than 60 MILLION OF LINES

Please find in attachment an extract of the input format
('example_of_input'), and the matrix I'm trying to create and manage
with Numpy 

Thanks again for your time 

Paul 

### 

##BEGIN _-> line number x in the original file_ 

42   _-> indicates the number of sub-blocs_ 

1 _-> number of the 1rst sub-bloc_ 

6 _-> gives how many value belong to the sub bloc_ 

12 

47 

2 

46 

3 

51 

…. 

13  _ -> another type of sub-bloc with 25 values_ 

25 

15 

88 

21 

42 

22 

76 

19 

89 

0 

18 

80 

23 

38 

24 

73 

20 

81 

0 

90 

0 

41 

0 

39 

0 

77 

… 

42 _-> another type of sub-bloc with 2 values_ 

2 

115 

109 

 ### 

THE MATRIX RESULT 

1 0 0 6 12 47 2 46 3 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

2 0 0 6 3 50 11 70 12 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

3 0 0 8 11 50 3 49 4 54 5 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

4 0 0 8 12 70 11 66 9 65 10 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

5 0 0 8 2 47 12 68 10 44 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

6 0 0 8 5 56 6 58 7 61 11 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

7 0 0 8 11 61 7 60 8 63 9 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

8 0 0 19 12 47 2 46 3 51 0 13 97 14 92 15 96 0 72 0 48 0 52 0 0 0 0 0 0 

9 0 0 19 13 97 14 92 15 96 0 16 86 17 82 18 85 0 95 0 91 0 90 0 0 0 0 0
0 

10 0 0 19 3 50 11 70 12 51 0 15 89 19 94 13 96 0 52 0 71 0 72 0 0 0 0 0
0 

11 0 0 19 15 89 19 94 13 96 0 18 81 20 84 16 85 0 90 0 77 0 95 0 0 0 0 0
0 

12 0 0 25 3 49 4 54 5 57 11 50 0 15 88 21 42 22 76 19 89 0 52 0 53 0 55
0 71 

13 0 0 25 15 88 21 42 22 76 19 89 0 18 80 23 38 24 73 20 81 0 90 0 41 0
39 0 77 

14 0 0 25 11 66 9 65 10 68 12 70 0 19 78 25 99 26 98 13 94 0 71 0 67 0
69 0 72 

…. 

### 

AN EXAMPLE OF THE CODE I STARTED TO WRITE 

# -*- coding: utf-8 -*- 

 import time, sys, os, re 

import itertools 

import numpy as np 

PATH = str(os.path.abspath('')) 

input_file_name ='/example_of_input.txt' 

## check if the file exists, then if it's empty or not 

if (os.path.isfile(PATH + input_file_name)): 

if (os.stat(PATH + input_file_name).st_size > 0): 

## go through the file in order to find specific sentences 

## specific blocks will be defined afterward 

Block_position = []; j=0; 

with open(PATH + input_file_name, "r") as data: 

for line in data: 

if '##BEGIN' in line: 

Block_position.append(j) 

j=j+1 

## just to tests to get all the values 

#i = 0 

#data = np.zeros( (505), dtype=np.int ) 

#with open(PATH + input_file_name, "r") as f: 

#for i in range (0,505): 

#data[i] = int(f.read(Block_position[0]+1+i)) 

#print ("i = ", i) 

#   for line in itertools.islice(f,Block_position[0],516): 

#   data[i]=f.read(0+i) 

#   i=i+1 

else: 

print "The file %s is empty : post-processing cannot be
performed !!!\n" % input_file_name 

else: 

print "Error : the file %s does not exist: post-processing stops
!!!\n" % input_file_name___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread Robert McLeod
While I'm going to bet that the fastest way to build a ndarray from ascii
is with a 'io.ByteIO` stream, NumPy does have a function to load from text,
`numpy.loadtxt` that works well enough for most purposes.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

It's hard to tell from the original post if the ascii is being continuously
generated or not.  If it's being produced in an on-going fashion then a
stream object is definitely the way to go, as the array chunks can be
produced by `numpy.frombuffer()`.

https://docs.python.org/3/library/io.html

https://docs.scipy.org/doc/numpy/reference/generated/numpy.frombuffer.html

Robert


On Wed, Jul 5, 2017 at 3:21 PM, Robert Kern  wrote:

> On Wed, Jul 5, 2017 at 5:41 AM,  wrote:
> >
> > Dear all
> >
> > I’m sorry if my question is too basic (not fully in relation to Numpy –
> while it is to build matrices and to work with Numpy afterward), but I’m
> spending a lot of time and effort to find a way to record data from an asci
> while, and reassign it into a matrix/array … with unsuccessfully!
> >
> > The only way I found is to use ‘append()’ instruction involving dynamic
> memory allocation. :-(
>
> Are you talking about appending to Python list objects? Or the np.append()
> function on numpy arrays?
>
> In my experience, it is usually fine to build a list with the `.append()`
> method while reading the file of unknown size and then converting it to an
> array afterwards, even for dozens of millions of lines. The list object is
> quite smart about reallocating memory so it is not that expensive. You
> should generally avoid the np.append() function, though; it is not smart.
>
> > From my current experience under Scilab (a like Matlab scientific
> solver), it is well know:
> >
> > Step 1 : matrix initialization like ‘np.zeros(n,n)’
> > Step 2 : record the data
> > and write it in the matrix (step 3)
> >
> > I’m obviously influenced by my current experience, but I’m interested in
> moving to Python and its packages
> >
> > For huge asci files (involving dozens of millions of lines), my strategy
> is to work by ‘blocks’ as :
> >
> > Find the line index of the beginning and the end of one block (this
> implies that the file is read ounce)
> > Read the block
> > (process repeated on the different other blocks)
>
> Are the blocks intrinsic parts of the file? Or are you just trying to
> break up the file into fixed-size chunks?
>
> > I tried different codes such as bellow, but each time Python is telling
> me I cannot mix iteration and record method
> >
> > #
> >
> > position = []; j=0
> > with open(PATH + file_name, "r") as rough_ data:
> > for line in rough_ data:
> > if my_criteria in line:
> > position.append(j) ## huge blocs but limited in
> number
> > j=j+1
> >
> > i = 0
> > blockdata = np.zeros( (size_block), dtype=np.float)
> > with open(PATH + file_name, "r") as f:
> >  for line in itertools.islice(f,1,size_block):
> >  blockdata [i]=float(f.readline() )
>
> For what it's worth, this is the line that is causing the error that you
> describe. When you iterate over the file with the `for line in
> itertools.islice(f, ...):` loop, you already have the line text. You don't
> (and can't) call `f.readline()` to get it again. It would mess up the
> iteration if you did and cause you to skip lines.
>
> By the way, it is useful to help us help you if you copy-paste the exact
> code that you are running as well as the full traceback instead of
> paraphrasing the error message.
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
Robert McLeod, Ph.D.
robert.mcl...@unibas.ch
robert.mcl...@bsse.ethz.ch 
robbmcl...@gmail.com
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread Derek Homeier
Hi Paul,

> ascii file is an input format (and the only one I can deal with)
> 
> HDF5 one might be an export one (it's one of the options) in order to speed 
> up the post-processing stage
> 
> 
> 
> Paul
> 
> 
> 
> 
> 
> Le 2017-07-05 20:19, Thomas Caswell a écrit :
> 
>> Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better 
>> storage format for what you are describing.
>>  
>> Tom
>> 
>> On Wed, Jul 5, 2017 at 8:42 AM  wrote:
>> Dear all
>> 
>> 
>> 
>> I'm sorry if my question is too basic (not fully in relation to Numpy – 
>> while it is to build matrices and to work with Numpy afterward), but I'm 
>> spending a lot of time and effort to find a way to record data from an asci 
>> while, and reassign it into a matrix/array ... with unsuccessfully!
>> 
>> 
>> 
>> The only way I found is to use 'append()' instruction involving dynamic 
>> memory allocation. :-(
>> 
>> 
>> 
>> From my current experience under Scilab (a like Matlab scientific solver), 
>> it is well know:
>> 
>>  • Step 1 : matrix initialization like 'np.zeros(n,n)'
>>  • Step 2 : record the data
>>  • and write it in the matrix (step 3)
>> 
>> 
>> I'm obviously influenced by my current experience, but I'm interested in 
>> moving to Python and its packages
>> 
>> 
>> 
>> For huge asci files (involving dozens of millions of lines), my strategy is 
>> to work by 'blocks' as :
>> 
>>  • Find the line index of the beginning and the end of one block (this 
>> implies that the file is read ounce)
>>  • Read the block
>>  • (process repeated on the different other blocks)
>> 
>> 
>> I tried different codes such as bellow, but each time Python is telling me I 
>> cannot mix iteration and record method
>> 

if you are indeed tied to using ASCII input data, you will of course have to 
deal with significant
performance handicaps, but there are at least some gains to be had by using an 
input parser
that does not do all the conversions at the Python level, but with a compiled 
(C) reader - either
pandas as Tom already mentioned, or astropy - see e.g. 
https://github.com/dhomeier/astropy-notebooks/blob/master/io/ascii/ascii_read_bench.ipynb
for the almost one order of magnitude speed gains you may get.

In your example it is not clear what “record” method you were trying to use 
that raised the errors
you mention - we would certainly need a full traceback of the error to find out 
more.

In principle your approach of allocating the numpy matrix first and reading the 
data in chunks
makes sense, as it will avoid the much larger temporary lists created during 
read-in.
But it might be more convenient to just read in the block into a list of lines 
and pass that to a
higher-level reader like np.genfromtxt or the faster astropy.io.ascii.read or 
pandas.read_csv
to speed up the parsing of the numbers themselves.
That said, on most systems these readers should still be able to handle files 
up to a few 10^8
items (expect ~ 25-55 bytes of memory for each input number allocated for 
temporary lists),
so if saving memory is not an absolute priority, directly reading the entire 
file might still be the
best choice (and would also save the first pass reading).

Cheers,
Derek

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread Robert Kern
On Wed, Jul 5, 2017 at 5:41 AM,  wrote:
>
> Dear all
>
> I’m sorry if my question is too basic (not fully in relation to Numpy –
while it is to build matrices and to work with Numpy afterward), but I’m
spending a lot of time and effort to find a way to record data from an asci
while, and reassign it into a matrix/array … with unsuccessfully!
>
> The only way I found is to use ‘append()’ instruction involving dynamic
memory allocation. :-(

Are you talking about appending to Python list objects? Or the np.append()
function on numpy arrays?

In my experience, it is usually fine to build a list with the `.append()`
method while reading the file of unknown size and then converting it to an
array afterwards, even for dozens of millions of lines. The list object is
quite smart about reallocating memory so it is not that expensive. You
should generally avoid the np.append() function, though; it is not smart.

> From my current experience under Scilab (a like Matlab scientific
solver), it is well know:
>
> Step 1 : matrix initialization like ‘np.zeros(n,n)’
> Step 2 : record the data
> and write it in the matrix (step 3)
>
> I’m obviously influenced by my current experience, but I’m interested in
moving to Python and its packages
>
> For huge asci files (involving dozens of millions of lines), my strategy
is to work by ‘blocks’ as :
>
> Find the line index of the beginning and the end of one block (this
implies that the file is read ounce)
> Read the block
> (process repeated on the different other blocks)

Are the blocks intrinsic parts of the file? Or are you just trying to break
up the file into fixed-size chunks?

> I tried different codes such as bellow, but each time Python is telling
me I cannot mix iteration and record method
>
> #
>
> position = []; j=0
> with open(PATH + file_name, "r") as rough_ data:
> for line in rough_ data:
> if my_criteria in line:
> position.append(j) ## huge blocs but limited in number
> j=j+1
>
> i = 0
> blockdata = np.zeros( (size_block), dtype=np.float)
> with open(PATH + file_name, "r") as f:
>  for line in itertools.islice(f,1,size_block):
>  blockdata [i]=float(f.readline() )

For what it's worth, this is the line that is causing the error that you
describe. When you iterate over the file with the `for line in
itertools.islice(f, ...):` loop, you already have the line text. You don't
(and can't) call `f.readline()` to get it again. It would mess up the
iteration if you did and cause you to skip lines.

By the way, it is useful to help us help you if you copy-paste the exact
code that you are running as well as the full traceback instead of
paraphrasing the error message.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread paul . carrico
Hi 

Thanks for the answer: 

ascii file is an input format (and the only one I can deal with) 

HDF5 one might be an export one (it's one of the options) in order to
speed up the post-processing stage 

Paul 

Le 2017-07-05 20:19, Thomas Caswell a écrit :

> Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better 
> storage format for what you are describing. 
> 
> Tom 
> 
> On Wed, Jul 5, 2017 at 8:42 AM  wrote: 
> 
>> Dear all 
>> 
>> I'm sorry if my question is too basic (not fully in relation to Numpy - 
>> while it is to build matrices and to work with Numpy afterward), but I'm 
>> spending a lot of time and effort to find a way to record data from an asci 
>> while, and reassign it into a matrix/array ... with unsuccessfully! 
>> 
>> The only way I found is to use _'append()'_ instruction involving dynamic 
>> memory allocation. :-( 
>> 
>> From my current experience under Scilab (a like Matlab scientific solver), 
>> it is well know: 
>> 
>> * Step 1 : matrix initialization like _'np.zeros(n,n)'_
>> * Step 2 : record the data
>> * and write it in the matrix (step 3)
>> 
>> I'm obviously influenced by my current experience, but I'm interested in 
>> moving to Python and its packages 
>> 
>> For huge asci files (involving dozens of millions of lines), my strategy is 
>> to work by 'blocks' as : 
>> 
>> * Find the line index of the beginning and the end of one block (this 
>> implies that the file is read ounce)
>> * Read the block
>> * (process repeated on the different other blocks)
>> 
>> I tried different codes such as bellow, but each time Python is telling me I 
>> CANNOT MIX ITERATION AND RECORD METHOD 
>> 
>> # 
>> 
>> position = []; j=0 
>> 
>> with open(PATH + file_name, "r") as rough_ data: 
>> 
>> for line in rough_ data: 
>> 
>> if _my_criteria_ in line: 
>> 
>> position.append(j) ## huge blocs but limited in number 
>> 
>> j=j+1 
>> 
>> i = 0 
>> 
>> blockdata = np.zeros( (size_block), dtype=np.float) 
>> 
>> with open(PATH + file_name, "r") as f: 
>> 
>> for line in itertools.islice(f,1,size_block): 
>> 
>> blockdata [i]=float(f.readline() ) 
>> 
>> i=i+1 
>> 
>> # 
>> 
>> Should I work on lists using f.readlines (but this implies to load all the 
>> file in memory). 
>> 
>> Additional question:  can I use record with vectorization, with 'i 
>> =np.arange(0,65406)' if I remain  in the previous example 
>> 
>> Thanks for your time and comprehension 
>> 
>> (I'm obviously interested by doc references speaking about those specific 
>> tasks) 
>> 
>> Paul 
>> 
>> PS: for Chuck:  I'll had a look to pandas package but in an code 
>> optimization step :-) (nearly 2000 doc pages) 
>> 
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread Thomas Caswell
Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a
better storage format for what you are describing.

Tom

On Wed, Jul 5, 2017 at 8:42 AM  wrote:

> Dear all
>
>
> I’m sorry if my question is too basic (not fully in relation to Numpy –
> while it is to build matrices and to work with Numpy afterward), but I’m
> spending a lot of time and effort to find a way to record data from an asci
> while, and reassign it into a matrix/array … with unsuccessfully!
>
>
> The only way I found is to use *‘append()’* instruction involving dynamic
> memory allocation. :-(
>
>
> From my current experience under Scilab (a like Matlab scientific solver),
> it is well know:
>
>1. Step 1 : matrix initialization like *‘np.zeros(n,n)’*
>2. Step 2 : record the data
>3. and write it in the matrix (step 3)
>
>
> I’m obviously influenced by my current experience, but I’m interested in
> moving to Python and its packages
>
>
> For huge asci files (involving dozens of millions of lines), my strategy
> is to work by ‘blocks’ as :
>
>- Find the line index of the beginning and the end of one block (this
>implies that the file is read ounce)
>- Read the block
>- (process repeated on the different other blocks)
>
>
> I tried different codes such as bellow, but each time Python is telling me *I
> cannot mix iteration and record method*
>
> #
>
> position = []; j=0
>
> with open(PATH + file_name, "r") as rough_ data:
>
> for line in rough_ data:
>
> if *my_criteria* in line:
>
> position.append(j) ## huge blocs but limited in number
>
> j=j+1
>
>
> i = 0
>
> blockdata = np.zeros( (size_block), dtype=np.float)
>
> with open(PATH + file_name, "r") as f:
>
>  for line in itertools.islice(f,1,size_block):
>
>  blockdata [i]=float(f.readline() )
>
>  i=i+1
>
>  #
>
>
> Should I work on lists using f.readlines (but this implies to load all the
> file in memory).
>
>
> *Additional question*:  can I use record with vectorization, with ‘i
> =np.arange(0,65406)’ if I remain  in the previous example
>
>
>
> Thanks for your time and comprehension
>
> (I’m obviously interested by doc references speaking about those specific
> tasks)
>
>
> Paul
>
>
> PS: for Chuck:  I’ll had a look to pandas package but in an code
> optimization step :-) (nearly 2000 doc pages)
>
>
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread paul . carrico
Dear all 

I'm sorry if my question is too basic (not fully in relation to Numpy -
while it is to build matrices and to work with Numpy afterward), but I'm
spending a lot of time and effort to find a way to record data from an
asci while, and reassign it into a matrix/array … with unsuccessfully! 

The only way I found is to use _'append()'_ instruction involving
dynamic memory allocation. :-( 

>From my current experience under Scilab (a like Matlab scientific
solver), it is well know: 

* Step 1 : matrix initialization like _'np.zeros(n,n)'_
* Step 2 : record the data
* and write it in the matrix (step 3)

I'm obviously influenced by my current experience, but I'm interested in
moving to Python and its packages 

For huge asci files (involving dozens of millions of lines), my strategy
is to work by 'blocks' as : 

* Find the line index of the beginning and the end of one block (this
implies that the file is read ounce)
* Read the block
* (process repeated on the different other blocks)

I tried different codes such as bellow, but each time Python is telling
me I CANNOT MIX ITERATION AND RECORD METHOD 

# 

position = []; j=0 

with open(PATH + file_name, "r") as rough_ data: 

for line in rough_ data: 

if _my_criteria_ in line: 

position.append(j) ## huge blocs but limited in
number 

j=j+1 

i = 0 

blockdata = np.zeros( (size_block), dtype=np.float) 

with open(PATH + file_name, "r") as f: 

 for line in itertools.islice(f,1,size_block): 

 blockdata [i]=float(f.readline() ) 

 i=i+1 

 # 

Should I work on lists using f.readlines (but this implies to load all
the file in memory). 

Additional question:  can I use record with vectorization, with 'i
=np.arange(0,65406)' if I remain  in the previous example 

Thanks for your time and comprehension 

(I'm obviously interested by doc references speaking about those
specific tasks) 

Paul 

PS: for Chuck:  I'll had a look to pandas package but in an code
optimization step :-) (nearly 2000 doc pages)___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion