Re: help: pandas and 2d table

2024-04-16 Thread jak via Python-list

Stefan Ram ha scritto:

jak  wrote or quoted:

Stefan Ram ha scritto:

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.


   Here's a technique to better understand such code:

   Transform it into a program with small statements and small
   expressions with no more than one call per statement if possible.
   (After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
 print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

   I suggest to next insert print statements to print each intermediate
   value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
   selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
   selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
   result, '"\n' )

print( result )

   Now you can see what each single step does!

df =
   :
"  foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1   aa   ab   zz   ad   ae   af
foo2   ba   bb   bc   bd   zz   bf
foo3   ca   zz   cc   cd   ce   zz
foo4   da   db   dc   dd   de   df
foo5   ea   eb   ec   zz   ee   ef
foo6   fa   fb   fc   fd   fe   ff "

result of where( df == 'zz' ) =
   :
"  foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1  NaN  NaN   zz  NaN  NaN  NaN
foo2  NaN  NaN  NaN  NaN   zz  NaN
foo3  NaN   zz  NaN  NaN  NaN   zz
foo4  NaN  NaN  NaN  NaN  NaN  NaN
foo5  NaN  NaN  NaN   zz  NaN  NaN
foo6  NaN  NaN  NaN  NaN  NaN  NaN "

result of stack() =
   :
" obj
foo1  foo3zz
foo2  foo5zz
foo3  foo2zz
   foo6zz
foo5  foo4zz
dtype: object "

result of reset_index() =
   :
" obj level_1   0
0  foo1foo3  zz
1  foo2foo5  zz
2  foo3foo2  zz
3  foo3foo6  zz
4  foo5foo4  zz "

value of .iloc[ :, 0 ]=
   :
" 0foo1
1foo2
2foo3
3foo3
4foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] =
   :
" 0foo3
1foo5
2foo2
3foo6
4foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )=
   :
" "

result of list( z )=
   :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 
'foo4')]"

value of { 'zz': l }=
   :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), 
('foo5', 'foo4')]}"

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), 
('foo5', 'foo4')]}

   The script reads a CSV file and stores the data in a Pandas
   DataFrame object named "df". The "index_col=0" parameter tells
   Pandas to use the first column as the index for the DataFrame,
   which is kinda like column headers.

   The "where" creates a new DataFrame selection that contains
   the same data as df, but with all values replaced by NaN (Not
   a Number) except for the values that are equal to 'zz'.

   "stack" returns a Series with a multi-level index created
   by pivoting the columns. Here it gives a Series with the
   row-col-addresses of a all the non-NaN values. The general
   meaning of "stack" might be the most complex operation of
   this script. It's explained in the pandas manual (see there).

   "reset_index" then just transforms this Series back into a
   DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
   first and second column, respectively, of that DataFrame. These
   then are zipped to get the desired form as a list of pairs.



And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow 

Re: help: pandas and 2d table

2024-04-16 Thread jak via Python-list

Stefan Ram ha scritto:

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}


Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.
--
https://mail.python.org/mailman/listinfo/python-list


Re: help: pandas and 2d table

2024-04-13 Thread Tim Williams via Python-list
On Sat, Apr 13, 2024 at 1:10 PM Mats Wichmann via Python-list <
python-list@python.org> wrote:

> On 4/13/24 07:00, jak via Python-list wrote:
>
> doesn't Pandas have a "where" method that can do this kind of thing? Or
> doesn't it match what you are looking for?  Pretty sure numpy does, but
> that's a lot to bring in if you don't need the rest of numpy.
>
> pandas.DataFrame.where — pandas 2.2.2 documentation (pydata.org)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: help: pandas and 2d table

2024-04-13 Thread Mats Wichmann via Python-list

On 4/13/24 07:00, jak via Python-list wrote:

Stefan Ram ha scritto:

jak  wrote or quoted:

Would you show me the path, please?


   I was not able to read xls here, so I used csv instead; Warning:
   the script will overwrite file "file_20240412201813_tmp_DML.csv"!

import pandas as pd

with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
 print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )

df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

result = {}

for rownum, row in df.iterrows():
 iterator = row.items()
 _, rowname = next( iterator )
 for colname, value in iterator:
 if value not in result: result[ value ]= []
 result[ value ].append( ( rowname, colname ))

print( result )



In reality what I wanted to achieve was this:

     what = 'zz'
     result = {what: []}

     for rownum, row in df.iterrows():
     iterator = row.items()
     _, rowname = next(iterator)
     for colname, value in iterator:
     if value == what:
     result[what] += [(rowname, colname)]
     print(result)

In any case, thank you again for pointing me in the right direction. I
had lost myself looking for a pandas method that would do this in a
single shot or almost.




doesn't Pandas have a "where" method that can do this kind of thing? Or 
doesn't it match what you are looking for?  Pretty sure numpy does, but 
that's a lot to bring in if you don't need the rest of numpy.



--
https://mail.python.org/mailman/listinfo/python-list


Re: help: pandas and 2d table

2024-04-13 Thread jak via Python-list

Stefan Ram ha scritto:

jak  wrote or quoted:

Would you show me the path, please?


   I was not able to read xls here, so I used csv instead; Warning:
   the script will overwrite file "file_20240412201813_tmp_DML.csv"!

import pandas as pd

with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
 print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )

df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

result = {}

for rownum, row in df.iterrows():
 iterator = row.items()
 _, rowname = next( iterator )
 for colname, value in iterator:
 if value not in result: result[ value ]= []
 result[ value ].append( ( rowname, colname ))

print( result )



In reality what I wanted to achieve was this:

what = 'zz'
result = {what: []}

for rownum, row in df.iterrows():
iterator = row.items()
_, rowname = next(iterator)
for colname, value in iterator:
if value == what:
result[what] += [(rowname, colname)]
print(result)

In any case, thank you again for pointing me in the right direction. I
had lost myself looking for a pandas method that would do this in a
single shot or almost.


--
https://mail.python.org/mailman/listinfo/python-list