Re: Extracting dataframe column with multiple conditions on row values

Edmondo Giovannozzi Sun, 09 Jan 2022 11:13:41 -0800

Il giorno sabato 8 gennaio 2022 alle 23:01:13 UTC+1 Avi Gross ha scritto:
> I have to wonder if when something looks like HOMEWORK, if it should be 
> answered in detail, let alone using methods beyond what is expected in class. 
> The goal of this particular project seems to be to find one (or perhaps more) 
> columns in some data structure like a dataframe that match two conditions 
> (containing a copy of two numbers in one or more places) and then KNOW what 
> column it was in. The reason I say that is because the next fairly 
> nonsensical request is to then explicitly return what that column has in the 
> row called 2, meaning the third row. 
> Perhaps stated another way: "what it the item in row/address 2 of the column 
> that somewhere contains two additional specified contents called key1 and 
> key2"  
> My guess is that if the instructor wanted this to be solved using methods 
> being taught, then loops may well be a way to go. Python and numpy/pandas 
> make it often easier to do things with columns rather than in rows across 
> them, albeit many things allow you to specify an axis. So, yes, transposing 
> is a way to go that transforms the problem in a way easier to solve without 
> thinking deeply. Some other languages allow relatively easy access in both 
> directions of horizontally versus vertically. And this may be an example 
> where solving it as a list of lists may also be easier.  
> Is the solution at the bottom a solution? Before I check, I want to see if I 
> understand the required functionality and ask if it is completely and 
> unambiguously specified.  
> For completeness, the question being asked may need to deal with a uniqueness 
> issue. Is it possible multiple columns match the request and thus more than 
> one answer is required to be returned? Is the row called 2 allowed to 
> participate in the match or must it be excluded and the question becomes to 
> find one (or more) columns that contain key1 somewhere else than row 2 and 
> key2 (which may have to be different than key1 or not) somewhere else and 
> THEN provide the corresponding entry from row 2 and that (or those) 
> column(s)? 
> So in looking at the solution offered, what exactly was this supposed to do 
> when dft is the transpose?
> idt = (dft[0] == 1) & (dft[1] == 5)
> Was the code (way below in this message) tried out or just written for us to 
> ponder? I tried it. I got an answer of:     0 1 2 
>    V2 1 5 6 
> That is not my understanding of what was requested. Row 2 (shown transposed 
> as a column) is being shown as a whole. The request was for item "2" which 
> would be just 6. Something more like this: 
> print(dft[idt][2]) 
> 
> But the code makes no sense to me.  seems to explicitly test the first column 
> (0) to see if it contains a 1 and then the second column (1) to see if it 
> contains a 5. Not sure who cares about this hard-wired query as this is not 
> my understanding of the question. You want any of the original three rows 
> (now transposed)  tested to see if it contains BOTH.  
> I may have read the requirements wrong or it may not be explained well. Until 
> I am sure what is being asked and whether there is a good reason someone 
> wants a different solution, I see no reason to provide yet another 
> solution.But just for fund, assuming dft contains the transpose of the 
> original data, will this work? 
> first = dft[dft.values == key1 ]second = first[first.values == key2 
> ]print(second[2]) 
> I get a 6 as an answer and suppose it could be done in one more complex 
> expression if needed! LOL!
> -----Original Message----- 
> From: Edmondo Giovannozzi <[email protected]> 
> To: [email protected] 
> Sent: Sat, Jan 8, 2022 8:00 am 
> Subject: Re: Extracting dataframe column with multiple conditions on row 
> values 
> 
> Il giorno sabato 8 gennaio 2022 alle 02:21:40 UTC+1 dn ha scritto: 
> > Salaam Mahmood, 
> > On 08/01/2022 12.07, Mahmood Naderan via Python-list wrote: 
> > > I have a csv file like this 
> > > V0,V1,V2,V3 
> > > 4,1,1,1 
> > > 6,4,5,2 
> > > 2,3,6,7 
> > > 
> > > And I want to search two rows for a match and find the column. For 
> > > example, I want to search row[0] for 1 and row[1] for 5. The 
> > > corresponding 
> > > column is V2 (which is the third column). Then I want to return the value 
> > > at row[2] and the found column. The result should be 6 then. 
> > Not quite: isn't the "found column" also required? 
> > > I can manually extract the specified rows (with index 0 and 1 which are 
> > > fixed) and manually iterate over them like arrays to find a match. Then I 
> > Perhaps this idea has been influenced by a similar solution in another 
> > programming language. May I suggest that the better-answer you seek lies 
> > in using Python idioms (as well as Python's tools)... 
> > > key1 = 1 
> > > key2 = 5 
> > Fine, so far - excepting that this 'problem' is likely to be a small 
> > part of some larger system. Accordingly, consider writing it as a 
> > function. In which case, these two "keys" will become 
> > function-parameters (and the two 'results' become return-values). 
> > > row1 = df.iloc[0] # row=[4,1,1,1] 
> > > row2 = df.iloc[1] # row=[6,4,5,2] 
> > This is likely not native-Python. Let's create lists for 'everything', 
> > just-because: 
> > 
> > >>> headings = [ "V0","V1","V2","V3" ] 
> > >>> row1 = [4,1,1,1] 
> > >>> row2 = [6,4,5,2] 
> > >>> results = [ 2,3,6,7 ] 
> > 
> > 
> > Note how I'm using the Python REPL (in a "terminal", type "python" (as 
> > appropriate to your OpSys) at the command-line). IMHO the REPL is a 
> > grossly under-rated tool, and is a very good means towards 
> > trial-and-error, and learning by example. Highly recommended! 
> > 
> > 
> > > for i in range(len(row1)): 
> > 
> > This construction is very much a "code smell" for thinking that it is 
> > not "pythonic". (and perhaps the motivation for this post) 
> > 
> > In Python (compared with many other languages) the "for" loop should 
> > actually be pronounced "for-each". In other words when we pair the 
> > code-construct with a list (for example): 
> > 
> > for each item in the list the computer should perform some suite of 
> > commands. 
> > 
> > (the "suite" is everything 'inside' the for-each-loop - NB my 
> > 'Python-betters' will quickly point-out that this feature is not limited 
> > to Python-lists, but will work with any :iterable" - ref: 
> > https://docs.python.org/3/tutorial/controlflow.html#for-statements) 
> > 
> > 
> > Thus: 
> > 
> > > for item in headings: print( item ) 
> > ... 
> > V0 
> > V1 
> > V2 
> > V3 
> > 
> > 
> > The problem is that when working with matrices/matrixes, a math 
> > background equips one with the idea of indices/indexes, eg the 
> > ubiquitous subscript-i. Accordingly, when reading 'math' where a formula 
> > uses the upper-case Greek "sigma" character, remember that it means "for 
> > all" or "for each"! 
> > 
> > So, if Python doesn't use indexing or "pointers", how do we deal with 
> > the problem? 
> > 
> > Unfortunately, at first glance, the pythonic approach may seem 
> > more-complicated or even somewhat convoluted, but once the concepts 
> > (and/or the Python idioms) are learned, it is quite manageable (and 
> > applicable to many more applications than matrices/matrixes!)... 
> > > if row1[i] == key1: 
> > > for j in range(len(row2)): 
> > > if row2[j] == key2: 
> > > res = df.iloc[:,j] 
> > > print(res) # 6 
> > > 
> > > Is there any way to use built-in function for a more efficient code? 
> > This is where your idea bears fruit! 
> > 
> > There is a Python "built-in function": zip(), which will 'join' lists. 
> > NB do not become confused between zip() and zip archive/compressed files! 
> > 
> > Most of the time reference book and web-page examples show zip() being 
> > used to zip-together two lists into a single data-construct (which is an 
> > iterable(!)). However, zip() will actually zip-together multiple (more 
> > than two) "iterables". As the manual says: 
> > 
> > «zip() returns an iterator of tuples, where the i-th tuple contains the 
> > i-th element from each of the argument iterables.» 
> > 
> > Ah, so that's where the math-idea of subscript-i went! It has become 
> > 'hidden' in Python's workings - or putting that another way: Python 
> > looks after the subscripting for us (and given that 'out by one' errors 
> > in pointers is a major source of coding-error in other languages, 
> > thank-you very much Python!) 
> > 
> > First re-state the source-data as Python lists, (per above) - except 
> > that I recommend the names be better-chosen to be more meaningful (to 
> > your application)! 
> > 
> > 
> > Now, (in the REPL) try using zip(): 
> > 
> > >>> zip( headings, row1, row2, results ) 
> > <zip object at 0x7f655cca6bc0> 
> > 
> > Does that seem a very good illustration? Not really, but re-read the 
> > quotation from the manual (above) where it says that zip returns an 
> > iterator. If we want to see the values an iterator will produce, then 
> > turn it into an iterable data-structure, eg: 
> > 
> > >>> list( zip( headings, row1, row2, results ) ) 
> > [('V0', 4, 6, 2), ('V1', 1, 4, 3), ('V2', 1, 5, 6), ('V3', 1, 2, 7)] 
> > 
> > or, to see things more clearly, let me re-type it as: 
> > 
> > [ 
> > ('V0', 4, 6, 2), 
> > ('V1', 1, 4, 3), 
> > ('V2', 1, 5, 6), 
> > ('V3', 1, 2, 7) 
> > ] 
> > 
> > 
> > What we now see is actually a "transpose" of the original 'matrix' 
> > presented in the post/question! 
> > 
> > (NB Python will perform this layout for us - read about the pprint library) 
> > 
> > 
> > Another method which can also be employed (and which will illustrate the 
> > loop required to code the eventual-solution(!)) is that Python's next() 
> > will extract the first row of the transpose: 
> > 
> > >>> row = next( zip( headings, row1, row2, results ) ) 
> > >>> row 
> > ('V0', 4, 6, 2) 
> > 
> > 
> > This is all-well-and-good, but that result is a tuple of four items 
> > (corresponding to one column in the way the source-data was explained). 
> > 
> > If we need to consider the four individual data-items, that can be 
> > improved using a Python feature called "tuple unpacking". Instead of the 
> > above delivering a tuple which is then assigned to "row", the tuple can 
> > be assigned to four "identifiers", eg 
> > 
> > >>> heading, row1_item, row2_item, result= next( zip( headings, row1, 
> > row2, results ) ) 
> > 
> > (apologies about email word-wrapping - this is a single line of 
> > Python-code) 
> > 
> > 
> > Which, to prove the case, could be printed: 
> > 
> > >>> heading, row1_item, row2_item, result 
> > ('V0', 4, 6, 2) 
> > 
> > 
> > (ref: 
> > https://docs.python.org/3/tutorial/datastructures.html?highlight=tuple%20unpacking#tuples-and-sequences)
> >  
> > 
> > 
> > Thus, if we repeatedly ask for the next() row from the zip-ped 
> > transpose, eventually it will respond with the row starting 'V2' - which 
> > is the desired-result, ie the row containing the 1, the 5, and the 6 - 
> > and if you follow-through using the REPL, will be clearly visible. 
> > 
> > 
> > Finally, 'all' that is required, is a for-each-loop which will iterate 
> > across/down the zip object, one tuple (row of the transpose) at a time, 
> > AND perform the "tuple-unpacking" all in one command, with an 
> > if-statement to detect the correct row/column: 
> > 
> > >>> for *tuple-unpacking* in *zip() etc*: 
> > ... if row1_item == *what?* and row2_item == *what?* 
> > ... print( *which* and *which identifier* ) 
> > ... 
> > V2 6 
> > 
> > Yes, three lines. It's as easy as that! 
> > (when you know how) 
> > 
> > Worse: when you become more expert, you'll be able to compress all of 
> > that down into a single-line solution - but it won't be as "readable" as 
> > is this! 
> > 
> > 
> > NB this question has a 'question-smell' of 'homework', so I'll not 
> > complete the code for you - this is something *you* asked to learn and 
> > the best way to learn is by 'doing' (not by 'reading'). 
> > 
> > However, please respond with your solution, or any further question 
> > (with the next version of the code so-far, per this first-post - which 
> > we appreciate!) 
> > 
> > Regardless, you asked 'the right question' (curiosity is the key to 
> > learning) and in the right way/manner. Well done! 
> > 
> > 
> > NBB the above code-outline does not consider the situation where the 
> > search fails/the keys are not found! 
> > 
> > 
> > For further information, please review: 
> > https://docs.python.org/3/library/functions.html?highlight=zip#zip 
> > 
> > Also, further to the above discussion of combining lists and loops: 
> > https://docs.python.org/3/tutorial/datastructures.html?highlight=zip#looping-techniques
> >  
> > 
> > and with a similar application (to this post): 
> > https://docs.python.org/3/faq/programming.html?highlight=zip#how-can-i-sort-one-list-by-values-from-another-list
> >  
> > 
> > -- 
> > Regards, 
> 
> You may also transpose your dataset. Then the index will become your column 
> name and the column name become your index: 
> To read your dataset: 
> 
> import pandas as pd 
> import io 
> 
> DN = """ 
> V0,V1,V2,V3 
> 4,1,1,1 
> 6,4,5,2 
> 2,3,6,7 
> """ 
> df = pd.read_csv(io.StringIO(DN)) 
> 
> Transpose it: 
> 
> dft = df.T 
> 
> Find all the index with your condition: 
> 
> idt = (dft[0] == 1) & (dft[1] == 5) 
> 
> Print the columns that satisfy your condition: 
> 
> print(dft[idt]) 
> 
> As you see, without explicit loop.
> -- 
> https://mail.python.org/mailman/listinfo/python-list


I was just showing that one can transpose the dataframe and use logical 
indexing, as the OP was asking for a fast and efficient solution. It is just a 
hint and not a complete solution. And, of course, one have to put everything 
inside a function.
 
And yes, I tested what I have written.

There could be other problems, of course, if some columns are not numeric, but 
all this details depend on the problem at hands.
But, I may not have understood completely the problem presented by the OP.

Cheers, :-)

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Extracting dataframe column with multiple conditions on row values

Reply via email to