On 15Jun2019 14:51, Sean Murphy <mhysnm1...@gmail.com> wrote:
I am not sure how to tackle this issue. I am using Windows 10 and Python 3.6 from Activestate.

I have a list of x number of elements. Some of the elements are have similar
words in them. For example:

Dog food Pal
Dog Food Pal qx1323
Cat food kitty
Absolute cleaning inv123
Absolute Domestic cleaning inv 222
Absolute d 3333
Fitness first 02/19
Fitness first

I'm going to assume that you have a list of strings, each being a line from a file.

I wish to remove duplicates. I could use the collection.Count method. This
fails due to the strings are not unique, only some of the words are.

You need to define this more tightly. Suppose the above were your input. What would it look like after "removing duplicates"? By providing an explicit example of what you expect afterwards it is easier for us to understand you, and will also help you with your implementation.

Do you intend to discard the second occurence of every word, turning line 2 above into "qx1323"? Or to remove similar lines, for some definition of "similar",
which might discard line 2 above?

Your code examples below seem to suggest that your want to discard words you've already seen.

My
thinking and is only rough sudo code as I am not sure how to do this and

Aside: "pseudo", not "sudo".

wish to learn and not sure how to do without causing gtraceback errors. I
want to delete the match pattern from the list of strings. Below is my
attempt and I hope this makes sense.

description = load_files() # returns a list
for text in description:
   words = text.split()
   for i in enumerate(words):

enumerate() yields a sequence of (i, v), so you need i, v in the loop:

 for i, word in enumerate(words):

Or you need the loop variable to be a tuple and to pull out the enumeration counter and the associated value inside the loop:

 for x in enumerate(words):
   i, word = x

       Word = ' '.join(words[:i])

Variable names in Python are case sensitive. You want "word", not "Word".

However, if you really want each word of the line you've got that from text.split(). The expression "words[:i]" means the letters of word from index 0 through to i-1. For example, "kitt" if "i" were 4.

The join string operation joins an iterable of strings. Unfortunately for you, a string is itself iterable: you get each character, but as a string (Python does not have a distinct "character" type, it just has single character strings). So if "word" were "kitt" above, you get:

 "k i t t"

from the join. Likely not what you want.

What _do_ you want?

       print (word)
       answer = input('Keep word?')
       if answer == 'n':
           continue
       for i, v in enumerate(description):
           if word in description[i]:
               description.pop[i]

There are some problems here. The big one is that you're modifying a list while you're iterating over it. This is always hazardous - it usually leading to accidentally skipping elements. Or not, depending how the iteration happens.

It is generally safer to iterate over the list and construct a distinct new line to replace it, without modifying the original list. This way the enumerate cannot get confused. So instead of discarding from the list, you conditionally add to the new list:

 new_description = []
 for i, word in enumerate(description):
   if word not in description[i]:
     new_description.append(word)

Note the "not" above. We invert the condition ("not in" instead of "in") because we're inverting the action (appending something instead of discarding it).

However, I think you have some fundamental confusion about what your iterating over.

I recommend that you adopt better variable names, and more formally describe your data.

If "description" is actualy a list of descriptions then give it a plural name like "descriptions". When you iterate over it, you can then use the singular form for each element i.e. "description" instead of "text".

Instead of writing loops like:

 for i, v in enumerate(descriptions):

give "v" a better name, like "description". That way your code inside the loop is better described, and mistakes more obvious because the code will suddenly read badly in some way.

The initial issues I see with the above is the popping of an element from
description list will cause a error.

It often won't. Instead if will mangle your iteration because after the pop the index "i" no longer refers to what you expect, it now points one word further along.

Towards the _end_ of the loop you'll get an error, but only once "i" starts to exceed the length of the list (because you've been shortening it).

If I copy the description list into a
new list. And use the new list for the outer loop. I will receive multiple
occurrences of the same text. This could be addressed by a if test. But I am
wondering if there is a better method.

The common idom is to leave the original unchanged and copy into a new list as in my example above. But taking a copy and iterating over that is also reasonable.

You will still have issues with the popping, because the index "i" will no longer be aligned with the modified list.

If you really want to modify in place, avoid enumerate. Instead, make "i" an index into the list as you do, but maintain it yourself. Loop from left to right in the list until you come off the end:

 i = 0
 while i < len(description):
   if ... we want to pop the element ...:
     description.pop(i)
   else:
     i = i + 1

Here we _either_ discard from the list and _do not_ advance "i", or we advance "i". Either way "i" then points at the next word, in the former case because the next word has shuffled down once position and in the latter because "i" has moved forwards. Either way "i" gets closer to the end of the list. We leave the loop when "i" gets past the end.

2nd code example:

description = load_files() # returns a list
search_txt = description.copy() # I have not verify if this is the right
syntax for the copy method.]

A quick way is:

 search_text = description[:]

but lists have a .copy method which does the same thing.

for text in search_txt:
   words = text.split()
   for i in enumerate(words):
       Word = ' '.join(words[:i])
       print (word)
       answer = input('Keep word (ynq)?')
       if answer == 'n':
           continue
       elif answer = 'q':
           break
       for i, v in enumerate(description):
           if word in description[i]:
               description.pop[i]

The inner for loop still has all the same issues as before. The outer loop is now more robust because you've iterating over the copy.

Cheers,
Cameron Simpson <c...@cskk.id.au>
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to