Re: Catogorising strings into random versus non-random
On Sunday, December 20, 2015 at 10:22:57 PM UTC-6, Chris Angelico wrote: > DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded: So why bother to mention it then? Is this another one of your "pikeish" propaganda campaigns? -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On 21/12/15 16:49, Ian Kelly wrote: > On Mon, Dec 21, 2015 at 9:40 AM, duncan smith wrote: >> Finite state machine / transition matrix. Learn from some English text >> source. Then process your strings by lower casing, replacing underscores >> with spaces, removing trailing numeric characters etc. Base your score >> on something like the mean transition probability. I'd expect to see two >> pretty well separated groups of scores. > > Sounds like a case for a Hidden Markov Model. > Perhaps. That would allow the encoding of marginal probabilities and distinct transition matrices for each class - if we could learn those extra parameters. Duncan -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
Steven D'Aprano writes: > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: I think I'd just look at the set of digraphs or trigraphs in each name and see if there are a lot that aren't found in English. > - I think nltk has a "language detection" function, would that be suitable? > - If not nltk, are there are suitable language detection libraries? I suspect these need longer strings to work. > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > - How about Bayesian filters, e.g. SpamBayes? You want large training sets for these approaches. -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On 21/12/2015 16:49, Ian Kelly wrote: On Mon, Dec 21, 2015 at 9:40 AM, duncan smith wrote: Finite state machine / transition matrix. Learn from some English text source. Then process your strings by lower casing, replacing underscores with spaces, removing trailing numeric characters etc. Base your score on something like the mean transition probability. I'd expect to see two pretty well separated groups of scores. Sounds like a case for a Hidden Markov Model. In which case https://pypi.python.org/pypi/Markov/0.1 would seem to be a starting point. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On Mon, Dec 21, 2015 at 9:40 AM, duncan smith wrote: > Finite state machine / transition matrix. Learn from some English text > source. Then process your strings by lower casing, replacing underscores > with spaces, removing trailing numeric characters etc. Base your score > on something like the mean transition probability. I'd expect to see two > pretty well separated groups of scores. Sounds like a case for a Hidden Markov Model. -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On 21/12/15 03:01, Steven D'Aprano wrote: > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > > Let's call the second group "random" and the first "non-random", without > getting bogged down into arguments about whether they are really random or > not. I wish to process the strings and automatically determine whether each > string is random or not. I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. > > Strings are *mostly* ASCII but may include a few non-ASCII characters. > > Note that false positives (detecting a meaningful non-random string as > random) is worse for me than false negatives (miscategorising a random > string as non-random). > > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: > > - I think nltk has a "language detection" function, would that be suitable? > > - If not nltk, are there are suitable language detection libraries? > > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > > - How about Bayesian filters, e.g. SpamBayes? > > > > Finite state machine / transition matrix. Learn from some English text source. Then process your strings by lower casing, replacing underscores with spaces, removing trailing numeric characters etc. Base your score on something like the mean transition probability. I'd expect to see two pretty well separated groups of scores. Duncan -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On Mon, Dec 21, 2015 at 7:25 AM, Vlastimil Brom wrote: > > baby lions at play > > saturday_morning12 > > Fukushima > > ImpossibleFork > > > > > > (note that some use underscores, others spaces, and some CamelCase) while > > others are completely meaningless (or mostly so): > > > > > > xy39mGWbosjY > > 9sjz7s8198ghwt > > rz4sdko-28dbRW00u > My first thought it to search google for each wor d or phase and count (google gives a count) the results. For example if you search for "xy39mGWbosjY" there is one result as of now, which is an archive of this tread. If you search for any given word or even the phrase , for example "baby lions at play " you get a much larger set of results ~500 . I assue there are many was to search google with python, this looks like one. https://pypi.python.org/pypi/google Vincent Davis -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
2015-12-21 4:01 GMT+01:00 Steven D'Aprano : > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > > Let's call the second group "random" and the first "non-random", without > getting bogged down into arguments about whether they are really random or > not. I wish to process the strings and automatically determine whether each > string is random or not. I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. > > Strings are *mostly* ASCII but may include a few non-ASCII characters. > > Note that false positives (detecting a meaningful non-random string as > random) is worse for me than false negatives (miscategorising a random > string as non-random). > > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: > > - I think nltk has a "language detection" function, would that be suitable? > > - If not nltk, are there are suitable language detection libraries? > > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > > - How about Bayesian filters, e.g. SpamBayes? > > > > > -- > Steven > > -- > https://mail.python.org/mailman/listinfo/python-list Hi, as you probably already know, NLTK could be helpful for some parts of this task; if you can handle the most likely "word" splitting involved by underscores, CamelCase etc., you could try to tag the parts of speech of the words and interpret for the results according to your needs. In the online demo http://text-processing.com/demo/tag/ your sample (with different approaches to splitt the words) yields: baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD Fukushima/NNP Impossible/JJ Fork/NNP xy39mGWbosjY/-None- 9sjz7s8198ghwt/-None- rz4sdko/-None- -/: 28dbRW00u/-None- or with more splittings on case or letter-digit boundaries: baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD Fukushima/NNP Impossible/JJ Fork/NNP xy/-None- 39/CD m/-None- G/NNP Wbosj/-None- Y/-None- 9/CD sjz/-None- 7/CD s/-None- 8198/-NONE- ghwt/-None- rz/-None- 4/CD sdko/-None- -/: 28/CD db/-None- R/NNP W/-None- 00/-None- u/-None- the tagset might be compatible with https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html There is sample code with a comparable output to this demo: http://stackoverflow.com/questions/23953709/how-do-i-tag-a-sentence-with-the-brown-or-conll2000-tagger-chunker For the given minimal sample, the results look useful (maybe with exception of the capitalised words sometimes tagged as proper names - but it might not be that relevant here). Of course, any scoring isn't available with this approach, but you could maybe check the proportion of the recognised "words" comparing to the total number of the "words" for the respective filename. Training the tagger should be possible too in NLTK, but I don't have experiences with this. regards, vbr -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
Am 21.12.15 um 11:53 schrieb Christian Gollwitzer: So for the spaces, either use a proper trainig material (some long corpus from Wikipedia or such), with punctuation removed. Then it will catch the correct probabilities at word boundaries. Or preprocess by removing the spaces. Christian PS: The real log-likelihood would become -infinity, when some pair does not appear at all in the training set (esp. the numbers, e.g.). I used the 1/total in the defaultdict to mitigate that. You could tweak that value a bit. The larger the corpus, the sharper it will divide by itself, too. Christian -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
Am 21.12.15 um 11:36 schrieb Steven D'Aprano: On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote: Apfelkiste:Tests chris$ python score_my.py -8.74 baby lions at play -7.63 saturday_morning12 -6.38 Fukushima -5.72 ImpossibleFork -10.6 xy39mGWbosjY -12.9 9sjz7s8198ghwt -12.1 rz4sdko-28dbRW00u Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' -9.43 bnsip atl ayba loy Thanks Christian and Peter for the suggestion, I'll certainly investigate this further. But the scoring doesn't seem very good. "baby lions at play" is 100% English words, and ought to have a radically different score from (say) xy39mGWbosjY which is extremely non-English like. (How many English words do you know of with W, X, two Y, and J?) And yet they are only two units apart. "baby lions..." is a score almost as negative as the authentic gibberish, while Fukushima (a Japanese word) has a much less negative score. It is the spaces, which do not occur in the training wordlist (I mentioned that above, maybe not prominently enough). /usr/share/dict/words contains one word per line. The underscore _ is probably putting the saturday morning low, while the spaces put the babies low. Using trigraphs: Apfelkiste:Tests chris$ python score_my.py -11.5 baby lions at play -9.88 saturday_morning12 -9.85 Fukushima -7.68 ImpossibleFork -13.4 xy39mGWbosjY -14.2 9sjz7s8198ghwt -14.2 rz4sdko-28dbRW00u Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay' -8.74 babylionsatplay Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12' -8.93 saturdaymorning12 Apfelkiste:Tests chris$ So for the spaces, either use a proper trainig material (some long corpus from Wikipedia or such), with punctuation removed. Then it will catch the correct probabilities at word boundaries. Or preprocess by removing the spaces. Christian -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote: > Apfelkiste:Tests chris$ python score_my.py > -8.74 baby lions at play > -7.63 saturday_morning12 > -6.38 Fukushima > -5.72 ImpossibleFork > -10.6 xy39mGWbosjY > -12.9 9sjz7s8198ghwt > -12.1 rz4sdko-28dbRW00u > Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' > -9.43 bnsip atl ayba loy Thanks Christian and Peter for the suggestion, I'll certainly investigate this further. But the scoring doesn't seem very good. "baby lions at play" is 100% English words, and ought to have a radically different score from (say) xy39mGWbosjY which is extremely non-English like. (How many English words do you know of with W, X, two Y, and J?) And yet they are only two units apart. "baby lions..." is a score almost as negative as the authentic gibberish, while Fukushima (a Japanese word) has a much less negative score. Using trigraphs doesn't change that: > -11.5 baby lions at play > -9.85 Fukushima > -13.4 xy39mGWbosjY So this test appears to find that English-like words are nearly as "random" as actual random strings. But it's certainly worth looking into. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
Am 21.12.15 um 09:24 schrieb Peter Otten: Steven D'Aprano wrote: I have a large number of strings (originally file names) which tend to fall into two groups. Some are human-meaningful, but not necessarily dictionary words e.g.: baby lions at play saturday_morning12 Fukushima ImpossibleFork (note that some use underscores, others spaces, and some CamelCase) while others are completely meaningless (or mostly so): xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u Let's call the second group "random" and the first "non-random", without getting bogged down into arguments about whether they are really random or not. I wish to process the strings and automatically determine whether each string is random or not. I need to split the strings into three groups: - those that I'm confident are random - those that I'm unsure about - those that I'm confident are non-random Ideally, I'll get some sort of numeric score so I can tweak where the boundaries fall. Strings are *mostly* ASCII but may include a few non-ASCII characters. Note that false positives (detecting a meaningful non-random string as random) is worse for me than false negatives (miscategorising a random string as non-random). Does anyone have any suggestions for how to do this? Preferably something already existing. I have some thoughts and/or questions: - I think nltk has a "language detection" function, would that be suitable? - If not nltk, are there are suitable language detection libraries? - Is this the sort of problem that neural networks are good at solving? Anyone know a really good tutorial for neural networks in Python? - How about Bayesian filters, e.g. SpamBayes? A dead simple approach -- look at the pairs in real words and calculate the ratio pairs-also-found-in-real-words/num-pairs Sounds reasonable. Building on this approach, two simple improvements: - calculate the log-likelihood instead, which also makes use of the frequency of the digraphs in the training set - Use trigraphs instead of digraphs - preprocess the string (lowercase), but more sophisticated preprocessing could be an option (i.e. converting under_scores and CamelCase to spaces) The main reason for the low score of the baby lions is the space character, I think - the word list does not contain that much spaces. Maybe one should feed in some long wikipedia article to calculate the digraph/trigraph probabilities = Apfelkiste:Tests chris$ cat score_my.py from __future__ import division from collections import Counter, defaultdict from math import log import sys WORDLIST = "/usr/share/dict/words" SAMPLE = """\ baby lions at play saturday_morning12 Fukushima ImpossibleFork xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u """.splitlines() def extract_pairs(text): for i in range(len(text)-1): yield text.lower()[i:i+2] # or len(text)-2 and i:i+3 def load_pairs(): pairs = Counter() with open(WORDLIST) as f: for line in f: pairs.update(extract_pairs(line.strip())) # normalize to sum total_count = sum([pairs[x] for x in pairs]) N = total_count+len(pairs) dist = defaultdict(lambda:1/N, ((x, (pairs[x]+1)/N) for x in pairs)) return dist def get_score(text, dist): ll= 0 for i, x in enumerate(extract_pairs(text), 1): ll += log(dist[x]) return ll / i def main(): pair_dist = load_pairs() for text in sys.argv[1:] or SAMPLE: score = get_score(text, pair_dist) print("%.3g %s" % (score, text)) if __name__ == "__main__": main() Apfelkiste:Tests chris$ python score_my.py -8.74 baby lions at play -7.63 saturday_morning12 -6.38 Fukushima -5.72 ImpossibleFork -10.6 xy39mGWbosjY -12.9 9sjz7s8198ghwt -12.1 rz4sdko-28dbRW00u Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' -9.43 bnsip atl ayba loy Apfelkiste:Tests chris$ and using trigraphs: Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy' -12.5 bnsip atl ayba loy Apfelkiste:Tests chris$ python score_my.py -11.5 baby lions at play -9.88 saturday_morning12 -9.85 Fukushima -7.68 ImpossibleFork -13.4 xy39mGWbosjY -14.2 9sjz7s8198ghwt -14.2 rz4sdko-28dbRW00u == -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On Monday 21 December 2015 15:22, Chris Angelico wrote: > On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano > wrote: >> I have a large number of strings (originally file names) which tend to >> fall into two groups. Some are human-meaningful, but not necessarily >> dictionary words e.g.: [...] > The first thing that comes to my mind is poking the string into a > search engine and seeing how many results come back. You might need to > do some preprocessing to recognize multi-word forms (maybe a handful > of recognized cases like snake_case, CamelCase, > CamelCasewiththeLittleWordsLeftUnchanged, etc), I could possibly split the string into "words", based on CamelCase, spaces, hyphens or underscores. That would cover most of the cases. > How many of these keywords would you be looking up, and would a > network transaction (a search engine API call) for each one be too > expensive? Tens or hundreds of thousands of strings, and yes a network transaction probably would be a bit much. I'd rather not have Google or Bing be a dependency :-) -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)
On Monday 21 December 2015 14:45, Ben Finney wrote: > Steven D'Aprano writes: > >> Let's call the second group "random" and the first "non-random", >> without getting bogged down into arguments about whether they are >> really random or not. > > I think we should discuss it, even at risk of getting bogged down. As > you know better than I, “random” is not an observable property of the > value, but of the process that produced it. > > So, I don't think “random” is at all helpful as a descriptor of the > criteria you need for discriminating these values. > > Can you give a better definition of what criteria distinguish the > values, based only on their observable properties? No, not really. This *literally* is a case of "I'll know it when I see it", which suggests that some sort of machine-learning solution (neural network?) may be useful. I can train it on a bunch of strings which I can hand- classify, and let the machine pick out the correlations, then apply it to the rest of the strings. The best I can say is that the "non-random" strings either are, or consist of, mostly English words, names, or things which look like they might be English words, containing no more than a few non-ASCII characters, punctuation, or digits. > You used “meaningless”; that seems at least more hopeful as a criterion > we can use by examining text values. So, what counts as meaningless? Strings made up of random-looking sequences of characters, like you often see on sites like imgur or tumblr. Characters from non-Latin character sets that I can't read (e.g. Japanese, Korean, Arabic, etc). Jumbled up words, e.g. "python" is non-random, "nyohtp" would be random. [...] > Perhaps you could measure Shannon entropy (“expected information value”) > https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as > a proxy? Or maybe I don't quite understand the criteria. That's a possibility. At least, it might be able to distinguish some strings, although if I understand correctly, the two strings "python" and "nhoypt" have identical entropy, so this alone won't be sufficient. -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
Steven D'Aprano wrote: > I have a large number of strings (originally file names) which tend to > fall into two groups. Some are human-meaningful, but not necessarily > dictionary words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > > Let's call the second group "random" and the first "non-random", without > getting bogged down into arguments about whether they are really random or > not. I wish to process the strings and automatically determine whether > each string is random or not. I need to split the strings into three > groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. > > Strings are *mostly* ASCII but may include a few non-ASCII characters. > > Note that false positives (detecting a meaningful non-random string as > random) is worse for me than false negatives (miscategorising a random > string as non-random). > > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: > > - I think nltk has a "language detection" function, would that be > suitable? > > - If not nltk, are there are suitable language detection libraries? > > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > > - How about Bayesian filters, e.g. SpamBayes? A dead simple approach -- look at the pairs in real words and calculate the ratio pairs-also-found-in-real-words/num-pairs $ cat score.py import sys WORDLIST = "/usr/share/dict/words" SAMPLE = """\ baby lions at play saturday_morning12 Fukushima ImpossibleFork xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u """.splitlines() def extract_pairs(text): for i in range(len(text)-1): yield text[i:i+2] def load_pairs(): pairs = set() with open(WORDLIST) as f: for line in f: pairs.update(extract_pairs(line.strip())) return pairs def get_score(text, popular_pairs): m = 0 for i, p in enumerate(extract_pairs(text), 1): if p in popular_pairs: m += 1 return m/i def main(): popular_pairs = load_pairs() for text in sys.argv[1:] or SAMPLE: score = get_score(text, popular_pairs) print("%4.2f %s" % (score, text)) if __name__ == "__main__": main() $ python3 score.py 0.65 baby lions at play 0.76 saturday_morning12 1.00 Fukushima 0.92 ImpossibleFork 0.36 xy39mGWbosjY 0.31 9sjz7s8198ghwt 0.31 rz4sdko-28dbRW00u However: $ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a); print("".join(a))' 'baby lions at play' bnsip atl ayba loy $ python3 score.py 'bnsip atl ayba loy' 0.65 bnsip atl ayba loy -- https://mail.python.org/mailman/listinfo/python-list
Re: Catogorising strings into random versus non-random
On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano wrote: > I have a large number of strings (originally file names) which tend to fall > into two groups. Some are human-meaningful, but not necessarily dictionary > words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. The first thing that comes to my mind is poking the string into a search engine and seeing how many results come back. You might need to do some preprocessing to recognize multi-word forms (maybe a handful of recognized cases like snake_case, CamelCase, CamelCasewiththeLittleWordsLeftUnchanged, etc), but doing that manually on the above text gives me: * baby lions at play * saturday morning 12 * fukushima * impossible fork * xy 39 mgwbosjy * 9 sjz 7 s 8198 ghwt * rz 4 sdko 28 dbrw 00 u Putting those into Google without quotes yields: * About 23,800,000 results * About 227,000,000 results * About 32,500,000 results * About 16,400,000 results * About 1,180 results * 7 results * About 30,300 results DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded: * 6,040,000 results * 123,000,000 results * 3,920,000 results * 720,000 results * No results at all * No results at all * 2 results Bing produces much more chaotic results, though: * 34,000,000 RESULTS * 15,600,000 RESULTS * 11,000,000 RESULTS * 1,620,000 RESULTS * 5,720,000 RESULTS * 1,580,000,000 RESULTS * 3,380,000 RESULTS This suggests that search engine results MAY be useful, but in some cases, tweaks may be necessary (I couldn't force Bing to do phrase search, for some reason probably related to my inexperience with it), and also that the boundary between "meaningful" and "non-meaningful" will depend on the engine used (I'd use 1,000,000 as the boundary with Google, but probably 100,000 with Yahoo). You might want to handle numerics differently, too - converting "9" into "nine" could improve the result reliability. How many of these keywords would you be looking up, and would a network transaction (a search engine API call) for each one be too expensive? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)
Steven D'Aprano writes: > Let's call the second group "random" and the first "non-random", > without getting bogged down into arguments about whether they are > really random or not. I think we should discuss it, even at risk of getting bogged down. As you know better than I, “random” is not an observable property of the value, but of the process that produced it. So, I don't think “random” is at all helpful as a descriptor of the criteria you need for discriminating these values. Can you give a better definition of what criteria distinguish the values, based only on their observable properties? You used “meaningless”; that seems at least more hopeful as a criterion we can use by examining text values. So, what counts as meaningless? > I wish to process the strings and automatically determine whether each > string is random or not. I need to split the strings into three groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. Perhaps you could measure Shannon entropy (“expected information value”) https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as a proxy? Or maybe I don't quite understand the criteria. -- \ “Actually I made up the term “object-oriented”, and I can tell | `\you I did not have C++ in mind.” —Alan Kay, creator of | _o__)Smalltalk, at OOPSLA 1997 | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Catogorising strings into random versus non-random
I have a large number of strings (originally file names) which tend to fall into two groups. Some are human-meaningful, but not necessarily dictionary words e.g.: baby lions at play saturday_morning12 Fukushima ImpossibleFork (note that some use underscores, others spaces, and some CamelCase) while others are completely meaningless (or mostly so): xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u Let's call the second group "random" and the first "non-random", without getting bogged down into arguments about whether they are really random or not. I wish to process the strings and automatically determine whether each string is random or not. I need to split the strings into three groups: - those that I'm confident are random - those that I'm unsure about - those that I'm confident are non-random Ideally, I'll get some sort of numeric score so I can tweak where the boundaries fall. Strings are *mostly* ASCII but may include a few non-ASCII characters. Note that false positives (detecting a meaningful non-random string as random) is worse for me than false negatives (miscategorising a random string as non-random). Does anyone have any suggestions for how to do this? Preferably something already existing. I have some thoughts and/or questions: - I think nltk has a "language detection" function, would that be suitable? - If not nltk, are there are suitable language detection libraries? - Is this the sort of problem that neural networks are good at solving? Anyone know a really good tutorial for neural networks in Python? - How about Bayesian filters, e.g. SpamBayes? -- Steven -- https://mail.python.org/mailman/listinfo/python-list