Re: Catogorising strings into random versus non-random

2015-12-21 Thread Rick Johnson
On Sunday, December 20, 2015 at 10:22:57 PM UTC-6, Chris Angelico wrote:
> DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded:

So why bother to mention it then? Is this another one of your "pikeish" 
propaganda campaigns?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread duncan smith
On 21/12/15 16:49, Ian Kelly wrote:
> On Mon, Dec 21, 2015 at 9:40 AM, duncan smith  wrote:
>> Finite state machine / transition matrix. Learn from some English text
>> source. Then process your strings by lower casing, replacing underscores
>> with spaces, removing trailing numeric characters etc. Base your score
>> on something like the mean transition probability. I'd expect to see two
>> pretty well separated groups of scores.
> 
> Sounds like a case for a Hidden Markov Model.
> 

Perhaps. That would allow the encoding of marginal probabilities and
distinct transition matrices for each class - if we could learn those
extra parameters.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Paul Rubin
Steven D'Aprano  writes:
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:

I think I'd just look at the set of digraphs or trigraphs in each name
and see if there are a lot that aren't found in English.

> - I think nltk has a "language detection" function, would that be suitable?
> - If not nltk, are there are suitable language detection libraries?

I suspect these need longer strings to work.

> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> - How about Bayesian filters, e.g. SpamBayes?

You want large training sets for these approaches.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Mark Lawrence

On 21/12/2015 16:49, Ian Kelly wrote:

On Mon, Dec 21, 2015 at 9:40 AM, duncan smith  wrote:

Finite state machine / transition matrix. Learn from some English text
source. Then process your strings by lower casing, replacing underscores
with spaces, removing trailing numeric characters etc. Base your score
on something like the mean transition probability. I'd expect to see two
pretty well separated groups of scores.


Sounds like a case for a Hidden Markov Model.



In which case https://pypi.python.org/pypi/Markov/0.1 would seem to be a 
starting point.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Ian Kelly
On Mon, Dec 21, 2015 at 9:40 AM, duncan smith  wrote:
> Finite state machine / transition matrix. Learn from some English text
> source. Then process your strings by lower casing, replacing underscores
> with spaces, removing trailing numeric characters etc. Base your score
> on something like the mean transition probability. I'd expect to see two
> pretty well separated groups of scores.

Sounds like a case for a Hidden Markov Model.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread duncan smith
On 21/12/15 03:01, Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
> 
> 
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
> 
> 
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
> 
> 
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
> 
> 
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
> 
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
> 
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
> 
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
> 
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
> 
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
> 
> - I think nltk has a "language detection" function, would that be suitable?
> 
> - If not nltk, are there are suitable language detection libraries?
> 
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> 
> - How about Bayesian filters, e.g. SpamBayes?
> 
> 
> 
> 

Finite state machine / transition matrix. Learn from some English text
source. Then process your strings by lower casing, replacing underscores
with spaces, removing trailing numeric characters etc. Base your score
on something like the mean transition probability. I'd expect to see two
pretty well separated groups of scores.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Vincent Davis
On Mon, Dec 21, 2015 at 7:25 AM, Vlastimil Brom 
wrote:

> > baby lions at play
> > saturday_morning12
> > Fukushima
> > ImpossibleFork
> >
> >
> > (note that some use underscores, others spaces, and some CamelCase) while
> > others are completely meaningless (or mostly so):
> >
> >
> > xy39mGWbosjY
> > 9sjz7s8198ghwt
> > rz4sdko-28dbRW00u
>

My first thought it to search google for each wor
​d​
or phase and count
​(google gives a count) ​
the results. For example if you search for "xy39mGWbosjY" there is one
result as of now,
​which
 is an archive of this tread. If you search for any given word or even the
phrase
​, for example​
"baby lions at play
​
" you get a much larger set of results
​ ~500​
. I assue there are many was to search google with python, this looks like
one. https://pypi.python.org/pypi/google

Vincent Davis
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Vlastimil Brom
2015-12-21 4:01 GMT+01:00 Steven D'Aprano :
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
>
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
>
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
>
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
>
> - I think nltk has a "language detection" function, would that be suitable?
>
> - If not nltk, are there are suitable language detection libraries?
>
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
>
> - How about Bayesian filters, e.g. SpamBayes?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
as you probably already know, NLTK could be helpful for some parts of
this task; if you can handle the most likely "word" splitting involved
by underscores, CamelCase etc., you could try to tag the parts of
speech of the words and interpret for the results according to your
needs.
In the online demo
http://text-processing.com/demo/tag/
your sample (with different approaches to splitt the words) yields:

baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy39mGWbosjY/-None-
9sjz7s8198ghwt/-None- rz4sdko/-None- -/: 28dbRW00u/-None-

or with more splittings on case or letter-digit boundaries:
baby/NN lions/NNS at/IN play/VB saturday/NN morning/NN 12/CD
Fukushima/NNP Impossible/JJ Fork/NNP xy/-None- 39/CD m/-None- G/NNP
Wbosj/-None- Y/-None- 9/CD sjz/-None- 7/CD s/-None- 8198/-NONE-
ghwt/-None- rz/-None- 4/CD sdko/-None- -/: 28/CD db/-None- R/NNP
W/-None- 00/-None- u/-None-

 the tagset might be compatible with
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

There is sample code with a comparable output to this demo:
http://stackoverflow.com/questions/23953709/how-do-i-tag-a-sentence-with-the-brown-or-conll2000-tagger-chunker

For the given minimal sample, the results look useful (maybe with
exception of the capitalised words sometimes tagged as proper names -
but it might not be that relevant here).
Of course, any scoring isn't available with this approach, but you
could maybe check the proportion of the  recognised "words" comparing
to the total number of the "words" for the respective filename.
Training the tagger should be possible too in NLTK, but I don't have
experiences with this.

regards,
 vbr
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Christian Gollwitzer

Am 21.12.15 um 11:53 schrieb Christian Gollwitzer:

So for the spaces, either use a proper trainig material (some long
corpus from Wikipedia or such), with punctuation removed. Then it will
catch the correct probabilities at word boundaries. Or preprocess by
removing the spaces.

 Christian


PS: The real log-likelihood would become -infinity, when some pair does 
not appear at all in the training set (esp. the numbers, e.g.). I used 
the 1/total in the defaultdict to mitigate that. You could tweak that 
value a bit. The larger the corpus, the sharper it will divide by 
itself, too.


Christian
--
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Christian Gollwitzer

Am 21.12.15 um 11:36 schrieb Steven D'Aprano:

On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote:


Apfelkiste:Tests chris$ python score_my.py
-8.74  baby lions at play
-7.63  saturday_morning12
-6.38  Fukushima
-5.72  ImpossibleFork
-10.6  xy39mGWbosjY
-12.9  9sjz7s8198ghwt
-12.1  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43  bnsip atl ayba loy


Thanks Christian and Peter for the suggestion, I'll certainly investigate
this further.

But the scoring doesn't seem very good. "baby lions at play" is 100% English
words, and ought to have a radically different score from (say)
xy39mGWbosjY which is extremely non-English like. (How many English words
do you know of with W, X, two Y, and J?) And yet they are only two units
apart. "baby lions..." is a score almost as negative as the authentic
gibberish, while Fukushima (a Japanese word) has a much less negative
score.


It is the spaces, which do not occur in the training wordlist (I 
mentioned that above, maybe not prominently enough). 
/usr/share/dict/words contains one word per line. The underscore _ is 
probably putting the saturday morning low, while the spaces put the 
babies low. Using trigraphs:



Apfelkiste:Tests chris$ python score_my.py
-11.5  baby lions at play
-9.88  saturday_morning12
-9.85  Fukushima
-7.68  ImpossibleFork
-13.4  xy39mGWbosjY
-14.2  9sjz7s8198ghwt
-14.2  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay'
-8.74  babylionsatplay
Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12'
-8.93  saturdaymorning12
Apfelkiste:Tests chris$

So for the spaces, either use a proper trainig material (some long 
corpus from Wikipedia or such), with punctuation removed. Then it will 
catch the correct probabilities at word boundaries. Or preprocess by 
removing the spaces.


Christian
--
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Steven D'Aprano
On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote:

> Apfelkiste:Tests chris$ python score_my.py
> -8.74  baby lions at play
> -7.63  saturday_morning12
> -6.38  Fukushima
> -5.72  ImpossibleFork
> -10.6  xy39mGWbosjY
> -12.9  9sjz7s8198ghwt
> -12.1  rz4sdko-28dbRW00u
> Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
> -9.43  bnsip atl ayba loy

Thanks Christian and Peter for the suggestion, I'll certainly investigate
this further.

But the scoring doesn't seem very good. "baby lions at play" is 100% English
words, and ought to have a radically different score from (say)
xy39mGWbosjY which is extremely non-English like. (How many English words
do you know of with W, X, two Y, and J?) And yet they are only two units
apart. "baby lions..." is a score almost as negative as the authentic
gibberish, while Fukushima (a Japanese word) has a much less negative
score. Using trigraphs doesn't change that:

> -11.5  baby lions at play
> -9.85  Fukushima
> -13.4  xy39mGWbosjY

So this test appears to find that English-like words are nearly as "random"
as actual random strings.

But it's certainly worth looking into.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Christian Gollwitzer

Am 21.12.15 um 09:24 schrieb Peter Otten:

Steven D'Aprano wrote:


I have a large number of strings (originally file names) which tend to
fall into two groups. Some are human-meaningful, but not necessarily
dictionary words e.g.:


baby lions at play
saturday_morning12
Fukushima
ImpossibleFork


(note that some use underscores, others spaces, and some CamelCase) while
others are completely meaningless (or mostly so):


xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u


Let's call the second group "random" and the first "non-random", without
getting bogged down into arguments about whether they are really random or
not. I wish to process the strings and automatically determine whether
each string is random or not. I need to split the strings into three
groups:

- those that I'm confident are random
- those that I'm unsure about
- those that I'm confident are non-random

Ideally, I'll get some sort of numeric score so I can tweak where the
boundaries fall.

Strings are *mostly* ASCII but may include a few non-ASCII characters.

Note that false positives (detecting a meaningful non-random string as
random) is worse for me than false negatives (miscategorising a random
string as non-random).

Does anyone have any suggestions for how to do this? Preferably something
already existing. I have some thoughts and/or questions:

- I think nltk has a "language detection" function, would that be
suitable?

- If not nltk, are there are suitable language detection libraries?

- Is this the sort of problem that neural networks are good at solving?
Anyone know a really good tutorial for neural networks in Python?

- How about Bayesian filters, e.g. SpamBayes?


A dead simple approach -- look at the pairs in real words and calculate the
ratio

pairs-also-found-in-real-words/num-pairs


Sounds reasonable. Building on this approach, two simple improvements:
- calculate the log-likelihood instead, which also makes use of the 
frequency of the digraphs in the training set

- Use trigraphs instead of digraphs
- preprocess the string (lowercase), but more sophisticated 
preprocessing could be an option (i.e. converting under_scores and 
CamelCase to spaces)


The main reason for the low score of the baby lions is the space 
character, I think - the word list does not contain that much spaces. 
Maybe one should feed in some long wikipedia article to calculate the 
digraph/trigraph probabilities


=
Apfelkiste:Tests chris$ cat score_my.py
from __future__ import division
from collections import Counter, defaultdict
from math import log
import sys
WORDLIST = "/usr/share/dict/words"

SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()

def extract_pairs(text):
for i in range(len(text)-1):
yield text.lower()[i:i+2]
# or len(text)-2 and i:i+3


def load_pairs():
pairs = Counter()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
# normalize to sum
total_count = sum([pairs[x] for x in pairs])
N = total_count+len(pairs)
dist = defaultdict(lambda:1/N, ((x, (pairs[x]+1)/N) for x in pairs))
return dist


def get_score(text, dist):
ll= 0
for i, x in enumerate(extract_pairs(text), 1):
ll += log(dist[x])
return ll / i


def main():
pair_dist = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, pair_dist)
print("%.3g  %s" % (score, text))


if __name__ == "__main__":
main()

Apfelkiste:Tests chris$ python score_my.py
-8.74  baby lions at play
-7.63  saturday_morning12
-6.38  Fukushima
-5.72  ImpossibleFork
-10.6  xy39mGWbosjY
-12.9  9sjz7s8198ghwt
-12.1  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43  bnsip atl ayba loy
Apfelkiste:Tests chris$

and using trigraphs:

Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-12.5  bnsip atl ayba loy
Apfelkiste:Tests chris$ python score_my.py
-11.5  baby lions at play
-9.88  saturday_morning12
-9.85  Fukushima
-7.68  ImpossibleFork
-13.4  xy39mGWbosjY
-14.2  9sjz7s8198ghwt
-14.2  rz4sdko-28dbRW00u
==

--
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Steven D'Aprano
On Monday 21 December 2015 15:22, Chris Angelico wrote:

> On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano 
> wrote:
>> I have a large number of strings (originally file names) which tend to
>> fall into two groups. Some are human-meaningful, but not necessarily
>> dictionary words e.g.:
[...]

> The first thing that comes to my mind is poking the string into a
> search engine and seeing how many results come back. You might need to
> do some preprocessing to recognize multi-word forms (maybe a handful
> of recognized cases like snake_case, CamelCase,
> CamelCasewiththeLittleWordsLeftUnchanged, etc),

I could possibly split the string into "words", based on CamelCase, spaces, 
hyphens or underscores. That would cover most of the cases.

> How many of these keywords would you be looking up, and would a
> network transaction (a search engine API call) for each one be too
> expensive?

Tens or hundreds of thousands of strings, and yes a network transaction 
probably would be a bit much. I'd rather not have Google or Bing be a 
dependency :-)


-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)

2015-12-21 Thread Steven D'Aprano
On Monday 21 December 2015 14:45, Ben Finney wrote:

> Steven D'Aprano  writes:
> 
>> Let's call the second group "random" and the first "non-random",
>> without getting bogged down into arguments about whether they are
>> really random or not.
> 
> I think we should discuss it, even at risk of getting bogged down. As
> you know better than I, “random” is not an observable property of the
> value, but of the process that produced it.
> 
> So, I don't think “random” is at all helpful as a descriptor of the
> criteria you need for discriminating these values.
> 
> Can you give a better definition of what criteria distinguish the
> values, based only on their observable properties?

No, not really. This *literally* is a case of "I'll know it when I see it", 
which suggests that some sort of machine-learning solution (neural network?) 
may be useful. I can train it on a bunch of strings which I can hand-
classify, and let the machine pick out the correlations, then apply it to 
the rest of the strings.

The best I can say is that the "non-random" strings either are, or consist 
of, mostly English words, names, or things which look like they might be 
English words, containing no more than a few non-ASCII characters, 
punctuation, or digits.


> You used “meaningless”; that seems at least more hopeful as a criterion
> we can use by examining text values. So, what counts as meaningless?

Strings made up of random-looking sequences of characters, like you often 
see on sites like imgur or tumblr. Characters from non-Latin character sets 
that I can't read (e.g. Japanese, Korean, Arabic, etc). Jumbled up words, 
e.g. "python" is non-random, "nyohtp" would be random.


[...]
> Perhaps you could measure Shannon entropy (“expected information value”)
> https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as
> a proxy? Or maybe I don't quite understand the criteria.

That's a possibility. At least, it might be able to distinguish some 
strings, although if I understand correctly, the two strings "python" and 
"nhoypt" have identical entropy, so this alone won't be sufficient.




-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread Peter Otten
Steven D'Aprano wrote:

> I have a large number of strings (originally file names) which tend to
> fall into two groups. Some are human-meaningful, but not necessarily
> dictionary words e.g.:
> 
> 
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
> 
> 
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
> 
> 
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
> 
> 
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether
> each string is random or not. I need to split the strings into three
> groups:
> 
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
> 
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
> 
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
> 
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
> 
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
> 
> - I think nltk has a "language detection" function, would that be
> suitable?
> 
> - If not nltk, are there are suitable language detection libraries?
> 
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> 
> - How about Bayesian filters, e.g. SpamBayes?

A dead simple approach -- look at the pairs in real words and calculate the 
ratio

pairs-also-found-in-real-words/num-pairs

$ cat score.py
import sys
WORDLIST = "/usr/share/dict/words"

SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()

def extract_pairs(text):
for i in range(len(text)-1):
yield text[i:i+2]


def load_pairs():
pairs = set()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
return pairs


def get_score(text, popular_pairs):
m = 0
for i, p in enumerate(extract_pairs(text), 1):
if p in popular_pairs:
m += 1
return m/i


def main():
popular_pairs = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, popular_pairs)
print("%4.2f  %s" % (score, text))


if __name__ == "__main__":
main()

$ python3 score.py
0.65  baby lions at play
0.76  saturday_morning12
1.00  Fukushima
0.92  ImpossibleFork
0.36  xy39mGWbosjY
0.31  9sjz7s8198ghwt
0.31  rz4sdko-28dbRW00u

However:

$ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a); 
print("".join(a))' 'baby lions at play'
bnsip atl ayba loy
$ python3 score.py 'bnsip atl ayba loy'
0.65  bnsip atl ayba loy


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-20 Thread Chris Angelico
On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano  wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
>
>
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
>
>
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
>
>
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
>
> I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.

The first thing that comes to my mind is poking the string into a
search engine and seeing how many results come back. You might need to
do some preprocessing to recognize multi-word forms (maybe a handful
of recognized cases like snake_case, CamelCase,
CamelCasewiththeLittleWordsLeftUnchanged, etc), but doing that
manually on the above text gives me:

* baby lions at play
* saturday morning 12
* fukushima
* impossible fork
* xy 39 mgwbosjy
* 9 sjz 7 s 8198 ghwt
* rz 4 sdko 28 dbrw 00 u

Putting those into Google without quotes yields:

* About 23,800,000 results
* About 227,000,000 results
* About 32,500,000 results
* About 16,400,000 results
* About 1,180 results
* 7 results
* About 30,300 results

DuckDuckGo doesn't give a result count, so I skipped it. Yahoo search yielded:

* 6,040,000 results
* 123,000,000 results
* 3,920,000 results
* 720,000 results
* No results at all
* No results at all
* 2 results

Bing produces much more chaotic results, though:
* 34,000,000 RESULTS
* 15,600,000 RESULTS
* 11,000,000 RESULTS
* 1,620,000 RESULTS
* 5,720,000 RESULTS
* 1,580,000,000 RESULTS
* 3,380,000 RESULTS

This suggests that search engine results MAY be useful, but in some
cases, tweaks may be necessary (I couldn't force Bing to do phrase
search, for some reason probably related to my inexperience with it),
and also that the boundary between "meaningful" and "non-meaningful"
will depend on the engine used (I'd use 1,000,000 as the boundary with
Google, but probably 100,000 with Yahoo). You might want to handle
numerics differently, too - converting "9" into "nine" could improve
the result reliability.

How many of these keywords would you be looking up, and would a
network transaction (a search engine API call) for each one be too
expensive?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Categorising strings on meaningful–meaningless spectrum (was: Catogorising strings into random versus non-random)

2015-12-20 Thread Ben Finney
Steven D'Aprano  writes:

> Let's call the second group "random" and the first "non-random",
> without getting bogged down into arguments about whether they are
> really random or not.

I think we should discuss it, even at risk of getting bogged down. As
you know better than I, “random” is not an observable property of the
value, but of the process that produced it.

So, I don't think “random” is at all helpful as a descriptor of the
criteria you need for discriminating these values.

Can you give a better definition of what criteria distinguish the
values, based only on their observable properties?

You used “meaningless”; that seems at least more hopeful as a criterion
we can use by examining text values. So, what counts as meaningless?

> I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
>
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
>
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.

Perhaps you could measure Shannon entropy (“expected information value”)
https://en.wikipedia.org/wiki/Entropy_%28information_theory%29> as
a proxy? Or maybe I don't quite understand the criteria.

-- 
 \  “Actually I made up the term “object-oriented”, and I can tell |
  `\you I did not have C++ in mind.” —Alan Kay, creator of |
_o__)Smalltalk, at OOPSLA 1997 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Catogorising strings into random versus non-random

2015-12-20 Thread Steven D'Aprano
I have a large number of strings (originally file names) which tend to fall
into two groups. Some are human-meaningful, but not necessarily dictionary
words e.g.:


baby lions at play
saturday_morning12
Fukushima
ImpossibleFork


(note that some use underscores, others spaces, and some CamelCase) while
others are completely meaningless (or mostly so):


xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u


Let's call the second group "random" and the first "non-random", without
getting bogged down into arguments about whether they are really random or
not. I wish to process the strings and automatically determine whether each
string is random or not. I need to split the strings into three groups:

- those that I'm confident are random
- those that I'm unsure about
- those that I'm confident are non-random

Ideally, I'll get some sort of numeric score so I can tweak where the
boundaries fall.

Strings are *mostly* ASCII but may include a few non-ASCII characters.

Note that false positives (detecting a meaningful non-random string as
random) is worse for me than false negatives (miscategorising a random
string as non-random).

Does anyone have any suggestions for how to do this? Preferably something
already existing. I have some thoughts and/or questions:

- I think nltk has a "language detection" function, would that be suitable?

- If not nltk, are there are suitable language detection libraries?

- Is this the sort of problem that neural networks are good at solving?
Anyone know a really good tutorial for neural networks in Python?

- How about Bayesian filters, e.g. SpamBayes?




-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list