[CODE4LIB] automatic greeking of sample files

2011-12-09 Thread BRIAN TINGLE
Hi,

I'm now in the group that produces XTF, and for XTF4.0, I'm thinking about 
updating the EAD XSLT based on the Online Archive of California's stylesheets.

For our EAD samples that we distribute with the XTF tutorial, we are using 6 
EAD files from the library of congress (which presumably are public domain).  

I'd like to start of a collection of pathological EAD examples that we have the 
rights to redistribute with the XTF tutorials and to use for testing.

Anticipating that potential contributors might not want to release their actual 
records for inclusion in an open source project; I hacked a little script to 
systematically change names and nouns to pig latin

https://gist.github.com/1429538

Here is a sample run;

Input: (from http://www.oac.cdlib.org/findaid/ark:/13030/kt3580374v/ )

The NASA Space Shuttle Challenger disaster occurred on January 28, 1986 when 
Space Shuttle Challenger broke apart 73 seconds into its flight, leading to the 
deaths of its seven crew members. Disintegration of the entire vehicle began 
after an O-ring seal in its right solid rocket booster failed at liftoff. The 
disaster resulted in the formation of the Rogers Commission, a special 
commission appointed by United States President Ronald Reagan to investigate 
the accident. The Presidential Commission found that NASA's organizational 
culture and decision-making processes had been a key contributing factor to the 
accident. NASA managers had known that contractor Morton Thiokol's design of 
the solid rocket boosters contained a potentially catastrophic flaw in the 
O-rings, but they failed to address it properly. They also disregarded warnings 
from engineers about the dangers of launching posed by the low temperatures of 
that morning.

output:

The Nasaay Acespay Uttleshay Allengerchay isasterday occurred on Anuaryjay 28, 
1986 when Acespay Uttleshay Allengerchay okebray apartway 73 econdsays into its 
flight, leading to the eathdays of its seven ewcray embermays. Isintegrationday 
of the entire ehiclevay began after an O-ring ealsay in its ightray solid 
ocketray oosterbay failed at iftofflay. The isasterday resulted in the 
ormationfay of the Ogersray Ommissioncay, a special ommissioncay appointed by 
Itedunay States Esidentpray Onaldray Eaganray to investigate the accidentway. 
The Esidentialpray Ommissioncay found that Nasaay's organizational ulturecay 
and decision-making ocessprays had been a key ontributingcay actorfay to the 
accidentway. Nasaay anagermays had known that ontractorcay Ortonmay Iokolthay's 
esignday of the solid ocketray oosterbays contained a potentially catastrophic 
awflay in the ingO-rays, but they failed to addressway it properly. They also 
disregarded arningways from engineerways about the angerda!
 ys of launching posed by the low emperaturetays of that orningmay.

Does anyone have any thoughts or feedback on this?  Is this totally silly?  Is 
there something besides pig latin that I could transform the words to?  Any 
obvious ways I could improve the python?


Re: [CODE4LIB] automatic greeking of sample files

2011-12-12 Thread Michael B. Klein
Hi Brian,

Your contributors might not consider Pig Latin, or anything else that can
be easily turned back into plaintext, to be "not releasing their actual
records." :-)

Here's a snippet that will completely randomize the contents of an
arbitrary string while replacing the general flow (vowels replaced with
vowels, consonants replaced with consonants (with case retained in both
instances), digits replaced with digits, and everything else is left alone.

https://gist.github.com/1468557

Here's your NASA sample run through the randomizer:

Vny RUPY Xsase Pwuccpo Lnipbaxjew fipewsof eqfugvof if Xeleufe 60, 1295
wtos Mvimo Jlehcve Lbobvezbyh vlozi odohl 77 cyfuzbq ilne ybl sponsf,
meojacz gu cmi piyngf ed abr fotor gloc cumcetj. Ruzildasfebaod if fdu
ejsosa rumozzi ginaq arhan or A-pont kaon ew eqv jejlk vutuq kalsaj roumhyl
teopyf is midqokz. Kda mitoxhuh rugoxhal on pxu pelqeseul az msu Tawivg
Luwjutmaol, i mqubyip wulvyffaak evviivhek qe Afykox Cfaron Mkefyfipq
Kybuvz Riufyl ba awwevrogixe bde uhliwekp. Hsu Gqugydatgyyp Qemgybmuix
diytr tvix VYXE'h irjybefakiyzil cibkeco udx numojuaf-pogezn dquziqpyb fod
heip a fee lannjuluxymk qejvet la vmy ymriqexc. BUJI fegucuzz syj wviwx
wmin cyvvgintoj Jufhyq Gnoeham'v dosyzv ar xzy detib xyzvyf raazkapk
lizniutyp u cypimsiufte zetesjzesmam dgyj ag cki U-juzrm, dys gnai jausul
gi iqlbyhf es ksumapfu. Bsau ittu qojsarahlih mozpyhbb dpon okxotuosd ebuih
cde xoqhewd ow koahznygl xuwoh by xce huf jujjybexohyp og xjoc gagnysx.

On Fri, Dec 9, 2011 at 3:17 PM, BRIAN TINGLE <
brian.tingle.cdlib@gmail.com> wrote:

> Hi,
>
> I'm now in the group that produces XTF, and for XTF4.0, I'm thinking about
> updating the EAD XSLT based on the Online Archive of California's
> stylesheets.
>
> For our EAD samples that we distribute with the XTF tutorial, we are using
> 6 EAD files from the library of congress (which presumably are public
> domain).
>
> I'd like to start of a collection of pathological EAD examples that we
> have the rights to redistribute with the XTF tutorials and to use for
> testing.
>
> Anticipating that potential contributors might not want to release their
> actual records for inclusion in an open source project; I hacked a little
> script to systematically change names and nouns to pig latin
>
> https://gist.github.com/1429538
>
> Here is a sample run;
>
> Input: (from http://www.oac.cdlib.org/findaid/ark:/13030/kt3580374v/ )
>
> The NASA Space Shuttle Challenger disaster occurred on January 28, 1986
> when Space Shuttle Challenger broke apart 73 seconds into its flight,
> leading to the deaths of its seven crew members. Disintegration of the
> entire vehicle began after an O-ring seal in its right solid rocket booster
> failed at liftoff. The disaster resulted in the formation of the Rogers
> Commission, a special commission appointed by United States President
> Ronald Reagan to investigate the accident. The Presidential Commission
> found that NASA's organizational culture and decision-making processes had
> been a key contributing factor to the accident. NASA managers had known
> that contractor Morton Thiokol's design of the solid rocket boosters
> contained a potentially catastrophic flaw in the O-rings, but they failed
> to address it properly. They also disregarded warnings from engineers about
> the dangers of launching posed by the low temperatures of that morning.
>
> output:
>
> The Nasaay Acespay Uttleshay Allengerchay isasterday occurred on Anuaryjay
> 28, 1986 when Acespay Uttleshay Allengerchay okebray apartway 73 econdsays
> into its flight, leading to the eathdays of its seven ewcray embermays.
> Isintegrationday of the entire ehiclevay began after an O-ring ealsay in
> its ightray solid ocketray oosterbay failed at iftofflay. The isasterday
> resulted in the ormationfay of the Ogersray Ommissioncay, a special
> ommissioncay appointed by Itedunay States Esidentpray Onaldray Eaganray to
> investigate the accidentway. The Esidentialpray Ommissioncay found that
> Nasaay's organizational ulturecay and decision-making ocessprays had been a
> key ontributingcay actorfay to the accidentway. Nasaay anagermays had known
> that ontractorcay Ortonmay Iokolthay's esignday of the solid ocketray
> oosterbays contained a potentially catastrophic awflay in the ingO-rays,
> but they failed to addressway it properly. They also disregarded arningways
> from engineerways about the angerda!
>  ys of launching posed by the low emperaturetays of that orningmay.
>
> Does anyone have any thoughts or feedback on this?  Is this totally silly?
>  Is there something besides pig latin that I could transform the words to?
>  Any obvious ways I could improve the python?
>


Re: [CODE4LIB] automatic greeking of sample files

2011-12-12 Thread Brian Tingle
On Mon, Dec 12, 2011 at 10:56 AM, Michael B. Klein wrote:

> Here's a snippet that will completely randomize the contents of an
> arbitrary string while replacing the general flow (vowels replaced with
> vowels, consonants replaced with consonants (with case retained in both
> instances), digits replaced with digits, and everything else is left alone.
>
> https://gist.github.com/1468557  


I like the way the output looks; but one problem with the random output is
that the same word might come out to different values.  The distribution of
unique words would also be affected, not sure if that would
impact relevance/searching/index size.  Also, I was sort of hoping to be
able to have some sort of browsing, so I'm looking for something that is
like a pronounceable hash one way hash.  Maybe if I take the md5 of the
word; and then use that as the seed for random, and then run
your algorithm then NASA would always "hash" to the same thing?

Potential contributors of specimens would have to be okay with the fact
that a determined person could recreate their original records.  The goal
is that an end user who might stumble across a random XTF tutorial
installation would not mistake what they are seeing for a real collection
description.

Hopefully nothing transforms to a swear word, I guess that is a problem
with pig latin as well...

Thanks for the feedback and the suggestion.  I'll play with this some
tonight and see if setting the seed based on the input word works to get
the same pseudo-random result, seems like it should.


Re: [CODE4LIB] automatic greeking of sample files

2011-12-12 Thread Nate Vack
On Mon, Dec 12, 2011 at 2:06 PM, Brian Tingle
 wrote:

> Potential contributors of specimens would have to be okay with the fact
> that a determined person could recreate their original records.

To make things simpler, you might just see how many contributors would
just be OK with the original records, and skip the obfuscation.

-n


Re: [CODE4LIB] automatic greeking of sample files

2011-12-12 Thread Joe Hourcle
On Dec 12, 2011, at 3:06 PM, Brian Tingle wrote:

> On Mon, Dec 12, 2011 at 10:56 AM, Michael B. Klein wrote:
> 
>> Here's a snippet that will completely randomize the contents of an
>> arbitrary string while replacing the general flow (vowels replaced with
>> vowels, consonants replaced with consonants (with case retained in both
>> instances), digits replaced with digits, and everything else is left alone.
>> 
>> https://gist.github.com/1468557  
> 
> 
> I like the way the output looks; but one problem with the random output is
> that the same word might come out to different values.  The distribution of
> unique words would also be affected, not sure if that would
> impact relevance/searching/index size.  Also, I was sort of hoping to be
> able to have some sort of browsing, so I'm looking for something that is
> like a pronounceable hash one way hash.  Maybe if I take the md5 of the
> word; and then use that as the seed for random, and then run
> your algorithm then NASA would always "hash" to the same thing?

If the list of missions / agencies / etc is rather small, it'd be possible to
just come up with a random list of nouns, and make a sort of secret
decoder ring, assigning each mission name that needs to be replaced
with a random (but consistent) word.

I just tend to replace all of my mission / spacecraft / instrument acronyms
with 'BOGUS' when I have to do similar stuff to generate records when
we're testing data systems, but I tend to just have the acronyms, not
the full spelled out names (which are looked up from the acronyms),
and I don't have large amounts of free text to worry about.

-Joe


Re: [CODE4LIB] automatic greeking of sample files

2011-12-12 Thread Brian Tingle
On Mon, Dec 12, 2011 at 12:27 PM, Nate Vack  wrote:

> On Mon, Dec 12, 2011 at 2:06 PM, Brian Tingle
>  wrote:
>
> > Potential contributors of specimens would have to be okay with the fact
> > that a determined person could recreate their original records.
>
> To make things simpler, you might just see how many contributors would
> just be OK with the original records, and skip the obfuscation.


true; but I'm also worried about end user support questions if we end up
have something like an ead-demo.xtf.cdlib.org

plus I'm also using this as an excuse to play with nltk (natural language
toolkit) and learn more python

but yes, I'm sure I'm prematurely optimizing this problem

On Mon, Dec 12, 2011 at 12:48 PM, Joe Hourcle  wrote:

> If the list of missions / agencies / etc is rather small, it'd be possible
> to
> just come up with a random list of nouns, and make a sort of secret
> decoder ring, assigning each mission name that needs to be replaced
> with a random (but consistent) word.


This is a great idea.  I think if I reset the pseudo-random seed based on
the input; then I don't even have to worry about keeping a decoder ring,
and it will work with any noun.  As long as the results look so silly that
no end user might mistake it for real this might work.

maybe I'll create an option switch for the text replacement method;
pig-latin, vowel/consonant-sensitive random letters, or random dictionary
word


Re: [CODE4LIB] automatic greeking of sample files

2011-12-12 Thread Michael B. Klein
I've altered my previous function (https://gist.github.com/1468557) into
something that's pretty much a straight letter-substitution cipher. It
could be turned back into plaintext pretty easily by someone who really
wanted to (by using frequency analysis and other hints like single-letter
words), but I can't imagine anyone going to the trouble over finding aids.
:) This keeps words (and therefore word frequency/distribution) consistent,
even across changes in case. But if you really want it to index
realistically, it would need to be altered to leave common stems (-s, -ies,
-ed, -ing, etc.) alone (assuming the indexer uses some sort of stemming
algorithm).

On Mon, Dec 12, 2011 at 12:06 PM, Brian Tingle <
brian.tingle.cdlib@gmail.com> wrote:

> On Mon, Dec 12, 2011 at 10:56 AM, Michael B. Klein  >wrote:
>
> > Here's a snippet that will completely randomize the contents of an
> > arbitrary string while replacing the general flow (vowels replaced with
> > vowels, consonants replaced with consonants (with case retained in both
> > instances), digits replaced with digits, and everything else is left
> alone.
> >
> > https://gist.github.com/1468557  
>
>
> I like the way the output looks; but one problem with the random output is
> that the same word might come out to different values.  The distribution of
> unique words would also be affected, not sure if that would
> impact relevance/searching/index size.  Also, I was sort of hoping to be
> able to have some sort of browsing, so I'm looking for something that is
> like a pronounceable hash one way hash.  Maybe if I take the md5 of the
> word; and then use that as the seed for random, and then run
> your algorithm then NASA would always "hash" to the same thing?
>
> Potential contributors of specimens would have to be okay with the fact
> that a determined person could recreate their original records.  The goal
> is that an end user who might stumble across a random XTF tutorial
> installation would not mistake what they are seeing for a real collection
> description.
>
> Hopefully nothing transforms to a swear word, I guess that is a problem
> with pig latin as well...
>
> Thanks for the feedback and the suggestion.  I'll play with this some
> tonight and see if setting the seed based on the input word works to get
> the same pseudo-random result, seems like it should.
>


Re: [CODE4LIB] automatic greeking of sample files

2011-12-13 Thread BRIAN TINGLE
On Dec 12, 2011, at 6:35 PM, Michael B. Klein wrote:

> I've altered my previous function (https://gist.github.com/1468557) into
> something that's pretty much a straight letter-substitution cipher.

This is what I ended up using
https://github.com/tingletech/greeker.py/blob/3ba1e84bc1ea51fa501c1a479f8758593bac5ffd/greeker.py#L131-150
it uses a different straight letter-substitutiuon for every unique word, using 
the input as the random's seed.
It does not look as pretty as your code

> But if you really want it to index
> realistically, it would need to be altered to leave common stems (-s, -ies,
> -ed, -ing, etc.) alone (assuming the indexer uses some sort of stemming
> algorithm).

I'm only doing nouns, and I'm matching inflection.  I guess I could investigate 
stemming as well.

I'd still like to play with substituting nouns using a dictionary of nouns of 
the same length; but I have not found a dictionary of nouns to use, I thought I 
would find one in nltk somewhere, but I did not figure out how to use wordnet 
when I looked at it.