"James Stroud" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > [EMAIL PROTECTED] wrote: >> Hello, >> >> I am looking for python code useful to process >> tables that are in ASCII text. The code must >> determine where are the columns (fields). >> Concerned tables for my application are various, >> but their columns are not very complicated >> to locate for a human, because even >> when ignoring the semantic of words, >> our eyes see vertical alignments >> >> Here is a sample table (must be viewed >> with fixed-width font to see alignments): >> ================================= >> >> 44544 ipod apple black 102 >> GFGFHHF-12 unknown thing bizar brick mortar tbc >> 45fjk do not know + is less biac >> disk seagate 250GB 130 >> 5G_gff tbd tbd >> gjgh88hgg media record a and b 12 >> hjj foo bar hop zip >> hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj >> qdsd zert nope nope >> >> ================================= >> >> I want the python code that builds a representation >> of this table (for exemple a list of lists, where each list >> represents a table line, each element of the list >> being a field value). >> >> Any hints? >> thanks >> > > As promised. I call this the "cast a shadow" algorithm for table > discovery. This is about as obfuscated as I could make it. It will be up > to you to explain it to your teacher ;-) >
James - I used your same algorithm, but I guess I used more brute force (and didn't use pyparsing, either!). -- Paul data = """\ 44544 ipod apple black 102 GFGFHHF-12 unknown thing bizar brick mortar tbc 45fjk do not know + is less biac disk seagate 250GB 130 5G_gff tbd tbd gjgh88hgg media record a and b 12 hjj foo bar hop zip hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj qdsd zert nope nope""".split('\n') # find rightmost space characters delimiting text columns spaceCols = set(range(max(map(len, data)))) - \ set( [col for line in data for col,c in enumerate(line.expandtabs()) if not c.isspace() ] ) spaceCols -= set( [c for c in spaceCols if c+1 in spaceCols ] ) # convert to sorted list of leading col characters spaceCols = map(lambda x:x+1, sorted(list(spaceCols))) # get and pretty-print data fields dataFields = \ [ [line.expandtabs()[start:stop] for (start,stop) in zip([0]+spaceCols,spaceCols+[None])] for line in data ] import pprint pprint.pprint( dataFields ) Gives: [['44544 ', 'ipod ', 'apple ', 'black ', '102'], ['GFGFHHF-12 ', 'unknown thing ', 'bizar ', 'brick mortar ', 'tbc'], ['45fjk ', 'do not know ', '+ is less ', ' ', 'biac'], [' ', 'disk ', 'seagate ', '250GB ', '130'], ['5G_gff ', ' ', 'tbd ', 'tbd', ''], ['gjgh88hgg ', 'media record ', 'a and b ', ' ', '12'], ['hjj ', 'foo ', 'bar ', 'hop ', 'zip'], ['hg uy oi ', 'hj uuu ii a ', 'qqq ccc v ', 'ZZZ Ughj', ''], ['qdsd ', 'zert ', ' ', 'nope ', 'nope']] -- http://mail.python.org/mailman/listinfo/python-list