On Oct 14, 8:48 am, [EMAIL PROTECTED] wrote: > Hi all, > > I started Python just a little while ago and I am stuck on something > that is really simple, but I just can't figure out. > > Essentially I need to take a text document with some chemical > information in Czech and organize it into another text file. The > information is always EINECS number, CAS, chemical name, and formula > in tables. I need to organize them into lines with | in between. So > it goes from: > > 200-763-1 71-73-8 > nátrium-tiopentál C11H18N2O2S.Na to: > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na > > but if I have a chemical like: kyselina močová > > I get: > 200-720-7|69-93-2|kyselina|močová > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál > > and then it is all off.
Pyparsing might be overkill for this example, but it is a good sample for a demo. If you end up doing lots of data extraction like this, pyparsing is a useful tool. In pyparsing, you define expressions using pyparsing classes and built-in strings, then use the constructed pyparsing expression to parse the data (using parseString, scanString, searchString, or transformString). In this example, searchString is the easiest to use. After the parsing is done, the parsed fields are returned in a ParseResults object, which has some list and some dict style behavior. I've given each field a name based on your post, so that you can read the tokens right out of the results as if they were attributes of an object. This example emits your '|' delimited data, but the commented lines show how you could access the individually parsed fields, too. Learn more about pyparsing at http://pyparsing.wikispaces.com/ . -- Paul # -*- coding: iso-8859-15 -*- data = """200-720-7 69-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH """ from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit # define expressions for each part in the input data # a numeric id starts with a number, and is followed by # any number of numbers or '-'s numericId = Word(nums, nums+"-") # a chemical name is one or more words, each made up of # alphas (including 8-bit alphas) or '-'s chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+"-")) # when returning the chemical name, rejoin the separate # words into a single string, with spaces chemName.setParseAction(lambda t:" ".join(t)) # a chemical formula is a 'word' starting with an uppercase # alpha, followed by uppercase alphas or numbers chemFormula = Word(alphas.upper(), alphas.upper()+nums) # put all expressions into overall form, and attach field names entry = numericId("EINECS") + \ numericId("CAS") + \ chemName("name") + \ chemFormula("formula") # search through input data, and print out retrieved data for chemData in entry.searchString(data): print "%(EINECS)s|%(CAS)s|%(name)s|%(formula)s" % chemData # or print each field by itself # print chemData.EINECS # print chemData.CAS # print chemData.name # print chemData.formula # print prints: 200-720-7|69-93-2|kyselina mocová|C5H4N4O3 200-001-8|50-00-0|formaldehyd|CH2O 200-002-3|50-01-1|guanidínium-chlorid|CH5N3 -- http://mail.python.org/mailman/listinfo/python-list