On Thursday 24 January 2008 10:28, Alex Brelsfoard wrote: > Here's a more detailed explanation: > We are reading in feeds (typically TSV or CSV files),
Ah, by "feeds" I understood RSS/Atom feeds. OK, so TSV/CSV is a different beast. In the general case it is notoriously tricky to parse. I'd recommend you look into Text::CSV, Text::CSV::Simple and Text::CSV_XS. > parsing them, > reformatting the data, and spitting it out as another file. > These feeds can come from all sorts of people/places/things. > So sometimes they are wonderfully formatted and we understand their > content. Sometime they are not, and we do not. You first need to figure out, for a given file, whether it is CSV or not, and then branch your processing accordingly. I don't think there is a general, robust solution for data that can be of any kind... > So now we have a feed where the description column may have linebreaks in > it. > So I can't just split on any form of linebreak. > > Does this make a bit more sense? It does. I've been bit by linebreaks within a CSV field before. People add line breaks in spreadsheets and resize spreadsheet columns until it looks good on their screen/computer/spreadsheed, and don't realize it almost assuredly won't look anything like it anywhere else... It can be an insidious and frustrating problem until you realize that the CSV format allows for that. Quoting from Text::CSV_XS: "The CSV file format does not require a specific character encoding, byte order, or line terminator format. Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, line-breaks can be embedded." > btw, there's no chance that I could define $/ as a regex could I? No. Unless you use awk... Bernardo _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm