On Thursday 24 January 2008 10:28, Alex Brelsfoard wrote:
> Here's a more detailed explanation:
> We are reading in feeds (typically TSV or CSV files),

Ah, by "feeds" I understood RSS/Atom feeds. OK, so TSV/CSV is a different 
beast. In the general case it is notoriously tricky to parse. I'd recommend 
you look into Text::CSV, Text::CSV::Simple and Text::CSV_XS.

> parsing them, 
> reformatting the data, and spitting it out as another file.
> These feeds can come from all sorts of people/places/things.
> So sometimes they are wonderfully formatted and we understand their
> content. Sometime they are not, and we do not.

You first need to figure out, for a given file, whether it is CSV or not, and 
then branch your processing accordingly. I don't think there is a general, 
robust solution for data that can be of any kind...

> So now we have a feed where the description column may have linebreaks in
> it.
> So I can't just split on any form of linebreak.
>
> Does this make a bit more sense?

It does. I've been bit by linebreaks within a CSV field before. People add 
line breaks in spreadsheets and resize spreadsheet columns until it looks 
good on their screen/computer/spreadsheed, and don't realize it almost 
assuredly won't look anything like it anywhere else... It can be an insidious 
and frustrating problem until you realize that the CSV format allows for 
that. Quoting from Text::CSV_XS: "The CSV file format does not require a 
specific character encoding, byte order, or line terminator format.
Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a 
carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, 
line-breaks can be embedded."

> btw, there's no chance that I could define $/ as a regex could I?

No. Unless you use awk...

Bernardo
 
_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to