On Wednesday 01 Jul 2009, pracheer gupta wrote:
> I have one big fat file(8GB) which is in the format
>
> "<DOC> <DOCNO> 93489fjdf -adsf0a-t9-4q </DOCNO>
> sdf0934lkrsjfamkf-q39qjkfrev-dafkvad ,43-0=-toqtgegedag=d0fga </DOC>
> <DOC> <DOCNO> 9348943jikfsdf0adfa-4q </DOCNO>
> sdf0934lkrsjfamkf-q39qjkfrev-dafkvad,34
> r09mkfas0923rfs;a[qr0qfsfvsdsaf
> </DOC>"
>
> note that the file looks like an xml at first glance but it isnt.
> This file has a new line character anywhere and everywhere. hence
> usage of .* becomes difficult.
>
> now the problem is i need to extract data between </DOCNO> till
> </DOC> and store it in a file by the name mentioned between <DOCNO>
> and </DOCNO>.

If neither field contains "<", something like this works:

perl -e 'undef $/; $t = <STDIN>; while($t){$t =~ 
s/<DOC>\s*<DOCNO>([^<]+)<\/DOCNO>([^<]+)<\/DOC>\s*//s; $file = $1; 
$content = $2; print "<$file> <$content>\n";}'

[Line has wrapped]

Regards,

-- Raju
-- 
Raj Mathur                r...@kandalaya.org      http://kandalaya.org/
       GPG: 78D4 FC67 367F 40E2 0DD5  0FEF C968 D0EF CC68 D17F
PsyTrance & Chill: http://schizoid.in/   ||   It is the mind that moves


_______________________________________________
ilugd mailinglist -- ilugd@lists.linux-delhi.org
http://frodo.hserus.net/mailman/listinfo/ilugd
Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi 
http://www.mail-archive.com/ilugd@lists.linux-delhi.org/

Reply via email to