On Wednesday 01 Jul 2009, pracheer gupta wrote: > I have one big fat file(8GB) which is in the format > > "<DOC> <DOCNO> 93489fjdf -adsf0a-t9-4q </DOCNO> > sdf0934lkrsjfamkf-q39qjkfrev-dafkvad ,43-0=-toqtgegedag=d0fga </DOC> > <DOC> <DOCNO> 9348943jikfsdf0adfa-4q </DOCNO> > sdf0934lkrsjfamkf-q39qjkfrev-dafkvad,34 > r09mkfas0923rfs;a[qr0qfsfvsdsaf > </DOC>" > > note that the file looks like an xml at first glance but it isnt. > This file has a new line character anywhere and everywhere. hence > usage of .* becomes difficult. > > now the problem is i need to extract data between </DOCNO> till > </DOC> and store it in a file by the name mentioned between <DOCNO> > and </DOCNO>.
If neither field contains "<", something like this works: perl -e 'undef $/; $t = <STDIN>; while($t){$t =~ s/<DOC>\s*<DOCNO>([^<]+)<\/DOCNO>([^<]+)<\/DOC>\s*//s; $file = $1; $content = $2; print "<$file> <$content>\n";}' [Line has wrapped] Regards, -- Raju -- Raj Mathur r...@kandalaya.org http://kandalaya.org/ GPG: 78D4 FC67 367F 40E2 0DD5 0FEF C968 D0EF CC68 D17F PsyTrance & Chill: http://schizoid.in/ || It is the mind that moves _______________________________________________ ilugd mailinglist -- ilugd@lists.linux-delhi.org http://frodo.hserus.net/mailman/listinfo/ilugd Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi http://www.mail-archive.com/ilugd@lists.linux-delhi.org/