I have one big fat file(8GB) which is in the format

"<DOC> <DOCNO> 93489fjdf -adsf0a-t9-4q </DOCNO> 
sdf0934lkrsjfamkf-q39qjkfrev-dafkvad
,43-0=-toqtgegedag=d0fga </DOC> <DOC> <DOCNO> 9348943jikfsdf0adfa-4q </DOCNO>
sdf0934lkrsjfamkf-q39qjkfrev-dafkvad,34
r09mkfas0923rfs;a[qr0qfsfvsdsaf
</DOC>"

note that the file looks like an xml at first glance but it isnt.
This file has a new line character anywhere and everywhere. hence usage of .* 
becomes difficult.


now the problem is i need to extract data between </DOCNO> till </DOC> and 
store it in a file by the name mentioned between <DOCNO> and </DOCNO>.

at first glance it looked like a problem of awk but after some unsuccessful 
attempts, i tried sed but couldnt quite get the regex pattern.

Can anyone help out with the regex pattern/sed/awk?

pracheer gupta

_________________________________________________________________
Live Search extreme As India feels the heat of poll season, get all the info 
you need on the MSN News Aggregator
http://news.in.msn.com/National/indiaelections2009/aggregator/default.aspx
_______________________________________________
ilugd mailinglist -- ilugd@lists.linux-delhi.org
http://frodo.hserus.net/mailman/listinfo/ilugd
Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi 
http://www.mail-archive.com/ilugd@lists.linux-delhi.org/

Reply via email to