what to call my module

2003-02-10 Thread Paul Tremblay

I  am planning to distribut my module, but am confused on what to call
it.

I have a series of modules in a directory that I will call Rtf2xml. (I
checked on Cpan, and don't believe this name space is taken.)

I named each module by a simple name--for example Pict.pm. In the main
script, I have:

use Rtf2xml::Pict
...
Pict::process_pict()


In the actual module, I have:

package Pict;


Everything works fine this way. But when I look at other modules, I
notice that they use a different naming convention. 

My question is if I should follow this syntax:

package Rtf2xml::Pict # for the actual module

Rtf::Pict::process_pict() # for calling on a subroutin in the pacage


I believe that the second way provides a more distinct namespace?

I *have* read the documentation on how to distribute a module, but I
couldn't follow most of the jargon. 

Thanks

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: how to set up Makefile.pl

2003-02-06 Thread Paul Tremblay

Thanks Bob, and thanks RF. I used seek and tell, and now have the data
as part of a module. 

Paul


On Wed, Feb 05, 2003 at 05:13:24PM -0500, Bob Showalter wrote:
 From: Bob Showalter [EMAIL PROTECTED]
 To: 'Paul Tremblay' [EMAIL PROTECTED], [EMAIL PROTECTED]
 Subject: RE: how to set up Makefile.pl
 Date: Wed, 5 Feb 2003 17:13:24 -0500 
 
 Paul Tremblay wrote:
  Thanks, but this won't work. I need to open the data file and read
  some data. Later in the script, I need to open the data file again
  and read more data. When I use  DATA, perl apparently reads one
  line at a time until it finds what I want. It then starts at the line
  I left off when I need to read more data. It does not start at the
  beginning again, so I cannot find the data I need. 
  
  The only solution I can think using the __DATA__ method would be to
  have my script print out *all* data to a temporary file. I could then
  open and close this file when I wanted data. But this seems like kind
  of a hack--and it would take a bit more time, though only a second or
  two. 
 
 Two alternate approaches:
 
 1) read the data into a memory structure (array or hash)
 
 2) use tell()/seek() on the DATA file handle to move around. 
 
 n.b. seek(DATA, 0, SEEK_SET) does not put you at the first line after
 __DATA__, it puts you at the top of your script file. So use tell() to mark
 the start point before you start reading.
 
 -- 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: how to set up Makefile.pl

2003-02-05 Thread Paul Tremblay
Thanks, but this won't work. I need to open the data file and read some
data. Later in the script, I need to open the data file again and read
more data. When I use  DATA, perl apparently reads one line at a time
until it finds what I want. It then starts at the line I left off when I
need to read more data. It does not start at the beginning again, so I
cannot find the data I need.

The only solution I can think using the __DATA__ method would be to have
my script print out *all* data to a temporary file. I could then open
and close this file when I wanted data. But this seems like kind of a
hack--and it would take a bit more time, though only a second or two.

Paul

On Wed, Feb 05, 2003 at 09:13:13AM -0500, RF wrote:
 
 However, I have several files that need aren't modules but are needed
 for my script. One of these is a data file. How do I set up Makefile.pl
 
 to make sure this data file gets put in a place where the script can
 read it?
 
 The answer verbatim from The Perl Cookbook
 (http://www.oreilly.com/catalog/cookbook)
 
 Problem
 
You have data that you want to bundle with your
 program and treat as though it were in a file, but you don't want it to
 be in a different file.
 
Solution
 
Use the __DATA__ or __END__ tokens after your program
 code to mark the start of a data block, which can be
read inside your program or module from the DATA
 filehandle.
 
Use __DATA__ within a module:
 
while (DATA) {
# process the line
}
__DATA__
# your data goes here
 
Similarly, use __END__ within the main program file:
 
while (main::DATA) {
# process the line
}
__END__
# your data goes here
 
Discussion
 
__DATA__ and __END__ indicate the logical end of a
 module or script before the physical end of file is reached.
Text after __DATA__ or __END__ can be read through
 the per-package DATA filehandle. For example, take the
hypothetical module Primes. Text after __DATA__ in
 Primes.pm can be read from the Primes::DATA filehandle.
 
__END__ behaves as a synonym for __DATA__ in the main
 package. Text after __END__ tokens in modules is
inaccessible.
 
This lets you write self-contained programs that
 would ordinarily keep data kept in separate files. Often this is used
for documentation. Sometimes it's configuration data
 or old test data that the program was originally developed
with, left lying about in case it ever needs to be
 recreated.
 
Another trick is to use DATA to find out the current
 program's or module's size or last modification date. On most
systems, the $0 variable will contain the full
 pathname to your running script. On systems where $0 is not correct,
you could try the DATA filehandle instead. This can
 be used to pull in the size, modification date, etc. Put a special
token __DATA__ at the end of the file (and maybe a
 warning not to delete it), and the DATA filehandle will be to
the script itself.
 
use POSIX qw(strftime);
 
$raw_time = (stat(DATA))[9];
$size = -s DATA;
$kilosize = int($size / 1024) . 'k';
 
print PScript size is $kilosize\n;
print strftime(PLast script update: %c (%Z)\n,
 localtime($raw_time));
 
__DATA__
DO NOT REMOVE THE PRECEDING LINE.
 
 
Everything else in this file will be ignored.
 
See Also
 
The Scalar Value Constructors section of perldata
 (1), and the Other literal tokens section of Chapter 2 of
Programming Perl
 
 
 
 -- 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




how to set up Makefile.pl

2003-02-04 Thread Paul Tremblay
I'm trying to write a Makefile.pl to distribute module. By just playing
around, I believe I've kind of got the hang of how to make sure my
module gets put in the right place.

However, I have several files that need aren't modules but are needed
for my script. One of these is a data file. How do I set up Makefile.pl
to make sure this data file gets put in a place where the script can
read it?

Also, I would like for the Makefile.pl to put an executable in an
appropriate place (/usr/bin, or whaterver) so the user can run my script
with just a command.

I have looked on the web but haven't found any straightforward
documentation.

Thanks

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: how to distribute a module

2003-02-03 Thread Paul Tremblay
On Sun, Feb 02, 2003 at 10:10:02PM -0800, Randal L. Schwartz wrote:
 

 
 rtf is a Really Bad Name for a module.
 
 First, it starts with a lowercase letter, which is reserved
 for system packages and modules.  Not You.
 
 Second, if you ever plan on distributing it outside your local group,
 the name should be selected by contacting [EMAIL PROTECTED]  If
 you *don't* plan on distributing it outside your local group, pick
 a group prefix uniquely identifying your group.  I use Stonehenge::.
 

Thanks. I will  check this out. I really don't want to create a module
so much as I want to package my script for distribution. The script
consisits of one main script which uses my own modules. I doubt anyone
else will use these modules themselves, so perhaps I shouldn't even try
to package them as modules? 

Right now I have this line in the main script:

use library /perl5/rtf

I could have the user put the rtf directory wherever s/he wants, and
then just change the line in the main script. This just seemed like kind
of a hack.

 Third... you *are* aware that there are a lot RTF things already,
 right?  So you might just be reinventing some portion of something
 that is already tested and deployed.
 
 See search.cpan.org for more details.

No. Every time I ask for help on this script, on different mailing
lists, I get this response. But there is no open source project that
converts RTF to XML. There are scripts that convert RTF to other
formats, but not XML. I have checked several times. I am working with
someone else, and he has also checked.

(Of course, there is always the possibilithy that I am still wrong!)

Thanks

Paul


-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




how to distribute a module

2003-02-02 Thread Paul Tremblay
I have a series of modules that I would like to distribute for public
use. These modules convert RTF to XML, and they are all contained in a
folder called rtf.

Can anyone tell me how to package these modules for distribution? The
*Perl Cookbook* doesn't seem to be much help in this area. If I type 

h2xs -XA rtf

Then I am given just the tidbits to get started with making my module
ready for distribution.

How do I get all the modules in the right perl library? And how do I get
the executable to the right place? Also, I have a data table that the
script needs, and I have a dtd that also needs to be put in the right
place. I don't know exactly what the right place is right now--as long
as the script knows where they are, it will work.

I could have the user intall every thing manually, but that seems like
it is kind of a hack. I would like for a user to be able to type 

make
make install
make test

and have everything working right.

Thanks

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: how to distribute a module

2003-02-02 Thread Paul Tremblay
On Sun, Feb 02, 2003 at 08:23:04PM +0100, Paul Johnson wrote:
 
  I have a series of modules that I would like to distribute for public
  use. These modules convert RTF to XML, and they are all contained in a
  folder called rtf.
  
  I could have the user intall every thing manually, but that seems like
  it is kind of a hack. I would like for a user to be able to type 
  
  make
  make install
  make test
  
  and have everything working right.
 
 You need to create a Makefile.PL.  For documentation on how to do this
 see:
 
   perldoc ExtUtils::MakeMaker
 

Thanks. I have run 

h2xs -XA -n rtf

This creates a skeleton Makefile.pl. I checked out an example
Makefile.pl from a module called HTML-Format. Theis person put all the
modules in a directory called HTML, then put this directory in a
directory called lib. So the Makefile looks like: 


WriteMakefile(
NAME = 'HTML-Format',
VERSION_FROM = 'lib/HTML/Formatter.pm',
PREREQ_PM= {
 'HTML::Element' = 1.44,
 'Font::AFM' = 1.17,
},
dist = { COMPRESS = 'gzip -9f', SUFFIX = 'gz', },
);

In my case, I believe I just need to do the same thing, except change the
line 

VERSION_FROM = 'lib/HTML/Formatter.pm',

to 

VERSION_FROM = 'lib/rtf/some_file_with_a_version',

? 

In additon, I would delete the line starting wiwth PREREQ_PM, since my
modules require no other modules.

However, how would I tell the Makefile.pl to install an executable, and
how do I put the datafile in the right place? This datafile contains the
data for character encoding. Obviously, the main script needs to know
where to find this data file. 

Last, there is a dtd. Once my module converts the RTF to XML, in writes
a path to the dtd. Again, the script needs to put this dtd some place,
and needs to know where it is.

Thanks

Paul


-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




finding invalide options in getopt

2002-09-01 Thread Paul Tremblay

I would like to have my script print a short help message and then
quit if a user uses an invalid option. I am using the Getopt::Long
module.

Also, I've noticed that some scripts have a trick whereby if a user
types 

scriptname --help

The script prints on the pod. There is a link on perldoc.com for
usage::pod, but this link is broken.

(I read your tutorial, Drieux, but it was a bit too advanced for me!)

Thanks

Paul

-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: where to put data files

2002-08-29 Thread Paul Tremblay

On Thu, Aug 29, 2002 at 06:47:26AM -0700, drieux wrote:
 
 ah! I see! you really are in the
 
   have code, the rest is temporary data files

I'm not sure I catch your drift here. Did a line get dropped from the
email?
 
 sorry for not quite catching your drift the first time.
 
 just a dumb-bunny question - but if I type
 
   rtf2xml --help
 
 will your code dump me some 'advice'? and/or
 
   perldoc rtf2xml
 
 show me if
 
   rtf2xml input_file output_file
 

No to both questions! Which means I need to include both. I have just
added some switches for this script. I guess I can just add a switch
for --help. 

#i'v already defined my switch --help and attatched it to $help
if ($help){
print in order to use this script...
}

As far as perldoc goes, if I include any pod in the script, then
typing perldoc rtf2xml should output the documentation? (I'll try this
out right now. I'll probably have the answer before you respond!)

 would work - or that in this release it is only
 sending output to STDOUT?
 
 that being the other rack of basics - all you
 really need to do to distribute it would be
 to hang the code on a webPage and let folks 'save to file'
 from their browser
 
 The next level of complexity is to either go with
 the Make::Maker approach of building an installer
 based upon the usual
 
   perl *.PL
   make install
 
 model - or you will want to look at the other
 option of
 
   a) hand crafting an installation script
   b) adopting someone else's package installer
   - rpm/pkgadd
 
 and complying with their standard.
 
 The basic stuff that you will want to put into the
 general release are
 
   a) README - what we are about
   - clues to any other documentation
   b) ChangeLog - when we did what to this for why
   c) The Installation Stuff
   - how to check for all the dependencies
   - how to check for the installer's desires and wishes
   as well as explain to them if they try to install into
   places they shouldn't
   - how to actually install it...
   d) The Code Stuff
   - the one item - at this point
 
   e) Manifest - what should be in the tarball
 
 This way the person who just wants 'the application' Can
 just install it - without having to understand what all
 the rest of that stuff is about - while others will be able
 to keep track of how this evolves as it evolves...
 


Thanks. This is all useful info. Right now in order to run the script,
a user needs:

1. the code
2. the character_set file
3. a folder called rtf2xml_dir located in /usr/share
4. a temp folder located in the above folder.

It is pretty easy to write a script to make a folder and copy the
character_set file to this folder. Will all users have permission to
make such a folder? 

And where should I put the executable script? In /bin, or /usr/bin, or
what? 

I think I'll go with simply writing my own simple script for making
the installation. I'll also give explicit instructions in README on
how to install the few components by hand, (if the user wants to) as
well as how to change the place the script writes its temporary files,
and where it reads its data. (I put a variable called $directory at
the beginning of the script so a user can change it easily if he
wants.)

I guess I need to work on the pod documentation. I believe I should be
able to simply follow the Perl Cookbook.

Thanks

Paul
 
 
 
 
 -- 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




convert decimal to hexidecimal

2002-08-28 Thread Paul Tremblay

Is there a way to convert a decimal number to a hexidecimal number in
perl? 

I have expressions like this:

\u8195\'20
\u9824\'3f

The number after the u is a decimal number that needs to be converted
to hexidecail. The number after the second slash needs to simply be
elimianted. Thus, the two lines above should look like:

#x2003;
#x2660;

Thanks

Paul

-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: convert decimal to hexidecimal

2002-08-28 Thread Paul Tremblay

On Wed, Aug 28, 2002 at 09:51:11PM -0700, Randal L. Schwartz wrote:
 Why?  #8195; #9824; is prefectly valid, unless you're talking about
 something besides XML or HTML.
 
 -- 

Yes, it certainly is! I realized this as soon as I sent off the email.
Most of RTF (I am writing a script to convert RTF to XML) uses
hexadecimal, so I was thinking along those lines.

But yes, I only have to do a simple substitution. 

None-the-less, I am glad for the tips I have gotten. Who knows if I'll
have to convert to hexadecimal sometime in the future?

Thanks everyone

Paul

-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: where to put data files

2002-08-27 Thread Paul Tremblay

On Wed, Aug 21, 2002 at 11:43:43AM -0700, drieux wrote:
 On Wednesday, August 21, 2002, at 10:06 , Paul Tremblay wrote:
 
  I am writing a script that converts RTF to XML, and this script needs
  to read an external data file to form a hash. I plan to make this
  script available to anyone who needs it, and am wondering what to do
  with this external data file.
 
 when you say 'external data file' - do you mean things
 that are required for the script itself to work? if so,
 you may want to look at the traditional solution of
 getting those to install into the site_perl as a perl module,
 hence you will most likely want to look at h2xs as a
 first round for how to do that

Okay, I'm getting closer to finishing the d**n beast of a script, so I
need to think about how to distribute it. This script will be a
command line utility. Right now it has no switches:

rtf2xml file

I want to make it easy to use for people who don't know perl. Should I
still use a module? 

My understanding of a module is that it provides code for other perl
scripts, rather than being a stand alone utility by itself. 

I am thinking that if I distribute this script as a module, I will
have to write a wrapper for it? That is, say I have a script like
this:

#!perl
print hello world;

I have to re-write it so it looks like:

sub main{
print hello world;
}

Now I call this module hello_world.pm. This installs in the proper
places. I write another script that looks like this:

#!perl
use hello_world;
main();

Thanks

Paul

 
 or do you mean, as seems to be suggested below,
 any 'temporary files' that are created in transit?
 
  Should I simply make the data file available with the script and tell
  the user where to put it? Right now, I have it in
  /usr/share/rtf2xml/char_data.data.
 
 a reasonable alternative plan - but going with the
 traditional perl module approach will mean that
 they can then just do the usual
 
   perl Makefile.PL
   make
   make test
   make install
 
 and let it go at that...
 
 This way if you need to upgrade along the way - and it's
 all nice and neatly in a perl module - you can then distribute
 a new release of it - and not need to upgrade the core code itself.
 
 [..]
 
  Also, the script outputs to standard output. Future versions might
  require that the script make one run through the file, write to a temp
  file, then make a run through the temp file. Where should I put this
  temp file (if I need to make one, that is)?
 
 [..]
 
 you might want to look at
 
   IO::File
 
 and use the new_tmpfile method to create a temp file
 'for the duration' - before sending it to stdout
 
 also IF in this future upgrade you start doing the
 
   myCode Infile Outfile
 
 then you could use the file named as Outfile as the temp file,
 unless you allow things like
 
   myCode Infile - | doFoo
 
 where you expressly allow '-' as the 'cheat' that you want
 the output to finally go to stdout
 
 ciao
 drieux
 
 ---
 
 
 -- 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




where to put data files

2002-08-21 Thread Paul Tremblay

I am writing a script that converts RTF to XML, and this script needs
to read an external data file to form a hash. I plan to make this
script available to anyone who needs it, and am wondering what to do
with this external data file. 

Should I simply make the data file available with the script and tell
the user where to put it? Right now, I have it in
/usr/share/rtf2xml/char_data.data. 

How about on a Windows system, which I don't know much about? 

Should I write a scrpt that creates a directory and puts the data file
in that directory? What is the standard procedure here? 

Also, the script outputs to standard output. Future versions might
require that the script make one run through the file, write to a temp
file, then make a run through the temp file. Where should I put this
temp file (if I need to make one, that is)?

Thanks

Paul 

-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




to slurp or not to slurp

2002-08-15 Thread Paul Tremblay

I am writing a script to convert RTF to XML, and my output looks like
this: (I explain this ugliness below)

id1listlevel1 
text
id1listlevel2
text
/id1listlevle2
/id1listlevel1
text
id1listlevel1 
text
id1listlevel2
text
/id1listlevle2
/id1listlevel1

I know that is ugly to read, but I'm just point out that the tags
repeat themselves. It should look like this:

id1listlevel1 
text
id1listlevel2
text
text
text
text
/id1listlevle2
/id1listlevel1

In other words, the list starts and then stops in the middle. In order
to get rid of these exessive tags , I was thinking of reading the
whole file into memory at once, and then doing this substitution:

my @array = split /(id1listlevel1(.*)I\/id1listlevel1/, $_;
for my $name(@array){
if ($name =~/id1listlevel1/){
$name=~s/id1listlevel1//g;
$name= id1listlevel1$nameid1listlevel1;
}
print $name;
}

my @array = split /(id1listlevel2(.*)I\/id1listlevel2/, $_;


I would actually use a loop for each level.

However, isn't it a bad idea to read the whole file in at once? What
happens if the user had a really huge file? 

My other method was to read my result  file one line at a time. Once I
found id1listlevel1 or anything that matches a similar pattern, I
would push it into an array. If I found it again, I would simply
delete it.  Then, read the file in backwards one line at a time, and
look for the pattern /id1listlevel1, and allow only the first one of
these.

So, should I slurp or do it one line at a time?

In case you are wondering why my output has extra tags in the middle,
you can blame good ol Bill Gates. RTF really does suck. In earlier
versions of RTF, the code told you when the user was skipping an item
in a list (but was still continuing the list). For a reason only known
to the morons at micro$oft, they changed this code in word 97 and
2000. 

Thanks

Paul





-- 

*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: speed and perl

2002-08-03 Thread Paul Tremblay

On Fri, Aug 02, 2002 at 08:43:45AM -0500, Bryan DeLuca wrote:
 If you are interested in language benchmarks you might want to check out
 the Great Computer Language Shootout:
 
 http://www.bagley.org/~doug/shootout/
 
 It has some surprising results.
 

Yes, the results are surprising. According to these benchmarks, perl is
not that much faster than python when doing regexs. Can this be so? It
seems when I have done my own anecdotal, unscientific tests, perl was so
much faster that I decided to write my script in perl.

I would have expected perl to blow python or java or even C right out of
the water when it comes to munging text.

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: speed and perl

2002-08-03 Thread Paul Tremblay

On Sat, Aug 03, 2002 at 10:58:24AM +0200, Paul Johnson wrote:
 
 On Fri, Aug 02, 2002 at 02:08:25AM -0400, Paul Tremblay wrote:
 
  (I know I did a little test with sed, a python script, and a perl
  script, just changing the word the to teh in a huge file. Sed and
  python took about he same time, while perl was six times faster.)
 
 This is from the perl source code (sv.c if you are interested):
 
 /* Here is some breathtakingly efficient cheating */
 
   if (rslen) {
   while (cnt  0) {/* this |  eat */
   cnt--;
   if ((*bp++ = *ptr++) == rslast)  /* really   |  dust */
   goto thats_all_folks;/* screams  |  sed :-) */
   }
   }
   else {
   Copy(ptr, bp, cnt, char);/* this |  eat */
   bp += cnt;   /* screams  |  dust */
   ptr += cnt;  /* louder   |  sed :-) */
   cnt = 0;
   }
 
 

Is this for real, and not a joke? If so, it is pretty funny!

Paul
-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




changing multiple flags and changing them back

2002-08-01 Thread Paul Tremblay

I have a series of flags that I need to change all at once, and then
change back, and was wondering if I could use an array or hash to do
this.

I am parsing an RTF file, and when I find a footnote, I need to preserve
the flags of the non-footnote text. So if I was in a table, I need to
save the $in_table flag. Then when I am done with the footnote text, I
need to re-set the $in_table flag to its previous state. 

So far I have this:

sub start_footnote{
$previous_in_table = $in_table;
...
}

sub end_footnone{
$in_table = $previous_in_table;
...
}

This works find except I might have 15 or 20 flags I need to set or
re-set. I would like to use an array like this:

@flags = ($in_table, $after_cell, $in_paragraph);

When I finish with my footnote, I will have an array of the previous
values. Now how do I assign these values to the variables?

Thanks

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




speed and perl

2002-08-01 Thread Paul Tremblay

This question may be too vague for a good answer, but my curiosity makes
me ask it anyway. I thought I read somewhere that perl is actually
faster than C for certain tasks. The vagueness of the question probably
lies in exactly what task, who writes the program, the size and type of
data, and a dozen other factors.

Specifically, I am writing a perl script to convert RTF to XML. I
downloaded a utility called unrtf, written in C, which converts RTF to
HTML, a somewhat easier task. I got depressed when I ran unrtf on
small files and saw that it didn't take *any* time. But when I ran it on
a big file of 1.8 megabytes, it actually took 6 minutes and 30 seconds.
My script only took a minute! (Hooray, after all this hard work, that's
a little encouraging to see!)

When I ran the same file through a java utility called majix, it took
over a minute. It is hard to say exactly, because majix supplies their
own timer, and this timer only starts when java starts processing the
documents, some 20 or 30 seconds after you launch it.

I had actually considered learning C++ to make my little script really
fast so that people would consider using it. I really doubt I would have
really gone through all that trouble, but now I am wondering if C++
would have given that much of a time advantage--if any at all.

Certainly, a perl script would be easier to maintain and debug.

Thoughs on how C, java, and perl compare on speed? 

(I know I did a little test with sed, a python script, and a perl
script, just changing the word the to teh in a huge file. Sed and
python took about he same time, while perl was six times faster.)

Paul



-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Perl IDE's v. Perl Editors was Re: Editor

2002-07-31 Thread Paul Tremblay

I'm surprised that more posters didn't advocate vim as *the* editor.
I think it is linux world that did a survey and found that 80 percent of
the users picked vim as their favorite editor. 

I originally started using vim, then switched to nedit, and now have
switched back to vim. Nedit if very nice, but I had problems with
keyboard commands. About half the time when I would press keys like
cntrl z to undo a command, I would get a ^ack or something like that
on the screen.

I can see why vim is the most popular linux editor. It is extremelly
powerful. It has all sorts of options for automatic inenting, as well as
a feature called folding, which I still have to learn how to use; it
bascially hides lines on your screen. So if you were working between
your main program and a subroutine 1000 lines below, you could hide
those thousand lines. Of course, vim as full highlighting capabilities.
Vim offers so many options that I doubt I could learn them all.

The drawback to vim is that it is a bit hard to learn at first. It is
keyboard driven, which goes against how most people learn to operate
a computer, with a mouse. Of course, there is full graphical interface
version of vim, which I use, called gvim. Once you get over the initial
difficulty of learning vim (which should only take a few days?), then it
can be easier to use, depending on your preferences.

Vim is also one hundred percent free--as is Nedit. I wouldn't pay for an
editor, with all the excellent free choices out there.

But nedit is also a good choice as an editor. It is very intuitive and also
powerful.

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: So what is munging?

2002-07-29 Thread Paul Tremblay

On Mon, Jul 29, 2002 at 12:07:05PM -0400, Jeff 'japhy' Pinyan wrote:

 
 I've got an article in the July Linux Magazine (Hitting the
 Motherlode) about regexes (mainly in Perl).  In it, I use the term
 munge.  FOLDOC[1] says a derogatory term meaning to imperfectly
 transform data, but I don't think it's such a bad term.
 
 For me, munge just means using whatever means necessary to massage data
 from one format to another -- that might mean extracting stuff, changing
 its layout, or transforming it to an entirely different format.  And I use
 massage here, because it's accurate:  sometimes, you get a gentle
 backrub, and sometimes, you get painful hand-chopping on your spine.

Huh, pretty funny. 

Would you make a distinction between parsing and munging? For example,
if you wrote a script to convert Word RTF to XML, is that parsing or is
that still munging? 

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: fastest way to substitute

2002-07-29 Thread Paul Tremblay

On Sat, Jul 27, 2002 at 02:40:52PM -0700, John W. Krahn wrote:
 
   s[([]|(?=\\)$rx)][$rep{$1}]go;
 

John, this doesn't work. 

My representative line is:


\ldblquote \rdblquote\par

My code is:


my %rep = qw(
   amp;
   gt;
   lt;
ldblquote   lt_quote/
rdblquote   rt_quote/

);  

my $rx = join |, map quotemeta, keys %rep;

s[([]|(?=\\)$rx)][$rep{$1}]go;

And the result is:

\lt_quote/ \rt_quote/ amp; lt; gt; \par

I'm getting the backslashes in front of my XML tags. I need 

lt_quote/ rt_quote/ amp; lt; gt; \par

I've tried a number of different variations with no success. Perhaps I
will need two lines of substitution in my script, one for backslashed
characters, and one for characters that are not backslashed?

Thanks

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: fastest way to substitute

2002-07-27 Thread Paul Tremblay

On Sat, Jul 27, 2002 at 11:23:59AM -0400, Jeff 'japhy' Pinyan wrote:
 
 That's because my method requires creating a hash each time.  If you were
 to take that out of the function, it would run faster.

Right. This makes perfect sense.

I  followed your advice and put the hash outside of the subroutine. I
also took out the ampersand substitution to make things equal.

I got some interesting resulst. If I used a long line full of tokens,
then each method was as fast. But if I used a more represenative line with
just a few tokens, then your method was around twice as fast. 

That makes sense. The read_each_line method has to read the non-tokens
35 times (once for each substitution). You method gets to skip over
them.

If I used a line with no tokens, your method ran 10 times faster. That's
why your suggestion above:

'\\this ' = 'this/,

might not be a good idea. If I wrote my hash like this, then as perl
searches the line, it has to search each item in the hash. Your
original suggestion looked like this:

 $l =~ s[\\($rx) ][$rep{$1}/]go;

With this method, perl stops searching if it doesn't find a \.
That means it can skip over lines with no \, which in turn means that
it will run around 10 times faster than using my original method.

The only problem is how I should replace , , and . I think I'll
do single line subs for this text. Even with huge files it shouldn't
take more than 1/2 a second or so, and that allows me to use your
original method to speed things up.

Or this just occurred to me:

s[()|()|()|\\($rx)][$rep{$1}/]go;

Yea, that should work!

I also realize that I shouldn't initialize my hashes in my subroutines.
Making them global should also speed up my script a bit.

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: fastest way to substitute

2002-07-27 Thread Paul Tremblay

On Sat, Jul 27, 2002 at 02:40:52PM -0700, John W. Krahn wrote:

 
 I thought '' and '/' were already in the hash values?  If so, wouldn't
 this work?
 
   s[([]|(?=\\)$rx)][$rep{$1}]go;
 
 
 

I don't understand this syntax:


   s[([]|(?=\\)$rx)][$rep{$1}]go;
  ^^^

Is that another way of telling the regex that you don't want to save the
value?

If my hash looks like this:

my %rep = qw(
ldblquote   rt_quote/
rdblquote   lt_quote/
   amp;
   gt;
   lt;
.
);

Then you solution shold work. I planned to change my hash, but I guess I
got ahead of myslef in my email!


Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: fastest way to substitute

2002-07-26 Thread Paul Tremblay

On Fri, Jul 26, 2002 at 02:02:54PM -0400, Jeff 'japhy' Pinyan wrote:
 
 It's best to come up with a hash of strings and replacements:
 
   my %rep = qw(
 ldblquote rt_quote
 rdblquote lt_quote
 emdashem_dash
 rquoter_quote
 tab   tab
 lquotel_quote
   );
 
 Then create a regex:
 
   my $rx = join |, map quotemeta, keys %rep;
 
 Then use it in a larger regex:
 
   $source =~ s[\\($rx) ][$rep{$1}/]g;
 
 Ta da!  ONLY one pass through the string. 

This looks really nice! I'll have to test it with a timer. I'd imgaine
it would be much faster because you only make one pass through. On
the other hand, doesn't perl have to recompile the $rx each time because
it is a variable? After all, $rx might have changed--though in my case,
it definitely wouldn't have.

 You'll need to beef up the hash
 and the regex as needed, if not everything is '\\IN ' and not every
 replacement is 'OUT/'.

As a matter of fact, the expressions take only two forms:

\emdash Regular text
\'9oeRegular text

Some of the expressions (the ones for foreign characters) don't have a
space after the control word. So I think:


 $source =~ s[\\($rx)(?:\s)*][$rep{$1}/]g;

 Should work?

On another note, my script is 1100 lines long, and seems to work.
It seems like there is a need for converting RTF to XML, since the perl
convertors availble only convert to HTML. 

I would like to release the script at some point, but when I get tips
off this site, I realize how much better an experienced perl programmer
could do things. It would be much more effective to work on this as part
of a team, but I've never done something like this before. I guess I'll
post feelers on other mailing lists.

(This really should be another thread!)

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: time comparison

2002-07-26 Thread Paul Tremblay

On Fri, Jul 26, 2002 at 06:51:08AM -0700, lz wrote:
 
 I went to search.cpan.org trying to look for utility
 that will return current GMT time, and couldn't find
 any.
 

According to *Perl Cookbook*:

use Time::gmtime;
$seconds = $tm-sec;

The second line is just an example of getting the seconds that have
passed since 1970, a useful variable when you are trying to figure out
if a file is older than a certain date.

But I believe you should be able to use any time functions with the
gmtime module that you can with the localtime module.

Hope that helps

Paul
-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: fastest way to substitute

2002-07-26 Thread Paul Tremblay

On Fri, Jul 26, 2002 at 02:02:54PM -0400, Jeff 'japhy' Pinyan wrote:

Jeff:

I ran a benchmark on your method, and it actually proved slower. I ran a
test line 10,000 times. Directly substituting each line took 38 wall
seconds. Using your method took 60. 

I included my test below, in case I made a mistake.

Thanks

Paul

##

#!/usr/bin/perl #-w
use strict;
use Benchmark;
my $loopcount=10_000;
##my $loopcount = 1;

my $line = \\tab Paul  Tom said \\ldblquote we are brothers \\rdblquote \\emdash 
which was the truth. \\e4 \\e5 \\e6 text text \\e7 text \\e8 text \\e9 text 
text text \\e10 text \\e11 text \\e12 text \\e13 text \\e14 text text \\e15  text 
\\e16 text \\e17 text \\e18 text \\e19 text \\e20 text e\\21 text \\e21 
text \\e22 text \\e23 text \\e24 text \\e25 text \\e26 text \\e27 text \\e28 
text  \\e29 text \\e30 ;

sub each_line{
$_ = $line;
 
s//amp;/g;
s//lt;/g;
s//gt;/g;
s/\\ldblquote /rt_quote\//g;
s/\\rdblquote /lt_quote\//g;
s/\\emdash /em_dash\//g;
s/\\rquote /r_quote\//g;
s/\\tab /tab\//g;
s/\\lquote /l_quote\//g;
s/\\e4 /e4\//g;
s/\\e5 /e5\//g;
s/\\e6 /e6\//g;
s/\\e7 /e7\//g;
s/\\e8 /e8\//g;
s/\\e9 /e9\//g;
s/\\e10 /e10\//g;
s/\\e11 /e12\//g;
s/\\e13 /e13\//g;
s/\\e14 /e14\//g;
s/\\e15 /e15\//g;
s/\\e16 /e16\//g;
s/\\e17 /e17\//g;
s/\\e18 /e18\//g;
s/\\e19 /e19\//g;
s/\\e20 /e20\//g;
s/\\e21 /e21\//g;
s/\\e22 /e22\//g;
s/\\e22 /e22\//g;
s/\\e23 /e23\//g;
s/\\e24 /e24\//g;
s/\\e25 /e25\//g;
s/\\e26 /e26\//g;
s/\\e27 /e27\//g;
s/\\e28 /e28\//g;
s/\\e29 /e29\//g;
s/\\e30 /e30\//g;
##print $_;

}
sub hash_method{
my $line = $line;
my %rep = qw(
ldblquote   rt_quote
rdblquote   lt_quote
emdash  em_dash   
rquote  r_quote
tab tab
lquote  l_quote
   amp
   lt;
   gt;
e4  e4
e5  e5
e6  e6
e7  e7
e8  e8
e9  e9
e10 e10
e11 e11
e12 e12
e13 e13
e14 e14
e15 e15
e16 e16
e17 e17
e18 e18
e19 e19
e20 e20
e21 e21
e22 e22
e23 e23
e24 e24
e25 e25
e26 e26
e27 e27
e28 e28
e29 e29
e30 e30
);


  my $rx = join |, map quotemeta, keys %rep;


  $line =~ s[\\($rx) ][$rep{$1}/]go;
  ##print $line\n;

 }




#--
# the main loop section

timethese $loopcount, {
each_line = \each_line,
hash_method = \hash_method,

};



# end of the world as I knew it [EMAIL PROTECTED] all rights reserved

###
 
 On Jul 26, Paul Tremblay said:
 
 Is there a quicker way to substitute an item in a line than reading the
 line in each time?
 
 I am writing a script to convert RTF to XML. One part of the script
 involves simple substitution, like this:
 
 s/\\ldblquote /rt_quote\//g;
 s/\\rdblquote /lt_quote\//g;
 s/\\emdash /em_dash\//g;
 s/\\rquote /r_quote\//g;
 s/\\tab /tab\//g;
 s/\\lquote /l_quote\//g;
 
 It's best to come up with a hash of strings and replacements:
 
   my %rep = qw(
 ldblquote rt_quote
 rdblquote lt_quote
 emdashem_dash
 rquoter_quote
 tab   tab
 lquotel_quote
   );
 
 Then create a regex:
 
   my $rx = join |, map quotemeta, keys %rep;
 
 Then use it in a larger regex:
 
   $source =~ s[\\($rx) ][$rep{$1}/]g;
 
 Ta da!  ONLY one pass through the string.  You'll need to beef up the hash
 and the regex as needed, if not everything is '\\IN ' and not every
 replacement is 'OUT/'.
 
 -- 
 Jeff japhy Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
 RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
 ** Look for Regular Expressions in Perl published by Manning, in 2002 **
 stu what does y/// stand for?  tenderpuss why, yansliterate of course.
 [  I'm looking for programming work.  If you like my work, let me know.  ]

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




create annoymous hash here?

2002-07-22 Thread Paul Tremblay

I have a small array that looks like this:

@array = qw(rtf1_deff fonttbl s33_up7)

I generate this file from parsing an rtf file. I have to process each
element in the array. I have to separte the letters from the numbers. So
the first element would break down to rtf and 1. Based on the first
part of the element (the rtf part), I have to determine a set of
actions. 

Later on in the program, I need to once again process each element in
the array, and that means separtaing the letters from the numbers. It
seems ineffecient to do this twice.

Should I create a hash within my array? In other words, each element
in the array will have one or more values attatched to it:

rtf1_deff: ignore, 1 (ignore this, it's number is one);
font_table: ignore
s33_up7: print, 33

I'm not sure how to create this data structure. The *Cookbook* shows you
how to created hashes within hashses, but not hashes within arrays. (Or
do I want to create an array within an array?)

My second question is whether I should bother. I would rather just
process the elements twice by sending them to a subroutine, but I don't
want to unnecessarily slow down my script.

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: script too slow?

2002-07-15 Thread Paul Tremblay

On Mon, Jul 15, 2002 at 10:26:25AM +0200, Janek Schleicher wrote:

 
 To increase speed, we can make also a lookahead statement:
 
 my @tokens = split / ( \\ (?=\S)# there's never a whitespace
   (?: [^\s{}]+  |
   [^\s\\}]+ |
   [\\}]   )
| }
  )/x
  = $line;
 

Can you explain the lookahead statement to me, or better yet,
point me to some good documentation? It is not explained in *Perl
Cookbook.*

As John pointed out, your solution leaves off the leading open
bracket. However, it is about twice as quick as the previous
solution. You solution tokenizes this line:

{\i italics}

This way:

'{'
'\i'
' italics'

Actually, I can deal with this. I would just have to set a flag
in my program. If the preceeding token was '{' then the next
token is part of an opening group.

However, your solution brings up another problem. It does not
speparate true brackets (which convey formatting information)
from escaped brackets (those in the text). Likewise, no
distinction is made between true back slashes and escaped back
slashes.

Here is a very typical line, and the tokesn it should be split
into.

\pard\plain \{All of this text {\i italicized words} is \{\}
\}\{between brackets\} \\escaped_back_slash\par

'\pard'
''
'\plain'
' '
'\{'
'All of this text '
'{\i'
' italicized words'
'}'
' is '
'\{'
''
'\}'
' '
'\}'
''
'\{between'
' brackets'
'\}'
''
'\\'
'\\escaped_back_slash'
'\par'
' 

Thanks

Paul


-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: script too slow?

2002-07-14 Thread Paul Tremblay

On Sun, Jul 14, 2002 at 04:45:19AM -0700, John W. Krahn wrote:
 So your split could be simplified to:
 
 my @tokens = split /({\\[^\s}{]+|\\[^\s\\}]+|\\[\\}]|})/, $line;
 
 

Ah, that cuts the tokenize process in half. My entire script now
takes 40 seconds to run instead of 50. 

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




script too slow?

2002-07-13 Thread Paul Tremblay

I just finished my first version of a script that converts rtf to
xml and was wondering if I went about writing it the wrong way.

My method was to read in one line at a time and split the lines
into tokens, and then to read one token at a time. I used this
line to split  up the text:
 
@tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|()|(})|(\\})/,$line);

Splitting up the text on my test file of 1.8 megabytes tooks 25
seconds. The entire script took 50 seconds. 

I had written a previous uncompleted version in which I relied on
regular expressions rather than tokens, and this script took only
10 seconds to run. I gave up on this method because it seemed
there would always be an excpetion that would require another
regexp.

So why does splitting a text into tokens take so long? Has
anybody done something similar to what I am trying, and do you
have any advice? 

The good news is that relativley speaking, perl is very, very
fast. I tried a similar script in python using a lexer called
plex, and the 1.8 megabyte file took 12 minutes to parse!

In case you are wondering why I'm seemingly obsessed with speed,
I would like to make this script available to anyone. Right now
the only free utilities for converting rtf to xml are a java
utility call majix, which deletes your footnotes and only allows
for 9 user-defined styles. If my perl script is too slow, it won't be
very useful.

Thanks

Paul


-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: script too slow?

2002-07-13 Thread Paul Tremblay

On Sat, Jul 13, 2002 at 08:08:50PM -0700, Marco Antonio Valenzuela Escárcega wrote:
 Subject: Re: script too slow?

 
 maybe you should check this out:
 http://search.cpan.org/search?dist=RTF-Tokenizer
 http://search.cpan.org/search?dist=RTF-Parser
 


These modules could be exactly what I need. However, since the
documentation is so minimal, I don't know how to use them. I have
a feeling that the second module converts to html and not xml? 

Is the first module just a tokenizer? Splitting the document into
tokens seems to be the easiest part of the job. 

Thanks

Paul
-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: script too slow?

2002-07-13 Thread Paul Tremblay

On Sat, Jul 13, 2002 at 10:57:04PM -0400, Tanton Gibbs wrote:

 
 I'm not exactly sure what the problems are; however, here are a couple of
 things to try
 1.) If you don't need to save the value of each of the subexpressions, then
 tell perl so by using ?: after each opening paren.

Once I tokenize the text, I don't use regular expressions at all.

Here is an example of an rtf line:


\pard \s1\fi720 \ldblquote Big guy, I didn\rquote t expect to see you so early,
\rdblquote  Joe said. \par

Here are the tokens:

'\pard':'\s1':'\fi720':'\ldblquote':' Big guy, I didn':
'\rquote': t expect to see you so early,' :''\rdblquote' :
'  Joe said. ' :'\par' 

Each of the escaped sequences represents some type of info that I
have to decide what to do with. I use the substr function to
determine the nature of the token.

Actually, I have simplified the list of tokens. My split
function actually produced 31 empty ('') tokens for this one
line. So perl is doing a lot of searching.


 2.) Usually alternation is much slower than doing separate
 regexes...however, in your case separating the regexes is seemingly
 impossible.

I'm not sure what alternation is. But now I am thinking that
regexes are really not at all impossible. Perhaps they require a
little more thought. That's not what stopped me from using them.
I thought that I as I encoutnered more complex rtf, with
different (and insidious versions) of word, I would have have to
tweak my code so much that I wouldn't be able to maintain it.
However, on second thought, I don't think the problem is that
complicated.

Let's take a look at the line above. It starts with pard this
means start a paragraph with a new style. the style names are
stored in the escaped sequences afterwords. So this style name is
\s1 (stlye 1), \fi720 The fi means first indent by 36 pts.
There are a zillion other tokens, all of which I don't
understand. What I need to know is when the text starts. I could
just look for non-escaped text. But the '\ldblquote' actually
marks the start of the text because it means left quote. 

You can start to see some of the complexities and why I thought
it better to handle one token at a time. However, I was just
playing around with perl. I substituted every instance of
\ldbquote and 4 other control sequences (right quote, em-dash,
tab, and right curly). That only took 4 seconds for a 1.8
megabyte documents.

So I am thinking of doing the simple substitutions first, and
then proceeding. For example, if I substitute

/\\ldblquote/lft_quote//g;
/\\rdblquote/rt_quote//g;

then my line looks like this:

\pard \s1\fi720 lft_quote/ Big guy, I didn\rquote t expect to see you so early,
rt_quote/  Joe said. \par 

now I can substitute:

s/\\pard(.*?)\s[^\\]/para style=\$1\/;  # pard, followed by a
#space, 
followed by
#any character 
that 
# is not a 
backslash


The most difficult part will be dealing with footnotes. They look
something like this:

{\footnote \pard \fi720 {\i italics word} text {\b bold words}}

This line contains a nested structure, and I have to determine
when it ends, because the paragraph styles are independent of the
styles in the main body. For this I will have to use //g as you
suggested, and keep counting the open and closed brackets until
they equal zero.

One last note on why I think I can change my strategy. an rtf
line can look like this:

\pard He was reading {
\i The Sun Also Rises} when he heard the dog bark.\par

This line should look like 

\pard He was reading {\i The Sun Also Rises} ...

In other words, rtf is so scrwed up, that it even splits tokens
across lines. However, I just read the Perl Cook book and realize
I can do this:

$\ = \\par;
read in each line
s/\n//g;# get rid of line endings. This will work. The only line
# line ending should come at the \par delimter

Also, rtf does this 

\pard {i The Sun Also Rises \par
}

I have to read the whole file in and swith it so it reads:

\pard {i The Sun Also Rises} \par

I tried this on my big document, and it took only 4 tenths of a
second.

In sum, I am thinking that the regex are so super fast in perl
that it I choose carefully what to substitute first, I can parse
my document much faster.

Thanks!


-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: fastest regexp: split or (.*)?

2002-07-12 Thread Paul Tremblay

On Thu, Jul 11, 2002 at 09:17:18PM -0700, drieux wrote:
 http://www.wetware.com/drieux/pbl/Other/BenchMarks/split_v.re_for_RTF.txt
 
 

Thanks. It appears splitting is quickest. What is bizzare about your link is that the 
code appear *almost identical* to what I had written. I mean, I asked a question with 
an example line, and the same example line was used; I chose two variables, and the 
same two variables were used. I kept having to say no they didn't use your code you 
posed a few hours ago.

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




fastest regexp: split or (.*)?

2002-07-11 Thread Paul Tremblay

I am writing a script to parste rtf into xml and have a number of
questions about regular expressions.

The first is:

I have this line:

\sbknone\linemod0\linex0\cols1\endnhere \pard\plain text\par

I want to split it at the expression \endnhere. I can either
use 

@array = split(/endnhere/,2);
$first_part_of_line = $array[0];
$second_part_of_line = $array[1];

Or:

$line=~/(.*?)endnhere(.*);
$first_part_of_line = $1;
$second_part_of_line = $2;

Which is quickest? I am using this method repeatedly in my
script, so I wanted the quickest method.

I also have come accross the g anchor. For example:

$line=~/(.*?)endnhere/g;
$rest_of_line=~/\G.*/;

Is it advisable to use this anchor? I know that the $` and $' are
deprecated because they slow down a script (at least according to
*Perl Cookbook*). How about these anchors? They seem like they
would be very useful.

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: converting html to text

2002-04-05 Thread Paul Tremblay

On Fri, Apr 05, 2002 at 05:15:08AM -0800, drieux wrote:

 
 ### #!/usr/bin/perl
 ###
 ### use HTML::Parser;
 ### use HTML::FormatText;
 ### use HTML::TreeBuilder;
 ###
 ### my $html_text;
 ### my $filename = $ARGV[0];
 ### open(FH, $filename) or die unable to open file $filename :$!\n;
 ### while (FH) { $html_text .= $_ ; }
 ### ###my $plain_text = 
 HTML::FormatText-new-format(parse_html($html_text));
 ### my $tree = HTML::TreeBuilder-new-parse($html_text);
 ### my $plain_text = HTML::FormatText-new-format($tree);
 ###
 ### print $plain_text\n;
 ###

I tried this code, and it did not work.

I also tried this code:

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder-new();
$tree-parse_file(/tmp/cleanup);

use HTML::FormatText;
my $formatter = HTML::FormatText-new(leftmargin = 0, rightmargin = 70);
#print $formatter-format($tree);
my $ascii = $formatter-format($tree);

It also did not work.

The problem is that the filter deletes all of my text and ouputs this:

[TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT
SHOWN][TABLE NOT SHOWN]

I have tried it on five different files. All of these files were
from the same website. It appears that this module is broken.
That is, it can't handle certain html (which is valid when looked
at in a browser). 

I think I'm ready to try other filters (those not in perl).

Paul
 


-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




converting html to text

2002-04-04 Thread Paul Tremblay

I spent several hours last night trying to convert an html file
to text, so that I could include it in an email.

Someone from a mailing list sent me a simple perl script, which
worked for my purpose.

However, this script simply eleminates tables and lists.

I am wodering if there isn't a CPAN module already written.
Converting html to text seems like such a common task, that there
ought to be some robust scripts out there. Interestingly enough,
I found many scripts to convert html to rtf and LaTeX and every
other format, but not plain old text!

Paul 

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: converting html to text

2002-04-04 Thread Paul Tremblay

On Thu, Apr 04, 2002 at 10:36:36AM -0800, Agustin Rivera wrote:

 
 Are you looking to keep the basic formatting of the HTML in tact during the
 conversion, or just want the HTML stripped?  I wouldn't imagine that it
 would be that hard to convert the HTML to text if the HTML wasn't overly
 complicated.
 

I am just trying to strip the tags and preserve formatting. Of
course, you can do a quick hack (s/.*?//g;#etc), but that still
leaves you with messy line breaks. I've already tried saving the
file as text from both Netscape and Lynx, and they did a pretty
poor job.

A good script might put a * before list items, might
apporxtimate tables, etc. 

No sense in hacking something together when someone problably
wrote a powerful version. 

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: anonymous subroutine problem

2002-03-29 Thread Paul Tremblay

On Wed, Mar 27, 2002 at 09:05:02PM -0800, bob ackerman wrote:

 i copied the code as is, and got no error.
 are you sure  line #29 is where you are calling $_incr_count-()?
 are you sure the code you are executing is what you posted?
 

No apparently this is not the code that gives me the error
message!

Below is the code that gives me the same error message.

#!/usr/bin/perl -w 

my $cd = CD::Music-new(Canon in D, Pachelbel,
Boering Mussak GmbH, 
1729-67836847-1,
1,
8,8,
5.0);





# How many CDs in the entire collection?
print The number of CDs is ,CD::Music-get_count, \n;

package CD::Music;
use strict;

{
my $_count = 0;
sub get_count   {$_count}
my $_incr_count = sub {++$_count};#create an anonymous subroutine



sub new{
$_incr_count-();#call anonymous subroutine

#
# Instead I could put this code?:
# my $_incr_count = sub {++$_count};
#$_incr_count-();
###

$_incr_count-();   #call the anonymous subroutine
#Gives me an error message shown below

my ($class) = @_; 
bless {
_name   =  $_[1],  #second value passed to subroutine
_artist =  $_[2],  #third value passed to subroutine
_publisher =   $_[3],  #and so on
_ISBN   =  $_[4],
_tracks =  $_[5],
_room   =  $_[6],
_shelf  =  $_[7],
_rating =  $_[8],
}, $class;
}
}



#ERROR MESSAGE WITH 'use strict' COMMENTED:
#
#Use of uninitialized value in subroutine entry at 
#/home/paul/bin/test16.pl line 27.
#Undefined subroutine main:: called at /home/paul/bin/test16.pl 
#line 27.



#ERROR MESSAGE WITH 'use strict'
#Use of uninitialized value in subroutine entry at 
#/home/paul/bin/test16.pl line 27.
#Can't use string () as a subroutine ref while strict refs 
#in use at /home/paul/bin/test16.pl line 27.



Again, I believe that the reference to the anonymous subroutine
expires once I start the 'new' subroutine. If I put the anonymous
subroutine within the 'new' subroutine block, then the script
executes correctly, and I still make sure that you can't
increment the value of the number of CDs directly.

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




path for personal library

2002-03-27 Thread Paul Tremblay

I am trying to set up a library for my own modules. According to
*Object Oreinted Perl,* (Conway), I should do the following:

PERL5LIB=${PERL5LIB}:/home/paul/perl5
export PERL5LIB

However, this doesn't work for me. If I put a module in the
directory /home/paul/perl5, then my perl scripts tell me it
can't find the module. If I use

use lib /home/paul/perl5;

Then I don't get an error message.

How do I set a permanent path for my own personal library?

Thanks

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




anonymous subroutine problem

2002-03-27 Thread Paul Tremblay

I am coppying the code below directly from *Object Oriented
Perl* by Conway. (Error message is below.)


package CD::Music;
#use strict; # turn this off for testing purposes

{

my $_count = 0;
sub get_count   {$_count}
my $_incr_count = sub {++$_count};#create an anonymous subroutine
$_incr_count-();# it works if I call it here

sub new{
$_incr_count-();#call the anonymous subroutine
#this line above causes the error below


#Isn't this where the anonymous subroutine should go?
#my $_incr_count = sub {++$_count};


my ($class) = @_; 
bless {
_name   =  $_[1],  #second value passed to subroutine
_artist =  $_[2],  #third value passed to subroutine
_publisher =   $_[3],  #and so on
_ISBN   =  $_[4],
_tracks =  $_[5],
_room   =  $_[6],
_shelf  =  $_[7],
_rating =  $_[8],
}, $class;
}
}

I am getting this error message:

/home/paul/bin/test16.pl
Use of uninitialized value in subroutine entry at /home/paul/bin/test16.pl 
line 29.
Undefined subroutine main:: called at /home/paul/bin/test16.pl line 29.

I am nearly positive that this error message results because the 
reference to the anonymous subroutine expires once I start the 
subroutine new.

Could a more experienced user confirm this?

I believe that the annonymous subroutine (and the other code
that initializes the counter) should go in the new subroutine
block. The idea is make sure you can't access the subroutine
directly. This happens  as long as I make an anonymous
subroutine--no matter where I put it.

Thanks! 

PS: Typos in computer books can really cause you to lose your
mind. You keep thinking that you typed something wrong!

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




looking for book on object oriented perl

2002-03-18 Thread Paul Tremblay

I finished *Learning Perl*(The O'Reilly book), and thought it one of the best books 
relating to a linux subject I have yet read. 

(I know that many people on this mailing list probably use Windows. If you are a linux 
user and have had to suffer through some of the awful documentation on various aspects 
of linux, then you will think books like *Learning Perl* a masterpiece!)

However, I now believe I should learn at least some of the concepts of object oriented 
perl, especially since I want to use modules like XML::Parser.

My search on amazon.com came up with a book called *Object Oriented Perl,* written by 
Damian Conway and others. Amazon.com lets regular people review the book, and almost 
every one of these reviewers raved about the book. It was the best book the ever 
read, they wished they had read it a long time ago. 

However, I believe I came upon a portion of this book on line. This is the url:

http://www.google.com/search?q=cache:NiHGVcWGmw8C:www.csse.monash.edu.au/~damian/papers/PDF/cyberdigest.pdf+object+oriented+perlhl=en

(1) Could someone tell me if this in fact is from the same book?

From what I read on this site, I was not too impressed with the book at all. It 
seemed to go on forever explaining theory without giving any concrete examples with 
perl code. My second question is:

(2) Does this book get better in chapters 2 and 3? Do people on this mailing list 
recomend it?

Keep in mind that I am a beginner with no previous experience in any object oriented 
langauge.

Thanks!

Paul

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: looking for book on object oriented perl

2002-03-18 Thread Paul Tremblay

Thanks. Looks like I'll give it a try. I came across an excellent
tutorial: 

http://www.extropia.com/tutorials/perl5/oop.html

I recomend this as a place to start for any beginner.

I have the Cookbook, and although it is excellent in most parts,
its treatement of oop is confusing, unless you are already
grounded in oop. 

I understand what you mean when you say that you have to have
some theory before tackling object oriented programming. The web
page I looked at, however, went on and on without ever trying to
ground the theory in practice. I think the book will be better.

Paul


On Mon, Mar 18, 2002 at 10:58:50AM +, Jonathan E. Paton wrote:
 Date: Mon, 18 Mar 2002 10:58:50 + (GMT)
 From: Jonathan E. Paton [EMAIL PROTECTED]
 Subject: Re: looking for book on object oriented perl
 To: [EMAIL PROTECTED]
 
   (1) Could someone tell me if this in fact
   is from the same book?
  
   From what I read on this site, I was not
   too impressed with the book at all. It
   seemed to go on forever explaining theory
   without giving any concrete examples with
   perl code.
 
 This is a series of extracts from Object
 Orientated Perl, and it looks familar
 enough but isn't any one chapter - it's a
 preview as it were.
 
  In my opinion, the balanace between theory
  and technique in the book is good, you really
  can't expect a book about object oriented to
  be without any theory at all right?
 
 There is a lot of theory, explaining the
 various approaches then following with an
 implementation of what was dicussed.
 
 In OO there is NO ONE WAY to implement a
 system, you need to know what the different
 techniques are and when to apply them -
 this book will teach you that.
 
   (2) Does this book get better in
   chapters 2 and 3? Do people on this
   mailing list recomend it?
 
 Yes, they do - as I bought it on other
 people's recommends and didn't regret
 doing so :)
 
   Keep in mind that I am a beginner
   with no previous experience in any
   object oriented langauge.
 
  I was very fresh on OO perl when I first
  read the book and I did found a lot of
  good tips and advice from it.  If you are
  too fresh to OO perl, I would say this is
  a good place to start.
 
 OO is difficult at first, but makes larger
 problems easier.  However, before you can
 do anything complex you need to learn a
 lot of things first.
 
 However, you should remember the Cookbook
 and the Camel both have chapters on OO that
 you may want to start with - although they
 are geared towards those with some prior
 experience.
 
 Jonathan Paton
 
 __
 Do You Yahoo!?
 Everything you'll ever need on one web page
 from News and Sport to Email and Music Charts
 http://uk.my.yahoo.com
 
 -- 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-- 


*Paul Tremblay *
*[EMAIL PROTECTED]*


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]