LWP getting special (multibyte) characters from webpages

2008-12-26 Thread John Refior
Hello, I am writing Perl scripts that go to webpages, download certain content, and then create a CSV file with the relevant data. I am trying to be a friendly web robot, so I am using the LWP::RobotUA module. my $ua = LWP::RobotUA->new('product_name', 'my_email'); $ua->delay(1/60); #

Re: LWP getting special (multibyte) characters from webpages

2008-12-26 Thread Chas. Owens
On Fri, Dec 26, 2008 at 15:03, John Refior wrote: snip > The problem I am having is that a number of these webpages have special > multibyte characters on them, such as the trademark symbol and registered > trademark symbol. For example, in the CSV, the trademark (TM) symbol > shows up like > >

Re: LWP getting special (multibyte) characters from webpages

2008-12-26 Thread John Refior
> The file is already in UTF-8, otherwise it wouldn't display properly > in Firefox or IE. The problem is either your display or perl doesn't > know that the file is in UTF-8. > > The first step is make sure Perl knows it is working with UTF-8. Add > > export PERL_UNICODE=SDL > > to your .profile

Re: LWP getting special (multibyte) characters from webpages

2009-01-02 Thread Mark Wagner
On Fri, Dec 26, 2008 at 12:03, John Refior wrote: > Hello, > > I am writing Perl scripts that go to webpages, download certain content, > and then create a CSV file with the relevant data. I am trying to be a > friendly web robot, so I am using the LWP::RobotUA module. > my $page = $r

Re: LWP getting special (multibyte) characters from webpages

2009-01-09 Thread John Refior
> $response->content gives you the exact byte values returned by > the server; decoded_content turns it into Perl's internal Unicode > representation (assuming the server is telling the truth about what > encoding the page is in). Thanks for the clarification, as I wasn't sure of the difference be