The '#' is being used as a delimiter here, to avoid the need to escape all the 
slashes.

-----Original Message-----
From: Rob Wilkerson [mailto:rwilker...@lotame.com] 
Sent: Friday, October 01, 2010 8:00 AM
To: pig-user@hadoop.apache.org
Subject: Re: Grouping & Counting

On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk <da...@icehouse.se> wrote:
> I would just preprocess the file with Perl or Ruby:
>
> perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > 
> outfile

What is the "#" representing? I have a semi-educated guess, but I can't find 
that particular symbol in any examples.

Also, as far as I can tell, this regex also misses the top level path because 
it has not children. For example, the "Arts" path. It catches "Arts/Anime" and 
below nicely, of course.

> Come to think of it, if your entire file is just 800k lines, I'd do 
> the entire thing with Perl.

I thought about that when PHP couldn't handle it, but my Perl skills are light 
and it was a chance to learn something entirely new.

Thanks for your help.

-- 
+rw
 
The information transmitted in this
email is intended only for the
person(s) or entity to which it is
addressed and may contain
confidential and/or privileged
material. Any review,
retransmission, dissemination
or other use of, or taking of any
action in reliance upon, this
information by persons or entities
other than the intended recipient
is prohibited. If you received this
email in error, please contact the
sender and permanently delete the
email from any computer.  

Reply via email to