The '#' is being used as a delimiter here, to avoid the need to escape all the slashes.
-----Original Message----- From: Rob Wilkerson [mailto:rwilker...@lotame.com] Sent: Friday, October 01, 2010 8:00 AM To: pig-user@hadoop.apache.org Subject: Re: Grouping & Counting On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk <da...@icehouse.se> wrote: > I would just preprocess the file with Perl or Ruby: > > perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > > outfile What is the "#" representing? I have a semi-educated guess, but I can't find that particular symbol in any examples. Also, as far as I can tell, this regex also misses the top level path because it has not children. For example, the "Arts" path. It catches "Arts/Anime" and below nicely, of course. > Come to think of it, if your entire file is just 800k lines, I'd do > the entire thing with Perl. I thought about that when PHP couldn't handle it, but my Perl skills are light and it was a chance to learn something entirely new. Thanks for your help. -- +rw The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.