Thank you for the suggestions.
Writing a custom LoadFunc seems like a valid solution for me, given that I
don't currently have Hive or HCatalog installed and I'm working on more of an
ad-hoc problem at this point.
HCatalog seems like a good solution for doing this type of thing on a repeated
Hi,
Forget the question raised before. It's solved.
Hi,
I am currently running pig from eclipse on hadoop cluster.
I added the hadoop conf location to the runtime configuration.
But the mapreduce jobs failed as the built class files of pig cannot
be called by hadoop.
I added class file locat
There's two possibilites that come to mind.
1. Write a custom LoadFunc in which you can handle these regular
expressions. *Not the most ideal solution*
2. Use HCatalog. The example they have in their documentation seems to fit
your use case perfectly. (http://incubator.apache.org/hcatalog/docs/r0.
Hi,
I am currently running pig from eclipse on hadoop cluster.
I added the hadoop conf location to the runtime configuration.
But the mapreduce jobs failed as the built class files of pig cannot be
called by hadoop.
I added class file location to the classpath, but it did not work.
Any hints?
Let's say I have my input data from the past 12 months organized into subdirs
by date:
/data/2012-06-10
/data/2012-06-11
...
/data/2013-06-09
And now say that I want to run a Pig script to process data from a range of
dates within the last 12 months, say 2012-11-07 through 2013-05-26. The regex
Bzip2 is only splittable in newer versions of hadoop.
On Jun 10, 2013 10:28 PM, "Alan Crosswell" wrote:
> Ignore what I said and see
> https://forums.aws.amazon.com/thread.jspa?threadID=51232
>
> bzip2 was documented somewhere as being splittable but this appears to not
> actually be implemented
Ignore what I said and see
https://forums.aws.amazon.com/thread.jspa?threadID=51232
bzip2 was documented somewhere as being splittable but this appears to not
actually be implemented at least in AWS S3.
/a
On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell wrote:
> Suggest that if you have a cho
Hi Shahab,
It will be great if someone can delete this email from PIG group. I am
aware of this mistake and had posted this issue to HIVE group almost
immediately.
Regards,
Gourav
On Mon, Jun 10, 2013 at 5:28 PM, Shahab Yunus wrote:
> Gourav, this is not a HIVE mailing list. It is PIG's.
>
> R
Dunno, I'm guessing it would since each file is a different mapper.
On Mon, Jun 10, 2013 at 1:12 PM, William Oberman
wrote:
> I'm using gzip as I had a huge S3 bucket of uncompressed files, and
> s3distcp only supported {gz, lzo, snappy}.
>
> I haven't ever done this, but can I mix/match files?
I'm using gzip as I had a huge S3 bucket of uncompressed files, and
s3distcp only supported {gz, lzo, snappy}.
I haven't ever done this, but can I mix/match files? My backup processes
add files to these buckets, so I could upload new files as *.bz. But then
I'd have some files as *.gz, and other
Suggest that if you have a choice, you use bzip2 compression instead of
gzip as bzip2 is block-based and Pig can split reading a large bzipped file
across multiple mappers while gzip can't be split that way.
On Mon, Jun 10, 2013 at 12:06 PM, William Oberman
wrote:
> I still don't fully understan
Gourav, this is not a HIVE mailing list. It is PIG's.
Regards,
Shahab
On Mon, Jun 10, 2013 at 10:39 AM, Gourav Sengupta
wrote:
> Hi,
>
> On running the following query I am getting multiple records with same
> value of F1
>
> SELECT F1, COUNT(*)
> FROM
> (
> SELECT F1, F2, COUNT(*)
> FROM TABLE
I still don't fully understand (and am still debugging), but I have a
"problem file" and a theory.
The file has a "corrupt line" that is a huge block of null characters
followed by a "\n" (other lines are json followed by "\n"). I'm thinking
that's a problem with my cassandra -> s3 process, but i
Hi,
On running the following query I am getting multiple records with same
value of F1
SELECT F1, COUNT(*)
FROM
(
SELECT F1, F2, COUNT(*)
FROM TABLE1
GROUP BY F1, F2
) a
GROUP BY F1;
As per what I understand there are multiple number of records based on
number of reducers.
Replicating the test
The hadoop wiki give a brief explanation :
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
The logic is indeed the same for Pig because, under the hood, Pig will
generated and optimize a workflow of MapReduce jobs.
With a splittable 500MB file, if the provided number of map (10) is
accepted,
Hi Pedro,
Yes, Pig Latin is always compiled to MapReduce.
Usually you don't have to specify the number of mappers (I am not sure
whether you really can). If you have a file of 500MB and it is splittable
then the number of mappers is automatically equals to 500MB / 64MB (block
size) which is around
Yes, I understand the previous answers now. The reason of my question is
because I was trying to "split" a file with pig latin by loading the file
and writing portions of the file again in HDFS. With both replies, it seems
that pig latin uses mapreduce to compute the scripts, correct?
And in map r
I wasn't clear. Specifying the size of the files is not your real aim, I
guess. But you think that's what is needed in order to solve your problem
that we don't know about. 500MB is not a really big file in itself and is
not an issue for HDFS and MapReduce.
There is no absolute way to know how muc
18 matches
Mail list logo