Re: While Inserting data into hive Why I colud not able to query ?

2013-07-16 Thread Alan Gates
This question should be sent to u...@hive.apache.org. Alan. On Jul 16, 2013, at 3:23 AM, samir das mohapatra wrote: Dear All, Did any one faced the issue : While Loading huge dataset into hive table , hive restricting me to query from same table. I have set

Re: cross product of 2 data sets

2011-09-01 Thread Alan Gates
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html search on cross matches Alan. On Sep 1, 2011, at 11:44 AM, Marc Sturlese wrote: Hey there, I would like to do the cross product of two data sets, any of them feeds in memory. I've seen pig has the cross operation. Can

Re: 回复: Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2?

2011-08-22 Thread Alan Gates
When I download the Pig 0.8.1 tarball I don't find any junit class files, just a license file (which probably doesn't need to be there). If you build it it will pull those via Ivy, but I they are not in the tarball. AFAIK it will work with any Junit 4.x, but 4.5 is what we use in our testing.

Re: [DISCUSS] Apache Pig bylaws

2010-10-01 Thread Alan Gates
to move as quickly as possible. Is this strong enough for you Ben? Alan. On Sep 27, 2010, at 6:18 PM, Alan Gates wrote: As directed in our vote to become a TLP, we (Pig's PMC) need to set out bylaws for the project. I have put up a first proposal for these by laws at http://wiki.apache.org/pig

Re: project on pigerry

2010-09-29 Thread Alan Gates
We keep tabs on projects we have worked on, are working on, and are thinking of working on at http://wiki.apache.org/pig/PigJournal This should give you some ideas for projects. Alan. On Sep 28, 2010, at 11:38 AM, yoomeosym...@yahoo.com wrote: Kindly give a set of project on the above,

Re: Accessing Nested Json

2010-09-29 Thread Alan Gates
Are you loading them as tuples or maps? If you're loading them as tuples than you should be able to say x.keyA.pA (which should return vA). If you're loading them as maps than it would be x#'keyA'#'pA' Alan. On Sep 28, 2010, at 12:45 PM, rakesh kothari wrote: Hi, Is there a good way

Re: [DISCUSS] Apache Pig bylaws

2010-09-28 Thread Alan Gates
-Original Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Monday, September 27, 2010 6:18 PM To: pig-user@hadoop.apache.org Subject: [DISCUSS] Apache Pig bylaws As directed in our vote to become a TLP, we (Pig's PMC) need to set out bylaws for the project. I have put up a first

Re: specify temp folder?

2010-09-13 Thread Alan Gates
Pig puts results between MR jobs into HDFS. Results from maps go into local files (like any other MR job). For results between MR jobs, you want them in HDFS where they will get replicated. Else your next MR job will not have a sufficient number of places it could be run, and you're much

Re: Pig 0.8.0 and Hadoop 0.21.0

2010-09-08 Thread Alan Gates
On Sep 8, 2010, at 7:40 AM, Aditya Muralidharan wrote: Hi, Thanks for your great work on pig. I've been trying to use the code from pig 0.7.0, and the pig 0.8.0 branch to submit jobs to a hadoop 0.21.0 cluster. Submissions don't seem to work due to API incompatibilities. I found issue

Re: Research projects with Hadoop

2010-09-07 Thread Alan Gates
Luan, Pig keeps a list at http://wiki.apache.org/pig/PigJournal of all the Pig projects we know of. Many of these are more project based, but some could be turned into actual research. If you do choose one of these, please let us know (over on pig-...@hadoop.apache.org) so we can mark

Re: Request for Comments: Piggybank future

2010-08-30 Thread Alan Gates
On Aug 28, 2010, at 11:39 AM, Milind A Bhandarkar wrote: +1 on the direction. A few questions: 1. With Pig marching towards becoming a TLP at Apache, can Piggybank become a full-fledged subproject (with it's own releases and all) ? 2. Or since the ultimate goal is to have a common UDF

Re: Failed to create DataStorage

2010-08-27 Thread Alan Gates
Pig 0.7 runs on 20.x. Alan. On Aug 27, 2010, at 2:58 PM, Saurav Datta wrote: Thanks Alan ! Will Pig 0.7.0 run on Hadoop 0.20.x ? Or should we use any other Hadoop release ? Regards, Saurav On Aug 27, 2010, at 2:50 PM, Alan Gates wrote: Pig has not been tested with Hadoop 0.21, so I

Re: [VOTE] Pig to become a TLP

2010-08-26 Thread Alan Gates
With 15 +1 votes (14 from PMC members) the proposal passes. Thanks for voting. Owen, please push this to the Apache board for their consideration. Alan. On Aug 23, 2010, at 10:38 AM, Alan Gates wrote: I propose that Pig become a top level Apache project. The Pig development community has

Re: image processing on a low level using PIG...Possible?

2010-07-26 Thread Alan Gates
Pig itself does not contain image processing primitives. But if you write your image processing in a UDF, then Pig can be a great framework for dealing with the parallelism, running it on Hadoop, etc. Alan. On Jul 26, 2010, at 11:56 AM, Ifeanyichukwu Osuji wrote: Hi all,

Re: ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/mapreduce/TableInputFormat

2010-07-26 Thread Alan Gates
At this point HBaseStorage is only a load function and not a store function. If you're interested in taking it on, we'd love to have someone extend it to be a store function as well. Alan. On Jul 22, 2010, at 2:05 PM, preethi vinayak sunny wrote: Hi All, This is my first mail in the

Re: Split Indexes

2010-07-22 Thread Alan Gates
Pig has implemented map side merge joins in this way. If the storage mechanism contains an index (e.g. Zebra) it can use it. Alan. On Jul 21, 2010, at 5:22 PM, Deem, Mike wrote: We are planning to use Hadoop to run a number of recurring jobs that involve map side joins. Rather than

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Alan Gates
Here at Yahoo we use Oozie for managing large workflows (latest open source edition at http://github.com/tucu00/oozie1 though they expect to make another drop before the Hadoop summit). There are plans to make Oozie a full open source project (instead of just making drops to github).

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Alan Gates
On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote: I think everyone has some sort of an ad-hoc system for building and managing these types of things. Seems like a prime candidate for some community development -- we would all benefit from sharing a framework like that, and it should be

Re: Scaling Pig Projects - The Hairy Pig

2010-06-22 Thread Alan Gates
On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote: I think everyone has some sort of an ad-hoc system for building and managing these types of things. Seems like a prime candidate for some community development -- we would all benefit from sharing a framework like that, and it should be

Re: Pig loader 0.6 to 0.7 migration guide

2010-06-18 Thread Alan Gates
-- ** addJobConf() is public, but not expected to be used by end- users, right? Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses. On May 24, 2010, at 11:06 AM, Alan Gates wrote: Scott, I made an effort

Fwd: First International Mapreduce Workshop 2010: Paper submission deadline July 15 , 2010

2010-06-18 Thread Alan Gates
Begin forwarded message: From: Milind A Bhandarkar mili...@yahoo-inc.com Date: May 31, 2010 9:16:38 PM PDT To: common-u...@hadoop.apache.org common-u...@hadoop.apache.org, mapreduce-u...@hadoop.apache.org mapreduce- u...@hadoop.apache.org, gene...@hadoop.apache.org

Re: Why hadoop-u...@lucene.a.o ?

2010-06-18 Thread Alan Gates
Ancient history. Hadoop started as a subproject of Lucene. Alan. On Jun 17, 2010, at 10:22 PM, Otis Gospodnetic wrote: Hello, I've noticed people send emails to the following address: hadoop-u...@lucene.apache.org Why? Is this supposed to be related to common-user@hadoop.apache.org

Re: Behavior of JOIN

2010-06-10 Thread Alan Gates
great if C = JOIN A by id, B b id; is alias for C1 = COGROUP A by id, B by id; C2 = filter C1 by IsEmpty(A) OR IsEmpty(B); C = foreach C2 generate FLATTEN(A), FLATTEN(B); On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates ga...@yahoo-inc.com wrote: Historically C = JOIN A by a, B by a was defined

Re: Pig facility analogous to SQL's IN?

2010-06-02 Thread Alan Gates
That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted. - George Boole, quoted in Iverson's Turing Award Lecture - Original Message From: Alan Gates ga...@yahoo-inc.com To: pig-user@hadoop.apache.org Sent

Fwd: OpenSQLCamp EU 2010 - Call for participation is now open

2010-06-01 Thread Alan Gates
Begin forwarded message: From: Giuseppe Maxia g.ma...@gmail.com Date: May 31, 2010 5:44:08 AM PDT To: databases-disc...@opensolaris.org, derby-u...@db.apache.org, firebird-de...@lists.sourceforge.net, gene...@hadoop.apache.org, hbase-u...@hadoop.apache.org,

Re: Pig facility analogous to SQL's IN?

2010-06-01 Thread Alan Gates
In general mapside cogroups are not possible unless the underlying storage mechanism can guarantee that all instances of a the key you are cogrouping on are in a single map instance. At this point only Zebra can guarantee that. If you're interested I can give more details on why join

Pig loader 0.6 to 0.7 migration guide

2010-05-21 Thread Alan Gates
At the Bay Area HUG on Wednesday someone (Eli I think, though I might be remembering incorrectly) asked if there was a migration guide for moving Pig load and store functions from 0.6 to 0.7. I said there was but I couldn't remember if it had been posted yet or not. In fact it had

Re: Is pig 0.7 compatible with 18.3?

2010-05-18 Thread Alan Gates
No. Pig versions 0.5 and later are only compatible with Hadoop 0.20. Alan. On May 18, 2010, at 4:22 PM, Brian Donaldson wrote: I get this error message: 2010-05-18 16:20:30,490 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system

[Travel Assistance] - Applications Open for ApacheCon NA 2010

2010-05-17 Thread Alan Gates
The Travel Assistance Committee is now taking in applications for those wanting to attend ApacheCon North America (NA) 2010, which is taking place between the 1st and 5th November in Atlanta. The Travel Assistance Committee is looking for people who would like to be able to attend

Re: CONCAT multiple fields

2010-05-14 Thread Alan Gates
On May 14, 2010, at 12:20 AM, Russell Jurney wrote: Should I make a JIRA then submit the patch? Yes. Alan.

Re: MiniCluster

2010-05-12 Thread Alan Gates
Check out the PigUnit patch in https://issues.apache.org/jira/browse/PIG-1404 and see if that will meet your needs. Alan. On Apr 29, 2010, at 9:28 AM, Corbin Hoenes wrote: I see MiniCluster.java in the pig source code and want to do something similar in my own tests tried just coping

Re: CONCAT multiple fields

2010-05-12 Thread Alan Gates
Can't we just change the built-in CONCAT to accept additional fields? This would be totally backward compatible. I know it won't help now. Alan. On May 12, 2010, at 4:15 PM, Russell Jurney wrote: The CONCAT in the oink project (LinkedIn's UDFs) does concatenation of any number of string

Re: GenericOptionsParser Warning

2010-05-11 Thread Alan Gates
On May 10, 2010, at 5:13 PM, Syed Wasti wrote: I keep seeing this warning message while running my scripts, is this a concern ? Any info please. How can I get rid of this ? It is not a concern. There's no way for you to remove it. It's caused by code in Hadoop complaining at the way Pig

Re: UDF with two Bag one per group and one 'static'

2010-04-30 Thread Alan Gates
You need to change your group to a cogroup so that both bags are in your data stream. If you don't want to group bag b by the same keys as a (that is, you want all of b available for each group of a) then you can load b as a side file inside your udf. Alan. On Apr 30, 2010, at 4:32 AM,

Re: ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.AVG

2010-04-29 Thread Alan Gates
What is the type of the field you are trying to take the average of? Alan. On Apr 29, 2010, at 11:10 AM, Katukuri, Jay wrote: Hi, I have encountered the following error in using pig's built in function AVG. ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.AVG

Re: chaining pig scripts

2010-04-27 Thread Alan Gates
Use the PigServer interface from Java. This way your Pig Latin and Java can be intermixed. You will be guaranteed that your middle Java code will start immediately after PigServer finishes running the first Pig Latin script. Alan. On Apr 26, 2010, at 7:41 PM, Katukuri, Jay wrote:

Re: Can you setup PigStorage to respect delimieters escaped with quotes?

2010-04-24 Thread Alan Gates
PigStorage doesn't have an escaping mechanism at the moment. You could create a load function that extends PigStorage and adds escaping for field delimiters. Alan. On Apr 23, 2010, at 7:28 PM, Toli Kuznets wrote: Hi, I'm trying to read in a comma-separated file with a simple command: a

Re: How do I generate a row id?

2010-04-23 Thread Alan Gates
Unique identifiers are easy enough. Row ids (monotonically increasing values) are impossible because of the parallel nature of map reduce. If you just want to generate a unique identifier you can write a UDF to wrap Java's UUID class (or use the new GenericInvoker UDF if you're working

Re: How do I generate a row id?

2010-04-23 Thread Alan Gates
another block. This way I can get a guaranteed unique ID. (And it's probably faster and smaller this way than generating UUID) Does pig use zookeeper to do anything? Can I connect to that one if it does? On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates ga...@yahoo-inc.com wrote: Unique

Re: pig-6.0 SequenceFileLoader throws Error

2010-04-20 Thread Alan Gates
No. It might be useful though. AFAIK no one monitors #pig. Alan. Unrelated question. Does PIG have an IRC on freenode. #pig seems to be invite only.

Re: How to create complex structures in foreach..generate?

2010-04-20 Thread Alan Gates
The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem

Re: A small question about Pig

2010-04-18 Thread Alan Gates
Pig 0.6 works on Hadoop 0.20.x. There was never an official release of Pig with Hadoop 0.19. Pig 0.4 was the last release to work on 0.18. There is a patch to convert this to 0.19 (see https://issues.apache.org/jira/browse/PIG-573) . Since Pig uses the new map reduce APIs in 0.20 a

Re: How to obtain dataset

2010-04-15 Thread Alan Gates
Take a look at https://issues.apache.org/jira/browse/PIG-200, the perf-0.6.patch contains scripts to generate skewed and unskewed data. Alan. On Apr 15, 2010, at 5:16 PM, Radhika Parvathaneni wrote: Dear Pig users, Please assist in obtatining 2 skewed data sets and 2 non-skewed

Re: why not ELSE in SPLIT

2010-04-14 Thread Alan Gates
I don't think this fits split's semantics. split does not necessarily send a tuple to only one destination. So for a split clause like: split A into big if size 100, into really_big if size 1000 big would contain all the records that really_big contains and all records with size = 100

Re: Few questions - (out-of-memory, DISTINCT/COUNT , not in)

2010-04-13 Thread Alan Gates
On Apr 13, 2010, at 3:54 PM, Katukuri, Jay wrote: Hello , I have few questions about the out-of-memory issues that I am running into. If you could please answer them, that will be great. I am using Pig0.40 on hadoop 0.18.3 in map reduce mode. The data set is fairly huge (The whole data

Re: Elephant Bird released

2010-03-31 Thread Alan Gates
I added a link to this on http://wiki.apache.org/pig/PigTools Alan. On Mar 29, 2010, at 2:51 PM, Dmitriy Ryaboy wrote: Hi folks, We (but mostly Kevin Weil) just open-sourced some of the code we use at Twitter to make working with Hadoop and Pig easier. Most of what is currently included in

Re: not in via join

2010-03-30 Thread Alan Gates
What you gave seems like it should work. But I'd try it as: C = COGROUP A BY id, B BY id; D = FILTER C BY COUNT(A) = 0; E = FOREACH D GENERATE FLATTEN(B); Alan. On Mar 29, 2010, at 7:06 PM, Kent Shi wrote: Hi, I am trying to get the elements of B not in A. My code is like this C = JOIN A

Re: Compiling 0.7.0 to run against Hadoop 0.19.x

2010-03-30 Thread Alan Gates
Since 0.5 Pig has run against Hadoop 0.20, and since 0.6 it has used the new Hadoop APIs (available only in 20+). Reverting this would be very difficult. There is a patch for Pig 0.4 that will make it run against Hadoop 19 (https://issues.apache.org/jira/browse/PIG-573). Alan. On Mar

Re: Using external jar in UDF

2010-03-15 Thread Alan Gates
The UDF interface does not currently include the ability for a UDF to indicate additional jars it would like to have packaged and sent along. Alan. On Mar 10, 2010, at 2:21 AM, Tamir Kamara wrote: Hi, Register is working fine but it means that the user needs to know when it's needed to

Re: Can't run the tutorial

2010-03-15 Thread Alan Gates
Which version of Pig is this? If it's trunk, then you should check that check that you can run Hadoop on your machine, as it appears it is not connecting to Hadoop. (As of version 0.7 Pig uses a local instance of Hadoop in local mode.) Alan. On Mar 9, 2010, at 8:26 AM, Pavel Gutin

Re: ERROR 6017: Execution failed, while processing

2010-03-15 Thread Alan Gates
On Mar 12, 2010, at 10:36 AM, hc busy wrote: Is there any work towards something like C languages '#include' in Pig? My large pig script is actually developed separately in several smaller pig files. Individually the pig files do not run because they depend on previous scripts, but

Re: ERROR 6017: Execution failed, while processing

2010-03-15 Thread Alan Gates
.. -D On Mon, Mar 15, 2010 at 2:23 PM, Alan Gates ga...@yahoo-inc.com wrote: On Mar 12, 2010, at 10:36 AM, hc busy wrote: Is there any work towards something like C languages '#include' in Pig? My large pig script is actually developed separately in several smaller pig files

Re: User opinions needed

2010-03-04 Thread Alan Gates
On Mar 4, 2010, at 10:19 AM, Dmitriy Ryaboy wrote: Thanks to Gerrit and Bill who responded. Unfortunately they said the exact opposite thing so we are still at an impasse :-). Anyone else care to venture an opinion? Cause if Alan and I have a commiter fight, he'll win and y'all will have to

Pig 0.6.0 released

2010-03-01 Thread Alan Gates
Pig 0.6.0 is released. This release includes performance and memory usage improvements, a new Accumulator interface for UDFs, and many bug fixes. You can see the details of the release at http://hadoop.apache.org/pig/releases.html Alan.

Re: Map input to UDF

2010-02-16 Thread Alan Gates
PigStorage (the loader you are using) creates all values as bytearrays, which in Java is represented as a DataByteArray. So when you get the id element of your map, it is a DataByteArray. If all you really want to do is cast from bytearray to a long you don't need a UDF for that.

Re: Map input to UDF

2010-02-16 Thread Alan Gates
On Feb 16, 2010, at 4:02 PM, Kelvin Moss wrote: Thanks for the reply. Actually I have more than 10 keys in my map. I tried the following in UDF and it seems to work Long id; if (m.get(id) != null) { id = Long.parseLong(m.get(id).toString()); } This is correct

Re: Fixing the release status of Pig in JIRA

2010-02-11 Thread Alan Gates
Done. Alan. On Feb 10, 2010, at 7:21 PM, Lars Francke wrote: Hi, I have a (hopefully) small request regarding JIRA. I quite like the Road Map feature[1] but unfortunately it doesn't work correctly for Pig as all versions (except 0.0.0) are set to Unreleased[2]. Would anyone with the power to

Re: reusing pig scripts

2010-02-09 Thread Alan Gates
You are not wrong. This is a feature we'd like to add but haven't gotten to yet. Alan. On Feb 9, 2010, at 8:12 PM, prasenjit mukherjee wrote: May be I was not clear enough on my problem. I would like to call another pig-script from a pig-script. How can I do that. As far as I understand,

Re: Automatically REGISTER jars

2010-02-05 Thread Alan Gates
5, 2010 at 2:46 PM, Alan Gates ga...@yahoo-inc.com wrote: Putting the jars on your classpath works as long as the classes you need are directly referenced in your script. So: B = foreach A generate com.mycompany.myudf($0); If myudf is in a jar somewhere in your classpath

Re: Filter equality with tuple

2010-02-03 Thread Alan Gates
This is a bug. Looking at the code EqualTo isn't implemented for Tuple, even though it is defined in the functional spec ( http://wiki.apache.org/pig/PigTypesFunctionalSpec ) and referenced in the user manual. Please file a JIRA on this so we can track it and get it fixed. In the

Re: Various pig questions

2010-02-02 Thread Alan Gates
Answers inlined: On Feb 2, 2010, at 3:15 AM, Guy Jeffery wrote: Hi, Hope this gets to the right list... I'm fairly new to Pig, been playing around with it for a couple of days. Essentially I'm doing a bit of work to evaluate Pig and its ability to simplify the use of Hadoop -

Re: piggybank build problem

2010-01-27 Thread Alan Gates
Before building in piggybank you need to do 'ant jar compile-test' at the top level. From the error messages I'm guessing you didn't do that. Alan. On Jan 26, 2010, at 10:53 PM, felix gao wrote: Hi all, Just downloaded it and when following the instruction to build there is compilation

Re: Conf Path

2010-01-27 Thread Alan Gates
PIG_CLASSPATH=your_config_directory pig Alan. On Jan 27, 2010, at 11:54 AM, Aryeh Berkowitz wrote: When I run Pig, I connect to the local file system, when I run (java -cp pig-0.5.0-core.jar:$HADOOP_HOME/conf org.apache.pig.Main) I connect to hdfs. It seems like Pig is not finding my

Re: WordCount Results Version 2 - Pig 0.6.0

2010-01-21 Thread Alan Gates
: Hi Alan, I'm not quite sure what you mean. As shown in my pig script, I have stated to have 56 reducers for the group by task. And the number of mappers is decided by hadoop. Is there any way to optimize my pig script further? On 20 Jan 2010 19:07, Alan Gates ga...@yahoo-inc.com wrote

Re: WordCount Results Version 2 - Pig 0.6.0

2010-01-20 Thread Alan Gates
Are you setting parallel as Mirdul suggests? Or does your cluster have a default parallelism set? Alan. On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote: Hi again, The results have been produced. I can tell you that I made the following improvements: 1. Removed unnecessary words =

Re: skewed optimizations

2010-01-20 Thread Alan Gates
..:-) Cheers, /R On 1/20/10 1:41 AM, Alan Gates ga...@yahoo-inc.com wrote: Let me elaborate on what Rekha said. He's correct that Pig does it automatically for order by. It has to sample the input to the order by to decide how to distribute the keys. As part of this is notices any skew and spreads

Re: a pig screencast on the basics

2010-01-19 Thread Alan Gates
Mat, This looks really nice. Are you okay with me posting it at http://wiki.apache.org/pig/PigTalksPapers so other Pig users can benefit from it? Alan. On Jan 16, 2010, at 8:20 PM, Mat Kelcey wrote: based on a talk i gave at work recently hope it might help someone as an intro to pig mat

Re: Pig script very slow to start actual processing

2010-01-15 Thread Alan Gates
Jeremy, Usually the mails get bounced when the sender isn't a subscriber to pig-user. Usually we see this sit and wait behavior when other jobs are running and there are no slots open on the cluster. If you see this behavior again can you look at the job tracker GUI. It will tell you

Re: Survey: Do you have your own Tuple and DataBag implementation ?

2010-01-15 Thread Alan Gates
Qui tacet consenti No one has spoken up, so I think you're free to make the change. Alan. On Jan 6, 2010, at 8:14 AM, Jeff Zhang wrote: Hi all, I am currently working on a JIRA which will change the interface of Tuple and DataBag: PIG-1166 https://issues.apache.org/jira/browse/PIG-1166

Re: How to run PigMix?

2010-01-15 Thread Alan Gates
That's correct. See Ying's comments near the bottom on how to make the patches there work together. Alan. On Jan 15, 2010, at 6:30 PM, Matei Zaharia wrote: Hi, I'm interested in running the PigMix benchmark described at http://wiki.apache.org/pig/PigMix to test some scheduling work in

Re: Piglet: a Ruby DSL for writing Pig scripts

2010-01-14 Thread Alan Gates
Done. Alan. On Jan 13, 2010, at 10:01 PM, Theo Hultberg wrote: Please do! T# On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates ga...@yahoo-inc.com wrote: Theo, This looks really interesting. Can I put a link to it on our page for tools use with Pig, http://wiki.apache.org/pig/PigTools

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Alan Gates
Rob, Feel free to update the wiki with your findings. You don't have to be a committer to change the wiki. Alan. On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote: Hello Dmitry! I have it solved, it was just a bit of trial and error based on the Hive bug report/fix I found. The report

Re: Bible Code and some input format ideas

2010-01-12 Thread Alan Gates
I'm guessing that you want to set the width of the text to avoid the issue where if you split by block, then all splits but the first will have an unknown offset. Most texts have natural divisions in them which I'm guessing you'll want to respect anyway. In the Bible this would be the

Re: Simulating a SOP in pig for optimization/debugging purpose

2010-01-11 Thread Alan Gates
The script you give below will run twice, once for the dump, and once for the store. And dump is implemented as store plus cat. So I don't think this will do what you want. Alan. On Dec 18, 2009, at 1:48 AM, prasenjit mukherjee wrote: I am trying to figure out a way to identify the

Re: Tracing from an UDF

2010-01-05 Thread Alan Gates
In MR mode, the output of your UDFs will turn up in the logs of the map and reduce tasks, not in the pig log. There is currently no channel for pig to send log messages back from the cluster to your machine to put the messages in the pig log. Alan. On Jan 5, 2010, at 7:02 AM, Vincent

Re: CSV format loader

2009-12-08 Thread Alan Gates
Definitely. Alan. On Dec 8, 2009, at 3:12 PM, James Kebinger wrote: Hi all, I realized a week or two ago that PigStorage(',') wasn't adequate to parse files that had commas embedded in properly CSV quoted fields. I went ahead and built a CSV parser for pig 0.3 that deals with embedded

Re: Why we name it zebra ?

2009-11-30 Thread Alan Gates
On Nov 26, 2009, at 7:39 AM, Jeff Zhang wrote: Hi all, I'd like to know where's the name zebra come from ? does it convey the meaning of this meta data system that the columnar storage format is like the lines on the zebra's skin. Pretty much, yes. We've fallen into the habit of giving

Re: Diffing two bags?

2009-11-25 Thread Alan Gates
Do you want to keep the distinct values separate by input, or mingle them? The following script will keep them separate. A = load 'students' as (name); B = load 'employees' as (name); C = cogroup A by name, B by name; D = filter C by IsEmpty(A); E = foreach D generate flatten(B); store E into

Re: Need help with grouping.

2009-11-25 Thread Alan Gates
On Nov 25, 2009, at 2:59 PM, Dmitriy Ryaboy wrote: snip This is a good use case that manages to expose a with the UDF apis -- it would be nice to output multiple records per processed tuple in exec(), to allow the kind of processing actual Pig operators sometimes do, with buffering

Re: Dataloading problem with HBaseStorage.

2009-11-20 Thread Alan Gates
HbaseStorage is broken in Pig 0.5.0, see https://issues.apache.org/jira/browse/PIG-970 The fix for that has been checked into trunk. You can either check out from trunk and build to get that, or can check out from the 0.5.0 branch and then apply the patches in PIG-970 to that code base.

Yahoo is hiring for Hadoop development

2009-11-20 Thread Alan Gates
All, Yahoo has a number of Hadoop development positions open. There are engineering, architect, management, and QA positions all open. See http://developer.yahoo.net/blogs/hadoop/2009/11/updated_do_you_have_what_it_ta.html for details. Alan.

Re: Could pig dynamic change the reduce number according the mapper task number ?

2009-11-13 Thread Alan Gates
On Nov 12, 2009, at 2:49 PM, Scott Carey wrote: Is it possible to have a script at least use the default configured Hadoop value? Or is there a way to do that already? If the user doesn't specify a parallelism Pig doesn't set a value in JobConf for the reduce, which means it will pick up

Re: properties or conf dir?

2009-11-13 Thread Alan Gates
Looks like it is missing from the distribution. You can see the file at http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.5/conf/pig.properties?revision=815933view=markup You can also get it from svn. Alan. On Nov 13, 2009, at 9:01 AM, John Hayward wrote: I downloaded hadoop 0.20.1

Re: properties or conf dir?

2009-11-13 Thread Alan Gates
Filed https://issues.apache.org/jira/browse/PIG-1093 for this issue. Alan. On Nov 13, 2009, at 10:26 AM, Alan Gates wrote: Looks like it is missing from the distribution. You can see the file at http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.5/conf/pig.properties?revision

Re: Could pig dynamic change the reduce number according the mapper task number ?

2009-11-12 Thread Alan Gates
I agree that it would be very useful to have a dynamic number of reducers. However, I'm not sure how to accomplish it. MapReduce requires that we set the number of reducers up front in JobConf, when we submit the job. But we don't know the number of maps until getSplits is called after

Re: Could pig dynamic change the reduce number according the mapper task number ?

2009-11-12 Thread Alan Gates
to the jobtracker. No, it's a copy. Changes made in it don't end up affecting the job. Alan. ben Alan Gates wrote: I agree that it would be very useful to have a dynamic number of reducers. However, I'm not sure how to accomplish it. MapReduce requires that we set the number of reducers up

Re: Follow Up Questions: PigMix, DataGenerator etc...

2009-11-09 Thread Alan Gates
On Nov 8, 2009, at 7:08 AM, Rob Stewart wrote: snip So, Alan, you're correct, MapReduce, on its own does not provide me with loops, I have to wrap a loop around this MapReduce method getAllChildren() to get all children of john. When you say that I would have to wrap Java around Pig to

Re: Using elements of a tuple in other tuples FOREACH statement.

2009-11-03 Thread Alan Gates
I'm not sure I understand your question, but it sounds like you want to comingle data from two relations, X and Y without doing a join or cross. Is that correct? If so, you can't do that. If you have a script like: X = load 'file_data'; Y = load 'tuple_data'; Z = do something with X and

Re: Follow Up Questions: PigMix, DataGenerator etc...

2009-11-02 Thread Alan Gates
On Oct 31, 2009, at 11:22 AM, Rob Stewart wrote: snip Map and reduce parallelism are controlled differently in Hadoop. Map parallelism is controlled by the InputSplit. IS determines how many maps to start and which file blocks to assign to which maps. In the case of PigMix, both the MR

Re: tolowercase function?

2009-10-21 Thread Alan Gates
Check out LOWER in piggybank. Alan. On Oct 21, 2009, at 8:32 AM, Vincent Barat wrote: Hello, Quick question: is there a set of ready to use PIG UDFs functions ? I'm looking to TOLOWERCASE function... Cheers,

Re: Stream within a Group

2009-10-13 Thread Alan Gates
What you propose below will result in all of the records for a given group going to a single instance of sessions.pl. Alan. On Oct 13, 2009, at 4:04 PM, Paul B wrote: I'm setting up a pig job that needs to stream a grouped set of data to an instance of a perl script. I need to ensure

Re: [Help]Re:UDF Implementing a LoadFunc Interface

2009-10-02 Thread Alan Gates
It's dying when trying to write out the contents of the tuples that are in the bag. What is the schema of the tuples inside the bag? Alan. On Oct 1, 2009, at 8:37 PM, miryala vignesh wrote: Hie, I was implementing LoadFunc Interface , in that getNext() returns a tuple . I have a bag

Re: Storing Pig output into HBase tables

2009-09-18 Thread Alan Gates
/TableStorer.java in Pig's contrib directory. Alan. On Sep 9, 2009, at 6:20 PM, Liu Xianglong wrote: Hi, Alan. I am interest in this store function, could you mind sending me some details? -- From: Alan Gates ga...@yahoo-inc.com Sent: Thursday

Re: Issue with LoadFunc Slicer

2009-09-17 Thread Alan Gates
Kevin, Please take a look at the proposal for reworking load and store functions that was posted a couple of days ago and see if it will address your issues with plugability of load functions. http://wiki.apache.org/pig/LoadStoreRedesignProposal Alan. On Sep 14, 2009, at 8:58 AM, Kevin

Re: Using Pig to process Nutch files

2009-09-14 Thread Alan Gates
I don't know much about Nutch, or its format. If it is not a text format separated by some single character value (such as comma, tab, etc.) you'll need to write a load function to read it and parse it into Pig tuples. You can find more info on writing load functions at

Re: Is the any plan to support hadoop-0.20.0 and hbase-0.20.0 in the roadmap?

2009-09-11 Thread Alan Gates
Our plan is to switch Pig to Hadoop 0.20 as soon as Hadoop 0.20.1 is released, because there's some features in that release we would like to have. Last I knew 0.20.1 was in the vote phase to be released. Integration with hbase 0.20 will need someone to pick it up and work on it. I am

Re: Double logs in grunt and ^d don't work ?

2009-09-11 Thread Alan Gates
Pig uses jline to do command line editing in grunt, so it supports whatever jline supports. Alan. On Sep 11, 2009, at 7:46 AM, Vincent BARAT wrote: Ashutosh Chauhan a écrit : Do you mean using ^D to kill grunt and return to OS shell ? No, just ^d to delete the character just after the

Re: Storing Pig output into HBase tables

2009-09-09 Thread Alan Gates
I do not know if there is a general hbase load/import tool. That would be a good question for the hbase-user list. Right now Pig does not have a store function to write data into hbase. It is possible to write such a function. If you are interested I can send you specific details on how

Re: OutOfMemory Errors when loading a Gzip file

2009-09-08 Thread Alan Gates
How large are the records in your file? Do you expect any single record to be in the multi-megabyte size? Have you tried decompressing the file and reading it to see if the issue is the compression? Alan. On Sep 8, 2009, at 7:58 AM, Irfan Mohammed wrote: Hi, I am trying to load a large

Re: Nesting and Grouping by Multiple Fields

2009-09-08 Thread Alan Gates
In other mails you're using Pig's multi-query feature to group the same data different ways. Is that the same thing you're wanting to do here, or something different? Alan. On Sep 3, 2009, at 1:08 PM, zaki rahaman wrote: I have a set of logfiles that I'm parsing and analyzing using Pig in

  1   2   >