Re: Pig mascot questions

2021-07-27 Thread Alan Gates
On Tue, Jul 27, 2021 at 6:30 PM Cat Lee Ball wrote: > Hi everyone, > > I've been wondering and wanted to ask about the Apache Pig mascot: > > - > https://svn.apache.org/repos/asf/comdev/project-logos/originals/pig.svg > > > In particular: > > - Does anyone know if there's any history how this

Re: Co Group vs Join in pig

2016-09-29 Thread Alan Gates
Sep 28, 2016, at 17:06, Kashif Hussain wrote: > > Will a co group with filter be equivalent to join ? > I mean will pig optimize the former to achieve performance equivalent to > latter ? I assume that single map reduce job will be spawned in both cases. > > On Wed, Sep 28, 2

Re: Co Group vs Join in pig

2016-09-28 Thread Alan Gates
Cogroup is only the first half of join. It collects the records with the matching key together. It does not do the cross product of records with matching keys. If you are going to do a join (that is, you want to produce the matching records) join is usually better as there are a number of joi

Re: How Tez work in Hive and Pig

2016-08-18 Thread Alan Gates
> On Aug 12, 2016, at 01:38, darion.yaphet wrote: > > Hi team : > > > We using Tez as our execute engine on hive and pig . I'm very curious about > how to Hive and pig use it to execute plan . > > > Is there some design document or implement detail about it ? thanks :) https://cwiki.apa

Re: Optimally assigning reducers

2016-07-06 Thread Alan Gates
My first guess is that your join has significant skew in the keys, so many are getting assigned to a single reducer. Have you tried the skew join algorithm[1]? Alan. 1. https://pig.apache.org/docs/r0.16.0/perf.html#skewed-joins > On Jul 6, 2016, at 08:55, Nigam, Vibhor wrote: > > Hi > > I a

Re: How does Pig Pass Data from First Job and its next Job

2015-09-15 Thread Alan Gates
Pig writes the data to disk in it's own format. Given that in the cluster you don't know which machines tasks will run on storing it in memory directly is not feasible. You can use something like HDFS' in memory files (which Pig doesn't do yet) or Spark's RDD's for this. Alan. Argho Chatte

Re: Query | Join Internals

2015-07-30 Thread Alan Gates
Here's the original design doc: https://wiki.apache.org/pig/PigSkewedJoinSpec Alan. Gagan Juneja July 29, 2015 at 21:30 Any help? Regards, Gagan Gagan Juneja July 14, 2015 at 4:56 Hi Team, We are using Pig intensively in

Re: PigMix extension

2015-07-15 Thread Alan Gates
The initial goal of PigMix was definitely to give the project a way to measure itself against MapReduce and between different versions of releases. So that falls into your synthetic category. That said, if adding a field enables extending the bench mark into new territory and makes it more us

Re: pig 0.11.x release download

2015-04-15 Thread Alan Gates
https://archive.apache.org/dist/pig/ Alan. Alex Nastetsky April 15, 2015 at 3:45 Does anyone know where I can get a 0.11.x release of Pig? This site has 2 links -- one to releases 0.8 and later, and to 0.7 and earlier: https://pig.apache.org/releases.htm

Re: REGISTER ... with or without quotes

2015-04-09 Thread Alan Gates
Though I'm tempted to say the O'Reilly book is always right, the official stance on this is the one in the pig documentation on the website. Alan. Michael Howard April 9, 2015 at 9:43 Q: When using the REGISTER statement to register .jar files containing UDFs, s

Re: Help with HCatLoader against remote Hive2

2015-03-23 Thread Alan Gates
HiveMetaStore is definitely meant to be hit remotely. Your URI should be thrift://your.host.com:9083. Alan. Adam Silberstein March 22, 2015 at 17:40 Hi, Having some trouble getting hcatloader to work. My script is this: A = LOAD 'testTable' USING org.apache.hive.hc

Re: Pig Meetup at LinkedIn 3/14

2014-01-16 Thread Alan Gates
or countries like me. > > cheers, > > Joao > > > On Wed, Jan 15, 2014 at 3:39 PM, Alan Gates wrote: > >> A Pig Meetup is scheduled for March 14th. Planned talks include Pig on >> Tez, Pig on Storm, Intel Graph Builder, PigPen (MR for Clojure) and >> Accu

Pig Meetup at LinkedIn 3/14

2014-01-14 Thread Alan Gates
oup > Subject: Invitation: Pig User Meetup > Date: January 14, 2014 at 4:28:30 PM PST > To: ga...@apache.org > > > > > NEW MEETUP > Pig User Meetup > Pig user group > Added by Alan Gates > Friday, March 14, 2014 > 2:00 PM > LinkedIn > 2025 Stie

Re: Does Pig support HCatalogStorer table with buckets

2013-12-09 Thread Alan Gates
No. HCat explicitly checks if a table is bucketed, and if so disable storing to it to avoid writing to the table in a destructive way. Alan. On Dec 6, 2013, at 3:45 PM, Araceli Henley wrote: > Hi > > > : > > QUESTION: > > : > > Can anyone confirm if HCatalogStore works wit

Re: Bag of tuples

2013-11-06 Thread Alan Gates
Do you mean you want to find the top 5 per input record? Also, what is your ordering criteria? Just sort by id? Something like this should order all tuples in each bag by id and then produce the top 5. My syntax may be a little off as I'm working offline and don't have the manual in front of

Re: support for distributed cache archives

2013-11-04 Thread Alan Gates
I don't see why we couldn't. Step one would be to file a JIRA. After that, if you have the time and inclination feel free to provide a patch for it. Alan. On Nov 1, 2013, at 10:31 PM, Jim Donofrio wrote: > Any thoughts on this? > > On 10/22/2013 10:36 AM, Jim Donofrio wrote: >> JobControlCom

Re: convert rows to columns in Pig

2013-10-21 Thread Alan Gates
I think the following will do what you want: A = load 'input'; B = group A all; C = foreach B generate flatten(BagToTuple(A)); Note that this will collect all records into one bag and produce one output record. That won't scale well, and may not be what you want. Alan. On Oct 18, 2013, at 8:3

Re: number of M/R jobs for a Pig Script

2013-10-15 Thread Alan Gates
Pig handles doing multiple group bys on the same input, often in a single MR job. So: A = load 'file'; B = group A by $0; C = foreach B generate group, COUNT(A); store C into 'output1'; D = group A by $1; E = foreach D generate group, COUNT(A); store D into 'output2'; This can be done in a sing

Re: Accessig paritcular folder

2013-10-04 Thread Alan Gates
For any Pig loader that reads files from HDFS, filenames are passed directly to HDFS. This means HDFS style globs are supported, which means the answer to your question depends on the version of HDFS you have. For your version of Hadoop, take a look at the documentation for FileSystem.globStat

Re: piglipstick

2013-09-06 Thread Alan Gates
On the Hive side, the Netflix team recently told me they are working on "honey", an equivalent thing for Hive. I believe a prototype is in their github. Alan. On Sep 5, 2013, at 11:40 PM, ajay kumar wrote: > Hi all, > any one worked on piglipstick? > > Please share some info about piglipstic

Re: Grunt Shell hangs on Cygwin.

2013-08-08 Thread Alan Gates
prompt? > > Regards, > Darpan > > On 6 August 2013 02:34, Alan Gates wrote: > >> You might try running Pig trunk without cygwin. Much work has been done >> lately to make Pig work directly on windows. >> >> Alan. >> >> On Aug 4, 2013, at 9:49 PM

Re: Grunt Shell hangs on Cygwin.

2013-08-05 Thread Alan Gates
You might try running Pig trunk without cygwin. Much work has been done lately to make Pig work directly on windows. Alan. On Aug 4, 2013, at 9:49 PM, Darpan R wrote: > Thanks Sudhir, > I tried running scripts , it takes a long time to start pig and stop ( > setup/cleanup) . > Please keep us u

Re: Execute multiple PIG scripts parallely

2013-07-22 Thread Alan Gates
If you write your scripts as one large Pig script Pig will execute them in parallel. You can keep from confusing your individual scripts by writing one master script that has imports (see http://pig.apache.org/docs/r0.11.1/cont.html#import-macros ). You just need to make sure your various scr

Re: something about builtin.TOP

2013-07-22 Thread Alan Gates
Agreed. Please file a JIRA on this. Alan. On Jul 22, 2013, at 1:57 AM, Qian, Chen(AWF) wrote: > Hi all, > > builtin.TOP() function can't ignore NULL value, it'll lead to NULL Pointer > error. > > That doesn't make sense > > Best, > Ned >

Re: DISTINCT and paritioner

2013-07-18 Thread Alan Gates
You're correct. It looks like an optimization was put in to make distinct use a special partitioner which prevents the user from setting the partitioner. Could you file a JIRA against the docs so we can get that fixed? Alan. On Jul 17, 2013, at 11:27 AM, William Oberman wrote: > The docs say

Re: Which Pig Version with Hadoop 0.22

2013-07-17 Thread Alan Gates
We have never produced a release that works with Hadoop 0.22. There were some patches for it, see https://issues.apache.org/jira/browse/PIG-2277 You might be able to build your own version. Alan. On Jul 17, 2013, at 10:41 AM, vivek thakre wrote: > Hello All, > > Which Apache Pig Release woul

Re: question about syntax for nested evaluations using bincond

2013-07-15 Thread Alan Gates
No, both are equally correct. == has higher precedence than ?: Alan. On Jul 5, 2013, at 1:39 PM, mark meyer wrote: > hello, > > i am new to pig and have a question regarding the syntax arrangement for > nested evaluations using bincond. > > both of these seem to work and produce identical re

Re: join with 2 skewed tables - a suggestion

2013-06-19 Thread Alan Gates
On Jun 17, 2013, at 7:24 AM, Ido Hadanny wrote: > Hey, > > We noticed that the current skewed join supports only 1 skewed table, and > assumes that the second table isn't skewed. > Please review this suggestion for a 2 skewed tables design: > > - Sample both tables > - for each skewed key (

Fwd: DesignLounge @ HadoopSummit

2013-06-12 Thread Alan Gates
Begin forwarded message: > From: Eric Baldeschwieler > Date: June 11, 2013 10:46:25 AM PDT > To: "common-...@hadoop.apache.org" > Subject: DesignLounge @ HadoopSummit > Reply-To: common-...@hadoop.apache.org > > Hi Folks, > > We thought we'd try something new at Hadoop Summit this year to bu

Re: Single Output file from STORE command

2013-05-28 Thread Alan Gates
Nothing that uses MapReduce as an underlying execution engine creates a single file when running multiple reducers because MapReduce doesn't. The real question is if you want to keep the file on Hadoop, why worry about whether it's a single file? Most applications on Hadoop will take a directo

Fwd: Hadoop In Seoul 2013 Conference Calls For Speakers

2013-05-21 Thread Alan Gates
Begin forwarded message: > From: "Edward J. Yoon" > Date: May 21, 2013 1:29:06 AM PDT > To: gene...@hadoop.apache.org > Subject: Hadoop In Seoul 2013 Conference Calls For Speakers > Reply-To: gene...@hadoop.apache.org > > Hi, > > I'm planning the Hadoop In Seoul 2013 Open Conference with some

Re: PIG: Transform based on value in field

2013-05-14 Thread Alan Gates
B = foreach A generate a1, (a2 == 0 ? a2 + 1 : a2) as a2, a3; Alan. On May 14, 2013, at 9:10 AM, Ashish Gupta wrote: > I want to something like this > > B = FOREACH A GENERATE a1, *if a2 = 0: a2=a2+1 else a2*, a3) > > how to do " if a2 = 0: a2=a2+1 else a2" in PIG > > (or it could be "if a2 m

Re: Pig Unique Counts on Multiple Subsets of a Large Input

2013-05-06 Thread Alan Gates
In the script you gave I'd be surprised if it's spending time in the map phase, as the map should be very simple. It's the reduce phase I'd expect to be very expensive because your mapping UDF prevents Pig from using the algebraic nature of count (that is, it has to ship all of the records to r

Re: Hbase Hex Values

2013-05-06 Thread Alan Gates
I am not aware of any built in or Piggybank UDF that converts Hex to Int, but it would be a welcome contribution if you wanted to write it. Alan. On May 5, 2013, at 8:14 PM, John Meek wrote: > Hey all, > > If I need to load a Hbase table with Hex values into Pig, does that require a > specifi

Fwd: CfP 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)

2013-04-25 Thread Alan Gates
Begin forwarded message: > From: MHPC 2013 > Date: April 24, 2013 10:23:55 AM PDT > To: u...@hadoop.apache.org > Subject: Fwd: CfP 2013 Workshop on Middleware for HPC and Big Data Systems > (MHPC'13) > Reply-To: u...@hadoop.apache.org > > > we apologize if you receive multiple copies of this

Re: long parse time

2013-03-29 Thread Alan Gates
What version of Pig are you using? Unreasonably long parse times were in issue in Pig 0.9 and 0.10, I believe those issues were fixed in Pig 0.11. Alan. On Mar 28, 2013, at 12:51 PM, Patrick Salami wrote: > We have some very long pig scripts that run several times per day. We > believe that th

Re: Reaching source code

2013-03-14 Thread Alan Gates
You can use explain to show you the plan Pig will use to execute your script. This won't show you the exact Java code. If you want to find out exactly what Java code is running for a particular operator the easiest thing to do is probably run the query in local mode and attach a debugger. Ala

Re: How Pig generates DAG

2013-02-25 Thread Alan Gates
, 2013, at 1:39 PM, Preeti Gupta wrote: >> a set of MapReduce jobs > > On Feb 25, 2013, at 1:35 PM, Alan Gates wrote: > >> Pig generates several DAGs (a logical plan, a physical plan, a set of >> MapReduce jobs). Which one are you interested in? >> >> Al

Re: How Pig generates DAG

2013-02-25 Thread Alan Gates
Pig generates several DAGs (a logical plan, a physical plan, a set of MapReduce jobs). Which one are you interested in? Alan. On Feb 25, 2013, at 12:02 PM, Preeti Gupta wrote: > Hi, > > I need to do some modifications here and need to know how Pig generates DAG. > Can someone throw some li

Re: Just started

2013-02-24 Thread Alan Gates
For books, check out http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645/ref=sr_1_1?ie=UTF8&qid=1361724828&sr=8-1&keywords=programming+pig There's also pretty good docs on pig.apache.org under the documentation tab. Alan. On Feb 24, 2013, at 8:44 AM, William Kang

Re: Reduce Tasks

2013-02-01 Thread Alan Gates
Setting that mapred.reduce.tasks won't work as Pig overrides. See http://pig.apache.org/docs/r0.10.0/perf.html#parallel for info on how to set the number of reducers in Pig. Alan. On Feb 1, 2013, at 4:53 PM, Mohit Anchlia wrote: > Just slightly different problem I tried setting SET mapred.red

Re: Run a job async

2013-01-24 Thread Alan Gates
> blocks until pig job is complete. > > Sent from my iPhone > > On Jan 24, 2013, at 9:31 AM, Alan Gates wrote: > >> If you're looking for an app server for Pig I'd take a look at a couple of >> other projects already out there that can do this: >>

Re: Run a job async

2013-01-24 Thread Alan Gates
If you're looking for an app server for Pig I'd take a look at a couple of other projects already out there that can do this: 1) webhcat (fka Templeton, now part of the HCatalog project). It provides a REST API that launches Pig, Hive, or MR jobs and allows you to manage them, get results, etc

Re: Hard-coded inline relations

2013-01-24 Thread Alan Gates
I agree this would be useful for debugging, but I'd go about it a different way. Rather than add new syntax as you propose, it seems we could easily create an inline loader, so your script would look something like: A = load '{(Hello), (World)}' using InlineLoader(); dump A; Alan. On Jan 18,

Re: Pig error

2013-01-15 Thread Alan Gates
Could you share your script or a script that gets this error message? Alan. On Jan 14, 2013, at 2:19 PM, Phanish Lakkarasu wrote: > Hi all, > > When am using JOIN operator in pig, am getting following error > > Pig joins inner plans can only have one output leaf? > > Can any one tell me, why

Re: JsonLoader schema field order shouldn't matter

2013-01-08 Thread Alan Gates
ted by something other than Pig since the ordering might change. > What do you think? > > I didn't see a bug for it in Jira, so would this (still open) one be > the place to mention it? Or should I make a new one? > https://issues.apache.org/jira/browse/PIG-1914 > > ~T > >

Re: JsonLoader schema field order shouldn't matter

2013-01-07 Thread Alan Gates
Currently the JsonLoader does assume ordering of the fields. It does not do any name matching against the given schema to find the right field. Alan. On Jan 7, 2013, at 11:56 AM, Tim Sell wrote: > When using JsonLoader with Pig 0.10.0 > > if I have an input.json file that looks like this: >

Re: Multiple input file

2012-12-22 Thread Alan Gates
Yes. See http://pig.apache.org/docs/r0.10.0/basic.html#load for a discussion of how to use globs in file paths. Alan. On Dec 21, 2012, at 10:38 PM, Mohit Anchlia wrote: > Is it possible to load multiple files in the same load command? I have > files in different path that I need to load, is t

Re: pig ship tar files

2012-12-20 Thread Alan Gates
See http://pig.apache.org/docs/r0.10.0/basic.html#define-udfs especially the section on SHIP. Alan. On Dec 20, 2012, at 10:01 AM, Danfeng Li wrote: > I read alot of about pig can ship a tar file and untar it before execution. > However, I couldn't find any example. Can someone provide an examp

Re: Do we have any plan for "Cost based optimizer"?

2012-12-06 Thread Alan Gates
I am not aware of any work going on for this or plans in this area at the moment. Alan. On Dec 4, 2012, at 6:32 PM, lulynn_2008 wrote: > Hi All, > > I just noticed that Pig Committer DaiJianYong has mentioned "Cost based > optimizer" for pig performanceoptimization. > My question are: > Do we

Re: Physical Plan

2012-11-26 Thread Alan Gates
No, it need not be binary. A split can have multiple children. Alan. On Nov 17, 2012, at 4:32 PM, Sarah Mohamed wrote: > Is the Physical Plan binary tree ? (i.e. Could any node have more than two > Physical Operators child ?) > -- > Regards, > Sarah M. Hassan

Re: computing avg in pig

2012-11-06 Thread Alan Gates
A = load 'input_file'; B = group A all; C = foreach B generate AVG(A.$1); This groups all of your records into one bag and then takes the average of the second column. Alan. On Nov 6, 2012, at 11:19 AM, jamal sasha wrote: >> I have data in format > >> >> >>1,1.2 >> >>2,1.3 >> >>

Re: CONCAT(null, "something") == NULL ?

2012-11-05 Thread Alan Gates
; We are documenting them, but apparently, many users find it confusing. I am > wondering if there is anything that we can do better. > > Thanks, > Cheolsoo > > On Fri, Nov 2, 2012 at 3:33 PM, Alan Gates wrote: > >> To give some context, the null semantics in Pig follow

Re: CONCAT(null, "something") == NULL ?

2012-11-02 Thread Alan Gates
To give some context, the null semantics in Pig follow SQL's. In SQL, null is viral, so any operation with null results in null. The idea is that null means unknown, not empty. So concat('x', unknown) = unknown. Alan. On Nov 2, 2012, at 3:09 PM, Yang wrote: > looks a more intuitive result s

Re: Is that possible to use Pig to do an optimized secondary sort.

2012-10-31 Thread Alan Gates
Seeing your Pig Latin script will help us determine whether this will work in your case. But in general Pig uses secondary sort when you do an order by in a nested foreach. So if you are grouping you could order within that group and then pass it to your UDF. Alan. On Oct 31, 2012, at 1:20 A

Re: Reading fixed width files in pig

2012-10-26 Thread Alan Gates
I am not aware of any. Alan. On Oct 23, 2012, at 6:03 AM, ranjith raghunath wrote: > Team, > > Are any out of the box load functions for fixed width files?

Re: Welcome our newest committer Cheolsoo Park

2012-10-26 Thread Alan Gates
Welcome Cheolsoo, and well deserved. Alan. On Oct 26, 2012, at 2:54 PM, Julien Le Dem wrote: > All, > > Please join me in welcoming Cheolsoo Park as our newest Pig committer. > He's been contributing to Pig for a while now, helping fixing the > build and improve Pig. We look forward to him bein

Re: FOREACH GENERATE Conditional?

2012-10-24 Thread Alan Gates
Are you sure Pig is spawning extra map jobs for this? The multi-query optimizer should be pushing these back together into one job. If it isn't, you should be able to accomplish the same thing with trinary logic and a single filter: all = foreach main_set ((blah == 'a' and meh == 'b') ? 'likes

Re: About full pipeline between pig jobs

2012-10-22 Thread Alan Gates
At this point, no. In the current MapReduce infrastructure it would take a lot of hackery that breaks the MR abstraction to make this work[1]. This is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka YARN) where it is easier for applications to build these types of features.

Re: There's nothing like an "#include" statement for splicing common text into a pig script, right?

2012-10-11 Thread Alan Gates
See http://pig.apache.org/docs/r0.10.0/cont.html#import-macros Alan. On Oct 11, 2012, at 7:36 AM, Trager Corey wrote: > Several scripts start by loading the same file. I'd like to have the text > for the field names and types in one place. Doable? > > > The i

Re: Decide if function is algebraic at planning phase

2012-10-09 Thread Alan Gates
There is one way you could shoe-horn this in. EvalFuncs can implement funcToArgMapping, which is built to allow functions to pick a different instance of themselves for different types (e.g. SUM(long) vs SUM(double)). You could implement your logic in this function and then return an EvalFunc

Re: Question about UDFs and tuple ordering

2012-10-05 Thread Alan Gates
Many operators, such as join and group by, are not implemented by a single physical operation. Also, they are spread through the code as they have logical components and physical components. The logical components of join are in org.apache.pig.newplan.logical.relational.LOJoin.java. That gets

Re: Loading text file

2012-10-03 Thread Alan Gates
There is not a pre-built load function to do that. In fact I am not aware of a Hadoop InputFormat that does that. So you would first need to subclass Hadoop's FileInputFormat and then write a Load Func. Both should be fairly straight forward since all you need to do is remove the record and f

Re: Using matches in generate clause?

2012-09-27 Thread Alan Gates
>} >catch(Exception e) >{ > throw WrappedIOException.wrap("ouch!", e); >} > } > } > > > and use it just like this: > > b = foreach html_pages generate portal_id, MyMatch('some pattern', html) as > wp_match; > >

Re: Using matches in generate clause?

2012-09-27 Thread Alan Gates
What version of Pig are you using? Alan. On Sep 27, 2012, at 8:54 AM, James Kebinger wrote: > Hello, I'm having some trouble doing something I thought would be easy: I'd > like to use matches to generate a boolean flag but this seems to not > compile: > > FOREACH html_pages GENERATE portal_id,

Re: How can I access secure HBase in UDF

2012-09-25 Thread Alan Gates
You can use the UDFContext to pass information for the UDF in the JobConf without writing files. Alan. On Sep 25, 2012, at 10:48 AM, Rohini Palaniswamy wrote: > Ray, > Looking at the EvalFunc interface, I can not see a way or loophole to do > it. EvalFunc does not have a reference to Job o

Re: Removing unnecessary disambiguation marks

2012-09-18 Thread Alan Gates
The added foreach will not generate another MR job. Alan. On Sep 18, 2012, at 8:54 AM, Ruslan Al-Fakikh wrote: > Hey, > > You can try cleaning in a separate FOREACH. I don't think it'll > trigger another MR job, but you better check it. > Example: > resultCleaned = FOREACH result GENERATE >

Re: How to force the script finish the job and continue the follow script?

2012-09-16 Thread Alan Gates
'exec' will force your job to start. However, I strongly doubt this will solve your OOME problem, as some one part of your job is running out of memory. Whichever part that is will still fail I suspect. Pig jobs don't generally accrue memory as they go since most memory intensive operations a

Re: access schema defined in LOAD statement in custom LoadFunc?

2012-09-15 Thread Alan Gates
Unfortunately, no. I agree we should add that to the LoadFunc interface. Alan. On Sep 15, 2012, at 1:13 AM, Jim Donofrio wrote: > Is there anyway within a LoadFunc to access the schema that a user defines > after AS in a LOAD statement? Is there some property I can access in the > UDFContext

Re: Json and split into multiple files

2012-09-12 Thread Alan Gates
I don't understand your use case or why you need to use exec or outputSchema. Would it be possible to send a more complete example that makes clear why you need these? Alan. A tuple can contain a tuple, so it's certainly possible with outputSchema() to generate a schema that declares both you

Re: Storing field in a bag

2012-09-10 Thread Alan Gates
You can achieve equivalent functionality by saying: page = foreach b generate page; store page into '/flume_vol/flume/input/page.dat'; network = foreach b generate network; store network into '/flume_vol/flume/input/network.dat'; Alan. On Sep 10, 2012, at 4:05 PM, Ruslan Al-Fakikh wrote: > Hi, M

Re: Json and split into multiple files

2012-09-06 Thread Alan Gates
Loading the JSON below should give you a Pig record like: (user: tuple(id: int, name: chararray), product: tuple(id: int, name:chararray)) In that case your Pig Latin would look like: A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name: chararray), product: tuple(id: int, name:

Re: Count of all the rows

2012-09-04 Thread Alan Gates
ch-all key >> "all"), so that you can run a function on them. >> >> Don't know if that helps. >> >> 2012/8/29 Mohit Anchlia >> >>> Thanks! Why is grouping necessary? Is it to send it to the reducer? >>> >>> On Wed,

Re: Count of all the rows

2012-08-29 Thread Alan Gates
Why is grouping necessary? Is it to send it to the reducer? > > On Wed, Aug 29, 2012 at 4:03 PM, Alan Gates wrote: > >> A = load 'foo'; >> B = group A all; >> C = foreach B generate COUNT(A); >> >> Alan. >> On Aug 29, 2012, at 3:51 PM, Mohit

Re: Count of all the rows

2012-08-29 Thread Alan Gates
A = load 'foo'; B = group A all; C = foreach B generate COUNT(A); Alan. On Aug 29, 2012, at 3:51 PM, Mohit Anchlia wrote: > How do I get count of all the rows? All the examples of COUNT use group by.

Re: Help with Log Processing

2012-08-24 Thread Alan Gates
The issue you're going to run into is that Pig's default load function uses FileInputFormat, which always divides records on line end. You could clone FileInputFormat and twiddle your version to break on paragraph ends instead of line ends. You could then make a version of PigStorage that uses

Re: add a field, ordered

2012-08-23 Thread Alan Gates
Take a look at https://issues.apache.org/jira/browse/PIG-2353 I believe that's the JIRA for where they're doing the work. Alan. On Aug 14, 2012, at 3:38 AM, Lauren Blau wrote: > Is the source for it available in the development area? I'd be happy to > help if I can. > Lauren > > On Tue, Aug 1

Re: Fallback for output data storage

2012-08-23 Thread Alan Gates
You can simply store the data twice at the end of your script. Pig will split it and send it to both. It shouldn't fail the HDFS storage if the dbstorage fails (but test this first to make sure I'm correct.) So your script would look like: A = load ... store Z into 'db' using DBStorage(); sto

Re: Issues with Bincond

2012-08-22 Thread Alan Gates
Use "is null" instead of "== null". Equality, inequality, boolean, and arithmetic operators that encounter a null returning null is standard trinary logic. The only possible answer to "is this equal to an unknown" is "unknown". Alan. On Aug 22, 2012, at 11:43 AM, Alex Rovner wrote: > Thanks

Re: Pig as Connector with MongoDB and Node.js

2012-08-22 Thread Alan Gates
ig blog. This is consistent with common practice. > > The real point here is to get common place to recognize, index and > distribute blog post HOWTOs as documentation. If there's value in the > post, we should reblog it with a link back. > > Russell Jurney http://datasyndrome.c

Re: Pig as Connector with MongoDB and Node.js

2012-08-21 Thread Alan Gates
n decouple the community interests from the corporate interests. > > Thoughts? > > Santhosh > > > > From: Alan Gates > To: user@pig.apache.org > Sent: Friday, August 17, 2012 3:20 PM > Subject: Re: Pig as Connector with MongoD

Re: runtime exception when load and store multiple files using avro in pig

2012-08-21 Thread Alan Gates
Moving it into core makes sense to me, as Avro is a format we should be supporting. Alan. On Aug 21, 2012, at 6:03 PM, Cheolsoo Park wrote: > Hi Dan, > > Glad to hear that it worked. I totally agree that AvroStorage can be > improved. In fact, it was written for Pig 0.7, so it can be written m

Re: Pig as Connector with MongoDB and Node.js

2012-08-17 Thread Alan Gates
blog, and > of course, if people feel uncomfortable they should voice that opinion. > > I think it is good to show that a variety of people use Pig, and I mean, > it's not really a surprise that Pig is developed, used, and promoted by > corporations :) > > 2012/8/17 Ala

Re: Pig as Connector with MongoDB and Node.js

2012-08-17 Thread Alan Gates
I'm happy to repost these kinds of blog entries on the Pig blog. But one thing we as a community need to decide is how we want to handle references to corporate blogs. My proposal would be that any entries supporting and promoting Apache Pig should be allowed. But I have an obvious conflict o

Re: Distributed accumulator functions

2012-08-13 Thread Alan Gates
On Aug 13, 2012, at 9:05 AM, Benjamin Smedberg wrote: > I'm a new-ish pig user querying data on an hbase cluster. I have a question > about accumulator-style functions. > > When writing an accumulator-style UDF, is all of the data shipped to a single > machine before it is reduced/accumulated?

Re: FileAlreadyExistsException while running pig

2012-08-10 Thread Alan Gates
Usually that means the the directory you are trying to store to already exists. Pig won't overwrite existing data. You should either move or remove the directory or change the directory name in your store function. Alan. On Aug 9, 2012, at 7:42 PM, Haitao Yao wrote: > hi, all > I got t

Re: User Defined Comparator

2012-08-09 Thread Alan Gates
There isn't a replacement for ComparisonFunc. That was written before Pig had types so that users could do type specific comparison functions. With the addition of types it was felt that ComparisonFunc was no longer necessary. That said, it's never been removed. The testing is limited at th

Next Pig Hackathon

2012-07-30 Thread Alan Gates
Hortonworks will be hosting the next Pig Hackathon on August 24th. http://www.meetup.com/PigUser/events/75286212/ The agenda: - Help newcomers get started on their first UDF or patch and walk through the Apache submission process - Get the committers to look at patches that are ready but have

Re: Trunk version does not like my macros

2012-07-26 Thread Alan Gates
Apache mail servers strip attachments. Could you post your script somewhere or send it inline? Alan. On Jul 26, 2012, at 7:41 AM, Alex Rovner wrote: > Gentlemen, > > We have recently attempted to compile and use the latest trunk code and have > encountered a rather strange issue. Our job whi

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

2012-07-26 Thread Alan Gates
How will you handle ambiguities when there is an A::b and B::b? Alan. On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote: > I am proposing to patch avrostorage to have an option of storing field names > without their relation name. A::b will be saved as "b". > > Thoughts? > > Sent from my iPhone

Re: when Algebraic UDF are used ?

2012-07-25 Thread Alan Gates
It can't use the algebraic interface in this case because the data has to be sorted (which means it has to see all the data) before passing it to your UDF. If you remove the ORDER statement then the algebraic portion of your UDF will be invoked. Alan. On Jul 25, 2012, at 9:32 AM, Benoit Mathi

Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

2012-07-25 Thread Alan Gates
Basically you need to transform the schema, not the data. The easiest way I can think of to do that is to use a UDF that has an outputSchema function that renames columns. The exec call can then be a simple pass through. If you wanted to you could have it consolidate the join keys. You impl

Re: None. wtf is None?

2012-07-24 Thread Alan Gates
Can you attach a sample of the input data? I'm guessing None came from the input data. Alan. On Jul 23, 2012, at 10:49 PM, Russell Jurney wrote: > Can someone explain this script to me? It is freaking me out. When did Pig > start spitting out 'None' in place of null? > > register /me/pig/bu

Re: Can't JOIN self?

2012-07-20 Thread Alan Gates
It isn't a bug that you need to declare the join twice in your script. That is necessary for clarity and semantic correctness. That is, if we allowed: A = load 'bla'; B = join A by user, A by user; then you'd have two user fields in the B with no way to disambiguate. What's a bug (or missed

Re: apache tar releases don't contain piggybank as a jar

2012-07-16 Thread Alan Gates
The big reason is we'd like to split off piggybank into a separate source control system (like github) rather than keeping it in Pig proper. Given this, it doesn't make sense to be releasing piggybank with Pig. Alan. On Jul 12, 2012, at 9:37 AM, David Capwell wrote: > Is there a reason that p

Re: Join with greater/less then condition

2012-07-05 Thread Alan Gates
Pig can only do equi-joins. Theta joins are hard in MapReduce. So the way to do this is do the equi-join and then filter afterwards. This will not create significant additional cost since the join results will be filtered before being materialized to disk. C = Join table_a on user_id, title_

Re: One file with sorted results.

2012-07-03 Thread Alan Gates
You can set different parallel levels at different parts of your script by attaching parallel to the different operations. For example: Y = join W by a, X by b parallel 100; Z = order Y by a parallel 1; store Z into 'onefile'; If your output is big I would suggest trying out ordering in paralle

Re: Best Practice: store depending on data content

2012-07-02 Thread Alan Gates
on tools, etc. And it does not bind you to one storage format. That said, if you don't need any of these things Avro may be a good solution for your situation. Definitely choose the tool that best fits your need. Alan. > > Best Regards, > Ruslan > > On Fri, Jun 29, 201

Re: Best Practice: store depending on data content

2012-06-29 Thread Alan Gates
On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. Alan. On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > Hi Markus, >

Re: modulize pig scripts via 'run'; pass param containing special chars

2012-06-29 Thread Alan Gates
Does putting the parameters in a file using -param_file help? Alan. On Jun 27, 2012, at 9:02 AM, Markus Resch wrote: > Hey everyone, > > we're still using CDH3u3 pig (0.8.1). > As out pig scripts are growing we like to split them to modules and call > them via run. the parameter substitution

  1   2   3   4   >