Re: generic union types in piggybank

2014-03-26 Thread Stan Rosenberg
Hi Cheolsoo, Thanks for your reply! (Liang and I work together.) The restriction to "simple" union types is still there in the latest code; see lines 83-95, here: https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java I know that ele

Re: Introducing Parquet: efficient columnar storage for Hadoop.

2013-03-12 Thread Stan Rosenberg
Dmitriy, Please excuse my ignorance. What is/was wrong with trevni (https://github.com/cutting/trevni) ? Thanks, stan On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy wrote: > Fellow Hadoopers, > > We'd like to introduce a joint project between Twitter and Cloudera > engineers -- a new column

Re: How can I set the mapper number for pig script?

2012-06-23 Thread Stan Rosenberg
On Sat, Jun 23, 2012 at 3:30 AM, Sheng Guo wrote: > I know it is automatically set. But I have a large data set, I want it > allocate more mappers during midnight so that more computing resource could > be used to speed up. > Any suggestions? Pig uses CombineInputFormat by default which attempts

running pig on remote cluster

2012-06-08 Thread Stan Rosenberg
Hi, I am trying to submit a pig job to a remote cluster by setting mapred.job.tracker and fs.default.name accordingly. The job does get executed on the remote cluster, however all intermediate output is stored on the local cluster from which pig is run. From job configuration I can see that that

Re: LIMIT operator doesn't work with variables

2012-04-10 Thread Stan Rosenberg
I believe the syntax of LIMIT does not admit an arbitrary expression; it only admits constants. At least this is what the documentation says. stan On Tue, Apr 10, 2012 at 4:33 PM, James Newhaven wrote: > Hi, > > I am trying to a limit the output size using LIMIT. I want to the limit > size to

Re: Schema mismatch for files with changing avro schemas

2012-04-05 Thread Stan Rosenberg
AFAIK, by default AvroStorage enforces that all input files have exactly the same schema. I've submitted a patch to improve this somewhat by allowing different input schemas so long as a union schema can be derived; e.g., say schema 1 contains field 'foo' which is not in schema 2, and schema 2 con

Re: Working with changing schemas (avro) in Pig

2012-03-28 Thread Stan Rosenberg
There is a patch for Avro to deal with this use case: https://issues.apache.org/jira/browse/PIG-2579 (See the attached pig example which loads two avro input files with different schemas.) Best, stan On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick wrote: > Hi guys, > > I use Pig to process some click

Re: Best Practice: lookup table

2012-03-27 Thread Stan Rosenberg
Hi Markus, I would start with a "replicated" join: join InputTable by BrowserId, BrowserLookup by Id USING 'replicated'; The idea is to perform a map-side join by loading the smaller relation, in this case BrowserLookup, into memory. If all you're doing is lookup, then the replicated join is lik

Re: Trying to store a bag of tuples using AvroStorage.

2012-03-26 Thread Stan Rosenberg
e why that happens. I will investigate further once I can execute your scripts. Best, stan On Sun, Mar 25, 2012 at 10:41 AM, Stan Rosenberg wrote: > Hi Dan, > > This looks like an avro bug.  I'll have a look later tonight unless someone > else has a more immediate answer. >

Re: What should storefuncs do on parse errors while reading?

2012-03-25 Thread Stan Rosenberg
I typically increment a counter and have a bounded log of randomly sampled erroneous data. stan On Mar 24, 2012 6:50 PM, "fatal.er...@gmail.com" wrote: > Can do a counter and log the first few thousand rows or something ... > > > > On Mar 24, 2012, at 10:33 AM, Bill Graham wrote: > > > The pat

Re: Trying to store a bag of tuples using AvroStorage.

2012-03-25 Thread Stan Rosenberg
Hi Dan, This looks like an avro bug. I'll have a look later tonight unless someone else has a more immediate answer. Best, stan On Mar 25, 2012 12:36 AM, "Dan Young" wrote: > Hello all, > > I'm trying to store a bag of tuples using AvroStorage but am not able to > figure out what I'm doing wr

Re: Globbing several AVRO files with different (extended) schemes

2012-03-21 Thread Stan Rosenberg
There is a patch for AvroStorage which computes a union schema thereby allowing input avro files having different schemas, specifically (un-nested) records with different fields. https://issues.apache.org/jira/browse/PIG-2579 Best, stan On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney wrote:

Re: config/reference data files for UDFS

2012-03-13 Thread Stan Rosenberg
Hi Alan, I am also curious to see how the distributed cache is used in a UDF. However, the code you reference in the patch doesn't appear to contain such an example. What is the name of source file? Thanks, stan On Mon, Mar 12, 2012 at 7:24 PM, Alan Gates wrote: > Take a look at the builtin U

Re: Understanding LoadFunc sequence

2012-02-03 Thread Stan Rosenberg
Hi Bill, I've used the following in my UDFs: public static boolean isBackend(JobContext ctx) { // HACK borrowed from HCatLoader: this property should only be set on the backend return ctx.getConfiguration().get("mapred.task.id", "").length() > 0; } I recall

Re: Passing schema inside Load functionc

2012-02-03 Thread Stan Rosenberg
is there a easy way to do > it or am I reading something wrong. > Now I will focus on what you have suggested. but I hope there is some easy > solution to my problem > > Praveenesh > > On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg < > srosenb...@proclivitysystems.com&

Re: Passing schema inside Load functionc

2012-02-03 Thread Stan Rosenberg
solve the above scenario in pig ? > > Praveenesh > > > On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg < > srosenb...@proclivitysystems.com> wrote: > >> My hunch is you'll have to write a custom loader, but I'll let the >> experts chime in.  E.g., AvroS

Re: Passing schema inside Load functionc

2012-02-03 Thread Stan Rosenberg
My hunch is you'll have to write a custom loader, but I'll let the experts chime in. E.g., AvroStorage loader can parse the schema from a json file passed to it via the constructor. I don't think PigStorage has the same option. stan On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar wrote: > Hey

Re: Pig/Avro Question

2012-02-03 Thread Stan Rosenberg
Check the code in PigAvroInputFormat; it overrides 'listStatus' from FileInputFormat so that files not ending in .avro are filtered. stan On Fri, Feb 3, 2012 at 1:58 PM, Russell Jurney wrote: > btw - the weird thing is... I've read the code.  There isn't a filter for > .avro in there.  Does Hado

Re: explode operation

2012-01-30 Thread Stan Rosenberg
On Mon, Jan 30, 2012 at 2:25 AM, Aniket Mokashi wrote: > Isnt FLATTEN similar to explode? Not quite. EXPLODE would take a record with n fields and generate n records.

Re: explode operation

2012-01-29 Thread Stan Rosenberg
al question. You would have to write a >> custom UDF to do this. >> >> Thanks, >> Prashant >> >> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg >> wrote: >> >> > To clarify, here is our input: >> > >> > X = LOAD 'input.txt&#

Re: explode operation

2012-01-25 Thread Stan Rosenberg
6 PM, Stan Rosenberg wrote: > I don't see how flatten would help in this case. > > On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi > wrote: >> Hi Stan, >> >> Would using FLATTEN and then DISTINCT work? >> >> Thanks, >> Prashant >

Re: explode operation

2012-01-25 Thread Stan Rosenberg
I don't see how flatten would help in this case. On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi wrote: > Hi Stan, > > Would using FLATTEN and then DISTINCT work? > > Thanks, > Prashant > > On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg < > srosenb...@pr

explode operation

2012-01-25 Thread Stan Rosenberg
Hi Guys, I came across a use case that seems to require an 'explode' operation which to my knowledge is not currently available. That is, given a tuple (x,y,z), 'explode' would generate the tuples (x), (y), (z). E.g., consider a relation that contains an arbitrary number of different identifier c

Re: Multiple files with AvroStorage and comma separated lists

2012-01-25 Thread Stan Rosenberg
y wrote: > Please submit. > > Russell Jurney > twitter.com/rjurney > russell.jur...@gmail.com > datasyndrome.com > > On Jan 24, 2012, at 8:22 AM, Stan Rosenberg > wrote: > >> Philipp, >> >> I would say that it is a bug.  I ran into the same problem some tim

is svn repo down?

2012-01-25 Thread Stan Rosenberg
Hi, I wanted to submit a patch for AvroStorage. However, repo appears to be down: http://svn.apache.org/repos/asf/pig/trunk Thanks, stan

Re: DBLoader

2012-01-24 Thread Stan Rosenberg
Actually, I don't see the loading capability. Unless I am looking at the wrong code, org.apache.pig.piggybank.storage.DBStorage extends StoreFunc; it does not implement 'getNext'. stan On Tue, Jan 24, 2012 at 5:17 PM, Stan Rosenberg wrote: > My bad; I should have looked a

Re: DBLoader

2012-01-24 Thread Stan Rosenberg
My bad; I should have looked at the code. Thanks Ashutosh! stan On Tue, Jan 24, 2012 at 5:14 PM, Ashutosh Chauhan wrote: > DBStorage can be used for both load and store. > > Hope it helps, > Ashutosh > > On Tue, Jan 24, 2012 at 14:10, Stan Rosenberg < > srosenb...@proc

DBLoader

2012-01-24 Thread Stan Rosenberg
Hi, Quick question: is anyone aware of a DBLoad UDF, preferably based on hadoop's DBInputFormat? I am aware that there are other better solutions, e.g., sqoop. I can see DBStorage in piggybank, but not DBLoad. Thanks, stan

Re: Multiple files with AvroStorage and comma separated lists

2012-01-24 Thread Stan Rosenberg
ect: Re: Multiple files with AvroStorage and comma separated lists > To: user@pig.apache.org > > > Hi Philipp, > > This is in fact a bug, so if you wouldn't mind submitting the patch, that > would be great. > > thanks, > Bill > > > On Tue, Jan 24, 2012 at

Re: Multiple files with AvroStorage and comma separated lists

2012-01-24 Thread Stan Rosenberg
Philipp, I would say that it is a bug. I ran into the same problem some time ago. Essentially, AvroStorage does not recognize globs and does not recognize commas, both of which are supported by hadoop's FileInputFormat. I ended up patching AvroStorage to make it compatible with hadoop's semanti

Re: STORING each relation in it's own file

2012-01-13 Thread Stan Rosenberg
Hi Yulia, One way to accomplish this is by writing your own StoreFunc. Take a look at org.apache.pig.piggybank.storage.MultiStorage. You'd need to create your own output format and possibly a record writer. stan On Fri, Jan 13, 2012 at 4:35 PM, Yulia Tolskaya wrote: > Hello, > I am trying to

Re: Simple AvroStorage LOAD and STORE with Avro 1.6.0

2012-01-10 Thread Stan Rosenberg
t; org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65) >         at > org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.write(PigAvroDatumWriter.java:99) >         at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:57) >

Re: Simple AvroStorage LOAD and STORE with Avro 1.6.0

2012-01-09 Thread Stan Rosenberg
Generally, AvroStorage works fine for us with Avro 1.6. However, we also patched AvroStorage on a couple of occasions, e.g., see PIG-2330. stan On Mon, Jan 9, 2012 at 3:47 PM, Russell Jurney wrote: > I could only make AvroStorage work with Avro 1.4.1. > > Russell Jurney > twitter.com/rjurney >

Re: Simple AvroStorage LOAD and STORE with Avro 1.6.0

2012-01-09 Thread Stan Rosenberg
Andrew, The source of the problem may be AvroStorage in piggybank. Could you please include the entire stack trace? stan On Mon, Jan 9, 2012 at 4:15 AM, Andrew Kenworthy wrote: > Hallo, > > When I run a simple pig script to LOAD and STORE avro data, I get:- > > java.lang.ClassCastException: or

Re: Partition keys in LoadMetadata is broken in 0.10?

2011-12-31 Thread Stan Rosenberg
Just to be clear, the concrete syntax had a typo; should have been: A = load 'daily_activity' USING HiveLoader WHERE date_partition >= 20110101 and date_partition <= 20110201; On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg wrote: > > A = load 'daily_acti

Re: Partition keys in LoadMetadata is broken in 0.10?

2011-12-31 Thread Stan Rosenberg
re is a > bug in implementation, this should be fixed in PIG-2346 and will be > included in all subsequent releases. > > Thanks, > Daniel > > On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > srosenb...@proclivitysystems.com> wrote: > >> Howdy All, &g

Fwd: Partition keys in LoadMetadata is broken in 0.10?

2011-12-30 Thread Stan Rosenberg
ny thanks! stan ------ Forwarded message -- From: Stan Rosenberg Date: Wed, Dec 7, 2011 at 12:24 PM Subject: Partition keys in LoadMetadata is broken in 0.10? To: user@pig.apache.org Hi, I am trying to implement a loader which is partition-aware.  As prescribed, my loader implements

Re: Using AvroStorage()

2011-12-13 Thread Stan Rosenberg
I write the exact same thing in one line, it works.. I remember seeing a > JIRA for this some time back, but am not able to find it now. > > On Wed, Dec 14, 2011 at 12:23 AM, Stan Rosenberg < > srosenb...@proclivitysystems.com> wrote: > >> There is something syntac

Re: Using AvroStorage()

2011-12-13 Thread Stan Rosenberg
Validator.java:970) >        at > org.apache.pig.parser.AstValidator.general_statement(AstValidator.java:574) >        at > org.apache.pig.parser.AstValidator.statement(AstValidator.java:396) >        at org.apache.pig.parser.AstValidator.query(AstValidator.java:306) >        at > org.apac

Re: Using AvroStorage()

2011-12-13 Thread Stan Rosenberg
The following test script works for me: = A = load '$LOGS' using org.apache.pig.piggybank.storage.avro.AvroStorage(); describe A; B = foreach A generate region as my_region, google_ip; dump B; store B into './output' using org.apache.pig.piggybank.sto

Partition keys in LoadMetadata is broken in 0.10?

2011-12-07 Thread Stan Rosenberg
Hi, I am trying to implement a loader which is partition-aware. As prescribed, my loader implements LoadMetadata, however, getPartitionKeys is never invoked. The script is of this form: X = LOAD 'input' USING MyLoader(); X = FILTER X BY partition_col == 'some_string'; and the schema returned by

Re: PigServer and dynamic invokers

2011-11-16 Thread Stan Rosenberg
Hi Dimitriy, The script does run if invoked from command line but only if we set PIG_CLASSPATH to point at the jar. stan On Nov 16, 2011 11:18 PM, "Dmitriy Ryaboy" wrote: > Does the script run if you launch it from the pig command line instead > of via PigServer? > > On Wed, Nov 16, 2011 at 3:0

Re: hive queries from pig

2011-11-14 Thread Stan Rosenberg
On Mon, Nov 14, 2011 at 5:30 PM, Dmitriy Ryaboy wrote: > If you manually create the hive table + partitions to match the format > Pig writes things in, it should just work. Hive table already exists. However, we don't want to write directly into its warehouse location because it may result in a

Re: hive queries from pig

2011-11-14 Thread Stan Rosenberg
On Mon, Nov 14, 2011 at 3:08 PM, Dmitriy Ryaboy wrote: > My lack of imagination is showing -- can you explain what you mean by > integrating hive queries with pig, For example, we implemented a storage function which creates path partitioning based on a given sequence of columns; the output is st

hive queries from pig

2011-11-14 Thread Stan Rosenberg
Hi, We are trying to brainstorm on how best to integrate hive queries into pig. All suggestions are greatly appreciated! Note, we are trying to use hcatalog but there are a couple of problems with that approach. We also considered using jython to communicate with a thrift server but jython seems

Re: UDF Counters

2011-11-09 Thread Stan Rosenberg
On Wed, Nov 9, 2011 at 2:45 PM, Daan Gerits wrote: > Hello everyone, > > Is it possible to update a counter from within an UDF? I know there is some > information on updating counters using log messages, but I have never done > that before and have no idea if it is working with pig. > This seem

get schema in StorageFunc

2011-11-07 Thread Stan Rosenberg
Hi All, I'd like to get the schema of a relation that is used in conjunction with my custom StorageFunc. I found 'checkSchema' to be useful for this case, however, it seems to work only in local mode. When run in distributed mode, 'checkSchema' is not invoked in mappers. Is there some other mean

Re: creating a graph over time

2011-11-05 Thread Stan Rosenberg
Hi Guys, Sorry for joining this discussion so late. I would suggest using interval trees for dealing with overlapping time intervals. There is a fairly nice treatment of interval trees in CLR, sect. 14.3. The data structure is essentially a red-black tree, and I surmise that one could extend jav

Re: python modules

2011-10-18 Thread Stan Rosenberg
e you are hitting https://issues.apache.org/jira/browse/PIG-1824 > > -Clay > > On Mon, 17 Oct 2011, Stan Rosenberg wrote: > >> Hi, >> >> What's a proper way to deploy python udfs? I've dropped the latest >> version of jython.jar in $PIG_HOME/lib. &

python modules

2011-10-17 Thread Stan Rosenberg
Hi, What's a proper way to deploy python udfs? I've dropped the latest version of jython.jar in $PIG_HOME/lib. Things work in "local" mode, but when I run on a cluster, built-in python modules cannot be found. E.g., urlparse cannot be located: ImportError: No module named urlparse at org

Re: calling python udfs with varargs

2011-10-17 Thread Stan Rosenberg
ng 0 > args so I need to add a special case in the JythonFunction to handle > varargs. I'll create a JIRA for this. > For now you can not use varargs as they will always be called with no > parameters. > Julien > > On Mon, Oct 17, 2011 at 9:54 AM, Stan Rosenberg <

calling python udfs with varargs

2011-10-17 Thread Stan Rosenberg
Hi, I have a simple python udf which takes a variable number of (string) arguments and returns the first non-empty one. I can see that the udf is invoked from pig but no arguments are being passed. Here is the script: = #!/usr/bin/python f

Re: jython udfs

2011-10-13 Thread Stan Rosenberg
On Thu, Oct 13, 2011 at 11:52 AM, Norbert Burger wrote: > Also the output schema for dummy3() doesn't match what's being returned. >  You're returning a list of strings, but the outputschema specifies a bag, > which translates into a list of tuples (of something, eg. strings). > Part of my questi

jython udfs

2011-10-12 Thread Stan Rosenberg
Hi, I have three constant udfs in jython: @outputSchema("m:map[bag{tuple()}]") def dummy1(): return {"key":[("value1", "value2")]} @outputSchema("m:map[tuple()]") def dummy2(): return {"key":("value1", "value2")} # doesn't work! @outputSchema("m:map[bag{}]") def dummy3(): return {"k

Re: output partitioning

2011-10-04 Thread Stan Rosenberg
On Tue, Oct 4, 2011 at 2:06 PM, Alan Gates wrote: > Can you explain what you mean by secondary output partitioning? HCatalog > supports the same partitioning that Hive does. > "Currently HCatStorer only supports writing to one partition." We need to partition our data by client id, then by da

Re: output partitioning

2011-10-04 Thread Stan Rosenberg
On Tue, Oct 4, 2011 at 1:27 PM, Alan Gates wrote: > If you want to use Pig and Hive together, you should also consider > HCatalog, which was built exactly to address that use case. > http://incubator.apache.org/hcatalog We'll definitely consider HCatalog but unfortunately it does not seem to be

Re: output partitioning

2011-10-03 Thread Stan Rosenberg
tly appreciated. Thanks, stan On Mon, Oct 3, 2011 at 11:09 PM, Stan Rosenberg < srosenb...@proclivitysystems.com> wrote: > Hi, > > I'd like to store the output relation partitioned by >

output partitioning

2011-10-03 Thread Stan Rosenberg
Hi, I'd like to store the output relation partitioned by

Re: Conditional execution of 'generate' clauses

2011-10-03 Thread Stan Rosenberg
gt; > > > Z = filter Y by isEmpty(t); > > > > OR: t can't be empty if the thing you are distincting is not empty, so > this > > should work: > > > > Y = filter X by IsEmpty(thing_you_wanted_to_distinct); > > Z = foreach Y { > > -- the thin

Re: Conditional execution of 'generate' clauses

2011-10-03 Thread Stan Rosenberg
Y { > -- the thing you are distincting is now guaranteed to have at least 1 > value > t = distinct .. > generate foo... > } > > On Sun, Oct 2, 2011 at 9:28 AM, Stan Rosenberg < > srosenb...@proclivitysystems.com> wrote: > > > Hi Folks, > > > > I came

Conditional execution of 'generate' clauses

2011-10-02 Thread Stan Rosenberg
Hi Folks, I came across a use case where I'd like to do something like this: FOREACH X { ... t = DISTINCT (...) if (!IsEmpty(t)) GENERATE foo, ... } Thus, 'generate' is conditionally executed and the control flow depends on the value of some tuple 't'. Can this be done in pig? Th

Conditional execution of 'generate' clauses

2011-10-02 Thread Stan Rosenberg
Hi Folks, I came across a use case where I'd like to do something like this: FOREACH X { if (!IsEmpty(t)) }