Hi Cheolsoo,
Thanks for your reply! (Liang and I work together.) The restriction to
"simple" union types is still there in the latest code; see lines 83-95,
here:
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java
I know that ele
Dmitriy,
Please excuse my ignorance. What is/was wrong with trevni
(https://github.com/cutting/trevni) ?
Thanks,
stan
On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy wrote:
> Fellow Hadoopers,
>
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new column
On Sat, Jun 23, 2012 at 3:30 AM, Sheng Guo wrote:
> I know it is automatically set. But I have a large data set, I want it
> allocate more mappers during midnight so that more computing resource could
> be used to speed up.
> Any suggestions?
Pig uses CombineInputFormat by default which attempts
Hi,
I am trying to submit a pig job to a remote cluster by setting
mapred.job.tracker and fs.default.name accordingly.
The job does get executed on the remote cluster, however all
intermediate output is stored on the local cluster from which
pig is run. From job configuration I can see that that
I believe the syntax of LIMIT does not admit an arbitrary expression;
it only admits constants. At least this is what the documentation
says.
stan
On Tue, Apr 10, 2012 at 4:33 PM, James Newhaven
wrote:
> Hi,
>
> I am trying to a limit the output size using LIMIT. I want to the limit
> size to
AFAIK, by default AvroStorage enforces that all input files have
exactly the same schema. I've submitted a patch to improve
this somewhat by allowing different input schemas so long as a union
schema can be derived; e.g., say schema 1 contains field 'foo' which
is not
in schema 2, and schema 2 con
There is a patch for Avro to deal with this use case:
https://issues.apache.org/jira/browse/PIG-2579
(See the attached pig example which loads two avro input files with
different schemas.)
Best,
stan
On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick wrote:
> Hi guys,
>
> I use Pig to process some click
Hi Markus,
I would start with a "replicated" join:
join InputTable by BrowserId, BrowserLookup by Id USING 'replicated';
The idea is to perform a map-side join by loading the smaller
relation, in this case BrowserLookup, into memory.
If all you're doing is lookup, then the replicated join is lik
e why that happens.
I will investigate further once I can execute your scripts.
Best,
stan
On Sun, Mar 25, 2012 at 10:41 AM, Stan Rosenberg
wrote:
> Hi Dan,
>
> This looks like an avro bug. I'll have a look later tonight unless someone
> else has a more immediate answer.
>
I typically increment a counter and have a bounded log of randomly sampled
erroneous data.
stan
On Mar 24, 2012 6:50 PM, "fatal.er...@gmail.com"
wrote:
> Can do a counter and log the first few thousand rows or something ...
>
>
>
> On Mar 24, 2012, at 10:33 AM, Bill Graham wrote:
>
> > The pat
Hi Dan,
This looks like an avro bug. I'll have a look later tonight unless someone
else has a more immediate answer.
Best,
stan
On Mar 25, 2012 12:36 AM, "Dan Young" wrote:
> Hello all,
>
> I'm trying to store a bag of tuples using AvroStorage but am not able to
> figure out what I'm doing wr
There is a patch for AvroStorage which computes a union schema thereby
allowing input avro files having different
schemas, specifically (un-nested) records with different fields.
https://issues.apache.org/jira/browse/PIG-2579
Best,
stan
On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney wrote:
Hi Alan,
I am also curious to see how the distributed cache is used in a UDF.
However, the code you reference in the patch doesn't appear to contain
such an example. What is the name of source file?
Thanks,
stan
On Mon, Mar 12, 2012 at 7:24 PM, Alan Gates wrote:
> Take a look at the builtin U
Hi Bill,
I've used the following in my UDFs:
public static boolean isBackend(JobContext ctx) {
// HACK borrowed from HCatLoader: this property should only be
set
on the backend
return ctx.getConfiguration().get("mapred.task.id",
"").length() > 0;
}
I recall
is there a easy way to do
> it or am I reading something wrong.
> Now I will focus on what you have suggested. but I hope there is some easy
> solution to my problem
>
> Praveenesh
>
> On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg <
> srosenb...@proclivitysystems.com&
solve the above scenario in pig ?
>
> Praveenesh
>
>
> On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
> srosenb...@proclivitysystems.com> wrote:
>
>> My hunch is you'll have to write a custom loader, but I'll let the
>> experts chime in. E.g., AvroS
My hunch is you'll have to write a custom loader, but I'll let the
experts chime in. E.g., AvroStorage loader can parse the schema
from a json file passed to it via the constructor. I don't think
PigStorage has the same option.
stan
On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar wrote:
> Hey
Check the code in PigAvroInputFormat; it overrides 'listStatus' from
FileInputFormat so that files not ending
in .avro are filtered.
stan
On Fri, Feb 3, 2012 at 1:58 PM, Russell Jurney wrote:
> btw - the weird thing is... I've read the code. There isn't a filter for
> .avro in there. Does Hado
On Mon, Jan 30, 2012 at 2:25 AM, Aniket Mokashi wrote:
> Isnt FLATTEN similar to explode?
Not quite. EXPLODE would take a record with n fields and generate n records.
al question. You would have to write a
>> custom UDF to do this.
>>
>> Thanks,
>> Prashant
>>
>> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
>> wrote:
>>
>> > To clarify, here is our input:
>> >
>> > X = LOAD 'input.txt
6 PM, Stan Rosenberg
wrote:
> I don't see how flatten would help in this case.
>
> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
> wrote:
>> Hi Stan,
>>
>> Would using FLATTEN and then DISTINCT work?
>>
>> Thanks,
>> Prashant
>
I don't see how flatten would help in this case.
On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
wrote:
> Hi Stan,
>
> Would using FLATTEN and then DISTINCT work?
>
> Thanks,
> Prashant
>
> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
> srosenb...@pr
Hi Guys,
I came across a use case that seems to require an 'explode' operation
which to my knowledge is not currently available.
That is, given a tuple (x,y,z), 'explode' would generate the tuples
(x), (y), (z).
E.g., consider a relation that contains an arbitrary number of
different identifier c
y
wrote:
> Please submit.
>
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com
>
> On Jan 24, 2012, at 8:22 AM, Stan Rosenberg
> wrote:
>
>> Philipp,
>>
>> I would say that it is a bug. I ran into the same problem some tim
Hi,
I wanted to submit a patch for AvroStorage. However, repo appears to be down:
http://svn.apache.org/repos/asf/pig/trunk
Thanks,
stan
Actually, I don't see the loading capability. Unless I am looking at
the wrong code, org.apache.pig.piggybank.storage.DBStorage extends
StoreFunc; it does not implement 'getNext'.
stan
On Tue, Jan 24, 2012 at 5:17 PM, Stan Rosenberg
wrote:
> My bad; I should have looked a
My bad; I should have looked at the code. Thanks Ashutosh!
stan
On Tue, Jan 24, 2012 at 5:14 PM, Ashutosh Chauhan wrote:
> DBStorage can be used for both load and store.
>
> Hope it helps,
> Ashutosh
>
> On Tue, Jan 24, 2012 at 14:10, Stan Rosenberg <
> srosenb...@proc
Hi,
Quick question: is anyone aware of a DBLoad UDF, preferably based on
hadoop's DBInputFormat? I am aware that there are other better
solutions, e.g., sqoop.
I can see DBStorage in piggybank, but not DBLoad.
Thanks,
stan
ect: Re: Multiple files with AvroStorage and comma separated lists
> To: user@pig.apache.org
>
>
> Hi Philipp,
>
> This is in fact a bug, so if you wouldn't mind submitting the patch, that
> would be great.
>
> thanks,
> Bill
>
>
> On Tue, Jan 24, 2012 at
Philipp,
I would say that it is a bug. I ran into the same problem some time
ago. Essentially, AvroStorage does not recognize globs and does not
recognize commas, both of which
are supported by hadoop's FileInputFormat. I ended up patching
AvroStorage to make it compatible with hadoop's semanti
Hi Yulia,
One way to accomplish this is by writing your own StoreFunc. Take a
look at org.apache.pig.piggybank.storage.MultiStorage. You'd need to
create your own output format and possibly a record writer.
stan
On Fri, Jan 13, 2012 at 4:35 PM, Yulia Tolskaya wrote:
> Hello,
> I am trying to
t; org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65)
> at
> org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.write(PigAvroDatumWriter.java:99)
> at
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:57)
>
Generally, AvroStorage works fine for us with Avro 1.6. However, we
also patched AvroStorage on a couple of occasions, e.g., see PIG-2330.
stan
On Mon, Jan 9, 2012 at 3:47 PM, Russell Jurney wrote:
> I could only make AvroStorage work with Avro 1.4.1.
>
> Russell Jurney
> twitter.com/rjurney
>
Andrew,
The source of the problem may be AvroStorage in piggybank. Could you
please include the entire stack trace?
stan
On Mon, Jan 9, 2012 at 4:15 AM, Andrew Kenworthy wrote:
> Hallo,
>
> When I run a simple pig script to LOAD and STORE avro data, I get:-
>
> java.lang.ClassCastException: or
Just to be clear, the concrete syntax had a typo; should have been:
A = load 'daily_activity' USING HiveLoader WHERE date_partition >=
20110101 and date_partition <= 20110201;
On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg
wrote:
>
> A = load 'daily_acti
re is a
> bug in implementation, this should be fixed in PIG-2346 and will be
> included in all subsequent releases.
>
> Thanks,
> Daniel
>
> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
> srosenb...@proclivitysystems.com> wrote:
>
>> Howdy All,
&g
ny thanks!
stan
------ Forwarded message --
From: Stan Rosenberg
Date: Wed, Dec 7, 2011 at 12:24 PM
Subject: Partition keys in LoadMetadata is broken in 0.10?
To: user@pig.apache.org
Hi,
I am trying to implement a loader which is partition-aware. As
prescribed, my loader implements
I write the exact same thing in one line, it works.. I remember seeing a
> JIRA for this some time back, but am not able to find it now.
>
> On Wed, Dec 14, 2011 at 12:23 AM, Stan Rosenberg <
> srosenb...@proclivitysystems.com> wrote:
>
>> There is something syntac
Validator.java:970)
> at
> org.apache.pig.parser.AstValidator.general_statement(AstValidator.java:574)
> at
> org.apache.pig.parser.AstValidator.statement(AstValidator.java:396)
> at org.apache.pig.parser.AstValidator.query(AstValidator.java:306)
> at
> org.apac
The following test script works for me:
=
A = load '$LOGS' using org.apache.pig.piggybank.storage.avro.AvroStorage();
describe A;
B = foreach A generate region as my_region, google_ip;
dump B;
store B into './output' using org.apache.pig.piggybank.sto
Hi,
I am trying to implement a loader which is partition-aware. As
prescribed, my loader implements LoadMetadata, however,
getPartitionKeys is never invoked.
The script is of this form:
X = LOAD 'input' USING MyLoader();
X = FILTER X BY partition_col == 'some_string';
and the schema returned by
Hi Dimitriy,
The script does run if invoked from command line but only if we set
PIG_CLASSPATH to point at the jar.
stan
On Nov 16, 2011 11:18 PM, "Dmitriy Ryaboy" wrote:
> Does the script run if you launch it from the pig command line instead
> of via PigServer?
>
> On Wed, Nov 16, 2011 at 3:0
On Mon, Nov 14, 2011 at 5:30 PM, Dmitriy Ryaboy wrote:
> If you manually create the hive table + partitions to match the format
> Pig writes things in, it should just work.
Hive table already exists. However, we don't want to write directly
into its warehouse location because it may result in a
On Mon, Nov 14, 2011 at 3:08 PM, Dmitriy Ryaboy wrote:
> My lack of imagination is showing -- can you explain what you mean by
> integrating hive queries with pig,
For example, we implemented a storage function which creates path
partitioning based on a given sequence of columns; the output is
st
Hi,
We are trying to brainstorm on how best to integrate hive queries into
pig. All suggestions are greatly appreciated!
Note, we are trying to use hcatalog but there are a couple of problems
with that approach.
We also considered using jython to communicate with a thrift server
but jython seems
On Wed, Nov 9, 2011 at 2:45 PM, Daan Gerits wrote:
> Hello everyone,
>
> Is it possible to update a counter from within an UDF? I know there is some
> information on updating counters using log messages, but I have never done
> that before and have no idea if it is working with pig.
>
This seem
Hi All,
I'd like to get the schema of a relation that is used in conjunction
with my custom StorageFunc. I found 'checkSchema' to be useful for
this case, however, it seems to work only in local mode. When run in
distributed mode, 'checkSchema' is not invoked in mappers.
Is there some other mean
Hi Guys,
Sorry for joining this discussion so late. I would suggest using
interval trees for dealing with overlapping time intervals.
There is a fairly nice treatment of interval trees in CLR, sect. 14.3.
The data structure is essentially a red-black tree, and I surmise
that one
could extend jav
e you are hitting https://issues.apache.org/jira/browse/PIG-1824
>
> -Clay
>
> On Mon, 17 Oct 2011, Stan Rosenberg wrote:
>
>> Hi,
>>
>> What's a proper way to deploy python udfs? I've dropped the latest
>> version of jython.jar in $PIG_HOME/lib.
&
Hi,
What's a proper way to deploy python udfs? I've dropped the latest
version of jython.jar in $PIG_HOME/lib.
Things work in "local" mode, but when I run on a cluster, built-in
python modules cannot be found. E.g., urlparse cannot be located:
ImportError: No module named urlparse
at org
ng 0
> args so I need to add a special case in the JythonFunction to handle
> varargs. I'll create a JIRA for this.
> For now you can not use varargs as they will always be called with no
> parameters.
> Julien
>
> On Mon, Oct 17, 2011 at 9:54 AM, Stan Rosenberg <
Hi,
I have a simple python udf which takes a variable number of (string)
arguments and returns the first non-empty one.
I can see that the udf is invoked from pig but no arguments are being passed.
Here is the script:
=
#!/usr/bin/python
f
On Thu, Oct 13, 2011 at 11:52 AM, Norbert Burger
wrote:
> Also the output schema for dummy3() doesn't match what's being returned.
> You're returning a list of strings, but the outputschema specifies a bag,
> which translates into a list of tuples (of something, eg. strings).
>
Part of my questi
Hi,
I have three constant udfs in jython:
@outputSchema("m:map[bag{tuple()}]")
def dummy1():
return {"key":[("value1", "value2")]}
@outputSchema("m:map[tuple()]")
def dummy2():
return {"key":("value1", "value2")}
# doesn't work!
@outputSchema("m:map[bag{}]")
def dummy3():
return {"k
On Tue, Oct 4, 2011 at 2:06 PM, Alan Gates wrote:
> Can you explain what you mean by secondary output partitioning? HCatalog
> supports the same partitioning that Hive does.
>
"Currently HCatStorer only supports writing to one partition."
We need to partition our data by client id, then by da
On Tue, Oct 4, 2011 at 1:27 PM, Alan Gates wrote:
> If you want to use Pig and Hive together, you should also consider
> HCatalog, which was built exactly to address that use case.
> http://incubator.apache.org/hcatalog
We'll definitely consider HCatalog but unfortunately it does not seem to be
tly appreciated.
Thanks,
stan
On Mon, Oct 3, 2011 at 11:09 PM, Stan Rosenberg <
srosenb...@proclivitysystems.com> wrote:
> Hi,
>
> I'd like to store the output relation partitioned by
>
Hi,
I'd like to store the output relation partitioned by
gt; >
> > Z = filter Y by isEmpty(t);
> >
> > OR: t can't be empty if the thing you are distincting is not empty, so
> this
> > should work:
> >
> > Y = filter X by IsEmpty(thing_you_wanted_to_distinct);
> > Z = foreach Y {
> > -- the thin
Y {
> -- the thing you are distincting is now guaranteed to have at least 1
> value
> t = distinct ..
> generate foo...
> }
>
> On Sun, Oct 2, 2011 at 9:28 AM, Stan Rosenberg <
> srosenb...@proclivitysystems.com> wrote:
>
> > Hi Folks,
> >
> > I came
Hi Folks,
I came across a use case where I'd like to do something like this:
FOREACH X {
...
t = DISTINCT (...)
if (!IsEmpty(t))
GENERATE foo, ...
}
Thus, 'generate' is conditionally executed and the control flow depends on
the value of some tuple 't'.
Can this be done in pig?
Th
Hi Folks,
I came across a use case where I'd like to do something like this:
FOREACH X {
if (!IsEmpty(t))
}
62 matches
Mail list logo