Re: run pig 0.12.0 under hadoop 2.2.0

2013-12-12 Thread Dmitriy Ryaboy
you have the wrong hadoop on your classpath, or you did not recompile against hadoop 2. On Thu, Dec 12, 2013 at 12:18 PM, qiaoresearcher wrote: > Everything was fine with hadoop 1.x and pig 0.11. > Recently I installed hadoop 2.2.0 and pig 0.12.0, and run some simple one > line script : load som

Re: how to load custom Writable class from sequence file?

2013-09-24 Thread Dmitriy Ryaboy
on from any Java object type in the > sequence file to pig types. See > https://issues.apache.org/jira/browse/PIG-1777 > > On Tue, Sep 24, 2013 at 5:22 AM, Dmitriy Ryaboy > wrote: > > I assume by scala you mean scalding? > > If so, yeah, scalding should be much easier

Re: how to load custom Writable class from sequence file?

2013-09-24 Thread Dmitriy Ryaboy
I assume by scala you mean scalding? If so, yeah, scalding should be much easier for working with custom data types. Pig doesn't handle generic "objects" well. You have to write converters to and from, like the ones we created in ElephantBird for Protocol Buffers and Thrift (and a bunch of writabl

Re: Whether this is a bug of count function

2013-09-23 Thread Dmitriy Ryaboy
That's actually the documented behavior: https://pig.apache.org/docs/r0.10.0/func.html#count There was some discussion about changing this: https://issues.apache.org/jira/browse/PIG-1014 Patches gratefully accepted.. D On Sat, Sep 14, 2013 at 12:01 AM, centerqi hu wrote: > The sample.txt fil

Re: DataByteArray as Input in Load Function

2013-09-23 Thread Dmitriy Ryaboy
Loaders and UDFs are all initialized at the compilation phase, so you can't pass dynamically calculated values in (you can do some things by pre-calculating constants like current time, etc, using variable binding via the define keyword, but you are trying to do something far more fancy). Moreover

Re: Pig with CombinedFileInputFormat & CombineFileRecordReader (Not working Pig 0.8)

2013-09-23 Thread Dmitriy Ryaboy
Don't use CombinedFile InputFormat / Record Reader. Just let Pig do its thing. On Wed, Sep 18, 2013 at 9:08 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I tried this > http://pig.apache.org/docs/r0.8.1/cookbook.html#Combine+Small+Input+Files > > Test Job Details > Input 7 Files * 51MB each > > HDFS Counters of

Re: Setting pig job name

2013-04-20 Thread Dmitriy Ryaboy
Also, if you run the pig job from a script rather than from the grunt shell, the name is the name of the script (so, "pig foo.pig" names spawned jobs foo.pig) On Apr 15, 2013, at 5:50 PM, Bill Graham wrote: > You can do this in your script as well: > > SET job.name 'my job'; > > > > > On

Re: GSoC 2013

2013-04-08 Thread Dmitriy Ryaboy
is a matrix including vertex id and its starting position > > > > > > > > > > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int) - > > > > --load the graph file > > > > vertex = COGROUP graph BY (vertex); > > > &

Re: Pig JasonParser

2013-04-06 Thread Dmitriy Ryaboy
#x27; USING > com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]); > DUMP inputData; > > > On Thu, Sep 27, 2012 at 8:48 AM, Dmitriy Ryaboy > wrote: > > > Yep. It's just JsonLoader. > > By default it works on top of whatever's returned by

Re: simple script generating 'too many counters' error

2013-04-04 Thread Dmitriy Ryaboy
Do you have any special properties set? Like the pig.udf.profile one maybe.. D On Thu, Apr 4, 2013 at 6:25 AM, Lauren Blau < lauren.b...@digitalreasoning.com> wrote: > I'm running a simple script to add a sequence_number to a relation, sort > the result and store to a file: > > a0 = load '' usin

Re: GSoC 2013

2013-04-01 Thread Dmitriy Ryaboy
it's clear :) > > Thanks > Best Regards... > > > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy > wrote: > > > Hi Burakk, > > The general idea of making graph processing easier is a good one. I'm not > > sure what exactly you are proposing to d

Re: GSoC 2013

2013-03-29 Thread Dmitriy Ryaboy
Hi Burakk, The general idea of making graph processing easier is a good one. I'm not sure what exactly you are proposing to do, though. Could you be more detailed about what you are thinking? On Thu, Mar 28, 2013 at 1:28 PM, burakkk wrote: > Hi, > I might be a little bit late. I come up with a

Re: EvalFunc finish() closing connections prematurely

2013-03-25 Thread Dmitriy Ryaboy
Mike, have you tried adding logging to any EvalFunc methods that communicate with Mongo to see which of them is calling it after finish() ? Are you sure something else doesn't close Mongo connection for you? On Fri, Mar 22, 2013 at 8:28 AM, Mike Sukmanowsky wrote: > Bump - any thoughts? > > > O

Re: Usage of 'limit' with Pig for Hbase

2013-03-14 Thread Dmitriy Ryaboy
To explain what's going on: -limit for HBaseStorage limits the number of rows returned from *each region* in the hbase table. It's an optimization -- there is no way for the LIMIT operator to be pushed down to the loader, so you can do it explicitly if you know you only need a few rows and don't wa

Re: pig 0.12.0 ERROR 2998: Unhandled internal error. com.google.common.collect.ImmutableSet.of

2013-03-12 Thread Dmitriy Ryaboy
jar) > and put it in my run directory, but still got the same error message. My > CLASSPATH is > > CLASSPATH=.:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/lib/dt.jar:. > > So it should look in the current directory, right? > > Thanks > Dan > > -

Re: pig 0.12.0 ERROR 2998: Unhandled internal error. com.google.common.collect.ImmutableSet.of

2013-03-12 Thread Dmitriy Ryaboy
11.0 is currently required. On Tue, Mar 12, 2013 at 2:54 PM, Danfeng Li wrote: > Thanks for the quick repsonse, which guava version I should use? > > -Original Message- > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] > Sent: Tuesday, March 12, 2013 2:52 PM > To: u

Re: null pointer error with a simple pig program

2013-03-12 Thread Dmitriy Ryaboy
g bug in S3 this thing tickles isn't being triggered). On Tue, Mar 12, 2013 at 2:55 PM, Dmitriy Ryaboy wrote: > Sounds like a bug in the S3 implementation of FileSystem? Does this happen > with pig 0.10 or 0.11? > > > > On Mon, Mar 11, 2013 at 12:11 AM, Yang wrote: >

Re: null pointer error with a simple pig program

2013-03-12 Thread Dmitriy Ryaboy
Sounds like a bug in the S3 implementation of FileSystem? Does this happen with pig 0.10 or 0.11? On Mon, Mar 11, 2013 at 12:11 AM, Yang wrote: > the following code gave null pointer exception > > > --- > > rbl

Re: Read Hive LazySimpleSerde with Pig

2013-03-12 Thread Dmitriy Ryaboy
How does LazySimpleSerde store data? On Tue, Mar 12, 2013 at 11:17 AM, Shawn Hermans wrote: > All, > Is there an easy way to read Hive LazySimpleSerde encoded files in Pig? I > did some research and found support for Hive's columnar format and for > SequenceFiles, but did not see anything for L

Re: pig 0.12.0 ERROR 2998: Unhandled internal error. com.google.common.collect.ImmutableSet.of

2013-03-12 Thread Dmitriy Ryaboy
Sounds like you have a bad (older? newer?) version of guava on the classpath. On Tue, Mar 12, 2013 at 2:50 PM, Danfeng Li wrote: > When I try to run pig 0.12.0, I got the following error > > $ pig12 -param input="t" -param output="s" -c b224G_1.pig > log4j:ERROR Could not find value for key lo

Introducing Parquet: efficient columnar storage for Hadoop.

2013-03-12 Thread Dmitriy Ryaboy
nity development, we plan to contribute Parquet to the Apache Incubator when the development is farther along. Regards, Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, Jonathan Coveney, and friends.

Re: Parsing a Complex JSON String?

2013-02-28 Thread Dmitriy Ryaboy
Does the EB json loader with elephantbird.jsonloader.nestedLoad = true Work? On Thu, Feb 28, 2013 at 10:44 AM, Eli Finkelshteyn wrote: > > Hi Folks, > > I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single charar

Re: jsonStorage and pig maps, not sure whats wrong with this?

2013-02-27 Thread Dmitriy Ryaboy
Sounds odd. Can you send a complete script that reproduces the error (include sample data and load statements). On Thu, Feb 21, 2013 at 2:55 AM, Robert McCarthy < robert.mark.mccar...@gmail.com> wrote: > If I have some information in A, that contains dt_dt and platform, I want > to store it in a

Re: Pig with NetCDF

2013-02-25 Thread Dmitriy Ryaboy
I don't think I've seen anyone write loaders for NetCDF, but there is no reason one couldn't, as far as I know. Just need to write a Hadoop InputFormat / RecordReader that implements the format, and wrap a thing LoadFunc around it. There is some basic documentation here : https://pig.apache.org/do

Re: Question about properties for Loader

2013-02-24 Thread Dmitriy Ryaboy
Hi Jeff, It does not sound like you need properties (or a configuration). It sounds like you want to pass arguments to your LoadFunc. You can create a LoadFunc that takes an arbitrary number of String arguments. For example, the default loader, PigStorage, takes 2 arguments: the first is a delimite

Pig 0.11: new features and improvements

2013-02-22 Thread Dmitriy Ryaboy
I pulled together some of the highlights of the pig 0.11 release on the Apache Pig blog (which now officially exists!): https://blogs.apache.org/pig/ D

Re: Hard-coded inline relations

2013-01-24 Thread Dmitriy Ryaboy
ach languages generate flatten(TOBAG(*)); * -- language_bag is a relation with three rows, ('en'), ('fr'), ('jp') * } */ On Thu, Jan 24, 2013 at 1:03 PM, Dmitriy Ryaboy wrote: > > I have a loader that does exactly that. Let me see about dropping into Elepha

Re: Hard-coded inline relations

2013-01-24 Thread Dmitriy Ryaboy
I have a loader that does exactly that. Let me see about dropping into Elephant-Bird. On Thu, Jan 24, 2013 at 8:15 AM, Alan Gates wrote: > I agree this would be useful for debugging, but I'd go about it a > different way. Rather than add new syntax as you propose, it seems we > could easily cr

Re: Parallelism for small input data

2013-01-13 Thread Dmitriy Ryaboy
"The udf (simple extends eval func) refers and reads a dictionary file of 6 MB for each input phrase." Any reason to keep re-reading the dictionary instead of just reading it once? D On Sun, Jan 13, 2013 at 4:47 AM, Dipesh Kumar Singh wrote: > The udf (simple extends eval func) refers and reads

Re: generate multiple output files?

2013-01-11 Thread Dmitriy Ryaboy
Yang, Try MultiStorage: https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html On Wed, Jan 9, 2013 at 2:37 PM, Yang wrote: > let's say I have an input dataset, each row has 2 fields, the first field > is a value among 100 possible values. I want to just split

Re: Sequence File processing

2013-01-10 Thread Dmitriy Ryaboy
Please see the list of editor plugins in https://cwiki.apache.org/confluence/display/PIG/PigTools D On Mon, Dec 24, 2012 at 9:42 PM, Kshiva Kps wrote: > Hi, > > Is there any PIG editors and where we can write 100 to 150 pig scripts > I'm believing is not possible to do in CLI mode . > Like ID

Re: JsonLoader schema field order shouldn't matter

2013-01-10 Thread Dmitriy Ryaboy
Tim, can you open a github issue with EB about compiling against 0.10? I think this is an easy fix. On Tue, Jan 8, 2013 at 9:38 AM, Alan Gates wrote: > I would open a new JIRA, since 1914 is focussed on building an alternative > that discovers schema, while you are wanting to improve the existi

Re: Escaping Dollar Sign in Map in Pig 0.10

2013-01-10 Thread Dmitriy Ryaboy
Two back slashes? On Thu, Jan 10, 2013 at 6:01 PM, Eli Finkelshteyn wrote: > This wasn't a problem in 0.9.2, but in 0.10, when I try to access a key in > a map that has a dollar sign in it, I get hammered with errors that I > haven't defined the variable. Specifically: > >blah = FOREACH meh

Re: pig latin & lucene

2013-01-07 Thread Dmitriy Ryaboy
Details: https://github.com/kevinweil/elephant-bird/wiki/Elephant-Bird-Lucene On Fri, Jan 4, 2013 at 7:55 AM, Bill Graham wrote: > ElephantBird now has pig-lucene support: > > > https://github.com/kevinweil/elephant-bird/blob/master/pig-lucene/src/main/java/com/twitter/elephantbird/pig/load/Luc

Re: Making Pig run faster in local mode

2013-01-07 Thread Dmitriy Ryaboy
Try jstacking it a few times while it's running. Is it just sitting idly in a sleep() ? On Mon, Jan 7, 2013 at 11:56 AM, Cheolsoo Park wrote: > Typo: it makes much sense to run them in cluster => it doesn't make much > sense to run them in cluster. > > On Mon, Jan 7, 2013 at 11:55 AM, Cheolsoo P

Re: Failing to make sense of an error.

2012-12-03 Thread Dmitriy Ryaboy
Are you running in local mode? Heap error implies the *local* JVM is running into trouble (so, either you are doing the compute locally, or something odd is going on with processing the script or collecting the results). What is your java Xmx set to on the local (client) machine? On Wed, Nov 28,

Re: Running pig script as different user

2012-12-03 Thread Dmitriy Ryaboy
This should not work in versions of hadoop that support security for fairly obvious reasons. On Fri, Nov 30, 2012 at 5:52 PM, Prashant Kommireddi wrote: > Hi Miki, > > What version of hadoop are you on? I can confirm this works on 0.20.2 but > never tried this on the newer versions. > > Try hadoo

Re: Push limit into custom LoadFunc

2012-11-29 Thread Dmitriy Ryaboy
Mike, it's done automatically -- the operator will just stop asking the loader for more elements. If you observe something to the contrary, please let us know! On Wed, Nov 28, 2012 at 7:11 PM, Mike Drob wrote: > Hello, > > According to https://issues.apache.org/jira/browse/PIG-1270 the execution

Re: PigStorage

2012-11-16 Thread Dmitriy Ryaboy
That sounds reasonable, I've run into the same problem. Do you mind submitting a patch? On Fri, Nov 16, 2012 at 12:48 PM, pablomar wrote: > hi all, > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely) > I got a case today on which I needed to clean up some fields before > proces

Re: About SpillableMemoryManager

2012-11-02 Thread Dmitriy Ryaboy
a.opts from -Xmx200m to -Xmx1024m . It seems it doesn't > help. And that threshold value is still the same. > when I monitor the java process by top command, it seems the setting of > mapred.child.java.opts have NO influence on both VIRT and RES, it seems > mapred.child.java.opts has be

Re: About SpillableMemoryManager

2012-11-01 Thread Dmitriy Ryaboy
Rather than increase memory, rewrite the script so it does not need so much ram to begin with. You can split on $2, group and generate what you need, then join things back. Hard to tell what exactly you are going for without schemas and expected inputs/outputs. If the hadoop configs are the same,

Re: JOIN comparasion PIG V/S HIVE

2012-10-22 Thread Dmitriy Ryaboy
Could you provide sample data and script that would allow us to reproduce this? Hive is faster at some things. Pig is faster at others. Both produce correct results. D On Mon, Oct 22, 2012 at 11:22 AM, yogesh dhari wrote: > > Hi All, > > Is it true that Pig's JOIN operation is not so efficient a

Re: Help on running pig in local mode

2012-10-22 Thread Dmitriy Ryaboy
10-19 11:06:57,382 [main] INFO org.apache.hadoop.ipc.Client - > Retrying connect to server: localhost/127.0.0.1:9001. Already tried 1 > time(s). > 2012-10-19 11:06:58,383 [main] INFO org.apache.hadoop.ipc.Client - > Retrying connect to server: localhost/127.0.0.1:9001. Already tried 2 &g

Re: debug feature??

2012-10-22 Thread Dmitriy Ryaboy
Some testing tips: 1) parametrize your load/store statements so that if you have to run in hadoop mode, it's easy to switch to debug inputs / outputs (and debug input/output loaders and storers). It's vastly preferable to test in local mode when possible, since the iterations are so much faster.

Re: _SUCCESS file -> _FAILURE file?

2012-10-18 Thread Dmitriy Ryaboy
That's a Hadoop mapreduce feature, not a Pig feature, so that request should go there. Can't really do the _failure thing though, if you think about it -- programs can fail by crashing, in which case they might not be able to write a file. Or maybe they are not crashing, but there is a problem tal

Re: How can I read Hive text files on S3 from Pig?

2012-10-18 Thread Dmitriy Ryaboy
again > Martin > > On 18 October 2012 05:15, Dmitriy Ryaboy wrote: > >> Yeah that's a bug in FileLocalizer, apparently it assumes local or >> hdfs, only. Could you file a jira? >> >> D >> >> On Sat, Oct 13, 2012 at 2:53 AM, Martin

Re: How can I read Hive text files on S3 from Pig?

2012-10-17 Thread Dmitriy Ryaboy
odAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.util.RunJar.main(RunJar.java:197) > > > Thanks for taking a look. I will start looking into HCatalog too. > > Martin > > > On 12 October 2012 18:56, Dmit

Re: NEED HELP in Hive Query

2012-10-17 Thread Dmitriy Ryaboy
B = group A by ( name, date, url); -- B now has 2 fields: "group" which is a tuple of (name, date, url) and "A" which is a collection of tuples from A with the same name-date-url -- try "illustrate B" or "describe B" to see what that looks like counts = foreach B generate flatten(group) as (name,

Re: Pig storage and load functions and Cache

2012-10-17 Thread Dmitriy Ryaboy
. BinStorage is internal to pig, and you shouldn't use it unless you really know what you are doing. None of this is relevant to how pig optimizes queries. D On Sun, Oct 7, 2012 at 9:10 PM, Dmitriy Ryaboy wrote: > Pig has multi-query execution optimization built-in. If you compute > multiple r

Re: Help on running pig in local mode

2012-10-17 Thread Dmitriy Ryaboy
I think it's trying to find the staging directory set in your configuration, not finding it, and isn't able to create it. depending on your configs, that could be in different places, but usually it's looking under /tmp/mapred . Check permissions there. D On Mon, Oct 15, 2012 at 4:35 PM, lei tang

Re: NEED HELP in PigStorage

2012-10-12 Thread Dmitriy Ryaboy
Sounds like however you wrote the data, it has some sort of a binary delimiter. Figure out what that delimiter is, and tell PigStorage to use it. For example: my_data = load 'path/to/data' using PigStorage('\\u001'); D On Thu, Oct 11, 2012 at 10:23 AM, yogesh dhari wrote: > > Hi All , > > How t

Re: Parallel Join with Pig

2012-10-12 Thread Dmitriy Ryaboy
The default partitioning algorithm is basically this: reducer_id = key.hashCode() % num_reducers If you are joining on values that all map to the same reducer_id using this function, they will go to the same reducer. But if you have a reasonable hash code distribution and a decent volume of uniqu

Re: How can I read Hive text files on S3 from Pig?

2012-10-12 Thread Dmitriy Ryaboy
Martin, Do you have the compete stack trace? Generally, for Hive interop I recommend HCatalog; AllLoader is neat but it's a 3rd party contrib and we don't really know it too well. I can check out the error dump and see if there's anything obvious though. D On Fri, Oct 12, 2012 at 8:48 AM, Martin

Re: Non static nested Algebraic functions and their constructor

2012-10-11 Thread Dmitriy Ryaboy
Yeah.. Joys of reflection. Note that if you are writing algebraics against pig 0.11 you probably want to extend AlgebraicEvalFunc -- that gives you the normal exec() and the accumulative implementation for free. D On Wed, Oct 10, 2012 at 10:20 AM, Ugljesa Stojanovic wrote: > Yeah i managed to fi

Re: Pig storage and load functions and Cache

2012-10-07 Thread Dmitriy Ryaboy
Pig has multi-query execution optimization built-in. If you compute multiple relations in your script that share parent relations, those parent relations will be computed only once. You don't have to do anything to make that happen. If you prefer to handle your own caching, you would have to handl

Re: Optimizations in pig

2012-10-04 Thread Dmitriy Ryaboy
bucketing and partitioning is just setting the files up right. you can do that explicitly. Pig also lets you push down any filtering and projection into the loader, as long as said loader is aware of how to deal with filters and projections. Using any such loader will give you the benefits. HCatLo

Re: regular expression as delimiter in PigStorage?

2012-09-28 Thread Dmitriy Ryaboy
Hi Lei, This is currently not supported. However one can always create a new loadfunc and implement his own parsing (perhaps by extending PigStorage and overriding the parsing bits). D On Fri, Sep 28, 2012 at 4:05 PM, lei tang wrote: > Hi, > > Is it possible to use a regular expression as a del

Re: Pig multiple groupby problem

2012-09-28 Thread Dmitriy Ryaboy
ry I am bit hazy over here... > > On Fri, Sep 28, 2012 at 3:12 PM, Dmitriy Ryaboy > wrote: > > > When you tried 2888, did you have pig.exec.mapPartAgg set to true, > > and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)? > > > > You said you ap

Re: Pig multiple groupby problem

2012-09-28 Thread Dmitriy Ryaboy
atch and see if that makes any > > difference.. > > > > Thanks very much for responding > > > > > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy >wrote: > > > >> Couple of ideas: > >> > >> 1) do you need exact distinct counts? The

Re: How can I access secure HBase in UDF

2012-09-27 Thread Dmitriy Ryaboy
If someone figures this out ll the way to working code, could you blog it? :) D On Thu, Sep 27, 2012 at 10:54 AM, Rohini Palaniswamy < rohini.adi...@gmail.com> wrote: > Ray, >In the frontend, you can do a new JobConf(HBaseConfiguration.create()) > and pass that to TableMapReduceUtil.initCred

Re: Using matches in generate clause?

2012-09-27 Thread Dmitriy Ryaboy
With Pig 0.9 you can do this, though: FOREACH html_pages GENERATE portal_id, (html matches 'some pattern' ? 1 : 0) as wp_match:int; On Thu, Sep 27, 2012 at 10:38 AM, Alan Gates wrote: > In Pig 0.9 boolean was not yet a first class data type, so boolean types > were not allowed in foreach stat

Re: Pig JasonParser

2012-09-26 Thread Dmitriy Ryaboy
son in Pig, not that I would recommend that). D On Wed, Sep 26, 2012 at 9:34 PM, Russell Jurney wrote: > Does that work without lzo? > > Russell Jurney http://datasyndrome.com > > On Sep 26, 2012, at 9:00 PM, Dmitriy Ryaboy wrote: > > > Try asking Michael May on gihub? This

Re: Pig JasonParser

2012-09-26 Thread Dmitriy Ryaboy
Try asking Michael May on gihub? This seems to be an issue with his Loader.. The JsonLoader in ElephantBird should work in this case if you turn on nested parsing ( https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java ) D On W

Re: processing .z files

2012-09-19 Thread Dmitriy Ryaboy
What are ".Z files"? On Wed, Sep 19, 2012 at 12:22 AM, Srini wrote: > Hello All, > > Is there any Build-in Load function for loading ".Z" files ? > > Regards, > Srini >

Re: Issues with SAMPLE in PIG v0.8.1

2012-09-18 Thread Dmitriy Ryaboy
ndering if anyone had seen this in their > scripts before. > >Brian > > > On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy > wrote: > > > I just ran this very script three times using Pig 0.8 (svn revision > > 1148107) on a set of 2.5 million

Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?

2012-09-16 Thread Dmitriy Ryaboy
only supplies input to the mapper? > > When you are talking about downstream code from the loader that assumes that > each tuple is a new Tuple, is there any code in Pig that assumes that or are > you just talking about UDF's and other 3rd party libs that people write for > Pi

Re: Issues with SAMPLE in PIG v0.8.1

2012-09-16 Thread Dmitriy Ryaboy
rovide. > > brian > > > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy wrote: > >> Brian, could you provide a complete script that reproduces the issue? >> What version of pig are you on? >> >> Thanks, >> -D >> >> On Sun, Sep 1

Re: access schema defined in LOAD statement in custom LoadFunc?

2012-09-16 Thread Dmitriy Ryaboy
I am not sure why pushProjection doesn't solve your dilemma? This is what we use in HBaseStorage, and ElephantBird uses in thrift and protobuf loaders. D On Sun, Sep 16, 2012 at 8:11 PM, Jim Donofrio wrote: > I guess a workaround could be to Base64 decode the pig.script property and > look for A

Re: Issues with SAMPLE in PIG v0.8.1

2012-09-16 Thread Dmitriy Ryaboy
Brian, could you provide a complete script that reproduces the issue? What version of pig are you on? Thanks, -D On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi wrote: > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing > about this is that it approaches the correct values f

Re: How can I split the data with more reducers?

2012-09-16 Thread Dmitriy Ryaboy
> > On 2012-9-16, at 下午5:05, Haitao Yao wrote: > >> here's the explain result compressed.(The apache mail server does not allow >> big attachments.) >> >> >> >> Haitao Yao >> yao.e...@gmail.com >> weibo: @haitao_yao >> Skype: haita

Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?

2012-09-16 Thread Dmitriy Ryaboy
I looked into this a while back -- trouble comes when something downstream from the loader tries to collect inputs into a bag, and doesn't do its own copies. One can easily argue that if someone wants to do such collection, it should be their responsibility to ensure they aren't just collecting the

Re: How can I split the data with more reducers?

2012-09-16 Thread Dmitriy Ryaboy
Still would like to see the script or the explain plan.. D On Sat, Sep 15, 2012 at 7:50 PM, Haitao Yao wrote: > No, I also thought it is a mapper , but It surely is a reducer. all the > mappers succeeded and the reducer failed. > > > > Haitao Yao > yao.e...@gmail.com > weibo: @haitao_yao > Skyp

Re: Approaches to storing arbitrary schema in a sequencefile

2012-09-15 Thread Dmitriy Ryaboy
We tend to write protobuf or thrift definition for complex objects, but that introduces severe latency into the development process. I suppose you could try something like kryo (and create a corresponding deserializer for EB).. the core of the problem is that you need to carry around the schema, an

Re: Apache Pig slides from the

2012-09-15 Thread Dmitriy Ryaboy
Wow, that's a fantastic presentation Adam! Nice job on all the examples and slides. D On Sat, Sep 15, 2012 at 3:16 AM, Adam Kawa wrote: > Hi All, > > I would like to share my slides from the presentation about Apache Pig > that I gave at the 3rd meeting of WHUG (Warsaw Hadoop User Group) a > cou

Re: Reading BytesWritable in sequence file

2012-09-13 Thread Dmitriy Ryaboy
execute goal > com.github.igor-petruk.protobuf:protobuf-maven-pl > ugin:0.4:run (default) on project elephant-bird-core: Unable to find > 'protoc' -> > [Help 1] > [ERROR] > > On Tue, Sep 11, 2012 at 4:24 PM, Mohit Anchlia wrote: > >> Thanks! I'll try it out. &g

Re: Batching transformations in Pig

2012-09-13 Thread Dmitriy Ryaboy
Group, and pass the grouped sets to your batch-processing UDF? so: data: id1 bucket1 id2 bucket2 id3 bucket2 id4 bucket1 bucketized = group data by bucket_id; bucket1, { (id1, id4) } bucket2, { (id2, id3) } batch_processed = foreach bucketized generate MyUDF(data); D On Wed, Sep 12, 2012 at

Re: Reading BytesWritable in sequence file

2012-09-11 Thread Dmitriy Ryaboy
Yup: https://github.com/kevinweil/elephant-bird D On Tue, Sep 11, 2012 at 4:00 PM, Mohit Anchlia wrote: > Is it the code that I checkout and build? > > On Tue, Sep 11, 2012 at 3:27 PM, Dmitriy Ryaboy wrote: > >> Try the one in Elephant-Bird. >> >> On Tue, S

Re: Reading BytesWritable in sequence file

2012-09-11 Thread Dmitriy Ryaboy
Try the one in Elephant-Bird. On Tue, Sep 11, 2012 at 11:22 AM, Mohit Anchlia wrote: > Is there a way to read BytesWritable using sequence file loader from > piggybank? If not then how should I go about implementing one?

Re: Using LoadFunc to get arbitrary data into Pig script

2012-09-07 Thread Dmitriy Ryaboy
Hi Thomas, This isn't a complete answer, but take a look at mock.Storage that Julien wrote to make testing easy: http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/builtin/mock/Storage.java D On Fri, Sep 7, 2012 at 6:34 AM, Thomas Schlosser wrote: > Hi all, > does anybody know what th

Re: Machine Learning + Pig?

2012-09-06 Thread Dmitriy Ryaboy
Please take a look at Alek and Jimmy's paper on ML in Pig; there are also a few presentations they did on this, here's one from the Hadoop Summit: https://speakerdeck.com/u/lintool/p/large-scale-machine-learning-at-twitter Note also that Ted Dunning has taken some of the stuff we open-sourced that

Re: Extremely slow when loading small amount of data from HBase

2012-09-04 Thread Dmitriy Ryaboy
etting better. > Should I continue merging? > > > 2012/8/29 Dmitriy Ryaboy : >> Can you try the same scans with a regular hbase mapreduce job? If you see >> the same problem, it's an hbase issue. Otherwise, we need to see the script >> and some facts about your ta

Re: UDF Performance Problem

2012-09-03 Thread Dmitriy Ryaboy
That's cause you used "group all" which groups everything into one group, which by definition can only go to one reducer. What if instead you group into some large-enough number of buckets? A = LOAD 'records.txt' USING PigStorage('\t') AS (recordId:int); A_PRIME = FOREACH A generate *, ROUND(RAN

Re: Custom DB Loader UDF

2012-09-02 Thread Dmitriy Ryaboy
You can also look at what Vertica did for their Pig connector: https://github.com/vertica/Vertica-Hadoop-Connector/blob/master/pig-connector/com/vertica/pig/VerticaLoader.java (it's apache licensed, so if you reuse any code, you have to indicate the Vertica copyright and apache license in credits

Re: wrong sort order (lexical vs numeric) in a nested foreach

2012-08-31 Thread Dmitriy Ryaboy
I tried to reproduce this and haven't been able to -- all my devious attempts to get something that is actually a string to show up as an int in "describe" wind up in class cast exceptions and blown up jobs (not devious enough, clearly). Can you give put together an example that reproduces the iss

Re: Unable to open iterator

2012-08-28 Thread Dmitriy Ryaboy
Please take a look at your job tracker page. It will have a failed job, which will have failed tasks, which will have more detailed error logs. On Aug 28, 2012, at 5:52 PM, Mohit Anchlia wrote: > I have this simple pig script but when I run I get: > > 2012-08-28 17:50:24,924 [main] INFO > org

Re: Pig multiple groupby problem

2012-08-28 Thread Dmitriy Ryaboy
Couple of ideas: 1) do you need exact distinct counts? There are approximate distinct counting approaches that may be appropriate an much more efficient. 2) can you try with pig-2888? On Aug 28, 2012, at 1:35 PM, Deepak Tiwari wrote: > Hi, > > I am processing huge dataset and need to aggrega

Re: Extremely slow when loading small amount of data from HBase

2012-08-28 Thread Dmitriy Ryaboy
Can you try the same scans with a regular hbase mapreduce job? If you see the same problem, it's an hbase issue. Otherwise, we need to see the script and some facts about your table (how many regions, how many rows, how big a cluster, is the small range all on one region server, etc) On Aug 27,

Re: Parameterized Expression in Filter

2012-08-27 Thread Dmitriy Ryaboy
I think you just want this: filt = filter colors_in by $color_filter; (no quotes) D On Mon, Aug 27, 2012 at 1:50 PM, Duckworth, Will wrote: > I am trying to use a parameter as the expression in a filter. > > Assuming: > > colors_in = load ‘$in_path’ as (color:chararray); > flt = filter colors_

Re: Question Regarding HBaseStorage Pig 0.8.1

2012-08-25 Thread Dmitriy Ryaboy
It works. Dan, pig should have printed out the name of a file it's logging errors to. That file will have a more complete error trace. Can you send that? D On Sat, Aug 25, 2012 at 5:43 PM, Subir S wrote: > I think HBaseStorage does not work in this version of pig. There were > few JIRAs, I cann

Re: Pig testing in maven projects

2012-08-22 Thread Dmitriy Ryaboy
Yeah, these should be published to maven. D On Wed, Aug 15, 2012 at 3:49 AM, Віталій Тимчишин wrote: > Hello. > > We are starting to use pig for our data analysis. > To be exact, actual work will be performed by amazon elastic map reduce. > That's why we are using 0.9.2 for now. > Everything wor

Re: Javascript UDFs don't work in local mode? At all?

2012-08-15 Thread Dmitriy Ryaboy
This class is part of the Sun Java 6 JDK . What version of Java are you running? You should have something along the lines of /usr/lib/jvm/java-6-openjdk/jre/lib/rhino.jar on your classpath. Dmitriy On Wed, Aug 15, 2012 at 9:46 AM, Russell Jurney wrote: > Cross posting in hopes a user has this

Re: Operator and Function Reference

2012-08-13 Thread Dmitriy Ryaboy
That would be quite handy I think. D On Thu, Aug 9, 2012 at 12:24 PM, Xavier Stevens wrote: > Does anyone else think it would make sense to have all operators and > functions listed on a single page somewhere as a reference? Right now they > are split up over the "Pig Latin Basics" and "Built In

Re: Distributed accumulator functions

2012-08-13 Thread Dmitriy Ryaboy
For CSV excel, check out http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html D >> Also, is PigStorage compatible with the quoting expected by excel >> tab-delimited files? AIUI that would require quoting the values with >> "value\tvalue" and escaping doub

Re: Using Distributed Cache in PIG

2012-08-13 Thread Dmitriy Ryaboy
You are talking about changing the way hadoop works; something like this would be transparent to Pig. Note that Hadoop Distributed Cache != "distributed memory cache". I suppose you could replace the value of fs.file.impl from org.apache.hadoop.fs.LocalFileSystem to something else.. might be qui

Re: Pig 0.10.0 slow startup

2012-08-13 Thread Dmitriy Ryaboy
Julien removed a dozen or so loader/storer instantiations. That can do it if you do work in constructors. D On Fri, Aug 10, 2012 at 1:15 PM, Prashant Kommireddi wrote: > Thanks Chun. > > Jon, any idea what on 0.11 might have fixed it? > > On Thu, Aug 9, 2012 at 3:32 PM, Chun Yang > wrote: > >> I

Re: [pig-0.10.0 e2e test]: some tests failed with "Sort check failed" error

2012-08-06 Thread Dmitriy Ryaboy
I'm just curious, why do you expect pig 0.10 tests to succeed on 0.9.2? D On Mon, Aug 6, 2012 at 6:57 AM, lulynn_2008 wrote: > Hi All, > I am running pig-0.10.0 e2e test with pig-0.9.2 and hadoop-1.0.3. There are > 12 tests failed with "Sort check failed" error. > > I list Order_6 as a example

Re: pig with Hbase

2012-08-06 Thread Dmitriy Ryaboy
Sounds like your hbase conf is not on the classpath. D On Mon, Aug 6, 2012 at 11:31 AM, Mohit Anchlia wrote: > I am trying to read records from HBase using HBaseStorage. When I execute > simple load I get this error. I think I am missing some property, but I am > running pig on the cluster where

Re: Next Pig Hackathon

2012-07-31 Thread Dmitriy Ryaboy
Won't be able to make it. Would love to see what you guys come up with about the UDFs. D On Mon, Jul 30, 2012 at 9:42 AM, Alan Gates wrote: > Hortonworks will be hosting the next Pig Hackathon on August 24th. > http://www.meetup.com/PigUser/events/75286212/ > > The agenda: > > - Help newcomers

Re: DATA not storing as comma-separted

2012-07-25 Thread Dmitriy Ryaboy
;> - Failed to produce result in: "file:/tmp/temp61624047/tmp1087576502" >>>> 2012-07-25 17:20:36,107 [main] INFO >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>> - Failed! >>>> 2012-07-25 17:20

Re: DATA not storing as comma-separted

2012-07-25 Thread Dmitriy Ryaboy
Using the store expression you wrote should work. Dump is its own thing and doesn't know anything about the format you store things in. To see files created on hdfs, you can use cat. On Jul 25, 2012, at 3:48 AM, wrote: > Hi All, > > I am new to PIG, trying to stroe data in HDFS as comma sepa

  1   2   3   4   5   6   7   8   >