Re: Need help getting started.

2014-01-11 Thread Ruslan Al-Fakikh
Usually Hadoop is used within a distro. Those can be cloudera, hortonworks, emr, etc 11 янв. 2014 г. 2:05 пользователь "Mariano Kamp" написал: > Hi Josh. > > Ok, got it. Interesting. > > Downloaded ant, recompiled and now it works. > > Thank you. > > > On Fri, Jan 10, 2014 at 10:16 PM, Josh Elser

Re: trying to understand local mode and core-site.xml

2014-01-11 Thread Ruslan Al-Fakikh
For simple testing you can use cloudera quickstart vm. All the lzo stuff can be configured in cloudera manager with a few clicks. 10 янв. 2014 г. 21:55 пользователь "Peter Sanford" написал: > Hello everybody! > > I'm getting started with pig and I'm trying to understand how to > configure io.comp

Re: AvroStorage schema_uri pointing to local file doesn't work

2013-12-25 Thread Ruslan Al-Fakikh
vro.AvroStorage('{ "index" : 1, "schema": $SCHEMA_LITERAL}'); Best Regards, Ruslan Al-Fakikh On Wed, Dec 25, 2013 at 11:48 AM, Cheolsoo Park wrote: > avro to bcc: > > >> Why can't it use the schema file from front-end invocation? >

AvroStorage schema_uri pointing to local file doesn't work

2013-12-24 Thread Ruslan Al-Fakikh
I am running the pig script from, cannot find the file in the local file system. Why can't it use the schema file from front-end invocation? Does it mean that I am only limited to either HDFS locations for schema_uri or using embedding the schema string in AvroStorage parameters? Thanks in advance Ruslan Al-Fakikh

Re: Help splitting a line into multiple lines

2013-12-24 Thread Ruslan Al-Fakikh
I guess you are getting a bag of tuples here. Try to apply FLATTEN on the bag. Thanks On Wed, Dec 18, 2013 at 12:20 AM, Tim Robertson wrote: > Hi all, > > I am new to Pig, and struggle to split up a long text line into multiple > lines. > I have an input format from a legacy mysqldump like: > >

Re: Unoverride load in PigUint

2013-12-21 Thread Ruslan Al-Fakikh
I did On Sat, Dec 21, 2013 at 9:44 PM, Serega Sheypak wrote: > https://issues.apache.org/jira/browse/PIG-3638 > "like" it :) > > > 2013/12/21 Ruslan Al-Fakikh > > > It seems to be a heavy PigUnit limitation. Maybe you can open a jira for > > this?:) &

Re: Unoverride load in PigUint

2013-12-21 Thread Ruslan Al-Fakikh
to use "native" loader/storage. > Looks like that the only solution is to create wrapper: data-driven tester > which feeds script to local pig server and verifies output. > We did in Megafon. I tried to use "recommended approach" - pig unit for my > own purposes. >

Re: Unoverride load in PigUint

2013-12-20 Thread Ruslan Al-Fakikh
Hi Serega! Have you resolved the issue? I am going to encounter the same problem, but I don't know a solution. Thanks On Sun, Dec 15, 2013 at 6:07 PM, Serega Sheypak wrote: > Hi! > By default PigUnit does override LOAD statements > Is there any possiblity to void this? > I'm using AvroStorage

Re: ON ERROR

2013-12-20 Thread Ruslan Al-Fakikh
Hi Russell, Could you be more specific. What would this operator do? Does it have something to do with control logic? (Like IF/ELSE, WHILE, etc) AFAIK, those are not present in Pig because it would make Pig less clean. Thanks On Sat, Dec 21, 2013 at 12:31 AM, Russell Jurney wrote: > Does anyon

Re: Pig syntax to access fields of records in an array

2013-11-28 Thread Ruslan Al-Fakikh
I think your expression ends up with a bag with just that column. Can you give the full context where it is not working? 28 нояб. 2013 г. 2:14 пользователь "ey-chih chow" написал: > Hi, > > We have an Avro file of which a field that is an array of tuples as > follows: > > > cam:bag{ARRAY_ELEM:tup

Re: Multiple Input Schemas in AvroStorage() fails

2013-11-27 Thread Ruslan Al-Fakikh
I guess you need to specify 'multiple_schema' in AvroStorage On Thu, Nov 28, 2013 at 4:07 AM, Mangtani, Kushal < kushal.mangt...@viasat.com> wrote: > Hi, > > I'm one of the several Pig Developer/User community.I have a question > regarding Avro1.6.1 and Pig0.11 compatibility. In ref to > https:/

Re: read csv file as schema

2013-11-27 Thread Ruslan Al-Fakikh
In my company we had to write our own Loader/Storer UDFs for this. On Wed, Nov 27, 2013 at 6:00 PM, Noam Lavie wrote: > Hi, > > Is there a way to load a csv file with header as schema? (the header's > fields are the properties of the schema and the other list in the csv file > will be in the sc

Re: Pig 0.12.0 issue wirth Hadoop 2.2

2013-11-20 Thread Ruslan Al-Fakikh
I've had a similar issue before. Not sure if I had the same versions. This helped me : The solution was to compile with -Dhadoopversion=23 option 20 нояб. 2013 г. 12:41 пользователь "Hiro Gangwani" написал: > Dear Team, > > I have downloaded 0.12.0 version of pig and trying to use with Hadoop 2.

Re: Pig Schema contains a name that is not allowed in Avro

2013-11-19 Thread Ruslan Al-Fakikh
Hey Johannes! Have you solved the problem? I also see it. But I don't see it when I use the schema as a string to AvroStorage parameter. I see it only when I try to use an external schema file. And if I specify a non-existent external schema file, the error is the same. Ruslan On Tue, Oct 22, 2

Re: pass relation value to the file path of store command

2013-11-18 Thread Ruslan Al-Fakikh
Hi soniya, In you example you are hard-coding the ID to 2 in your filter statement. You could also hard-code it in STORE: Store A into '/main/abc'/2 ; If you want to separate the rows by a value of a field then you could try MultiStorage from piggybank. Thanks On Mon, Nov 18, 2013 at 6:01 AM, s

Re: STORE USING AvroStorage - ignores Pig field names, only using their position

2013-11-16 Thread Ruslan Al-Fakikh
including this last message to pig user list On Sun, Nov 17, 2013 at 7:40 AM, Ruslan Al-Fakikh wrote: > Russel, > > Actually this problem came from the situation when I had the same names in > pig relation schema and avro schema. And it turned out that AvroStorage > switches fiel

Re: STORE USING AvroStorage - ignores Pig field names, only using their position

2013-11-16 Thread Ruslan Al-Fakikh
Thanks, Russel! Do you mean that this is the expected behavior? Shouldn't AvroStorage map the pig fields by their names (not their field order) matching them to the names in the avro schema? Thanks, Ruslan Al-Fakikh On Sun, Nov 17, 2013 at 6:53 AM, Russell Jurney wrote: > Pig tup

STORE USING AvroStorage - ignores Pig field names, only using their position

2013-11-16 Thread Ruslan Al-Fakikh
* --{"b":"data_a","nonsense_name":"data_b"} --{"b":"data_a","nonsense_name":"data_b"} AvroStorage is build from the latest piggybank code. Using AvroStorage "debug": 5 parameter didn't help. $ pig -version Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Any help would be appreciated. Thanks, Ruslan Al-Fakikh

Re: SQL- WHILE in Pig

2013-11-13 Thread Ruslan Al-Fakikh
There is no such control logic in Pig. Maybe the FOREACH statement can help, but it's not a loop, but rather a processing operator. Also we may want to use Pig Embedding to launch pig from other languages. Thanks On Wed, Oct 30, 2013 at 11:45 AM, Murphy Ric wrote: > I have a code in SQL to con

Re: ORDER BY a map value fails with a syntax error - pig bug?

2013-10-29 Thread Ruslan Al-Fakikh
pressions." > > I think "you cannot order ... by expressions" means the behavior you see > is expected. > > William F Dowling > Senior Technologist > Thomson Reuters > > > -Original Message- > From: Ruslan Al-Fakikh [mailto:metarus..

ORDER BY a map value fails with a syntax error - pig bug?

2013-10-29 Thread Ruslan Al-Fakikh
s one: LOAD 'input' AS (M:map []); named = foreach A generate *, M#'key1' as myfield; sorted = ORDER named BY myfield; dump sorted; is OK Is it a bug in Pig? Best Regards, Ruslan Al-Fakikh

Re: Small files

2013-09-30 Thread Ruslan Al-Fakikh
Hi, It says that your command returns non-zero code. Does it return it in case you invoke it manually outside of Pig? I think I don't have any valuable ideas otherwise. Thanks On Mon, Sep 30, 2013 at 10:37 AM, Anastasis Andronidis < andronat_...@hotmail.com> wrote: > Hello again, > > any comme

Re: Need to parse the data from [ ]

2013-09-26 Thread Ruslan Al-Fakikh
I suppose you need to use the RegExp groups for that, something like ([(.*),(.*)...]), and I think you need to escape [] Basically this is not a Pig problem, I would test the RegExp in Java first. Ruslan On Thu, Sep 26, 2013 at 4:36 PM, Muni mahesh wrote: > *Input Data :* > > ([37.77916,-122.42

Re: Create hive tables dynamically from pig script

2013-09-26 Thread Ruslan Al-Fakikh
Probably you'll need Pig embedding: http://pig.apache.org/docs/r0.11.1/cont.html For doing some logic that is not MapReduce and depends on you Pig's output. For loading data to the created tables, you can take a look at HCatalog, though I am not sure whether your very old version of the Hadoop dist

Re: what are the compatible versions between Pig and HBase

2013-09-25 Thread Ruslan Al-Fakikh
Hi, Are you trying to install them yourself? Usually a Hadoop distro is used (Cloudera, Hortonworks, Amazon EMR, etc) and they are already compatible within a distro. Thanks On Wed, Sep 25, 2013 at 2:02 PM, yonghu wrote: > hello, > > Can anyone give me a list of compatible Versions between Pi

Re: ISOToUNix working in Pig 0.8.1 but not in Pig 0.11.0

2013-09-20 Thread Ruslan Al-Fakikh
What was the error? Not an issue, but why do you call the columns dt1, dt2, but not using the name, using the ordinal number insted: $0? On Fri, Sep 20, 2013 at 6:00 PM, Muni mahesh wrote: > Hi Hadoopers, > > I did the same thing in Pig 0.8.1 but not Pig 0.11.0 > > register /usr/lib/pig/piggyba

Re: Pig Parameter Substitution

2013-09-16 Thread Ruslan Al-Fakikh
Hi, No sure whether it helps, but I did a lot of testing in such cases. "Test and see" was my main approach. It is really tricky sometimes. Also you can try the -dryrun option when launching pig. Best Regards, Ruslan Al-Fakikh https://www.odesk.com/users/~015b7b5f617eb89923 On T

Re: Splitting by unique values in a relation

2013-09-15 Thread Ruslan Al-Fakikh
found piggybank's MultiStorage method much closer to > what > > I > > > am looking for. I was just wondering is there a better or different way > > to > > > do the same. > > > > > > Regards > > > Praveenesh > > > > >

Re: Splitting by unique values in a relation

2013-09-15 Thread Ruslan Al-Fakikh
Hi! Have you tried the SPLIT operator? http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT After splitting the relation into two separate relations you can STORE them into different locations. Best Regards, Ruslan Al-Fakikh https://www.odesk.com/users/~015b7b5f617eb89923 On Sun, Sep 15, 2013

Re: self join in pig

2013-09-15 Thread Ruslan Al-Fakikh
Best Regards, Ruslan Al-Fakikh https://www.odesk.com/users/~015b7b5f617eb89923 On Sun, Sep 15, 2013 at 10:10 PM, Shahab Yunus wrote: > You need to load your data twice and then use it as any other join. > Self-join is just like any other join to Pig. > > Regards, > Shahab > &

Re: COALESCE UDF?

2013-09-04 Thread Ruslan Al-Fakikh
Hi, I think you could mimic it with an expression like this: b = foreach a generate ((field1 is null) ? ((field2 is null) ? null : field2) : field1); Hope that helps, Ruslan On Wed, Sep 4, 2013 at 9:50 AM, Something Something < mailinglist...@gmail.com> wrote: > Is there a UDF in Piggybank (or

Re: Avro to Tuples during UnitTest

2013-08-30 Thread Ruslan Al-Fakikh
; >>logic to a java method outside of the UDF and test it within normal junit > Who will convert sample avro data for me to Tuples and feed them to java > method? I don't want to invent the wheel and repeat AvroStorage > functionality. > > > 2013/8/30 Ruslan Al-Fakikh

Re: Reading json file.

2013-08-30 Thread Ruslan Al-Fakikh
Hi, There are different json loaders available, but none of them worked for me when I had to deal with json. I ended up loading the file as text file, reading one line at a time and then I parsed json inside my UDF with a json java library Best Regards, Ruslan On Fri, Aug 30, 2013 at 2:53 AM, j

Re: Avro to Tuples during UnitTest

2013-08-30 Thread Ruslan Al-Fakikh
Hi, What exactly do you want to test? The logic inside UDFs? In that case I would recommend not bothering about input format of the whole Pig script. You can use plain text files as input for the test. Or you can extract the logic to a java method outside of the UDF and test it within normal junit

Re: Misplaced pigsample_123456.... file fails the pig job !

2013-08-29 Thread Ruslan Al-Fakikh
Which hadoop distro are you using? I've heard Hortonworks has a windows-compatible hadoop. On Wed, Aug 28, 2013 at 2:36 PM, Darpan R wrote: > Hi folks, > I am facing a wiered issue. > I am running PIG 0.11 on windows7/64 bit machine with latest version of > cygwin. > > I am a weblog which I wan

Re: Date Function in Pig

2013-08-27 Thread Ruslan Al-Fakikh
Hi, I think the easiest way would be to use the piggybank converstion functions for such tasks: http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/datetime/convert/ Best Regards, Ruslan On Mon, Aug 26, 2013 at 7:43 PM, Serega Sheypak

Re: dev How can I add a row number per input file to the data

2013-08-21 Thread Ruslan Al-Fakikh
Hi! Probably these can help: http://pig.apache.org/docs/r0.11.1/basic.html#rank http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for -tagsource) I've never tried this, but probably you could group by tagsource and then apply RANK Ruslan On Fri, Aug 16, 2013 at 6:17 AM, Leo wrote:

Re: Removing characters from a bag

2013-06-29 Thread Ruslan Al-Fakikh
here > > Those are 2 lines however it gets broken down as 5 lines because of \n in > between and the real line ends. I tried to use foreach generate > REPLACE('\n',''); . Is that the right thing to do? Does it replace all \n > or only the first one? > > On Tue

Re: Removing characters from a bag

2013-06-25 Thread Ruslan Al-Fakikh
Hi Mohit, I don't clearly understand your use case. It depends on how you read the input, how you use the newlines... As the row separator, or just inside a row as a normal character. Can you put a simple example of input and output that you need? Thanks On Mon, Jun 24, 2013 at 10:18 PM, Mohit

Re: nested FOREACH statements

2013-06-25 Thread Ruslan Al-Fakikh
be 2 levels of nesting: http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ see Nested Cross/Foreach Hope that helps Ruslan Al-Fakikh On Fri, Jun 21, 2013 at 7:09 PM, Adamantios Corais < adamantios.cor...@gmail.com> wrote: > It seems that group is not supported in nest

Re: count total number of tuples in a bag?

2013-06-25 Thread Ruslan Al-Fakikh
Hi! What are you trying to do with define c COV('a','b','c') exactly? Can you try out = foreach grp generate group, COV(A.$0,A.$1,A.$2); without the define statement? Ruslan Al-Fakikh On Tue, Jun 18, 2013 at 1:17 PM, achile wandji wrote: > Hi, > I'

Re: save several 64MB files in Pig Latin

2013-06-10 Thread Ruslan Al-Fakikh
Hi Pedro, Yes, Pig Latin is always compiled to MapReduce. Usually you don't have to specify the number of mappers (I am not sure whether you really can). If you have a file of 500MB and it is splittable then the number of mappers is automatically equals to 500MB / 64MB (block size) which is around

Re: Tracking parts of a job taking the most time

2013-06-06 Thread Ruslan Al-Fakikh
questions On Wed, Jun 5, 2013 at 2:29 PM, John Meek wrote: > hi Ruslan , > Not sure how to do this? Can you be specific?? Whats DAG? Thanks. > > > > > > > > -Original Message- > From: Ruslan Al-Fakikh > To: user > Sent: Wed, Jun 5, 2013 4:04 am &

Re: Tracking parts of a job taking the most time

2013-06-05 Thread Ruslan Al-Fakikh
Hi! You can look at the Pig script stats after the script is finished. There is a DAG of MR jobs there. You can look at the individual MR jobs' stats to see how much time each MR job takes Ruslan On Wed, Jun 5, 2013 at 10:15 AM, Johnny Zhang wrote: > How about disable multi-query execution an

Re: DBStorage incompatibility with other Storage in pig Script

2013-05-27 Thread Ruslan Al-Fakikh
I'd recommend to try Sqoop for RDBMS-related tasks. On Mon, May 27, 2013 at 4:41 PM, Hardik Shah wrote: > DBStorage is not working with other storage in pig script. means DBStorage > is not working with multiple storage statement. > > What I was trying for: 1) I was trying to Store one output u

Re: Multiple STORE statements in one script

2013-05-09 Thread Ruslan Al-Fakikh
ve to split the processing and first generate multiple HDFS files > and then use SQOOP to load RDMS, then why not write few more short PIG > scripts to load those HDFS files in RDMS? > > Regards, > Shahab > > > On Wed, May 8, 2013 at 12:27 PM, Ruslan Al-Fakikh >wrote: &g

Re: Json array parsing issue with Pig JsonLoader

2013-05-08 Thread Ruslan Al-Fakikh
I also was having issues with the builtin JsonLoader and tried some other loaders: Elephant-bird (which doesn't work with CDH 4 :( ), Mozilla Aleka. There is also another JsonLoader in piggybank in some newer version of Pig. I ended up just loading data as text and processing it inside a UDF with a

Re: Multiple STORE statements in one script

2013-05-08 Thread Ruslan Al-Fakikh
Hi, It is possible to have multiple store statements, but I can't tell why you have nothing in the result. I recommend to split the task to the appropriate tools: store everything in HDFS and then run Sqoop to upload data to an RDBMS. Ruslan On Wed, May 8, 2013 at 6:11 PM, Shahab Yunus wrote:

Re: Filter on tuple question, and how to deal with dity datas?

2013-04-19 Thread Ruslan Al-Fakikh
Hi: Q1: maybe there is something wrong with the udf itself? Q2: How do you specify the data as dirty? One of your 6 fields is null? then you could something like: FILTER BY ($0 IS NULL OR $1 IS NULL...) Ruslan On Fri, Apr 19, 2013 at 6:57 AM, 何琦 wrote: > > Hi, > > Q1:I have a question about h

Re: Unable to load data using PigStorage that was previously stored using PigStorage

2013-04-19 Thread Ruslan Al-Fakikh
; > > > > Beginning in Pig 0.9, a map can declare its values to all be of the > > > same > > > > > type... " > > > > > > > > > > I agree that all values in the map can be of the same type but this > > is > > > > not > > > > >

Re: Unable to load data using PigStorage that was previously stored using PigStorage

2013-04-17 Thread Ruslan Al-Fakikh
orker.runTask(ThreadPoolExecutor.java:895) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:680) > > > Best Regards, > > Jerry > > > On Wed, Apr 17, 2013 at 3:26 PM, Ruslan Al-Faki

Re: Unable to load data using PigStorage that was previously stored using PigStorage

2013-04-17 Thread Ruslan Al-Fakikh
$Worker.runTask(ThreadPoolExecutor.java:895) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:680) > > Best Regards, > > Jerry > > > On Wed, Apr 17, 2013 at 1:24 PM, Ruslan Al-Fakikh >

Re: Unable to load data using PigStorage that was previously stored using PigStorage

2013-04-17 Thread Ruslan Al-Fakikh
Hey, and as for converting a map of tuples, probably i got you wrong. If you can get to every value manually withing FOREACH then I see no problem in doing so. On Wed, Apr 17, 2013 at 9:22 PM, Ruslan Al-Fakikh wrote: > I am not sure whether you can convert a map to a tuple. > But I am c

Re: Unable to load data using PigStorage that was previously stored using PigStorage

2013-04-17 Thread Ruslan Al-Fakikh
e flatten(b); > > I don't have controls over the input. It is passed as Map of Maps. I guess > it makes lookup easier using a map with keys. > > Can I convert map to tuple? > > Best Regards, > > Jerry > > > > On Wed, Apr 17, 2013 at 11:57 AM, Ruslan Al-Fak

Re: Stopping after load with no tuples?

2013-04-17 Thread Ruslan Al-Fakikh
h Pig with a bash wrapper. 2) Embed Pig into a Java or Python, etc (just like you would embed SQL into a regular language). Pig supports it out of the box 3) Use Oozie or something similar for the jobs orchestration. Hope that helps Ruslan Al-Fakikh On Tue, Apr 9, 2013 at 5:28 PM, John Far

Re: how to get the detailed warn or error log message from pig udfs

2013-04-17 Thread Ruslan Al-Fakikh
Hi Lucas, It seems that you are using org.apache.pig.EvalFunc.warn(String, Enum) which acts differently. Check the code or Javadocs. It works through Hadoop counters I guess. You can use a regular log4j warnings or just Sysout.out.println. But keep in mind that your UDF is implemented on a remote

Re: Unable to load data using PigStorage that was previously stored using PigStorage

2013-04-17 Thread Ruslan Al-Fakikh
Hi Jerry, I would recommend to debug the issue step by step. Just after this line: A = load 'data.txt' as document:[]; and then right after that: DESCRIBE A; DUMP A; and so on... To be honest I haven't used maps that much. Just curious, why did you choose to use them? You can also use regular tup

Re: Classpath issues with a custom loadfunc

2013-04-16 Thread Ruslan Al-Fakikh
He Niels, This is not a Pig question, it is more of a Java packaging question. What exactly went wrong with the maven assembly plugin? Maybe the maven shade plugin would work better? (though I've never tried it myself) For me - the simplest way is to just register all the needed dependencies and I

Re: ORDER failed

2013-04-13 Thread Ruslan Al-Fakikh
Hi Lei, It seems there is something wrong with creating a sampler. The ORDER command is not trivial, it works by creating a sampler. I guess something went wrong with it: Input path does not exist: file:/home/dliu/ApacheLogAnalysisWithPig/pigsample_259943398_1365820592017 I suppose pigsample is n

Re: Debugging UDFs

2013-04-13 Thread Ruslan Al-Fakikh
James, Try to execute in mapreduce mode at least on a pseudo-distributed cluster and try to find them in specific tasks logs. Also you can try to throw an exception, just to make sure your code is actually getting there, something like if (true) throw new RuntimeException("My warning"); Best Rega

Re: Pig JasonParser

2013-04-09 Thread Ruslan Al-Fakikh
: > https://github.com/rangadi/elephant-bird/tree/hadoop-2.0-support > > > On Thu, Apr 4, 2013 at 6:39 PM, Ruslan Al-Fakikh >wrote: > > > Hi guys, > > > > As for elephant-bird, it seems that it is not compatible with Pig 0.10 > > (CDH4) :( > > I am using th

Builtin JsonLoader: how to load a flat array?

2013-04-04 Thread Ruslan Al-Fakikh
Hey guys, I have a complex json file, I can load simple properties, but I am having problems with a property that has an array as its value: Suppose I have input.json with the contents: {"images": ["url1","url2"]} when i do: a = LOAD 'input.json' using JsonLoader('images: {(image: chararray)}');

Re: Pig JasonParser

2013-04-04 Thread Ruslan Al-Fakikh
Hi guys, As for elephant-bird, it seems that it is not compatible with Pig 0.10 (CDH4) :( I am using this configuration: pig -version Apache Pig version 0.10.0-cdh4.1.1 (rexported) hadoop version Hadoop 2.0.0-cdh4.1.1 and getting just the same error as Tim explained: java.lang.IncompatibleClassCha

Re: JsonLoader schema field order shouldn't matter

2013-04-04 Thread Ruslan Al-Fakikh
Tim, have you resolved the issue of using the elephant-bird with pig 0.10? meghana, I am using just the same configuration: pig -version Apache Pig version 0.10.0-cdh4.1.1 (rexported) hadoop version Hadoop 2.0.0-cdh4.1.1 and getting just the same error as Tim explained: java.lang.IncompatibleCla

Re: removing last item in a bag

2013-03-15 Thread Ruslan Al-Fakikh
1917004,200409672,2013-02-01 > 21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3) > > The error is not present when I comment out the "last_removed..." line and > uncommented out the one below it. > > > > > On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fak

Re: removing last item in a bag

2013-03-12 Thread Ruslan Al-Fakikh
Chan, Sorry, I meant ordered = ORDER inputData BY date; not ordered = ORDER inputData BY key; On Wed, Mar 13, 2013 at 7:06 AM, Ruslan Al-Fakikh wrote: > Hi Chan, > > Your tasks seems to be not trivial in Pig. Basically bags are not ordered, > so you have to either sort before or to

Re: removing last item in a bag

2013-03-12 Thread Ruslan Al-Fakikh
f it helps. Best Regards, Ruslan Al-Fakikh On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang wrote: > Hi, Chan: > That's fine. How did you generate the bag with different size in runtime. > It will be easier for me to come out a solution by this information. > Thanks. > > Johnny &

Pig 0.10.0-cdh4.1.1 uses its old JodaTime instead of my new JodaTime dependency in UDF

2013-01-20 Thread Ruslan Al-Fakikh
Hi guys, I am having a JodaTime maven version issue. I have a Java UDF in the form of a Maven project with this dependency: joda-time joda-time 2.1 Pig itself is dependent on JodaTime 1.6: https://issues.apache.org/jira/browse/PIG-3031 When my UDF uses a method that exists only in the new versio

Pig 0.10.0-cdh4.1.1: Annoying deprecation warnings

2013-01-20 Thread Ruslan Al-Fakikh
Hi guys, When runnig Pig I have a lot of WARNs like these: 2013-01-20 19:09:21,318 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2013-01-20 19:09:22,756 [main] WARN org.apache.hadoop.conf.Configuration - io.bytes.per.checksum is depre

Re: How to perfom a logical diff on two PigStorage files

2012-11-30 Thread Ruslan Al-Fakikh
Hi, As for point 1: it will always be cumbersome to work on such files. I would recommend using Avro where the schema is included in the file. Also you could try to sort contents or apply some transformation to force the files look the same. Then just diff the files outside of Pig, that's just an

Re: How to filter by pig datatype?

2012-11-22 Thread Ruslan Al-Fakikh
Hi, Just in case, can you execute: DESCRIBE data; As per my understanding a relation has a schema for all rows and it cannot have a schema per row. I guess that you will have to treat the field as one type, as chararray for example, and then try to get the type from its contents. Ruslan On Thu,

Re: distributed cache

2012-11-14 Thread Ruslan Al-Fakikh
Maybe this is what you are looking for: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html see "Replicated join" On Tue, Nov 13, 2012 at 11:46 AM, yingnan.ma wrote: > Hi , > > I used the distributed cache in the hadoop though the "setup" and "static" > store an hashset in the

Re: AvroStorage compression ratio

2012-10-23 Thread Ruslan Al-Fakikh
t; SET mapred.output.compress true > searches = load '/user/testuser/aol_search_logs.avro' using > org.apache.pig.piggybank.storage.avro.AvroStorage(); > store searches into '/user/testuser/aol_search_logs.snappy.avro' using > org.apache.pig.piggybank.storage.avro.AvroSto

Re: AvroStorage compression ratio

2012-10-22 Thread Ruslan Al-Fakikh
How do you generate your Avro files? It worked OK for me with: SET avro.mapred.deflate.level 5 inputData = LOAD 'input path' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); STORE inputData INTO 'output path' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); But I did these tes

Re: debug feature??

2012-10-22 Thread Ruslan Al-Fakikh
As for: >the >best scenario is to put a "marker" so that certain variables are stored or >skipped computation but instead LOADed I remember there was some discussion on this in the past. Actually this is not trivial. What would it do if you changed a UDF internal code, for example? How would it kno

Re: debug feature??

2012-10-19 Thread Ruslan Al-Fakikh
Hi, Basically it would be perfect if you first test with a small amount of data in local mode and then run the script on the big data to verify the correctness. If this is not possible you can store a relation at any point of your script with a STORE statement, so not to lose intermediate results.

Re: Decide if function is algebraic at planning phase

2012-10-09 Thread Ruslan Al-Fakikh
Hi! Out of curiosity: what for? Algebraic works faster in most cases. Possible solutions: 1) Maybe you can disable the use of combiner or something else that is related. Maybe if you change the Configuration to pig.exec.nocombiner=true that will disable the use of Algebraic, but I am not sure, th

Re: Counting elements in a bag

2012-09-20 Thread Ruslan Al-Fakikh
Sorry, I meant: or just c = foreach b generate COUNT(a); --without group to eliminate the keys On Thu, Sep 20, 2012 at 1:37 PM, Ruslan Al-Fakikh wrote: > Hey, try this: > > [cloudera@localhost workpig]$ cat input > James > John > Lisa > Larry > Amanda > Amanda &

Re: Counting elements in a bag

2012-09-20 Thread Ruslan Al-Fakikh
Hey, try this: [cloudera@localhost workpig]$ cat input James John Lisa Larry Amanda Amanda John James Lisa John [cloudera@localhost workpig]$ pig -x local 2012-09-20 13:35:06,225 [main] INFO org.apache.pig.Main - Logging error messages to: /home/cloudera/workpig/pig_1348133706198.log 2012-09-20 1

Re: ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

2012-09-20 Thread Ruslan Al-Fakikh
Hi! Are you sure in your types? Can you add a DESCRIBE statement for all relations before the line that causes the error. Ruslan On Wed, Sep 19, 2012 at 4:22 PM, Björn-Elmar Macek wrote: > Hi, > > during execution of the following PIG script i ran into the class cast > exception mentioned in th

Re: Replacing elements in a bag via a join

2012-09-20 Thread Ruslan Al-Fakikh
Hi Terry, It looks like you should FLATTEN the data relation first, so that your ids could be not nested and then join like this (or just remove GROUP statement): joined = JOIN dataFlattened by id, lookup by id USING 'replicated'; (the replacated join is recommended if your lookup relation is smal

Re: Removing unnecessary disambiguation marks

2012-09-18 Thread Ruslan Al-Fakikh
Hey, You can try cleaning in a separate FOREACH. I don't think it'll trigger another MR job, but you better check it. Example: resultCleaned = FOREACH result GENERATE name::group::fileldName AS fileldName; Ruslan On Tue, Sep 18, 2012 at 3:01 AM, R

Re: Input and output path

2012-09-13 Thread Ruslan Al-Fakikh
MiaoMiao, Mohit, If we are talking about embedding Pig into Python, I'd like to add that we can also embed Pig into Java using PigServer http://wiki.apache.org/pig/EmbeddedPig MiaoMiao, what's the purpose of embedding here (if we already have parameter substitution feature)? I guess Pig embedding

Re: Input and output path

2012-09-11 Thread Ruslan Al-Fakikh
ds On Tue, Sep 11, 2012 at 3:29 AM, Mohit Anchlia wrote: > On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh wrote: > >> Mohit, >> >> I guess you could use parameters substitution here >> http://wiki.apache.org/pig/ParameterSubstitution >> >> thanks thi

Re: Input and output path

2012-09-10 Thread Ruslan Al-Fakikh
Mohit, I guess you could use parameters substitution here http://wiki.apache.org/pig/ParameterSubstitution Also, a note about your architecture: You can consider using Hive partitions to effectively select appropriate dates in the folder names. But as your tool is Pig, not Hive, you can use HCata

Re: loading data to mysql using pig

2012-09-10 Thread Ruslan Al-Fakikh
Hi, Probably DBStorage is more convenient (I haven't tried it), but you can also you Sqoop if you are ok with storing data to HDFS first and then using Sqoop to insert data to MySql Ruslan On Tue, Sep 11, 2012 at 2:26 AM, Ranjith wrote: > Question for you pig experts.Trying to determine the

Re: Storing field in a bag

2012-09-10 Thread Ruslan Al-Fakikh
Hi, Mohit, http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#STORE I guess you can only STORE relations, not fields, etc Ruslan On Mon, Sep 10, 2012 at 9:53 PM, Mohit Anchlia wrote: > I am trying to store field in a bag command but it fails with > > store b.page into '/flume_vol/flume/input

Re: Custom DB Loader UDF

2012-08-31 Thread Ruslan Al-Fakikh
at each InputSplit would correspond to a map task, >> > but what I see in the JobTracker is that the submitted job only has 1 >> > map task which executes each split serially. Is my understanding even >> > correct that a split can be effectively assigned to a single map task? >> > If so, can I coerce the submitted MR job to properly get each of my >> > splits to execute in its own map task? >> > >> > Thanks, >> > -Terry >> > >> >> >> >> -- >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com >> datasyndrome.com >> > > > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Best Regards, Ruslan Al-Fakikh

Re: Custom DB Loader UDF

2012-08-31 Thread Ruslan Al-Fakikh
n the JobTracker is that the submitted job only has 1 >> map task which executes each split serially. Is my understanding even >> correct that a split can be effectively assigned to a single map task? >> If so, can I coerce the submitted MR job to properly get each of my >> splits to execute in its own map task? >> >> Thanks, >> -Terry >> > > > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Best Regards, Ruslan Al-Fakikh

Re: Custom DB Loader UDF

2012-08-31 Thread Ruslan Al-Fakikh
t a split can be effectively assigned to a single map task? If so, > can I coerce the submitted MR job to properly get each of my splits to > execute in its own map task? > > Thanks, > -Terry -- Best Regards, Ruslan Al-Fakikh

Re: Project the last field of a tuple

2012-08-23 Thread Ruslan Al-Fakikh
enius wrote: > Hi, > > is there anyway to project the last field of a tuple (when you don't > know how many fields there are) without creating a UDF? > > > Thanks, > > Fabian -- Best Regards, Ruslan Al-Fakikh

Re: Loading data from a SQL database?

2012-08-10 Thread Ruslan Al-Fakikh
t;> allowing to perform computation on data comming from HBase, SQL and Hadoop >> files, if possible without having to deal with workflow tools like Oozie). >> >> What is your recommendations about that ? >> >> Cheers >> >> > -- Best Regards, Ruslan Al-Fakikh

RE: foreach in PIG is not working.

2012-07-25 Thread Ruslan Al-fakikh
Hi, It seems that you are having problems with separators. Even you first dump shows columns where the first one contains everything and the second one is empty Ruslan -Original Message- From: yogesh.kuma...@wipro.com [mailto:yogesh.kuma...@wipro.com] Sent: Wednesday, July 25, 2012 10:0

Re: how can I delete a file in pig only after checking if the file exists?

2012-07-23 Thread Ruslan Al-Fakikh
am not sure if it > exists, this statement will give some error when I run it. > > So is there any method so that I can delete a file in pig script only after > checking the file exists? > > Thanks! -- Best Regards, Ruslan Al-Fakikh

Re: Best Practice: store depending on data content

2012-07-05 Thread Ruslan Al-Fakikh
hly offtopic. Sorry.) > > D > > On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh > wrote: >> Dmirtiy, >> >> In our organization we use file paths for this purpose like this: >> /incoming/datasetA >> /incoming/datasetB >> /reports/datasetC >>

Re: Using average function is really slow

2012-07-04 Thread Ruslan Al-Fakikh
Hi James, AVG is Algebraic which means that it will use combiner when it can. It seems that your job is not using combiner. Can you give the full script? Also check the job config of the running job. If it is using combiner then you should see something like pig.job.feature=GROUP_BY,COMBINER pig.a

Re: Does pig support in clause?

2012-07-04 Thread Ruslan Al-Fakikh
Hi Johannes, Try this C = LOAD 'in.dat' AS (A1); A = LOAD 'in2.dat' AS (A1); joined = JOIN A BY A1 LEFT OUTER, C BY A1; DESCRIBE joined; newEntries = FILTER joined BY C::A1 IS NULL; DUMP newEntries; Ruslan On Wed, Jul 4, 2012 at 4:42 PM, Johannes Schwenk wrote: > Hi Alan, > > I'd like to us

Re: What is the best way to do counting in pig?

2012-07-03 Thread Ruslan Al-Fakikh
roup m_skills_filter by member_id; >>> > grpd = group m_skill_group all; >>> > cnt = foreach grpd generate COUNT(m_skill_group); >>> > >>> > cnt_filter = limit cnt 10; >>> > dump cnt_filter; >>> > >>> > >>> > but sometimes, when the records get larger, it takes lots of time and >>> hang >>> > up, and or die. >>> > I thought counting should be simple enough, so what is the best way to >>> do a >>> > counting in pig? >>> > >>> > Thanks! >>> > >>> > Sheng >>> > >>> >> -- Best Regards, Ruslan Al-Fakikh

Re: Best Practice: store depending on data content

2012-07-03 Thread Ruslan Al-Fakikh
gt; And that's exactly why you want it. > > D > > On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh wrote: >> Hey Alan, >> >> I am not familiar with Apache processes, so I could be wrong in my >> point 1, I am sorry. >> Basically my impressions was th

Re: Best Practice: store depending on data content

2012-07-02 Thread Ruslan Al-Fakikh
> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > >> Hi Markus, >> >> Currently I am doing almost the same task. But in Hive. >> In Hive you can use the native Avro+Hive integration: >> https://issues.apache.org/jira/browse/HIVE-895 >> Or haivvreo

  1   2   >