Usually Hadoop is used within a distro. Those can be cloudera, hortonworks,
emr, etc
11 янв. 2014 г. 2:05 пользователь "Mariano Kamp"
написал:
> Hi Josh.
>
> Ok, got it. Interesting.
>
> Downloaded ant, recompiled and now it works.
>
> Thank you.
>
>
> On Fri, Jan 10, 2014 at 10:16 PM, Josh Elser
For simple testing you can use cloudera quickstart vm. All the lzo stuff
can be configured in cloudera manager with a few clicks.
10 янв. 2014 г. 21:55 пользователь "Peter Sanford"
написал:
> Hello everybody!
>
> I'm getting started with pig and I'm trying to understand how to
> configure io.comp
vro.AvroStorage('{
"index" : 1,
"schema": $SCHEMA_LITERAL}');
Best Regards,
Ruslan Al-Fakikh
On Wed, Dec 25, 2013 at 11:48 AM, Cheolsoo Park wrote:
> avro to bcc:
>
> >> Why can't it use the schema file from front-end invocation?
>
I am running the pig script from, cannot find the
file in the local file system.
Why can't it use the schema file from front-end invocation?
Does it mean that I am only limited to either HDFS locations for schema_uri
or using embedding the schema string in AvroStorage parameters?
Thanks in advance
Ruslan Al-Fakikh
I guess you are getting a bag of tuples here.
Try to apply FLATTEN on the bag.
Thanks
On Wed, Dec 18, 2013 at 12:20 AM, Tim Robertson
wrote:
> Hi all,
>
> I am new to Pig, and struggle to split up a long text line into multiple
> lines.
> I have an input format from a legacy mysqldump like:
>
>
I did
On Sat, Dec 21, 2013 at 9:44 PM, Serega Sheypak wrote:
> https://issues.apache.org/jira/browse/PIG-3638
> "like" it :)
>
>
> 2013/12/21 Ruslan Al-Fakikh
>
> > It seems to be a heavy PigUnit limitation. Maybe you can open a jira for
> > this?:)
&
to use "native" loader/storage.
> Looks like that the only solution is to create wrapper: data-driven tester
> which feeds script to local pig server and verifies output.
> We did in Megafon. I tried to use "recommended approach" - pig unit for my
> own purposes.
>
Hi Serega!
Have you resolved the issue? I am going to encounter the same problem, but
I don't know a solution.
Thanks
On Sun, Dec 15, 2013 at 6:07 PM, Serega Sheypak wrote:
> Hi!
> By default PigUnit does override LOAD statements
> Is there any possiblity to void this?
> I'm using AvroStorage
Hi Russell,
Could you be more specific. What would this operator do?
Does it have something to do with control logic? (Like IF/ELSE, WHILE, etc)
AFAIK, those are not present in Pig because it would make Pig less clean.
Thanks
On Sat, Dec 21, 2013 at 12:31 AM, Russell Jurney
wrote:
> Does anyon
I think your expression ends up with a bag with just that column. Can you
give the full context where it is not working?
28 нояб. 2013 г. 2:14 пользователь "ey-chih chow"
написал:
> Hi,
>
> We have an Avro file of which a field that is an array of tuples as
> follows:
>
>
> cam:bag{ARRAY_ELEM:tup
I guess you need to specify 'multiple_schema' in AvroStorage
On Thu, Nov 28, 2013 at 4:07 AM, Mangtani, Kushal <
kushal.mangt...@viasat.com> wrote:
> Hi,
>
> I'm one of the several Pig Developer/User community.I have a question
> regarding Avro1.6.1 and Pig0.11 compatibility. In ref to
> https:/
In my company we had to write our own Loader/Storer UDFs for this.
On Wed, Nov 27, 2013 at 6:00 PM, Noam Lavie wrote:
> Hi,
>
> Is there a way to load a csv file with header as schema? (the header's
> fields are the properties of the schema and the other list in the csv file
> will be in the sc
I've had a similar issue before. Not sure if I had the same versions. This
helped me :
The solution was to compile with -Dhadoopversion=23 option
20 нояб. 2013 г. 12:41 пользователь "Hiro Gangwani"
написал:
> Dear Team,
>
> I have downloaded 0.12.0 version of pig and trying to use with Hadoop 2.
Hey Johannes!
Have you solved the problem? I also see it.
But I don't see it when I use the schema as a string to AvroStorage
parameter. I see it only when I try to use an external schema file. And if
I specify a non-existent external schema file, the error is the same.
Ruslan
On Tue, Oct 22, 2
Hi soniya,
In you example you are hard-coding the ID to 2 in your filter statement.
You could also hard-code it in STORE:
Store A into '/main/abc'/2 ;
If you want to separate the rows by a value of a field then you could try
MultiStorage from piggybank.
Thanks
On Mon, Nov 18, 2013 at 6:01 AM, s
including this last message to pig user list
On Sun, Nov 17, 2013 at 7:40 AM, Ruslan Al-Fakikh wrote:
> Russel,
>
> Actually this problem came from the situation when I had the same names in
> pig relation schema and avro schema. And it turned out that AvroStorage
> switches fiel
Thanks, Russel!
Do you mean that this is the expected behavior? Shouldn't AvroStorage map
the pig fields by their names (not their field order) matching them to the
names in the avro schema?
Thanks,
Ruslan Al-Fakikh
On Sun, Nov 17, 2013 at 6:53 AM, Russell Jurney wrote:
> Pig tup
*
--{"b":"data_a","nonsense_name":"data_b"}
--{"b":"data_a","nonsense_name":"data_b"}
AvroStorage is build from the latest piggybank code.
Using AvroStorage "debug": 5 parameter didn't help.
$ pig -version
Apache Pig version 0.11.0-cdh4.3.0 (rexported)
compiled May 27 2013, 20:48:21
Any help would be appreciated.
Thanks,
Ruslan Al-Fakikh
There is no such control logic in Pig.
Maybe the FOREACH statement can help, but it's not a loop, but rather a
processing operator.
Also we may want to use Pig Embedding to launch pig from other languages.
Thanks
On Wed, Oct 30, 2013 at 11:45 AM, Murphy Ric wrote:
> I have a code in SQL to con
pressions."
>
> I think "you cannot order ... by expressions" means the behavior you see
> is expected.
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>
>
> -Original Message-
> From: Ruslan Al-Fakikh [mailto:metarus..
s one:
LOAD 'input' AS (M:map []);
named = foreach A generate *, M#'key1' as myfield;
sorted = ORDER named BY myfield;
dump sorted;
is OK
Is it a bug in Pig?
Best Regards,
Ruslan Al-Fakikh
Hi,
It says that your command returns non-zero code. Does it return it in case
you invoke it manually outside of Pig?
I think I don't have any valuable ideas otherwise.
Thanks
On Mon, Sep 30, 2013 at 10:37 AM, Anastasis Andronidis <
andronat_...@hotmail.com> wrote:
> Hello again,
>
> any comme
I suppose you need to use the RegExp groups for that, something like
([(.*),(.*)...]), and I think you need to escape []
Basically this is not a Pig problem, I would test the RegExp in Java first.
Ruslan
On Thu, Sep 26, 2013 at 4:36 PM, Muni mahesh wrote:
> *Input Data :*
>
> ([37.77916,-122.42
Probably you'll need Pig embedding:
http://pig.apache.org/docs/r0.11.1/cont.html
For doing some logic that is not MapReduce and depends on you Pig's output.
For loading data to the created tables, you can take a look at HCatalog,
though I am not sure whether your very old version of the Hadoop dist
Hi,
Are you trying to install them yourself? Usually a Hadoop distro is used
(Cloudera, Hortonworks, Amazon EMR, etc) and they are already compatible
within a distro.
Thanks
On Wed, Sep 25, 2013 at 2:02 PM, yonghu wrote:
> hello,
>
> Can anyone give me a list of compatible Versions between Pi
What was the error?
Not an issue, but why do you call the columns dt1, dt2, but not using the
name, using the ordinal number insted: $0?
On Fri, Sep 20, 2013 at 6:00 PM, Muni mahesh wrote:
> Hi Hadoopers,
>
> I did the same thing in Pig 0.8.1 but not Pig 0.11.0
>
> register /usr/lib/pig/piggyba
Hi,
No sure whether it helps, but I did a lot of testing in such cases. "Test
and see" was my main approach. It is really tricky sometimes. Also you can
try the -dryrun option when launching pig.
Best Regards,
Ruslan Al-Fakikh
https://www.odesk.com/users/~015b7b5f617eb89923
On T
found piggybank's MultiStorage method much closer to
> what
> > I
> > > am looking for. I was just wondering is there a better or different way
> > to
> > > do the same.
> > >
> > > Regards
> > > Praveenesh
> > >
> >
Hi!
Have you tried the SPLIT operator?
http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
After splitting the relation into two separate relations you can STORE them
into different locations.
Best Regards,
Ruslan Al-Fakikh
https://www.odesk.com/users/~015b7b5f617eb89923
On Sun, Sep 15, 2013
Best Regards,
Ruslan Al-Fakikh
https://www.odesk.com/users/~015b7b5f617eb89923
On Sun, Sep 15, 2013 at 10:10 PM, Shahab Yunus wrote:
> You need to load your data twice and then use it as any other join.
> Self-join is just like any other join to Pig.
>
> Regards,
> Shahab
>
&
Hi,
I think you could mimic it with an expression like this:
b = foreach a generate ((field1 is null) ? ((field2 is null) ? null :
field2) : field1);
Hope that helps,
Ruslan
On Wed, Sep 4, 2013 at 9:50 AM, Something Something <
mailinglist...@gmail.com> wrote:
> Is there a UDF in Piggybank (or
; >>logic to a java method outside of the UDF and test it within normal junit
> Who will convert sample avro data for me to Tuples and feed them to java
> method? I don't want to invent the wheel and repeat AvroStorage
> functionality.
>
>
> 2013/8/30 Ruslan Al-Fakikh
Hi,
There are different json loaders available, but none of them worked for me
when I had to deal with json. I ended up loading the file as text file,
reading one line at a time and then I parsed json inside my UDF with a json
java library
Best Regards,
Ruslan
On Fri, Aug 30, 2013 at 2:53 AM, j
Hi,
What exactly do you want to test? The logic inside UDFs? In that case I
would recommend not bothering about input format of the whole Pig script.
You can use plain text files as input for the test. Or you can extract the
logic to a java method outside of the UDF and test it within normal junit
Which hadoop distro are you using? I've heard Hortonworks has a
windows-compatible hadoop.
On Wed, Aug 28, 2013 at 2:36 PM, Darpan R wrote:
> Hi folks,
> I am facing a wiered issue.
> I am running PIG 0.11 on windows7/64 bit machine with latest version of
> cygwin.
>
> I am a weblog which I wan
Hi,
I think the easiest way would be to use the piggybank converstion functions
for such tasks:
http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/datetime/convert/
Best Regards,
Ruslan
On Mon, Aug 26, 2013 at 7:43 PM, Serega Sheypak
Hi!
Probably these can help:
http://pig.apache.org/docs/r0.11.1/basic.html#rank
http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
-tagsource)
I've never tried this, but probably you could group by tagsource and then
apply RANK
Ruslan
On Fri, Aug 16, 2013 at 6:17 AM, Leo wrote:
here
>
> Those are 2 lines however it gets broken down as 5 lines because of \n in
> between and the real line ends. I tried to use foreach generate
> REPLACE('\n',''); . Is that the right thing to do? Does it replace all \n
> or only the first one?
>
> On Tue
Hi Mohit,
I don't clearly understand your use case. It depends on how you read the
input, how you use the newlines... As the row separator, or just inside a
row as a normal character.
Can you put a simple example of input and output that you need?
Thanks
On Mon, Jun 24, 2013 at 10:18 PM, Mohit
be 2 levels of nesting:
http://hortonworks.com/blog/new-features-in-apache-pig-0-10/
see
Nested Cross/Foreach
Hope that helps
Ruslan Al-Fakikh
On Fri, Jun 21, 2013 at 7:09 PM, Adamantios Corais <
adamantios.cor...@gmail.com> wrote:
> It seems that group is not supported in nest
Hi!
What are you trying to do with define c COV('a','b','c') exactly?
Can you try
out = foreach grp generate group, COV(A.$0,A.$1,A.$2);
without the define statement?
Ruslan Al-Fakikh
On Tue, Jun 18, 2013 at 1:17 PM, achile wandji wrote:
> Hi,
> I'
Hi Pedro,
Yes, Pig Latin is always compiled to MapReduce.
Usually you don't have to specify the number of mappers (I am not sure
whether you really can). If you have a file of 500MB and it is splittable
then the number of mappers is automatically equals to 500MB / 64MB (block
size) which is around
questions
On Wed, Jun 5, 2013 at 2:29 PM, John Meek wrote:
> hi Ruslan ,
> Not sure how to do this? Can you be specific?? Whats DAG? Thanks.
>
>
>
>
>
>
>
> -Original Message-
> From: Ruslan Al-Fakikh
> To: user
> Sent: Wed, Jun 5, 2013 4:04 am
&
Hi!
You can look at the Pig script stats after the script is finished. There is
a DAG of MR jobs there. You can look at the individual MR jobs' stats to
see how much time each MR job takes
Ruslan
On Wed, Jun 5, 2013 at 10:15 AM, Johnny Zhang wrote:
> How about disable multi-query execution an
I'd recommend to try Sqoop for RDBMS-related tasks.
On Mon, May 27, 2013 at 4:41 PM, Hardik Shah wrote:
> DBStorage is not working with other storage in pig script. means DBStorage
> is not working with multiple storage statement.
>
> What I was trying for: 1) I was trying to Store one output u
ve to split the processing and first generate multiple HDFS files
> and then use SQOOP to load RDMS, then why not write few more short PIG
> scripts to load those HDFS files in RDMS?
>
> Regards,
> Shahab
>
>
> On Wed, May 8, 2013 at 12:27 PM, Ruslan Al-Fakikh >wrote:
&g
I also was having issues with the builtin JsonLoader and tried some other
loaders: Elephant-bird (which doesn't work with CDH 4 :( ), Mozilla Aleka.
There is also another JsonLoader in piggybank in some newer version of Pig.
I ended up just loading data as text and processing it inside a UDF with a
Hi,
It is possible to have multiple store statements, but I can't tell why you
have nothing in the result.
I recommend to split the task to the appropriate tools: store everything in
HDFS and then run Sqoop to upload data to an RDBMS.
Ruslan
On Wed, May 8, 2013 at 6:11 PM, Shahab Yunus wrote:
Hi:
Q1: maybe there is something wrong with the udf itself?
Q2: How do you specify the data as dirty? One of your 6 fields is null?
then you could something like: FILTER BY ($0 IS NULL OR $1 IS NULL...)
Ruslan
On Fri, Apr 19, 2013 at 6:57 AM, 何琦 wrote:
>
> Hi,
>
> Q1:I have a question about h
; > > > > Beginning in Pig 0.9, a map can declare its values to all be of the
> > > same
> > > > > type... "
> > > > >
> > > > > I agree that all values in the map can be of the same type but this
> > is
> > > > not
> > > > >
orker.runTask(ThreadPoolExecutor.java:895)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:680)
>
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Apr 17, 2013 at 3:26 PM, Ruslan Al-Faki
$Worker.runTask(ThreadPoolExecutor.java:895)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:680)
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Apr 17, 2013 at 1:24 PM, Ruslan Al-Fakikh >
Hey, and as for converting a map of tuples, probably i got you wrong. If
you can get to every value manually withing FOREACH then I see no problem
in doing so.
On Wed, Apr 17, 2013 at 9:22 PM, Ruslan Al-Fakikh wrote:
> I am not sure whether you can convert a map to a tuple.
> But I am c
e flatten(b);
>
> I don't have controls over the input. It is passed as Map of Maps. I guess
> it makes lookup easier using a map with keys.
>
> Can I convert map to tuple?
>
> Best Regards,
>
> Jerry
>
>
>
> On Wed, Apr 17, 2013 at 11:57 AM, Ruslan Al-Fak
h Pig with a bash wrapper.
2) Embed Pig into a Java or Python, etc (just like you would embed SQL into
a regular language). Pig supports it out of the box
3) Use Oozie or something similar for the jobs orchestration.
Hope that helps
Ruslan Al-Fakikh
On Tue, Apr 9, 2013 at 5:28 PM, John Far
Hi Lucas,
It seems that you are using org.apache.pig.EvalFunc.warn(String, Enum)
which acts differently. Check the code or Javadocs. It works through Hadoop
counters I guess. You can use a regular log4j warnings or just
Sysout.out.println. But keep in mind that your UDF is implemented on a
remote
Hi Jerry,
I would recommend to debug the issue step by step. Just after this line:
A = load 'data.txt' as document:[];
and then right after that:
DESCRIBE A;
DUMP A;
and so on...
To be honest I haven't used maps that much. Just curious, why did you
choose to use them? You can also use regular tup
He Niels,
This is not a Pig question, it is more of a Java packaging question. What
exactly went wrong with the maven assembly plugin? Maybe the maven shade
plugin would work better? (though I've never tried it myself)
For me - the simplest way is to just register all the needed dependencies
and I
Hi Lei,
It seems there is something wrong with creating a sampler. The ORDER
command is not trivial, it works by creating a sampler. I guess something
went wrong with it:
Input path
does not exist:
file:/home/dliu/ApacheLogAnalysisWithPig/pigsample_259943398_1365820592017
I suppose pigsample is n
James,
Try to execute in mapreduce mode at least on a pseudo-distributed cluster
and try to find them in specific tasks logs. Also you can try to throw an
exception, just to make sure your code is actually getting there, something
like
if (true) throw new RuntimeException("My warning");
Best Rega
:
> https://github.com/rangadi/elephant-bird/tree/hadoop-2.0-support
>
>
> On Thu, Apr 4, 2013 at 6:39 PM, Ruslan Al-Fakikh >wrote:
>
> > Hi guys,
> >
> > As for elephant-bird, it seems that it is not compatible with Pig 0.10
> > (CDH4) :(
> > I am using th
Hey guys,
I have a complex json file, I can load simple properties, but I am having
problems with a property that has an array as its value:
Suppose I have input.json with the contents:
{"images": ["url1","url2"]}
when i do:
a = LOAD 'input.json' using JsonLoader('images: {(image: chararray)}');
Hi guys,
As for elephant-bird, it seems that it is not compatible with Pig 0.10
(CDH4) :(
I am using this configuration:
pig -version
Apache Pig version 0.10.0-cdh4.1.1 (rexported)
hadoop version
Hadoop 2.0.0-cdh4.1.1
and getting just the same error as Tim explained:
java.lang.IncompatibleClassCha
Tim,
have you resolved the issue of using the elephant-bird with pig 0.10?
meghana,
I am using just the same configuration:
pig -version
Apache Pig version 0.10.0-cdh4.1.1 (rexported)
hadoop version
Hadoop 2.0.0-cdh4.1.1
and getting just the same error as Tim explained:
java.lang.IncompatibleCla
1917004,200409672,2013-02-01
> 21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3)
>
> The error is not present when I comment out the "last_removed..." line and
> uncommented out the one below it.
>
>
>
>
> On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fak
Chan,
Sorry, I meant
ordered = ORDER inputData BY date;
not
ordered = ORDER inputData BY key;
On Wed, Mar 13, 2013 at 7:06 AM, Ruslan Al-Fakikh wrote:
> Hi Chan,
>
> Your tasks seems to be not trivial in Pig. Basically bags are not ordered,
> so you have to either sort before or to
f it helps.
Best Regards,
Ruslan Al-Fakikh
On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang wrote:
> Hi, Chan:
> That's fine. How did you generate the bag with different size in runtime.
> It will be easier for me to come out a solution by this information.
> Thanks.
>
> Johnny
&
Hi guys,
I am having a JodaTime maven version issue.
I have a Java UDF in the form of a Maven project with this dependency:
joda-time
joda-time
2.1
Pig itself is dependent on JodaTime 1.6:
https://issues.apache.org/jira/browse/PIG-3031
When my UDF uses a method that exists only in the new versio
Hi guys,
When runnig Pig I have a lot of WARNs like these:
2013-01-20 19:09:21,318 [main] WARN org.apache.hadoop.conf.Configuration -
fs.default.name is deprecated. Instead, use fs.defaultFS
2013-01-20 19:09:22,756 [main] WARN org.apache.hadoop.conf.Configuration -
io.bytes.per.checksum is depre
Hi,
As for point 1: it will always be cumbersome to work on such files. I would
recommend using Avro where the schema is included in the file.
Also you could try to sort contents or apply some transformation to force
the files look the same. Then just diff the files outside of Pig, that's
just an
Hi,
Just in case, can you execute:
DESCRIBE data;
As per my understanding a relation has a schema for all rows and it cannot
have a schema per row. I guess that you will have to treat the field as one
type, as chararray for example, and then try to get the type from its
contents.
Ruslan
On Thu,
Maybe this is what you are looking for:
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html
see "Replicated join"
On Tue, Nov 13, 2012 at 11:46 AM, yingnan.ma wrote:
> Hi ,
>
> I used the distributed cache in the hadoop though the "setup" and "static"
> store an hashset in the
t; SET mapred.output.compress true
> searches = load '/user/testuser/aol_search_logs.avro' using
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> store searches into '/user/testuser/aol_search_logs.snappy.avro' using
> org.apache.pig.piggybank.storage.avro.AvroSto
How do you generate your Avro files?
It worked OK for me with:
SET avro.mapred.deflate.level 5
inputData = LOAD 'input path' USING
org.apache.pig.piggybank.storage.avro.AvroStorage();
STORE inputData INTO 'output path' USING
org.apache.pig.piggybank.storage.avro.AvroStorage();
But I did these tes
As for:
>the
>best scenario is to put a "marker" so that certain variables are stored or
>skipped computation but instead LOADed
I remember there was some discussion on this in the past. Actually
this is not trivial. What would it do if you changed a UDF internal
code, for example? How would it kno
Hi,
Basically it would be perfect if you first test with a small amount of
data in local mode and then run the script on the big data to verify
the correctness.
If this is not possible you can store a relation at any point of your
script with a STORE statement, so not to lose intermediate results.
Hi!
Out of curiosity: what for? Algebraic works faster in most cases.
Possible solutions:
1) Maybe you can disable the use of combiner or something else that is
related. Maybe if you change the Configuration to
pig.exec.nocombiner=true
that will disable the use of Algebraic, but I am not sure, th
Sorry,
I meant:
or just
c = foreach b generate COUNT(a); --without group
to eliminate the keys
On Thu, Sep 20, 2012 at 1:37 PM, Ruslan Al-Fakikh wrote:
> Hey, try this:
>
> [cloudera@localhost workpig]$ cat input
> James
> John
> Lisa
> Larry
> Amanda
> Amanda
&
Hey, try this:
[cloudera@localhost workpig]$ cat input
James
John
Lisa
Larry
Amanda
Amanda
John
James
Lisa
John
[cloudera@localhost workpig]$ pig -x local
2012-09-20 13:35:06,225 [main] INFO org.apache.pig.Main - Logging
error messages to: /home/cloudera/workpig/pig_1348133706198.log
2012-09-20 1
Hi!
Are you sure in your types? Can you add a DESCRIBE statement for all
relations before the line that causes the error.
Ruslan
On Wed, Sep 19, 2012 at 4:22 PM, Björn-Elmar Macek
wrote:
> Hi,
>
> during execution of the following PIG script i ran into the class cast
> exception mentioned in th
Hi Terry,
It looks like you should FLATTEN the data relation first, so that your
ids could be not nested and then join like this (or just remove GROUP
statement):
joined = JOIN dataFlattened by id, lookup by id USING 'replicated';
(the replacated join is recommended if your lookup relation is smal
Hey,
You can try cleaning in a separate FOREACH. I don't think it'll
trigger another MR job, but you better check it.
Example:
resultCleaned = FOREACH result GENERATE
name::group::fileldName AS
fileldName;
Ruslan
On Tue, Sep 18, 2012 at 3:01 AM, R
MiaoMiao, Mohit,
If we are talking about embedding Pig into Python, I'd like to add
that we can also embed Pig into Java using PigServer
http://wiki.apache.org/pig/EmbeddedPig
MiaoMiao, what's the purpose of embedding here (if we already have
parameter substitution feature)? I guess Pig embedding
ds
On Tue, Sep 11, 2012 at 3:29 AM, Mohit Anchlia wrote:
> On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh wrote:
>
>> Mohit,
>>
>> I guess you could use parameters substitution here
>> http://wiki.apache.org/pig/ParameterSubstitution
>>
>> thanks thi
Mohit,
I guess you could use parameters substitution here
http://wiki.apache.org/pig/ParameterSubstitution
Also, a note about your architecture:
You can consider using Hive partitions to effectively select
appropriate dates in the folder names. But as your tool is Pig, not
Hive, you can use HCata
Hi,
Probably DBStorage is more convenient (I haven't tried it), but you
can also you Sqoop if you are ok with storing data to HDFS first and
then using Sqoop to insert data to MySql
Ruslan
On Tue, Sep 11, 2012 at 2:26 AM, Ranjith wrote:
> Question for you pig experts.Trying to determine the
Hi, Mohit,
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#STORE
I guess you can only STORE relations, not fields, etc
Ruslan
On Mon, Sep 10, 2012 at 9:53 PM, Mohit Anchlia wrote:
> I am trying to store field in a bag command but it fails with
>
> store b.page into '/flume_vol/flume/input
at each InputSplit would correspond to a map task,
>> > but what I see in the JobTracker is that the submitted job only has 1
>> > map task which executes each split serially. Is my understanding even
>> > correct that a split can be effectively assigned to a single map task?
>> > If so, can I coerce the submitted MR job to properly get each of my
>> > splits to execute in its own map task?
>> >
>> > Thanks,
>> > -Terry
>> >
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>> datasyndrome.com
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
--
Best Regards,
Ruslan Al-Fakikh
n the JobTracker is that the submitted job only has 1
>> map task which executes each split serially. Is my understanding even
>> correct that a split can be effectively assigned to a single map task?
>> If so, can I coerce the submitted MR job to properly get each of my
>> splits to execute in its own map task?
>>
>> Thanks,
>> -Terry
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
--
Best Regards,
Ruslan Al-Fakikh
t a split can be effectively assigned to a single map task? If so,
> can I coerce the submitted MR job to properly get each of my splits to
> execute in its own map task?
>
> Thanks,
> -Terry
--
Best Regards,
Ruslan Al-Fakikh
enius
wrote:
> Hi,
>
> is there anyway to project the last field of a tuple (when you don't
> know how many fields there are) without creating a UDF?
>
>
> Thanks,
>
> Fabian
--
Best Regards,
Ruslan Al-Fakikh
t;> allowing to perform computation on data comming from HBase, SQL and Hadoop
>> files, if possible without having to deal with workflow tools like Oozie).
>>
>> What is your recommendations about that ?
>>
>> Cheers
>>
>>
>
--
Best Regards,
Ruslan Al-Fakikh
Hi,
It seems that you are having problems with separators. Even you first dump
shows columns where the first one contains everything and the second one is
empty
Ruslan
-Original Message-
From: yogesh.kuma...@wipro.com [mailto:yogesh.kuma...@wipro.com]
Sent: Wednesday, July 25, 2012 10:0
am not sure if it
> exists, this statement will give some error when I run it.
>
> So is there any method so that I can delete a file in pig script only after
> checking the file exists?
>
> Thanks!
--
Best Regards,
Ruslan Al-Fakikh
hly offtopic. Sorry.)
>
> D
>
> On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh
> wrote:
>> Dmirtiy,
>>
>> In our organization we use file paths for this purpose like this:
>> /incoming/datasetA
>> /incoming/datasetB
>> /reports/datasetC
>>
Hi James,
AVG is Algebraic which means that it will use combiner when it can. It
seems that your job is not using combiner. Can you give the full
script? Also check the job config of the running job. If it is using
combiner then you should see something like
pig.job.feature=GROUP_BY,COMBINER
pig.a
Hi Johannes,
Try this
C = LOAD 'in.dat' AS (A1);
A = LOAD 'in2.dat' AS (A1);
joined = JOIN A BY A1 LEFT OUTER, C BY A1;
DESCRIBE joined;
newEntries = FILTER joined BY C::A1 IS NULL;
DUMP newEntries;
Ruslan
On Wed, Jul 4, 2012 at 4:42 PM, Johannes Schwenk
wrote:
> Hi Alan,
>
> I'd like to us
roup m_skills_filter by member_id;
>>> > grpd = group m_skill_group all;
>>> > cnt = foreach grpd generate COUNT(m_skill_group);
>>> >
>>> > cnt_filter = limit cnt 10;
>>> > dump cnt_filter;
>>> >
>>> >
>>> > but sometimes, when the records get larger, it takes lots of time and
>>> hang
>>> > up, and or die.
>>> > I thought counting should be simple enough, so what is the best way to
>>> do a
>>> > counting in pig?
>>> >
>>> > Thanks!
>>> >
>>> > Sheng
>>> >
>>>
>>
--
Best Regards,
Ruslan Al-Fakikh
gt; And that's exactly why you want it.
>
> D
>
> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh wrote:
>> Hey Alan,
>>
>> I am not familiar with Apache processes, so I could be wrong in my
>> point 1, I am sorry.
>> Basically my impressions was th
> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>
>> Hi Markus,
>>
>> Currently I am doing almost the same task. But in Hive.
>> In Hive you can use the native Avro+Hive integration:
>> https://issues.apache.org/jira/browse/HIVE-895
>> Or haivvreo
1 - 100 of 128 matches
Mail list logo