[jira] Created: (PIG-1424) Error logs of streaming should not be placed in output location

2010-05-20 Thread Ashutosh Chauhan (JIRA)
Error logs of streaming should not be placed in output location
---

 Key: PIG-1424
 URL: https://issues.apache.org/jira/browse/PIG-1424
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


This becomes a problem when output location is anything other then a 
filesystem. Output will be written to DB but where the logs generated by 
streaming should go? Clearly, they cant be written into DB. This blocks 
PIG-1229 which introduces writing to DB from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1415) LoadFunc signature is not correct in LoadFunc.getSchema sometimes

2010-05-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867633#action_12867633
 ] 

Ashutosh Chauhan commented on PIG-1415:
---

+1 Please commit if all unit tests pass.

> LoadFunc signature is not correct in LoadFunc.getSchema sometimes
> -
>
> Key: PIG-1415
> URL: https://issues.apache.org/jira/browse/PIG-1415
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1415-1.patch
>
>
> The following script does not set signature correctly when we call 
> LoadFunc.getSchema.
> a = load 'xxx' using TableLoader('xxx') as (a, b, c);
> However, if we don't give schema to a, we get the right signature:
> a = load 'xxx' using TableLoader('xxx);
> Diagnosis:
> Parser will generate LoadClause before go to the generation "Alias = 
> LoadClause", which actually set signature to the LOLoad. When we give a 
> schema, parser try to call LOLoad.setSchema(), internally it will call 
> LoadFunc.determineSchema. And at that time, signature has not been set yet. 
> This relates to the change we cache determinedSchema in LOLoad 
> [PIG-1317|https://issues.apache.org/jira/browse/PIG-1317]. Before that 
> change, we will later call LoadFunc.getSchema() again using the right 
> signature. Now we cache determinedSchema, so LoadFunc don't have a chance to 
> get the right signature inside LoadFunc.getSchema()
> Solution:
> We shall not call LoadFunc.determineSchema inside LOLoad.setSchema().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-05-13 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1229:
--

Attachment: pig-1229.patch

Ankur,

Sorry for getting back late on this. I fiddled with your latest patch and was 
able to make some progress on it. I am able to get rid of those Path problems 
(looks like Pig itself is not dealing with it correctly at one place). I think 
with the patch that I attached should work but I am not able to get test case 
to pass because of hsqldb problem which I am not able to resolve. I keep 
getting this error from it:
{noformat}
Caused by: java.sql.SQLException: The database is already in use by another 
process: org.hsqldb.persist.niolockf...@4abea04e[file 
=/private/tmp/batchtest.lck, exists=true, locked=false, valid=false, fl =null]: 
java.lang.Exception: checkHeartbeat(): lock file [/private/tmp/batchtest.lck] 
is presumably locked by another process.
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.(Unknown Source)
at org.hsqldb.jdbcDriver.getConnection(Unknown Source)
at org.hsqldb.jdbcDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:582)
at java.sql.DriverManager.getConnection(DriverManager.java:185)
at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:274)

{noformat}
Anyways here are the changes I made:
1.
{code}
Index:src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
===
-conf.set("pig.streaming.log.dir", 
-new Path(outputPath, LOG_DIR).toString());
+//conf.set("pig.streaming.log.dir", 
+//new Path(outputPath, LOG_DIR).toString());
 conf.set("pig.streaming.task.output.dir", outputPath);
 }
{code}
This looks like a problem in Pig. Here Pig is incorrectly assuming that it can 
put logs generated during stream command in output location which is incorrect 
if output location is something like DB. Since this needs changes in main Pig 
code, I will suggest to open new jira for it and track it there.

2. Then in DBStorage.java
{code}
@Override
public void setStoreLocation(String location, Job job) throws IOException {
  job.getConfiguration().set("pig.db.conn.string", location);
}
@Override
public RecordWriter getRecordWriter(
TaskAttemptContext context) throws IOException, InterruptedException {
  jdbcURL = context.getConfiguration().get("pig.db.conn.string");
  return null;
}
{code} 
Need to save db connection string in job in setStoreLocation() and then 
retrieve it in backend in getRecordWriter(). 

3. In DBStorage.java
{code}
@Override
public void cleanupOnFailure(String location, Job job) throws 
IOException {
  log.error("Job has failed.");
}
{code}
You need to necessarily override this function of StoreFunc() as default 
implementation assumes FileSystem as the output location. Currently, I left it 
as no-op but it can be improved to do rollbacks, release db connections etc. 


> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file

2010-05-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867220#action_12867220
 ] 

Ashutosh Chauhan commented on PIG-1381:
---

+1 on the changes. 
For completeness, we can also check in an empty pig.properties  in the conf dir 
and then add comments in both pig.properties and pig-default.properties that if 
user wants to pass some properties doing it through pig-default.properties will 
have no effect and instead they should add extra properties they want to 
add/override in pig.properties.

> Need a way for Pig to take an alternative property file
> ---
>
> Key: PIG-1381
> URL: https://issues.apache.org/jira/browse/PIG-1381
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: V.V.Chaitanya Krishna
> Fix For: 0.7.0, 0.8.0
>
> Attachments: PIG-1381-1.patch, PIG-1381-2.patch, PIG-1381-3.patch, 
> PIG-1381-4.patch
>
>
> Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a 
> default pig.properties and if user have a different pig.properties, there 
> will be a conflict since we can only read one. There are couple of ways to 
> solve it:
> 1. Give a command line option for user to pass an additional property file
> 2. Change the name for default pig.properties to pig-default.properties, and 
> user can give a pig.properties to override
> 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems 
> to be more natural for hadoop community. If so, we shall provide backward 
> compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1390) Provide a target to generate eclipse-related classpath and files

2010-04-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862951#action_12862951
 ] 

Ashutosh Chauhan commented on PIG-1390:
---

I gave it a go and did as mentioned in previous comment

{noformat}
These are the steps that could be followed and imported to eclipse in a faster 
way :
1. checkout the trunk code.
2. run "ant eclipse-files".
3. open eclipse and "import" the existing project.
{noformat}

Though, pig itself compiled fine and is ready to go, the contrib projects 
(owl,zebra,piggybank/hiverc) didnt compile, I think because either it didn't 
download dependices of those projects or didn't include them in the build path. 
So, there appears unfriendly red cross next to project. If I remove them from 
build path, things are good. Did I do something wrong or is this expected ?

> Provide a target to generate eclipse-related classpath and files
> 
>
> Key: PIG-1390
> URL: https://issues.apache.org/jira/browse/PIG-1390
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.7.0, 0.8.0
>Reporter: V.V.Chaitanya Krishna
>Assignee: V.V.Chaitanya Krishna
> Fix For: 0.8.0
>
> Attachments: PIG-1390-2.patch, PIG-1390-3.patch, 
> PIG-eclipse_support.patch
>
>
> Currently, after checking out from svn repository, there is no provision to 
> auto-generate eclipse-related classpath and files , which could help in 
> import into eclipse directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory

2010-04-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1395:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked-in with updated comment.

> Mapside cogroup runs out of memory
> --
>
> Key: PIG-1395
> URL: https://issues.apache.org/jira/browse/PIG-1395
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: cogrp_mem.patch
>
>
> In a particular scenario when there aren't lot of tuples with a same key in a 
> relation (i.e. there aren't many repeating keys) map tasks doing cogroup 
> fails with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file

2010-04-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861186#action_12861186
 ] 

Ashutosh Chauhan commented on PIG-1381:
---

Do we need to have two different property files ? One possibility is to not 
package pig.properties in the pig.jar and then include it in the classpath 
while invoking Pig. (We can modify pig shell script to include it in the path 
by default). Then, user can add/delete/modify the pig.properties as he wish as 
well override default properties. 

Disadvantage of two property files, is sometimes its confusing which property 
is getting picked up (one in default or one in user specified). If there is 
only one property file, there is only one way to specify the properties to Pig 
which I think is better way of doing it. 


> Need a way for Pig to take an alternative property file
> ---
>
> Key: PIG-1381
> URL: https://issues.apache.org/jira/browse/PIG-1381
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
> Fix For: 0.8.0
>
>
> Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a 
> default pig.properties and if user have a different pig.properties, there 
> will be a conflict since we can only read one. There are couple of ways to 
> solve it:
> 1. Give a command line option for user to pass an additional property file
> 2. Change the name for default pig.properties to pig-default.properties, and 
> user can give a pig.properties to override
> 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems 
> to be more natural for hadoop community. If so, we shall provide backward 
> compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861177#action_12861177
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Ankur,

The stack trace above is out of sync with trunk. Can you upload the patch with 
this alternative approach that you are trying. I think it might be possible to 
get this working.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory

2010-04-26 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1395:
--

Status: Patch Available  (was: Open)

> Mapside cogroup runs out of memory
> --
>
> Key: PIG-1395
> URL: https://issues.apache.org/jira/browse/PIG-1395
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: cogrp_mem.patch
>
>
> In a particular scenario when there aren't lot of tuples with a same key in a 
> relation (i.e. there aren't many repeating keys) map tasks doing cogroup 
> fails with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory

2010-04-26 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1395:
--

Attachment: cogrp_mem.patch

While doing cogroup, we first put tuples from all the relations in a heap, then 
we drain the heap and generate the output tuple as appropriate. We need to look 
ahead atleast one tuple from all the relations before generating an output 
tuple to be sure that we have all the tuples belonging to a key. Currently, we 
look too far ahead and tuples starts to accumulate faster in heap then we are 
draining. At a certain point, we had enough information to generate output 
tuple instead of waiting and putting another tuple in heap. This patch generate 
the output tuple at that point.

> Mapside cogroup runs out of memory
> --
>
> Key: PIG-1395
> URL: https://issues.apache.org/jira/browse/PIG-1395
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: cogrp_mem.patch
>
>
> In a particular scenario when there aren't lot of tuples with a same key in a 
> relation (i.e. there aren't many repeating keys) map tasks doing cogroup 
> fails with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1395) Mapside cogroup runs out of memory

2010-04-26 Thread Ashutosh Chauhan (JIRA)
Mapside cogroup runs out of memory
--

 Key: PIG-1395
 URL: https://issues.apache.org/jira/browse/PIG-1395
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0


In a particular scenario when there aren't lot of tuples with a same key in a 
relation (i.e. there aren't many repeating keys) map tasks doing cogroup fails 
with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861122#action_12861122
 ] 

Ashutosh Chauhan commented on PIG-798:
--

1.
{noformat}
 b = foreach a generate (chararray) $0 as name; 
{noformat}

2. {noformat}
B = foreach A generate $0 as name:chararray;
{noformat}

@Viraj,

Discussed with Alan and Daniel. Language semantics for achieving this 
functionality with whatever loader is 1. The fact that 2 works for BinStorage 
is unfortunate and is bug. It is something which is currently there for 
backward compatibility and will eventually be removed. 


> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-24 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860614#action_12860614
 ] 

Ashutosh Chauhan commented on PIG-1211:
---

Oh, I got confused. From your earlier comment, it occurred to me you are saying 
that we should add a -checkscript command line option. From your previous 
comment are you suggesting that we should add syntax checker which will always 
run (i.e., without needing any cmd line directive) before the query starts to 
execute and thereby catching as many user error as possible. I think this is a 
reasonable ask and will be useful to users. This might be the first step 
towards making a distinction between pig compile time and run-time explicit to 
user. If we go full length here, we might as well do what Milind suggested 
earlier (and in recent mail thread). We can add a "compilation" phase which 
first runs a syntax checker, then generates "object code" (essentially job jar) 
from pig script. This compiled object can then be handed over to run-time 
(hadoop cluster). Wow, pig-latin is evolving towards a "true language" :)   

> Pig script runs half way after which it reports syntax error
> 
>
> Key: PIG-1211
> URL: https://issues.apache.org/jira/browse/PIG-1211
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
> col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the 
> first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of 
> explain to get the same syntax error?  In this way I can ensure that I do not 
> run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the 
> store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1339) International characters in column names not supported

2010-04-24 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860606#action_12860606
 ] 

Ashutosh Chauhan commented on PIG-1339:
---

This works fine on grunt. 
{code}
grunt> a = load '1-3.txt' using PigStorage() as (あいうえお);
grunt> dump a;
{code}

gives expected result. Problem is if it is fed as script to Pig
{code}
bin/pig myscript.pig
{code}
gives the exception as you shown above. This looks like a bug in 
PigScriptParser.jj where it should read the stream from script file as UTF-8.

> International characters in column names not supported
> --
>
> Key: PIG-1339
> URL: https://issues.apache.org/jira/browse/PIG-1339
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
>
> There is a particular use-case in which someone specifies a column name to be 
> in International characters.
> {code}
> inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
> describe inputdata;
> dump inputdata;
> {code}
> ==
> Pig Stack Trace
> ---
> ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
> Encountered: "\u3042" (12354), after : ""
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
> 1, column 64.  Encountered: "\u3042" (12354), after : ""
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:391)
> ==
> Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-24 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860598#action_12860598
 ] 

Ashutosh Chauhan commented on PIG-798:
--

You can specify schema in FOREACH GENERATE with PigStorage loader as follows:
{code}
grunt> a = load 'data' using PigStorage();
grunt> b = foreach a generate (chararray) $0 as name; 
grunt> describe b;
b: {name: chararray}
grunt> dump b;
{code}

I get the expected result.

> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1390) Provide a target to generate eclipse-related classpath and files

2010-04-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1390:
-

Assignee: V.V.Chaitanya Krishna

> Provide a target to generate eclipse-related classpath and files
> 
>
> Key: PIG-1390
> URL: https://issues.apache.org/jira/browse/PIG-1390
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.7.0, 0.8.0
>Reporter: V.V.Chaitanya Krishna
>Assignee: V.V.Chaitanya Krishna
> Fix For: 0.8.0
>
> Attachments: PIG-eclipse_support.patch
>
>
> Currently, after checking out from svn repository, there is no provision to 
> auto-generate eclipse-related classpath and files , which could help in 
> import into eclipse directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-04-21 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1345:
--

Parent: PIG-908
Issue Type: Sub-task  (was: Improvement)

> Link casting errors in POCast to actual lines numbers in Pig script
> ---
>
> Key: PIG-1345
> URL: https://issues.apache.org/jira/browse/PIG-1345
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> For the purpose of easy debugging, I would be nice to find out where  my 
> warnings are coming from is in the pig script. 
> The only known process is to comment out lines in the Pig script and see if 
> these warnings go away.
> 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
> I think this may need us to keep track of the line numbers of the Pig script 
> (via out javacc parser) and maintain it in the logical and physical plan.
> It would help users in debugging simple errors/warning related to casting.
> Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
> Do we need to change the parser to something other than javacc to make this 
> task simpler?
> "Standardize on Parser and Scanner Technology"
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-04-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859471#action_12859471
 ] 

Ashutosh Chauhan commented on PIG-1345:
---

This will involve recording line numbers (and possibly more metadata) from 
parser to logical layer, then to physical layer and then to backend and then 
back in case of exceptions. This has been discussed before in some detail in 
PIG-908. Linking it against that.

> Link casting errors in POCast to actual lines numbers in Pig script
> ---
>
> Key: PIG-1345
> URL: https://issues.apache.org/jira/browse/PIG-1345
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> For the purpose of easy debugging, I would be nice to find out where  my 
> warnings are coming from is in the pig script. 
> The only known process is to comment out lines in the Pig script and see if 
> these warnings go away.
> 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
> I think this may need us to keep track of the line numbers of the Pig script 
> (via out javacc parser) and maintain it in the logical and physical plan.
> It would help users in debugging simple errors/warning related to casting.
> Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
> Do we need to change the parser to something other than javacc to make this 
> task simpler?
> "Standardize on Parser and Scanner Technology"
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859462#action_12859462
 ] 

Ashutosh Chauhan commented on PIG-1211:
---

bq. Can we have an option to do something like "checkscript" instead of explain 
to get the same syntax error? In this way I can ensure that I do not run for 
3-4 hours before encountering a syntax error

Though its possible to add something like checkscript. But, it will be a 
syntactic sugar, since it will do the same exact thing as explain does (but not 
printing the plan at the end). So,  I am thinking, shall we tell users to run 
explain to catch syntax errors, instead of adding this new command line option? 
What do others think ?

> Pig script runs half way after which it reports syntax error
> 
>
> Key: PIG-1211
> URL: https://issues.apache.org/jira/browse/PIG-1211
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
> col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the 
> first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of 
> explain to get the same syntax error?  In this way I can ensure that I do not 
> run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the 
> store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859159#action_12859159
 ] 

Ashutosh Chauhan commented on PIG-798:
--

Viraj,

I am confused with this description. It seems to me that you are first storing 
some data using BinStorage and then loading it using PigStorage. If that is so, 
obviously it will not work. PigStorage and BinStorage aren't interoperable in 
this way. Specifically, data stored using BinStorage, can only be loaded using 
BinStorage.

> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-04-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859157#action_12859157
 ] 

Ashutosh Chauhan commented on PIG-1341:
---

I think BinStorage is an internal way of moving data around in Pig and it 
should be treated that way. I think we should discourage its usage to user. 
Otherwise, we need to add capabilities as the one requested here. Important 
impact of making such a change is that we can't  then swap out BinStorage with 
other storage mechanisms. If Avro (or protobuf or whatever) proved to be a 
better replacement for BinStorage, then we cant just swap them in place of 
BinStorage, unless we add to them all the capabilities that BinStorage has. 
Therefore, I suggest to keep capabilities of BinStorage to minimal.  

> BinStorage cannot convert DataByteArray to Chararray and results in 
> FIELD_DISCARDED_TYPE_CONVERSION_FAILED
> --
>
> Key: PIG-1341
> URL: https://issues.apache.org/jira/browse/PIG-1341
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Attachments: PIG-1341.patch
>
>
> Script reads in BinStorage data and tries to convert a column which is in 
> DataByteArray to Chararray. 
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
> time(s).
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1339) International characters in column names not supported

2010-04-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859152#action_12859152
 ] 

Ashutosh Chauhan commented on PIG-1339:
---

This is not reproducible on trunk. I get the expected output. Viraj, can you 
please verify if it works for you in trunk ?

> International characters in column names not supported
> --
>
> Key: PIG-1339
> URL: https://issues.apache.org/jira/browse/PIG-1339
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> There is a particular use-case in which someone specifies a column name to be 
> in International characters.
> {code}
> inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
> describe inputdata;
> dump inputdata;
> {code}
> ==
> Pig Stack Trace
> ---
> ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
> Encountered: "\u3042" (12354), after : ""
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
> 1, column 64.  Encountered: "\u3042" (12354), after : ""
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:391)
> ==
> Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1378) har url not usable in Pig scripts

2010-04-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858709#action_12858709
 ] 

Ashutosh Chauhan commented on PIG-1378:
---

{noformat}
grunt> a = load 
'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
grunt> dump a;
{noformat}

 This is incorrect. You need to do the following:
{noformat}
grunt> a = load 
'har://hdfs-namenode.foo.com:8020/user/viraj/project/subproject/files/size/data';
 
grunt> dump a;
{noformat}

Note that scheme is hdfs. Then a -(dash), followed by namenode url, followed by 
semi-colon, followed by port number(8020) and then location of your har 
archive. 


> har url not usable in Pig scripts
> -
>
> Key: PIG-1378
> URL: https://issues.apache.org/jira/browse/PIG-1378
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I am trying to use har (Hadoop Archives) in my Pig script.
> I can use them through the HDFS shell
> {noformat}
> $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
> Found 1 items
> -rw---   5 viraj users1537234 2010-04-14 09:49 
> user/viraj/project/subproject/files/size/data/part-1
> {noformat}
> Using similar URL's in grunt yields
> {noformat}
> grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; 
> grunt> dump a;
> {noformat}
> {noformat}
> 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. 
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible 
> file URI scheme: har : hdfs
> 2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
> is no log file to write to.
> 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
> java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> Incompatible file URI scheme: har : hdfs
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> at org.apache.pig.Main.main(Main.java:357)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> Incompatible file URI scheme: har : hdfs
> at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
> at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
> ... 13 more
> {noformat}
> According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
> following as stated in the original description
> {noformat}
> grunt> a = load 
> 'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
> grunt> dump a;
> {noformat}
> {noformat}
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
> Unable to create input splits for: 
> har://namenode-location/user/viraj/project/subproject/files/size/data'; 
> ... 8 more
> Caused by: java.io.IOException: No FileSystem for scheme: namenode-location
> at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
> at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
> at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
> at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
> at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
> at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> at 
> .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat

[jira] Updated: (PIG-1353) Map-side joins

2010-04-16 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

  Status: Resolved  (was: Patch Available)
Release Note: 
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  
  Resolution: Fixed

Patch checked-in.


> Map-side joins
> --
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-04-16 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857932#action_12857932
 ] 

Ashutosh Chauhan commented on PIG-1354:
---

Dmitriy,
Neat work! 

This patch facilitates to call few existing methods in jdk libs(which are thus 
compiled and already available at runtime). 
Thinking aloud, what will it take to go one step further from here. That is, to 
take uncompiled java code blocks and then compile them at runtime and make them 
available as udfs. If we can get that, then we can allow users to write java 
code of there udfs inline in pig script and then pig can compile it at runtime 
and do other necessary plumbing to make it work. And then, no more need to 
write java code separately, compile it, jar it, register it etc. All user code 
will be in one file. This will make writing udfs a lot easier. 
http://docs.codehaus.org/display/JANINO/Home and http://commons.apache.org/jci/ 
might be of help. 
Thoughts ?

> UDFs for dynamic invocation of simple Java methods
> --
>
> Key: PIG-1354
> URL: https://issues.apache.org/jira/browse/PIG-1354
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch
>
>
> The need to create wrapper UDFs for simple Java functions creates unnecessary 
> work for Pig users, slows down the development process, and produces a lot of 
> trivial classes. We can use Java's reflection to allow invoking a number of 
> methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1353) Map-side joins

2010-04-16 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857887#action_12857887
 ] 

Ashutosh Chauhan commented on PIG-1353:
---

Ya, visibility change of visitor methods from public to protected is unrelated 
to the issue. I did it as a part of cleanup of LogToPhyTranslator. I dont see a 
reason why visitor methods should be public. in general we strive to not make 
things public and all the usage of those methods were in same package, so 
changing them from public to protected is a safe choice. 

> Map-side joins
> --
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-15 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

   Status: Patch Available  (was: Reopened)
Affects Version/s: 0.8.0
   (was: 0.7.0)

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch, pig-1363_1.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-15 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

Attachment: pig-1363_1.patch

Can't get away without pasing loader signature to backend for Merge Join. So, 
set it.

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch, pig-1363_1.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Reopened: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-15 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1363:
---


Issue patch is supposed to fix at one place, breaks it at another place. Need 
to Re-fix.

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch, pig-1363_1.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857154#action_12857154
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

As per http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg02257.html 
thread I am wondering if it will be safe and possible to make sure that job 
using this storage has speculative execution turned-off.  Otherwise, with S.E. 
turned on, there are too many scenarios we would have to handle. What do you 
think?

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-14 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked-in.

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1353) Map-side joins

2010-04-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857149#action_12857149
 ] 

Ashutosh Chauhan commented on PIG-1353:
---

Hudson.. Oh Hudson.. when y'll get better ! Ran the full test suite. All of 
them passed. Ran test-patch:
{noformat}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 12 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

{noformat}

Patch is ready for review.

> Map-side joins
> --
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856992#action_12856992
 ] 

Ashutosh Chauhan commented on PIG-1363:
---

Hudson is flaky (again). Result of test-patch:
{noformat}
 [exec] 
 [exec] -1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 

{noformat} 
Patch is ready for review.

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1353) Map-side joins

2010-04-13 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

   Status: Patch Available  (was: Open)
Fix Version/s: 0.8.0

> Map-side joins
> --
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1353) Map-side joins

2010-04-13 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

Attachment: pig-1353.patch

Running through hudson. 

> Map-side joins
> --
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-13 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

Status: Patch Available  (was: Open)

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-13 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

Status: Open  (was: Patch Available)

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Assigned: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1363:
-

Assignee: Ashutosh Chauhan

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

Status: Patch Available  (was: Open)

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1363:
--

Attachment: pig-1363.patch

Ideal solution of this problem is to have {{{LoadFunc}}} implements 
{{{Serializable}}}. Then LoadFunc will be instantiated once first time its 
needed (in LoLoad) and then everywhere this one object is used. But this will 
be backward incompatible as all the load func implementation then have to be 
necessarily implement Serializable. So, for now we will live with this. 
This patch gets rid of the multiple load func instantiation in front end where 
it could be avoided without the need of making it Serializable. No test cases 
are needed since this is purely code cleanup and doesn't add/delete/modify any 
existing functionality, so current regression tests suffice. 

> Unnecessary loadFunc instantiations
> ---
>
> Key: PIG-1363
> URL: https://issues.apache.org/jira/browse/PIG-1363
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1363.patch
>
>
> In MRCompiler loadfuncs are instantiated at multiple locations in different 
> visit methods. This is inconsistent and confusing. LoadFunc should be 
> instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). 
> A getter should be added to POLoad to retrieve this instantiated loadFunc 
> wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?

2010-04-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855434#action_12855434
 ] 

Ashutosh Chauhan commented on PIG-506:
--

Ashitosh,

When I click on that link, I get:
{noformat}
You do not have the required role. 
{noformat}

Do you need to set permissions for it to be world-readable? (if that is what 
you are intending to do) 

> Does pig need a NATIVE keyword?
> ---
>
> Key: PIG-506
> URL: https://issues.apache.org/jira/browse/PIG-506
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
>
> Assume a user had a job that broke easily into three pieces.  Further assume 
> that pieces one and three were easily expressible in pig, but that piece two 
> needed to be written in map reduce for whatever reason (performance, 
> something that pig could not easily express, legacy job that was too 
> important to change, etc.).  Today the user would either have to use map 
> reduce for the entire job or manually handle the stitching together of pig 
> and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
> would allow the script to pass off the data stream to the underlying system 
> (in this case map reduce).  The semantics of NATIVE would vary by underlying 
> system.  In the map reduce case, we would assume that this indicated a 
> collection of one or more fully contained map reduce jobs, so that pig would 
> store the data, invoke the map reduce jobs, and then read the resulting data 
> to continue.  It might look something like this:
> {code}
> A = load 'myfile';
> X = load 'myotherfile';
> B = group A by $0;
> C = foreach B generate group, myudf(B);
> D = native (jar=mymr.jar, infile=frompig outfile=topig);
> E = join D by $0, X by $0;
> ...
> {code}
> This differs from streaming in that it allows the user to insert an arbitrary 
> amount of native processing, whereas streaming allows the insertion of one 
> binary.  It also differs in that, for streaming, data is piped directly into 
> and out of the binary as part of the pig pipeline.  Here the pipeline would 
> be broken, data written to disk, and the native block invoked, then data read 
> back from disk.
> Another alternative is to say this is unnecessary because the user can do the 
> coordination from java, using the PIgServer interface to run pig and calling 
> the map reduce job explicitly.  The advantages of the native keyword are that 
> the user need not be worried about coordination between the jobs, pig will 
> take care of it.  Also the user can make use of existing java applications 
> without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-1362.
---

Resolution: Fixed

Since hudson is flaky once again. Ran the full test - suite. All of it passed. 
Ran test-patch:

{noformat}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{noformat}

Patch checked-in for 0.7 branch.

> Provide udf context signature in ensureAllKeysInSameSplit() method of loader
> 
>
> Key: PIG-1362
> URL: https://issues.apache.org/jira/browse/PIG-1362
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: backport.patch
>
>
> As a part of PIG-1292 a check was introduced to make sure loader used in 
> "collected" group-by implements CollectableLoader (new interface in that 
> patch). In its method, loader may use udf context to store some info. We need 
> to make sure that udf context signature is setup correctly in such cases. 
> This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854740#action_12854740
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

You  can get rid of this stack-trace by overriding 
relToAbsPathForStoreLocation() of StoreFunc which DBStorage extends and turning 
it into no-op. Since, DB location is always absolute, there is no need of 
default behavior which is there in StoreFunc.  

For DataType.find() I found even PigStorage does the same, so this patch is no 
worse then PigStorage in that way.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1363) Unnecessary loadFunc instantiations

2010-04-07 Thread Ashutosh Chauhan (JIRA)
Unnecessary loadFunc instantiations
---

 Key: PIG-1363
 URL: https://issues.apache.org/jira/browse/PIG-1363
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


In MRCompiler loadfuncs are instantiated at multiple locations in different 
visit methods. This is inconsistent and confusing. LoadFunc should be 
instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). A 
getter should be added to POLoad to retrieve this instantiated loadFunc 
wherever it is needed in later stages of compilation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1362:
---


> Provide udf context signature in ensureAllKeysInSameSplit() method of loader
> 
>
> Key: PIG-1362
> URL: https://issues.apache.org/jira/browse/PIG-1362
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: backport.patch
>
>
> As a part of PIG-1292 a check was introduced to make sure loader used in 
> "collected" group-by implements CollectableLoader (new interface in that 
> patch). In its method, loader may use udf context to store some info. We need 
> to make sure that udf context signature is setup correctly in such cases. 
> This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854643#action_12854643
 ] 

Ashutosh Chauhan commented on PIG-1348:
---

1) As far as I can see TextOutputFormat has synchronized write() because it is 
meant to work even with mappers implementing MultithreadedMapRunner. But since 
thats not the case for Pig, we can get rid of it especially now that we are 
putting in our own PigTextOutputFormat instead of using TextOutputformat. 

3) Thats what I meant, if Schema is available, we should use that to find 
types, instead of reflecting on every call. I suggested the work around of 
caching for the case if we know user did provide Schema, but we dont have a 
handle on it. Clearly, if there is no schema, we need to find type every time. 
I can see that dealing with Complex types even when there is a schema is not 
straight forward. In any case, casts that are currently there for simple types 
are unnecessary.

For performance numbers, both of these will save CPU time, if we are convinced 
that we are always I/O bound we can leave these things as it is. 

> PigStorage making unnecessary byte array copy when storing data
> ---
>
> Key: PIG-1348
> URL: https://issues.apache.org/jira/browse/PIG-1348
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1348.patch, PIG-1348_2.patch
>
>
> InternalCachedBag makes estimate of memory available to the VM by using 
> Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
> configurable) of this memory and divides this memory into number of bags. It 
> keeps track of the memory used by bags and then proactively spills if bags 
> memory usage reach close to these limits. Given all this in theory when 
> presented with data more then it can handle InternalCachedBag should not run 
> out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1362:
--

Status: Patch Available  (was: Open)

> Provide udf context signature in ensureAllKeysInSameSplit() method of loader
> 
>
> Key: PIG-1362
> URL: https://issues.apache.org/jira/browse/PIG-1362
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: backport.patch
>
>
> As a part of PIG-1292 a check was introduced to make sure loader used in 
> "collected" group-by implements CollectableLoader (new interface in that 
> patch). In its method, loader may use udf context to store some info. We need 
> to make sure that udf context signature is setup correctly in such cases. 
> This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1362:
-

Assignee: Ashutosh Chauhan

> Provide udf context signature in ensureAllKeysInSameSplit() method of loader
> 
>
> Key: PIG-1362
> URL: https://issues.apache.org/jira/browse/PIG-1362
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: backport.patch
>
>
> As a part of PIG-1292 a check was introduced to make sure loader used in 
> "collected" group-by implements CollectableLoader (new interface in that 
> patch). In its method, loader may use udf context to store some info. We need 
> to make sure that udf context signature is setup correctly in such cases. 
> This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1362:
--

Attachment: backport.patch

Simple one line fix. Test cases included.

> Provide udf context signature in ensureAllKeysInSameSplit() method of loader
> 
>
> Key: PIG-1362
> URL: https://issues.apache.org/jira/browse/PIG-1362
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: backport.patch
>
>
> As a part of PIG-1292 a check was introduced to make sure loader used in 
> "collected" group-by implements CollectableLoader (new interface in that 
> patch). In its method, loader may use udf context to store some info. We need 
> to make sure that udf context signature is setup correctly in such cases. 
> This is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader

2010-04-07 Thread Ashutosh Chauhan (JIRA)
Provide udf context signature in ensureAllKeysInSameSplit() method of loader


 Key: PIG-1362
 URL: https://issues.apache.org/jira/browse/PIG-1362
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Critical
 Fix For: 0.7.0


As a part of PIG-1292 a check was introduced to make sure loader used in 
"collected" group-by implements CollectableLoader (new interface in that 
patch). In its method, loader may use udf context to store some info. We need 
to make sure that udf context signature is setup correctly in such cases. This 
is already the case in trunk, need to backport it to 0.7 branch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-959) Merge Join fails when there is a blocking operator before it in query.

2010-04-06 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854340#action_12854340
 ] 

Ashutosh Chauhan commented on PIG-959:
--

Patch includes two new tests. Not sure why hudson thought otherwise. Patch is 
ready for review.

> Merge Join fails when there is a blocking operator before it in query.
> --
>
> Key: PIG-959
> URL: https://issues.apache.org/jira/browse/PIG-959
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-959.patch
>
>
> If there is an order-by, distinct or any other blocking operator in query 
> followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data

2010-04-06 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854292#action_12854292
 ] 

Ashutosh Chauhan commented on PIG-1348:
---

Since this is mostly performance related, there are few more things which we 
can get in depending on complexity - speedup tradeoff:
1) PigLineRecordWriter#write() is synchronized. Is that needed? I don't see a 
scenario where multiple threads are writing using same object and thus 
potentially stomping on each other. Am I missing something here?
2) Within write() I think it can be safely assumed that value is of type Tuple, 
because argument in putNext() is of type Tuple. Then we can get rid of 
instanceof.
3) In StorageUtil.putField(), is it possible to get rid of DataType.findType(), 
possibly by getting hold of schema and getting type information from there. If 
not, then may be we cache the type info first time, instead of finding it on 
every call. At the very least, we shall get rid of casts for simple types as 
thats unnecessary. DataType.isComplex() can be used to determine that. 

> PigStorage making unnecessary byte array copy when storing data
> ---
>
> Key: PIG-1348
> URL: https://issues.apache.org/jira/browse/PIG-1348
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1348.patch
>
>
> InternalCachedBag makes estimate of memory available to the VM by using 
> Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
> configurable) of this memory and divides this memory into number of bags. It 
> keeps track of the memory used by bags and then proactively spills if bags 
> memory usage reach close to these limits. Given all this in theory when 
> presented with data more then it can handle InternalCachedBag should not run 
> out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-959) Merge Join fails when there is a blocking operator before it in query.

2010-04-06 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-959:
-

Status: Patch Available  (was: Open)

Running through hudson.

> Merge Join fails when there is a blocking operator before it in query.
> --
>
> Key: PIG-959
> URL: https://issues.apache.org/jira/browse/PIG-959
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-959.patch
>
>
> If there is an order-by, distinct or any other blocking operator in query 
> followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-959) Merge Join fails when there is a blocking operator before it in query.

2010-04-06 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-959:
-

Attachment: pig-959.patch

Attached patch which lifts the restriction that no blocking operator could be 
placed before merge join. With this work, data can be ordered and joined in 
same script. 

> Merge Join fails when there is a blocking operator before it in query.
> --
>
> Key: PIG-959
> URL: https://issues.apache.org/jira/browse/PIG-959
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-959.patch
>
>
> If there is an order-by, distinct or any other blocking operator in query 
> followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side joins

2010-04-02 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

Attachment: pig-1353.patch

An illustrative patch which achieves this.

> Map-side joins
> --
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1353) Map-side joins

2010-04-02 Thread Ashutosh Chauhan (JIRA)
Map-side joins
--

 Key: PIG-1353
 URL: https://issues.apache.org/jira/browse/PIG-1353
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan


Pig already has couple of map-side join implementations: Merge Join and 
Fragmented-Replicate Join. But both of them are pretty restrictive. Merge Join 
can only join two tables and that too can only do inner join. FR Join can join 
multiple relations, but it can also only do inner and left outer joins. Further 
it restricts the sizes of side relations. It will be nice if we can do map side 
joins on multiple tables as well do inner, left outer, right outer and full 
outer joins. 

Lot of groundwork for this has already been done in PIG-1309. Remaining will be 
tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-04-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: pig-1309_2.patch

Updated the patch to fix test failures, javac warnings and more comments.

Result of test-patch on latest patch:
{noformat}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 9 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
{noformat}

Result of test-commit:
{noformat}
test-commit:
[mkdir] Created dir: /homes/chauhana/scratch/latest/build/test/logs
[junit] Running org.apache.pig.test.TestAdd
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.036 sec
:
:
[junit] Running org.apache.pig.test.TestTypeCheckingValidatorNoSchema
[junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.165 sec
BUILD SUCCESSFUL
{noformat}

Patch checked in trunk.


> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1348) InternalCachedBag running out of memory

2010-04-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852511#action_12852511
 ] 

Ashutosh Chauhan commented on PIG-1348:
---

To reproduce, cogroup page_views(from PigMix's dataset) with page_views on user 
and this exception should occur. Apart from making InternalCachedBag more 
robust, important thing to figure out here is to see where 90% of available 
memory is getting used. Also, a related fix went in for this recently: PIG-1307 
Might be related to that issue. 

> InternalCachedBag running out of memory
> ---
>
> Key: PIG-1348
> URL: https://issues.apache.org/jira/browse/PIG-1348
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Richard Ding
>
> InternalCachedBag makes estimate of memory available to the VM by using 
> Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
> configurable) of this memory and divides this memory into number of bags. It 
> keeps track of the memory used by bags and then proactively spills if bags 
> memory usage reach close to these limits. Given all this in theory when 
> presented with data more then it can handle InternalCachedBag should not run 
> out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1348) InternalCachedBag running out of memory

2010-04-01 Thread Ashutosh Chauhan (JIRA)
InternalCachedBag running out of memory
---

 Key: PIG-1348
 URL: https://issues.apache.org/jira/browse/PIG-1348
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding


InternalCachedBag makes estimate of memory available to the VM by using 
Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though 
configurable) of this memory and divides this memory into number of bags. It 
keeps track of the memory used by bags and then proactively spills if bags 
memory usage reach close to these limits. Given all this in theory when 
presented with data more then it can handle InternalCachedBag should not run 
out of memory. But in practice we find OOM happening. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-31 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852190#action_12852190
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Few suggestions:

Reading from test case, currently store statements look like:
{code}
 b = store a into 'dummy' using 
org.apache.pig.piggybank.storage.DBStorage('org.hsqldb.jdbcDriver','jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100','insert
 into a...');
{code}
here 'dummy' is totally ignored. while this works, from a user experience 
following might be better:

{code}
 b = store a into 'jdbc:hsqldb:file:/tmp/batchtest' using 
org.apache.pig.piggybank.storage.DBStorage('org.hsqldb.jdbcDriver','hsqldb.default_table_type=cached;hsqldb.cache_rows=100','insert
 into a');
{code}
that is, have db url as store location and second param of store func as db 
params. you can use setStoreLocation() to store url. Apart from more intuitive 
store stmt, this will also allow you to check whether DB is reachable or not at 
compile time itself, instead of at runtime. You can do that via 
checkOutputSpecs(). 

Doing DataType.findType() on every element of every tuple will be expensive. I 
am wondering if you can get hold of schema in your store func and use that to 
map pig types to sql types.

All of these suggestions may come in as later patches. So, if you want to get 
this committed and track these separately I think that also will work as this 
patch is functionally complete. 

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851661#action_12851661
 ] 

Ashutosh Chauhan commented on PIG-1309:
---

To build index, we sample every split and get an index entry corresponding to 
the split. After sampling all the index entries are sorted and then index is 
written to disk. When I first wrote MergeJoin I wasn't able to figure out how 
to use hadoop sorting to sort the index. So, there is a comment in MRCompiler 
for that:
{noformat}
// Sorting of index can possibly be achieved by using Hadoop sorting 
// between map and reduce instead of Pig doing sort. If that is so, 
// it will simplify lot of the code below.
{noformat}
Now I figured it out :) By default, if LocalRearranges produce key of type 
tuple Pig supplies raw binary comparators (PigTupleWritableComparator) to 
hadoop to compare tuples, which ignores the semantics of tuple. We need to 
override that behavior to make Pig supply correct version of tuple comparator 
(PigTupleRawComparator).  We need to communicate this info to 
JobControlCompiler from MRCompiler. So, I am doing the same through 
MapReduceOper object. 

As a nice side-effects of this 
a) code in MRCompiler is indeed simplified now
b) We got rid of extra index sorting inside reducer. 

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-03-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Status: Patch Available  (was: Open)

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-03-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: pig-1309_1.patch

Getting closer. Running through hudson to find out if it breaks anything.

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-03-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: (was: pig-1309.patch)

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309_1.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-03-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849827#action_12849827
 ] 

Ashutosh Chauhan commented on PIG-1315:
---

Aah.. I thought you put SortedTableSplitComparable in its own class file. That 
doesnt seem to be the case. I need to test this version to make sure if it 
works.

> [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
> 
>
> Key: PIG-1315
> URL: https://issues.apache.org/jira/browse/PIG-1315
> Project: Pig
>  Issue Type: New Feature
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: zebra.0324, zebra.0324
>
>
> OrderedLoadFunc interface is used by Pig to do merge join and mapside 
> cogrouping. For Zebra, implementing this interface is necessary to support 
> mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-03-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849814#action_12849814
 ] 

Ashutosh Chauhan commented on PIG-1315:
---

+1 it passes Pig side of things.

> [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
> 
>
> Key: PIG-1315
> URL: https://issues.apache.org/jira/browse/PIG-1315
> Project: Pig
>  Issue Type: New Feature
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: zebra.0324, zebra.0324
>
>
> OrderedLoadFunc interface is used by Pig to do merge join and mapside 
> cogrouping. For Zebra, implementing this interface is necessary to support 
> mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1329) Pig version incorrect post-0.7 split

2010-03-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849777#action_12849777
 ] 

Ashutosh Chauhan commented on PIG-1329:
---

+1, I also hit this issue yesterday. Seems to be typo while branching for 0.7.

> Pig version incorrect post-0.7 split
> 
>
> Key: PIG-1329
> URL: https://issues.apache.org/jira/browse/PIG-1329
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
>Priority: Trivial
> Fix For: 0.8.0
>
> Attachments: PIG-1329.patch
>
>
> There's a typo in build.xml that makes the current pig version 0..0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1254) up the memory for junit to run tests

2010-03-25 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-1254.
---

Resolution: Invalid

I upped the memory limits to 512M and 768M, but didn't see any improvements in 
runtime. Resolving as invalid.

> up the memory for junit to run tests
> 
>
> Key: PIG-1254
> URL: https://issues.apache.org/jira/browse/PIG-1254
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Ashutosh Chauhan
>
> Currently junit is configured to run only with 256M of memory. This is too 
> low considering the fact that most tests create MiniCluster, run Hadoop local 
> job runner, run tests etc. all within same jvm.  This results in transient 
> failures, longer time for tests to complete etc. This should be upped atleast 
> to 512M.
> build.xml:
> {noformat}
>  fork="yes" maxmemory="256m" dir="${basedir}" timeout="${test.timeout}" 
> errorProperty="tests.failed" failureProperty="tests.failed">
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader

2010-03-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849601#action_12849601
 ] 

Ashutosh Chauhan commented on PIG-1315:
---

While reading records in reducer, Pig uses reflection to instantiate typed data 
objects.  SortedTableSplitComparable which is a writable comparable required in 
OrderedLoadFunc is an inner class of SortedTableSplit. As a result, reflection 
fails and exception is thrown. To make it work,  SortedTableSplitComparable  
may need to move into its own class with public visibility.  

> [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
> 
>
> Key: PIG-1315
> URL: https://issues.apache.org/jira/browse/PIG-1315
> Project: Pig
>  Issue Type: New Feature
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: zebra.0324
>
>
> OrderedLoadFunc interface is used by Pig to do merge join and mapside 
> cogrouping. For Zebra, implementing this interface is necessary to support 
> mapside cogrouping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-03-24 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: pig-1309.patch

Did offline review with Alan. Found a subtle bug in POMergeCogroup#getNext(). 
Fixed that and added more tests. Still need to tidy up things at few places. 
Looking for suggestion for better test cases that cover all the edge cases. 

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch, pig-1309.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1298) Restore file traversal behavior to Pig loaders

2010-03-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847630#action_12847630
 ] 

Ashutosh Chauhan commented on PIG-1298:
---

+1

> Restore file traversal behavior to Pig loaders
> --
>
> Key: PIG-1298
> URL: https://issues.apache.org/jira/browse/PIG-1298
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1298.patch, PIG-1298_1.patch
>
>
> Given a location to a Pig loader, it is expected to recursively load all the 
> files under the location (i.e., all the files returned with  "ls -R" 
> command). However, after the transition to using Hadoop 20 API,  only files 
> returned with "ls" command are loaded.
>  
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1309) Map-side Cogroup

2010-03-19 Thread Ashutosh Chauhan (JIRA)
Map-side Cogroup


 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch

In never ending quest to make Pig go faster, we want to parallelize as many 
relational operations as possible. Its already possible to do Group-by( PIG-984 
) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add 
map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-03-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: mapsideCogrp.patch

Preliminary patch to discuss the approach. Not ready for inclusion yet.

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: mapsideCogrp.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1292) Interface Refinements

2010-03-16 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1292:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked-in with changes suggested in previous comment. Core test failure 
reported by hudson was transient. It passed on my machine.

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1292.patch, pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1292) Interface Refinements

2010-03-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1292:
--

Status: Patch Available  (was: Open)

Hudson is fickle recently. Hopefully, this patch gets lucky and is tested 
correctly.

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1292.patch, pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1292) Interface Refinements

2010-03-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1292:
--

Attachment: pig-1292.patch

Didn't get about LoadMetaData, ResourceSchema. LoadMetaData is one of those 
interfaces which loaders can choose to implement. ResourceSchema is independent 
class of its own.

New patch incorporating suggested changes in the above comments. This patch 
also adds checks in the MRCompiler to enforce loader to implement new 
CollectableLoader interface if there is a map-side grouping ( PIG-984 ) in the 
script.

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1292.patch, pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1292) Interface Refinements

2010-03-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844632#action_12844632
 ] 

Ashutosh Chauhan commented on PIG-1292:
---

One reason for not putting it in LoadFunc is to keep loadfunc simple and not 
have such highly specific methods in there. We want to move such specialized 
capabilities away from LoadFunc into their own interfaces. This is also the 
reason PIG-966 decided to split LoadFunc into separate interfaces like 
LoadPushDown, LoadCaster, LoadMetaData etc. and not put all of them in 
LoadFunc. This frees loadfunc implementers from not thinking about them, if 
they don't want to. And if one wants to have such specific capability in his 
loader, he has to think about it anyway whether its in loadfunc or in its own 
interface. 

That said, I agree having boolean return value for the method seems to be 
confusing, so I agree method return value should be void.

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1292) Interface Refinements

2010-03-11 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844301#action_12844301
 ] 

Ashutosh Chauhan commented on PIG-1292:
---

Thanks for review, Xuefu.

1. Thats a valid point. Where possible we want loadfunc implementers to deal 
with Hadoop concepts and not with Pig concepts.

2. So, lets assume there is a loader which is capable of  implementing this 
interface but only if underlying data is sorted (information which is available 
to loader only at run-time). Now this loader will implement this interface, 
indicating to Pig it is capable of doing it. But just because it is capable of 
doing it, doesn't necessarily imply it will do it (possibly because of 
performance reasons). Then, when Pig calls the method of interface, it is 
communicating to loader that it wants data in particular fashion, no matter 
what. Inside this method, loader may come to know about some metadata (like 
data is not sorted, possibly by reading its schema or contacting some metadata 
repo) and decides that it cant honor the contract because of information which 
is available to it only at run time. Then, loader may return false for the 
method. Pig may then choose to rewrite the query and still carry-on the 
execution. Because of these scenarios, I think having a boolean return value is 
useful. what do you think?

3. Can't come up with better name. Feel free to suggest :)

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1292) Interface Refinements

2010-03-11 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1292:
-

Assignee: Ashutosh Chauhan

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1292) Interface Refinements

2010-03-11 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1292:
--

Attachment: pig-interfaces.patch

A preview patch with suggested changes.

> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-interfaces.patch
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1292) Interface Refinements

2010-03-11 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844247#action_12844247
 ] 

Ashutosh Chauhan commented on PIG-1292:
---

Currently LoadFunc is an abstract class. OrderedLoadFunc is another abstract 
class which extends LoadFunc and adds the method which tells Pig in what order 
to read the splits. Similarly, there is IndexableLoadFunc which also extends 
LoadFunc and adds the functionality that loader can arbitrarily seek near to 
specified keys. Its not hard to imagine that there may exist a loader which can 
do both. Currently there can't be such a loader since both of these are 
abstract classes. Proposal is to change them to interfaces. 

Further, a loader may also provide a guarantee that all instances of a key 
appear together in one split. A similar loader is required for map-side groups 
PIG-984 . Currently, its assumed that underlying loader is providing data in a 
way its expected. We should formalize this assumption by introducing new 
interface and checking if loader is implementing it.


> Interface Refinements
> -
>
> Key: PIG-1292
> URL: https://issues.apache.org/jira/browse/PIG-1292
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.7.0
>
>
> A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
> are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1292) Interface Refinements

2010-03-11 Thread Ashutosh Chauhan (JIRA)
Interface Refinements
-

 Key: PIG-1292
 URL: https://issues.apache.org/jira/browse/PIG-1292
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.7.0


A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-04 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841391#action_12841391
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Sure. By the way, I am not sure if hsqldb license 
http://hsqldb.org/web/hsqlLicense.html is compatible with Apache or not. 
Though, I think if we are pulling it through ivy, we will be fine. Am I correct 
?

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.7.0
>
> Attachments: hsqldb.jar, jira-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-03-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841039#action_12841039
 ] 

Ashutosh Chauhan commented on PIG-928:
--

@Woody

I agree frameworks will not be performant. I think there usefulness depends on 
what we want to achieve? If we want to support many different languages, then 
they might prove useful, if we are only interested in supporting a language or 
two (seems Python and Ruby are most popular ones) then it won't make sense to 
pay the overhead associated with them.

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
> Attachments: package.zip, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-03-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841033#action_12841033
 ] 

Ashutosh Chauhan commented on PIG-928:
--

@Prasen

bq. can we not implement it along the lines of DEFINE commands. 
Ya, this functionality could be partially simulated using DEFINE / Streaming 
combination. But that may not be most efficient way to achieve it. First of 
all, streaming script  would be run in a separate process (as oppose to same 
JVM in approaches discussed above)  so there will be CPU cost involved in 
getting data in and out of from java process to stream script process.  Then, 
there is a cost of serialization and deserialization of parameters. You loose 
all the type information of the parameters.  Once you are in same runtime you 
can start doing interesting things. Also, having scripts in define statements 
will get kludgy soon as one you start to do complicated things there.  

bq. no need to include scripting-specific jars (jython etc.)
Do you mean Include in pig distribution or in pig's  classpath at runtime ? In 
either case that may not necessarily a problem. For first part, we can use ivy 
to pull the jars for us instead of including in distribution and for second 
part we can ship all the jars required by Pig to compute nodes.

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
> Attachments: package.zip, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841003#action_12841003
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Ankur,

With recent Load-Store interface changes, the patch doesn't compile. Can you 
regenerate it? And while you are at it, can you also make changes in ivy.xml so 
that hsqldb.jar is pulled over internet instead of needing it to be bundled 
with pig distribution.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.7.0
>
> Attachments: hsqldb.jar, jira-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1265) Change LoadMetadata and StoreMetadata to use Job instead of Configuraiton and add a cleanupOnFailure method to StoreFuncInterface

2010-03-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839938#action_12839938
 ] 

Ashutosh Chauhan commented on PIG-1265:
---

Looked at the diff of first and second patch. +1 for that part. 

One thing unrelated to patch I want to highlight is: these kind of issues 
creeps in because our api is too wide, there are multiple ways of getting same 
thing done in Pig (executing script through Java api in this case), each 
exercising different code paths. So, first thing we need is to establish what 
is part of our api  and what is meant for Pig's internal purposes only? Is 
every single public method part of api  ? Or, only those which have properly 
documented Javadocs are supported ?  Or only those documented in wiki page ?
First we establish what constitutes our public api, then we should 
systematically start decreasing the visibility of those methods which are 
public but not meant as an api, then deprecate them and eventually remove them. 

> Change LoadMetadata and StoreMetadata to use Job instead of Configuraiton and 
> add a cleanupOnFailure method to StoreFuncInterface
> -
>
> Key: PIG-1265
> URL: https://issues.apache.org/jira/browse/PIG-1265
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Fix For: 0.7.0
>
> Attachments: PIG-1265-2.patch, PIG-1265.patch
>
>
> Speaking to the hadoop team folks, the direction in hadoop is to use Job 
> instead of Configuration - for example InputFormat/OutputFormat 
> implementations use Job to store input/output location. So pig should also do 
> the same in LoadMetadata and StoreMetadata to be closer to hadoop.
> Currently when a job fails, pig assumes the output locations (corresponding 
> to the stores in the job) are hdfs locations and attempts to delete them. 
> Since output locations could be non hdfs locations, this cleanup should be 
> delegated to the StoreFuncInterface implementation - hence a new method - 
> cleanupOnFailure() should be introduced in StoreFuncInterface and a default 
> implementation should be provided in the StoreFunc abstract class which 
> checks if the location exists on hdfs and deletes it if so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839000#action_12839000
 ] 

Ashutosh Chauhan commented on PIG-1251:
---

@Dmitriy

I agree. ResourceSchema encapsulates both SortInfo and PigSchema within it. So, 
SortInfo should not really be exposed. It should be a private member of 
ResourceSchema and where SortInfo is required, it should be accessed via 
ResourceSchema.  I thought about doing all those changes in this patch, but 
changes were more involved then I thought. So, I backed-off to keep changes 
minimal for this patch. We can track that as a separate ticket. 

> Move SortInfo calculation earlier in compilation 
> -
>
> Key: PIG-1251
> URL: https://issues.apache.org/jira/browse/PIG-1251
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1251.patch, pig-1251_1.patch
>
>
> In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
> storefunc might need schema to do such a validation. So, we should call 
> checkSchema() before doing the validation. checkSchema() in turn requires 
> SortInfo which is calculated later in compilation phase. We need to move it 
> earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

2010-02-26 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-1216.
---

Resolution: Fixed

> New load store design does not allow Pig to validate inputs and outputs up 
> front
> 
>
> Key: PIG-1216
> URL: https://issues.apache.org/jira/browse/PIG-1216
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and 
> non-existence of outputs during parsing to avoid run time failures when 
> inputs don't exist or outputs can't be overwritten.  The downside to this was 
> that Pig assumed all inputs and outputs were HDFS files, which made 
> implementation harder for non-HDFS based load and store functions.  In the 
> load store redesign (PIG-966) this was delegated to InputFormats and 
> OutputFormats to avoid this problem and to make use of the checks already 
> being done in those implementations.  Unfortunately, for Pig Latin scripts 
> that run more then one MR job, this does not work well.  MR does not do 
> input/output verification on all the jobs at once.  It does them one at a 
> time.  So if a Pig Latin script results in 10 MR jobs and the file to store 
> to at the end already exists, the first 9 jobs will be run before the 10th 
> job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and 
> StoreFunc interfaces.  Pig needs to pass this method enough information that 
> the load function implementer can delegate to InputFormat.getSplits() and the 
> store function implementer to OutputFormat.checkOutputSpecs() if s/he decides 
> to.  Since 90% of all load and store functions use HDFS and PigStorage will 
> also need to, the Pig team should implement a default file existence check on 
> HDFS and make it available as a static method to other Load/Store function 
> implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-26 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1251:
--

   Resolution: Fixed
Fix Version/s: 0.7.0
   Status: Resolved  (was: Patch Available)

Patch committed. 

> Move SortInfo calculation earlier in compilation 
> -
>
> Key: PIG-1251
> URL: https://issues.apache.org/jira/browse/PIG-1251
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1251.patch, pig-1251_1.patch
>
>
> In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
> storefunc might need schema to do such a validation. So, we should call 
> checkSchema() before doing the validation. checkSchema() in turn requires 
> SortInfo which is calculated later in compilation phase. We need to move it 
> earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-26 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1251:
--

Attachment: pig-1251_1.patch

Resynced with trunk and updated to address Daniel's comments. Will be 
committing it shortly.

> Move SortInfo calculation earlier in compilation 
> -
>
> Key: PIG-1251
> URL: https://issues.apache.org/jira/browse/PIG-1251
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1251.patch, pig-1251_1.patch
>
>
> In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
> storefunc might need schema to do such a validation. So, we should call 
> checkSchema() before doing the validation. checkSchema() in turn requires 
> SortInfo which is calculated later in compilation phase. We need to move it 
> earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1260) Param Subsitution results in parser error if there is no EOL after last line in script

2010-02-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838551#action_12838551
 ] 

Ashutosh Chauhan commented on PIG-1260:
---

Work around is to add a EOL after last line in the script-file.

> Param Subsitution results in parser error if there is no EOL after last line 
> in script
> --
>
> Key: PIG-1260
> URL: https://issues.apache.org/jira/browse/PIG-1260
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.7.0
>
>
> {noformat}
> A = load '$INPUT' using PigStorage(':');
> B = foreach A generate $0 as id;
> store B into '$OUTPUT' USING PigStorage();
> {noformat}
> Invoking above script which contains no EOL in the last line of script as 
> following:
> {noformat} 
> pig -param INPUT=mydata/input -param OUTPUT=mydata/output myscript.pig
> {noformat}
> results in parser error:
> {noformat}
> [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during 
> parsing. Lexical error at line 3, column 42.  Encountered:  after : ""
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1260) Param Subsitution results in parser error if there is no EOL after last line in script

2010-02-25 Thread Ashutosh Chauhan (JIRA)
Param Subsitution results in parser error if there is no EOL after last line in 
script
--

 Key: PIG-1260
 URL: https://issues.apache.org/jira/browse/PIG-1260
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.7.0


{noformat}
A = load '$INPUT' using PigStorage(':');
B = foreach A generate $0 as id;
store B into '$OUTPUT' USING PigStorage();
{noformat}

Invoking above script which contains no EOL in the last line of script as 
following:

{noformat} 
pig -param INPUT=mydata/input -param OUTPUT=mydata/output myscript.pig
{noformat}

results in parser error:
{noformat}
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during 
parsing. Lexical error at line 3, column 42.  Encountered:  after : ""
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837534#action_12837534
 ] 

Ashutosh Chauhan commented on PIG-1251:
---

Seems some problem with hudson. I ran the unit tests manually. They passed. 
Patch is ready for review.

> Move SortInfo calculation earlier in compilation 
> -
>
> Key: PIG-1251
> URL: https://issues.apache.org/jira/browse/PIG-1251
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1251.patch
>
>
> In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
> storefunc might need schema to do such a validation. So, we should call 
> checkSchema() before doing the validation. checkSchema() in turn requires 
> SortInfo which is calculated later in compilation phase. We need to move it 
> earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1254) up the memory for junit to run tests

2010-02-23 Thread Ashutosh Chauhan (JIRA)
up the memory for junit to run tests


 Key: PIG-1254
 URL: https://issues.apache.org/jira/browse/PIG-1254
 Project: Pig
  Issue Type: Bug
  Components: build
Reporter: Ashutosh Chauhan


Currently junit is configured to run only with 256M of memory. This is too low 
considering the fact that most tests create MiniCluster, run Hadoop local job 
runner, run tests etc. all within same jvm.  This results in transient 
failures, longer time for tests to complete etc. This should be upped atleast 
to 512M.

build.xml:
{noformat}

{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1251:
--

Attachment: pig-1251.patch

Patch which moves SortInfo calculation from LogToPhyTranslation to 
SortInfoSetter. This facilitate calling checkSchema() before checkOutputSpecs() 
during store location validation. 
Result of test-patch
{noformat}
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] -1 javadoc.  The javadoc tool appears to have generated 1 
warning messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] 
{noformat}

Javadoc warning is unrelated to the patch.

> Move SortInfo calculation earlier in compilation 
> -
>
> Key: PIG-1251
> URL: https://issues.apache.org/jira/browse/PIG-1251
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1251.patch
>
>
> In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
> storefunc might need schema to do such a validation. So, we should call 
> checkSchema() before doing the validation. checkSchema() in turn requires 
> SortInfo which is calculated later in compilation phase. We need to move it 
> earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1251:
--

Status: Patch Available  (was: Open)

> Move SortInfo calculation earlier in compilation 
> -
>
> Key: PIG-1251
> URL: https://issues.apache.org/jira/browse/PIG-1251
> Project: Pig
>  Issue Type: Bug
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1251.patch
>
>
> In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
> storefunc might need schema to do such a validation. So, we should call 
> checkSchema() before doing the validation. checkSchema() in turn requires 
> SortInfo which is calculated later in compilation phase. We need to move it 
> earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1251) Move SortInfo calculation earlier in compilation

2010-02-22 Thread Ashutosh Chauhan (JIRA)
Move SortInfo calculation earlier in compilation 
-

 Key: PIG-1251
 URL: https://issues.apache.org/jira/browse/PIG-1251
 Project: Pig
  Issue Type: Bug
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan


In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A 
storefunc might need schema to do such a validation. So, we should call 
checkSchema() before doing the validation. checkSchema() in turn requires 
SortInfo which is calculated later in compilation phase. We need to move it 
earlier in compilation phase. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-02-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836108#action_12836108
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Hey Woody,

Great work !! This will definitely be useful for lot of Pig users. I just 
hastily looked at your work. One question which stuck to me is you are doing 
lot of heavy lifting to provide for multi-language support by figuring out 
which language user is asking for and then doing reflection to load appropriate 
interpreter and stuff. I think it might be easier to use one of the frameworks 
here (BSF or javax.script) which hides this and allows handling of multiple 
language transparently. (atleast, thats what they claim to do) Have you taken a 
look at them? These frameworks  will arguably help us to provide support for 
more languages without maintaining lot of code on our part. Though, I am sure 
they will come at the performance cost (certainly CPU and possibly memory too). 

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
> Attachments: package.zip, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1215) Make Hadoop jobId more prominent in the client log

2010-02-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1215:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked-in.

> Make Hadoop jobId more prominent in the client log
> --
>
> Key: PIG-1215
> URL: https://issues.apache.org/jira/browse/PIG-1215
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: pig-1215.patch, pig-1215.patch, pig-1215_1.patch, 
> pig-1215_3.patch, pig-1215_4.patch
>
>
> This is a request from applications that want to be able to programmatically 
> parse client logs to find hadoop Ids.
> The woould like to see each job id on a separate line in the following format:
> hadoopJobId: job_123456789
> They would also like to see the jobs in the order they are executed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



<    1   2   3   4   5   >