[jira] Created: (PIG-1424) Error logs of streaming should not be placed in output location
Error logs of streaming should not be placed in output location --- Key: PIG-1424 URL: https://issues.apache.org/jira/browse/PIG-1424 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This becomes a problem when output location is anything other then a filesystem. Output will be written to DB but where the logs generated by streaming should go? Clearly, they cant be written into DB. This blocks PIG-1229 which introduces writing to DB from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1415) LoadFunc signature is not correct in LoadFunc.getSchema sometimes
[ https://issues.apache.org/jira/browse/PIG-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867633#action_12867633 ] Ashutosh Chauhan commented on PIG-1415: --- +1 Please commit if all unit tests pass. > LoadFunc signature is not correct in LoadFunc.getSchema sometimes > - > > Key: PIG-1415 > URL: https://issues.apache.org/jira/browse/PIG-1415 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1415-1.patch > > > The following script does not set signature correctly when we call > LoadFunc.getSchema. > a = load 'xxx' using TableLoader('xxx') as (a, b, c); > However, if we don't give schema to a, we get the right signature: > a = load 'xxx' using TableLoader('xxx); > Diagnosis: > Parser will generate LoadClause before go to the generation "Alias = > LoadClause", which actually set signature to the LOLoad. When we give a > schema, parser try to call LOLoad.setSchema(), internally it will call > LoadFunc.determineSchema. And at that time, signature has not been set yet. > This relates to the change we cache determinedSchema in LOLoad > [PIG-1317|https://issues.apache.org/jira/browse/PIG-1317]. Before that > change, we will later call LoadFunc.getSchema() again using the right > signature. Now we cache determinedSchema, so LoadFunc don't have a chance to > get the right signature inside LoadFunc.getSchema() > Solution: > We shall not call LoadFunc.determineSchema inside LOLoad.setSchema(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1229: -- Attachment: pig-1229.patch Ankur, Sorry for getting back late on this. I fiddled with your latest patch and was able to make some progress on it. I am able to get rid of those Path problems (looks like Pig itself is not dealing with it correctly at one place). I think with the patch that I attached should work but I am not able to get test case to pass because of hsqldb problem which I am not able to resolve. I keep getting this error from it: {noformat} Caused by: java.sql.SQLException: The database is already in use by another process: org.hsqldb.persist.niolockf...@4abea04e[file =/private/tmp/batchtest.lck, exists=true, locked=false, valid=false, fl =null]: java.lang.Exception: checkHeartbeat(): lock file [/private/tmp/batchtest.lck] is presumably locked by another process. at org.hsqldb.jdbc.Util.sqlException(Unknown Source) at org.hsqldb.jdbc.jdbcConnection.(Unknown Source) at org.hsqldb.jdbcDriver.getConnection(Unknown Source) at org.hsqldb.jdbcDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:274) {noformat} Anyways here are the changes I made: 1. {code} Index:src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java === -conf.set("pig.streaming.log.dir", -new Path(outputPath, LOG_DIR).toString()); +//conf.set("pig.streaming.log.dir", +//new Path(outputPath, LOG_DIR).toString()); conf.set("pig.streaming.task.output.dir", outputPath); } {code} This looks like a problem in Pig. Here Pig is incorrectly assuming that it can put logs generated during stream command in output location which is incorrect if output location is something like DB. Since this needs changes in main Pig code, I will suggest to open new jira for it and track it there. 2. Then in DBStorage.java {code} @Override public void setStoreLocation(String location, Job job) throws IOException { job.getConfiguration().set("pig.db.conn.string", location); } @Override public RecordWriter getRecordWriter( TaskAttemptContext context) throws IOException, InterruptedException { jdbcURL = context.getConfiguration().get("pig.db.conn.string"); return null; } {code} Need to save db connection string in job in setStoreLocation() and then retrieve it in backend in getRecordWriter(). 3. In DBStorage.java {code} @Override public void cleanupOnFailure(String location, Job job) throws IOException { log.error("Job has failed."); } {code} You need to necessarily override this function of StoreFunc() as default implementation assumes FileSystem as the output location. Currently, I left it as no-op but it can be improved to do rollbacks, release db connections etc. > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.8.0 > > Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file
[ https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867220#action_12867220 ] Ashutosh Chauhan commented on PIG-1381: --- +1 on the changes. For completeness, we can also check in an empty pig.properties in the conf dir and then add comments in both pig.properties and pig-default.properties that if user wants to pass some properties doing it through pig-default.properties will have no effect and instead they should add extra properties they want to add/override in pig.properties. > Need a way for Pig to take an alternative property file > --- > > Key: PIG-1381 > URL: https://issues.apache.org/jira/browse/PIG-1381 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: V.V.Chaitanya Krishna > Fix For: 0.7.0, 0.8.0 > > Attachments: PIG-1381-1.patch, PIG-1381-2.patch, PIG-1381-3.patch, > PIG-1381-4.patch > > > Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a > default pig.properties and if user have a different pig.properties, there > will be a conflict since we can only read one. There are couple of ways to > solve it: > 1. Give a command line option for user to pass an additional property file > 2. Change the name for default pig.properties to pig-default.properties, and > user can give a pig.properties to override > 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems > to be more natural for hadoop community. If so, we shall provide backward > compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1390) Provide a target to generate eclipse-related classpath and files
[ https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862951#action_12862951 ] Ashutosh Chauhan commented on PIG-1390: --- I gave it a go and did as mentioned in previous comment {noformat} These are the steps that could be followed and imported to eclipse in a faster way : 1. checkout the trunk code. 2. run "ant eclipse-files". 3. open eclipse and "import" the existing project. {noformat} Though, pig itself compiled fine and is ready to go, the contrib projects (owl,zebra,piggybank/hiverc) didnt compile, I think because either it didn't download dependices of those projects or didn't include them in the build path. So, there appears unfriendly red cross next to project. If I remove them from build path, things are good. Did I do something wrong or is this expected ? > Provide a target to generate eclipse-related classpath and files > > > Key: PIG-1390 > URL: https://issues.apache.org/jira/browse/PIG-1390 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.7.0, 0.8.0 >Reporter: V.V.Chaitanya Krishna >Assignee: V.V.Chaitanya Krishna > Fix For: 0.8.0 > > Attachments: PIG-1390-2.patch, PIG-1390-3.patch, > PIG-eclipse_support.patch > > > Currently, after checking out from svn repository, there is no provision to > auto-generate eclipse-related classpath and files , which could help in > import into eclipse directly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory
[ https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1395: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked-in with updated comment. > Mapside cogroup runs out of memory > -- > > Key: PIG-1395 > URL: https://issues.apache.org/jira/browse/PIG-1395 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.8.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: cogrp_mem.patch > > > In a particular scenario when there aren't lot of tuples with a same key in a > relation (i.e. there aren't many repeating keys) map tasks doing cogroup > fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file
[ https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861186#action_12861186 ] Ashutosh Chauhan commented on PIG-1381: --- Do we need to have two different property files ? One possibility is to not package pig.properties in the pig.jar and then include it in the classpath while invoking Pig. (We can modify pig shell script to include it in the path by default). Then, user can add/delete/modify the pig.properties as he wish as well override default properties. Disadvantage of two property files, is sometimes its confusing which property is getting picked up (one in default or one in user specified). If there is only one property file, there is only one way to specify the properties to Pig which I think is better way of doing it. > Need a way for Pig to take an alternative property file > --- > > Key: PIG-1381 > URL: https://issues.apache.org/jira/browse/PIG-1381 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai > Fix For: 0.8.0 > > > Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a > default pig.properties and if user have a different pig.properties, there > will be a conflict since we can only read one. There are couple of ways to > solve it: > 1. Give a command line option for user to pass an additional property file > 2. Change the name for default pig.properties to pig-default.properties, and > user can give a pig.properties to override > 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems > to be more natural for hadoop community. If so, we shall provide backward > compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861177#action_12861177 ] Ashutosh Chauhan commented on PIG-1229: --- Ankur, The stack trace above is out of sync with trunk. Can you upload the patch with this alternative approach that you are trying. I think it might be possible to get this working. > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.8.0 > > Attachments: jira-1229-v2.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory
[ https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1395: -- Status: Patch Available (was: Open) > Mapside cogroup runs out of memory > -- > > Key: PIG-1395 > URL: https://issues.apache.org/jira/browse/PIG-1395 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.8.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: cogrp_mem.patch > > > In a particular scenario when there aren't lot of tuples with a same key in a > relation (i.e. there aren't many repeating keys) map tasks doing cogroup > fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory
[ https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1395: -- Attachment: cogrp_mem.patch While doing cogroup, we first put tuples from all the relations in a heap, then we drain the heap and generate the output tuple as appropriate. We need to look ahead atleast one tuple from all the relations before generating an output tuple to be sure that we have all the tuples belonging to a key. Currently, we look too far ahead and tuples starts to accumulate faster in heap then we are draining. At a certain point, we had enough information to generate output tuple instead of waiting and putting another tuple in heap. This patch generate the output tuple at that point. > Mapside cogroup runs out of memory > -- > > Key: PIG-1395 > URL: https://issues.apache.org/jira/browse/PIG-1395 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.8.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: cogrp_mem.patch > > > In a particular scenario when there aren't lot of tuples with a same key in a > relation (i.e. there aren't many repeating keys) map tasks doing cogroup > fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1395) Mapside cogroup runs out of memory
Mapside cogroup runs out of memory -- Key: PIG-1395 URL: https://issues.apache.org/jira/browse/PIG-1395 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 In a particular scenario when there aren't lot of tuples with a same key in a relation (i.e. there aren't many repeating keys) map tasks doing cogroup fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861122#action_12861122 ] Ashutosh Chauhan commented on PIG-798: -- 1. {noformat} b = foreach a generate (chararray) $0 as name; {noformat} 2. {noformat} B = foreach A generate $0 as name:chararray; {noformat} @Viraj, Discussed with Alan and Daniel. Language semantics for achieving this functionality with whatever loader is 1. The fact that 2 works for BinStorage is unfortunate and is bug. It is something which is currently there for backward compatibility and will eventually be removed. > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860614#action_12860614 ] Ashutosh Chauhan commented on PIG-1211: --- Oh, I got confused. From your earlier comment, it occurred to me you are saying that we should add a -checkscript command line option. From your previous comment are you suggesting that we should add syntax checker which will always run (i.e., without needing any cmd line directive) before the query starts to execute and thereby catching as many user error as possible. I think this is a reasonable ask and will be useful to users. This might be the first step towards making a distinction between pig compile time and run-time explicit to user. If we go full length here, we might as well do what Milind suggested earlier (and in recent mail thread). We can add a "compilation" phase which first runs a syntax checker, then generates "object code" (essentially job jar) from pig script. This compiled object can then be handed over to run-time (hadoop cluster). Wow, pig-latin is evolving towards a "true language" :) > Pig script runs half way after which it reports syntax error > > > Key: PIG-1211 > URL: https://issues.apache.org/jira/browse/PIG-1211 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I have a Pig script which is structured in the following way > {code} > register cp.jar > dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, > col3, col4, col5); > filtered_dataset = filter dataset by (col1 == 1); > proj_filtered_dataset = foreach filtered_dataset generate col2, col3; > rmf $output1; > store proj_filtered_dataset into '$output1' using PigStorage(); > second_stream = foreach filtered_dataset generate col2, col4, col5; > group_second_stream = group second_stream by col4; > output2 = foreach group_second_stream { > a = second_stream.col2 > b = distinct second_stream.col5; > c = order b by $0; > generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; > } > rmf $output2; > --syntax error here > store output2 to '$output2' using PigStorage(); > {code} > I run this script using the Multi-query option, it runs successfully till the > first store but later fails with a syntax error. > The usage of HDFS option, "rmf" causes the first store to execute. > The only option the I have is to run an explain before running his script > grunt> explain -script myscript.pig -out explain.out > or moving the rmf statements to the top of the script > Here are some questions: > a) Can we have an option to do something like "checkscript" instead of > explain to get the same syntax error? In this way I can ensure that I do not > run for 3-4 hours before encountering a syntax error > b) Can pig not figure out a way to re-order the rmf statements since all the > store directories are variables > Thanks > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860606#action_12860606 ] Ashutosh Chauhan commented on PIG-1339: --- This works fine on grunt. {code} grunt> a = load '1-3.txt' using PigStorage() as (あいうえお); grunt> dump a; {code} gives expected result. Problem is if it is fed as script to Pig {code} bin/pig myscript.pig {code} gives the exception as you shown above. This looks like a bug in PigScriptParser.jj where it should read the stream from script file as UTF-8. > International characters in column names not supported > -- > > Key: PIG-1339 > URL: https://issues.apache.org/jira/browse/PIG-1339 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > > There is a particular use-case in which someone specifies a column name to be > in International characters. > {code} > inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); > describe inputdata; > dump inputdata; > {code} > == > Pig Stack Trace > --- > ERROR 1000: Error during parsing. Lexical error at line 1, column 64. > Encountered: "\u3042" (12354), after : "" > org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line > 1, column 64. Encountered: "\u3042" (12354), after : "" > at > org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:391) > == > Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860598#action_12860598 ] Ashutosh Chauhan commented on PIG-798: -- You can specify schema in FOREACH GENERATE with PigStorage loader as follows: {code} grunt> a = load 'data' using PigStorage(); grunt> b = foreach a generate (chararray) $0 as name; grunt> describe b; b: {name: chararray} grunt> dump b; {code} I get the expected result. > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1390) Provide a target to generate eclipse-related classpath and files
[ https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1390: - Assignee: V.V.Chaitanya Krishna > Provide a target to generate eclipse-related classpath and files > > > Key: PIG-1390 > URL: https://issues.apache.org/jira/browse/PIG-1390 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.7.0, 0.8.0 >Reporter: V.V.Chaitanya Krishna >Assignee: V.V.Chaitanya Krishna > Fix For: 0.8.0 > > Attachments: PIG-eclipse_support.patch > > > Currently, after checking out from svn repository, there is no provision to > auto-generate eclipse-related classpath and files , which could help in > import into eclipse directly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1345: -- Parent: PIG-908 Issue Type: Sub-task (was: Improvement) > Link casting errors in POCast to actual lines numbers in Pig script > --- > > Key: PIG-1345 > URL: https://issues.apache.org/jira/browse/PIG-1345 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > > For the purpose of easy debugging, I would be nice to find out where my > warnings are coming from is in the pig script. > The only known process is to comment out lines in the Pig script and see if > these warnings go away. > 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 > I think this may need us to keep track of the line numbers of the Pig script > (via out javacc parser) and maintain it in the logical and physical plan. > It would help users in debugging simple errors/warning related to casting. > Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? > Do we need to change the parser to something other than javacc to make this > task simpler? > "Standardize on Parser and Scanner Technology" > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859471#action_12859471 ] Ashutosh Chauhan commented on PIG-1345: --- This will involve recording line numbers (and possibly more metadata) from parser to logical layer, then to physical layer and then to backend and then back in case of exceptions. This has been discussed before in some detail in PIG-908. Linking it against that. > Link casting errors in POCast to actual lines numbers in Pig script > --- > > Key: PIG-1345 > URL: https://issues.apache.org/jira/browse/PIG-1345 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > > For the purpose of easy debugging, I would be nice to find out where my > warnings are coming from is in the pig script. > The only known process is to comment out lines in the Pig script and see if > these warnings go away. > 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 > I think this may need us to keep track of the line numbers of the Pig script > (via out javacc parser) and maintain it in the logical and physical plan. > It would help users in debugging simple errors/warning related to casting. > Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? > Do we need to change the parser to something other than javacc to make this > task simpler? > "Standardize on Parser and Scanner Technology" > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859462#action_12859462 ] Ashutosh Chauhan commented on PIG-1211: --- bq. Can we have an option to do something like "checkscript" instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error Though its possible to add something like checkscript. But, it will be a syntactic sugar, since it will do the same exact thing as explain does (but not printing the plan at the end). So, I am thinking, shall we tell users to run explain to catch syntax errors, instead of adding this new command line option? What do others think ? > Pig script runs half way after which it reports syntax error > > > Key: PIG-1211 > URL: https://issues.apache.org/jira/browse/PIG-1211 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I have a Pig script which is structured in the following way > {code} > register cp.jar > dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, > col3, col4, col5); > filtered_dataset = filter dataset by (col1 == 1); > proj_filtered_dataset = foreach filtered_dataset generate col2, col3; > rmf $output1; > store proj_filtered_dataset into '$output1' using PigStorage(); > second_stream = foreach filtered_dataset generate col2, col4, col5; > group_second_stream = group second_stream by col4; > output2 = foreach group_second_stream { > a = second_stream.col2 > b = distinct second_stream.col5; > c = order b by $0; > generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; > } > rmf $output2; > --syntax error here > store output2 to '$output2' using PigStorage(); > {code} > I run this script using the Multi-query option, it runs successfully till the > first store but later fails with a syntax error. > The usage of HDFS option, "rmf" causes the first store to execute. > The only option the I have is to run an explain before running his script > grunt> explain -script myscript.pig -out explain.out > or moving the rmf statements to the top of the script > Here are some questions: > a) Can we have an option to do something like "checkscript" instead of > explain to get the same syntax error? In this way I can ensure that I do not > run for 3-4 hours before encountering a syntax error > b) Can pig not figure out a way to re-order the rmf statements since all the > store directories are variables > Thanks > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859159#action_12859159 ] Ashutosh Chauhan commented on PIG-798: -- Viraj, I am confused with this description. It seems to me that you are first storing some data using BinStorage and then loading it using PigStorage. If that is so, obviously it will not work. PigStorage and BinStorage aren't interoperable in this way. Specifically, data stored using BinStorage, can only be loaded using BinStorage. > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED
[ https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859157#action_12859157 ] Ashutosh Chauhan commented on PIG-1341: --- I think BinStorage is an internal way of moving data around in Pig and it should be treated that way. I think we should discourage its usage to user. Otherwise, we need to add capabilities as the one requested here. Important impact of making such a change is that we can't then swap out BinStorage with other storage mechanisms. If Avro (or protobuf or whatever) proved to be a better replacement for BinStorage, then we cant just swap them in place of BinStorage, unless we add to them all the capabilities that BinStorage has. Therefore, I suggest to keep capabilities of BinStorage to minimal. > BinStorage cannot convert DataByteArray to Chararray and results in > FIELD_DISCARDED_TYPE_CONVERSION_FAILED > -- > > Key: PIG-1341 > URL: https://issues.apache.org/jira/browse/PIG-1341 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: Richard Ding > Attachments: PIG-1341.patch > > > Script reads in BinStorage data and tries to convert a column which is in > DataByteArray to Chararray. > {code} > raw = load 'sampledata' using BinStorage() as (col1,col2, col3); > --filter out null columns > A = filter raw by col1#'bcookie' is not null; > B = foreach A generate col1#'bcookie' as reqcolumn; > describe B; > --B: {regcolumn: bytearray} > X = limit B 5; > dump X; > B = foreach A generate (chararray)col1#'bcookie' as convertedcol; > describe B; > --B: {convertedcol: chararray} > X = limit B 5; > dump X; > {code} > The first dump produces: > (36co9b55onr8s) > (36co9b55onr8s) > (36hilul5oo1q1) > (36hilul5oo1q1) > (36l4cj15ooa8a) > The second dump produces: > () > () > () > () > () > It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 > time(s). > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859152#action_12859152 ] Ashutosh Chauhan commented on PIG-1339: --- This is not reproducible on trunk. I get the expected output. Viraj, can you please verify if it works for you in trunk ? > International characters in column names not supported > -- > > Key: PIG-1339 > URL: https://issues.apache.org/jira/browse/PIG-1339 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > > There is a particular use-case in which someone specifies a column name to be > in International characters. > {code} > inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); > describe inputdata; > dump inputdata; > {code} > == > Pig Stack Trace > --- > ERROR 1000: Error during parsing. Lexical error at line 1, column 64. > Encountered: "\u3042" (12354), after : "" > org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line > 1, column 64. Encountered: "\u3042" (12354), after : "" > at > org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:391) > == > Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858709#action_12858709 ] Ashutosh Chauhan commented on PIG-1378: --- {noformat} grunt> a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} This is incorrect. You need to do the following: {noformat} grunt> a = load 'har://hdfs-namenode.foo.com:8020/user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} Note that scheme is hdfs. Then a -(dash), followed by namenode url, followed by semi-colon, followed by port number(8020) and then location of your har archive. > har url not usable in Pig scripts > - > > Key: PIG-1378 > URL: https://issues.apache.org/jira/browse/PIG-1378 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I am trying to use har (Hadoop Archives) in my Pig script. > I can use them through the HDFS shell > {noformat} > $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' > Found 1 items > -rw--- 5 viraj users1537234 2010-04-14 09:49 > user/viraj/project/subproject/files/size/data/part-1 > {noformat} > Using similar URL's in grunt yields > {noformat} > grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; > grunt> dump a; > {noformat} > {noformat} > 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2998: Unhandled internal error. > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible > file URI scheme: har : hdfs > 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There > is no log file to write to. > 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - > java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: > Incompatible file URI scheme: har : hdfs > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) > at org.apache.pig.Main.main(Main.java:357) > Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: > Incompatible file URI scheme: har : hdfs > at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) > at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) > ... 13 more > {noformat} > According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the > following as stated in the original description > {noformat} > grunt> a = load > 'har://namenode-location/user/viraj/project/subproject/files/size/data'; > grunt> dump a; > {noformat} > {noformat} > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: > Unable to create input splits for: > har://namenode-location/user/viraj/project/subproject/files/size/data'; > ... 8 more > Caused by: java.io.IOException: No FileSystem for scheme: namenode-location > at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) > at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) > at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) > at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) > at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) > at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) > at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) > at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) > at > .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat
[jira] Updated: (PIG-1353) Map-side joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1353: -- Status: Resolved (was: Patch Available) Release Note: With this patch, it is now possible to perform [left|right|full] outer joins on two tables as well as inner joins on more then two tables in Pig in map-side if data is sorted and one of the loader implements {{CollectableLoader}} interface. Primary algorithm is based on sort-merge join. Additional implementation details: 1) No other operations can be done between load and join statements. 2) Data must be sorted in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement CollectableLoader interface as well as OrderedLoadFunc. 5) All other loaders must implement IndexableLoadFunc. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similiar conditions apply to map-side cogroups (PIG-1309) as well. Resolution: Fixed Patch checked-in. > Map-side joins > -- > > Key: PIG-1353 > URL: https://issues.apache.org/jira/browse/PIG-1353 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1353.patch, pig-1353.patch > > > Pig already has couple of map-side join implementations: Merge Join and > Fragmented-Replicate Join. But both of them are pretty restrictive. Merge > Join can only join two tables and that too can only do inner join. FR Join > can join multiple relations, but it can also only do inner and left outer > joins. Further it restricts the sizes of side relations. It will be nice if > we can do map side joins on multiple tables as well do inner, left outer, > right outer and full outer joins. > Lot of groundwork for this has already been done in PIG-1309. Remaining will > be tracked in this jira. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857932#action_12857932 ] Ashutosh Chauhan commented on PIG-1354: --- Dmitriy, Neat work! This patch facilitates to call few existing methods in jdk libs(which are thus compiled and already available at runtime). Thinking aloud, what will it take to go one step further from here. That is, to take uncompiled java code blocks and then compile them at runtime and make them available as udfs. If we can get that, then we can allow users to write java code of there udfs inline in pig script and then pig can compile it at runtime and do other necessary plumbing to make it work. And then, no more need to write java code separately, compile it, jar it, register it etc. All user code will be in one file. This will make writing udfs a lot easier. http://docs.codehaus.org/display/JANINO/Home and http://commons.apache.org/jci/ might be of help. Thoughts ? > UDFs for dynamic invocation of simple Java methods > -- > > Key: PIG-1354 > URL: https://issues.apache.org/jira/browse/PIG-1354 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.8.0 >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.8.0 > > Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch > > > The need to create wrapper UDFs for simple Java functions creates unnecessary > work for Pig users, slows down the development process, and produces a lot of > trivial classes. We can use Java's reflection to allow invoking a number of > methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1353) Map-side joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857887#action_12857887 ] Ashutosh Chauhan commented on PIG-1353: --- Ya, visibility change of visitor methods from public to protected is unrelated to the issue. I did it as a part of cleanup of LogToPhyTranslator. I dont see a reason why visitor methods should be public. in general we strive to not make things public and all the usage of those methods were in same package, so changing them from public to protected is a safe choice. > Map-side joins > -- > > Key: PIG-1353 > URL: https://issues.apache.org/jira/browse/PIG-1353 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1353.patch, pig-1353.patch > > > Pig already has couple of map-side join implementations: Merge Join and > Fragmented-Replicate Join. But both of them are pretty restrictive. Merge > Join can only join two tables and that too can only do inner join. FR Join > can join multiple relations, but it can also only do inner and left outer > joins. Further it restricts the sizes of side relations. It will be nice if > we can do map side joins on multiple tables as well do inner, left outer, > right outer and full outer joins. > Lot of groundwork for this has already been done in PIG-1309. Remaining will > be tracked in this jira. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Status: Patch Available (was: Reopened) Affects Version/s: 0.8.0 (was: 0.7.0) > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch, pig-1363_1.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Attachment: pig-1363_1.patch Can't get away without pasing loader signature to backend for Merge Join. So, set it. > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch, pig-1363_1.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Reopened: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reopened PIG-1363: --- Issue patch is supposed to fix at one place, breaks it at another place. Need to Re-fix. > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch, pig-1363_1.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857154#action_12857154 ] Ashutosh Chauhan commented on PIG-1229: --- As per http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg02257.html thread I am wondering if it will be safe and possible to make sure that job using this storage has speculative execution turned-off. Otherwise, with S.E. turned on, there are too many scenarios we would have to handle. What do you think? > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.8.0 > > Attachments: jira-1229-v2.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked-in. > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1353) Map-side joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857149#action_12857149 ] Ashutosh Chauhan commented on PIG-1353: --- Hudson.. Oh Hudson.. when y'll get better ! Ran the full test suite. All of them passed. Ran test-patch: {noformat} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 12 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {noformat} Patch is ready for review. > Map-side joins > -- > > Key: PIG-1353 > URL: https://issues.apache.org/jira/browse/PIG-1353 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1353.patch, pig-1353.patch > > > Pig already has couple of map-side join implementations: Merge Join and > Fragmented-Replicate Join. But both of them are pretty restrictive. Merge > Join can only join two tables and that too can only do inner join. FR Join > can join multiple relations, but it can also only do inner and left outer > joins. Further it restricts the sizes of side relations. It will be nice if > we can do map side joins on multiple tables as well do inner, left outer, > right outer and full outer joins. > Lot of groundwork for this has already been done in PIG-1309. Remaining will > be tracked in this jira. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856992#action_12856992 ] Ashutosh Chauhan commented on PIG-1363: --- Hudson is flaky (again). Result of test-patch: {noformat} [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] {noformat} Patch is ready for review. > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1353) Map-side joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1353: -- Status: Patch Available (was: Open) Fix Version/s: 0.8.0 > Map-side joins > -- > > Key: PIG-1353 > URL: https://issues.apache.org/jira/browse/PIG-1353 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1353.patch, pig-1353.patch > > > Pig already has couple of map-side join implementations: Merge Join and > Fragmented-Replicate Join. But both of them are pretty restrictive. Merge > Join can only join two tables and that too can only do inner join. FR Join > can join multiple relations, but it can also only do inner and left outer > joins. Further it restricts the sizes of side relations. It will be nice if > we can do map side joins on multiple tables as well do inner, left outer, > right outer and full outer joins. > Lot of groundwork for this has already been done in PIG-1309. Remaining will > be tracked in this jira. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1353) Map-side joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1353: -- Attachment: pig-1353.patch Running through hudson. > Map-side joins > -- > > Key: PIG-1353 > URL: https://issues.apache.org/jira/browse/PIG-1353 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-1353.patch, pig-1353.patch > > > Pig already has couple of map-side join implementations: Merge Join and > Fragmented-Replicate Join. But both of them are pretty restrictive. Merge > Join can only join two tables and that too can only do inner join. FR Join > can join multiple relations, but it can also only do inner and left outer > joins. Further it restricts the sizes of side relations. It will be nice if > we can do map side joins on multiple tables as well do inner, left outer, > right outer and full outer joins. > Lot of groundwork for this has already been done in PIG-1309. Remaining will > be tracked in this jira. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Status: Patch Available (was: Open) > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Status: Open (was: Patch Available) > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Assigned: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1363: - Assignee: Ashutosh Chauhan > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Status: Patch Available (was: Open) > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1363) Unnecessary loadFunc instantiations
[ https://issues.apache.org/jira/browse/PIG-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1363: -- Attachment: pig-1363.patch Ideal solution of this problem is to have {{{LoadFunc}}} implements {{{Serializable}}}. Then LoadFunc will be instantiated once first time its needed (in LoLoad) and then everywhere this one object is used. But this will be backward incompatible as all the load func implementation then have to be necessarily implement Serializable. So, for now we will live with this. This patch gets rid of the multiple load func instantiation in front end where it could be avoided without the need of making it Serializable. No test cases are needed since this is purely code cleanup and doesn't add/delete/modify any existing functionality, so current regression tests suffice. > Unnecessary loadFunc instantiations > --- > > Key: PIG-1363 > URL: https://issues.apache.org/jira/browse/PIG-1363 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan > Fix For: 0.8.0 > > Attachments: pig-1363.patch > > > In MRCompiler loadfuncs are instantiated at multiple locations in different > visit methods. This is inconsistent and confusing. LoadFunc should be > instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). > A getter should be added to POLoad to retrieve this instantiated loadFunc > wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855434#action_12855434 ] Ashutosh Chauhan commented on PIG-506: -- Ashitosh, When I click on that link, I get: {noformat} You do not have the required role. {noformat} Do you need to set permissions for it to be world-readable? (if that is what you are intending to do) > Does pig need a NATIVE keyword? > --- > > Key: PIG-506 > URL: https://issues.apache.org/jira/browse/PIG-506 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Alan Gates >Assignee: Alan Gates >Priority: Minor > > Assume a user had a job that broke easily into three pieces. Further assume > that pieces one and three were easily expressible in pig, but that piece two > needed to be written in map reduce for whatever reason (performance, > something that pig could not easily express, legacy job that was too > important to change, etc.). Today the user would either have to use map > reduce for the entire job or manually handle the stitching together of pig > and map reduce jobs. What if instead pig provided a NATIVE keyword that > would allow the script to pass off the data stream to the underlying system > (in this case map reduce). The semantics of NATIVE would vary by underlying > system. In the map reduce case, we would assume that this indicated a > collection of one or more fully contained map reduce jobs, so that pig would > store the data, invoke the map reduce jobs, and then read the resulting data > to continue. It might look something like this: > {code} > A = load 'myfile'; > X = load 'myotherfile'; > B = group A by $0; > C = foreach B generate group, myudf(B); > D = native (jar=mymr.jar, infile=frompig outfile=topig); > E = join D by $0, X by $0; > ... > {code} > This differs from streaming in that it allows the user to insert an arbitrary > amount of native processing, whereas streaming allows the insertion of one > binary. It also differs in that, for streaming, data is piped directly into > and out of the binary as part of the pig pipeline. Here the pipeline would > be broken, data written to disk, and the native block invoked, then data read > back from disk. > Another alternative is to say this is unnecessary because the user can do the > coordination from java, using the PIgServer interface to run pig and calling > the map reduce job explicitly. The advantages of the native keyword are that > the user need not be worried about coordination between the jobs, pig will > take care of it. Also the user can make use of existing java applications > without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan resolved PIG-1362. --- Resolution: Fixed Since hudson is flaky once again. Ran the full test - suite. All of it passed. Ran test-patch: {noformat} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {noformat} Patch checked-in for 0.7 branch. > Provide udf context signature in ensureAllKeysInSameSplit() method of loader > > > Key: PIG-1362 > URL: https://issues.apache.org/jira/browse/PIG-1362 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan >Priority: Critical > Fix For: 0.7.0 > > Attachments: backport.patch > > > As a part of PIG-1292 a check was introduced to make sure loader used in > "collected" group-by implements CollectableLoader (new interface in that > patch). In its method, loader may use udf context to store some info. We need > to make sure that udf context signature is setup correctly in such cases. > This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854740#action_12854740 ] Ashutosh Chauhan commented on PIG-1229: --- You can get rid of this stack-trace by overriding relToAbsPathForStoreLocation() of StoreFunc which DBStorage extends and turning it into no-op. Since, DB location is always absolute, there is no need of default behavior which is there in StoreFunc. For DataType.find() I found even PigStorage does the same, so this patch is no worse then PigStorage in that way. > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.8.0 > > Attachments: jira-1229-v2.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1363) Unnecessary loadFunc instantiations
Unnecessary loadFunc instantiations --- Key: PIG-1363 URL: https://issues.apache.org/jira/browse/PIG-1363 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 In MRCompiler loadfuncs are instantiated at multiple locations in different visit methods. This is inconsistent and confusing. LoadFunc should be instantiated at only one place, ideally in LogToPhyTanslation#visit(LOLoad). A getter should be added to POLoad to retrieve this instantiated loadFunc wherever it is needed in later stages of compilation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reopened PIG-1362: --- > Provide udf context signature in ensureAllKeysInSameSplit() method of loader > > > Key: PIG-1362 > URL: https://issues.apache.org/jira/browse/PIG-1362 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan >Priority: Critical > Fix For: 0.7.0 > > Attachments: backport.patch > > > As a part of PIG-1292 a check was introduced to make sure loader used in > "collected" group-by implements CollectableLoader (new interface in that > patch). In its method, loader may use udf context to store some info. We need > to make sure that udf context signature is setup correctly in such cases. > This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854643#action_12854643 ] Ashutosh Chauhan commented on PIG-1348: --- 1) As far as I can see TextOutputFormat has synchronized write() because it is meant to work even with mappers implementing MultithreadedMapRunner. But since thats not the case for Pig, we can get rid of it especially now that we are putting in our own PigTextOutputFormat instead of using TextOutputformat. 3) Thats what I meant, if Schema is available, we should use that to find types, instead of reflecting on every call. I suggested the work around of caching for the case if we know user did provide Schema, but we dont have a handle on it. Clearly, if there is no schema, we need to find type every time. I can see that dealing with Complex types even when there is a schema is not straight forward. In any case, casts that are currently there for simple types are unnecessary. For performance numbers, both of these will save CPU time, if we are convinced that we are always I/O bound we can leave these things as it is. > PigStorage making unnecessary byte array copy when storing data > --- > > Key: PIG-1348 > URL: https://issues.apache.org/jira/browse/PIG-1348 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Richard Ding > Fix For: 0.7.0 > > Attachments: PIG-1348.patch, PIG-1348_2.patch > > > InternalCachedBag makes estimate of memory available to the VM by using > Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though > configurable) of this memory and divides this memory into number of bags. It > keeps track of the memory used by bags and then proactively spills if bags > memory usage reach close to these limits. Given all this in theory when > presented with data more then it can handle InternalCachedBag should not run > out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1362: -- Status: Patch Available (was: Open) > Provide udf context signature in ensureAllKeysInSameSplit() method of loader > > > Key: PIG-1362 > URL: https://issues.apache.org/jira/browse/PIG-1362 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Priority: Critical > Fix For: 0.7.0 > > Attachments: backport.patch > > > As a part of PIG-1292 a check was introduced to make sure loader used in > "collected" group-by implements CollectableLoader (new interface in that > patch). In its method, loader may use udf context to store some info. We need > to make sure that udf context signature is setup correctly in such cases. > This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1362: - Assignee: Ashutosh Chauhan > Provide udf context signature in ensureAllKeysInSameSplit() method of loader > > > Key: PIG-1362 > URL: https://issues.apache.org/jira/browse/PIG-1362 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan >Priority: Critical > Fix For: 0.7.0 > > Attachments: backport.patch > > > As a part of PIG-1292 a check was introduced to make sure loader used in > "collected" group-by implements CollectableLoader (new interface in that > patch). In its method, loader may use udf context to store some info. We need > to make sure that udf context signature is setup correctly in such cases. > This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
[ https://issues.apache.org/jira/browse/PIG-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1362: -- Attachment: backport.patch Simple one line fix. Test cases included. > Provide udf context signature in ensureAllKeysInSameSplit() method of loader > > > Key: PIG-1362 > URL: https://issues.apache.org/jira/browse/PIG-1362 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Priority: Critical > Fix For: 0.7.0 > > Attachments: backport.patch > > > As a part of PIG-1292 a check was introduced to make sure loader used in > "collected" group-by implements CollectableLoader (new interface in that > patch). In its method, loader may use udf context to store some info. We need > to make sure that udf context signature is setup correctly in such cases. > This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1362) Provide udf context signature in ensureAllKeysInSameSplit() method of loader
Provide udf context signature in ensureAllKeysInSameSplit() method of loader Key: PIG-1362 URL: https://issues.apache.org/jira/browse/PIG-1362 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Critical Fix For: 0.7.0 As a part of PIG-1292 a check was introduced to make sure loader used in "collected" group-by implements CollectableLoader (new interface in that patch). In its method, loader may use udf context to store some info. We need to make sure that udf context signature is setup correctly in such cases. This is already the case in trunk, need to backport it to 0.7 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-959) Merge Join fails when there is a blocking operator before it in query.
[ https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854340#action_12854340 ] Ashutosh Chauhan commented on PIG-959: -- Patch includes two new tests. Not sure why hudson thought otherwise. Patch is ready for review. > Merge Join fails when there is a blocking operator before it in query. > -- > > Key: PIG-959 > URL: https://issues.apache.org/jira/browse/PIG-959 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-959.patch > > > If there is an order-by, distinct or any other blocking operator in query > followed by Merge Join, pig fails to compile it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) PigStorage making unnecessary byte array copy when storing data
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854292#action_12854292 ] Ashutosh Chauhan commented on PIG-1348: --- Since this is mostly performance related, there are few more things which we can get in depending on complexity - speedup tradeoff: 1) PigLineRecordWriter#write() is synchronized. Is that needed? I don't see a scenario where multiple threads are writing using same object and thus potentially stomping on each other. Am I missing something here? 2) Within write() I think it can be safely assumed that value is of type Tuple, because argument in putNext() is of type Tuple. Then we can get rid of instanceof. 3) In StorageUtil.putField(), is it possible to get rid of DataType.findType(), possibly by getting hold of schema and getting type information from there. If not, then may be we cache the type info first time, instead of finding it on every call. At the very least, we shall get rid of casts for simple types as thats unnecessary. DataType.isComplex() can be used to determine that. > PigStorage making unnecessary byte array copy when storing data > --- > > Key: PIG-1348 > URL: https://issues.apache.org/jira/browse/PIG-1348 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Richard Ding > Fix For: 0.7.0 > > Attachments: PIG-1348.patch > > > InternalCachedBag makes estimate of memory available to the VM by using > Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though > configurable) of this memory and divides this memory into number of bags. It > keeps track of the memory used by bags and then proactively spills if bags > memory usage reach close to these limits. Given all this in theory when > presented with data more then it can handle InternalCachedBag should not run > out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-959) Merge Join fails when there is a blocking operator before it in query.
[ https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-959: - Status: Patch Available (was: Open) Running through hudson. > Merge Join fails when there is a blocking operator before it in query. > -- > > Key: PIG-959 > URL: https://issues.apache.org/jira/browse/PIG-959 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-959.patch > > > If there is an order-by, distinct or any other blocking operator in query > followed by Merge Join, pig fails to compile it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-959) Merge Join fails when there is a blocking operator before it in query.
[ https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-959: - Attachment: pig-959.patch Attached patch which lifts the restriction that no blocking operator could be placed before merge join. With this work, data can be ordered and joined in same script. > Merge Join fails when there is a blocking operator before it in query. > -- > > Key: PIG-959 > URL: https://issues.apache.org/jira/browse/PIG-959 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-959.patch > > > If there is an order-by, distinct or any other blocking operator in query > followed by Merge Join, pig fails to compile it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1353) Map-side joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1353: -- Attachment: pig-1353.patch An illustrative patch which achieves this. > Map-side joins > -- > > Key: PIG-1353 > URL: https://issues.apache.org/jira/browse/PIG-1353 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-1353.patch > > > Pig already has couple of map-side join implementations: Merge Join and > Fragmented-Replicate Join. But both of them are pretty restrictive. Merge > Join can only join two tables and that too can only do inner join. FR Join > can join multiple relations, but it can also only do inner and left outer > joins. Further it restricts the sizes of side relations. It will be nice if > we can do map side joins on multiple tables as well do inner, left outer, > right outer and full outer joins. > Lot of groundwork for this has already been done in PIG-1309. Remaining will > be tracked in this jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1353) Map-side joins
Map-side joins -- Key: PIG-1353 URL: https://issues.apache.org/jira/browse/PIG-1353 Project: Pig Issue Type: Improvement Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Pig already has couple of map-side join implementations: Merge Join and Fragmented-Replicate Join. But both of them are pretty restrictive. Merge Join can only join two tables and that too can only do inner join. FR Join can join multiple relations, but it can also only do inner and left outer joins. Further it restricts the sizes of side relations. It will be nice if we can do map side joins on multiple tables as well do inner, left outer, right outer and full outer joins. Lot of groundwork for this has already been done in PIG-1309. Remaining will be tracked in this jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: pig-1309_2.patch Updated the patch to fix test failures, javac warnings and more comments. Result of test-patch on latest patch: {noformat} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] {noformat} Result of test-commit: {noformat} test-commit: [mkdir] Created dir: /homes/chauhana/scratch/latest/build/test/logs [junit] Running org.apache.pig.test.TestAdd [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.036 sec : : [junit] Running org.apache.pig.test.TestTypeCheckingValidatorNoSchema [junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.165 sec BUILD SUCCESSFUL {noformat} Patch checked in trunk. > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1348) InternalCachedBag running out of memory
[ https://issues.apache.org/jira/browse/PIG-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852511#action_12852511 ] Ashutosh Chauhan commented on PIG-1348: --- To reproduce, cogroup page_views(from PigMix's dataset) with page_views on user and this exception should occur. Apart from making InternalCachedBag more robust, important thing to figure out here is to see where 90% of available memory is getting used. Also, a related fix went in for this recently: PIG-1307 Might be related to that issue. > InternalCachedBag running out of memory > --- > > Key: PIG-1348 > URL: https://issues.apache.org/jira/browse/PIG-1348 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Richard Ding > > InternalCachedBag makes estimate of memory available to the VM by using > Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though > configurable) of this memory and divides this memory into number of bags. It > keeps track of the memory used by bags and then proactively spills if bags > memory usage reach close to these limits. Given all this in theory when > presented with data more then it can handle InternalCachedBag should not run > out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1348) InternalCachedBag running out of memory
InternalCachedBag running out of memory --- Key: PIG-1348 URL: https://issues.apache.org/jira/browse/PIG-1348 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding InternalCachedBag makes estimate of memory available to the VM by using Runtime.getRuntime().maxMemory(). It then uses 10%(by default, though configurable) of this memory and divides this memory into number of bags. It keeps track of the memory used by bags and then proactively spills if bags memory usage reach close to these limits. Given all this in theory when presented with data more then it can handle InternalCachedBag should not run out of memory. But in practice we find OOM happening. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852190#action_12852190 ] Ashutosh Chauhan commented on PIG-1229: --- Few suggestions: Reading from test case, currently store statements look like: {code} b = store a into 'dummy' using org.apache.pig.piggybank.storage.DBStorage('org.hsqldb.jdbcDriver','jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100','insert into a...'); {code} here 'dummy' is totally ignored. while this works, from a user experience following might be better: {code} b = store a into 'jdbc:hsqldb:file:/tmp/batchtest' using org.apache.pig.piggybank.storage.DBStorage('org.hsqldb.jdbcDriver','hsqldb.default_table_type=cached;hsqldb.cache_rows=100','insert into a'); {code} that is, have db url as store location and second param of store func as db params. you can use setStoreLocation() to store url. Apart from more intuitive store stmt, this will also allow you to check whether DB is reachable or not at compile time itself, instead of at runtime. You can do that via checkOutputSpecs(). Doing DataType.findType() on every element of every tuple will be expensive. I am wondering if you can get hold of schema in your store func and use that to map pig types to sql types. All of these suggestions may come in as later patches. So, if you want to get this committed and track these separately I think that also will work as this patch is functionally complete. > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.8.0 > > Attachments: jira-1229-v2.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851661#action_12851661 ] Ashutosh Chauhan commented on PIG-1309: --- To build index, we sample every split and get an index entry corresponding to the split. After sampling all the index entries are sorted and then index is written to disk. When I first wrote MergeJoin I wasn't able to figure out how to use hadoop sorting to sort the index. So, there is a comment in MRCompiler for that: {noformat} // Sorting of index can possibly be achieved by using Hadoop sorting // between map and reduce instead of Pig doing sort. If that is so, // it will simplify lot of the code below. {noformat} Now I figured it out :) By default, if LocalRearranges produce key of type tuple Pig supplies raw binary comparators (PigTupleWritableComparator) to hadoop to compare tuples, which ignores the semantics of tuple. We need to override that behavior to make Pig supply correct version of tuple comparator (PigTupleRawComparator). We need to communicate this info to JobControlCompiler from MRCompiler. So, I am doing the same through MapReduceOper object. As a nice side-effects of this a) code in MRCompiler is indeed simplified now b) We got rid of extra index sorting inside reducer. > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch, pig-1309_1.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Status: Patch Available (was: Open) > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch, pig-1309_1.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: pig-1309_1.patch Getting closer. Running through hudson to find out if it breaks anything. > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch, pig-1309_1.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: (was: pig-1309.patch) > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch, pig-1309_1.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849827#action_12849827 ] Ashutosh Chauhan commented on PIG-1315: --- Aah.. I thought you put SortedTableSplitComparable in its own class file. That doesnt seem to be the case. I need to test this version to make sure if it works. > [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader > > > Key: PIG-1315 > URL: https://issues.apache.org/jira/browse/PIG-1315 > Project: Pig > Issue Type: New Feature >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: zebra.0324, zebra.0324 > > > OrderedLoadFunc interface is used by Pig to do merge join and mapside > cogrouping. For Zebra, implementing this interface is necessary to support > mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849814#action_12849814 ] Ashutosh Chauhan commented on PIG-1315: --- +1 it passes Pig side of things. > [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader > > > Key: PIG-1315 > URL: https://issues.apache.org/jira/browse/PIG-1315 > Project: Pig > Issue Type: New Feature >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: zebra.0324, zebra.0324 > > > OrderedLoadFunc interface is used by Pig to do merge join and mapside > cogrouping. For Zebra, implementing this interface is necessary to support > mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1329) Pig version incorrect post-0.7 split
[ https://issues.apache.org/jira/browse/PIG-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849777#action_12849777 ] Ashutosh Chauhan commented on PIG-1329: --- +1, I also hit this issue yesterday. Seems to be typo while branching for 0.7. > Pig version incorrect post-0.7 split > > > Key: PIG-1329 > URL: https://issues.apache.org/jira/browse/PIG-1329 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy >Priority: Trivial > Fix For: 0.8.0 > > Attachments: PIG-1329.patch > > > There's a typo in build.xml that makes the current pig version 0..0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1254) up the memory for junit to run tests
[ https://issues.apache.org/jira/browse/PIG-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan resolved PIG-1254. --- Resolution: Invalid I upped the memory limits to 512M and 768M, but didn't see any improvements in runtime. Resolving as invalid. > up the memory for junit to run tests > > > Key: PIG-1254 > URL: https://issues.apache.org/jira/browse/PIG-1254 > Project: Pig > Issue Type: Bug > Components: build >Reporter: Ashutosh Chauhan > > Currently junit is configured to run only with 256M of memory. This is too > low considering the fact that most tests create MiniCluster, run Hadoop local > job runner, run tests etc. all within same jvm. This results in transient > failures, longer time for tests to complete etc. This should be upped atleast > to 512M. > build.xml: > {noformat} > fork="yes" maxmemory="256m" dir="${basedir}" timeout="${test.timeout}" > errorProperty="tests.failed" failureProperty="tests.failed"> > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1315) [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader
[ https://issues.apache.org/jira/browse/PIG-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849601#action_12849601 ] Ashutosh Chauhan commented on PIG-1315: --- While reading records in reducer, Pig uses reflection to instantiate typed data objects. SortedTableSplitComparable which is a writable comparable required in OrderedLoadFunc is an inner class of SortedTableSplit. As a result, reflection fails and exception is thrown. To make it work, SortedTableSplitComparable may need to move into its own class with public visibility. > [Zebra] Implementing OrderedLoadFunc interface for Zebra TableLoader > > > Key: PIG-1315 > URL: https://issues.apache.org/jira/browse/PIG-1315 > Project: Pig > Issue Type: New Feature >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: zebra.0324 > > > OrderedLoadFunc interface is used by Pig to do merge join and mapside > cogrouping. For Zebra, implementing this interface is necessary to support > mapside cogrouping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: pig-1309.patch Did offline review with Alan. Found a subtle bug in POMergeCogroup#getNext(). Fixed that and added more tests. Still need to tidy up things at few places. Looking for suggestion for better test cases that cover all the edge cases. > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch, pig-1309.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1298) Restore file traversal behavior to Pig loaders
[ https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847630#action_12847630 ] Ashutosh Chauhan commented on PIG-1298: --- +1 > Restore file traversal behavior to Pig loaders > -- > > Key: PIG-1298 > URL: https://issues.apache.org/jira/browse/PIG-1298 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.7.0 > > Attachments: PIG-1298.patch, PIG-1298_1.patch > > > Given a location to a Pig loader, it is expected to recursively load all the > files under the location (i.e., all the files returned with "ls -R" > command). However, after the transition to using Hadoop 20 API, only files > returned with "ls" command are loaded. > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1309) Map-side Cogroup
Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: mapsideCogrp.patch Preliminary patch to discuss the approach. Not ready for inclusion yet. > Map-side Cogroup > > > Key: PIG-1309 > URL: https://issues.apache.org/jira/browse/PIG-1309 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: mapsideCogrp.patch > > > In never ending quest to make Pig go faster, we want to parallelize as many > relational operations as possible. Its already possible to do Group-by( > PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira > is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked-in with changes suggested in previous comment. Core test failure reported by hudson was transient. It passed on my machine. > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1292.patch, pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Status: Patch Available (was: Open) Hudson is fickle recently. Hopefully, this patch gets lucky and is tested correctly. > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1292.patch, pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Attachment: pig-1292.patch Didn't get about LoadMetaData, ResourceSchema. LoadMetaData is one of those interfaces which loaders can choose to implement. ResourceSchema is independent class of its own. New patch incorporating suggested changes in the above comments. This patch also adds checks in the MRCompiler to enforce loader to implement new CollectableLoader interface if there is a map-side grouping ( PIG-984 ) in the script. > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1292.patch, pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844632#action_12844632 ] Ashutosh Chauhan commented on PIG-1292: --- One reason for not putting it in LoadFunc is to keep loadfunc simple and not have such highly specific methods in there. We want to move such specialized capabilities away from LoadFunc into their own interfaces. This is also the reason PIG-966 decided to split LoadFunc into separate interfaces like LoadPushDown, LoadCaster, LoadMetaData etc. and not put all of them in LoadFunc. This frees loadfunc implementers from not thinking about them, if they don't want to. And if one wants to have such specific capability in his loader, he has to think about it anyway whether its in loadfunc or in its own interface. That said, I agree having boolean return value for the method seems to be confusing, so I agree method return value should be void. > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844301#action_12844301 ] Ashutosh Chauhan commented on PIG-1292: --- Thanks for review, Xuefu. 1. Thats a valid point. Where possible we want loadfunc implementers to deal with Hadoop concepts and not with Pig concepts. 2. So, lets assume there is a loader which is capable of implementing this interface but only if underlying data is sorted (information which is available to loader only at run-time). Now this loader will implement this interface, indicating to Pig it is capable of doing it. But just because it is capable of doing it, doesn't necessarily imply it will do it (possibly because of performance reasons). Then, when Pig calls the method of interface, it is communicating to loader that it wants data in particular fashion, no matter what. Inside this method, loader may come to know about some metadata (like data is not sorted, possibly by reading its schema or contacting some metadata repo) and decides that it cant honor the contract because of information which is available to it only at run time. Then, loader may return false for the method. Pig may then choose to rewrite the query and still carry-on the execution. Because of these scenarios, I think having a boolean return value is useful. what do you think? 3. Can't come up with better name. Feel free to suggest :) > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1292: - Assignee: Ashutosh Chauhan > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Attachment: pig-interfaces.patch A preview patch with suggested changes. > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-interfaces.patch > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844247#action_12844247 ] Ashutosh Chauhan commented on PIG-1292: --- Currently LoadFunc is an abstract class. OrderedLoadFunc is another abstract class which extends LoadFunc and adds the method which tells Pig in what order to read the splits. Similarly, there is IndexableLoadFunc which also extends LoadFunc and adds the functionality that loader can arbitrarily seek near to specified keys. Its not hard to imagine that there may exist a loader which can do both. Currently there can't be such a loader since both of these are abstract classes. Proposal is to change them to interfaces. Further, a loader may also provide a guarantee that all instances of a key appear together in one split. A similar loader is required for map-side groups PIG-984 . Currently, its assumed that underlying loader is providing data in a way its expected. We should formalize this assumption by introducing new interface and checking if loader is implementing it. > Interface Refinements > - > > Key: PIG-1292 > URL: https://issues.apache.org/jira/browse/PIG-1292 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan > Fix For: 0.7.0 > > > A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both > are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1292) Interface Refinements
Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.7.0 A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841391#action_12841391 ] Ashutosh Chauhan commented on PIG-1229: --- Sure. By the way, I am not sure if hsqldb license http://hsqldb.org/web/hsqlLicense.html is compatible with Apache or not. Though, I think if we are pulling it through ivy, we will be fine. Am I correct ? > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.7.0 > > Attachments: hsqldb.jar, jira-1229.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841039#action_12841039 ] Ashutosh Chauhan commented on PIG-928: -- @Woody I agree frameworks will not be performant. I think there usefulness depends on what we want to achieve? If we want to support many different languages, then they might prove useful, if we are only interested in supporting a language or two (seems Python and Ruby are most popular ones) then it won't make sense to pay the overhead associated with them. > UDFs in scripting languages > --- > > Key: PIG-928 > URL: https://issues.apache.org/jira/browse/PIG-928 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates > Attachments: package.zip, scripting.tgz, scripting.tgz > > > It should be possible to write UDFs in scripting languages such as python, > ruby, etc. This frees users from needing to compile Java, generate a jar, > etc. It also opens Pig to programmers who prefer scripting languages over > Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841033#action_12841033 ] Ashutosh Chauhan commented on PIG-928: -- @Prasen bq. can we not implement it along the lines of DEFINE commands. Ya, this functionality could be partially simulated using DEFINE / Streaming combination. But that may not be most efficient way to achieve it. First of all, streaming script would be run in a separate process (as oppose to same JVM in approaches discussed above) so there will be CPU cost involved in getting data in and out of from java process to stream script process. Then, there is a cost of serialization and deserialization of parameters. You loose all the type information of the parameters. Once you are in same runtime you can start doing interesting things. Also, having scripts in define statements will get kludgy soon as one you start to do complicated things there. bq. no need to include scripting-specific jars (jython etc.) Do you mean Include in pig distribution or in pig's classpath at runtime ? In either case that may not necessarily a problem. For first part, we can use ivy to pull the jars for us instead of including in distribution and for second part we can ship all the jars required by Pig to compute nodes. > UDFs in scripting languages > --- > > Key: PIG-928 > URL: https://issues.apache.org/jira/browse/PIG-928 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates > Attachments: package.zip, scripting.tgz, scripting.tgz > > > It should be possible to write UDFs in scripting languages such as python, > ruby, etc. This frees users from needing to compile Java, generate a jar, > etc. It also opens Pig to programmers who prefer scripting languages over > Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841003#action_12841003 ] Ashutosh Chauhan commented on PIG-1229: --- Ankur, With recent Load-Store interface changes, the patch doesn't compile. Can you regenerate it? And while you are at it, can you also make changes in ivy.xml so that hsqldb.jar is pulled over internet instead of needing it to be bundled with pig distribution. > allow pig to write output into a JDBC db > > > Key: PIG-1229 > URL: https://issues.apache.org/jira/browse/PIG-1229 > Project: Pig > Issue Type: New Feature > Components: impl >Reporter: Ian Holsman >Assignee: Ankur >Priority: Minor > Fix For: 0.7.0 > > Attachments: hsqldb.jar, jira-1229.patch > > > UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1265) Change LoadMetadata and StoreMetadata to use Job instead of Configuraiton and add a cleanupOnFailure method to StoreFuncInterface
[ https://issues.apache.org/jira/browse/PIG-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839938#action_12839938 ] Ashutosh Chauhan commented on PIG-1265: --- Looked at the diff of first and second patch. +1 for that part. One thing unrelated to patch I want to highlight is: these kind of issues creeps in because our api is too wide, there are multiple ways of getting same thing done in Pig (executing script through Java api in this case), each exercising different code paths. So, first thing we need is to establish what is part of our api and what is meant for Pig's internal purposes only? Is every single public method part of api ? Or, only those which have properly documented Javadocs are supported ? Or only those documented in wiki page ? First we establish what constitutes our public api, then we should systematically start decreasing the visibility of those methods which are public but not meant as an api, then deprecate them and eventually remove them. > Change LoadMetadata and StoreMetadata to use Job instead of Configuraiton and > add a cleanupOnFailure method to StoreFuncInterface > - > > Key: PIG-1265 > URL: https://issues.apache.org/jira/browse/PIG-1265 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Fix For: 0.7.0 > > Attachments: PIG-1265-2.patch, PIG-1265.patch > > > Speaking to the hadoop team folks, the direction in hadoop is to use Job > instead of Configuration - for example InputFormat/OutputFormat > implementations use Job to store input/output location. So pig should also do > the same in LoadMetadata and StoreMetadata to be closer to hadoop. > Currently when a job fails, pig assumes the output locations (corresponding > to the stores in the job) are hdfs locations and attempts to delete them. > Since output locations could be non hdfs locations, this cleanup should be > delegated to the StoreFuncInterface implementation - hence a new method - > cleanupOnFailure() should be introduced in StoreFuncInterface and a default > implementation should be provided in the StoreFunc abstract class which > checks if the location exists on hdfs and deletes it if so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1251) Move SortInfo calculation earlier in compilation
[ https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839000#action_12839000 ] Ashutosh Chauhan commented on PIG-1251: --- @Dmitriy I agree. ResourceSchema encapsulates both SortInfo and PigSchema within it. So, SortInfo should not really be exposed. It should be a private member of ResourceSchema and where SortInfo is required, it should be accessed via ResourceSchema. I thought about doing all those changes in this patch, but changes were more involved then I thought. So, I backed-off to keep changes minimal for this patch. We can track that as a separate ticket. > Move SortInfo calculation earlier in compilation > - > > Key: PIG-1251 > URL: https://issues.apache.org/jira/browse/PIG-1251 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1251.patch, pig-1251_1.patch > > > In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A > storefunc might need schema to do such a validation. So, we should call > checkSchema() before doing the validation. checkSchema() in turn requires > SortInfo which is calculated later in compilation phase. We need to move it > earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front
[ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan resolved PIG-1216. --- Resolution: Fixed > New load store design does not allow Pig to validate inputs and outputs up > front > > > Key: PIG-1216 > URL: https://issues.apache.org/jira/browse/PIG-1216 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Alan Gates >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1216.patch, pig-1216_1.patch > > > In Pig 0.6 and before, Pig attempts to verify existence of inputs and > non-existence of outputs during parsing to avoid run time failures when > inputs don't exist or outputs can't be overwritten. The downside to this was > that Pig assumed all inputs and outputs were HDFS files, which made > implementation harder for non-HDFS based load and store functions. In the > load store redesign (PIG-966) this was delegated to InputFormats and > OutputFormats to avoid this problem and to make use of the checks already > being done in those implementations. Unfortunately, for Pig Latin scripts > that run more then one MR job, this does not work well. MR does not do > input/output verification on all the jobs at once. It does them one at a > time. So if a Pig Latin script results in 10 MR jobs and the file to store > to at the end already exists, the first 9 jobs will be run before the 10th > job discovers that the whole thing was doomed from the beginning. > To avoid this a validate call needs to be added to the new LoadFunc and > StoreFunc interfaces. Pig needs to pass this method enough information that > the load function implementer can delegate to InputFormat.getSplits() and the > store function implementer to OutputFormat.checkOutputSpecs() if s/he decides > to. Since 90% of all load and store functions use HDFS and PigStorage will > also need to, the Pig team should implement a default file existence check on > HDFS and make it available as a static method to other Load/Store function > implementers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation
[ https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1251: -- Resolution: Fixed Fix Version/s: 0.7.0 Status: Resolved (was: Patch Available) Patch committed. > Move SortInfo calculation earlier in compilation > - > > Key: PIG-1251 > URL: https://issues.apache.org/jira/browse/PIG-1251 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1251.patch, pig-1251_1.patch > > > In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A > storefunc might need schema to do such a validation. So, we should call > checkSchema() before doing the validation. checkSchema() in turn requires > SortInfo which is calculated later in compilation phase. We need to move it > earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation
[ https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1251: -- Attachment: pig-1251_1.patch Resynced with trunk and updated to address Daniel's comments. Will be committing it shortly. > Move SortInfo calculation earlier in compilation > - > > Key: PIG-1251 > URL: https://issues.apache.org/jira/browse/PIG-1251 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-1251.patch, pig-1251_1.patch > > > In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A > storefunc might need schema to do such a validation. So, we should call > checkSchema() before doing the validation. checkSchema() in turn requires > SortInfo which is calculated later in compilation phase. We need to move it > earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1260) Param Subsitution results in parser error if there is no EOL after last line in script
[ https://issues.apache.org/jira/browse/PIG-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838551#action_12838551 ] Ashutosh Chauhan commented on PIG-1260: --- Work around is to add a EOL after last line in the script-file. > Param Subsitution results in parser error if there is no EOL after last line > in script > -- > > Key: PIG-1260 > URL: https://issues.apache.org/jira/browse/PIG-1260 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan > Fix For: 0.7.0 > > > {noformat} > A = load '$INPUT' using PigStorage(':'); > B = foreach A generate $0 as id; > store B into '$OUTPUT' USING PigStorage(); > {noformat} > Invoking above script which contains no EOL in the last line of script as > following: > {noformat} > pig -param INPUT=mydata/input -param OUTPUT=mydata/output myscript.pig > {noformat} > results in parser error: > {noformat} > [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during > parsing. Lexical error at line 3, column 42. Encountered: after : "" > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1260) Param Subsitution results in parser error if there is no EOL after last line in script
Param Subsitution results in parser error if there is no EOL after last line in script -- Key: PIG-1260 URL: https://issues.apache.org/jira/browse/PIG-1260 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.7.0 {noformat} A = load '$INPUT' using PigStorage(':'); B = foreach A generate $0 as id; store B into '$OUTPUT' USING PigStorage(); {noformat} Invoking above script which contains no EOL in the last line of script as following: {noformat} pig -param INPUT=mydata/input -param OUTPUT=mydata/output myscript.pig {noformat} results in parser error: {noformat} [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 3, column 42. Encountered: after : "" {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1251) Move SortInfo calculation earlier in compilation
[ https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837534#action_12837534 ] Ashutosh Chauhan commented on PIG-1251: --- Seems some problem with hudson. I ran the unit tests manually. They passed. Patch is ready for review. > Move SortInfo calculation earlier in compilation > - > > Key: PIG-1251 > URL: https://issues.apache.org/jira/browse/PIG-1251 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-1251.patch > > > In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A > storefunc might need schema to do such a validation. So, we should call > checkSchema() before doing the validation. checkSchema() in turn requires > SortInfo which is calculated later in compilation phase. We need to move it > earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1254) up the memory for junit to run tests
up the memory for junit to run tests Key: PIG-1254 URL: https://issues.apache.org/jira/browse/PIG-1254 Project: Pig Issue Type: Bug Components: build Reporter: Ashutosh Chauhan Currently junit is configured to run only with 256M of memory. This is too low considering the fact that most tests create MiniCluster, run Hadoop local job runner, run tests etc. all within same jvm. This results in transient failures, longer time for tests to complete etc. This should be upped atleast to 512M. build.xml: {noformat} {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation
[ https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1251: -- Attachment: pig-1251.patch Patch which moves SortInfo calculation from LogToPhyTranslation to SortInfoSetter. This facilitate calling checkSchema() before checkOutputSpecs() during store location validation. Result of test-patch {noformat} [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] {noformat} Javadoc warning is unrelated to the patch. > Move SortInfo calculation earlier in compilation > - > > Key: PIG-1251 > URL: https://issues.apache.org/jira/browse/PIG-1251 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-1251.patch > > > In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A > storefunc might need schema to do such a validation. So, we should call > checkSchema() before doing the validation. checkSchema() in turn requires > SortInfo which is calculated later in compilation phase. We need to move it > earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1251) Move SortInfo calculation earlier in compilation
[ https://issues.apache.org/jira/browse/PIG-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1251: -- Status: Patch Available (was: Open) > Move SortInfo calculation earlier in compilation > - > > Key: PIG-1251 > URL: https://issues.apache.org/jira/browse/PIG-1251 > Project: Pig > Issue Type: Bug >Reporter: Ashutosh Chauhan >Assignee: Ashutosh Chauhan > Attachments: pig-1251.patch > > > In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A > storefunc might need schema to do such a validation. So, we should call > checkSchema() before doing the validation. checkSchema() in turn requires > SortInfo which is calculated later in compilation phase. We need to move it > earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1251) Move SortInfo calculation earlier in compilation
Move SortInfo calculation earlier in compilation - Key: PIG-1251 URL: https://issues.apache.org/jira/browse/PIG-1251 Project: Pig Issue Type: Bug Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan In LSR Pig does Input Output Validation by calling hadoop's checkSpecs() A storefunc might need schema to do such a validation. So, we should call checkSchema() before doing the validation. checkSchema() in turn requires SortInfo which is calculated later in compilation phase. We need to move it earlier in compilation phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836108#action_12836108 ] Ashutosh Chauhan commented on PIG-928: -- Hey Woody, Great work !! This will definitely be useful for lot of Pig users. I just hastily looked at your work. One question which stuck to me is you are doing lot of heavy lifting to provide for multi-language support by figuring out which language user is asking for and then doing reflection to load appropriate interpreter and stuff. I think it might be easier to use one of the frameworks here (BSF or javax.script) which hides this and allows handling of multiple language transparently. (atleast, thats what they claim to do) Have you taken a look at them? These frameworks will arguably help us to provide support for more languages without maintaining lot of code on our part. Though, I am sure they will come at the performance cost (certainly CPU and possibly memory too). > UDFs in scripting languages > --- > > Key: PIG-928 > URL: https://issues.apache.org/jira/browse/PIG-928 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates > Attachments: package.zip, scripting.tgz, scripting.tgz > > > It should be possible to write UDFs in scripting languages such as python, > ruby, etc. This frees users from needing to compile Java, generate a jar, > etc. It also opens Pig to programmers who prefer scripting languages over > Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1215) Make Hadoop jobId more prominent in the client log
[ https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1215: -- Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked-in. > Make Hadoop jobId more prominent in the client log > -- > > Key: PIG-1215 > URL: https://issues.apache.org/jira/browse/PIG-1215 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: pig-1215.patch, pig-1215.patch, pig-1215_1.patch, > pig-1215_3.patch, pig-1215_4.patch > > > This is a request from applications that want to be able to programmatically > parse client logs to find hadoop Ids. > The woould like to see each job id on a separate line in the following format: > hadoopJobId: job_123456789 > They would also like to see the jobs in the order they are executed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.