[jira] [Commented] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454985#comment-13454985 ] Jakob Homan commented on PIG-1748: -- @deb - questions like these should be directed to the pig user list, not JIRA. You'll receive assistance there. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: lin guo > Fix For: 0.9.0 > > Attachments: avro_storage.patch, AvroStorageUtils-bagfix.patch, > avro_test_files.tar.gz, PIG-1748-2.patch, PIG-1748-3.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454762#comment-13454762 ] deb ashish commented on PIG-1748: - REGISTER /path/avro-1.4.1.jar REGISTER /path/json-simple-1.1.jar REGISTER /path/piggybank.jar REGISTER /path/jackson-core-asl-1.5.5.jar REGISTER /path/jackson-mapper-asl-1.5.5.jar avro = LOAD '/hdfs path/part-r-0.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); Im trying this code but it's unable to read the avro file,showing the following exception Pig Stack Trace --- ERROR 2997: Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sc. Backend error : Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.pig.PigServer.openIterator(PigServer.java:742) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:406) at org.apache.pig.Main.main(Main.java:107) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198) at org.apache.pig.PigServer.storeEx(PigServer.java:874) at org.apache.pig.PigServer.store(PigServer.java:816) at org.apache.pig.PigServer.openIterator(PigServer.java:728) ... 7 more please help me asap > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: lin guo > Fix For: 0.9.0 > > Attachments: avro_storage.patch, AvroStorageUtils-bagfix.patch, > avro_test_files.tar.gz, PIG-1748-2.patch, PIG-1748-3.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164848#comment-13164848 ] Russell Jurney commented on PIG-1748: - Thanks, Doug. Created https://issues.apache.org/jira/browse/PIG-2411 and submitted patch. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: lin guo > Fix For: 0.9.0 > > Attachments: AvroStorageUtils-bagfix.patch, PIG-1748-2.patch, > PIG-1748-3.patch, avro_storage.patch, avro_test_files.tar.gz > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164828#comment-13164828 ] Doug Cutting commented on PIG-1748: --- Russell, this patch looks good to me, except for the print statement. But you should probably open a new issue and add it there, as this issue has already been committed, closed and released. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: lin guo > Fix For: 0.9.0 > > Attachments: AvroStorageUtils-bagfix.patch, PIG-1748-2.patch, > PIG-1748-3.patch, avro_storage.patch, avro_test_files.tar.gz > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164217#comment-13164217 ] Russell Jurney commented on PIG-1748: - I've attached a patch that fixes a bug I ran into in serializing bags of tuples. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: lin guo > Fix For: 0.9.0 > > Attachments: AvroStorageUtils-bagfix.patch, PIG-1748-2.patch, > PIG-1748-3.patch, avro_storage.patch, avro_test_files.tar.gz > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988893#comment-12988893 ] Daniel Dai commented on PIG-1748: - Patch committed to trunk. Thanks Lin, Jakob! > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Fix For: 0.9.0 > > Attachments: PIG-1748-2.patch, PIG-1748-3.patch, avro_storage.patch, > avro_test_files.tar.gz > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988253#action_12988253 ] Dmitriy V. Ryaboy commented on PIG-1748: The TestPigStorageSchema thing is mine, someone else just opened a ticket. Will fix. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch, PIG-1748-3.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986551#action_12986551 ] Felix Gao commented on PIG-1748: I noticed the avro loader does not support file globbing. log_load = LOAD '/user/felix/avro/access_log.test.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); <--- works fine but log_load = LOAD '/user/felix/avro/*.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); ERROR 1018: Problem determining schema during load org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Problem determining schema during load at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1342) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1286) at org.apache.pig.PigServer.registerQuery(PigServer.java:460) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:738) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:163) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:139) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:414) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem determining schema during load at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:752) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1336) ... 8 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: Problem determining schema during load at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:156) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:750) ... 10 more Caused by: java.io.FileNotFoundException: File does not exist: /user/felix/avro/*.avro at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1586) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1577) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:428) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:185) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:431) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:181) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:133) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:108) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:233) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:169) at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:150) ... 11 more > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986204#action_12986204 ] Scott Carey commented on PIG-1748: -- @Jacob {quote} I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro.{quote} Avro is completely modularized now, so there would not be any dependency mess like that. It is now easy to add separate modules such as 'avro-pig.jar' or 'avro-hive.jar'. It already has 'avro-mapred.jar'. https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildDocumentation-Java As this gets off topic, we can use Avro developer mailing list. Related issues are https://issues.apache.org/jira/browse/AVRO-647 and the issues linked to it, as well as https://issues.apache.org/jira/browse/AVRO-592. There is no ticket yet on the broader scope stuff. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985972#action_12985972 ] Jakob Homan commented on PIG-1748: -- @Scott I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro. However, this JIRA isn't the best place to discuss this. Is there a discussion about this type of integration going on in Avro that the community can contribute to? Is there a JIRA? Thanks. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985958#action_12985958 ] Scott Carey commented on PIG-1748: -- @Jacob Of course projects can do what they wish. I'm simply hoping many can collaborate together on this general problem category. {quote}This seems like an odd approach to me, essentially inverting the domain knowledge of each application to Avro, rather than the application itself where its developers frolic and work. Is there something I'm missing here? {quote} Writing a Pig storage adapter requires Avro domain knowledge and Pig domain knowledge. I found that it required more knowledge of Avro than Pig to do well. If all you ever want to achieve is: Pig - >> Avro file - >> Pig, then maybe it doesn't matter who hosts it. But what if you want to do: Pig - >> Avro file - >> Cascading - >> Avro file - >> Hive - >> Avro file - >> Pig ? Now which project should host what defines how all those data models can interact through a common schema system? pig contrib? hive contrib? howl? cascading (gpl . . .)? In the longer term, the common elements needed by all of the above can crystallize out into an avro module general to all, and individual modules hosted by each project can use that. What that might look like won't be apparent until there are enough example use cases however. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985936#action_12985936 ] Jakob Homan commented on PIG-1748: -- @Daniel- Let me take a look. @Scott - It's worth noting that projects can include Avro support as they wish, just as Avro can incorporate that work as it wishes. But I'm not sure I understand. You're saying that you'd rather have any higher-level application supporting Avro to have that support hosted in Avro, rather than treating it as a library to be included? This seems like an odd approach to me, essentially inverting the domain knowledge of each application to Avro, rather than the application itself where its developers frolic and work. Is there something I'm missing here? > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985030#action_12985030 ] Daniel Dai commented on PIG-1748: - To Jokob: I get 2 failure in TestAvroStorage: testArrayWithSame, testRecordWithFieldSchema. Error message is similar: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments '[same, src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro]' java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments '[same, src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro]' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:494) at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:5660) at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:4034) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1501) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:1013) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:825) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1708) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1658) at org.apache.pig.PigServer.registerQuery(PigServer.java:546) at org.apache.pig.PigServer.registerQuery(PigServer.java:570) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(Unknown Source) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testArrayWithSame(Unknown Source) Caused by: java.lang.reflect.InvocationTargetException at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:484) Caused by: java.net.ConnectException: Call to localhost.localdomain/127.0.0.1:57284 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:767) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy9.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180) at org.apache.pig.piggybank.storage.avro.AvroStorage.init(AvroStorage.java:372) at org.apache.pig.piggybank.storage.avro.AvroStorage.(AvroStorage.java:110) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304) at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176) at org.apache.hadoop.ipc.Client.getConnection(Client.java:860) at org.apache.hadoop.ipc.Client.call(Client.java:720) Know what happen? > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > bec
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985022#action_12985022 ] Scott Carey commented on PIG-1748: -- About plans on the Avro side: I plan on merging my work with this (great!) work into the Avro project. In the long run the Avro project is a better place for this for several reasons, but in the short term it does not matter. It will be some time before it is available from Avro. * Avro is fully mavenized in 1.5.0 (due out in a few weeks), meaning it is easy to add sub module jars such as 'avro-pig.jar'. Furthermore, its easy to have multiple versions for each version of pig if needed. For example we could simultaneously release avro-pig0.7.jar, avro-pig0.8.jar etc. as part of Avro 1.6.0 if it was necessary due to API breakage or extra features enabled in newer versions of Pig. * A lot of the work here is applicable to multiple systems, I plan to share code with Avro Hive SerDe's when those are implemented. This may lead to a general module that helps projects translate their schemas to avro and back. None of this impacts the work here in the short term, but I'm sure people will be interested in these plans and may have other ideas/suggestions on how to work on this in a way that is not too fragmented. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo >Assignee: Jakob Homan > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984964#action_12984964 ] Daniel Dai commented on PIG-1748: - Ok, get it in a separate file. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984963#action_12984963 ] Daniel Dai commented on PIG-1748: - Seems you forget contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo > Attachments: avro_storage.patch, avro_test_files.tar.gz, > PIG-1748-2.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970011#action_12970011 ] lin guo commented on PIG-1748: -- Thanks... and I will update it to use the latest version. Best, Lin > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo > Attachments: avro_storage.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969880#action_12969880 ] Doug Cutting commented on PIG-1748: --- I skimmed this, and overall it looks great! The only thing I noticed is that it should probably depend on avro-1.4.1 rather than 1.4.0. > Add load/store function AvroStorage for avro data > - > > Key: PIG-1748 > URL: https://issues.apache.org/jira/browse/PIG-1748 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: lin guo > Attachments: avro_storage.patch > > > We want to use Pig to process arbitrary Avro data and store results as Avro > files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. > Due to discrepancies of Avro and Pig data models, AvroStorage has: > 1. Limited support for "record": we do not support recursively defined record > because the number of fields in such records is data dependent. > 2. Limited support for "union": we only accept nullable union like ["null", > "some-type"]. > For simplicity, we also make the following assumptions: > If the input directory is a leaf directory, then we assume Avro data files in > it have the same schema; > If the input directory contains sub-directories, then we assume Avro data > files in all sub-directories have the same schema. > AvroStorage takes no input parameters when used as a LoadFunc (except for > "debug [debug-level]"). > Users can provide parameters to AvroStorage when used as a StoreFunc. If they > don't, Avro schema of output data is derived from its > Pig schema. > Detailed documentation can be found in > http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.