[jira] [Commented] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13454985#comment-13454985 ] Jakob Homan commented on PIG-1748: -- @deb - questions like these should be directed to the pig user list, not JIRA. You'll receive assistance there. Add load/store function AvroStorage for avro data - Key: PIG-1748 URL: https://issues.apache.org/jira/browse/PIG-1748 Project: Pig Issue Type: Improvement Components: impl Reporter: lin guo Assignee: lin guo Fix For: 0.9.0 Attachments: avro_storage.patch, AvroStorageUtils-bagfix.patch, avro_test_files.tar.gz, PIG-1748-2.patch, PIG-1748-3.patch We want to use Pig to process arbitrary Avro data and store results as Avro files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. Due to discrepancies of Avro and Pig data models, AvroStorage has: 1. Limited support for record: we do not support recursively defined record because the number of fields in such records is data dependent. 2. Limited support for union: we only accept nullable union like [null, some-type]. For simplicity, we also make the following assumptions: If the input directory is a leaf directory, then we assume Avro data files in it have the same schema; If the input directory contains sub-directories, then we assume Avro data files in all sub-directories have the same schema. AvroStorage takes no input parameters when used as a LoadFunc (except for debug [debug-level]). Users can provide parameters to AvroStorage when used as a StoreFunc. If they don't, Avro schema of output data is derived from its Pig schema. Detailed documentation can be found in http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429456#comment-13429456 ] Jakob Homan commented on PIG-1891: -- This looks good to me. +1 on the patch, for what it's worth. This is what we're looking for. [~billgraham], how does this look to you? Enable StoreFunc to make intelligent decision based on job success or failure - Key: PIG-1891 URL: https://issues.apache.org/jira/browse/PIG-1891 Project: Pig Issue Type: New Feature Affects Versions: 0.10.0 Reporter: Alex Rovner Priority: Minor Labels: patch Attachments: PIG-1891-1.patch We are in the process of using PIG for various data processing and component integration. Here is where we feel pig storage funcs lack: They are not aware if the over all job has succeeded. This creates a problem for storage funcs which needs to upload results into another system: DB, FTP, another file system etc. I looked at the DBStorage in the piggybank (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) and what I see is essentially a mechanism which for each task does the following: 1. Creates a recordwriter (in this case open connection to db) 2. Open transaction. 3. Writes records into a batch 4. Executes commit or rollback depending if the task was successful. While this aproach works great on a task level, it does not work at all on a job level. If certain tasks will succeed but over job will fail, partial records are going to get uploaded into the DB. Any ideas on the workaround? Our current workaround is fairly ugly: We created a java wrapper that launches pig jobs and then uploads to DB's once pig's job is successful. While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2031) NPE in TOP
[ https://issues.apache.org/jira/browse/PIG-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-2031: - Assignee: Jacob Perkins NPE in TOP -- Key: PIG-2031 URL: https://issues.apache.org/jira/browse/PIG-2031 Project: Pig Issue Type: Bug Reporter: Jacob Perkins Assignee: Jacob Perkins Labels: newbie Attachments: toppatch.txt If a NULL DataBag is passed to org.apache.pig.builtin.TOP then a NPE is thrown. Consider: {code} $: cat foo.tsv a {(foo,1),(bar,2)} b c {(fyha,4),(asdf,9)} {code} then: {code} data = LOAD 'foo.tsv' AS (key:chararray, a_bag:bag {t:tuple (name:chararray, value:int)}); tpd = FOREACH data { top_n = TOP(1, 1, a_bag); GENERATE key AS key, top_n AS top_n ; }; DUMP tpd; {code} will throw an NPE when it gets to the row with no bag. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-1748: - Assignee: lin guo (was: Jakob Homan) Add load/store function AvroStorage for avro data - Key: PIG-1748 URL: https://issues.apache.org/jira/browse/PIG-1748 Project: Pig Issue Type: Improvement Components: impl Reporter: lin guo Assignee: lin guo Fix For: 0.9.0 Attachments: PIG-1748-2.patch, PIG-1748-3.patch, avro_storage.patch, avro_test_files.tar.gz We want to use Pig to process arbitrary Avro data and store results as Avro files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. Due to discrepancies of Avro and Pig data models, AvroStorage has: 1. Limited support for record: we do not support recursively defined record because the number of fields in such records is data dependent. 2. Limited support for union: we only accept nullable union like [null, some-type]. For simplicity, we also make the following assumptions: If the input directory is a leaf directory, then we assume Avro data files in it have the same schema; If the input directory contains sub-directories, then we assume Avro data files in all sub-directories have the same schema. AvroStorage takes no input parameters when used as a LoadFunc (except for debug [debug-level]). Users can provide parameters to AvroStorage when used as a StoreFunc. If they don't, Avro schema of output data is derived from its Pig schema. Detailed documentation can be found in http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage
[ https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030845#comment-13030845 ] Jakob Homan commented on PIG-1890: -- @Ken - any update now that we're in a new week? Fix piggybank unit test TestAvroStorage --- Key: PIG-1890 URL: https://issues.apache.org/jira/browse/PIG-1890 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Daniel Dai Assignee: Jakob Homan Fix For: 0.9.0 Attachments: PIG-1890-1.patch TestAvroStorage fail on trunk. There are two reasons: 1. After PIG-1680, we call LoadFunc.setLocation one more time. 2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for in is set to PIG_WRAPPER: (FIELD: {PIG_WRAPPER: (ARRAY_ELEM: float)}). It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1872) Fix bug in AvroStorage
[ https://issues.apache.org/jira/browse/PIG-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999689#comment-12999689 ] Jakob Homan commented on PIG-1872: -- +1. Looks good to me. Fix bug in AvroStorage -- Key: PIG-1872 URL: https://issues.apache.org/jira/browse/PIG-1872 Project: Pig Issue Type: Bug Affects Versions: 0.9.0 Reporter: lin guo Priority: Minor Fix For: 0.9.0 Attachments: my.diff AvroStorageUtils.containsRecursiveRecord() has a bug and returns true for a record with multiple fields of the same type, e.g. { type:record, name:Event, + fields:[{name:f1, type:{ type:record,name:EntityID, }} {name:f2,type:EntityID}, + {name:f3,type:EntityID} ]} Patch contains bug fix and unit tests. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1872) Fix bug in AvroStorage
[ https://issues.apache.org/jira/browse/PIG-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-1872: - Fix Version/s: (was: 0.9.0) site Affects Version/s: (was: 0.9.0) site Status: Patch Available (was: Open) Fix bug in AvroStorage -- Key: PIG-1872 URL: https://issues.apache.org/jira/browse/PIG-1872 Project: Pig Issue Type: Bug Affects Versions: site Reporter: lin guo Priority: Minor Fix For: site Attachments: my.diff AvroStorageUtils.containsRecursiveRecord() has a bug and returns true for a record with multiple fields of the same type, e.g. { type:record, name:Event, + fields:[{name:f1, type:{ type:record,name:EntityID, }} {name:f2,type:EntityID}, + {name:f3,type:EntityID} ]} Patch contains bug fix and unit tests. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1872) Fix bug in AvroStorage
[ https://issues.apache.org/jira/browse/PIG-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-1872: - Priority: Major (was: Minor) Fix bug in AvroStorage -- Key: PIG-1872 URL: https://issues.apache.org/jira/browse/PIG-1872 Project: Pig Issue Type: Bug Affects Versions: site Reporter: lin guo Fix For: site Attachments: my.diff AvroStorageUtils.containsRecursiveRecord() has a bug and returns true for a record with multiple fields of the same type, e.g. { type:record, name:Event, + fields:[{name:f1, type:{ type:record,name:EntityID, }} {name:f2,type:EntityID}, + {name:f3,type:EntityID} ]} Patch contains bug fix and unit tests. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1833) Contrib's build.xml points to an invalid hadoop-conf
Contrib's build.xml points to an invalid hadoop-conf Key: PIG-1833 URL: https://issues.apache.org/jira/browse/PIG-1833 Project: Pig Issue Type: Bug Reporter: Jakob Homan As discovered in testing PIG-1748, the build.xml in the contrib/piggybank/java module has {{junit.hadoop..conf}} which points to {{${user.home}/pigtest/conf/}}. In this directory is a hadoop-conf.xml that defines a value for {{fs.default.name}} which is valid during the regular test runs but not for the contrib modules. However, any tests in contrib that try to access a non-fully qualified file via FileSystem will be routed to this value and will then fail when they can't reach it. If, however, one runs the tests directly from contrib module without the pigtest directory existing, the tests will pass. Do any of the contrib modules actually need this variable? If not, it should be removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985972#action_12985972 ] Jakob Homan commented on PIG-1748: -- @Scott I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro. However, this JIRA isn't the best place to discuss this. Is there a discussion about this type of integration going on in Avro that the community can contribute to? Is there a JIRA? Thanks. Add load/store function AvroStorage for avro data - Key: PIG-1748 URL: https://issues.apache.org/jira/browse/PIG-1748 Project: Pig Issue Type: Improvement Components: impl Reporter: lin guo Assignee: Jakob Homan Attachments: avro_storage.patch, avro_test_files.tar.gz, PIG-1748-2.patch We want to use Pig to process arbitrary Avro data and store results as Avro files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. Due to discrepancies of Avro and Pig data models, AvroStorage has: 1. Limited support for record: we do not support recursively defined record because the number of fields in such records is data dependent. 2. Limited support for union: we only accept nullable union like [null, some-type]. For simplicity, we also make the following assumptions: If the input directory is a leaf directory, then we assume Avro data files in it have the same schema; If the input directory contains sub-directories, then we assume Avro data files in all sub-directories have the same schema. AvroStorage takes no input parameters when used as a LoadFunc (except for debug [debug-level]). Users can provide parameters to AvroStorage when used as a StoreFunc. If they don't, Avro schema of output data is derived from its Pig schema. Detailed documentation can be found in http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1748) Add load/store function AvroStorage for avro data
[ https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-1748: - Attachment: avro_test_files.tar.gz Attaching binary test avro files used by unit tests. Need to be untgz'ed and placed in contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files by reviewer/committer Add load/store function AvroStorage for avro data - Key: PIG-1748 URL: https://issues.apache.org/jira/browse/PIG-1748 Project: Pig Issue Type: Improvement Components: impl Reporter: lin guo Attachments: avro_storage.patch, avro_test_files.tar.gz, PIG-1748-2.patch We want to use Pig to process arbitrary Avro data and store results as Avro files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. Due to discrepancies of Avro and Pig data models, AvroStorage has: 1. Limited support for record: we do not support recursively defined record because the number of fields in such records is data dependent. 2. Limited support for union: we only accept nullable union like [null, some-type]. For simplicity, we also make the following assumptions: If the input directory is a leaf directory, then we assume Avro data files in it have the same schema; If the input directory contains sub-directories, then we assume Avro data files in all sub-directories have the same schema. AvroStorage takes no input parameters when used as a LoadFunc (except for debug [debug-level]). Users can provide parameters to AvroStorage when used as a StoreFunc. If they don't, Avro schema of output data is derived from its Pig schema. Detailed documentation can be found in http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1749) Update Pig parser so that function arguments can contain newline characters
[ https://issues.apache.org/jira/browse/PIG-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated PIG-1749: - Attachment: PIG-1749-2.patch I'll be finishing this patch for Lin. Updated patch. Added test for TestQueryParser, which required updating the new grammar (my grammar skills are a bity rusty), removed reference to AvroStorage, but left in its details. These references don't actually call out to the class and are just used for filler purposes. Update Pig parser so that function arguments can contain newline characters --- Key: PIG-1749 URL: https://issues.apache.org/jira/browse/PIG-1749 Project: Pig Issue Type: Improvement Reporter: lin guo Attachments: parser.patch, PIG-1749-2.patch We want to add this feature so that users can put long function argument strings in multiple lines. PIG-1748 depends on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.