[jira] Updated: (PIG-1015) [piggybank] DateExtractor should take into account timezones
[ https://issues.apache.org/jira/browse/PIG-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1015: --- Fix Version/s: 0.6.0 Status: Patch Available (was: Open) [piggybank] DateExtractor should take into account timezones Key: PIG-1015 URL: https://issues.apache.org/jira/browse/PIG-1015 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: date_extractor.patch The current implementation defaults to the local timezone when parsing strings, thereby providing inconsistent results depending on the settings of the computer the program is executing on (this is causing unit test failures). We should set the timezone to a consistent default, and allow users to override this default. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-868) indexof / lastindexof / lower / replace / substring udf's
[ https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764533#action_12764533 ] Dmitriy V. Ryaboy commented on PIG-868: --- The dateExtractor issue is addressed by PIG-1015 ; just changing the testcase is not sufficient, as the testcase will still break in some parts of the world because it relies on local settings. indexof / lastindexof / lower / replace / substring udf's - Key: PIG-868 URL: https://issues.apache.org/jira/browse/PIG-868 Project: Pig Issue Type: New Feature Reporter: Bennie Schut Priority: Trivial Attachments: addSomeUDFsPatch.patch, dateExtractorPatch.patch We parse some apache logs using pig and are using some pretty simple udf's like this: B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, '.txt')) as lang; It's pretty simple stuff but I figured someone else might find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-990) Provide a way to pin LogicalOperator Options
Provide a way to pin LogicalOperator Options Key: PIG-990 URL: https://issues.apache.org/jira/browse/PIG-990 Project: Pig Issue Type: Bug Components: impl Reporter: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.6.0 This is a proactive patch, setting up the groundwork for adding an optimizer. Some of the LogicalOperators have options. For example, LOJoin has a variety of join types (regular, fr, skewed, merge), which can be set by the user or chosen by a hypothetical optimizer. If a user selects a join type, pig philoophy guides us to always respect the user's choice and not explore alternatives. Therefore, we need a way to pin options. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761070#action_12761070 ] Dmitriy V. Ryaboy commented on PIG-984: --- Good idea. It should be straightforward to look at the sort info associated with the ResourceSchema (see the load/store proposal) to know whether the data is sorted; this frees us from relying on loaders, lets us follow ORDER BYs and LIMITs, etc. Still, this is not quite safe unless you know that the distribution key is a subset of your group key. A simple sorted input stream can still be split among mappers with some rows with the same key going to one, and some to the other. Do you have thoughts on how to handle such cases? This is something that can be inferred looking at the schema and distribution key. I understand wanting a manual handle to turn on the behavior while developing, but the production version of this can be done automatically ( if distributed by and sorted on a subset of group keys, apply map-side group rule in the optimizer). PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757979#action_12757979 ] Dmitriy V. Ryaboy commented on PIG-966: --- The comments below are from both me and Ashutosh. We'd like to preface this with saying that we think overall, the proposed changes are very useful and important, and are likely to result in significantly reducing the barriers to Pig adoption in the broader Hadoop user community. There is a lot of suggestions / critiques below, but that's just because we care :-) On to the notes. *Names of interfaces* Can you explain why everything has a Load prefix? Seems like this limits the interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata does not represent metadata associated with loading -- it loads metadata. LoadStatistics does not load statistics; it represents statistics, and is loaded using LoadMetadata). How about: LoadCaster - PigTypeCaster, LoadPushDown - Filterable, Projectionable (the latter may need a better name) (clearly, we are also suggesting breaking down the interface into multiple interfaces -- more on that later) LoadSchema - ResourceSchema LoadFieldSchema - FieldSchema or ResourceFieldSchema LoadMetadata - MetadataReader StoreMetadata - MetadataWriter LoadStatistics - ResourceStatistics *LoadFunc* In regards to the appropriate parameters for setURI -- can you explain the advantage of this over Strings in more detail? I think the current setLocation approach is preferable; it gives users more flexibility. Plus Hadoop Paths are constructed from strings, not URIs, so we are forcing a string-uri-string conversion on the common case. The _getLoadCaster_ method -- perhaps _getTypeCaster_ or _getPigTypeCaster_ is a better name? _prepareToRead_: does it need a _finishReading()_ mate? I would like to see a standard method for getting the jobconf (or whatever it is called in 20/21), both for LoadFunc and StoreFunc. *LoadCaster (Or PigTypeCaster..)* This interface is implemented by UTF8StorageConverter. Let's decide on what these are -- _casters_ or _converters_ -- and use one term. *LoadMetadata (or MetadataLoader)* Some thoughts on the problem of what happens when the loader is loading multiple resources or a resource with multiple partitions. We think that the schema should be uniform for everything a single instance of a loader is responsible for loading (and the loader can fill in null or defaults where appropriate if some resources are missing fields). Statistics should be aggregated, since the collection of resources will be treated as one (knowledge of relevant partitions would be used by a Filterable/Projectionable/Pushdownable loader to push selections down, not, I think, by downstream operators). So we have two options. In option 1, getStatistics would return a collection (lower c) of stats associated with the resources that the loader is loading, perhaps as a Map of String-ResourceStatistics. These would need to go through a stat aggregator of some sort that would know how to deal with unifying statistics across multiple resources in a generic way. In option 2, getStatistics would be responsible for its own implementation of aggregation, which would give it flexibility in terms of how such aggregation is done. Since we don't expect many stat stores, this seems preferable to us, as generic aggregation is going to be hard to get right). (of course there is option 3, where we have a default stat aggregator class that can be extended/overridden by individual MetadataLoaders, but I imagine this would be a hard sell). *LoadSchema (or ResourceSchema)* Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as an internal representation? I like how sort information is handled here. Perhaps we can consider using this approach instead of _SortInfo_ in PIG-953 If _PigSchema_ implements or contains _ResourceSchema_, _SortInfo_ will no longer be needed. _PartitionKeys_ aren't really part of schema; they are a storage/distribution property. This should go into the Metadata and refer to the schema. *LoadStatistics (or ResourceStatistics)* Why the public fields? Not that I am a huge fan of getters and setters but I sense findbugs warnings heading our way. We need to account for some statistics being missing. What should numRecords be set to if we don't know the number of records? We can use Long and set it to null; we can use a magic value (-1?); we can wrap in a getter and throw an exception (ugh). I had envisioned statistics as more of a key-value thing, with some keys predefined in a separate class. So we would have: {code} ResourceStats.NUM_RECORDS ResourceStats.SIZE_IN_BYTES //etc {code} and to get the stats we would call {code} MyResourceStats.getLong(ResourceStats.NUM_RECORDS)
[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758037#action_12758037 ] Dmitriy V. Ryaboy commented on PIG-966: --- Hi Alan, Responses to responses: bq. Perhaps its best to leave this as strings but look for a scheme at the beginning and interpret it as a URI if it has one (which is what Pig does now). I understand the motivation more clearly now, thanks for the explanation. Agreed with the quoted approach. bq. [regarding single schema for partitioned datasets] Agreed, that is what I was trying to say. Perhaps it wasn't clear. Nope, it was clear, I just have a very verbose way of saying yes. Regarding merging the Schemas you said: bq. No. It serves a different purpose, which is to define the content of data flows inside the logical plan. We should not tie these two together. I don't really understand the difference, but accept your superior knowledge of the codebase and accept your decision :-). bq. I'm not inclined to bend my programming style to match that of whoever wrote findbugs. +9.3 from the Russian judge. Gleefully accepted. bq. We need partition keys as part of this interface, as Pig will need to be able to pass partition keys to loaders that are capable of doing partition pruning. So we could add getPartitionKeys to the LoadMetadata interface. That's precisely what I am suggesting -- take it out of Schema, put it in LoadMetadata (or MetadataReader, as I like to call it). bq. The problem with key/value set ups like this is it can be hard for people to understand what is already there. So they end up not using what already exists, or worse, re-inventing the wheel. My hope is that by versioning this we can get around the need for this key/value stuff. Hm, I see your point. I am interested in being able to augment the set of available statistics without requiring changes to the base classes, however. I guess that's where inheritance comes in handy. Any comments on how to handle missing data? Primitive types still don't work for that. bq. So what happens tomorrow when some loaders can do merge joins on sorted data? Now we have to have another interface. I want this to be easily extensible. I must not be clear on what pushing down to a loader does. My interpretation was that it allows pushing down operations to the point where you don't read unnecessary data off disk. A classic example of filter projection would be filtering by a partition key (so, dt sysdate-30 , and our data is stored in files one per day). An example of projection pushdown is when we have a column store that simply avoids loading some of the columns. I don't see how a loader can push down a join. That seems to require reading and changing data. Is the idea that such a join can be performed without an MR step? That seems like a Pig thing, not a loader thing. In any case, yes, I think something like this would require a new interface in the same namespace, since it's a drastically different capability. Any thoughts on advisability of simplifying projection pushdown to just work on an int array? I know it's limiting, but it's going to be a heck of a lot easier for users to implement. bq. I'm assuming that a given StoreFunc is tied to a particular metadata instance, so it would return its implementation of StoreMetadata. I was assuming that Pig would have a preferred metadata store (such as Owl), and it would attempt to use it unless instructed otherwise. We could even try some cascading thing: if the user specifies a metadata store on the command line, use that; if not, see whether the loader suggests one; if not, use Owl; if owl doesn't have anything, see if it's an file in a known scheme (hdfs, file, s3n...) and at least get some file-level metadata such as create date and size. StoreMetadata can do the same (except for hdfs part). I'll take another look at PIG-967. Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces --- Key: PIG-966 URL: https://issues.apache.org/jira/browse/PIG-966 Project: Pig Issue Type: Improvement Components: impl Reporter: Alan Gates Assignee: Alan Gates I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757310#action_12757310 ] Dmitriy V. Ryaboy commented on PIG-948: --- I don't see a problem with url construction in Pig code. If Hadoop exposed this, then sure, it would be better to use such a feature. Since Hadoop does not expose it (afaik), it's more useful for the end-user to have this url than to have a jobid. Maintenance on this piece of code is minimal -- after all, it's just a simple string concatenation we are talking about. If Hadoop changes how this url is constructed, it will take about 3 minutes to fix, 2.5 of which will be spent opening a Jira ticket. In the meantime, users will have a more usable product than they would without this one line of code. [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Priority: Minor Attachments: pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753801#action_12753801 ] Dmitriy V. Ryaboy commented on PIG-953: --- Pradeep, First, I think this is very important to have, not just for Merge but for other things that might benefit from knowing sort orders, as well. A few minor nits from a cursory glance at the code. I didn't check the actual logic very carefully yet -- it looks like the large diff blocks in MergeSort et al are mostly moves of code blocks, not significant code changes, correct? On to the comments: seekNear seems ambiguous, as near is a generic concept that does not necessarily imply before or to, but not after -- which is what this method is required to do. How about seekBefore()? Why does getAscColumns and getSortColumns make a copy of the list? Seems like we can save some memory and cpu here. For that matter, why not use a map of (String)colName-gt; (Boolean)ascending instead of 2 lists? One structure, plus O(1) lookup. Not sure about the use of super() in the constructor of a class that doesn't extend anything but Object. Is there some magic that requires it? In Log2PhysTranslator, why hardcode the Limit operator? There are other operators that don't change sort order, such as filter. Perhaps add a method to Logical Operators that indicates if they alter sort order of their inputs? in Utils checkNullEquals is better written as {code} if (obj1 == null || obj2 == null) { return obj1 == obj2; } else { return checkEquality ? obj1.equals(obj2) : true; } {code} Even with this rewrite, this seems like an odd function. It being as odd as it is leads to it not being used safely when you set checkEquality to false (just a few lines later)-- if obj1 is null and obj2 is not, the func returns true, you try to call a method on obj1, and get an NPE. Probably better not to roll all this into one amorphous function and simply write {code} Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) obj1.equals(obj2)); {code} (the implementations of bothNull and notNull are obvious -- just conjunction and disjunction of obj == null) In StoreConfig This comment has a typo (and instead of an): * 1) the store does not follow and order by Enable merge join in pig to work with loaders and store functions which can internally index sorted data - Key: PIG-953 URL: https://issues.apache.org/jira/browse/PIG-953 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-953.patch Currently merge join implementation in pig includes construction of an index on sorted data and use of that index to seek into the right input to efficiently perform the join operation. Some loaders (notably the zebra loader) internally implement an index on sorted data and can perform this seek efficiently using their index. So the use of the index needs to be abstracted in such a way that when the loader supports indexing, pig uses it (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753833#action_12753833 ] Dmitriy V. Ryaboy commented on PIG-953: --- I got my trues and falses reversed on the NPE thing. You are right, the function works as intended. I still think it's too verbose, but agree that it's a style issue -- I guess if the commiters like it, it's fine :-) Enable merge join in pig to work with loaders and store functions which can internally index sorted data - Key: PIG-953 URL: https://issues.apache.org/jira/browse/PIG-953 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-953.patch Currently merge join implementation in pig includes construction of an index on sorted data and use of that index to seek into the right input to efficiently perform the join operation. Some loaders (notably the zebra loader) internally implement an index on sorted data and can perform this seek efficiently using their index. So the use of the index needs to be abstracted in such a way that when the loader supports indexing, pig uses it (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-943) Pig crash when it cannot get counter from hadoop
[ https://issues.apache.org/jira/browse/PIG-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12751631#action_12751631 ] Dmitriy V. Ryaboy commented on PIG-943: --- Hi Daniel, My apologies, I worded my comment poorly. I wasn't minus-oneing the patch, I was saying that the use of -1 as a magic value is a bit hacky. I think inserting Long.NaN or null and checking for it on the other end, instead of checking for -1, is cleaner. Pig crash when it cannot get counter from hadoop Key: PIG-943 URL: https://issues.apache.org/jira/browse/PIG-943 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-943-1.patch We see following call stacks in Pig: Case 1: Caused by: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.computeWarningAggregate(MapReduceLauncher.java:390) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:238) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) Case 2: Caused by: java.lang.NullPointerException at org.apache.pig.tools.pigstats.PigStats.accumulateMRStats(PigStats.java:150) at org.apache.pig.tools.pigstats.PigStats.accumulateStats(PigStats.java:91) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:192) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) In both cases, hadoop jobs finishes without error. The cause of both problems is RunningJob.getCounters() returns a null, and Pig do not currently check for that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-936) making dump and PigDump independent from Tuple.toString
[ https://issues.apache.org/jira/browse/PIG-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749255#action_12749255 ] Dmitriy V. Ryaboy commented on PIG-936: --- Patch makes sense. pig.data doesn't seem like the right package for this class -- perhaps pig.tools ? Also please make sure to format your code in accordance with the style guidelines (http://java.sun.com/docs/codeconv/ ), and use 4 spaces -- not tabs -- for indentation. making dump and PigDump independent from Tuple.toString --- Key: PIG-936 URL: https://issues.apache.org/jira/browse/PIG-936 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Jeff Zhang Fix For: 0.4.0 Since Tuple is an interface, a toString implementation can change from one tuple implementation to the next. This means that format of dump and PigDump will be different depending on the tuples processed. This could be quite confusing to the users. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749196#action_12749196 ] Dmitriy V. Ryaboy commented on PIG-934: --- Throwing an exception when a seek is past the file boundary seems acceptable to me (and preferable to adding new functions and changing upstream code that shouldn't care about this detail). Especially since if there is a way to get a consistent ordering among files in a directory, it's trivial to later update this code to seek past file boundaries and into the next file. Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index -- Key: PIG-934 URL: https://issues.apache.org/jira/browse/PIG-934 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Ashutosh Chauhan Attachments: pig-934.patch We use POLoad to seek into right file which has the following code: {noformat} public void setUp() throws IOException{ String filename = lFile.getFileName(); loader = (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); is = FileLocalizer.open(filename, pc); loader.bindTo(filename , new BufferedPositionedInputStream(is), this.offset, Long.MAX_VALUE); } {noformat} Between opening the stream and bindTo we do not seek to the right offset. bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745518#action_12745518 ] Dmitriy V. Ryaboy commented on PIG-924: --- Owen -- I may not have made the intent clear; the idea is that when Pig is rewritten to use the future-proofed APIs, the shims will go away (presumably for 0.5). Right now, pig is not using the new APIs, even the 20 patch posted by Olga uses the deprecated mapred calls. This is only to make life easier in the transitional period while Pig is using the old, mutating APIs. Check out the pig user list archives for motivation of why these shims are needed. Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745109#action_12745109 ] Dmitriy V. Ryaboy commented on PIG-924: --- Regarding deprecation -- I tried setting it back to off, and adding @SuppressWarnings(deprecation) to the shims for 20, but and complained about deprecation nonetheless. Not sure what its deal is. Adding something like this to the main build.xml works. Does this seem like a reasonable solution? {code} !-- set deprecation off if hadoop version greater or equals 20 -- target name=set_deprecation condition property=hadoop_is20 equals arg1=${hadoop.version} arg2=20/ /condition antcall target=if_hadoop_is20/ antcall target=if_hadoop_not20/ /target target name=if_hadoop_is20 if=hadoop_is20 property name=javac.deprecation value=off / /target target name=if_hadoop_not20 unless=hadoop_is20 property name=javac.deprecation value=on / /target target name=init depends=set_deprecation [] {code} Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-924: -- Status: Patch Available (was: Open) Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: pig_924.2.patch, pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-924: -- Attachment: pig_924.2.patch This patch addresses the reviewer comments. I put the factor of 0.9 into the 18 shim to restore old behavior (not sure what the motivation was for changing this for 20.. I set the default hadoop version to 18, so that we can verify correctness by running automated tests. The existing unit tests are sufficient verification of this patch (at least as far as 18 is concerned). Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: pig_924.2.patch, pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-924: -- Status: Patch Available (was: Open) Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-923) Allow setting logfile location in pig.properties
[ https://issues.apache.org/jira/browse/PIG-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-923: -- Status: Patch Available (was: Open) Allow setting logfile location in pig.properties Key: PIG-923 URL: https://issues.apache.org/jira/browse/PIG-923 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Dmitriy V. Ryaboy Fix For: 0.4.0 Attachments: pig_923.patch Local log file location can be specified through the -l flag, but it cannot be set in pig.properties. This JIRA proposes a change to Main.java that allows it to read the pig.logfile property from the configuration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744307#action_12744307 ] Dmitriy V. Ryaboy commented on PIG-924: --- Thanks for looking, Todd -- most of those changes, like the factor of 0.9, deprecation, excluding HBase test, etc, are consistent with the 0.20 patch posted to PIG-660 . Moving junit.hadoop.conf is critical -- there are comments about this in 660 -- without it, resetting hadoop.version doesn't actually work, as some of the information from a previous build sticks around. I'll fix the whitespace; this wasn't a final patch, more of a proof of concept. Point being this could work, but it can't, because Hadoop is bundled in the jar. I am looking for comments from the core developer team regarding the possibility of un-bundling. Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-911: -- Status: Open (was: Patch Available) [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_911.2.patch, pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-911: -- Attachment: pig_911.2.patch Addressed Alan's comments. [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_911.2.patch, pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744343#action_12744343 ] Dmitriy V. Ryaboy commented on PIG-911: --- Concerning making this a StoreFunc, as well -- the StoreFunc interface is not very friendly to this. All you get in the bind call is the output stream; for LoadFunc, you also get the name of the file (or, presumably, whatever it was the user passed in under the guise of a file name). This means that for the LoadFunc, I was able to use the passed in filename to back into a Path and a FileSystem. I can't do the same for StoreFunc, where the filename is not available -- only the output stream is. That means I can't create the appropriate SequenceFile.Writer . Is there a way around this limitation that does not involve requiring special constructor parameters to be used? Is it possible to change the StoreFunc api to provide this information, or to make it available through some side channel (MapRedUtils or similar)? [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_911.2.patch, pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-845) PERFORMANCE: Merge Join
[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742562#action_12742562 ] Dmitriy V. Ryaboy commented on PIG-845: --- Alan, Ashutosh -- maybe I am misunderstanding where null keys come from in the Indexer. I assumed this was due to the processing that happens in the plan the indexer deserializes and attaches to its POLocalRearrange. In regards to errors, I was referring to this: {code} catch(PlanException e){ int errCode = 2034; String msg = Error compiling operator + joinOp.getClass().getCanonicalName(); throw new MRCompilerException(msg, errCode, PigException.BUG, e); {code} The only central place for error codes seems to be the Wiki. A class with a bunch of static+final error codes would be a better place. Ashutosh, I completely disagree with you on changing all tests to run in MR mode. The tests are already impossible to run on a laptop (people, myself included, actually submit patches to jira just to see if tests pass). Running in MR mode will incur significant overhead per test. Only things that actually rely on the MR bits should be tested in MR (and use mock objects if possible.. there's been some advancement on that front in Hadoop 20, I haven't looked at it yet). Would love to see a more efficient indexing MR job (which will reduce load on the JT, keep schedules less busy, and incur less overhead in task startups by requiring fewer tasks), but perhaps not before 0.4 is out the door with existing functionality. Just to be clear, I don't think more than 1 record per block is necessary, but more than one block per task would probably be a good thing. Any thoughts on how to choose which of two relations to index? We get locality on the non-indexed relation, but not on the indexed one, which probably throws a kink in the normal way of thinking about this. PERFORMANCE: Merge Join --- Key: PIG-845 URL: https://issues.apache.org/jira/browse/PIG-845 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Attachments: merge-join.patch Thsi join would work if the data for both tables is sorted on the join key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742565#action_12742565 ] Dmitriy V. Ryaboy commented on PIG-911: --- Alan, Thanks for the feedback. I'll add the try/catch In regards to the UTF8StorageConverter -- I think I added that because before that the code broke if you didn't declare a schema at load time (so, a=load 'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using SequenceFileLoader() as (a:chararray, b:double) I'll figure out what exactly is going on with that and remove the UTF8StorageConverter Will add Store as time allows. [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742083#action_12742083 ] Dmitriy V. Ryaboy commented on PIG-833: --- Alan -- if it's not finding .dfs , it's probably not linking hadoop20.jar Try my patch in 660 :-) Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742170#action_12742170 ] Dmitriy V. Ryaboy commented on PIG-833: --- Alan, this means Pig contrib/ is no longer compatible with Hadoop 18. Which probably means that you need to either rolls this back or roll 660 in (and add the hadoop20.jar file to lib/ ) Otherwise the build is broken. Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-845) PERFORMANCE: Merge Join
[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741589#action_12741589 ] Dmitriy V. Ryaboy commented on PIG-845: --- Some Comments below. It's a big patch, so a lot of comments... 1. EndOfAllInput flags -- could you add comments here about what the point of this flag is? You explain what EndOfAllInputSetter does (which is actually rather self-explanatory) but not what the meaning of the flag is and how it's used. There is a bit of an explanation in PigMapBase, but it really belongs here. 2. Could you explain the relationship between EndOfAllInput and (deleted) POStream? 3. Comments in MRCompiler alternate between referring to the left MROp as LeftMROper and curMROper. Choose one. 4. I am curious about the decision to throw compiler exceptions if MergeJoin requirements re number of inputs, etc, aren't satisfied. It seems like a better user experience would be to log a warning and fall back to a regular join. 5. Style notes for visitMergeJoin: It's a 200-line method. Any way you can break it up into smaller components? As is, it's hard to follow. The if statements should be broken up into multiple lines to agree with the style guides. Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they are fine, but together in a 200-line method they are undreadable. Please consider more descriptive names. 6. Kind of a global comment, since it applies to more than just MergeJoin: It seems to me like we need a Builder for operators to clean up some of the new, set, set, set stuff. Having the setters return this and a Plan's add() method return the plan, would let us replace this: POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope))); topPrj.setColumn(1); topPrj.setResultType(DataType.TUPLE); topPrj.setOverloaded(true); rightMROpr.reducePlan.add(topPrj); rightMROpr.reducePlan.connect(pkg, topPrj); with this: POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope))) .setColumn(1).setResultType(DataType.TUPLE) .setOverloaded(true); rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj) 7. Is the change to ListListByte keyTypes in POFRJoin related to MergeJoin or just rolled in? 8. MergeJoin break getNext() into components. I don't see you supporting Left outer joins. Plans for that? At least document the planned approach. Error codes being declared deep inside classes, and documented on the wiki, is a poor practice, imo. They should be pulled out into PigErrors (as lightweight final objects that have an error code, a name, and a description..) I thought Santhosh made progress on this already, no? Could you explain the problem with splits and streams? Why can't this work for them? 9. Sampler/Indexer: 9a Looks like you create the same number of map tasks for this as you do for a join; all a sampling map task does is read one record and emit a single tuple. That seems wasteful; there is a lot of overhead in setting up these tiny jobs which might get stuck behind other jobs running on the cluster, etc. If the underlying file has syncpoints, a smaller number of MR tasks can be created. If we know the ratio of sample tasks to full tasks, we can figure out how many records we should emit per job ( ceil(full_tasks/sample_tasks) ). We can approximately achieve this through seeking trough (end-offset)/num_to_emit and doing a sync() after that seek. It's approximate, but close enough for an index. 9b Consider renaming to something like SortedFileIndexer, since it's coneivable that this component can be reused in a context other than a Merge Join. 10. Would it make sense to expose this to the users via a 'CREATE INDEX' (or similar) command? That way the index could be persisted, and the user could tell you to use an existing index instead of rescanning the data. 11. I am not sure about the approach of pushing sampling above filters. Have you guys benchmarked this? Seems like you'd wind up reading the whole file in the sample job if the filter is selective enough (and high filter selectivity would also make materialize-sample go much faster). Testing: 12a You should test for refusal to do 3-way join and other error condition (or a warning and successful failover to regular join -- my preference) 12b You should do a proper unit test for the MergeJoinIndexer (or whatever we are calling it). PERFORMANCE: Merge Join --- Key: PIG-845 URL: https://issues.apache.org/jira/browse/PIG-845 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Attachments: merge-join-1.patch, merge-join-for-review.patch Thsi join would work if the data for both tables is sorted on the join key. -- This message is
[jira] Commented: (PIG-561) Need to generate empty tuples and bags as a part of Pig Syntax
[ https://issues.apache.org/jira/browse/PIG-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741133#action_12741133 ] Dmitriy V. Ryaboy commented on PIG-561: --- I believe PIG-773 fixes this. Can we close this? Need to generate empty tuples and bags as a part of Pig Syntax -- Key: PIG-561 URL: https://issues.apache.org/jira/browse/PIG-561 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Viraj Bhat There is a need to sometimes generate empty tuples and bags as a part of the Pig syntax rather than using UDF's {code} a = load 'mydata.txt' using PigStorage(); b =foreach a generate ( ) as emptytuple; c = foreach a generate { } as emptybag; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740241#action_12740241 ] Dmitriy V. Ryaboy commented on PIG-660: --- The shim patch posted above doesn't work as cleanly as desired; the current build.xml has junit.hadoop.conf points to a directory in ${user.home} This has an undesired effect -- a hadoop config file gets created the first time you run ant, which among other things sets what class implements the FileSytem interface. When ant gets re-run with a different hadoop version, 'ant clean' does not clean out this file -- so an incorrect fs class name gets used. Deleting the directory created by junit.hadoop.conf before rerunning fixes the problem; so does putting the value of junit.hadoop.conf relative to ${build.dir} instead of ${user.home}. As I am not sure how the Y! developers use their pigconf directories this thing references, I do not know the appropriate way to proceed. Comments? Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-660: -- Attachment: pig_660_shims_3.patch The attached patch fixes the mentioned issue with junit.hadoop.conf by setting it to $build.dir/conf This can be overridden by build.properties if individual contributors want to revert to the old behavior. Also added a compatibility shim for hadoop19 (from PIG-573) Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch, pig_660_shims_3.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740339#action_12740339 ] Dmitriy V. Ryaboy commented on PIG-660: --- Nate, Your stacktrace shows hadoop.dfs calls (as opposed to hdfs) which tells me it's looking for -- and finding -- hadoop 18 classes. Can you do this: export PIG_HADOOP_VERSION=20 ant clean; ant -Dhadoop.version=20 any try again? Just to be sure, try moving hadoop1* out of the lib directory (so that it for sure fails if it's trying to look for 18). Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch, pig_660_shims_3.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739568#action_12739568 ] Dmitriy V. Ryaboy commented on PIG-893: --- Jeff, Thanks for the contribution! Just a few comments: 0) could you name your patch files *.patch? That makes them easier to review, as the proper highlighting mode is chosen. 1) Other class names in the utils package imply that the class name for this should be CastUtils 2) Spacing in POCast.java is a bit messed up. Please make sure all spacing is to project conventions 3) In TestSchema -- Numberic isn't a word, you mean Numeric (no b) 4) I am not sure about naming the methods chararrayTo . Since they take String as an argument, being in Java-land, I think it would be more straightforward to say stringToxxx . 5) Implementation of the casts -- you call str.toBytes(), and hand off to bytesToXXX method. That method, in turn, converts bytes back into a string, and proceeds to do the conversion. That seems like redundant work. Wouldn't it be better to have stringToXXX peform the conversion, and have bytesToXXX covert to string, then call the stringToXXX method? 6) TestCharArray2Numeric.java -- the convention is to spell out To instead of using the number 2 7) The tests in TestCharArray2Numeric look very similar to each other. Could you pull out the common functionality so the code is only repeated once? About the tests themselves: Since you are just testing conversions, this can be a straightforward unit test -- make a few strings, assert that they convert to the expected value. Hit the edge cases (overflows, special cases for parsing, etc). We don't need to spin up a whole Pig query. 8) I don't like testing random values, as this creates tests that might sometimes pass, and sometimes not. Recommend using known data for reproducible test results. 9) You extracted functionality from Utf8StorageConverter by duplicating the code; I would prefer to see Utf8StorageConverter modified to hand off conversions to CastUtils support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Thejas M Nair Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_893_Patch.txt Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-660: -- Attachment: pig_660_shims_2.patch Sure is.. uploading a patch with the fixed package name. Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch, pig_660_shims.patch, pig_660_shims_2.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739570#action_12739570 ] Dmitriy V. Ryaboy commented on PIG-909: --- Sorry I am being slow -- which libraries are missing from the classpath you posted? Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739643#action_12739643 ] Dmitriy V. Ryaboy commented on PIG-909: --- Oh I see. I have this in my bashrc: export PIG_CLASSPATH=$PIGDIR/pig.jar I thought this was included in a README somewhere. I guess we can modify bin/pig to use this as a default value (so a user can still override by setting PIG_CLASSPATH to something else). Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-911: -- Attachment: pig_sequencefile.patch The attached patch is an initial implementation of a loader for SequenceFiles. It works with keys and values of the following types: Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable, ByteWritable I would appreciate some comments on how to properly handle errors (casting errors, IO errors, etc). [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-908) Need a way to correlate MR jobs with Pig statements
Need a way to correlate MR jobs with Pig statements --- Key: PIG-908 URL: https://issues.apache.org/jira/browse/PIG-908 Project: Pig Issue Type: Wish Reporter: Dmitriy V. Ryaboy Complex Pig Scripts often generate many Map-Reduce jobs, especially with the recent introduction of multi-store capabilities. For example, the first script in the Pig tutorial produces 5 MR jobs. There is currently very little support for debugging resulting jobs; if one of the MR jobs fails, it is hard to figure out which part of the script it was responsible for. Explain plans help, but even with the explain plan, a fair amount of effort (and sometimes, experimentation) is required to correlate the failing MR job with the corresponding PigLatin statements. This ticket is created to discuss approaches to alleviating this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements
[ https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739125#action_12739125 ] Dmitriy V. Ryaboy commented on PIG-908: --- An idea for something might work (haven't evaluated the complexity of implementing this) When LogicalOperators are created, a bit of metadata is attached to them, listing the line number that they come from. Multiple LOs may be created from a single line, and multiple lines may be associated with a single operator. This metadata is passed down to Physical Operators. When an MR job is created, a log message is written listing the line numbers that are associated with the POs in this map-reduce job, and the job name. Thoughts? Need a way to correlate MR jobs with Pig statements --- Key: PIG-908 URL: https://issues.apache.org/jira/browse/PIG-908 Project: Pig Issue Type: Wish Reporter: Dmitriy V. Ryaboy Complex Pig Scripts often generate many Map-Reduce jobs, especially with the recent introduction of multi-store capabilities. For example, the first script in the Pig tutorial produces 5 MR jobs. There is currently very little support for debugging resulting jobs; if one of the MR jobs fails, it is hard to figure out which part of the script it was responsible for. Explain plans help, but even with the explain plan, a fair amount of effort (and sometimes, experimentation) is required to correlate the failing MR job with the corresponding PigLatin statements. This ticket is created to discuss approaches to alleviating this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-909: -- Attachment: pig_909.patch The attached patch modifies bin/pig as described. Tested locally by setting and unsetting HADOOP_HOME and making sure the right configurations, etc, are picked up. Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-909: -- Attachment: pig_909.2.patch added ivy jars to classpath Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.2.patch, pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739287#action_12739287 ] Dmitriy V. Ryaboy commented on PIG-909: --- Daniel, not sure what you mean. Do you mean that the patch makes it necessary to have an external version of hadoop to build/run pig? That's not the case, as I wrapped the whole thing in an if -- external hadoop jars will only be used instead of the bundled hadoop.jar if HADOOP_HOME is defined (and valid). Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.2.patch, pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739297#action_12739297 ] Dmitriy V. Ryaboy commented on PIG-909: --- Actually I looked at build.xml for pig, and it includes the Ivy dependencies in pig.jar Which explains why this stuff has been working for me. I'll delete the second patch -- that change is unnecessary. Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-909: -- Attachment: (was: pig_909.2.patch) Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-660: -- Attachment: pig_660_shims.patch Attached patch, pig_660_shims.patch, introduces an compatibility layer similar to that in https://issues.apache.org/jira/browse/HIVE-487 . HadoopShims.java contains wrappers that hide interface differences between Hadoop 18 and 20; when an interface change affects Pig, a shim is added into this class, and used by Pig. Separate versions of the shims are maintained for different Hadoop versions. This way, Pig users can compile against either Hadoop 18 or Hadoop 20 by simply changing an ant property, either via the -D flag, or build.properties, instead of having to go through the process of patching. There has been discussion of officially moving Pig to 0.20; this way, we sidestep the whole question, and only need to worry about version compatibility when using specific Hadoop APIs. I propose that we use this mechanism until Pig is moved to use the new, future-proofed API. Pig compiled against 18 won't be able to use some of the newest features, such as Zebra storage. Ant can be configured not to build ant if Hadoop version is 20. Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch, pig_660_shims.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-903) ILLUSTRATE fails on 'Distinct' operator
ILLUSTRATE fails on 'Distinct' operator --- Key: PIG-903 URL: https://issues.apache.org/jira/browse/PIG-903 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Using the latest Pig from trunk (0.3+) in mapreduce mode, running through the tutorial script script1-hadoop.pig works fine. However, executing the following illustrate command throws an exception: illustrate ngramed2 Pig Stack Trace --- ERROR 2999: Unexpected internal error. Unrecognized logical operator. java.lang.RuntimeException: Unrecognized logical operator. at org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(EquivalenceClasses.java:60) at org.apache.pig.pen.DerivedDataVisitor.evaluateOperator(DerivedDataVisitor.java:368) at org.apache.pig.pen.DerivedDataVisitor.visit(DerivedDataVisitor.java:226) at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:104) at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:37) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:98) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:90) at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:106) at org.apache.pig.PigServer.getExamples(PigServer.java:724) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:541) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:361) This works: illustrate ngramed1; Although it does throw a few NPEs : java.lang.NullPointerException at org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:205) at org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190) at org.apache.pig.pen.util.DisplayExamples.PrintTabular(DisplayExamples.java:86) [...] (illustrate also doesn't work on bzipped input, but that's a separate issue) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735676#action_12735676 ] Dmitriy V. Ryaboy commented on PIG-893: --- +1 for string-numeric conversion via casting. support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Reporter: Thejas M Nair Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-660: -- Attachment: PIG-660_5.patch Updating the patch to set PIG_HADOOP_VERSION to 20 by default. Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729751#action_12729751 ] Dmitriy V. Ryaboy commented on PIG-879: --- Having this be a global flag through properties wouldn't work for scripts that require both behaviors in different load statements. Maybe a boolean performPathConversion flag which is true by default, and can be overridden via the load statement? Custom Loaders could change what their default is. I think a boolean flag is more straightforward than a method you have to override with a no-op. Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-863) Function (UDF) automatic namespace resolution is really needed
[ https://issues.apache.org/jira/browse/PIG-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723683#action_12723683 ] Dmitriy V. Ryaboy commented on PIG-863: --- I believe PIG-832 addresses this Function (UDF) automatic namespace resolution is really needed -- Key: PIG-863 URL: https://issues.apache.org/jira/browse/PIG-863 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz The Apache PiggyBank documentation says that to reference a function, I need to specify a function as: org.apache.pig.piggybank.evaluation.string.UPPER(text) As in the example: {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} Why can't we implement automatic name space resolution as so we can just reference UPPER without namespace qualifiers? {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} I know about the workaround: {code} define org.apache.pig.piggybank.evaluation.string.UPPER UPPER {code} But this is really a pain to do if I have lots of functions. Just warn if there is a collision and suggest I use the define workaround in the warning messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-855) Filter to determine if a UserAgent string is a bot
Filter to determine if a UserAgent string is a bot -- Key: PIG-855 URL: https://issues.apache.org/jira/browse/PIG-855 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Priority: Minor A PiggyBank contrib that would allow one to filter records by whether a UserAgent strings represents a bot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-855) Filter to determine if a UserAgent string is a bot
[ https://issues.apache.org/jira/browse/PIG-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721012#action_12721012 ] Dmitriy V. Ryaboy commented on PIG-855: --- Jeff, the approach depends on whether you care more about false positives or false negatives. The right way to do this is probably not to write a boolean function, but something that returns one of several codes -- known browser, known crawler, monitor, stuff like wget and curl, and unknown. IAB has a standard list of bots and spiders (http://www.iab.net/sites/login.php), and maintains an industry standard for the filters that should be applied before numbers are reported. Filter to determine if a UserAgent string is a bot -- Key: PIG-855 URL: https://issues.apache.org/jira/browse/PIG-855 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Priority: Minor A PiggyBank contrib that would allow one to filter records by whether a UserAgent strings represents a bot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2
[ https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-830: -- Status: Patch Available (was: Open) Port Apache Log parsing piggybank contrib to Pig 0.2 Key: PIG-830 URL: https://issues.apache.org/jira/browse/PIG-830 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt The piggybank contribs (pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was merged in. They should be updated to work with the current APIs and added back into trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2
Port Apache Log parsing piggybank contrib to Pig 0.2 Key: PIG-830 URL: https://issues.apache.org/jira/browse/PIG-830 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Dmitriy V. Ryaboy Priority: Minor The piggybank contribs (pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was merged in. They should be updated to work with the current APIs and added back into trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2
[ https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-830: -- Attachment: pig-830-v2.patch Sorry about that. New version attached, passes the test this time. Port Apache Log parsing piggybank contrib to Pig 0.2 Key: PIG-830 URL: https://issues.apache.org/jira/browse/PIG-830 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig-830-v2.patch, pig-830.patch, TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt The piggybank contribs (pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was merged in. They should be updated to work with the current APIs and added back into trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2
[ https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-830: -- Status: Patch Available (was: Open) Port Apache Log parsing piggybank contrib to Pig 0.2 Key: PIG-830 URL: https://issues.apache.org/jira/browse/PIG-830 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig-830-v2.patch, pig-830.patch, TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt The piggybank contribs (pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was merged in. They should be updated to work with the current APIs and added back into trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2
[ https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-830: -- Attachment: pig-830-v3.patch As I experimented with these classes, I realized that the naive implementation that used a regex to capture strings, and return a tuple of strings, is not appropriate for the typed version of Pig, since one may want to cast various fields into integers, etc. The attached version returns a tuple of DataByteArrays , instead. Port Apache Log parsing piggybank contrib to Pig 0.2 Key: PIG-830 URL: https://issues.apache.org/jira/browse/PIG-830 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Dmitriy V. Ryaboy Priority: Minor Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt The piggybank contribs (pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was merged in. They should be updated to work with the current APIs and added back into trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-825) PIG_HADOOP_VERSION should be 18
PIG_HADOOP_VERSION should be 18 --- Key: PIG-825 URL: https://issues.apache.org/jira/browse/PIG-825 Project: Pig Issue Type: Bug Components: grunt Reporter: Dmitriy V. Ryaboy PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now considered default. Patch coming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-825) PIG_HADOOP_VERSION should be 18
[ https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-825: -- Attachment: pig-825.patch Attached trivial patch, please review. PIG_HADOOP_VERSION should be 18 --- Key: PIG-825 URL: https://issues.apache.org/jira/browse/PIG-825 Project: Pig Issue Type: Bug Components: grunt Reporter: Dmitriy V. Ryaboy Attachments: pig-825.patch PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now considered default. Patch coming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-825) PIG_HADOOP_VERSION should be 18
[ https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-825: -- Attachment: pig-825.patch Minor update to minor patch --fixed a typo in the bug number in CHANGES.txt PIG_HADOOP_VERSION should be 18 --- Key: PIG-825 URL: https://issues.apache.org/jira/browse/PIG-825 Project: Pig Issue Type: Bug Components: grunt Reporter: Dmitriy V. Ryaboy Attachments: pig-825.patch, pig-825.patch PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now considered default. Patch coming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.