[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779131#action_12779131 ] Richard Ding commented on PIG-879: -- The test-patch run results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 23 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 333 release audit warnings (more than the trunk's current 331 warnings). Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Richard Ding Attachments: PIG-879.patch, PIG-879.patch Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777316#action_12777316 ] Richard Ding commented on PIG-879: -- As a related issue, there is a feature in Pig right now that allows user to specify a local file in a load statement even if Pig is running in MapReduce mode. Namely, {code} A = load 'file:test/org/apache/pig/test/data/passwd' using PigStorage(':'); {code} is a valid Pig statement. Internally Pig moves the file from the local file system to the HDFS file system and gives the corresponding HDFS URI to the loader: {code} hdfs://localhost.localdomain:37575/tmp/temp957104276/tmp124591329 {code} As we move to the new Load/Store API, there are two options: 1. Stop supporting this feature. The above load statement can be replaced by the following statements: {code} copyFromLocal ./test/org/apache/pig/test/data/passw ./passw A = load './passw' using PigStorage(':'); {code} 2. The default implementation of _relativeToAbsolutePath(String location, Path curDir)_ method will move the local file (specified by location) to the HDFS and return the corresponding HDFS URI. Any comments on these options? Especially does anyone have strong opinion against choosing option 1? Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Richard Ding Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729744#action_12729744 ] Hong Tang commented on PIG-879: --- 1) and 3) are kind of equivalent to user, and are preferred for customized loaders that do not wish pig to do the escaping at all. Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729751#action_12729751 ] Dmitriy V. Ryaboy commented on PIG-879: --- Having this be a global flag through properties wouldn't work for scripts that require both behaviors in different load statements. Maybe a boolean performPathConversion flag which is true by default, and can be overridden via the load statement? Custom Loaders could change what their default is. I think a boolean flag is more straightforward than a method you have to override with a no-op. Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729755#action_12729755 ] Thejas M Nair commented on PIG-879: --- The problem with 1 3 is that the setting is universal to the grunt shell or script. In cases where user wants to read from read from multiple sources with different loaders, it will be inconvenient to be forced to use absolute uri's for all of them. Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729771#action_12729771 ] Hong Tang commented on PIG-879: --- Both are valid arguments. The problem of 2) and 4) are that they require change to the load statement syntax or load-func api and would take longer to get there. I guess we could structure the fix in two phases: Phase One: supporting 1) and 3), so that we can have the minimum to move along without having to disable multi-query optimization completely. User should be able to modify the script to change all relative paths to absolute ones (the chance of such usage should be rare that most people should not be impacted). Phase Two: support either 2) or 4) (but I do not think we need both). And personally I think 4) would be better because loader should be the one that interprets the location string syntax. Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729859#action_12729859 ] Milind Bhandarkar commented on PIG-879: --- I see some long term issues with all the approaches/options. First, not all loaders require a path. (e.g. DBLoader) Some paths (e.g. hftp:// or hsftp://) do not have a notion of relative or absolute. Indeed, the right way to fix this is to change the syntax of load and store statements, so that the loader itself deals with the path handling, and not pig. Second, take out copyToLocal, cp, mv, and all the dfs shell functionality from pig. These are side effects and impose a barrier for optimization. In the current form, they do not belong in a dataflow language. Grunt could still support it. Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.