Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
we have not discussed advantages of stand alone python vs jython-in-maven pom http://code.google.com/p/jy-maven-plugin/ language is about same, and it does not needs to have installed, which is advantage on windows.
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
If we decide to go with Maven then there's no point to complicate the picture with jython. This time I will keep the offensive about *yton to myself ;) Cos On Sat, Nov 24, 2012 at 10:26PM, Radim Kolar wrote: we have not discussed advantages of stand alone python vs jython-in-maven pom http://code.google.com/p/jy-maven-plugin/ language is about same, and it does not needs to have installed, which is advantage on windows.
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
discussion seems to ended, lets start vote.
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
On 22 November 2012 02:40, Chris Nauroth cnaur...@hortonworks.com wrote: It seems like the trickiest issue is preservation of permissions and symlinks in tar files. I suspect that any JVM-based solution like custom Maven plugins, Groovy, or jtar would be limited in this respect. According to Ant documentation, it's a JDK limitation, so I suspect all of these would have the same problem. I haven't tried any of them though. (If there was a feasible solution, then Ant likely would have incorporated it long ago.) If anyone wants to try though, we might learn something from that. Thank you, --Chris You are limited by what File.canRead(), canWrite() and canExecute) tell you. The absence of a way to detect file permissions in Java -is because of the lowest-common-denominator approach of the JavaFS APIs, supporting FAT32 (odd case logic, no perms or symlinks), NTFS (odd case logic, ACLs over perms, symlinks historically very hard to create), HFS+ (case insensitive unix fs!) as well as classic unixy filesystems. Ant tarfileset filesets in tar let you spec permissions on filesets you pull into the tar; they are generated x-platform, which the other reason why you declare them in tar -you have the right to generate proper tar files even if you use a Windows box. symlinks are problematic -even detecting them cross platform is pretty unreliable. To really do them you'd need to add a new symlinkfileset entity for tar, that would take the link declaration. I could imagine how to do that -and if stuck into the hadoop tools JAR, wouldn't even depend on a new version of Ant. Maven just adds extra layers in the way. -Steve
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
On 21 November 2012 19:15, Matt Foley ma...@apache.org wrote: This discussion started in Those of us involved in the branch-1-win port of Hadoop to Windows without use of Cygwin, have faced the issue of frequent use of shell scripts throughout the system, both in build time (eg, the utility saveVersion.sh), and run time (config files like hadoop-env.sh and the start/stop scripts in bin/* ). Similar usages exist throughout the Hadoop stack, in all projects. The vast majority of these shell scripts do not do anything platform specific; they can be expressed in a posix-conforming way. Therefore, it seems to us that it makes sense to start using a cross-platform scripting language, such as python, in place of shell for these purposes. For those rare occasions where platform-specific functionality really is needed, python also supports quite a lot of platform-specific functionality on both Linux and Windows; but where that is inadequate, one could still conditionally invoke a platform-specific module written in shell (for Linux/*nix) or powershell or bat (for Windows). The primary motive for moving to a cross-platform scripting language is maintainability. The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future). We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash. I'd argue that a lot of Hadoop java developers aren't that familiar with bash. It's only in the last six months that I've come to hate it properly. In the ant project, it was the launcher scripts that had the worst bugrep:line ratio, as -variations in .sh behaviour, especially under cygwin, but also things that weren't bash (AIX, ...) -requirements of the entire unix command set for real work -variants in the parameters/behaviour of those commands between Linux and other widely used Unix systems (e.g. OSX) -lack of inclusion of the .sh scripts in the junit test suite -lack of understanding of bash. In the ant project we added a Python launcher in, what, 2001, based on the Perl launcher supplied by one steve_l@users.sourceforge For run-time, there is likely to be a lot more discussion. Lots of folks, including me, aren't real happy with use of active scripts for configuration, and various others, including I believe some of the Bigtop folks, have issues with the way the start/stop scripts work. Nevertheless, all those scripts exist today and are widely used. And they present an impediment to porting to Windows-without-cygwin. They're a maintenance and support cost on Unix. Too many scripts, even more in Yarn, weakly-nondeterministic logic for loading env variables, especially between init.d and bin/hadoop; not much diagnostics. And as with Ant, a relatively under-comprehended language with no unit test coverage. I'd replace the bash logic with python for Unix dev and maintenance alone. You could put your logic into a shared python module in usr/lib/hadoop/bin , have PyUnit test the inner functions as part of the build and test process ( jenkins). Nothing about run-time use of scripts has changed significantly over the past three years, and I don't think we should hold up the Windows port while we have a huge discussion about issues that veer dangerously into religious/aesthetic domains. It would be fun to have that discussion, but I don't want this decision to be dependent on it! With Yarn its got more complex. More env variables to set, more support calls when they aren't. So I propose that we go ahead and also approve python as a run-time dependency, and allow the inclusion of python scripts in place of current shell-based functionality. The unpleasant alternative is to spawn a bunch of powershell scripts in parallel to the current shell scripts, with a very negative impact on maintainability. The Windows port must, after all, be allowed to proceed. +1 to any vote to allow .py at run time as a new feature =0 to ripping out and replacing the existing .sh scripts with python code, as even though I don't like the scripts, replacing them could be traumatic downstream. +1 to a gradual migration to .py for new code, starting with the yarn scripts.
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Hey Matt, We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on its way out with the move of docs to APT) Why not do a maven-plugin to do that? Colin already has something to simplify all the cmake calls from the builds using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887) We could do the same with protoc, thus simplifying the POMs. The saveVersion.sh seems like another prime candidate for a maven plugin, and in this case it would not require external tools. Does this make sense? Thx On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley ma...@apache.org wrote: This discussion started in HADOOP-8924https://issues.apache.org/jira/browse/HADOOP-8924 , where it was proposed to replace the build-time utility saveVersion.sh with a python script. This would require Python as a build-time dependency. Here's the background: Those of us involved in the branch-1-win port of Hadoop to Windows without use of Cygwin, have faced the issue of frequent use of shell scripts throughout the system, both in build time (eg, the utility saveVersion.sh), and run time (config files like hadoop-env.sh and the start/stop scripts in bin/* ). Similar usages exist throughout the Hadoop stack, in all projects. The vast majority of these shell scripts do not do anything platform specific; they can be expressed in a posix-conforming way. Therefore, it seems to us that it makes sense to start using a cross-platform scripting language, such as python, in place of shell for these purposes. For those rare occasions where platform-specific functionality really is needed, python also supports quite a lot of platform-specific functionality on both Linux and Windows; but where that is inadequate, one could still conditionally invoke a platform-specific module written in shell (for Linux/*nix) or powershell or bat (for Windows). The primary motive for moving to a cross-platform scripting language is maintainability. The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future). We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash. Regarding the choice of python: - There are already a few instances of python usage in Hadoop, such as the utility (currently broken) relnotes.py, and massive usage of python in the examples/ and contrib/ directories. - Python is also used in Bigtop build-time. - The Python language is available for free on essentially all platforms, under an Apache-compatible licensehttp://www.apache.org/legal/resolved.html. - It is supported in Eclipse and similar IDEs. - Most importantly, it is widely accepted as a reasonably good OO scripting language, and it is easily learned by anyone who already knows shell or perl, or other common scripting languages. - On the Tiobe index of programming language popularity http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html, which seeks to measure the relative number of software engineers who know and use each language, Python far exceeds Perl and Ruby. The only more well-known scripting languages are PHP and Visual Basic, neither of which seems a prime candidate for this use. For build-time usage, I think we should immediately approve python as a build-time dependency, and allow people who are motivated to do so, to open jiras for migrating existing build-time shell scripts to python. For run-time, there is likely to be a lot more discussion. Lots of folks, including me, aren't real happy with use of active scripts for configuration, and various others, including I believe some of the Bigtop folks, have issues with the way the start/stop scripts work. Nevertheless, all those scripts exist today and are widely used. And they present an impediment to porting to Windows-without-cygwin. Nothing about run-time use of scripts has changed significantly over the past three years, and I don't think we should hold up the Windows port while we have a huge discussion about issues that veer dangerously into religious/aesthetic domains. It would be fun to have that discussion, but I don't want this decision to be dependent on it! So I propose that we go ahead and also approve python as a run-time dependency, and allow the inclusion of python scripts in place of current shell-based functionality. The unpleasant alternative is to spawn a bunch of powershell scripts in parallel to the current shell scripts, with a very negative impact on maintainability. The Windows port must, after all, be allowed to proceed. Let's have a discussion, and then I'll put both issues, separately, to a vote (unless we miraculously achieve consensus without a vote :-) I
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Got it, thx. BTW, for branch-1, how about doing an ant task as part of the build that does that. Thx On Wed, Nov 21, 2012 at 11:44 AM, Matt Foley mfo...@hortonworks.com wrote: Hi Alejandro, For build-time issues in branch-2 and beyond, this may make sense (although I'm concerned about obscuring functionality in a way that only maven experts will be able to understand). In the particular case of saveVersion.sh, I'd be happy to see it done automatically by the build tools. However, for build-time issues in the non-mavenized branch-1, and for run-time issues in both worlds, the need for cross-platform scripting remains. Thanks, --Matt On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur t...@cloudera.com wrote: Hey Matt, We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on its way out with the move of docs to APT) Why not do a maven-plugin to do that? Colin already has something to simplify all the cmake calls from the builds using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887) We could do the same with protoc, thus simplifying the POMs. The saveVersion.sh seems like another prime candidate for a maven plugin, and in this case it would not require external tools. Does this make sense? Thx -- Alejandro -- Alejandro
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Sorry, to clarify my point a little more, Ant does allow you to make declarations to explicitly set the desired file permissions via the fileMode attribute of a tarfileset. However, it does not have the capability to preserve whatever permissions were naturally created on files earlier in the build process. This is a difference in maintainability, as adding new files to the build may then require extra maintenance of the Ant directives to apply the desired fileMode. This is an easy thing to overlook. A solution that preserves the natural permissions requires less maintenance overhead. I couldn't find a way to make assembly plugin preserve permissions like this either. It just has explicit fileMode directives similar to Ant. (Let me know if I missed something though.) To see symlinks show up in distribution tarballs, you need to build with the native components, like libhadoop.so or bundled Snappy. Thanks, --Chris On Wed, Nov 21, 2012 at 1:30 PM, Radim Kolar h...@filez.com wrote: Dne 21.11.2012 22:03, Chris Nauroth napsal(a): For creation of the distribution tarballs, the Maven Ant Plugin (and actually the underlying Ant tool) cannot preserve file permissions or symlinks. maven assembly plugin can deal with file permissions. not sure about symlinks. I do not remember dist tar to have symlinks inside.
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
This predates me, so I don't know the rationale for repackaging Tomcat inside HTTPFS. I suspect that there was a desire to create a fully stand-alone distribution package, including a full web server. The Maven Jetty plugin isn't directly applicable to this use case. I don't know why it was decided to use Tomcat instead of Jetty. (If anyone else out there has the background, please respond.) Regardless, if the desire is to package a full web server instead of just the war, then switching to Jetty would not change the challenges of the build process. We'd still need to preserve whatever permissions are present in the Jetty distribution. In general, when I was working on this, I did not question whether the current packaging was correct. I assumed that whatever changes I made for Windows compatibility must yield the exact same distribution without changes on currently supported platforms like Linux. If there are questions around actually changing the output of the build process, then that will steer the conversation in another direction and increase the scope of this effort. It seems like the trickiest issue is preservation of permissions and symlinks in tar files. I suspect that any JVM-based solution like custom Maven plugins, Groovy, or jtar would be limited in this respect. According to Ant documentation, it's a JDK limitation, so I suspect all of these would have the same problem. I haven't tried any of them though. (If there was a feasible solution, then Ant likely would have incorporated it long ago.) If anyone wants to try though, we might learn something from that. Thank you, --Chris On Wed, Nov 21, 2012 at 5:55 PM, Radim Kolar h...@filez.com wrote: Dne 22.11.2012 1:14, Chris Nauroth napsal(a): The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack and repack a Tomcat. why its not possible to just ship WAR file? Its seems to be special purpose app and they needs hand security setup anyway and intergration with existing firewall/web infrastructure. did you considered to use Jetty? it has really good maven support: http://wiki.eclipse.org/Jetty/**Feature/Jetty_Maven_Pluginhttp://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_Plugin I am using jetty 8 instead of tomcat and run it with java -jar start.jar no extra file permissions like x bit are needed. If you really need to create tar by hand, there is java library for doing it - http://code.google.com/p/jtar/ and it can be used from any JVM based script language, you have plenty of choices.