Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-24 Thread Radim Kolar
we have not discussed advantages of stand alone python vs 
jython-in-maven pom


http://code.google.com/p/jy-maven-plugin/

language is about same, and it does not needs to have installed, which 
is advantage on windows.


Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-24 Thread Konstantin Boudnik
If we decide to go with Maven then there's no point to complicate the
picture with jython. This time I will keep the offensive about *yton to myself
;)

Cos

On Sat, Nov 24, 2012 at 10:26PM, Radim Kolar wrote:
 we have not discussed advantages of stand alone python vs
 jython-in-maven pom
 
 http://code.google.com/p/jy-maven-plugin/
 
 language is about same, and it does not needs to have installed,
 which is advantage on windows.


Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-23 Thread Radim Kolar

discussion seems to ended, lets start vote.


Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-22 Thread Steve Loughran
On 22 November 2012 02:40, Chris Nauroth cnaur...@hortonworks.com wrote:


 It seems like the trickiest issue is preservation of permissions and
 symlinks in tar files.  I suspect that any JVM-based solution like custom
 Maven plugins, Groovy, or jtar would be limited in this respect.  According
 to Ant documentation, it's a JDK limitation, so I suspect all of these
 would have the same problem.  I haven't tried any of them though.  (If
 there was a feasible solution, then Ant likely would have incorporated it
 long ago.)  If anyone wants to try though, we might learn something from
 that.

 Thank you,
 --Chris


You are limited by what File.canRead(), canWrite() and canExecute) tell you.

The absence of a way to detect file permissions in Java -is because of the
lowest-common-denominator approach of the JavaFS APIs, supporting FAT32
(odd case logic, no perms or symlinks), NTFS (odd case logic, ACLs over
perms, symlinks historically very hard to create), HFS+ (case insensitive
unix fs!) as well as classic unixy filesystems.

Ant tarfileset filesets in tar let you spec permissions on filesets you
pull into the tar; they are generated x-platform, which the other reason
why you declare them in tar -you have the right to generate proper tar
files even if you use a Windows box.

symlinks are problematic -even detecting them cross platform is pretty
unreliable. To really do them you'd need to add a new symlinkfileset
entity for tar, that would take the link declaration. I could imagine how
to do that -and if stuck into the hadoop tools JAR, wouldn't even depend on
a new version of Ant.

Maven just adds extra layers in the way.

-Steve


Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-22 Thread Steve Loughran
On 21 November 2012 19:15, Matt Foley ma...@apache.org wrote:

 This discussion started in


 Those of us involved in the branch-1-win port of Hadoop to Windows without
 use of Cygwin, have faced the issue of frequent use of shell scripts
 throughout the system, both in build time (eg, the utility
 saveVersion.sh),
 and run time (config files like hadoop-env.sh and the start/stop scripts
 in bin/* ).  Similar usages exist throughout the Hadoop stack, in all
 projects.

 The vast majority of these shell scripts do not do anything platform
 specific; they can be expressed in a posix-conforming way.  Therefore, it
 seems to us that it makes sense to start using a cross-platform scripting
 language, such as python, in place of shell for these purposes.  For those
 rare occasions where platform-specific functionality really is needed,
 python also supports quite a lot of platform-specific functionality on both
 Linux and Windows; but where that is inadequate, one could still
 conditionally invoke a platform-specific module written in shell (for
 Linux/*nix) or powershell or bat (for Windows).

 The primary motive for moving to a cross-platform scripting language is
 maintainability.  The alternative would be to maintain two complete suites
 of scripts, one for Linux and one for Windows (and perhaps others in the
 future).  We want to avoid the need to update dual modules in two different
 languages when functionality changes, especially given that many Linux
 developers are not familiar with powershell or bat, and many Windows
 developers are not familiar with shell or bash.


I'd argue that a lot of Hadoop java developers aren't that familiar with
bash. It's only in the last six months that I've come to hate it properly.

In the ant project, it was the launcher scripts that had the worst
bugrep:line ratio, as
 -variations in .sh behaviour, especially under cygwin, but also things
that weren't bash (AIX, ...)
 -requirements of the entire unix command set for real work
 -variants in the parameters/behaviour of those commands between Linux and
other widely used Unix systems (e.g. OSX)
 -lack of inclusion of the .sh scripts in the junit test suite
 -lack of understanding of bash.

In the ant project we added a Python launcher in, what, 2001, based on the
Perl launcher supplied by one steve_l@users.sourceforge


 For run-time, there is likely to be a lot more discussion.  Lots of folks,
 including me, aren't real happy with use of active scripts for
 configuration, and various others, including I believe some of the Bigtop
 folks, have issues with the way the start/stop scripts work.  Nevertheless,
 all those scripts exist today and are widely used.  And they present an
 impediment to porting to Windows-without-cygwin.


They're a maintenance and support cost on Unix. Too many scripts, even more
in Yarn, weakly-nondeterministic logic for loading env variables,
especially between init.d and bin/hadoop; not much diagnostics. And as with
Ant, a relatively under-comprehended language with no unit test coverage.

I'd replace the bash logic with python for Unix dev and maintenance alone.
You could put your logic into a shared python module in usr/lib/hadoop/bin
, have PyUnit test the inner functions as part of the build and test
process ( jenkins).



 Nothing about run-time use of scripts has changed significantly over the
 past three years, and I don't think we should hold up the Windows port
 while we have a huge discussion about issues that veer dangerously into
 religious/aesthetic domains. It would be fun to have that discussion, but I
 don't want this decision to be dependent on it!


With Yarn its got more complex. More env variables to set, more support
calls when they aren't.


 So I propose that we go ahead and also approve python as a run-time
 dependency, and allow the inclusion of python scripts in place of current
 shell-based functionality.  The unpleasant alternative is to spawn a bunch
 of powershell scripts in parallel to the current shell scripts, with a very
 negative impact on maintainability.  The Windows port must, after all, be
 allowed to proceed.


+1 to any vote to allow .py at run time as a new feature

=0 to ripping out and replacing the existing .sh scripts with python code,
as even though I don't like the scripts, replacing them could be traumatic
downstream.

+1 to a gradual migration to .py for new code, starting with the yarn
scripts.


Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-21 Thread Alejandro Abdelnur
Hey Matt,

We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
its way out with the move of docs to APT)

Why not do a maven-plugin to do that?

Colin already has something to simplify all the cmake calls from the builds
using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)

We could do the same with protoc, thus simplifying the POMs.

The saveVersion.sh seems like another prime candidate for a maven plugin,
and in this case it would not require external tools.

Does this make sense?

Thx

On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley ma...@apache.org wrote:

 This discussion started in
 HADOOP-8924https://issues.apache.org/jira/browse/HADOOP-8924
 , where it was proposed to replace the build-time utility saveVersion.sh
 with a python script.  This would require Python as a build-time
 dependency.  Here's the background:

 Those of us involved in the branch-1-win port of Hadoop to Windows without
 use of Cygwin, have faced the issue of frequent use of shell scripts
 throughout the system, both in build time (eg, the utility
 saveVersion.sh),
 and run time (config files like hadoop-env.sh and the start/stop scripts
 in bin/* ).  Similar usages exist throughout the Hadoop stack, in all
 projects.

 The vast majority of these shell scripts do not do anything platform
 specific; they can be expressed in a posix-conforming way.  Therefore, it
 seems to us that it makes sense to start using a cross-platform scripting
 language, such as python, in place of shell for these purposes.  For those
 rare occasions where platform-specific functionality really is needed,
 python also supports quite a lot of platform-specific functionality on both
 Linux and Windows; but where that is inadequate, one could still
 conditionally invoke a platform-specific module written in shell (for
 Linux/*nix) or powershell or bat (for Windows).

 The primary motive for moving to a cross-platform scripting language is
 maintainability.  The alternative would be to maintain two complete suites
 of scripts, one for Linux and one for Windows (and perhaps others in the
 future).  We want to avoid the need to update dual modules in two different
 languages when functionality changes, especially given that many Linux
 developers are not familiar with powershell or bat, and many Windows
 developers are not familiar with shell or bash.

 Regarding the choice of python:

- There are already a few instances of python usage in Hadoop, such as
the utility (currently broken) relnotes.py, and massive usage of
 python
in the examples/ and contrib/ directories.
- Python is also used in Bigtop build-time.
- The Python language is available for free on essentially all
platforms, under an Apache-compatible
 licensehttp://www.apache.org/legal/resolved.html.

- It is supported in Eclipse and similar IDEs.
- Most importantly, it is widely accepted as a reasonably good OO
scripting language, and it is easily learned by anyone who already knows
shell or perl, or other common scripting languages.
- On the Tiobe index of programming language
 popularity
 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html,
which seeks to measure the relative number of software engineers who
 know
and use each language, Python far exceeds Perl and Ruby.  The only more
well-known scripting languages are PHP and Visual Basic, neither of
 which
seems a prime candidate for this use.

 For build-time usage, I think we should immediately approve python as a
 build-time dependency, and allow people who are motivated to do so, to open
 jiras for migrating existing build-time shell scripts to python.

 For run-time, there is likely to be a lot more discussion.  Lots of folks,
 including me, aren't real happy with use of active scripts for
 configuration, and various others, including I believe some of the Bigtop
 folks, have issues with the way the start/stop scripts work.  Nevertheless,
 all those scripts exist today and are widely used.  And they present an
 impediment to porting to Windows-without-cygwin.

 Nothing about run-time use of scripts has changed significantly over the
 past three years, and I don't think we should hold up the Windows port
 while we have a huge discussion about issues that veer dangerously into
 religious/aesthetic domains. It would be fun to have that discussion, but I
 don't want this decision to be dependent on it!

 So I propose that we go ahead and also approve python as a run-time
 dependency, and allow the inclusion of python scripts in place of current
 shell-based functionality.  The unpleasant alternative is to spawn a bunch
 of powershell scripts in parallel to the current shell scripts, with a very
 negative impact on maintainability.  The Windows port must, after all, be
 allowed to proceed.

 Let's have a discussion, and then I'll put both issues, separately, to a
 vote (unless we miraculously achieve consensus without a vote :-)

 I 

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-21 Thread Alejandro Abdelnur
Got it, thx.

BTW, for branch-1, how about doing an ant task as part of the build that
does that.

Thx



On Wed, Nov 21, 2012 at 11:44 AM, Matt Foley mfo...@hortonworks.com wrote:

 Hi Alejandro,
 For build-time issues in branch-2 and beyond, this may make sense (although
 I'm concerned about obscuring functionality in a way that only maven
 experts will be able to understand).  In the particular case of
 saveVersion.sh, I'd be happy to see it done automatically by the build
 tools.

 However, for build-time issues in the non-mavenized branch-1, and for
 run-time issues in both worlds, the need for cross-platform scripting
 remains.

 Thanks,
 --Matt

 On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur t...@cloudera.com
 wrote:

  Hey Matt,
 
  We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
  its way out with the move of docs to APT)
 
  Why not do a maven-plugin to do that?
 
  Colin already has something to simplify all the cmake calls from the
 builds
  using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
 
  We could do the same with protoc, thus simplifying the POMs.
 
  The saveVersion.sh seems like another prime candidate for a maven plugin,
  and in this case it would not require external tools.
 
  Does this make sense?
 
  Thx
 
  --
  Alejandro
 




-- 
Alejandro


Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-21 Thread Chris Nauroth
Sorry, to clarify my point a little more, Ant does allow you to make
declarations to explicitly set the desired file permissions via the
fileMode attribute of a tarfileset.  However, it does not have the
capability to preserve whatever permissions were naturally created on files
earlier in the build process.  This is a difference in maintainability, as
adding new files to the build may then require extra maintenance of the Ant
directives to apply the desired fileMode.  This is an easy thing to
overlook.  A solution that preserves the natural permissions requires less
maintenance overhead.

I couldn't find a way to make assembly plugin preserve permissions like
this either.  It just has explicit fileMode directives similar to Ant.
 (Let me know if I missed something though.)

To see symlinks show up in distribution tarballs, you need to build with
the native components, like libhadoop.so or bundled Snappy.

Thanks,
--Chris


On Wed, Nov 21, 2012 at 1:30 PM, Radim Kolar h...@filez.com wrote:

 Dne 21.11.2012 22:03, Chris Nauroth napsal(a):

  For creation of the distribution tarballs, the Maven
 Ant Plugin (and actually the underlying Ant tool) cannot preserve file
 permissions or symlinks.

 maven assembly plugin can deal with file permissions. not sure about
 symlinks. I do not remember dist tar to have symlinks inside.



Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

2012-11-21 Thread Chris Nauroth
This predates me, so I don't know the rationale for repackaging Tomcat
inside HTTPFS.  I suspect that there was a desire to create a fully
stand-alone distribution package, including a full web server.  The Maven
Jetty plugin isn't directly applicable to this use case.  I don't know why
it was decided to use Tomcat instead of Jetty.  (If anyone else out there
has the background, please respond.)  Regardless, if the desire is to
package a full web server instead of just the war, then switching to Jetty
would not change the challenges of the build process.  We'd still need to
preserve whatever permissions are present in the Jetty distribution.

In general, when I was working on this, I did not question whether the
current packaging was correct.  I assumed that whatever changes I made
for Windows compatibility must yield the exact same distribution without
changes on currently supported platforms like Linux.  If there are
questions around actually changing the output of the build process, then
that will steer the conversation in another direction and increase the
scope of this effort.

It seems like the trickiest issue is preservation of permissions and
symlinks in tar files.  I suspect that any JVM-based solution like custom
Maven plugins, Groovy, or jtar would be limited in this respect.  According
to Ant documentation, it's a JDK limitation, so I suspect all of these
would have the same problem.  I haven't tried any of them though.  (If
there was a feasible solution, then Ant likely would have incorporated it
long ago.)  If anyone wants to try though, we might learn something from
that.

Thank you,
--Chris


On Wed, Nov 21, 2012 at 5:55 PM, Radim Kolar h...@filez.com wrote:

 Dne 22.11.2012 1:14, Chris Nauroth napsal(a):

  The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack
 and repack a Tomcat.

 why its not possible to just ship WAR file? Its seems to be special
 purpose app and they needs hand security setup anyway and intergration with
 existing firewall/web infrastructure.

 did you considered to use Jetty? it has really good maven support:
 http://wiki.eclipse.org/Jetty/**Feature/Jetty_Maven_Pluginhttp://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_Plugin
 I am using jetty 8 instead of tomcat and run it with java -jar start.jar
 no extra file permissions like x bit are needed.

 If you really need to create tar by hand, there is java library for doing
 it - http://code.google.com/p/jtar/ and it can be used from any JVM based
 script language, you have plenty of choices.