Re: spark support on windows

2017-01-16 Thread Steve Loughran

On 16 Jan 2017, at 11:06, Hyukjin Kwon 
> wrote:

Hi,

I just looked through Jacek's page and I believe that is the correct way.

That seems to be a Hadoop library specific issue[1]. Up to my knowledge, 
winutils and the binaries in the private repo
 are built by a Hadoop PMC member on a dedicated Windows VM which I believe are 
pretty trustable.

thank you :)

I also check out and build the specific git commit SHA1 of the release, not any 
(moveable) tag, so we have identical sources for my builds as the matching 
releases.

This can be compile from the source. If you think it is not reliable and not 
safe, you can go and build it by your self.

I agree it would be great if there are documentation about this as we have a 
weak promise for Windows[2] and
I believe it always require some overhead to install Spark on Windows. FWIW, In 
case of SparkR, there are some
documentation [3].

For bundling it, it seems even Hadoop itself does not include this in their 
releases. I think documentation would be
enough.

Really, Hadoop itself should be doing the release of the windows binaries. It's 
just it complicates the release process as the linux build/test/release would 
have to be done, then somehow the windows stuff would need to be done on 
another machine and mixed in. That's the real barrier: extra work. That said, 
maybe it's time.




For many JIRAs, at least I am resolving it one by one.

I hope my answer is helpful and makes sense.

Thanks.


[1] https://wiki.apache.org/hadoop/WindowsProblems
[2] 
https://github.com/apache/spark/blob/f3a3fed76cb74ecd0f46031f337576ce60f54fb2/docs/index.md
[3] https://github.com/apache/spark/blob/master/R/WINDOWS.md


2017-01-16 19:35 GMT+09:00 assaf.mendelson 
>:
Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is 
lacking. There are sources (such as 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
 and many more) which explain how to make spark run on windows, however, they 
all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person, 
this can be an issue (e.g. getting approval to install on a company computer 
can be an issue).
There are tons of jira tickets on the subject (most are marked as duplicate or 
not a problem), however, I believe that if we say spark is supported on windows 
there should be a clear explanation on how to run it and one shouldn’t have to 
use executable from a private person.

If indeed using winutil.exe is the correct solution, I believe it should be 
bundled to the spark binary distribution along with clear instructions on how 
to add it.
Assaf.


View this message in context: spark support on 
windows
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.




Re: spark support on windows

2017-01-16 Thread Steve Loughran

On 16 Jan 2017, at 10:35, assaf.mendelson 
> wrote:

Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is 
lacking. There are sources (such as 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
 and many more) which explain how to make spark run on windows, however, they 
all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person,

A repository belonging to me, ste...@apache.org

this can be an issue (e.g. getting approval to install on a company computer 
can be an issue).


An a committer on the Hadoop PMC, those signed artifacts are no less 
trustworthy than anything you get from the ASF itself. It's clean built off a 
windows VM that is only ever used for build/test of Hadoop code, no other use 
at all; the VM is powered off most of its life. This actually makes it less of 
a security risk than the main desktop. And you can check the GPG signature of 
the artifacts to see they've not been tampered with.

There are tons of jira tickets on the subject (most are marked as duplicate or 
not a problem), however, I believe that if we say spark is supported on windows 
there should be a clear explanation on how to run it and one shouldn’t have to 
use executable from a private person.

While I recognise your concerns, if I wanted to run code on your machines, rest 
assured, I wouldn't do it in such an obvious way.

I'd do it via transitive maven artifacts with a harmless name like 
"org.example.xml-unit-diags" which would so something useful except in the 
special case that is' running on code in your subnet, get a patch a pom.xml to 
pull it into org.apache.hadoop somewhere, release a version of hadoop with that 
dependency, then wait for it to propagate downstream into everything, including 
all those server farms running linux only.

Writing a malicious windows native excutable would require me to write C/C++ 
windows code, and I don't want to go there.

Of course, if I did any of these I'd be in trouble when caught, lose my job, 
never be trusted to submit a line of code to any OSS project, lose all my 
friends, etc, etc. I have nothing to gain by doing so.

If you really don't trust me the instructions for building it are up online; 
build  a windows system for compiuling hadoop, check out the branch and then go

  mvn -T 1C package -Pdist -Dmaven.javadoc.skip=true -DskipTests

Or go to hortonworks.com, download the windows version 
and lift the windows binaries. Same thing, built by a colleague-managed release 
VM.


If indeed using winutil.exe is the correct solution, I believe it should be 
bundled to the spark binary distribution along with clear instructions on how 
to add it.
I recognise that it is good to question the provenance of every line of code 
executed on machines you care about. I am reasonably confident as so the 
quality of this code; given the fact it was a checkout & build of the ASF 
tagged release, then signed my me, it'd either need my VM corrupted, my VM's 
feed from the ASF HTTPS repo subverted by a fake SSL cert, or by someone 
getting hold of my GPG key and github keys and uploading something malicious in 
my name. Interestingly, that is a vulnerability, one I covered last year in my 
"Household infosec in a post-sony era: talk: 
https://www.youtube.com/watch?v=tcRjG1CCrPs

You'll be pleased to know that the relevant keys now live on a yubikey, so even 
malicious code executed on my desktop cannot get the secrets off the 
(encrypted) local drive. It'd need physical access to the key, and I'd notice 
it was missing, revoke everything, etc, etc, making the risk of my keys being 
stolen low. That leaves the general problem of "our entire build process is 
based on the assumption that we truest the maven repositories and the people 
who wrote the JARs"

That's a far more serious problem than the provenance of a single exe file on 
github

-Steve


Re: spark support on windows

2017-01-16 Thread Hyukjin Kwon
Hi,

I just looked through Jacek's page and I believe that is the correct way.

That seems to be a Hadoop library specific issue[1]. Up to my
knowledge, winutils and the binaries in the private repo
 are built by a Hadoop PMC member on a dedicated Windows VM which I believe
are pretty trustable.
This can be compile from the source. If you think it is not reliable and
not safe, you can go and build it by your self.

I agree it would be great if there are documentation about this as we have
a weak promise for Windows[2] and
I believe it always require some overhead to install Spark on Windows.
FWIW, In case of SparkR, there are some
documentation [3].

For bundling it, it seems even Hadoop itself does not include this in their
releases. I think documentation would be
enough.

For many JIRAs, at least I am resolving it one by one.

I hope my answer is helpful and makes sense.

Thanks.


[1] https://wiki.apache.org/hadoop/WindowsProblems
[2]
https://github.com/apache/spark/blob/f3a3fed76cb74ecd0f46031f337576ce60f54fb2/docs/index.md
[3] https://github.com/apache/spark/blob/master/R/WINDOWS.md


2017-01-16 19:35 GMT+09:00 assaf.mendelson :

> Hi,
>
> In the documentation it says spark is supported on windows.
>
> The problem, however, is that the documentation description on windows is
> lacking. There are sources (such as https://jaceklaskowski.
> gitbooks.io/mastering-apache-spark/content/spark-tips-and-
> tricks-running-spark-windows.html and many more) which explain how to
> make spark run on windows, however, they all involve downloading a third
> party winutil.exe file.
>
> Since this file is downloaded from a repository belonging to a private
> person, this can be an issue (e.g. getting approval to install on a company
> computer can be an issue).
>
> There are tons of jira tickets on the subject (most are marked as
> duplicate or not a problem), however, I believe that if we say spark is
> supported on windows there should be a clear explanation on how to run it
> and one shouldn’t have to use executable from a private person.
>
>
>
> If indeed using winutil.exe is the correct solution, I believe it should
> be bundled to the spark binary distribution along with clear instructions
> on how to add it.
>
> Assaf.
>
> --
> View this message in context: spark support on windows
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>