Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Holden Karau
Can you send me the output of those two commands

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:

> Hi Holden,
>
> Please check the second email of mine in this email chain. I did that
> originally and to quote my email:
>
>
> ===
> In the spark-2.4.5 src directory, I just did a simple:
>
> `./build/mvn -DskipTests clean package`
>
>
> And then went to the python directory and did:
>
>
> `python setup.py sdist` followed by `pip install
> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>
>
> *This ran into "error: package directory `deps/jars` does not exist".*
>
>
> ===
>
>
> So exactly as what you said, which is also one of the printout message in
> the make-distribution.sh script.
>
> On Fri, May 1, 2020 at 4:39 PM Holden Karau  wrote:
>
>> Your problem isn't the missing license per-se (that just happens to be
>> the first error).
>>
>> I don't believe that is the way we expect users to pip install the Python
>> library. pip will only install directories/targets underneath the directory
>> where setup.py, hence the deps directory which is constructed by setup.py
>> with a bunch of symlinks. It assumes that you are either building Spark
>> from source in which case you should follow it's instructions:
>>
>> To build Spark with maven you can run:
>>   ./build/mvn -DskipTests clean package
>> Building the source dist is done in the Python directory:
>>   cd python
>>   python setup.py sdist
>>   pip install dist/*.tar.gz
>>
>>
>> On Fri, May 1, 2020 at 1:32 PM Xiangyu Li  wrote:
>>
>>> make-distribution.sh with --pip would run a `python setup.py sdist`
>>> within that make-distribution.sh script.
>>> I also tested `make-distribution.sh` without --pip, and the same error
>>> happens.
>>>
>>> Correct me if I'm wrong, but pyspark binary has always been successfully
>>> built, it is the pyspark pip package that is failing.
>>>
>>> On Fri, May 1, 2020 at 4:23 PM Sean Owen  wrote:
>>>
 Hm, others may have to chime in here. Either that's not how you create
 the pyspark binary from the source release (make-distribution.sh doesn't do
 that?) or there is a small but important issue here, that the source
 release doesn't contain one thing that the binary release script expects,
 which is LICENSE-binary et al. If it's the latter, we could move around the
 LICENSE bits in the source tree so that both are "source" files included in
 the source release, so you can make the binary release with it, but, I'd
 probably say it's easier/better to simply skip adding the license in this
 path (if it's supposed to work this way at all) as the use case, a custom
 derived work, doesn't need the *ASF's* license statement.


 On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:

> To reproduce this, I just did
>
> curl -O
> http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
> tar xzf spark-2.4.5.tgz
> cd spark-2.4.5
> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
> mv spark-2.4.5-bin-custom-spark.tgz ../
> cd ..
> tar xzf spark-2.4.5-bin-custom-spark.tgz
> cd spark-2.4.5-bin-custom-spark/python/
> sudo python setup.py install
>
> And here is the output:
> [image: image.png]
>
>
> On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:
>
>> You wrote:
>>
>> "
>> 2. On each machine, I can install pyspark by running `python setup.py
>> install` inside the python directory.
>>
>> Step 2 would fail because of missing the licenses directory.
>> "
>>
>> That shouldn't depend on the license file, and the script you showed
>> does not fail when not present, so I am wondering what this means.
>> I'm not sure there's a JIRA here yet.
>>
>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:
>>
>>> Hmm, sorry I don't get what part of my email were you referring to
>>> when you said "the build fails?".
>>>
>>> So I am trying to build a custom spark binary distribution with,
>>> say, different Hadoop versions and R support.
>>>
>>> Then I stored this custom build on S3, so as I am building more
>>> machines I can just directly download this custom build from S3. But
>>> besides spark-submit and what not, I also wanted to install the pyspark
>>> python package to the machine I am building.
>>>
>>> The lack of the LICENSE file in the custom build would prevent
>>> pyspark from being successfully built.
>>>
>>> Hopefully this answers your question.
>>>
>>> The second part of my last email was about building pyspark inside
>>> spark source directory, I will raise an issue on Jira for that, as it is
>>> more of a clean cut problem with t

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Xiangyu Li
Hi Holden,

Please check the second email of mine in this email chain. I did that
originally and to quote my email:

===
In the spark-2.4.5 src directory, I just did a simple:

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz`
(as mentioned in the make-distribution.sh.)


*This ran into "error: package directory `deps/jars` does not exist".*

===


So exactly as what you said, which is also one of the printout message in
the make-distribution.sh script.

On Fri, May 1, 2020 at 4:39 PM Holden Karau  wrote:

> Your problem isn't the missing license per-se (that just happens to be the
> first error).
>
> I don't believe that is the way we expect users to pip install the Python
> library. pip will only install directories/targets underneath the directory
> where setup.py, hence the deps directory which is constructed by setup.py
> with a bunch of symlinks. It assumes that you are either building Spark
> from source in which case you should follow it's instructions:
>
> To build Spark with maven you can run:
>   ./build/mvn -DskipTests clean package
> Building the source dist is done in the Python directory:
>   cd python
>   python setup.py sdist
>   pip install dist/*.tar.gz
>
>
> On Fri, May 1, 2020 at 1:32 PM Xiangyu Li  wrote:
>
>> make-distribution.sh with --pip would run a `python setup.py sdist`
>> within that make-distribution.sh script.
>> I also tested `make-distribution.sh` without --pip, and the same error
>> happens.
>>
>> Correct me if I'm wrong, but pyspark binary has always been successfully
>> built, it is the pyspark pip package that is failing.
>>
>> On Fri, May 1, 2020 at 4:23 PM Sean Owen  wrote:
>>
>>> Hm, others may have to chime in here. Either that's not how you create
>>> the pyspark binary from the source release (make-distribution.sh doesn't do
>>> that?) or there is a small but important issue here, that the source
>>> release doesn't contain one thing that the binary release script expects,
>>> which is LICENSE-binary et al. If it's the latter, we could move around the
>>> LICENSE bits in the source tree so that both are "source" files included in
>>> the source release, so you can make the binary release with it, but, I'd
>>> probably say it's easier/better to simply skip adding the license in this
>>> path (if it's supposed to work this way at all) as the use case, a custom
>>> derived work, doesn't need the *ASF's* license statement.
>>>
>>>
>>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:
>>>
 To reproduce this, I just did

 curl -O
 http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
 tar xzf spark-2.4.5.tgz
 cd spark-2.4.5
 ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
 mv spark-2.4.5-bin-custom-spark.tgz ../
 cd ..
 tar xzf spark-2.4.5-bin-custom-spark.tgz
 cd spark-2.4.5-bin-custom-spark/python/
 sudo python setup.py install

 And here is the output:
 [image: image.png]


 On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:

> You wrote:
>
> "
> 2. On each machine, I can install pyspark by running `python setup.py
> install` inside the python directory.
>
> Step 2 would fail because of missing the licenses directory.
> "
>
> That shouldn't depend on the license file, and the script you showed
> does not fail when not present, so I am wondering what this means.
> I'm not sure there's a JIRA here yet.
>
> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:
>
>> Hmm, sorry I don't get what part of my email were you referring to
>> when you said "the build fails?".
>>
>> So I am trying to build a custom spark binary distribution with, say,
>> different Hadoop versions and R support.
>>
>> Then I stored this custom build on S3, so as I am building more
>> machines I can just directly download this custom build from S3. But
>> besides spark-submit and what not, I also wanted to install the pyspark
>> python package to the machine I am building.
>>
>> The lack of the LICENSE file in the custom build would prevent
>> pyspark from being successfully built.
>>
>> Hopefully this answers your question.
>>
>> The second part of my last email was about building pyspark inside
>> spark source directory, I will raise an issue on Jira for that, as it is
>> more of a clean cut problem with the documentation on the website and the
>> comments in make-distribution.sh.
>>
>>
>>
>> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>>
>>> Hm, the build fails? you can see this is just skipped if not
>>> p

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Xiangyu Li
I need the pip packaging, all these efforts are to get a pyspark pip
package actually.

On Fri, May 1, 2020 at 4:38 PM Sean Owen  wrote:

> I see, that makes more sense, though I have limited knowledge of how the
> pip packaging works. You don't need pip packaging, do you? just pyspark
> itself right. Omit --pip?
>
> On Fri, May 1, 2020 at 3:32 PM Xiangyu Li  wrote:
>
>> make-distribution.sh with --pip would run a `python setup.py sdist`
>> within that make-distribution.sh script.
>> I also tested `make-distribution.sh` without --pip, and the same error
>> happens.
>>
>> Correct me if I'm wrong, but pyspark binary has always been successfully
>> built, it is the pyspark pip package that is failing.
>>
>> On Fri, May 1, 2020 at 4:23 PM Sean Owen  wrote:
>>
>>> Hm, others may have to chime in here. Either that's not how you create
>>> the pyspark binary from the source release (make-distribution.sh doesn't do
>>> that?) or there is a small but important issue here, that the source
>>> release doesn't contain one thing that the binary release script expects,
>>> which is LICENSE-binary et al. If it's the latter, we could move around the
>>> LICENSE bits in the source tree so that both are "source" files included in
>>> the source release, so you can make the binary release with it, but, I'd
>>> probably say it's easier/better to simply skip adding the license in this
>>> path (if it's supposed to work this way at all) as the use case, a custom
>>> derived work, doesn't need the *ASF's* license statement.
>>>
>>>
>>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:
>>>
 To reproduce this, I just did

 curl -O
 http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
 tar xzf spark-2.4.5.tgz
 cd spark-2.4.5
 ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
 mv spark-2.4.5-bin-custom-spark.tgz ../
 cd ..
 tar xzf spark-2.4.5-bin-custom-spark.tgz
 cd spark-2.4.5-bin-custom-spark/python/
 sudo python setup.py install

 And here is the output:
 [image: image.png]


 On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:

> You wrote:
>
> "
> 2. On each machine, I can install pyspark by running `python setup.py
> install` inside the python directory.
>
> Step 2 would fail because of missing the licenses directory.
> "
>
> That shouldn't depend on the license file, and the script you showed
> does not fail when not present, so I am wondering what this means.
> I'm not sure there's a JIRA here yet.
>
> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:
>
>> Hmm, sorry I don't get what part of my email were you referring to
>> when you said "the build fails?".
>>
>> So I am trying to build a custom spark binary distribution with, say,
>> different Hadoop versions and R support.
>>
>> Then I stored this custom build on S3, so as I am building more
>> machines I can just directly download this custom build from S3. But
>> besides spark-submit and what not, I also wanted to install the pyspark
>> python package to the machine I am building.
>>
>> The lack of the LICENSE file in the custom build would prevent
>> pyspark from being successfully built.
>>
>> Hopefully this answers your question.
>>
>> The second part of my last email was about building pyspark inside
>> spark source directory, I will raise an issue on Jira for that, as it is
>> more of a clean cut problem with the documentation on the website and the
>> comments in make-distribution.sh.
>>
>>
>>
>> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>>
>>> Hm, the build fails? you can see this is just skipped if not
>>> present, for this reason.
>>> I'm not clear why you need the file for its own sake, for your own
>>> internal modification that you don't redistribute.
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li 
>>> wrote:
>>>
 Hi Sean,

 Thanks for the quick response! Yes, what you described about how
 LICENSE file should be distributed makes sense.

 The reason I learned about this is that I was trying to build
 spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
 machines, so that:

 1. These machines can run spark with the built.
 2. On each machine, I can install pyspark by running `python
 setup.py install` inside the python directory.

 Step 2 would fail because of missing the licenses directory.

 Building pyspark out of a binary distribution is a bit
 unconventional, but I did this after failing to do what the official 
 doc
 recommended (
 https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
 so taking a step back to describe what

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Holden Karau
Your problem isn't the missing license per-se (that just happens to be the
first error).

I don't believe that is the way we expect users to pip install the Python
library. pip will only install directories/targets underneath the directory
where setup.py, hence the deps directory which is constructed by setup.py
with a bunch of symlinks. It assumes that you are either building Spark
from source in which case you should follow it's instructions:

To build Spark with maven you can run:
  ./build/mvn -DskipTests clean package
Building the source dist is done in the Python directory:
  cd python
  python setup.py sdist
  pip install dist/*.tar.gz


On Fri, May 1, 2020 at 1:32 PM Xiangyu Li  wrote:

> make-distribution.sh with --pip would run a `python setup.py sdist` within
> that make-distribution.sh script.
> I also tested `make-distribution.sh` without --pip, and the same error
> happens.
>
> Correct me if I'm wrong, but pyspark binary has always been successfully
> built, it is the pyspark pip package that is failing.
>
> On Fri, May 1, 2020 at 4:23 PM Sean Owen  wrote:
>
>> Hm, others may have to chime in here. Either that's not how you create
>> the pyspark binary from the source release (make-distribution.sh doesn't do
>> that?) or there is a small but important issue here, that the source
>> release doesn't contain one thing that the binary release script expects,
>> which is LICENSE-binary et al. If it's the latter, we could move around the
>> LICENSE bits in the source tree so that both are "source" files included in
>> the source release, so you can make the binary release with it, but, I'd
>> probably say it's easier/better to simply skip adding the license in this
>> path (if it's supposed to work this way at all) as the use case, a custom
>> derived work, doesn't need the *ASF's* license statement.
>>
>>
>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:
>>
>>> To reproduce this, I just did
>>>
>>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>> tar xzf spark-2.4.5.tgz
>>> cd spark-2.4.5
>>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
>>> mv spark-2.4.5-bin-custom-spark.tgz ../
>>> cd ..
>>> tar xzf spark-2.4.5-bin-custom-spark.tgz
>>> cd spark-2.4.5-bin-custom-spark/python/
>>> sudo python setup.py install
>>>
>>> And here is the output:
>>> [image: image.png]
>>>
>>>
>>> On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:
>>>
 You wrote:

 "
 2. On each machine, I can install pyspark by running `python setup.py
 install` inside the python directory.

 Step 2 would fail because of missing the licenses directory.
 "

 That shouldn't depend on the license file, and the script you showed
 does not fail when not present, so I am wondering what this means.
 I'm not sure there's a JIRA here yet.

 On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:

> Hmm, sorry I don't get what part of my email were you referring to
> when you said "the build fails?".
>
> So I am trying to build a custom spark binary distribution with, say,
> different Hadoop versions and R support.
>
> Then I stored this custom build on S3, so as I am building more
> machines I can just directly download this custom build from S3. But
> besides spark-submit and what not, I also wanted to install the pyspark
> python package to the machine I am building.
>
> The lack of the LICENSE file in the custom build would prevent pyspark
> from being successfully built.
>
> Hopefully this answers your question.
>
> The second part of my last email was about building pyspark inside
> spark source directory, I will raise an issue on Jira for that, as it is
> more of a clean cut problem with the documentation on the website and the
> comments in make-distribution.sh.
>
>
>
> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>
>> Hm, the build fails? you can see this is just skipped if not present,
>> for this reason.
>> I'm not clear why you need the file for its own sake, for your own
>> internal modification that you don't redistribute.
>>
>>
>>
>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li 
>> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for the quick response! Yes, what you described about how
>>> LICENSE file should be distributed makes sense.
>>>
>>> The reason I learned about this is that I was trying to build
>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>> machines, so that:
>>>
>>> 1. These machines can run spark with the built.
>>> 2. On each machine, I can install pyspark by running `python
>>> setup.py install` inside the python directory.
>>>
>>> Step 2 would fail because of missing the licenses directory.
>>>
>>> Building pyspark out of a binary distribution is a bit
>>> un

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Sean Owen
I see, that makes more sense, though I have limited knowledge of how the
pip packaging works. You don't need pip packaging, do you? just pyspark
itself right. Omit --pip?

On Fri, May 1, 2020 at 3:32 PM Xiangyu Li  wrote:

> make-distribution.sh with --pip would run a `python setup.py sdist` within
> that make-distribution.sh script.
> I also tested `make-distribution.sh` without --pip, and the same error
> happens.
>
> Correct me if I'm wrong, but pyspark binary has always been successfully
> built, it is the pyspark pip package that is failing.
>
> On Fri, May 1, 2020 at 4:23 PM Sean Owen  wrote:
>
>> Hm, others may have to chime in here. Either that's not how you create
>> the pyspark binary from the source release (make-distribution.sh doesn't do
>> that?) or there is a small but important issue here, that the source
>> release doesn't contain one thing that the binary release script expects,
>> which is LICENSE-binary et al. If it's the latter, we could move around the
>> LICENSE bits in the source tree so that both are "source" files included in
>> the source release, so you can make the binary release with it, but, I'd
>> probably say it's easier/better to simply skip adding the license in this
>> path (if it's supposed to work this way at all) as the use case, a custom
>> derived work, doesn't need the *ASF's* license statement.
>>
>>
>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:
>>
>>> To reproduce this, I just did
>>>
>>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>> tar xzf spark-2.4.5.tgz
>>> cd spark-2.4.5
>>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
>>> mv spark-2.4.5-bin-custom-spark.tgz ../
>>> cd ..
>>> tar xzf spark-2.4.5-bin-custom-spark.tgz
>>> cd spark-2.4.5-bin-custom-spark/python/
>>> sudo python setup.py install
>>>
>>> And here is the output:
>>> [image: image.png]
>>>
>>>
>>> On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:
>>>
 You wrote:

 "
 2. On each machine, I can install pyspark by running `python setup.py
 install` inside the python directory.

 Step 2 would fail because of missing the licenses directory.
 "

 That shouldn't depend on the license file, and the script you showed
 does not fail when not present, so I am wondering what this means.
 I'm not sure there's a JIRA here yet.

 On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:

> Hmm, sorry I don't get what part of my email were you referring to
> when you said "the build fails?".
>
> So I am trying to build a custom spark binary distribution with, say,
> different Hadoop versions and R support.
>
> Then I stored this custom build on S3, so as I am building more
> machines I can just directly download this custom build from S3. But
> besides spark-submit and what not, I also wanted to install the pyspark
> python package to the machine I am building.
>
> The lack of the LICENSE file in the custom build would prevent pyspark
> from being successfully built.
>
> Hopefully this answers your question.
>
> The second part of my last email was about building pyspark inside
> spark source directory, I will raise an issue on Jira for that, as it is
> more of a clean cut problem with the documentation on the website and the
> comments in make-distribution.sh.
>
>
>
> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>
>> Hm, the build fails? you can see this is just skipped if not present,
>> for this reason.
>> I'm not clear why you need the file for its own sake, for your own
>> internal modification that you don't redistribute.
>>
>>
>>
>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li 
>> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for the quick response! Yes, what you described about how
>>> LICENSE file should be distributed makes sense.
>>>
>>> The reason I learned about this is that I was trying to build
>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>> machines, so that:
>>>
>>> 1. These machines can run spark with the built.
>>> 2. On each machine, I can install pyspark by running `python
>>> setup.py install` inside the python directory.
>>>
>>> Step 2 would fail because of missing the licenses directory.
>>>
>>> Building pyspark out of a binary distribution is a bit
>>> unconventional, but I did this after failing to do what the official doc
>>> recommended (
>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>> so taking a step back to describe what I did originally:
>>>
>>> In the spark-2.4.5 src directory, I just did a simple:
>>>
>>> `./build/mvn -DskipTests clean package`
>>>
>>>
>>> And then went to the python directory and did:
>>>
>>>
>>> `python setup.py sdist` followed b

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Xiangyu Li
make-distribution.sh with --pip would run a `python setup.py sdist` within
that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error
happens.

Correct me if I'm wrong, but pyspark binary has always been successfully
built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen  wrote:

> Hm, others may have to chime in here. Either that's not how you create the
> pyspark binary from the source release (make-distribution.sh doesn't do
> that?) or there is a small but important issue here, that the source
> release doesn't contain one thing that the binary release script expects,
> which is LICENSE-binary et al. If it's the latter, we could move around the
> LICENSE bits in the source tree so that both are "source" files included in
> the source release, so you can make the binary release with it, but, I'd
> probably say it's easier/better to simply skip adding the license in this
> path (if it's supposed to work this way at all) as the use case, a custom
> derived work, doesn't need the *ASF's* license statement.
>
>
> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:
>
>> To reproduce this, I just did
>>
>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>> tar xzf spark-2.4.5.tgz
>> cd spark-2.4.5
>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
>> mv spark-2.4.5-bin-custom-spark.tgz ../
>> cd ..
>> tar xzf spark-2.4.5-bin-custom-spark.tgz
>> cd spark-2.4.5-bin-custom-spark/python/
>> sudo python setup.py install
>>
>> And here is the output:
>> [image: image.png]
>>
>>
>> On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:
>>
>>> You wrote:
>>>
>>> "
>>> 2. On each machine, I can install pyspark by running `python setup.py
>>> install` inside the python directory.
>>>
>>> Step 2 would fail because of missing the licenses directory.
>>> "
>>>
>>> That shouldn't depend on the license file, and the script you showed
>>> does not fail when not present, so I am wondering what this means.
>>> I'm not sure there's a JIRA here yet.
>>>
>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:
>>>
 Hmm, sorry I don't get what part of my email were you referring to when
 you said "the build fails?".

 So I am trying to build a custom spark binary distribution with, say,
 different Hadoop versions and R support.

 Then I stored this custom build on S3, so as I am building more
 machines I can just directly download this custom build from S3. But
 besides spark-submit and what not, I also wanted to install the pyspark
 python package to the machine I am building.

 The lack of the LICENSE file in the custom build would prevent pyspark
 from being successfully built.

 Hopefully this answers your question.

 The second part of my last email was about building pyspark inside
 spark source directory, I will raise an issue on Jira for that, as it is
 more of a clean cut problem with the documentation on the website and the
 comments in make-distribution.sh.



 On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:

> Hm, the build fails? you can see this is just skipped if not present,
> for this reason.
> I'm not clear why you need the file for its own sake, for your own
> internal modification that you don't redistribute.
>
>
>
> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li  wrote:
>
>> Hi Sean,
>>
>> Thanks for the quick response! Yes, what you described about how
>> LICENSE file should be distributed makes sense.
>>
>> The reason I learned about this is that I was trying to build
>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>> machines, so that:
>>
>> 1. These machines can run spark with the built.
>> 2. On each machine, I can install pyspark by running `python setup.py
>> install` inside the python directory.
>>
>> Step 2 would fail because of missing the licenses directory.
>>
>> Building pyspark out of a binary distribution is a bit
>> unconventional, but I did this after failing to do what the official doc
>> recommended (
>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>> so taking a step back to describe what I did originally:
>>
>> In the spark-2.4.5 src directory, I just did a simple:
>>
>> `./build/mvn -DskipTests clean package`
>>
>>
>> And then went to the python directory and did:
>>
>>
>> `python setup.py sdist` followed by `pip install
>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the
>> make-distribution.sh.)
>>
>>
>> This ran into "error: package directory `deps/jars` does not exist".
>>
>>
>> However, directly running
>>
>>
>> `sudo python setup.py install`
>>
>>
>> worked.
>>
>>
>>
>> On Fri, May 

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Sean Owen
Hm, others may have to chime in here. Either that's not how you create the
pyspark binary from the source release (make-distribution.sh doesn't do
that?) or there is a small but important issue here, that the source
release doesn't contain one thing that the binary release script expects,
which is LICENSE-binary et al. If it's the latter, we could move around the
LICENSE bits in the source tree so that both are "source" files included in
the source release, so you can make the binary release with it, but, I'd
probably say it's easier/better to simply skip adding the license in this
path (if it's supposed to work this way at all) as the use case, a custom
derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li  wrote:

> To reproduce this, I just did
>
> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
> tar xzf spark-2.4.5.tgz
> cd spark-2.4.5
> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
> mv spark-2.4.5-bin-custom-spark.tgz ../
> cd ..
> tar xzf spark-2.4.5-bin-custom-spark.tgz
> cd spark-2.4.5-bin-custom-spark/python/
> sudo python setup.py install
>
> And here is the output:
> [image: image.png]
>
>
> On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:
>
>> You wrote:
>>
>> "
>> 2. On each machine, I can install pyspark by running `python setup.py
>> install` inside the python directory.
>>
>> Step 2 would fail because of missing the licenses directory.
>> "
>>
>> That shouldn't depend on the license file, and the script you showed does
>> not fail when not present, so I am wondering what this means.
>> I'm not sure there's a JIRA here yet.
>>
>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:
>>
>>> Hmm, sorry I don't get what part of my email were you referring to when
>>> you said "the build fails?".
>>>
>>> So I am trying to build a custom spark binary distribution with, say,
>>> different Hadoop versions and R support.
>>>
>>> Then I stored this custom build on S3, so as I am building more machines
>>> I can just directly download this custom build from S3. But besides
>>> spark-submit and what not, I also wanted to install the pyspark python
>>> package to the machine I am building.
>>>
>>> The lack of the LICENSE file in the custom build would prevent pyspark
>>> from being successfully built.
>>>
>>> Hopefully this answers your question.
>>>
>>> The second part of my last email was about building pyspark inside spark
>>> source directory, I will raise an issue on Jira for that, as it is more of
>>> a clean cut problem with the documentation on the website and the comments
>>> in make-distribution.sh.
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>>>
 Hm, the build fails? you can see this is just skipped if not present,
 for this reason.
 I'm not clear why you need the file for its own sake, for your own
 internal modification that you don't redistribute.



 On Fri, May 1, 2020 at 11:43 AM Xiangyu Li  wrote:

> Hi Sean,
>
> Thanks for the quick response! Yes, what you described about how
> LICENSE file should be distributed makes sense.
>
> The reason I learned about this is that I was trying to build
> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
> machines, so that:
>
> 1. These machines can run spark with the built.
> 2. On each machine, I can install pyspark by running `python setup.py
> install` inside the python directory.
>
> Step 2 would fail because of missing the licenses directory.
>
> Building pyspark out of a binary distribution is a bit unconventional,
> but I did this after failing to do what the official doc recommended (
> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
> so taking a step back to describe what I did originally:
>
> In the spark-2.4.5 src directory, I just did a simple:
>
> `./build/mvn -DskipTests clean package`
>
>
> And then went to the python directory and did:
>
>
> `python setup.py sdist` followed by `pip install
> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>
>
> This ran into "error: package directory `deps/jars` does not exist".
>
>
> However, directly running
>
>
> `sudo python setup.py install`
>
>
> worked.
>
>
>
> On Fri, May 1, 2020 at 11:30 AM Sean Owen  wrote:
>
>> The source distribution has the source LICENSE file. The binary
>> distribution has the LICENSE-binary license file. The source release 
>> isn't
>> supposed to have LICENSE-binary as it would not be accurate for that
>> release; LICENSE is. If you're redistributing a build, you'll have your 
>> own
>> process for modifying and building it, including modifying the LICENSE 
>> file
>> as appropriate; these LICENSE files represent what

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Xiangyu Li
To reproduce this, I just did

curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
[image: image.png]


On Fri, May 1, 2020 at 2:48 PM Sean Owen  wrote:

> You wrote:
>
> "
> 2. On each machine, I can install pyspark by running `python setup.py
> install` inside the python directory.
>
> Step 2 would fail because of missing the licenses directory.
> "
>
> That shouldn't depend on the license file, and the script you showed does
> not fail when not present, so I am wondering what this means.
> I'm not sure there's a JIRA here yet.
>
> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:
>
>> Hmm, sorry I don't get what part of my email were you referring to when
>> you said "the build fails?".
>>
>> So I am trying to build a custom spark binary distribution with, say,
>> different Hadoop versions and R support.
>>
>> Then I stored this custom build on S3, so as I am building more machines
>> I can just directly download this custom build from S3. But besides
>> spark-submit and what not, I also wanted to install the pyspark python
>> package to the machine I am building.
>>
>> The lack of the LICENSE file in the custom build would prevent pyspark
>> from being successfully built.
>>
>> Hopefully this answers your question.
>>
>> The second part of my last email was about building pyspark inside spark
>> source directory, I will raise an issue on Jira for that, as it is more of
>> a clean cut problem with the documentation on the website and the comments
>> in make-distribution.sh.
>>
>>
>>
>> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>>
>>> Hm, the build fails? you can see this is just skipped if not present,
>>> for this reason.
>>> I'm not clear why you need the file for its own sake, for your own
>>> internal modification that you don't redistribute.
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li  wrote:
>>>
 Hi Sean,

 Thanks for the quick response! Yes, what you described about how
 LICENSE file should be distributed makes sense.

 The reason I learned about this is that I was trying to build
 spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
 machines, so that:

 1. These machines can run spark with the built.
 2. On each machine, I can install pyspark by running `python setup.py
 install` inside the python directory.

 Step 2 would fail because of missing the licenses directory.

 Building pyspark out of a binary distribution is a bit unconventional,
 but I did this after failing to do what the official doc recommended (
 https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
 so taking a step back to describe what I did originally:

 In the spark-2.4.5 src directory, I just did a simple:

 `./build/mvn -DskipTests clean package`


 And then went to the python directory and did:


 `python setup.py sdist` followed by `pip install
 dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


 This ran into "error: package directory `deps/jars` does not exist".


 However, directly running


 `sudo python setup.py install`


 worked.



 On Fri, May 1, 2020 at 11:30 AM Sean Owen  wrote:

> The source distribution has the source LICENSE file. The binary
> distribution has the LICENSE-binary license file. The source release isn't
> supposed to have LICENSE-binary as it would not be accurate for that
> release; LICENSE is. If you're redistributing a build, you'll have your 
> own
> process for modifying and building it, including modifying the LICENSE 
> file
> as appropriate; these LICENSE files represent what the project delivers to
> you rather than what you deliver to others. You could get the
> LICENSE-binary file from the right hash commit from git, if desired, as
> part of your build.
>
> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li  wrote:
>
>> Hello,
>>
>> I downloaded spark-2.4.5 source from
>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>> After extracting it and running:
>>
>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>
>>
>> It creates a Spark binary distribution named:
>> spark-2.4.5-bin-custom-spark.tgz
>>
>> So this file is supposedly a ready-to-distribute Spark binary file
>> like the one you can download from
>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-b

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Sean Owen
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py
install` inside the python directory.

Step 2 would fail because of missing the licenses directory.
"

That shouldn't depend on the license file, and the script you showed does
not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li  wrote:

> Hmm, sorry I don't get what part of my email were you referring to when
> you said "the build fails?".
>
> So I am trying to build a custom spark binary distribution with, say,
> different Hadoop versions and R support.
>
> Then I stored this custom build on S3, so as I am building more machines I
> can just directly download this custom build from S3. But besides
> spark-submit and what not, I also wanted to install the pyspark python
> package to the machine I am building.
>
> The lack of the LICENSE file in the custom build would prevent pyspark
> from being successfully built.
>
> Hopefully this answers your question.
>
> The second part of my last email was about building pyspark inside spark
> source directory, I will raise an issue on Jira for that, as it is more of
> a clean cut problem with the documentation on the website and the comments
> in make-distribution.sh.
>
>
>
> On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:
>
>> Hm, the build fails? you can see this is just skipped if not present, for
>> this reason.
>> I'm not clear why you need the file for its own sake, for your own
>> internal modification that you don't redistribute.
>>
>>
>>
>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li  wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for the quick response! Yes, what you described about how LICENSE
>>> file should be distributed makes sense.
>>>
>>> The reason I learned about this is that I was trying to build
>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>> machines, so that:
>>>
>>> 1. These machines can run spark with the built.
>>> 2. On each machine, I can install pyspark by running `python setup.py
>>> install` inside the python directory.
>>>
>>> Step 2 would fail because of missing the licenses directory.
>>>
>>> Building pyspark out of a binary distribution is a bit unconventional,
>>> but I did this after failing to do what the official doc recommended (
>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>> so taking a step back to describe what I did originally:
>>>
>>> In the spark-2.4.5 src directory, I just did a simple:
>>>
>>> `./build/mvn -DskipTests clean package`
>>>
>>>
>>> And then went to the python directory and did:
>>>
>>>
>>> `python setup.py sdist` followed by `pip install
>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>>>
>>>
>>> This ran into "error: package directory `deps/jars` does not exist".
>>>
>>>
>>> However, directly running
>>>
>>>
>>> `sudo python setup.py install`
>>>
>>>
>>> worked.
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen  wrote:
>>>
 The source distribution has the source LICENSE file. The binary
 distribution has the LICENSE-binary license file. The source release isn't
 supposed to have LICENSE-binary as it would not be accurate for that
 release; LICENSE is. If you're redistributing a build, you'll have your own
 process for modifying and building it, including modifying the LICENSE file
 as appropriate; these LICENSE files represent what the project delivers to
 you rather than what you deliver to others. You could get the
 LICENSE-binary file from the right hash commit from git, if desired, as
 part of your build.

 On Fri, May 1, 2020 at 10:19 AM Xiangyu Li  wrote:

> Hello,
>
> I downloaded spark-2.4.5 source from
> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
> After extracting it and running:
>
> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>
>
> It creates a Spark binary distribution named:
> spark-2.4.5-bin-custom-spark.tgz
>
> So this file is supposedly a ready-to-distribute Spark binary file
> like the one you can download from
> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>
> However, one big difference between this custom build and the official
> build is that you do not have a LICENSE file in the custom build. I don't
> know much about Apache license, but I would suppose a custom build
> distribution should have one.
>
> The reason we are missing the file is caused by the following code in
> make-distribution.sh:
> [image: image.png]
>
> There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
> therefore there will be no LICENSE file in your custom build.
>
> I am aware of two pull reques

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Xiangyu Li
Hmm, sorry I don't get what part of my email were you referring to when you
said "the build fails?".

So I am trying to build a custom spark binary distribution with, say,
different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I
can just directly download this custom build from S3. But besides
spark-submit and what not, I also wanted to install the pyspark python
package to the machine I am building.

The lack of the LICENSE file in the custom build would prevent pyspark from
being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark
source directory, I will raise an issue on Jira for that, as it is more of
a clean cut problem with the documentation on the website and the comments
in make-distribution.sh.



On Fri, May 1, 2020 at 1:31 PM Sean Owen  wrote:

> Hm, the build fails? you can see this is just skipped if not present, for
> this reason.
> I'm not clear why you need the file for its own sake, for your own
> internal modification that you don't redistribute.
>
>
>
> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li  wrote:
>
>> Hi Sean,
>>
>> Thanks for the quick response! Yes, what you described about how LICENSE
>> file should be distributed makes sense.
>>
>> The reason I learned about this is that I was trying to build
>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>> machines, so that:
>>
>> 1. These machines can run spark with the built.
>> 2. On each machine, I can install pyspark by running `python setup.py
>> install` inside the python directory.
>>
>> Step 2 would fail because of missing the licenses directory.
>>
>> Building pyspark out of a binary distribution is a bit unconventional,
>> but I did this after failing to do what the official doc recommended (
>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>> so taking a step back to describe what I did originally:
>>
>> In the spark-2.4.5 src directory, I just did a simple:
>>
>> `./build/mvn -DskipTests clean package`
>>
>>
>> And then went to the python directory and did:
>>
>>
>> `python setup.py sdist` followed by `pip install
>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>>
>>
>> This ran into "error: package directory `deps/jars` does not exist".
>>
>>
>> However, directly running
>>
>>
>> `sudo python setup.py install`
>>
>>
>> worked.
>>
>>
>>
>> On Fri, May 1, 2020 at 11:30 AM Sean Owen  wrote:
>>
>>> The source distribution has the source LICENSE file. The binary
>>> distribution has the LICENSE-binary license file. The source release isn't
>>> supposed to have LICENSE-binary as it would not be accurate for that
>>> release; LICENSE is. If you're redistributing a build, you'll have your own
>>> process for modifying and building it, including modifying the LICENSE file
>>> as appropriate; these LICENSE files represent what the project delivers to
>>> you rather than what you deliver to others. You could get the
>>> LICENSE-binary file from the right hash commit from git, if desired, as
>>> part of your build.
>>>
>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li  wrote:
>>>
 Hello,

 I downloaded spark-2.4.5 source from
 https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
 After extracting it and running:

 ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
 -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes


 It creates a Spark binary distribution named:
 spark-2.4.5-bin-custom-spark.tgz

 So this file is supposedly a ready-to-distribute Spark binary file like
 the one you can download from
 http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

 However, one big difference between this custom build and the official
 build is that you do not have a LICENSE file in the custom build. I don't
 know much about Apache license, but I would suppose a custom build
 distribution should have one.

 The reason we are missing the file is caused by the following code in
 make-distribution.sh:
 [image: image.png]

 There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
 therefore there will be no LICENSE file in your custom build.

 I am aware of two pull requests related to this:

 https://github.com/apache/spark/pull/22436
 started to use LICENSE-binary instead of just the LICENSE.

 And
 https://github.com/apache/spark/pull/22840
 To avoid failure when there is no LICENSE-binary in spark-2.4.5 source
 directory.

 I think we need to change make-distribution.sh to make sure that the
 LICENSE file is copied over to its corresponding custom build distribution.
 However, I am not ready to do a pull request, so hopefully we can discuss
 it here first.
>>

Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Sean Owen
Hm, the build fails? you can see this is just skipped if not present, for
this reason.
I'm not clear why you need the file for its own sake, for your own internal
modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li  wrote:

> Hi Sean,
>
> Thanks for the quick response! Yes, what you described about how LICENSE
> file should be distributed makes sense.
>
> The reason I learned about this is that I was trying to build
> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
> machines, so that:
>
> 1. These machines can run spark with the built.
> 2. On each machine, I can install pyspark by running `python setup.py
> install` inside the python directory.
>
> Step 2 would fail because of missing the licenses directory.
>
> Building pyspark out of a binary distribution is a bit unconventional, but
> I did this after failing to do what the official doc recommended (
> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
> so taking a step back to describe what I did originally:
>
> In the spark-2.4.5 src directory, I just did a simple:
>
> `./build/mvn -DskipTests clean package`
>
>
> And then went to the python directory and did:
>
>
> `python setup.py sdist` followed by `pip install
> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>
>
> This ran into "error: package directory `deps/jars` does not exist".
>
>
> However, directly running
>
>
> `sudo python setup.py install`
>
>
> worked.
>
>
>
> On Fri, May 1, 2020 at 11:30 AM Sean Owen  wrote:
>
>> The source distribution has the source LICENSE file. The binary
>> distribution has the LICENSE-binary license file. The source release isn't
>> supposed to have LICENSE-binary as it would not be accurate for that
>> release; LICENSE is. If you're redistributing a build, you'll have your own
>> process for modifying and building it, including modifying the LICENSE file
>> as appropriate; these LICENSE files represent what the project delivers to
>> you rather than what you deliver to others. You could get the
>> LICENSE-binary file from the right hash commit from git, if desired, as
>> part of your build.
>>
>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li  wrote:
>>
>>> Hello,
>>>
>>> I downloaded spark-2.4.5 source from
>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>> After extracting it and running:
>>>
>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
>>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>>
>>>
>>> It creates a Spark binary distribution named:
>>> spark-2.4.5-bin-custom-spark.tgz
>>>
>>> So this file is supposedly a ready-to-distribute Spark binary file like
>>> the one you can download from
>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>>
>>> However, one big difference between this custom build and the official
>>> build is that you do not have a LICENSE file in the custom build. I don't
>>> know much about Apache license, but I would suppose a custom build
>>> distribution should have one.
>>>
>>> The reason we are missing the file is caused by the following code in
>>> make-distribution.sh:
>>> [image: image.png]
>>>
>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
>>> therefore there will be no LICENSE file in your custom build.
>>>
>>> I am aware of two pull requests related to this:
>>>
>>> https://github.com/apache/spark/pull/22436
>>> started to use LICENSE-binary instead of just the LICENSE.
>>>
>>> And
>>> https://github.com/apache/spark/pull/22840
>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source
>>> directory.
>>>
>>> I think we need to change make-distribution.sh to make sure that the
>>> LICENSE file is copied over to its corresponding custom build distribution.
>>> However, I am not ready to do a pull request, so hopefully we can discuss
>>> it here first.
>>> --
>>> Sincerely
>>> Xiangyu Li
>>>
>>> 
>>>
>>
>
> --
> Sincerely
> Xiangyu Li
>
> 
>


Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Xiangyu Li
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE
file should be distributed makes sense.

The reason I learned about this is that I was trying to build
spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py
install` inside the python directory.

Step 2 would fail because of missing the licenses directory.

Building pyspark out of a binary distribution is a bit unconventional, but
I did this after failing to do what the official doc recommended (
https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple:

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz`
(as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running


`sudo python setup.py install`


worked.



On Fri, May 1, 2020 at 11:30 AM Sean Owen  wrote:

> The source distribution has the source LICENSE file. The binary
> distribution has the LICENSE-binary license file. The source release isn't
> supposed to have LICENSE-binary as it would not be accurate for that
> release; LICENSE is. If you're redistributing a build, you'll have your own
> process for modifying and building it, including modifying the LICENSE file
> as appropriate; these LICENSE files represent what the project delivers to
> you rather than what you deliver to others. You could get the
> LICENSE-binary file from the right hash commit from git, if desired, as
> part of your build.
>
> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li  wrote:
>
>> Hello,
>>
>> I downloaded spark-2.4.5 source from
>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>> After extracting it and running:
>>
>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>
>>
>> It creates a Spark binary distribution named:
>> spark-2.4.5-bin-custom-spark.tgz
>>
>> So this file is supposedly a ready-to-distribute Spark binary file like
>> the one you can download from
>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>
>> However, one big difference between this custom build and the official
>> build is that you do not have a LICENSE file in the custom build. I don't
>> know much about Apache license, but I would suppose a custom build
>> distribution should have one.
>>
>> The reason we are missing the file is caused by the following code in
>> make-distribution.sh:
>> [image: image.png]
>>
>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
>> therefore there will be no LICENSE file in your custom build.
>>
>> I am aware of two pull requests related to this:
>>
>> https://github.com/apache/spark/pull/22436
>> started to use LICENSE-binary instead of just the LICENSE.
>>
>> And
>> https://github.com/apache/spark/pull/22840
>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source
>> directory.
>>
>> I think we need to change make-distribution.sh to make sure that the
>> LICENSE file is copied over to its corresponding custom build distribution.
>> However, I am not ready to do a pull request, so hopefully we can discuss
>> it here first.
>> --
>> Sincerely
>> Xiangyu Li
>>
>> 
>>
>

-- 
Sincerely
Xiangyu Li




Re: No LICENSE file in spark custom build distribution

2020-05-01 Thread Sean Owen
The source distribution has the source LICENSE file. The binary
distribution has the LICENSE-binary license file. The source release isn't
supposed to have LICENSE-binary as it would not be accurate for that
release; LICENSE is. If you're redistributing a build, you'll have your own
process for modifying and building it, including modifying the LICENSE file
as appropriate; these LICENSE files represent what the project delivers to
you rather than what you deliver to others. You could get the
LICENSE-binary file from the right hash commit from git, if desired, as
part of your build.

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li  wrote:

> Hello,
>
> I downloaded spark-2.4.5 source from
> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
> After extracting it and running:
>
> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>
>
> It creates a Spark binary distribution named:
> spark-2.4.5-bin-custom-spark.tgz
>
> So this file is supposedly a ready-to-distribute Spark binary file like
> the one you can download from
> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>
> However, one big difference between this custom build and the official
> build is that you do not have a LICENSE file in the custom build. I don't
> know much about Apache license, but I would suppose a custom build
> distribution should have one.
>
> The reason we are missing the file is caused by the following code in
> make-distribution.sh:
> [image: image.png]
>
> There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
> therefore there will be no LICENSE file in your custom build.
>
> I am aware of two pull requests related to this:
>
> https://github.com/apache/spark/pull/22436
> started to use LICENSE-binary instead of just the LICENSE.
>
> And
> https://github.com/apache/spark/pull/22840
> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source
> directory.
>
> I think we need to change make-distribution.sh to make sure that the
> LICENSE file is copied over to its corresponding custom build distribution.
> However, I am not ready to do a pull request, so hopefully we can discuss
> it here first.
> --
> Sincerely
> Xiangyu Li
>
> 
>