Re: No LICENSE file in spark custom build distribution
Can you send me the output of those two commands On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > Hi Holden, > > Please check the second email of mine in this email chain. I did that > originally and to quote my email: > > > === > In the spark-2.4.5 src directory, I just did a simple: > > `./build/mvn -DskipTests clean package` > > > And then went to the python directory and did: > > > `python setup.py sdist` followed by `pip install > dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) > > > *This ran into "error: package directory `deps/jars` does not exist".* > > > === > > > So exactly as what you said, which is also one of the printout message in > the make-distribution.sh script. > > On Fri, May 1, 2020 at 4:39 PM Holden Karau wrote: > >> Your problem isn't the missing license per-se (that just happens to be >> the first error). >> >> I don't believe that is the way we expect users to pip install the Python >> library. pip will only install directories/targets underneath the directory >> where setup.py, hence the deps directory which is constructed by setup.py >> with a bunch of symlinks. It assumes that you are either building Spark >> from source in which case you should follow it's instructions: >> >> To build Spark with maven you can run: >> ./build/mvn -DskipTests clean package >> Building the source dist is done in the Python directory: >> cd python >> python setup.py sdist >> pip install dist/*.tar.gz >> >> >> On Fri, May 1, 2020 at 1:32 PM Xiangyu Li wrote: >> >>> make-distribution.sh with --pip would run a `python setup.py sdist` >>> within that make-distribution.sh script. >>> I also tested `make-distribution.sh` without --pip, and the same error >>> happens. >>> >>> Correct me if I'm wrong, but pyspark binary has always been successfully >>> built, it is the pyspark pip package that is failing. >>> >>> On Fri, May 1, 2020 at 4:23 PM Sean Owen wrote: >>> Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement. On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: > To reproduce this, I just did > > curl -O > http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz > tar xzf spark-2.4.5.tgz > cd spark-2.4.5 > ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 > mv spark-2.4.5-bin-custom-spark.tgz ../ > cd .. > tar xzf spark-2.4.5-bin-custom-spark.tgz > cd spark-2.4.5-bin-custom-spark/python/ > sudo python setup.py install > > And here is the output: > [image: image.png] > > > On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: > >> You wrote: >> >> " >> 2. On each machine, I can install pyspark by running `python setup.py >> install` inside the python directory. >> >> Step 2 would fail because of missing the licenses directory. >> " >> >> That shouldn't depend on the license file, and the script you showed >> does not fail when not present, so I am wondering what this means. >> I'm not sure there's a JIRA here yet. >> >> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: >> >>> Hmm, sorry I don't get what part of my email were you referring to >>> when you said "the build fails?". >>> >>> So I am trying to build a custom spark binary distribution with, >>> say, different Hadoop versions and R support. >>> >>> Then I stored this custom build on S3, so as I am building more >>> machines I can just directly download this custom build from S3. But >>> besides spark-submit and what not, I also wanted to install the pyspark >>> python package to the machine I am building. >>> >>> The lack of the LICENSE file in the custom build would prevent >>> pyspark from being successfully built. >>> >>> Hopefully this answers your question. >>> >>> The second part of my last email was about building pyspark inside >>> spark source directory, I will raise an issue on Jira for that, as it is >>> more of a clean cut problem with t
Re: No LICENSE file in spark custom build distribution
Hi Holden, Please check the second email of mine in this email chain. I did that originally and to quote my email: === In the spark-2.4.5 src directory, I just did a simple: `./build/mvn -DskipTests clean package` And then went to the python directory and did: `python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) *This ran into "error: package directory `deps/jars` does not exist".* === So exactly as what you said, which is also one of the printout message in the make-distribution.sh script. On Fri, May 1, 2020 at 4:39 PM Holden Karau wrote: > Your problem isn't the missing license per-se (that just happens to be the > first error). > > I don't believe that is the way we expect users to pip install the Python > library. pip will only install directories/targets underneath the directory > where setup.py, hence the deps directory which is constructed by setup.py > with a bunch of symlinks. It assumes that you are either building Spark > from source in which case you should follow it's instructions: > > To build Spark with maven you can run: > ./build/mvn -DskipTests clean package > Building the source dist is done in the Python directory: > cd python > python setup.py sdist > pip install dist/*.tar.gz > > > On Fri, May 1, 2020 at 1:32 PM Xiangyu Li wrote: > >> make-distribution.sh with --pip would run a `python setup.py sdist` >> within that make-distribution.sh script. >> I also tested `make-distribution.sh` without --pip, and the same error >> happens. >> >> Correct me if I'm wrong, but pyspark binary has always been successfully >> built, it is the pyspark pip package that is failing. >> >> On Fri, May 1, 2020 at 4:23 PM Sean Owen wrote: >> >>> Hm, others may have to chime in here. Either that's not how you create >>> the pyspark binary from the source release (make-distribution.sh doesn't do >>> that?) or there is a small but important issue here, that the source >>> release doesn't contain one thing that the binary release script expects, >>> which is LICENSE-binary et al. If it's the latter, we could move around the >>> LICENSE bits in the source tree so that both are "source" files included in >>> the source release, so you can make the binary release with it, but, I'd >>> probably say it's easier/better to simply skip adding the license in this >>> path (if it's supposed to work this way at all) as the use case, a custom >>> derived work, doesn't need the *ASF's* license statement. >>> >>> >>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: >>> To reproduce this, I just did curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz tar xzf spark-2.4.5.tgz cd spark-2.4.5 ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 mv spark-2.4.5-bin-custom-spark.tgz ../ cd .. tar xzf spark-2.4.5-bin-custom-spark.tgz cd spark-2.4.5-bin-custom-spark/python/ sudo python setup.py install And here is the output: [image: image.png] On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: > You wrote: > > " > 2. On each machine, I can install pyspark by running `python setup.py > install` inside the python directory. > > Step 2 would fail because of missing the licenses directory. > " > > That shouldn't depend on the license file, and the script you showed > does not fail when not present, so I am wondering what this means. > I'm not sure there's a JIRA here yet. > > On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > >> Hmm, sorry I don't get what part of my email were you referring to >> when you said "the build fails?". >> >> So I am trying to build a custom spark binary distribution with, say, >> different Hadoop versions and R support. >> >> Then I stored this custom build on S3, so as I am building more >> machines I can just directly download this custom build from S3. But >> besides spark-submit and what not, I also wanted to install the pyspark >> python package to the machine I am building. >> >> The lack of the LICENSE file in the custom build would prevent >> pyspark from being successfully built. >> >> Hopefully this answers your question. >> >> The second part of my last email was about building pyspark inside >> spark source directory, I will raise an issue on Jira for that, as it is >> more of a clean cut problem with the documentation on the website and the >> comments in make-distribution.sh. >> >> >> >> On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: >> >>> Hm, the build fails? you can see this is just skipped if not >>> p
Re: No LICENSE file in spark custom build distribution
I need the pip packaging, all these efforts are to get a pyspark pip package actually. On Fri, May 1, 2020 at 4:38 PM Sean Owen wrote: > I see, that makes more sense, though I have limited knowledge of how the > pip packaging works. You don't need pip packaging, do you? just pyspark > itself right. Omit --pip? > > On Fri, May 1, 2020 at 3:32 PM Xiangyu Li wrote: > >> make-distribution.sh with --pip would run a `python setup.py sdist` >> within that make-distribution.sh script. >> I also tested `make-distribution.sh` without --pip, and the same error >> happens. >> >> Correct me if I'm wrong, but pyspark binary has always been successfully >> built, it is the pyspark pip package that is failing. >> >> On Fri, May 1, 2020 at 4:23 PM Sean Owen wrote: >> >>> Hm, others may have to chime in here. Either that's not how you create >>> the pyspark binary from the source release (make-distribution.sh doesn't do >>> that?) or there is a small but important issue here, that the source >>> release doesn't contain one thing that the binary release script expects, >>> which is LICENSE-binary et al. If it's the latter, we could move around the >>> LICENSE bits in the source tree so that both are "source" files included in >>> the source release, so you can make the binary release with it, but, I'd >>> probably say it's easier/better to simply skip adding the license in this >>> path (if it's supposed to work this way at all) as the use case, a custom >>> derived work, doesn't need the *ASF's* license statement. >>> >>> >>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: >>> To reproduce this, I just did curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz tar xzf spark-2.4.5.tgz cd spark-2.4.5 ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 mv spark-2.4.5-bin-custom-spark.tgz ../ cd .. tar xzf spark-2.4.5-bin-custom-spark.tgz cd spark-2.4.5-bin-custom-spark/python/ sudo python setup.py install And here is the output: [image: image.png] On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: > You wrote: > > " > 2. On each machine, I can install pyspark by running `python setup.py > install` inside the python directory. > > Step 2 would fail because of missing the licenses directory. > " > > That shouldn't depend on the license file, and the script you showed > does not fail when not present, so I am wondering what this means. > I'm not sure there's a JIRA here yet. > > On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > >> Hmm, sorry I don't get what part of my email were you referring to >> when you said "the build fails?". >> >> So I am trying to build a custom spark binary distribution with, say, >> different Hadoop versions and R support. >> >> Then I stored this custom build on S3, so as I am building more >> machines I can just directly download this custom build from S3. But >> besides spark-submit and what not, I also wanted to install the pyspark >> python package to the machine I am building. >> >> The lack of the LICENSE file in the custom build would prevent >> pyspark from being successfully built. >> >> Hopefully this answers your question. >> >> The second part of my last email was about building pyspark inside >> spark source directory, I will raise an issue on Jira for that, as it is >> more of a clean cut problem with the documentation on the website and the >> comments in make-distribution.sh. >> >> >> >> On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: >> >>> Hm, the build fails? you can see this is just skipped if not >>> present, for this reason. >>> I'm not clear why you need the file for its own sake, for your own >>> internal modification that you don't redistribute. >>> >>> >>> >>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li >>> wrote: >>> Hi Sean, Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that: 1. These machines can run spark with the built. 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended ( https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what
Re: No LICENSE file in spark custom build distribution
Your problem isn't the missing license per-se (that just happens to be the first error). I don't believe that is the way we expect users to pip install the Python library. pip will only install directories/targets underneath the directory where setup.py, hence the deps directory which is constructed by setup.py with a bunch of symlinks. It assumes that you are either building Spark from source in which case you should follow it's instructions: To build Spark with maven you can run: ./build/mvn -DskipTests clean package Building the source dist is done in the Python directory: cd python python setup.py sdist pip install dist/*.tar.gz On Fri, May 1, 2020 at 1:32 PM Xiangyu Li wrote: > make-distribution.sh with --pip would run a `python setup.py sdist` within > that make-distribution.sh script. > I also tested `make-distribution.sh` without --pip, and the same error > happens. > > Correct me if I'm wrong, but pyspark binary has always been successfully > built, it is the pyspark pip package that is failing. > > On Fri, May 1, 2020 at 4:23 PM Sean Owen wrote: > >> Hm, others may have to chime in here. Either that's not how you create >> the pyspark binary from the source release (make-distribution.sh doesn't do >> that?) or there is a small but important issue here, that the source >> release doesn't contain one thing that the binary release script expects, >> which is LICENSE-binary et al. If it's the latter, we could move around the >> LICENSE bits in the source tree so that both are "source" files included in >> the source release, so you can make the binary release with it, but, I'd >> probably say it's easier/better to simply skip adding the license in this >> path (if it's supposed to work this way at all) as the use case, a custom >> derived work, doesn't need the *ASF's* license statement. >> >> >> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: >> >>> To reproduce this, I just did >>> >>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>> tar xzf spark-2.4.5.tgz >>> cd spark-2.4.5 >>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 >>> mv spark-2.4.5-bin-custom-spark.tgz ../ >>> cd .. >>> tar xzf spark-2.4.5-bin-custom-spark.tgz >>> cd spark-2.4.5-bin-custom-spark/python/ >>> sudo python setup.py install >>> >>> And here is the output: >>> [image: image.png] >>> >>> >>> On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: >>> You wrote: " 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. " That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means. I'm not sure there's a JIRA here yet. On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > Hmm, sorry I don't get what part of my email were you referring to > when you said "the build fails?". > > So I am trying to build a custom spark binary distribution with, say, > different Hadoop versions and R support. > > Then I stored this custom build on S3, so as I am building more > machines I can just directly download this custom build from S3. But > besides spark-submit and what not, I also wanted to install the pyspark > python package to the machine I am building. > > The lack of the LICENSE file in the custom build would prevent pyspark > from being successfully built. > > Hopefully this answers your question. > > The second part of my last email was about building pyspark inside > spark source directory, I will raise an issue on Jira for that, as it is > more of a clean cut problem with the documentation on the website and the > comments in make-distribution.sh. > > > > On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: > >> Hm, the build fails? you can see this is just skipped if not present, >> for this reason. >> I'm not clear why you need the file for its own sake, for your own >> internal modification that you don't redistribute. >> >> >> >> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li >> wrote: >> >>> Hi Sean, >>> >>> Thanks for the quick response! Yes, what you described about how >>> LICENSE file should be distributed makes sense. >>> >>> The reason I learned about this is that I was trying to build >>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>> machines, so that: >>> >>> 1. These machines can run spark with the built. >>> 2. On each machine, I can install pyspark by running `python >>> setup.py install` inside the python directory. >>> >>> Step 2 would fail because of missing the licenses directory. >>> >>> Building pyspark out of a binary distribution is a bit >>> un
Re: No LICENSE file in spark custom build distribution
I see, that makes more sense, though I have limited knowledge of how the pip packaging works. You don't need pip packaging, do you? just pyspark itself right. Omit --pip? On Fri, May 1, 2020 at 3:32 PM Xiangyu Li wrote: > make-distribution.sh with --pip would run a `python setup.py sdist` within > that make-distribution.sh script. > I also tested `make-distribution.sh` without --pip, and the same error > happens. > > Correct me if I'm wrong, but pyspark binary has always been successfully > built, it is the pyspark pip package that is failing. > > On Fri, May 1, 2020 at 4:23 PM Sean Owen wrote: > >> Hm, others may have to chime in here. Either that's not how you create >> the pyspark binary from the source release (make-distribution.sh doesn't do >> that?) or there is a small but important issue here, that the source >> release doesn't contain one thing that the binary release script expects, >> which is LICENSE-binary et al. If it's the latter, we could move around the >> LICENSE bits in the source tree so that both are "source" files included in >> the source release, so you can make the binary release with it, but, I'd >> probably say it's easier/better to simply skip adding the license in this >> path (if it's supposed to work this way at all) as the use case, a custom >> derived work, doesn't need the *ASF's* license statement. >> >> >> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: >> >>> To reproduce this, I just did >>> >>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>> tar xzf spark-2.4.5.tgz >>> cd spark-2.4.5 >>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 >>> mv spark-2.4.5-bin-custom-spark.tgz ../ >>> cd .. >>> tar xzf spark-2.4.5-bin-custom-spark.tgz >>> cd spark-2.4.5-bin-custom-spark/python/ >>> sudo python setup.py install >>> >>> And here is the output: >>> [image: image.png] >>> >>> >>> On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: >>> You wrote: " 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. " That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means. I'm not sure there's a JIRA here yet. On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > Hmm, sorry I don't get what part of my email were you referring to > when you said "the build fails?". > > So I am trying to build a custom spark binary distribution with, say, > different Hadoop versions and R support. > > Then I stored this custom build on S3, so as I am building more > machines I can just directly download this custom build from S3. But > besides spark-submit and what not, I also wanted to install the pyspark > python package to the machine I am building. > > The lack of the LICENSE file in the custom build would prevent pyspark > from being successfully built. > > Hopefully this answers your question. > > The second part of my last email was about building pyspark inside > spark source directory, I will raise an issue on Jira for that, as it is > more of a clean cut problem with the documentation on the website and the > comments in make-distribution.sh. > > > > On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: > >> Hm, the build fails? you can see this is just skipped if not present, >> for this reason. >> I'm not clear why you need the file for its own sake, for your own >> internal modification that you don't redistribute. >> >> >> >> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li >> wrote: >> >>> Hi Sean, >>> >>> Thanks for the quick response! Yes, what you described about how >>> LICENSE file should be distributed makes sense. >>> >>> The reason I learned about this is that I was trying to build >>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>> machines, so that: >>> >>> 1. These machines can run spark with the built. >>> 2. On each machine, I can install pyspark by running `python >>> setup.py install` inside the python directory. >>> >>> Step 2 would fail because of missing the licenses directory. >>> >>> Building pyspark out of a binary distribution is a bit >>> unconventional, but I did this after failing to do what the official doc >>> recommended ( >>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>> so taking a step back to describe what I did originally: >>> >>> In the spark-2.4.5 src directory, I just did a simple: >>> >>> `./build/mvn -DskipTests clean package` >>> >>> >>> And then went to the python directory and did: >>> >>> >>> `python setup.py sdist` followed b
Re: No LICENSE file in spark custom build distribution
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script. I also tested `make-distribution.sh` without --pip, and the same error happens. Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing. On Fri, May 1, 2020 at 4:23 PM Sean Owen wrote: > Hm, others may have to chime in here. Either that's not how you create the > pyspark binary from the source release (make-distribution.sh doesn't do > that?) or there is a small but important issue here, that the source > release doesn't contain one thing that the binary release script expects, > which is LICENSE-binary et al. If it's the latter, we could move around the > LICENSE bits in the source tree so that both are "source" files included in > the source release, so you can make the binary release with it, but, I'd > probably say it's easier/better to simply skip adding the license in this > path (if it's supposed to work this way at all) as the use case, a custom > derived work, doesn't need the *ASF's* license statement. > > > On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: > >> To reproduce this, I just did >> >> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz >> tar xzf spark-2.4.5.tgz >> cd spark-2.4.5 >> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 >> mv spark-2.4.5-bin-custom-spark.tgz ../ >> cd .. >> tar xzf spark-2.4.5-bin-custom-spark.tgz >> cd spark-2.4.5-bin-custom-spark/python/ >> sudo python setup.py install >> >> And here is the output: >> [image: image.png] >> >> >> On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: >> >>> You wrote: >>> >>> " >>> 2. On each machine, I can install pyspark by running `python setup.py >>> install` inside the python directory. >>> >>> Step 2 would fail because of missing the licenses directory. >>> " >>> >>> That shouldn't depend on the license file, and the script you showed >>> does not fail when not present, so I am wondering what this means. >>> I'm not sure there's a JIRA here yet. >>> >>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: >>> Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support. Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built. Hopefully this answers your question. The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: > Hm, the build fails? you can see this is just skipped if not present, > for this reason. > I'm not clear why you need the file for its own sake, for your own > internal modification that you don't redistribute. > > > > On Fri, May 1, 2020 at 11:43 AM Xiangyu Li wrote: > >> Hi Sean, >> >> Thanks for the quick response! Yes, what you described about how >> LICENSE file should be distributed makes sense. >> >> The reason I learned about this is that I was trying to build >> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >> machines, so that: >> >> 1. These machines can run spark with the built. >> 2. On each machine, I can install pyspark by running `python setup.py >> install` inside the python directory. >> >> Step 2 would fail because of missing the licenses directory. >> >> Building pyspark out of a binary distribution is a bit >> unconventional, but I did this after failing to do what the official doc >> recommended ( >> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >> so taking a step back to describe what I did originally: >> >> In the spark-2.4.5 src directory, I just did a simple: >> >> `./build/mvn -DskipTests clean package` >> >> >> And then went to the python directory and did: >> >> >> `python setup.py sdist` followed by `pip install >> dist/pyspark-2.4.5.tar.gz` (as mentioned in the >> make-distribution.sh.) >> >> >> This ran into "error: package directory `deps/jars` does not exist". >> >> >> However, directly running >> >> >> `sudo python setup.py install` >> >> >> worked. >> >> >> >> On Fri, May
Re: No LICENSE file in spark custom build distribution
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement. On Fri, May 1, 2020 at 3:13 PM Xiangyu Li wrote: > To reproduce this, I just did > > curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz > tar xzf spark-2.4.5.tgz > cd spark-2.4.5 > ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 > mv spark-2.4.5-bin-custom-spark.tgz ../ > cd .. > tar xzf spark-2.4.5-bin-custom-spark.tgz > cd spark-2.4.5-bin-custom-spark/python/ > sudo python setup.py install > > And here is the output: > [image: image.png] > > > On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: > >> You wrote: >> >> " >> 2. On each machine, I can install pyspark by running `python setup.py >> install` inside the python directory. >> >> Step 2 would fail because of missing the licenses directory. >> " >> >> That shouldn't depend on the license file, and the script you showed does >> not fail when not present, so I am wondering what this means. >> I'm not sure there's a JIRA here yet. >> >> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: >> >>> Hmm, sorry I don't get what part of my email were you referring to when >>> you said "the build fails?". >>> >>> So I am trying to build a custom spark binary distribution with, say, >>> different Hadoop versions and R support. >>> >>> Then I stored this custom build on S3, so as I am building more machines >>> I can just directly download this custom build from S3. But besides >>> spark-submit and what not, I also wanted to install the pyspark python >>> package to the machine I am building. >>> >>> The lack of the LICENSE file in the custom build would prevent pyspark >>> from being successfully built. >>> >>> Hopefully this answers your question. >>> >>> The second part of my last email was about building pyspark inside spark >>> source directory, I will raise an issue on Jira for that, as it is more of >>> a clean cut problem with the documentation on the website and the comments >>> in make-distribution.sh. >>> >>> >>> >>> On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: >>> Hm, the build fails? you can see this is just skipped if not present, for this reason. I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute. On Fri, May 1, 2020 at 11:43 AM Xiangyu Li wrote: > Hi Sean, > > Thanks for the quick response! Yes, what you described about how > LICENSE file should be distributed makes sense. > > The reason I learned about this is that I was trying to build > spark-2.4.5-bin-custom.tgz, then distributes this build to multiple > machines, so that: > > 1. These machines can run spark with the built. > 2. On each machine, I can install pyspark by running `python setup.py > install` inside the python directory. > > Step 2 would fail because of missing the licenses directory. > > Building pyspark out of a binary distribution is a bit unconventional, > but I did this after failing to do what the official doc recommended ( > https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), > so taking a step back to describe what I did originally: > > In the spark-2.4.5 src directory, I just did a simple: > > `./build/mvn -DskipTests clean package` > > > And then went to the python directory and did: > > > `python setup.py sdist` followed by `pip install > dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) > > > This ran into "error: package directory `deps/jars` does not exist". > > > However, directly running > > > `sudo python setup.py install` > > > worked. > > > > On Fri, May 1, 2020 at 11:30 AM Sean Owen wrote: > >> The source distribution has the source LICENSE file. The binary >> distribution has the LICENSE-binary license file. The source release >> isn't >> supposed to have LICENSE-binary as it would not be accurate for that >> release; LICENSE is. If you're redistributing a build, you'll have your >> own >> process for modifying and building it, including modifying the LICENSE >> file >> as appropriate; these LICENSE files represent what
Re: No LICENSE file in spark custom build distribution
To reproduce this, I just did curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz tar xzf spark-2.4.5.tgz cd spark-2.4.5 ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 mv spark-2.4.5-bin-custom-spark.tgz ../ cd .. tar xzf spark-2.4.5-bin-custom-spark.tgz cd spark-2.4.5-bin-custom-spark/python/ sudo python setup.py install And here is the output: [image: image.png] On Fri, May 1, 2020 at 2:48 PM Sean Owen wrote: > You wrote: > > " > 2. On each machine, I can install pyspark by running `python setup.py > install` inside the python directory. > > Step 2 would fail because of missing the licenses directory. > " > > That shouldn't depend on the license file, and the script you showed does > not fail when not present, so I am wondering what this means. > I'm not sure there's a JIRA here yet. > > On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > >> Hmm, sorry I don't get what part of my email were you referring to when >> you said "the build fails?". >> >> So I am trying to build a custom spark binary distribution with, say, >> different Hadoop versions and R support. >> >> Then I stored this custom build on S3, so as I am building more machines >> I can just directly download this custom build from S3. But besides >> spark-submit and what not, I also wanted to install the pyspark python >> package to the machine I am building. >> >> The lack of the LICENSE file in the custom build would prevent pyspark >> from being successfully built. >> >> Hopefully this answers your question. >> >> The second part of my last email was about building pyspark inside spark >> source directory, I will raise an issue on Jira for that, as it is more of >> a clean cut problem with the documentation on the website and the comments >> in make-distribution.sh. >> >> >> >> On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: >> >>> Hm, the build fails? you can see this is just skipped if not present, >>> for this reason. >>> I'm not clear why you need the file for its own sake, for your own >>> internal modification that you don't redistribute. >>> >>> >>> >>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li wrote: >>> Hi Sean, Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that: 1. These machines can run spark with the built. 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended ( https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally: In the spark-2.4.5 src directory, I just did a simple: `./build/mvn -DskipTests clean package` And then went to the python directory and did: `python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) This ran into "error: package directory `deps/jars` does not exist". However, directly running `sudo python setup.py install` worked. On Fri, May 1, 2020 at 11:30 AM Sean Owen wrote: > The source distribution has the source LICENSE file. The binary > distribution has the LICENSE-binary license file. The source release isn't > supposed to have LICENSE-binary as it would not be accurate for that > release; LICENSE is. If you're redistributing a build, you'll have your > own > process for modifying and building it, including modifying the LICENSE > file > as appropriate; these LICENSE files represent what the project delivers to > you rather than what you deliver to others. You could get the > LICENSE-binary file from the right hash commit from git, if desired, as > part of your build. > > On Fri, May 1, 2020 at 10:19 AM Xiangyu Li wrote: > >> Hello, >> >> I downloaded spark-2.4.5 source from >> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >> After extracting it and running: >> >> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr >> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes >> >> >> It creates a Spark binary distribution named: >> spark-2.4.5-bin-custom-spark.tgz >> >> So this file is supposedly a ready-to-distribute Spark binary file >> like the one you can download from >> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-b
Re: No LICENSE file in spark custom build distribution
You wrote: " 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. " That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means. I'm not sure there's a JIRA here yet. On Fri, May 1, 2020 at 1:46 PM Xiangyu Li wrote: > Hmm, sorry I don't get what part of my email were you referring to when > you said "the build fails?". > > So I am trying to build a custom spark binary distribution with, say, > different Hadoop versions and R support. > > Then I stored this custom build on S3, so as I am building more machines I > can just directly download this custom build from S3. But besides > spark-submit and what not, I also wanted to install the pyspark python > package to the machine I am building. > > The lack of the LICENSE file in the custom build would prevent pyspark > from being successfully built. > > Hopefully this answers your question. > > The second part of my last email was about building pyspark inside spark > source directory, I will raise an issue on Jira for that, as it is more of > a clean cut problem with the documentation on the website and the comments > in make-distribution.sh. > > > > On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: > >> Hm, the build fails? you can see this is just skipped if not present, for >> this reason. >> I'm not clear why you need the file for its own sake, for your own >> internal modification that you don't redistribute. >> >> >> >> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li wrote: >> >>> Hi Sean, >>> >>> Thanks for the quick response! Yes, what you described about how LICENSE >>> file should be distributed makes sense. >>> >>> The reason I learned about this is that I was trying to build >>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>> machines, so that: >>> >>> 1. These machines can run spark with the built. >>> 2. On each machine, I can install pyspark by running `python setup.py >>> install` inside the python directory. >>> >>> Step 2 would fail because of missing the licenses directory. >>> >>> Building pyspark out of a binary distribution is a bit unconventional, >>> but I did this after failing to do what the official doc recommended ( >>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>> so taking a step back to describe what I did originally: >>> >>> In the spark-2.4.5 src directory, I just did a simple: >>> >>> `./build/mvn -DskipTests clean package` >>> >>> >>> And then went to the python directory and did: >>> >>> >>> `python setup.py sdist` followed by `pip install >>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) >>> >>> >>> This ran into "error: package directory `deps/jars` does not exist". >>> >>> >>> However, directly running >>> >>> >>> `sudo python setup.py install` >>> >>> >>> worked. >>> >>> >>> >>> On Fri, May 1, 2020 at 11:30 AM Sean Owen wrote: >>> The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. On Fri, May 1, 2020 at 10:19 AM Xiangyu Li wrote: > Hello, > > I downloaded spark-2.4.5 source from > https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz > After extracting it and running: > > ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr > -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes > > > It creates a Spark binary distribution named: > spark-2.4.5-bin-custom-spark.tgz > > So this file is supposedly a ready-to-distribute Spark binary file > like the one you can download from > http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz > > However, one big difference between this custom build and the official > build is that you do not have a LICENSE file in the custom build. I don't > know much about Apache license, but I would suppose a custom build > distribution should have one. > > The reason we are missing the file is caused by the following code in > make-distribution.sh: > [image: image.png] > > There is no LICENSE-binary file in the official spark-2.4.5.tgz file, > therefore there will be no LICENSE file in your custom build. > > I am aware of two pull reques
Re: No LICENSE file in spark custom build distribution
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support. Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built. Hopefully this answers your question. The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. On Fri, May 1, 2020 at 1:31 PM Sean Owen wrote: > Hm, the build fails? you can see this is just skipped if not present, for > this reason. > I'm not clear why you need the file for its own sake, for your own > internal modification that you don't redistribute. > > > > On Fri, May 1, 2020 at 11:43 AM Xiangyu Li wrote: > >> Hi Sean, >> >> Thanks for the quick response! Yes, what you described about how LICENSE >> file should be distributed makes sense. >> >> The reason I learned about this is that I was trying to build >> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >> machines, so that: >> >> 1. These machines can run spark with the built. >> 2. On each machine, I can install pyspark by running `python setup.py >> install` inside the python directory. >> >> Step 2 would fail because of missing the licenses directory. >> >> Building pyspark out of a binary distribution is a bit unconventional, >> but I did this after failing to do what the official doc recommended ( >> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >> so taking a step back to describe what I did originally: >> >> In the spark-2.4.5 src directory, I just did a simple: >> >> `./build/mvn -DskipTests clean package` >> >> >> And then went to the python directory and did: >> >> >> `python setup.py sdist` followed by `pip install >> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) >> >> >> This ran into "error: package directory `deps/jars` does not exist". >> >> >> However, directly running >> >> >> `sudo python setup.py install` >> >> >> worked. >> >> >> >> On Fri, May 1, 2020 at 11:30 AM Sean Owen wrote: >> >>> The source distribution has the source LICENSE file. The binary >>> distribution has the LICENSE-binary license file. The source release isn't >>> supposed to have LICENSE-binary as it would not be accurate for that >>> release; LICENSE is. If you're redistributing a build, you'll have your own >>> process for modifying and building it, including modifying the LICENSE file >>> as appropriate; these LICENSE files represent what the project delivers to >>> you rather than what you deliver to others. You could get the >>> LICENSE-binary file from the right hash commit from git, if desired, as >>> part of your build. >>> >>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li wrote: >>> Hello, I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz After extracting it and running: ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes It creates a Spark binary distribution named: spark-2.4.5-bin-custom-spark.tgz So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one. The reason we are missing the file is caused by the following code in make-distribution.sh: [image: image.png] There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. I am aware of two pull requests related to this: https://github.com/apache/spark/pull/22436 started to use LICENSE-binary instead of just the LICENSE. And https://github.com/apache/spark/pull/22840 To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory. I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first. >>
Re: No LICENSE file in spark custom build distribution
Hm, the build fails? you can see this is just skipped if not present, for this reason. I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute. On Fri, May 1, 2020 at 11:43 AM Xiangyu Li wrote: > Hi Sean, > > Thanks for the quick response! Yes, what you described about how LICENSE > file should be distributed makes sense. > > The reason I learned about this is that I was trying to build > spark-2.4.5-bin-custom.tgz, then distributes this build to multiple > machines, so that: > > 1. These machines can run spark with the built. > 2. On each machine, I can install pyspark by running `python setup.py > install` inside the python directory. > > Step 2 would fail because of missing the licenses directory. > > Building pyspark out of a binary distribution is a bit unconventional, but > I did this after failing to do what the official doc recommended ( > https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), > so taking a step back to describe what I did originally: > > In the spark-2.4.5 src directory, I just did a simple: > > `./build/mvn -DskipTests clean package` > > > And then went to the python directory and did: > > > `python setup.py sdist` followed by `pip install > dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) > > > This ran into "error: package directory `deps/jars` does not exist". > > > However, directly running > > > `sudo python setup.py install` > > > worked. > > > > On Fri, May 1, 2020 at 11:30 AM Sean Owen wrote: > >> The source distribution has the source LICENSE file. The binary >> distribution has the LICENSE-binary license file. The source release isn't >> supposed to have LICENSE-binary as it would not be accurate for that >> release; LICENSE is. If you're redistributing a build, you'll have your own >> process for modifying and building it, including modifying the LICENSE file >> as appropriate; these LICENSE files represent what the project delivers to >> you rather than what you deliver to others. You could get the >> LICENSE-binary file from the right hash commit from git, if desired, as >> part of your build. >> >> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li wrote: >> >>> Hello, >>> >>> I downloaded spark-2.4.5 source from >>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>> After extracting it and running: >>> >>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr >>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes >>> >>> >>> It creates a Spark binary distribution named: >>> spark-2.4.5-bin-custom-spark.tgz >>> >>> So this file is supposedly a ready-to-distribute Spark binary file like >>> the one you can download from >>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >>> >>> However, one big difference between this custom build and the official >>> build is that you do not have a LICENSE file in the custom build. I don't >>> know much about Apache license, but I would suppose a custom build >>> distribution should have one. >>> >>> The reason we are missing the file is caused by the following code in >>> make-distribution.sh: >>> [image: image.png] >>> >>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file, >>> therefore there will be no LICENSE file in your custom build. >>> >>> I am aware of two pull requests related to this: >>> >>> https://github.com/apache/spark/pull/22436 >>> started to use LICENSE-binary instead of just the LICENSE. >>> >>> And >>> https://github.com/apache/spark/pull/22840 >>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source >>> directory. >>> >>> I think we need to change make-distribution.sh to make sure that the >>> LICENSE file is copied over to its corresponding custom build distribution. >>> However, I am not ready to do a pull request, so hopefully we can discuss >>> it here first. >>> -- >>> Sincerely >>> Xiangyu Li >>> >>> >>> >> > > -- > Sincerely > Xiangyu Li > > >
Re: No LICENSE file in spark custom build distribution
Hi Sean, Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that: 1. These machines can run spark with the built. 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended ( https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally: In the spark-2.4.5 src directory, I just did a simple: `./build/mvn -DskipTests clean package` And then went to the python directory and did: `python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) This ran into "error: package directory `deps/jars` does not exist". However, directly running `sudo python setup.py install` worked. On Fri, May 1, 2020 at 11:30 AM Sean Owen wrote: > The source distribution has the source LICENSE file. The binary > distribution has the LICENSE-binary license file. The source release isn't > supposed to have LICENSE-binary as it would not be accurate for that > release; LICENSE is. If you're redistributing a build, you'll have your own > process for modifying and building it, including modifying the LICENSE file > as appropriate; these LICENSE files represent what the project delivers to > you rather than what you deliver to others. You could get the > LICENSE-binary file from the right hash commit from git, if desired, as > part of your build. > > On Fri, May 1, 2020 at 10:19 AM Xiangyu Li wrote: > >> Hello, >> >> I downloaded spark-2.4.5 source from >> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >> After extracting it and running: >> >> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr >> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes >> >> >> It creates a Spark binary distribution named: >> spark-2.4.5-bin-custom-spark.tgz >> >> So this file is supposedly a ready-to-distribute Spark binary file like >> the one you can download from >> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >> >> However, one big difference between this custom build and the official >> build is that you do not have a LICENSE file in the custom build. I don't >> know much about Apache license, but I would suppose a custom build >> distribution should have one. >> >> The reason we are missing the file is caused by the following code in >> make-distribution.sh: >> [image: image.png] >> >> There is no LICENSE-binary file in the official spark-2.4.5.tgz file, >> therefore there will be no LICENSE file in your custom build. >> >> I am aware of two pull requests related to this: >> >> https://github.com/apache/spark/pull/22436 >> started to use LICENSE-binary instead of just the LICENSE. >> >> And >> https://github.com/apache/spark/pull/22840 >> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source >> directory. >> >> I think we need to change make-distribution.sh to make sure that the >> LICENSE file is copied over to its corresponding custom build distribution. >> However, I am not ready to do a pull request, so hopefully we can discuss >> it here first. >> -- >> Sincerely >> Xiangyu Li >> >> >> > -- Sincerely Xiangyu Li
Re: No LICENSE file in spark custom build distribution
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. On Fri, May 1, 2020 at 10:19 AM Xiangyu Li wrote: > Hello, > > I downloaded spark-2.4.5 source from > https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz > After extracting it and running: > > ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr > -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes > > > It creates a Spark binary distribution named: > spark-2.4.5-bin-custom-spark.tgz > > So this file is supposedly a ready-to-distribute Spark binary file like > the one you can download from > http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz > > However, one big difference between this custom build and the official > build is that you do not have a LICENSE file in the custom build. I don't > know much about Apache license, but I would suppose a custom build > distribution should have one. > > The reason we are missing the file is caused by the following code in > make-distribution.sh: > [image: image.png] > > There is no LICENSE-binary file in the official spark-2.4.5.tgz file, > therefore there will be no LICENSE file in your custom build. > > I am aware of two pull requests related to this: > > https://github.com/apache/spark/pull/22436 > started to use LICENSE-binary instead of just the LICENSE. > > And > https://github.com/apache/spark/pull/22840 > To avoid failure when there is no LICENSE-binary in spark-2.4.5 source > directory. > > I think we need to change make-distribution.sh to make sure that the > LICENSE file is copied over to its corresponding custom build distribution. > However, I am not ready to do a pull request, so hopefully we can discuss > it here first. > -- > Sincerely > Xiangyu Li > > >