Can you send me the output of those two commands On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote:
> Hi Holden, > > Please check the second email of mine in this email chain. I did that > originally and to quote my email: > > > =========================================================================================== > In the spark-2.4.5 src directory, I just did a simple: > > `./build/mvn -DskipTests clean package` > > > And then went to the python directory and did: > > > `python setup.py sdist` followed by `pip install > dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) > > > *This ran into "error: package directory `deps/jars` does not exist".* > > > =============================================================================================== > > > So exactly as what you said, which is also one of the printout message in > the make-distribution.sh script. > > On Fri, May 1, 2020 at 4:39 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> Your problem isn't the missing license per-se (that just happens to be >> the first error). >> >> I don't believe that is the way we expect users to pip install the Python >> library. pip will only install directories/targets underneath the directory >> where setup.py, hence the deps directory which is constructed by setup.py >> with a bunch of symlinks. It assumes that you are either building Spark >> from source in which case you should follow it's instructions: >> >> To build Spark with maven you can run: >> ./build/mvn -DskipTests clean package >> Building the source dist is done in the Python directory: >> cd python >> python setup.py sdist >> pip install dist/*.tar.gz >> >> >> On Fri, May 1, 2020 at 1:32 PM Xiangyu Li <yisky...@gmail.com> wrote: >> >>> make-distribution.sh with --pip would run a `python setup.py sdist` >>> within that make-distribution.sh script. >>> I also tested `make-distribution.sh` without --pip, and the same error >>> happens. >>> >>> Correct me if I'm wrong, but pyspark binary has always been successfully >>> built, it is the pyspark pip package that is failing. >>> >>> On Fri, May 1, 2020 at 4:23 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> Hm, others may have to chime in here. Either that's not how you create >>>> the pyspark binary from the source release (make-distribution.sh doesn't do >>>> that?) or there is a small but important issue here, that the source >>>> release doesn't contain one thing that the binary release script expects, >>>> which is LICENSE-binary et al. If it's the latter, we could move around the >>>> LICENSE bits in the source tree so that both are "source" files included in >>>> the source release, so you can make the binary release with it, but, I'd >>>> probably say it's easier/better to simply skip adding the license in this >>>> path (if it's supposed to work this way at all) as the use case, a custom >>>> derived work, doesn't need the *ASF's* license statement. >>>> >>>> >>>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <yisky...@gmail.com> wrote: >>>> >>>>> To reproduce this, I just did >>>>> >>>>> curl -O >>>>> http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>>> tar xzf spark-2.4.5.tgz >>>>> cd spark-2.4.5 >>>>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 >>>>> mv spark-2.4.5-bin-custom-spark.tgz ../ >>>>> cd .. >>>>> tar xzf spark-2.4.5-bin-custom-spark.tgz >>>>> cd spark-2.4.5-bin-custom-spark/python/ >>>>> sudo python setup.py install >>>>> >>>>> And here is the output: >>>>> [image: image.png] >>>>> >>>>> >>>>> On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> You wrote: >>>>>> >>>>>> " >>>>>> 2. On each machine, I can install pyspark by running `python setup.py >>>>>> install` inside the python directory. >>>>>> >>>>>> Step 2 would fail because of missing the licenses directory. >>>>>> " >>>>>> >>>>>> That shouldn't depend on the license file, and the script you showed >>>>>> does not fail when not present, so I am wondering what this means. >>>>>> I'm not sure there's a JIRA here yet. >>>>>> >>>>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote: >>>>>> >>>>>>> Hmm, sorry I don't get what part of my email were you referring to >>>>>>> when you said "the build fails?". >>>>>>> >>>>>>> So I am trying to build a custom spark binary distribution with, >>>>>>> say, different Hadoop versions and R support. >>>>>>> >>>>>>> Then I stored this custom build on S3, so as I am building more >>>>>>> machines I can just directly download this custom build from S3. But >>>>>>> besides spark-submit and what not, I also wanted to install the pyspark >>>>>>> python package to the machine I am building. >>>>>>> >>>>>>> The lack of the LICENSE file in the custom build would prevent >>>>>>> pyspark from being successfully built. >>>>>>> >>>>>>> Hopefully this answers your question. >>>>>>> >>>>>>> The second part of my last email was about building pyspark inside >>>>>>> spark source directory, I will raise an issue on Jira for that, as it is >>>>>>> more of a clean cut problem with the documentation on the website and >>>>>>> the >>>>>>> comments in make-distribution.sh. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote: >>>>>>> >>>>>>>> Hm, the build fails? you can see this is just skipped if not >>>>>>>> present, for this reason. >>>>>>>> I'm not clear why you need the file for its own sake, for your own >>>>>>>> internal modification that you don't redistribute. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Sean, >>>>>>>>> >>>>>>>>> Thanks for the quick response! Yes, what you described about how >>>>>>>>> LICENSE file should be distributed makes sense. >>>>>>>>> >>>>>>>>> The reason I learned about this is that I was trying to build >>>>>>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>>>>>>>> machines, so that: >>>>>>>>> >>>>>>>>> 1. These machines can run spark with the built. >>>>>>>>> 2. On each machine, I can install pyspark by running `python >>>>>>>>> setup.py install` inside the python directory. >>>>>>>>> >>>>>>>>> Step 2 would fail because of missing the licenses directory. >>>>>>>>> >>>>>>>>> Building pyspark out of a binary distribution is a bit >>>>>>>>> unconventional, but I did this after failing to do what the official >>>>>>>>> doc >>>>>>>>> recommended ( >>>>>>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>>>>>>>> so taking a step back to describe what I did originally: >>>>>>>>> >>>>>>>>> In the spark-2.4.5 src directory, I just did a simple: >>>>>>>>> >>>>>>>>> `./build/mvn -DskipTests clean package` >>>>>>>>> >>>>>>>>> >>>>>>>>> And then went to the python directory and did: >>>>>>>>> >>>>>>>>> >>>>>>>>> `python setup.py sdist` followed by `pip install >>>>>>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the >>>>>>>>> make-distribution.sh.) >>>>>>>>> >>>>>>>>> >>>>>>>>> This ran into "error: package directory `deps/jars` does >>>>>>>>> not exist". >>>>>>>>> >>>>>>>>> >>>>>>>>> However, directly running >>>>>>>>> >>>>>>>>> >>>>>>>>> `sudo python setup.py install` >>>>>>>>> >>>>>>>>> >>>>>>>>> worked. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The source distribution has the source LICENSE file. The binary >>>>>>>>>> distribution has the LICENSE-binary license file. The source release >>>>>>>>>> isn't >>>>>>>>>> supposed to have LICENSE-binary as it would not be accurate for that >>>>>>>>>> release; LICENSE is. If you're redistributing a build, you'll have >>>>>>>>>> your own >>>>>>>>>> process for modifying and building it, including modifying the >>>>>>>>>> LICENSE file >>>>>>>>>> as appropriate; these LICENSE files represent what the project >>>>>>>>>> delivers to >>>>>>>>>> you rather than what you deliver to others. You could get the >>>>>>>>>> LICENSE-binary file from the right hash commit from git, if desired, >>>>>>>>>> as >>>>>>>>>> part of your build. >>>>>>>>>> >>>>>>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I downloaded spark-2.4.5 source from >>>>>>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>>>>>>>>> After extracting it and running: >>>>>>>>>>> >>>>>>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz >>>>>>>>>>> -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn >>>>>>>>>>> -Pkubernetes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It creates a Spark binary distribution named: >>>>>>>>>>> spark-2.4.5-bin-custom-spark.tgz >>>>>>>>>>> >>>>>>>>>>> So this file is supposedly a ready-to-distribute Spark binary >>>>>>>>>>> file like the one you can download from >>>>>>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >>>>>>>>>>> >>>>>>>>>>> However, one big difference between this custom build and the >>>>>>>>>>> official build is that you do not have a LICENSE file in the custom >>>>>>>>>>> build. >>>>>>>>>>> I don't know much about Apache license, but I would suppose a >>>>>>>>>>> custom build >>>>>>>>>>> distribution should have one. >>>>>>>>>>> >>>>>>>>>>> The reason we are missing the file is caused by the following >>>>>>>>>>> code in make-distribution.sh: >>>>>>>>>>> [image: image.png] >>>>>>>>>>> >>>>>>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz >>>>>>>>>>> file, therefore there will be no LICENSE file in your custom build. >>>>>>>>>>> >>>>>>>>>>> I am aware of two pull requests related to this: >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/spark/pull/22436 >>>>>>>>>>> started to use LICENSE-binary instead of just the LICENSE. >>>>>>>>>>> >>>>>>>>>>> And >>>>>>>>>>> https://github.com/apache/spark/pull/22840 >>>>>>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 >>>>>>>>>>> source directory. >>>>>>>>>>> >>>>>>>>>>> I think we need to change make-distribution.sh to make sure that >>>>>>>>>>> the LICENSE file is copied over to its corresponding custom build >>>>>>>>>>> distribution. However, I am not ready to do a pull request, so >>>>>>>>>>> hopefully we >>>>>>>>>>> can discuss it here first. >>>>>>>>>>> -- >>>>>>>>>>> Sincerely >>>>>>>>>>> Xiangyu Li >>>>>>>>>>> >>>>>>>>>>> <yisky...@gmail.com> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sincerely >>>>>>>>> Xiangyu Li >>>>>>>>> >>>>>>>>> <yisky...@gmail.com> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sincerely >>>>>>> Xiangyu Li >>>>>>> >>>>>>> <yisky...@gmail.com> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Sincerely >>>>> Xiangyu Li >>>>> >>>>> <yisky...@gmail.com> >>>>> >>>> >>> >>> -- >>> Sincerely >>> Xiangyu Li >>> >>> <yisky...@gmail.com> >>> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > > > -- > Sincerely > Xiangyu Li > > <yisky...@gmail.com> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau