I need the pip packaging, all these efforts are to get a pyspark pip package actually.
On Fri, May 1, 2020 at 4:38 PM Sean Owen <sro...@gmail.com> wrote: > I see, that makes more sense, though I have limited knowledge of how the > pip packaging works. You don't need pip packaging, do you? just pyspark > itself right. Omit --pip? > > On Fri, May 1, 2020 at 3:32 PM Xiangyu Li <yisky...@gmail.com> wrote: > >> make-distribution.sh with --pip would run a `python setup.py sdist` >> within that make-distribution.sh script. >> I also tested `make-distribution.sh` without --pip, and the same error >> happens. >> >> Correct me if I'm wrong, but pyspark binary has always been successfully >> built, it is the pyspark pip package that is failing. >> >> On Fri, May 1, 2020 at 4:23 PM Sean Owen <sro...@gmail.com> wrote: >> >>> Hm, others may have to chime in here. Either that's not how you create >>> the pyspark binary from the source release (make-distribution.sh doesn't do >>> that?) or there is a small but important issue here, that the source >>> release doesn't contain one thing that the binary release script expects, >>> which is LICENSE-binary et al. If it's the latter, we could move around the >>> LICENSE bits in the source tree so that both are "source" files included in >>> the source release, so you can make the binary release with it, but, I'd >>> probably say it's easier/better to simply skip adding the license in this >>> path (if it's supposed to work this way at all) as the use case, a custom >>> derived work, doesn't need the *ASF's* license statement. >>> >>> >>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <yisky...@gmail.com> wrote: >>> >>>> To reproduce this, I just did >>>> >>>> curl -O >>>> http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>> tar xzf spark-2.4.5.tgz >>>> cd spark-2.4.5 >>>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 >>>> mv spark-2.4.5-bin-custom-spark.tgz ../ >>>> cd .. >>>> tar xzf spark-2.4.5-bin-custom-spark.tgz >>>> cd spark-2.4.5-bin-custom-spark/python/ >>>> sudo python setup.py install >>>> >>>> And here is the output: >>>> [image: image.png] >>>> >>>> >>>> On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> You wrote: >>>>> >>>>> " >>>>> 2. On each machine, I can install pyspark by running `python setup.py >>>>> install` inside the python directory. >>>>> >>>>> Step 2 would fail because of missing the licenses directory. >>>>> " >>>>> >>>>> That shouldn't depend on the license file, and the script you showed >>>>> does not fail when not present, so I am wondering what this means. >>>>> I'm not sure there's a JIRA here yet. >>>>> >>>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote: >>>>> >>>>>> Hmm, sorry I don't get what part of my email were you referring to >>>>>> when you said "the build fails?". >>>>>> >>>>>> So I am trying to build a custom spark binary distribution with, say, >>>>>> different Hadoop versions and R support. >>>>>> >>>>>> Then I stored this custom build on S3, so as I am building more >>>>>> machines I can just directly download this custom build from S3. But >>>>>> besides spark-submit and what not, I also wanted to install the pyspark >>>>>> python package to the machine I am building. >>>>>> >>>>>> The lack of the LICENSE file in the custom build would prevent >>>>>> pyspark from being successfully built. >>>>>> >>>>>> Hopefully this answers your question. >>>>>> >>>>>> The second part of my last email was about building pyspark inside >>>>>> spark source directory, I will raise an issue on Jira for that, as it is >>>>>> more of a clean cut problem with the documentation on the website and the >>>>>> comments in make-distribution.sh. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> Hm, the build fails? you can see this is just skipped if not >>>>>>> present, for this reason. >>>>>>> I'm not clear why you need the file for its own sake, for your own >>>>>>> internal modification that you don't redistribute. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Sean, >>>>>>>> >>>>>>>> Thanks for the quick response! Yes, what you described about how >>>>>>>> LICENSE file should be distributed makes sense. >>>>>>>> >>>>>>>> The reason I learned about this is that I was trying to build >>>>>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>>>>>>> machines, so that: >>>>>>>> >>>>>>>> 1. These machines can run spark with the built. >>>>>>>> 2. On each machine, I can install pyspark by running `python >>>>>>>> setup.py install` inside the python directory. >>>>>>>> >>>>>>>> Step 2 would fail because of missing the licenses directory. >>>>>>>> >>>>>>>> Building pyspark out of a binary distribution is a bit >>>>>>>> unconventional, but I did this after failing to do what the official >>>>>>>> doc >>>>>>>> recommended ( >>>>>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>>>>>>> so taking a step back to describe what I did originally: >>>>>>>> >>>>>>>> In the spark-2.4.5 src directory, I just did a simple: >>>>>>>> >>>>>>>> `./build/mvn -DskipTests clean package` >>>>>>>> >>>>>>>> >>>>>>>> And then went to the python directory and did: >>>>>>>> >>>>>>>> >>>>>>>> `python setup.py sdist` followed by `pip install >>>>>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the >>>>>>>> make-distribution.sh.) >>>>>>>> >>>>>>>> >>>>>>>> This ran into "error: package directory `deps/jars` does >>>>>>>> not exist". >>>>>>>> >>>>>>>> >>>>>>>> However, directly running >>>>>>>> >>>>>>>> >>>>>>>> `sudo python setup.py install` >>>>>>>> >>>>>>>> >>>>>>>> worked. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote: >>>>>>>> >>>>>>>>> The source distribution has the source LICENSE file. The binary >>>>>>>>> distribution has the LICENSE-binary license file. The source release >>>>>>>>> isn't >>>>>>>>> supposed to have LICENSE-binary as it would not be accurate for that >>>>>>>>> release; LICENSE is. If you're redistributing a build, you'll have >>>>>>>>> your own >>>>>>>>> process for modifying and building it, including modifying the >>>>>>>>> LICENSE file >>>>>>>>> as appropriate; these LICENSE files represent what the project >>>>>>>>> delivers to >>>>>>>>> you rather than what you deliver to others. You could get the >>>>>>>>> LICENSE-binary file from the right hash commit from git, if desired, >>>>>>>>> as >>>>>>>>> part of your build. >>>>>>>>> >>>>>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I downloaded spark-2.4.5 source from >>>>>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>>>>>>>> After extracting it and running: >>>>>>>>>> >>>>>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz >>>>>>>>>> -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn >>>>>>>>>> -Pkubernetes >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> It creates a Spark binary distribution named: >>>>>>>>>> spark-2.4.5-bin-custom-spark.tgz >>>>>>>>>> >>>>>>>>>> So this file is supposedly a ready-to-distribute Spark binary >>>>>>>>>> file like the one you can download from >>>>>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >>>>>>>>>> >>>>>>>>>> However, one big difference between this custom build and the >>>>>>>>>> official build is that you do not have a LICENSE file in the custom >>>>>>>>>> build. >>>>>>>>>> I don't know much about Apache license, but I would suppose a custom >>>>>>>>>> build >>>>>>>>>> distribution should have one. >>>>>>>>>> >>>>>>>>>> The reason we are missing the file is caused by the following >>>>>>>>>> code in make-distribution.sh: >>>>>>>>>> [image: image.png] >>>>>>>>>> >>>>>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz >>>>>>>>>> file, therefore there will be no LICENSE file in your custom build. >>>>>>>>>> >>>>>>>>>> I am aware of two pull requests related to this: >>>>>>>>>> >>>>>>>>>> https://github.com/apache/spark/pull/22436 >>>>>>>>>> started to use LICENSE-binary instead of just the LICENSE. >>>>>>>>>> >>>>>>>>>> And >>>>>>>>>> https://github.com/apache/spark/pull/22840 >>>>>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 >>>>>>>>>> source directory. >>>>>>>>>> >>>>>>>>>> I think we need to change make-distribution.sh to make sure that >>>>>>>>>> the LICENSE file is copied over to its corresponding custom build >>>>>>>>>> distribution. However, I am not ready to do a pull request, so >>>>>>>>>> hopefully we >>>>>>>>>> can discuss it here first. >>>>>>>>>> -- >>>>>>>>>> Sincerely >>>>>>>>>> Xiangyu Li >>>>>>>>>> >>>>>>>>>> <yisky...@gmail.com> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sincerely >>>>>>>> Xiangyu Li >>>>>>>> >>>>>>>> <yisky...@gmail.com> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Sincerely >>>>>> Xiangyu Li >>>>>> >>>>>> <yisky...@gmail.com> >>>>>> >>>>> >>>> >>>> -- >>>> Sincerely >>>> Xiangyu Li >>>> >>>> <yisky...@gmail.com> >>>> >>> >> >> -- >> Sincerely >> Xiangyu Li >> >> <yisky...@gmail.com> >> > -- Sincerely Xiangyu Li <yisky...@gmail.com>