I see, that makes more sense, though I have limited knowledge of how the pip packaging works. You don't need pip packaging, do you? just pyspark itself right. Omit --pip?
On Fri, May 1, 2020 at 3:32 PM Xiangyu Li <yisky...@gmail.com> wrote: > make-distribution.sh with --pip would run a `python setup.py sdist` within > that make-distribution.sh script. > I also tested `make-distribution.sh` without --pip, and the same error > happens. > > Correct me if I'm wrong, but pyspark binary has always been successfully > built, it is the pyspark pip package that is failing. > > On Fri, May 1, 2020 at 4:23 PM Sean Owen <sro...@gmail.com> wrote: > >> Hm, others may have to chime in here. Either that's not how you create >> the pyspark binary from the source release (make-distribution.sh doesn't do >> that?) or there is a small but important issue here, that the source >> release doesn't contain one thing that the binary release script expects, >> which is LICENSE-binary et al. If it's the latter, we could move around the >> LICENSE bits in the source tree so that both are "source" files included in >> the source release, so you can make the binary release with it, but, I'd >> probably say it's easier/better to simply skip adding the license in this >> path (if it's supposed to work this way at all) as the use case, a custom >> derived work, doesn't need the *ASF's* license statement. >> >> >> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <yisky...@gmail.com> wrote: >> >>> To reproduce this, I just did >>> >>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>> tar xzf spark-2.4.5.tgz >>> cd spark-2.4.5 >>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 >>> mv spark-2.4.5-bin-custom-spark.tgz ../ >>> cd .. >>> tar xzf spark-2.4.5-bin-custom-spark.tgz >>> cd spark-2.4.5-bin-custom-spark/python/ >>> sudo python setup.py install >>> >>> And here is the output: >>> [image: image.png] >>> >>> >>> On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> You wrote: >>>> >>>> " >>>> 2. On each machine, I can install pyspark by running `python setup.py >>>> install` inside the python directory. >>>> >>>> Step 2 would fail because of missing the licenses directory. >>>> " >>>> >>>> That shouldn't depend on the license file, and the script you showed >>>> does not fail when not present, so I am wondering what this means. >>>> I'm not sure there's a JIRA here yet. >>>> >>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote: >>>> >>>>> Hmm, sorry I don't get what part of my email were you referring to >>>>> when you said "the build fails?". >>>>> >>>>> So I am trying to build a custom spark binary distribution with, say, >>>>> different Hadoop versions and R support. >>>>> >>>>> Then I stored this custom build on S3, so as I am building more >>>>> machines I can just directly download this custom build from S3. But >>>>> besides spark-submit and what not, I also wanted to install the pyspark >>>>> python package to the machine I am building. >>>>> >>>>> The lack of the LICENSE file in the custom build would prevent pyspark >>>>> from being successfully built. >>>>> >>>>> Hopefully this answers your question. >>>>> >>>>> The second part of my last email was about building pyspark inside >>>>> spark source directory, I will raise an issue on Jira for that, as it is >>>>> more of a clean cut problem with the documentation on the website and the >>>>> comments in make-distribution.sh. >>>>> >>>>> >>>>> >>>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> Hm, the build fails? you can see this is just skipped if not present, >>>>>> for this reason. >>>>>> I'm not clear why you need the file for its own sake, for your own >>>>>> internal modification that you don't redistribute. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Sean, >>>>>>> >>>>>>> Thanks for the quick response! Yes, what you described about how >>>>>>> LICENSE file should be distributed makes sense. >>>>>>> >>>>>>> The reason I learned about this is that I was trying to build >>>>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>>>>>> machines, so that: >>>>>>> >>>>>>> 1. These machines can run spark with the built. >>>>>>> 2. On each machine, I can install pyspark by running `python >>>>>>> setup.py install` inside the python directory. >>>>>>> >>>>>>> Step 2 would fail because of missing the licenses directory. >>>>>>> >>>>>>> Building pyspark out of a binary distribution is a bit >>>>>>> unconventional, but I did this after failing to do what the official doc >>>>>>> recommended ( >>>>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>>>>>> so taking a step back to describe what I did originally: >>>>>>> >>>>>>> In the spark-2.4.5 src directory, I just did a simple: >>>>>>> >>>>>>> `./build/mvn -DskipTests clean package` >>>>>>> >>>>>>> >>>>>>> And then went to the python directory and did: >>>>>>> >>>>>>> >>>>>>> `python setup.py sdist` followed by `pip install >>>>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the >>>>>>> make-distribution.sh.) >>>>>>> >>>>>>> >>>>>>> This ran into "error: package directory `deps/jars` does not exist". >>>>>>> >>>>>>> >>>>>>> However, directly running >>>>>>> >>>>>>> >>>>>>> `sudo python setup.py install` >>>>>>> >>>>>>> >>>>>>> worked. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote: >>>>>>> >>>>>>>> The source distribution has the source LICENSE file. The binary >>>>>>>> distribution has the LICENSE-binary license file. The source release >>>>>>>> isn't >>>>>>>> supposed to have LICENSE-binary as it would not be accurate for that >>>>>>>> release; LICENSE is. If you're redistributing a build, you'll have >>>>>>>> your own >>>>>>>> process for modifying and building it, including modifying the LICENSE >>>>>>>> file >>>>>>>> as appropriate; these LICENSE files represent what the project >>>>>>>> delivers to >>>>>>>> you rather than what you deliver to others. You could get the >>>>>>>> LICENSE-binary file from the right hash commit from git, if desired, as >>>>>>>> part of your build. >>>>>>>> >>>>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I downloaded spark-2.4.5 source from >>>>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>>>>>>> After extracting it and running: >>>>>>>>> >>>>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz >>>>>>>>> -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn >>>>>>>>> -Pkubernetes >>>>>>>>> >>>>>>>>> >>>>>>>>> It creates a Spark binary distribution named: >>>>>>>>> spark-2.4.5-bin-custom-spark.tgz >>>>>>>>> >>>>>>>>> So this file is supposedly a ready-to-distribute Spark binary file >>>>>>>>> like the one you can download from >>>>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >>>>>>>>> >>>>>>>>> However, one big difference between this custom build and the >>>>>>>>> official build is that you do not have a LICENSE file in the custom >>>>>>>>> build. >>>>>>>>> I don't know much about Apache license, but I would suppose a custom >>>>>>>>> build >>>>>>>>> distribution should have one. >>>>>>>>> >>>>>>>>> The reason we are missing the file is caused by the following code >>>>>>>>> in make-distribution.sh: >>>>>>>>> [image: image.png] >>>>>>>>> >>>>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz >>>>>>>>> file, therefore there will be no LICENSE file in your custom build. >>>>>>>>> >>>>>>>>> I am aware of two pull requests related to this: >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/pull/22436 >>>>>>>>> started to use LICENSE-binary instead of just the LICENSE. >>>>>>>>> >>>>>>>>> And >>>>>>>>> https://github.com/apache/spark/pull/22840 >>>>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 >>>>>>>>> source directory. >>>>>>>>> >>>>>>>>> I think we need to change make-distribution.sh to make sure that >>>>>>>>> the LICENSE file is copied over to its corresponding custom build >>>>>>>>> distribution. However, I am not ready to do a pull request, so >>>>>>>>> hopefully we >>>>>>>>> can discuss it here first. >>>>>>>>> -- >>>>>>>>> Sincerely >>>>>>>>> Xiangyu Li >>>>>>>>> >>>>>>>>> <yisky...@gmail.com> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sincerely >>>>>>> Xiangyu Li >>>>>>> >>>>>>> <yisky...@gmail.com> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Sincerely >>>>> Xiangyu Li >>>>> >>>>> <yisky...@gmail.com> >>>>> >>>> >>> >>> -- >>> Sincerely >>> Xiangyu Li >>> >>> <yisky...@gmail.com> >>> >> > > -- > Sincerely > Xiangyu Li > > <yisky...@gmail.com> >