make-distribution.sh with --pip would run a `python setup.py sdist` within
that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error
happens.

Correct me if I'm wrong, but pyspark binary has always been successfully
built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <sro...@gmail.com> wrote:

> Hm, others may have to chime in here. Either that's not how you create the
> pyspark binary from the source release (make-distribution.sh doesn't do
> that?) or there is a small but important issue here, that the source
> release doesn't contain one thing that the binary release script expects,
> which is LICENSE-binary et al. If it's the latter, we could move around the
> LICENSE bits in the source tree so that both are "source" files included in
> the source release, so you can make the binary release with it, but, I'd
> probably say it's easier/better to simply skip adding the license in this
> path (if it's supposed to work this way at all) as the use case, a custom
> derived work, doesn't need the *ASF's* license statement.
>
>
> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <yisky...@gmail.com> wrote:
>
>> To reproduce this, I just did
>>
>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>> tar xzf spark-2.4.5.tgz
>> cd spark-2.4.5
>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
>> mv spark-2.4.5-bin-custom-spark.tgz ../
>> cd ..
>> tar xzf spark-2.4.5-bin-custom-spark.tgz
>> cd spark-2.4.5-bin-custom-spark/python/
>> sudo python setup.py install
>>
>> And here is the output:
>> [image: image.png]
>>
>>
>> On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> You wrote:
>>>
>>> "
>>> 2. On each machine, I can install pyspark by running `python setup.py
>>> install` inside the python directory.
>>>
>>> Step 2 would fail because of missing the licenses directory.
>>> "
>>>
>>> That shouldn't depend on the license file, and the script you showed
>>> does not fail when not present, so I am wondering what this means.
>>> I'm not sure there's a JIRA here yet.
>>>
>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote:
>>>
>>>> Hmm, sorry I don't get what part of my email were you referring to when
>>>> you said "the build fails?".
>>>>
>>>> So I am trying to build a custom spark binary distribution with, say,
>>>> different Hadoop versions and R support.
>>>>
>>>> Then I stored this custom build on S3, so as I am building more
>>>> machines I can just directly download this custom build from S3. But
>>>> besides spark-submit and what not, I also wanted to install the pyspark
>>>> python package to the machine I am building.
>>>>
>>>> The lack of the LICENSE file in the custom build would prevent pyspark
>>>> from being successfully built.
>>>>
>>>> Hopefully this answers your question.
>>>>
>>>> The second part of my last email was about building pyspark inside
>>>> spark source directory, I will raise an issue on Jira for that, as it is
>>>> more of a clean cut problem with the documentation on the website and the
>>>> comments in make-distribution.sh.
>>>>
>>>>
>>>>
>>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> Hm, the build fails? you can see this is just skipped if not present,
>>>>> for this reason.
>>>>> I'm not clear why you need the file for its own sake, for your own
>>>>> internal modification that you don't redistribute.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> wrote:
>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thanks for the quick response! Yes, what you described about how
>>>>>> LICENSE file should be distributed makes sense.
>>>>>>
>>>>>> The reason I learned about this is that I was trying to build
>>>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>>>>> machines, so that:
>>>>>>
>>>>>> 1. These machines can run spark with the built.
>>>>>> 2. On each machine, I can install pyspark by running `python setup.py
>>>>>> install` inside the python directory.
>>>>>>
>>>>>> Step 2 would fail because of missing the licenses directory.
>>>>>>
>>>>>> Building pyspark out of a binary distribution is a bit
>>>>>> unconventional, but I did this after failing to do what the official doc
>>>>>> recommended (
>>>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>>>>> so taking a step back to describe what I did originally:
>>>>>>
>>>>>> In the spark-2.4.5 src directory, I just did a simple:
>>>>>>
>>>>>> `./build/mvn -DskipTests clean package`
>>>>>>
>>>>>>
>>>>>> And then went to the python directory and did:
>>>>>>
>>>>>>
>>>>>> `python setup.py sdist` followed by `pip install
>>>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the
>>>>>> make-distribution.sh.)
>>>>>>
>>>>>>
>>>>>> This ran into "error: package directory `deps/jars` does not exist".
>>>>>>
>>>>>>
>>>>>> However, directly running
>>>>>>
>>>>>>
>>>>>> `sudo python setup.py install`
>>>>>>
>>>>>>
>>>>>> worked.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> The source distribution has the source LICENSE file. The binary
>>>>>>> distribution has the LICENSE-binary license file. The source release 
>>>>>>> isn't
>>>>>>> supposed to have LICENSE-binary as it would not be accurate for that
>>>>>>> release; LICENSE is. If you're redistributing a build, you'll have your 
>>>>>>> own
>>>>>>> process for modifying and building it, including modifying the LICENSE 
>>>>>>> file
>>>>>>> as appropriate; these LICENSE files represent what the project delivers 
>>>>>>> to
>>>>>>> you rather than what you deliver to others. You could get the
>>>>>>> LICENSE-binary file from the right hash commit from git, if desired, as
>>>>>>> part of your build.
>>>>>>>
>>>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I downloaded spark-2.4.5 source from
>>>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>>>>>>> After extracting it and running:
>>>>>>>>
>>>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
>>>>>>>> -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn 
>>>>>>>> -Pkubernetes
>>>>>>>>
>>>>>>>>
>>>>>>>> It creates a Spark binary distribution named:
>>>>>>>> spark-2.4.5-bin-custom-spark.tgz
>>>>>>>>
>>>>>>>> So this file is supposedly a ready-to-distribute Spark binary file
>>>>>>>> like the one you can download from
>>>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>>>>>>>
>>>>>>>> However, one big difference between this custom build and the
>>>>>>>> official build is that you do not have a LICENSE file in the custom 
>>>>>>>> build.
>>>>>>>> I don't know much about Apache license, but I would suppose a custom 
>>>>>>>> build
>>>>>>>> distribution should have one.
>>>>>>>>
>>>>>>>> The reason we are missing the file is caused by the following code
>>>>>>>> in make-distribution.sh:
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz
>>>>>>>> file, therefore there will be no LICENSE file in your custom build.
>>>>>>>>
>>>>>>>> I am aware of two pull requests related to this:
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/22436
>>>>>>>> started to use LICENSE-binary instead of just the LICENSE.
>>>>>>>>
>>>>>>>> And
>>>>>>>> https://github.com/apache/spark/pull/22840
>>>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5
>>>>>>>> source directory.
>>>>>>>>
>>>>>>>> I think we need to change make-distribution.sh to make sure that
>>>>>>>> the LICENSE file is copied over to its corresponding custom build
>>>>>>>> distribution. However, I am not ready to do a pull request, so 
>>>>>>>> hopefully we
>>>>>>>> can discuss it here first.
>>>>>>>> --
>>>>>>>> Sincerely
>>>>>>>> Xiangyu Li
>>>>>>>>
>>>>>>>> <yisky...@gmail.com>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely
>>>>>> Xiangyu Li
>>>>>>
>>>>>> <yisky...@gmail.com>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely
>>>> Xiangyu Li
>>>>
>>>> <yisky...@gmail.com>
>>>>
>>>
>>
>> --
>> Sincerely
>> Xiangyu Li
>>
>> <yisky...@gmail.com>
>>
>

-- 
Sincerely
Xiangyu Li

<yisky...@gmail.com>

Reply via email to