I need the pip packaging, all these efforts are to get a pyspark pip
package actually.

On Fri, May 1, 2020 at 4:38 PM Sean Owen <sro...@gmail.com> wrote:

> I see, that makes more sense, though I have limited knowledge of how the
> pip packaging works. You don't need pip packaging, do you? just pyspark
> itself right. Omit --pip?
>
> On Fri, May 1, 2020 at 3:32 PM Xiangyu Li <yisky...@gmail.com> wrote:
>
>> make-distribution.sh with --pip would run a `python setup.py sdist`
>> within that make-distribution.sh script.
>> I also tested `make-distribution.sh` without --pip, and the same error
>> happens.
>>
>> Correct me if I'm wrong, but pyspark binary has always been successfully
>> built, it is the pyspark pip package that is failing.
>>
>> On Fri, May 1, 2020 at 4:23 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> Hm, others may have to chime in here. Either that's not how you create
>>> the pyspark binary from the source release (make-distribution.sh doesn't do
>>> that?) or there is a small but important issue here, that the source
>>> release doesn't contain one thing that the binary release script expects,
>>> which is LICENSE-binary et al. If it's the latter, we could move around the
>>> LICENSE bits in the source tree so that both are "source" files included in
>>> the source release, so you can make the binary release with it, but, I'd
>>> probably say it's easier/better to simply skip adding the license in this
>>> path (if it's supposed to work this way at all) as the use case, a custom
>>> derived work, doesn't need the *ASF's* license statement.
>>>
>>>
>>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <yisky...@gmail.com> wrote:
>>>
>>>> To reproduce this, I just did
>>>>
>>>> curl -O
>>>> http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>>> tar xzf spark-2.4.5.tgz
>>>> cd spark-2.4.5
>>>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
>>>> mv spark-2.4.5-bin-custom-spark.tgz ../
>>>> cd ..
>>>> tar xzf spark-2.4.5-bin-custom-spark.tgz
>>>> cd spark-2.4.5-bin-custom-spark/python/
>>>> sudo python setup.py install
>>>>
>>>> And here is the output:
>>>> [image: image.png]
>>>>
>>>>
>>>> On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> You wrote:
>>>>>
>>>>> "
>>>>> 2. On each machine, I can install pyspark by running `python setup.py
>>>>> install` inside the python directory.
>>>>>
>>>>> Step 2 would fail because of missing the licenses directory.
>>>>> "
>>>>>
>>>>> That shouldn't depend on the license file, and the script you showed
>>>>> does not fail when not present, so I am wondering what this means.
>>>>> I'm not sure there's a JIRA here yet.
>>>>>
>>>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote:
>>>>>
>>>>>> Hmm, sorry I don't get what part of my email were you referring to
>>>>>> when you said "the build fails?".
>>>>>>
>>>>>> So I am trying to build a custom spark binary distribution with, say,
>>>>>> different Hadoop versions and R support.
>>>>>>
>>>>>> Then I stored this custom build on S3, so as I am building more
>>>>>> machines I can just directly download this custom build from S3. But
>>>>>> besides spark-submit and what not, I also wanted to install the pyspark
>>>>>> python package to the machine I am building.
>>>>>>
>>>>>> The lack of the LICENSE file in the custom build would prevent
>>>>>> pyspark from being successfully built.
>>>>>>
>>>>>> Hopefully this answers your question.
>>>>>>
>>>>>> The second part of my last email was about building pyspark inside
>>>>>> spark source directory, I will raise an issue on Jira for that, as it is
>>>>>> more of a clean cut problem with the documentation on the website and the
>>>>>> comments in make-distribution.sh.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> Hm, the build fails? you can see this is just skipped if not
>>>>>>> present, for this reason.
>>>>>>> I'm not clear why you need the file for its own sake, for your own
>>>>>>> internal modification that you don't redistribute.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>>
>>>>>>>> Thanks for the quick response! Yes, what you described about how
>>>>>>>> LICENSE file should be distributed makes sense.
>>>>>>>>
>>>>>>>> The reason I learned about this is that I was trying to build
>>>>>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>>>>>>> machines, so that:
>>>>>>>>
>>>>>>>> 1. These machines can run spark with the built.
>>>>>>>> 2. On each machine, I can install pyspark by running `python
>>>>>>>> setup.py install` inside the python directory.
>>>>>>>>
>>>>>>>> Step 2 would fail because of missing the licenses directory.
>>>>>>>>
>>>>>>>> Building pyspark out of a binary distribution is a bit
>>>>>>>> unconventional, but I did this after failing to do what the official 
>>>>>>>> doc
>>>>>>>> recommended (
>>>>>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>>>>>>> so taking a step back to describe what I did originally:
>>>>>>>>
>>>>>>>> In the spark-2.4.5 src directory, I just did a simple:
>>>>>>>>
>>>>>>>> `./build/mvn -DskipTests clean package`
>>>>>>>>
>>>>>>>>
>>>>>>>> And then went to the python directory and did:
>>>>>>>>
>>>>>>>>
>>>>>>>> `python setup.py sdist` followed by `pip install
>>>>>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the
>>>>>>>> make-distribution.sh.)
>>>>>>>>
>>>>>>>>
>>>>>>>> This ran into "error: package directory `deps/jars` does
>>>>>>>> not exist".
>>>>>>>>
>>>>>>>>
>>>>>>>> However, directly running
>>>>>>>>
>>>>>>>>
>>>>>>>> `sudo python setup.py install`
>>>>>>>>
>>>>>>>>
>>>>>>>> worked.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The source distribution has the source LICENSE file. The binary
>>>>>>>>> distribution has the LICENSE-binary license file. The source release 
>>>>>>>>> isn't
>>>>>>>>> supposed to have LICENSE-binary as it would not be accurate for that
>>>>>>>>> release; LICENSE is. If you're redistributing a build, you'll have 
>>>>>>>>> your own
>>>>>>>>> process for modifying and building it, including modifying the 
>>>>>>>>> LICENSE file
>>>>>>>>> as appropriate; these LICENSE files represent what the project 
>>>>>>>>> delivers to
>>>>>>>>> you rather than what you deliver to others. You could get the
>>>>>>>>> LICENSE-binary file from the right hash commit from git, if desired, 
>>>>>>>>> as
>>>>>>>>> part of your build.
>>>>>>>>>
>>>>>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I downloaded spark-2.4.5 source from
>>>>>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>>>>>>>>> After extracting it and running:
>>>>>>>>>>
>>>>>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
>>>>>>>>>> -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn 
>>>>>>>>>> -Pkubernetes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It creates a Spark binary distribution named:
>>>>>>>>>> spark-2.4.5-bin-custom-spark.tgz
>>>>>>>>>>
>>>>>>>>>> So this file is supposedly a ready-to-distribute Spark binary
>>>>>>>>>> file like the one you can download from
>>>>>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>>>>>>>>>
>>>>>>>>>> However, one big difference between this custom build and the
>>>>>>>>>> official build is that you do not have a LICENSE file in the custom 
>>>>>>>>>> build.
>>>>>>>>>> I don't know much about Apache license, but I would suppose a custom 
>>>>>>>>>> build
>>>>>>>>>> distribution should have one.
>>>>>>>>>>
>>>>>>>>>> The reason we are missing the file is caused by the following
>>>>>>>>>> code in make-distribution.sh:
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz
>>>>>>>>>> file, therefore there will be no LICENSE file in your custom build.
>>>>>>>>>>
>>>>>>>>>> I am aware of two pull requests related to this:
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/pull/22436
>>>>>>>>>> started to use LICENSE-binary instead of just the LICENSE.
>>>>>>>>>>
>>>>>>>>>> And
>>>>>>>>>> https://github.com/apache/spark/pull/22840
>>>>>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5
>>>>>>>>>> source directory.
>>>>>>>>>>
>>>>>>>>>> I think we need to change make-distribution.sh to make sure that
>>>>>>>>>> the LICENSE file is copied over to its corresponding custom build
>>>>>>>>>> distribution. However, I am not ready to do a pull request, so 
>>>>>>>>>> hopefully we
>>>>>>>>>> can discuss it here first.
>>>>>>>>>> --
>>>>>>>>>> Sincerely
>>>>>>>>>> Xiangyu Li
>>>>>>>>>>
>>>>>>>>>> <yisky...@gmail.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sincerely
>>>>>>>> Xiangyu Li
>>>>>>>>
>>>>>>>> <yisky...@gmail.com>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely
>>>>>> Xiangyu Li
>>>>>>
>>>>>> <yisky...@gmail.com>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely
>>>> Xiangyu Li
>>>>
>>>> <yisky...@gmail.com>
>>>>
>>>
>>
>> --
>> Sincerely
>> Xiangyu Li
>>
>> <yisky...@gmail.com>
>>
>

-- 
Sincerely
Xiangyu Li

<yisky...@gmail.com>

Reply via email to