[jira] [Comment Edited] (SPARK-34100) pyspark 2.4 packages can't be installed via pip on Amazon Linux 2

2021-01-13 Thread Devin Boyer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264309#comment-17264309
 ] 

Devin Boyer edited comment on SPARK-34100 at 1/13/21, 5:43 PM:
---

Noting that I found a workaround here: it appears that this is due to [an issue 
with the version of the setuptools|https://stackoverflow.com/a/55167875/316079] 
package bundled into the Python distribution with Amazon Linux 2, and the 
"wheel" library not being installed.

If this command is run on an Amazon Linux 2 installation with Python 3.7 
installed, then pyspark 2.4.x package installation succeeds:

 

{{pip3 install --upgrade --force-reinstall setuptools && pip3 install wheel}}

 

I noticed this doesn't happen with 3.0.x package versions, so maybe there's a 
difference in how the package is distributed between 2.4 and 3.x?

 


was (Author: drboyer):
Noting that I found a workaround here: it appears that this is due to an issue 
with the version of the setuptools package bundled into the Python distribution 
with Amazon Linux 2, and the "wheel" library not being installed.

If this command is run on an Amazon Linux 2 installation with Python 3.7 
installed, then pyspark 2.4.x package installation succeeds:

 

{{pip3 install --upgrade --force-reinstall setuptools && pip3 install wheel}}

 

I noticed this doesn't happen with 3.0.x package versions, so maybe there's a 
difference in how the package is distributed between 2.4 and 3.x?

 

> pyspark 2.4 packages can't be installed via pip on Amazon Linux 2
> -
>
> Key: SPARK-34100
> URL: https://issues.apache.org/jira/browse/SPARK-34100
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 2.4.7
> Environment: Amazon Linux 2, with Python 3.7.9 and pip 9.0.3 (also 
> tested with pip 20.3.3), using Docker or EMR 5.32.0
>  
> Example Dockerfile to reproduce:
> {{FROM amazonlinux:2}}
> {{RUN yum install -y python3}}
> {{RUN pip3 install pyspark==2.4.7}}
>  
>Reporter: Devin Boyer
>Priority: Minor
>
> I'm unable to install the pyspark Python package on Amazon Linux 2, whether 
> in a Docker image or an EMR cluster. Amazon Linux 2 currently ships with 
> Python 3.7 and pip 9.0.3, but upgrading pip yields the same result.
>  
> When installing the package, the installation will fail with the error 
> "ValueError: bad marshal data (unknown type code)". Full example stack below.
>  
> This bug prevents use of pyspark for simple testing environments, and from 
> using tools where the pyspark package is a dependency, like 
> [https://github.com/awslabs/python-deequ.]
>  
> Stack Trace:
> {{Step 3/3 : RUN pip3 install pyspark==2.4.7}}
> {{ ---> Running in 2c6e1c1de62f}}
> {{WARNING: Running pip install with root privileges is generally not a good 
> idea. Try `pip3 install --user` instead.}}
> {{Collecting pyspark==2.4.7}}
> {{ Downloading 
> https://files.pythonhosted.org/packages/e2/06/29f80e5a464033432eedf89924e7aa6ebbc47ce4dcd956853a73627f2c07/pyspark-2.4.7.tar.gz
>  (217.9MB)}}
> {{ Complete output from command python setup.py egg_info:}}
> {{ Could not import pypandoc - required to package PySpark}}
> {{ /usr/lib64/python3.7/distutils/dist.py:274: UserWarning: Unknown 
> distribution option: 'long_description_content_type'}}
> {{ warnings.warn(msg)}}
> {{ zip_safe flag not set; analyzing archive contents...}}
> {{ Traceback (most recent call last):}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 154, 
> in save_modules}}
> {{ yield saved}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 195, 
> in setup_context}}
> {{ yield}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 250, 
> in run_setup}}
> {{ _execfile(setup_script, ns)}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 45, in 
> _execfile}}
> {{ exec(code, globals, locals)}}
> {{ File "/tmp/easy_install-l742j64w/pypandoc-1.5/setup.py", line 111, in 
> }}
> {{ # using Python imports instead which will be resolved correctly.}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 129, 
> in setup}}
> {{ return distutils.core.setup(**attrs)}}
> {{ File "/usr/lib64/python3.7/distutils/core.py", line 148, in setup}}
> {{ dist.run_commands()}}
> {{ File "/usr/lib64/python3.7/distutils/dist.py", line 966, in run_commands}}
> {{ self.run_command(cmd)}}
> {{ File "/usr/lib64/python3.7/distutils/dist.py", line 985, in run_command}}
> {{ cmd_obj.run()}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
> line 218, in run}}
> {{ os.path.join(archive_root, 'EGG-INFO'), self.zip_safe()}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
> line 269, in zip_safe}}

[jira] [Commented] (SPARK-34100) pyspark 2.4 packages can't be installed via pip on Amazon Linux 2

2021-01-13 Thread Devin Boyer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264309#comment-17264309
 ] 

Devin Boyer commented on SPARK-34100:
-

Noting that I found a workaround here: it appears that this is due to an issue 
with the version of the setuptools package bundled into the Python distribution 
with Amazon Linux 2, and the "wheel" library not being installed.

If this command is run on an Amazon Linux 2 installation with Python 3.7 
installed, then pyspark 2.4.x package installation succeeds:

 

{{pip3 install --upgrade --force-reinstall setuptools && pip3 install wheel}}

 

I noticed this doesn't happen with 3.0.x package versions, so maybe there's a 
difference in how the package is distributed between 2.4 and 3.x?

 

> pyspark 2.4 packages can't be installed via pip on Amazon Linux 2
> -
>
> Key: SPARK-34100
> URL: https://issues.apache.org/jira/browse/SPARK-34100
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 2.4.7
> Environment: Amazon Linux 2, with Python 3.7.9 and pip 9.0.3 (also 
> tested with pip 20.3.3), using Docker or EMR 5.32.0
>  
> Example Dockerfile to reproduce:
> {{FROM amazonlinux:2}}
> {{RUN yum install -y python3}}
> {{RUN pip3 install pyspark==2.4.7}}
>  
>Reporter: Devin Boyer
>Priority: Minor
>
> I'm unable to install the pyspark Python package on Amazon Linux 2, whether 
> in a Docker image or an EMR cluster. Amazon Linux 2 currently ships with 
> Python 3.7 and pip 9.0.3, but upgrading pip yields the same result.
>  
> When installing the package, the installation will fail with the error 
> "ValueError: bad marshal data (unknown type code)". Full example stack below.
>  
> This bug prevents use of pyspark for simple testing environments, and from 
> using tools where the pyspark package is a dependency, like 
> [https://github.com/awslabs/python-deequ.]
>  
> Stack Trace:
> {{Step 3/3 : RUN pip3 install pyspark==2.4.7}}
> {{ ---> Running in 2c6e1c1de62f}}
> {{WARNING: Running pip install with root privileges is generally not a good 
> idea. Try `pip3 install --user` instead.}}
> {{Collecting pyspark==2.4.7}}
> {{ Downloading 
> https://files.pythonhosted.org/packages/e2/06/29f80e5a464033432eedf89924e7aa6ebbc47ce4dcd956853a73627f2c07/pyspark-2.4.7.tar.gz
>  (217.9MB)}}
> {{ Complete output from command python setup.py egg_info:}}
> {{ Could not import pypandoc - required to package PySpark}}
> {{ /usr/lib64/python3.7/distutils/dist.py:274: UserWarning: Unknown 
> distribution option: 'long_description_content_type'}}
> {{ warnings.warn(msg)}}
> {{ zip_safe flag not set; analyzing archive contents...}}
> {{ Traceback (most recent call last):}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 154, 
> in save_modules}}
> {{ yield saved}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 195, 
> in setup_context}}
> {{ yield}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 250, 
> in run_setup}}
> {{ _execfile(setup_script, ns)}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 45, in 
> _execfile}}
> {{ exec(code, globals, locals)}}
> {{ File "/tmp/easy_install-l742j64w/pypandoc-1.5/setup.py", line 111, in 
> }}
> {{ # using Python imports instead which will be resolved correctly.}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 129, 
> in setup}}
> {{ return distutils.core.setup(**attrs)}}
> {{ File "/usr/lib64/python3.7/distutils/core.py", line 148, in setup}}
> {{ dist.run_commands()}}
> {{ File "/usr/lib64/python3.7/distutils/dist.py", line 966, in run_commands}}
> {{ self.run_command(cmd)}}
> {{ File "/usr/lib64/python3.7/distutils/dist.py", line 985, in run_command}}
> {{ cmd_obj.run()}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
> line 218, in run}}
> {{ os.path.join(archive_root, 'EGG-INFO'), self.zip_safe()}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
> line 269, in zip_safe}}
> {{ return analyze_egg(self.bdist_dir, self.stubs)}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
> line 379, in analyze_egg}}
> {{ safe = scan_module(egg_dir, base, name, stubs) and safe}}
> {{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
> line 416, in scan_module}}
> {{ code = marshal.load(f)}}
> {{ ValueError: bad marshal data (unknown type code)}}{{During handling of the 
> above exception, another exception occurred:}}{{Traceback (most recent call 
> last):}}
> {{ File "", line 1, in }}
> {{ File "/tmp/pip-build-j3d56a0n/pyspark/setup.py", line 224, in }}
> {{ 'Programming Language :: Python :: Implementation :: PyPy']}}
> {{ File 

[jira] [Created] (SPARK-34100) pyspark 2.4 packages can't be installed via pip on Amazon Linux 2

2021-01-13 Thread Devin Boyer (Jira)
Devin Boyer created SPARK-34100:
---

 Summary: pyspark 2.4 packages can't be installed via pip on Amazon 
Linux 2
 Key: SPARK-34100
 URL: https://issues.apache.org/jira/browse/SPARK-34100
 Project: Spark
  Issue Type: Bug
  Components: Deploy, PySpark
Affects Versions: 2.4.7
 Environment: Amazon Linux 2, with Python 3.7.9 and pip 9.0.3 (also 
tested with pip 20.3.3), using Docker or EMR 5.32.0

 

Example Dockerfile to reproduce:

{{FROM amazonlinux:2}}
{{RUN yum install -y python3}}
{{RUN pip3 install pyspark==2.4.7}}

 
Reporter: Devin Boyer


I'm unable to install the pyspark Python package on Amazon Linux 2, whether in 
a Docker image or an EMR cluster. Amazon Linux 2 currently ships with Python 
3.7 and pip 9.0.3, but upgrading pip yields the same result.

 

When installing the package, the installation will fail with the error 
"ValueError: bad marshal data (unknown type code)". Full example stack below.

 

This bug prevents use of pyspark for simple testing environments, and from 
using tools where the pyspark package is a dependency, like 
[https://github.com/awslabs/python-deequ.]

 

Stack Trace:

{{Step 3/3 : RUN pip3 install pyspark==2.4.7}}
{{ ---> Running in 2c6e1c1de62f}}
{{WARNING: Running pip install with root privileges is generally not a good 
idea. Try `pip3 install --user` instead.}}
{{Collecting pyspark==2.4.7}}
{{ Downloading 
https://files.pythonhosted.org/packages/e2/06/29f80e5a464033432eedf89924e7aa6ebbc47ce4dcd956853a73627f2c07/pyspark-2.4.7.tar.gz
 (217.9MB)}}
{{ Complete output from command python setup.py egg_info:}}
{{ Could not import pypandoc - required to package PySpark}}
{{ /usr/lib64/python3.7/distutils/dist.py:274: UserWarning: Unknown 
distribution option: 'long_description_content_type'}}
{{ warnings.warn(msg)}}
{{ zip_safe flag not set; analyzing archive contents...}}
{{ Traceback (most recent call last):}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 154, in 
save_modules}}
{{ yield saved}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 195, in 
setup_context}}
{{ yield}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 250, in 
run_setup}}
{{ _execfile(setup_script, ns)}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/sandbox.py", line 45, in 
_execfile}}
{{ exec(code, globals, locals)}}
{{ File "/tmp/easy_install-l742j64w/pypandoc-1.5/setup.py", line 111, in 
}}
{{ # using Python imports instead which will be resolved correctly.}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 129, in 
setup}}
{{ return distutils.core.setup(**attrs)}}
{{ File "/usr/lib64/python3.7/distutils/core.py", line 148, in setup}}
{{ dist.run_commands()}}
{{ File "/usr/lib64/python3.7/distutils/dist.py", line 966, in run_commands}}
{{ self.run_command(cmd)}}
{{ File "/usr/lib64/python3.7/distutils/dist.py", line 985, in run_command}}
{{ cmd_obj.run()}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
line 218, in run}}
{{ os.path.join(archive_root, 'EGG-INFO'), self.zip_safe()}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
line 269, in zip_safe}}
{{ return analyze_egg(self.bdist_dir, self.stubs)}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
line 379, in analyze_egg}}
{{ safe = scan_module(egg_dir, base, name, stubs) and safe}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", 
line 416, in scan_module}}
{{ code = marshal.load(f)}}
{{ ValueError: bad marshal data (unknown type code)}}{{During handling of the 
above exception, another exception occurred:}}{{Traceback (most recent call 
last):}}
{{ File "", line 1, in }}
{{ File "/tmp/pip-build-j3d56a0n/pyspark/setup.py", line 224, in }}
{{ 'Programming Language :: Python :: Implementation :: PyPy']}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 128, in 
setup}}
{{ _install_setup_requires(attrs)}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 123, in 
_install_setup_requires}}
{{ dist.fetch_build_eggs(dist.setup_requires)}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/dist.py", line 461, in 
fetch_build_eggs}}
{{ replace_conflicting=True,}}
{{ File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 866, 
in resolve}}
{{ replace_conflicting=replace_conflicting}}
{{ File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 
1146, in best_match}}
{{ return self.obtain(req, installer)}}
{{ File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 
1158, in obtain}}
{{ return installer(requirement)}}
{{ File "/usr/lib/python3.7/site-packages/setuptools/dist.py", line 528, in 
fetch_build_egg}}
{{ return cmd.easy_install(req)}}
{{ File 

[jira] [Commented] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes

2020-04-01 Thread Devin Boyer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073145#comment-17073145
 ] 

Devin Boyer commented on SPARK-29574:
-

Will this or can this change be backported to future versions of 2.4? Doing so 
would mean that I won't have to manually patch or fork the entrypoint.sh file 
in my docker images.

 

It's unclear to me if this introduces a backwards-incompatible change or not.

> spark with user provided hadoop doesn't work on kubernetes
> --
>
> Key: SPARK-29574
> URL: https://issues.apache.org/jira/browse/SPARK-29574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Michał Wesołowski
>Assignee: Shahin Shakeri
>Priority: Major
> Fix For: 3.0.0
>
>
> When spark-submit is run with image built with "hadoop free" spark and user 
> provided hadoop it fails on kubernetes (hadoop libraries are not on spark's 
> classpath). 
> I downloaded spark [Pre-built with user-provided Apache 
> Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz].
>  
> I created docker image with usage of 
> [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]].
>  
>  
> Based on this image (2.4.4-without-hadoop)
> I created another one with Dockerfile
> {code:java}
> FROM spark-py:2.4.4-without-hadoop
> ENV SPARK_HOME=/opt/spark/
> # This is needed for newer kubernetes versions
> ADD 
> https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar
>  $SPARK_HOME/jars
> COPY spark-env.sh /opt/spark/conf/spark-env.sh
> RUN chmod +x /opt/spark/conf/spark-env.sh
> RUN wget -qO- 
> https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz 
> | tar xz  -C /opt/
> ENV HADOOP_HOME=/opt/hadoop-3.2.1
> ENV PATH=${HADOOP_HOME}/bin:${PATH}
> {code}
> Contents of spark-env.sh:
> {code:java}
> #!/usr/bin/env bash
> export SPARK_DIST_CLASSPATH=$(hadoop 
> classpath):$HADOOP_HOME/share/hadoop/tools/lib/*
> export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
> {code}
> spark-submit run with image crated this way fails since spark-env.sh is 
> overwritten by [volume created when pod 
> starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108]
> As quick workaround I tried to modify [entrypoint 
> script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh]
>  to run spark-env.sh during startup and moving spark-env.sh to a different 
> directory. 
>  Driver starts without issues in this setup however, evethough 
> SPARK_DIST_CLASSPATH is set executor is constantly failing:
> {code:java}
> PS 
> C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda>
>  kubectl logs rda-script-1571835692837-exec-12
> ++ id -u
> + myuid=0
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 0
> + uidentry=root:x:0:0:root:/root:/bin/ash
> + set -e
> + '[' -z root:x:0:0:root:/root:/bin/ash ']'
> + source /opt/spark-env.sh
> +++ hadoop classpath
> ++ export 
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++
>  
> SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
> ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ echo 
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
> 

[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2019-07-16 Thread Devin Boyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886352#comment-16886352
 ] 

Devin Boyer commented on SPARK-23443:
-

FWIW, a little while back AWS released their implementation of the Glue Data 
Catalog for Hive and Spark as an open source repository. It includes 
instructions for how to integrate this library into Spark builds, which 
unfortunately currently requires hand-patching Hive.

https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore#building-the-spark-client

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26505) Catalog class Function is missing "database" field

2018-12-30 Thread Devin Boyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devin Boyer updated SPARK-26505:

Description: 
This change fell out of the review of 
[https://github.com/apache/spark/pull/20658,] which is the implementation of 
https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
[Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
 contains a `database` attribute, while the [Python 
version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
 does not.

 

To be consistent, it would likely be best to add the `database` attribute to 
the Python class. This would be a breaking API change, though (as discussed in 
[this PR 
comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]), 
so it would have to be made for Spark 3.0.0, the next major version where 
breaking API changes can occur.

  was:
This change fell out of the review of 
[https://github.com/apache/spark/pull/20658,] which is the implementation of 
https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
[Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
 contains a `database` attribute, while the [Python 
version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
 does not.

 

To be consistent, it would likely be best to add the `database` attribute to 
the Python class. This would be a breaking API change, though (as discussed in 
[this PR 
comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]|https://github.com/apache/spark/pull/20658#issuecomment-368561007]),
 so it would have to be made for Spark 3.0.0, the next major version where 
breaking API changes can occur.


> Catalog class Function is missing "database" field
> --
>
> Key: SPARK-26505
> URL: https://issues.apache.org/jira/browse/SPARK-26505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Devin Boyer
>Priority: Minor
>
> This change fell out of the review of 
> [https://github.com/apache/spark/pull/20658,] which is the implementation of 
> https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
> [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
>  contains a `database` attribute, while the [Python 
> version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
>  does not.
>  
> To be consistent, it would likely be best to add the `database` attribute to 
> the Python class. This would be a breaking API change, though (as discussed 
> in [this PR 
> comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]),
>  so it would have to be made for Spark 3.0.0, the next major version where 
> breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26505) Catalog class Function is missing "database" field

2018-12-30 Thread Devin Boyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devin Boyer updated SPARK-26505:

Description: 
This change fell out of the review of 
[https://github.com/apache/spark/pull/20658,] which is the implementation of 
https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
[Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
 contains a `database` attribute, while the [Python 
version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
 does not.

 

To be consistent, it would likely be best to add the `database` attribute to 
the Python class. This would be a breaking API change, though (as discussed in 
[this PR 
comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]|https://github.com/apache/spark/pull/20658#issuecomment-368561007]),
 so it would have to be made for Spark 3.0.0, the next major version where 
breaking API changes can occur.

  was:
This change fell out of the review of 
[https://github.com/apache/spark/pull/20658,] which is the implementation of 
https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
[Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
 contains a `database` attribute, while the [Python 
version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
 does not.

 

To be consistent, it would likely be best to add the `database` attribute to 
the Python class. This would be a breaking API change, though (as discussed in 
[this PR 
comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]),] 
so it would have to be made for Spark 3.0.0, the next major version where 
breaking API changes can occur.


> Catalog class Function is missing "database" field
> --
>
> Key: SPARK-26505
> URL: https://issues.apache.org/jira/browse/SPARK-26505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Devin Boyer
>Priority: Minor
>
> This change fell out of the review of 
> [https://github.com/apache/spark/pull/20658,] which is the implementation of 
> https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
> [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
>  contains a `database` attribute, while the [Python 
> version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
>  does not.
>  
> To be consistent, it would likely be best to add the `database` attribute to 
> the Python class. This would be a breaking API change, though (as discussed 
> in [this PR 
> comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]|https://github.com/apache/spark/pull/20658#issuecomment-368561007]),
>  so it would have to be made for Spark 3.0.0, the next major version where 
> breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26505) Catalog class Function is missing "database" field

2018-12-30 Thread Devin Boyer (JIRA)
Devin Boyer created SPARK-26505:
---

 Summary: Catalog class Function is missing "database" field
 Key: SPARK-26505
 URL: https://issues.apache.org/jira/browse/SPARK-26505
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Devin Boyer


This change fell out of the review of 
[https://github.com/apache/spark/pull/20658,] which is the implementation of 
https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
[Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
 contains a `database` attribute, while the [Python 
version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
 does not.

 

To be consistent, it would likely be best to add the `database` attribute to 
the Python class. This would be a breaking API change, though (as discussed in 
[this PR 
comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]),] 
so it would have to be made for Spark 3.0.0, the next major version where 
breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-03-01 Thread Devin Boyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382085#comment-16382085
 ] 

Devin Boyer commented on SPARK-23443:
-

I would also be interested in helping if needed, or certainly testing this!

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Devin Boyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devin Boyer updated SPARK-23488:

Target Version/s:   (was: 2.2.2, 2.3.1)

> Add other missing Catalog methods to Python API
> ---
>
> Key: SPARK-23488
> URL: https://issues.apache.org/jira/browse/SPARK-23488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.1
>Reporter: Devin Boyer
>Priority: Minor
>
> I noticed the Python Catalog API was missing some methods that are present in 
> the Scala API. These would be handy to have in the Python API as well, 
> especially the database/TableExists() methods.
> I have a PR ready to add these I can open. All methods added:
>  * databaseExists()
>  * tableExists()
>  * functionExists()
>  * getDatabase()
>  * getTable()
>  * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Devin Boyer (JIRA)
Devin Boyer created SPARK-23488:
---

 Summary: Add other missing Catalog methods to Python API
 Key: SPARK-23488
 URL: https://issues.apache.org/jira/browse/SPARK-23488
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.2.1
Reporter: Devin Boyer


I noticed the Python Catalog API was missing some methods that are present in 
the Scala API. These would be handy to have in the Python API as well, 
especially the database/TableExists() methods.

I have a PR ready to add these I can open. All methods added:
 * databaseExists()
 * tableExists()
 * functionExists()
 * getDatabase()
 * getTable()
 * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org