[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603785#comment-14603785 ] Punya Biswal commented on SPARK-6764: - Some packages need to be installed on workers, it's not enough just to put archived versions on the PYTHONPATH. Is there a reason to avoid using pip on the workers? > Add wheel package support for PySpark > - > > Key: SPARK-6764 > URL: https://issues.apache.org/jira/browse/SPARK-6764 > Project: Spark > Issue Type: Improvement > Components: Deploy, PySpark >Reporter: Takao Magoori >Priority: Minor > Labels: newbie > > We can do _spark-submit_ with one or more Python packages (.egg,.zip and > .jar) by *--py-files* option. > h4. zip packaging > Spark put a zip file on its working directory and adds the absolute path to > Python's sys.path. When the user program imports it, > [zipimport|https://docs.python.org/2.7/library/zipimport.html] is > automatically invoked under the hood. That is, data-files and dynamic > modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and > .pyo. > h4. egg packaging > Spark put an egg file on its working directory and adds the absolute path to > Python's sys.path. Unlike zipimport, egg can handle data files and dynamid > modules as far as the author of the package uses [pkg_resources > API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations] > properly. But so many python modules does not use pkg_resources API, that > causes "ImportError"or "No such file" error. Moreover, creating eggs of > dependencies and further dependencies are troublesome job. > h4. wheel packaging > Supporting new Python standard package-format > "[wheel|https://wheel.readthedocs.org/en/latest/]"; would be nice. With wheel, > we can do spark-submit with complex dependencies simply as follows. > 1. Write requirements.txt file. > {noformat} > SQLAlchemy > MySQL-python > requests > simplejson>=3.6.0,<=3.6.5 > pydoop > {noformat} > 2. Do wheel packaging by only one command. All dependencies are wheel-ed. > {noformat} > $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement > requirements.txt > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py > {noformat} > If your pyspark driver is a package which consists of many modules, > 1. Write setup.py for your pyspark driver package. > {noformat} > from setuptools import ( > find_packages, > setup, > ) > setup( > name='yourpkg', > version='0.0.1', > packages=find_packages(), > install_requires=[ > 'SQLAlchemy', > 'MySQL-python', > 'requests', > 'simplejson>=3.6.0,<=3.6.5', > 'pydoop', > ], > ) > {noformat} > 2. Do wheel packaging by only one command. Your driver package and all > dependencies are wheel-ed. > {noformat} > your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/. > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') > your_driver_bootstrap.py > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8397) Allow custom configuration for TestHive
Punya Biswal created SPARK-8397: --- Summary: Allow custom configuration for TestHive Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Punya Biswal Priority: Minor We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punya Biswal updated SPARK-7515: Fix Version/s: 1.4.1 > Update documentation for PySpark on YARN with cluster mode > -- > > Key: SPARK-7515 > URL: https://issues.apache.org/jira/browse/SPARK-7515 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.4.1, 1.5.0 > > > Now PySpark on YARN with cluster mode is supported so let's update doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7899) PySpark sql/tests breaks pylint validation
[ https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punya Biswal updated SPARK-7899: Description: The pyspark.sql.types module is dynamically named {{types}} from {{_types}} which messes up pylint validation >From [~justin.uang] below: In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was renamed to {{pyspark/sql/\_types.py}} and then some magic in {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to {{types}}. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that {{\_\_init\_\_}} would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. was: The pyspark.sql.types module is dynamically named "types" from "_types" which messes up pylint validation >From [~justin.uang] below: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. > PySpark sql/tests breaks pylint validation > -- > > Key: SPARK-7899 > URL: https://issues.apache.org/jira/browse/SPARK-7899 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 1.4.0 >Reporter: Michael Nazario > > The pyspark.sql.types module is dynamically named {{types}} from {{_types}} > which messes up pylint validation > From [~justin.uang] below: > In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was > renamed to {{pyspark/sql/\_types.py}} and then some magic in > {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to > {{types}}. I imagine that this is some naming conflict with Python 3, but > what was the error that showed up? > The reason why I'm asking about this is because it's messing with pylint, > since pylint cannot now statically find the module. I tried also importing > the package so that {{\_\_init\_\_}} would be run in a init-hook, but that > isn't what the discovery mechanism is using. I imagine it's probably just > crawling the directory structure. > One way to work around this would be something akin to this > (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), > where I would have to create a fake module, but I would probably be missing > a ton of pylint features on users of that module, and it's pretty hacky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6907) Create an isolated classloader for the Hive Client.
[ https://issues.apache.org/jira/browse/SPARK-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522573#comment-14522573 ] Punya Biswal commented on SPARK-6907: - Makes sense, thanks for clarifying. I guess a weaker version of my question is, can we write this in Java (rather than Scala) to set it up for future separation? > Create an isolated classloader for the Hive Client. > --- > > Key: SPARK-6907 > URL: https://issues.apache.org/jira/browse/SPARK-6907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6907) Create an isolated classloader for the Hive Client.
[ https://issues.apache.org/jira/browse/SPARK-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515066#comment-14515066 ] Punya Biswal commented on SPARK-6907: - Would it make sense to do this as a separate project (repository)? It seems like a generic problem that's applicable more broadly than just Spark. > Create an isolated classloader for the Hive Client. > --- > > Key: SPARK-6907 > URL: https://issues.apache.org/jira/browse/SPARK-6907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7175) Upgrade Hive to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514861#comment-14514861 ] Punya Biswal commented on SPARK-7175: - [~pwendell] and [~vanzin] explained to me that this is quite hard to do at present, and pointed me to SPARK-6906. I'm leaving this ticket open for now, to revisit once the necessary architectural improvements have been made. > Upgrade Hive to 1.1.0 > - > > Key: SPARK-7175 > URL: https://issues.apache.org/jira/browse/SPARK-7175 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 1.3.1 >Reporter: Punya Biswal > > Spark SQL currently supports Hive 0.13 (June 2014), but the latest version of > Hive is 1.1.0 (March 2015). Among other improvements, it includes new UDFs > for date manipulation that I'd like to avoid rebuilding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7175) Upgrade Hive to 1.1.0
Punya Biswal created SPARK-7175: --- Summary: Upgrade Hive to 1.1.0 Key: SPARK-7175 URL: https://issues.apache.org/jira/browse/SPARK-7175 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.3.1 Reporter: Punya Biswal Spark SQL currently supports Hive 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). Among other improvements, it includes new UDFs for date manipulation that I'd like to avoid rebuilding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6996) DataFrame should support map types when creating DFs from JavaBeans.
[ https://issues.apache.org/jira/browse/SPARK-6996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punya Biswal updated SPARK-6996: Description: If we have a JavaBean class with fields of map types, SQL throws an exception in {{createDataFrame}} because those types are not matched in {{SQLContext#inferDataType}}. Similar to SPARK-6475. was: If we have a JavaBean class with fields of collection or map types, SQL throws an exception in {{createDataFrame}} because those types are not matched in {{SQLContext#inferDataType}}. Similar to SPARK-6475. Summary: DataFrame should support map types when creating DFs from JavaBeans. (was: DataFrame should support collection types when creating DFs from JavaBeans.) > DataFrame should support map types when creating DFs from JavaBeans. > > > Key: SPARK-6996 > URL: https://issues.apache.org/jira/browse/SPARK-6996 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Punya Biswal > > If we have a JavaBean class with fields of map types, SQL throws an exception > in {{createDataFrame}} because those types are not matched in > {{SQLContext#inferDataType}}. > Similar to SPARK-6475. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6996) DataFrame should support collection types when creating DFs from JavaBeans.
Punya Biswal created SPARK-6996: --- Summary: DataFrame should support collection types when creating DFs from JavaBeans. Key: SPARK-6996 URL: https://issues.apache.org/jira/browse/SPARK-6996 Project: Spark Issue Type: Improvement Components: SQL Reporter: Punya Biswal If we have a JavaBean class with fields of collection or map types, SQL throws an exception in {{createDataFrame}} because those types are not matched in {{SQLContext#inferDataType}}. Similar to SPARK-6475. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.
[ https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501609#comment-14501609 ] Punya Biswal commented on SPARK-6475: - Would it be reasonable to recognize Java iterables and maps as well? I'd be happy to work on a PR if that seems like a good idea. > DataFrame should support array types when creating DFs from JavaBeans. > -- > > Key: SPARK-6475 > URL: https://issues.apache.org/jira/browse/SPARK-6475 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.4.0 > > > If we have a JavaBean class with array fields, SQL throws an exception in > `createDataFrame` because arrays are not matched in `getSchema` from a > JavaBean class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6952) spark-daemon.sh PID reuse check fails on long classpath
[ https://issues.apache.org/jira/browse/SPARK-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499695#comment-14499695 ] Punya Biswal commented on SPARK-6952: - Would it be reasonable to back port this to branch-1.3 or is it too late for that? > spark-daemon.sh PID reuse check fails on long classpath > --- > > Key: SPARK-6952 > URL: https://issues.apache.org/jira/browse/SPARK-6952 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.3.0 >Reporter: Punya Biswal >Assignee: Punya Biswal >Priority: Minor > Fix For: 1.4.0 > > > {{sbin/spark-daemon.sh}} uses {{ps -p "$TARGET_PID" -o args=}} to figure out > whether the process running with the expected PID is actually a Spark daemon. > When running with a large classpath, the output of {{ps}} gets truncated and > the check fails spuriously. > I think we should weaken the check to see if it's a java command (which is > something we do in other parts of the script) rather than looking for the > specific main class name. This means that SPARK-4832 might happen under a > slightly broader range of circumstances (a *java* program happened to reuse > the same PID), but it seems worthwhile compared to failing consistently with > a large classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6940) PySpark ML.Tuning Wrappers are missing
[ https://issues.apache.org/jira/browse/SPARK-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498930#comment-14498930 ] Punya Biswal commented on SPARK-6940: - Sorry about the duplicate bug - [~omede] and I were talking about the issue offline and we managed to step on each other's toes. > PySpark ML.Tuning Wrappers are missing > -- > > Key: SPARK-6940 > URL: https://issues.apache.org/jira/browse/SPARK-6940 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 1.3.0 >Reporter: Omede Firouz > > PySpark doesn't currently have wrappers for any of the ML.Tuning classes: > CrossValidator, CrossValidatorModel, ParamGridBuilder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6952) spark-daemon.sh PID reuse check fails on long classpath
[ https://issues.apache.org/jira/browse/SPARK-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punya Biswal updated SPARK-6952: Summary: spark-daemon.sh PID reuse check fails on long classpath (was: spark-daemon.sh fails on long classpath) > spark-daemon.sh PID reuse check fails on long classpath > --- > > Key: SPARK-6952 > URL: https://issues.apache.org/jira/browse/SPARK-6952 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.3.0 >Reporter: Punya Biswal > > {{sbin/spark-daemon.sh}} uses {{ps -p "$TARGET_PID" -o args=}} to figure out > whether the process running with the expected PID is actually a Spark daemon. > When running with a large classpath, the output of {{ps}} gets truncated and > the check fails spuriously. > I think we should weaken the check to see if it's a java command (which is > something we do in other parts of the script) rather than looking for the > specific main class name. This means that SPARK-4832 might happen under a > slightly broader range of circumstances (a *java* program happened to reuse > the same PID), but it seems worthwhile compared to failing consistently with > a large classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6952) spark-daemon.sh fails on long classpath
Punya Biswal created SPARK-6952: --- Summary: spark-daemon.sh fails on long classpath Key: SPARK-6952 URL: https://issues.apache.org/jira/browse/SPARK-6952 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.0 Reporter: Punya Biswal {{sbin/spark-daemon.sh}} uses {{ps -p "$TARGET_PID" -o args=}} to figure out whether the process running with the expected PID is actually a Spark daemon. When running with a large classpath, the output of {{ps}} gets truncated and the check fails spuriously. I think we should weaken the check to see if it's a java command (which is something we do in other parts of the script) rather than looking for the specific main class name. This means that SPARK-4832 might happen under a slightly broader range of circumstances (a *java* program happened to reuse the same PID), but it seems worthwhile compared to failing consistently with a large classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6947) Make ml.tuning accessible from Python API
Punya Biswal created SPARK-6947: --- Summary: Make ml.tuning accessible from Python API Key: SPARK-6947 URL: https://issues.apache.org/jira/browse/SPARK-6947 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Punya Biswal {{CrossValidator}} and {{ParamGridBuilder}} should be available for use in PySpark-based ML pipelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1
Punya Biswal created SPARK-6731: --- Summary: Upgrade Apache commons-math3 to 3.4.1 Key: SPARK-6731 URL: https://issues.apache.org/jira/browse/SPARK-6731 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 1.3.0 Reporter: Punya Biswal Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. The current version (3.4.1) includes approximate percentile statistics (among other things). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org