[jira] [Commented] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658 ] zenglinxi commented on SPARK-16408: --- as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two functions in SparkContext.scala: {quote} def addFile(path: String): Unit = { addFile(path, false) } def addFile(path: String, recursive: Boolean): Unit = { ... } {quote} But there are no config to turn on or off recursive, and spark always call addFile(path) in default, which means the value of recursive is false, this is why we get the exceptions. > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when use Spark-sql to execute sql like: > {quote} > add file hdfs://xxx/user/test; > {quote} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {quote} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14743) Improve delegation token handling in secure clusters
[ https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365656#comment-15365656 ] Apache Spark commented on SPARK-14743: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/14065 > Improve delegation token handling in secure clusters > > > Key: SPARK-14743 > URL: https://issues.apache.org/jira/browse/SPARK-14743 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > In a way, I'd consider this a parent bug of SPARK-7252. > Spark's current support for delegation tokens is a little all over the place: > - for HDFS, there's support for re-creating tokens if a principal and keytab > are provided > - for HBase and Hive, Spark will fetch delegation tokens so that apps can > work in cluster mode, but will not re-create them, so apps that need those > will stop working after 7 days > - for anything else, Spark doesn't do anything. Lots of other services use > delegation tokens, and supporting them as data sources in Spark becomes more > complicated because of that. e.g., Kafka will (hopefully) soon support them. > It would be nice if Spark had consistent support for handling delegation > tokens regardless of who needs them. I'd list these as the requirements: > - Spark to provide a generic interface for fetching delegation tokens. This > would allow Spark's delegation token support to be extended using some plugin > architecture (e.g. Java services), meaning Spark itself doesn't need to > support every possible service out there. > This would be used to fetch tokens when launching apps in cluster mode, and > when a principal and a keytab are provided to Spark. > - A way to manually update delegation tokens in Spark. For example, a new > SparkContext API, or some configuration that tells Spark to monitor a file > for changes and load tokens from said file. > This would allow external applications to manage tokens outside of Spark and > be able to update a running Spark application (think, for example, a job > sever like Oozie, or something like Hive-on-Spark which manages Spark apps > running remotely). > - A way to notify running code that new delegation tokens have been loaded. > This may not be strictly necessary; it might be possible for code to detect > that, e.g., by peeking into the UserGroupInformation structure. But an event > sent to the listener bus would allow applications to react when new tokens > are available (e.g., the Hive backend could re-create connections to the > metastore server using the new tokens). > Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the > mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14743) Improve delegation token handling in secure clusters
[ https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14743: Assignee: (was: Apache Spark) > Improve delegation token handling in secure clusters > > > Key: SPARK-14743 > URL: https://issues.apache.org/jira/browse/SPARK-14743 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > In a way, I'd consider this a parent bug of SPARK-7252. > Spark's current support for delegation tokens is a little all over the place: > - for HDFS, there's support for re-creating tokens if a principal and keytab > are provided > - for HBase and Hive, Spark will fetch delegation tokens so that apps can > work in cluster mode, but will not re-create them, so apps that need those > will stop working after 7 days > - for anything else, Spark doesn't do anything. Lots of other services use > delegation tokens, and supporting them as data sources in Spark becomes more > complicated because of that. e.g., Kafka will (hopefully) soon support them. > It would be nice if Spark had consistent support for handling delegation > tokens regardless of who needs them. I'd list these as the requirements: > - Spark to provide a generic interface for fetching delegation tokens. This > would allow Spark's delegation token support to be extended using some plugin > architecture (e.g. Java services), meaning Spark itself doesn't need to > support every possible service out there. > This would be used to fetch tokens when launching apps in cluster mode, and > when a principal and a keytab are provided to Spark. > - A way to manually update delegation tokens in Spark. For example, a new > SparkContext API, or some configuration that tells Spark to monitor a file > for changes and load tokens from said file. > This would allow external applications to manage tokens outside of Spark and > be able to update a running Spark application (think, for example, a job > sever like Oozie, or something like Hive-on-Spark which manages Spark apps > running remotely). > - A way to notify running code that new delegation tokens have been loaded. > This may not be strictly necessary; it might be possible for code to detect > that, e.g., by peeking into the UserGroupInformation structure. But an event > sent to the listener bus would allow applications to react when new tokens > are available (e.g., the Hive backend could re-create connections to the > metastore server using the new tokens). > Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the > mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14743) Improve delegation token handling in secure clusters
[ https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14743: Assignee: Apache Spark > Improve delegation token handling in secure clusters > > > Key: SPARK-14743 > URL: https://issues.apache.org/jira/browse/SPARK-14743 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > In a way, I'd consider this a parent bug of SPARK-7252. > Spark's current support for delegation tokens is a little all over the place: > - for HDFS, there's support for re-creating tokens if a principal and keytab > are provided > - for HBase and Hive, Spark will fetch delegation tokens so that apps can > work in cluster mode, but will not re-create them, so apps that need those > will stop working after 7 days > - for anything else, Spark doesn't do anything. Lots of other services use > delegation tokens, and supporting them as data sources in Spark becomes more > complicated because of that. e.g., Kafka will (hopefully) soon support them. > It would be nice if Spark had consistent support for handling delegation > tokens regardless of who needs them. I'd list these as the requirements: > - Spark to provide a generic interface for fetching delegation tokens. This > would allow Spark's delegation token support to be extended using some plugin > architecture (e.g. Java services), meaning Spark itself doesn't need to > support every possible service out there. > This would be used to fetch tokens when launching apps in cluster mode, and > when a principal and a keytab are provided to Spark. > - A way to manually update delegation tokens in Spark. For example, a new > SparkContext API, or some configuration that tells Spark to monitor a file > for changes and load tokens from said file. > This would allow external applications to manage tokens outside of Spark and > be able to update a running Spark application (think, for example, a job > sever like Oozie, or something like Hive-on-Spark which manages Spark apps > running remotely). > - A way to notify running code that new delegation tokens have been loaded. > This may not be strictly necessary; it might be possible for code to detect > that, e.g., by peeking into the UserGroupInformation structure. But an event > sent to the listener bus would allow applications to react when new tokens > are available (e.g., the Hive backend could re-create connections to the > metastore server using the new tokens). > Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the > mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
zenglinxi created SPARK-16408: - Summary: SparkSQL Added file get Exception: is a directory and recursive is not turned on Key: SPARK-16408 URL: https://issues.apache.org/jira/browse/SPARK-16408 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.6.2 Reporter: zenglinxi when use Spark-sql to execute sql like: {quote} add file hdfs://xxx/user/test; {quote} if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: {quote} org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory and recursive is not turned on. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16398) Make cancelJob and cancelStage API public
[ https://issues.apache.org/jira/browse/SPARK-16398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16398. - Resolution: Fixed Assignee: Mitesh Patel Fix Version/s: 2.1.0 > Make cancelJob and cancelStage API public > - > > Key: SPARK-16398 > URL: https://issues.apache.org/jira/browse/SPARK-16398 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Mitesh >Assignee: Mitesh Patel >Priority: Trivial > Fix For: 2.1.0 > > > Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This > allows applications to use {{SparkListener}} to do their own management of > jobs via events, but without using the REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs
[ https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365641#comment-15365641 ] Apache Spark commented on SPARK-16021: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/14084 > Zero out freed memory in test to help catch correctness bugs > > > Key: SPARK-16021 > URL: https://issues.apache.org/jira/browse/SPARK-16021 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > > In both on-heap and off-heap modes, it would be helpful to immediately zero > out (or otherwise fill with a sentinel value) memory when an object is > deallocated. > Currently, in on-heap mode, freed memory can be accessed without visible > error if no other consumer has written to the same space. Similarly, off-heap > memory can be accessed without fault if the allocation library has not > released the pages back to the OS. Zeroing out freed memory would make these > errors immediately visible as a correctness problem. > Since this would add some performance overhead, it would make sense to > conf-flag and enable only in test. > cc [~sameerag] [~hvanhovell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14743) Improve delegation token handling in secure clusters
[ https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-14743: Component/s: YARN > Improve delegation token handling in secure clusters > > > Key: SPARK-14743 > URL: https://issues.apache.org/jira/browse/SPARK-14743 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > In a way, I'd consider this a parent bug of SPARK-7252. > Spark's current support for delegation tokens is a little all over the place: > - for HDFS, there's support for re-creating tokens if a principal and keytab > are provided > - for HBase and Hive, Spark will fetch delegation tokens so that apps can > work in cluster mode, but will not re-create them, so apps that need those > will stop working after 7 days > - for anything else, Spark doesn't do anything. Lots of other services use > delegation tokens, and supporting them as data sources in Spark becomes more > complicated because of that. e.g., Kafka will (hopefully) soon support them. > It would be nice if Spark had consistent support for handling delegation > tokens regardless of who needs them. I'd list these as the requirements: > - Spark to provide a generic interface for fetching delegation tokens. This > would allow Spark's delegation token support to be extended using some plugin > architecture (e.g. Java services), meaning Spark itself doesn't need to > support every possible service out there. > This would be used to fetch tokens when launching apps in cluster mode, and > when a principal and a keytab are provided to Spark. > - A way to manually update delegation tokens in Spark. For example, a new > SparkContext API, or some configuration that tells Spark to monitor a file > for changes and load tokens from said file. > This would allow external applications to manage tokens outside of Spark and > be able to update a running Spark application (think, for example, a job > sever like Oozie, or something like Hive-on-Spark which manages Spark apps > running remotely). > - A way to notify running code that new delegation tokens have been loaded. > This may not be strictly necessary; it might be possible for code to detect > that, e.g., by peeking into the UserGroupInformation structure. But an event > sent to the listener bus would allow applications to react when new tokens > are available (e.g., the Hive backend could re-create connections to the > metastore server using the new tokens). > Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the > mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365599#comment-15365599 ] Semet commented on SPARK-16367: --- Wheels are tagged by os, architecture, Python version and it seems to be enough for being compiled on one machine and work on another, if compatible. Pip install is responsible for finding the right wheel of a wanted module. For example on my machine, when I do a "pip install numpy" I don't have any compilation, pip directly takes the binary wheel from pypi, so installation is fast. But if you have an older version of Python, for instance 2.6, since there is no wheels for 2.6, pip install will compile some C modules and store the wheel in ~/.cache/pip. So futur installation will not require compilation. You can even take this wheel and add it to you pypi-local repository on artifactory so this package will be available on you pypi mirror (see doc about artifactory support for pypi). > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via
[jira] [Commented] (SPARK-16381) Update SQL examples and programming guide for R language binding
[ https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365597#comment-15365597 ] Xin Ren commented on SPARK-16381: - Hi Cheng, do you mind tell me where to find the RC date, or release schedule? I tried here https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel, but not much information found > Update SQL examples and programming guide for R language binding > > > Key: SPARK-16381 > URL: https://issues.apache.org/jira/browse/SPARK-16381 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xin Ren > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16380) Update SQL examples and programming guide for Python language binding
[ https://issues.apache.org/jira/browse/SPARK-16380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365591#comment-15365591 ] Cheng Lian commented on SPARK-16380: [~wm624] Considering 2.0.0 RC2 has already been cut, it's possible that we can't have this in 2.0.0. However, we'd like to have it in 2.0.0 if there's another RC. > Update SQL examples and programming guide for Python language binding > - > > Key: SPARK-16380 > URL: https://issues.apache.org/jira/browse/SPARK-16380 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Miao Wang > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16303) Update SQL examples and programming guide for Scala and Java language bindings
[ https://issues.apache.org/jira/browse/SPARK-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365590#comment-15365590 ] Cheng Lian commented on SPARK-16303: [~aokolnychyi] Considering 2.0.0 RC2 has already been cut, it's possible that we can't have this in 2.0.0. However, we'd like to have it in 2.0.0 if there's another RC. > Update SQL examples and programming guide for Scala and Java language bindings > -- > > Key: SPARK-16303 > URL: https://issues.apache.org/jira/browse/SPARK-16303 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Anton Okolnychyi > > We need to update SQL examples code under the {{examples}} sub-project, and > then replace hard-coded snippets in the SQL programming guide with snippets > automatically extracted from actual source files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16381) Update SQL examples and programming guide for R language binding
[ https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365583#comment-15365583 ] Cheng Lian commented on SPARK-16381: Thanks for volunteering! I've assigned this ticket to you. Considering 2.0.0 RC2 has already been cut, it's possible that we can't have this in 2.0.0. However, we'd like to have it if there's another RC. > Update SQL examples and programming guide for R language binding > > > Key: SPARK-16381 > URL: https://issues.apache.org/jira/browse/SPARK-16381 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xin Ren > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16381) Update SQL examples and programming guide for R language binding
[ https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16381: --- Assignee: Xin Ren > Update SQL examples and programming guide for R language binding > > > Key: SPARK-16381 > URL: https://issues.apache.org/jira/browse/SPARK-16381 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xin Ren > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16380) Update SQL examples and programming guide for Python language binding
[ https://issues.apache.org/jira/browse/SPARK-16380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365579#comment-15365579 ] Cheng Lian commented on SPARK-16380: I just noticed that I put "Scala" into the JIRA ticket title by mistake. Please note that the scope of this ticket only covers Python examples. > Update SQL examples and programming guide for Python language binding > - > > Key: SPARK-16380 > URL: https://issues.apache.org/jira/browse/SPARK-16380 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Miao Wang > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16380) Update SQL examples and programming guide for Python language binding
[ https://issues.apache.org/jira/browse/SPARK-16380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16380: --- Summary: Update SQL examples and programming guide for Python language binding (was: Update SQL examples and programming guide for Scala Python language binding) > Update SQL examples and programming guide for Python language binding > - > > Key: SPARK-16380 > URL: https://issues.apache.org/jira/browse/SPARK-16380 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Miao Wang > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16374) Remove Alias from MetastoreRelation and SimpleCatalogRelation
[ https://issues.apache.org/jira/browse/SPARK-16374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16374. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14053 [https://github.com/apache/spark/pull/14053] > Remove Alias from MetastoreRelation and SimpleCatalogRelation > - > > Key: SPARK-16374 > URL: https://issues.apache.org/jira/browse/SPARK-16374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Different from the other leaf nodes, `MetastoreRelation` and > `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change > the qualifier of the node. However, based on the existing alias handling, > alias should be put in `SubqueryAlias`. > This PR is to separate alias handling from `MetastoreRelation` and > `SimpleCatalogRelation` to make it consistent with the other nodes. > For example, below is an example query for `MetastoreRelation`, which is > converted to `LogicalRelation`: > {noformat} > SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2 > {noformat} > Before changes, the analyzed plan is > {noformat} > == Analyzed Logical Plan == > (a + 1): int > Project [(a#951 + 1) AS (a + 1)#952] > +- Filter (a#951 > 2) >+- SubqueryAlias tmp > +- Relation[a#951] parquet > {noformat} > After changes, the analyzed plan becomes > {noformat} > == Analyzed Logical Plan == > (a + 1): int > Project [(a#951 + 1) AS (a + 1)#952] > +- Filter (a#951 > 2) >+- SubqueryAlias tmp > +- SubqueryAlias test_parquet_ctas > +- Relation[a#951] parquet > {noformat} > **Note: the optimized plans are the same.** > For `SimpleCatalogRelation`, the existing code always generates two > Subqueries. Thus, no change is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16374) Remove Alias from MetastoreRelation and SimpleCatalogRelation
[ https://issues.apache.org/jira/browse/SPARK-16374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16374: Assignee: Xiao Li > Remove Alias from MetastoreRelation and SimpleCatalogRelation > - > > Key: SPARK-16374 > URL: https://issues.apache.org/jira/browse/SPARK-16374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Different from the other leaf nodes, `MetastoreRelation` and > `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change > the qualifier of the node. However, based on the existing alias handling, > alias should be put in `SubqueryAlias`. > This PR is to separate alias handling from `MetastoreRelation` and > `SimpleCatalogRelation` to make it consistent with the other nodes. > For example, below is an example query for `MetastoreRelation`, which is > converted to `LogicalRelation`: > {noformat} > SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2 > {noformat} > Before changes, the analyzed plan is > {noformat} > == Analyzed Logical Plan == > (a + 1): int > Project [(a#951 + 1) AS (a + 1)#952] > +- Filter (a#951 > 2) >+- SubqueryAlias tmp > +- Relation[a#951] parquet > {noformat} > After changes, the analyzed plan becomes > {noformat} > == Analyzed Logical Plan == > (a + 1): int > Project [(a#951 + 1) AS (a + 1)#952] > +- Filter (a#951 > 2) >+- SubqueryAlias tmp > +- SubqueryAlias test_parquet_ctas > +- Relation[a#951] parquet > {noformat} > **Note: the optimized plans are the same.** > For `SimpleCatalogRelation`, the existing code always generates two > Subqueries. Thus, no change is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14839) Support for other types as option in OPTIONS clause
[ https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-14839. --- Resolution: Resolved Assignee: Hyukjin Kwon Fix Version/s: 2.1.0 > Support for other types as option in OPTIONS clause > --- > > Key: SPARK-14839 > URL: https://issues.apache.org/jira/browse/SPARK-14839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This was found in https://github.com/apache/spark/pull/12494. > Currently, Spark SQL does not support other types and {{null}} as a value of > an options. > For example, > {code} > CREATE ... > USING csv > OPTIONS (path "your-path", quote null) > {code} > throws an exception below > {code} > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > org.apache.spark.sql.catalyst.parser.ParseException: > Unsupported SQL statement > == SQL == > CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, > modelName string, comments string, grp string) USING csv OPTIONS (path > "your-path", quote null) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764) > ... > {code} > Currently, Scala API supports to take options with the types, {{String}}, > {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other > types. I think in this way we can support data sources in a consistent way. > It looks it is okay to to provide other types as arguments just like > [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] > because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] > standard mentions options as below: > {quote} > An implementation remains conforming even if it provides user op- > tions to process nonconforming SQL language or to process conform- > ing SQL language in a nonconforming manner. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16407) Allow users to supply custom StreamSinkProviders
holdenk created SPARK-16407: --- Summary: Allow users to supply custom StreamSinkProviders Key: SPARK-16407 URL: https://issues.apache.org/jira/browse/SPARK-16407 Project: Spark Issue Type: Improvement Components: Streaming Reporter: holdenk The current DataStreamWriter allows users to specify a class name as format, however it could be easier for people to directly pass in a specific provider instance - e.g. for user equivalent of ForeachSink or other sink with non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365547#comment-15365547 ] SuYan commented on SPARK-3630: -- Snappy-java support concatenate since from snappy 1.1.2: https://github.com/xerial/snappy-java/issues/103 > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365543#comment-15365543 ] SuYan commented on SPARK-3630: -- may the reason was snappy 1.0.4.1 not support Concatenating? because the code path was "UnsafeShuffleWriter-> mergeSpillsWithFastFileStream", it will concatenate the same partition snappy compression data from different spilled files. > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao closed SPARK-16342. --- Resolution: Duplicate > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365535#comment-15365535 ] Saisai Shao commented on SPARK-16342: - Close as JIRA as duplicated and move to SPARK-14743. > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14743) Improve delegation token handling in secure clusters
[ https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365534#comment-15365534 ] Saisai Shao commented on SPARK-14743: - Post design doc here and move SPARK-16342 to here. > Improve delegation token handling in secure clusters > > > Key: SPARK-14743 > URL: https://issues.apache.org/jira/browse/SPARK-14743 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > In a way, I'd consider this a parent bug of SPARK-7252. > Spark's current support for delegation tokens is a little all over the place: > - for HDFS, there's support for re-creating tokens if a principal and keytab > are provided > - for HBase and Hive, Spark will fetch delegation tokens so that apps can > work in cluster mode, but will not re-create them, so apps that need those > will stop working after 7 days > - for anything else, Spark doesn't do anything. Lots of other services use > delegation tokens, and supporting them as data sources in Spark becomes more > complicated because of that. e.g., Kafka will (hopefully) soon support them. > It would be nice if Spark had consistent support for handling delegation > tokens regardless of who needs them. I'd list these as the requirements: > - Spark to provide a generic interface for fetching delegation tokens. This > would allow Spark's delegation token support to be extended using some plugin > architecture (e.g. Java services), meaning Spark itself doesn't need to > support every possible service out there. > This would be used to fetch tokens when launching apps in cluster mode, and > when a principal and a keytab are provided to Spark. > - A way to manually update delegation tokens in Spark. For example, a new > SparkContext API, or some configuration that tells Spark to monitor a file > for changes and load tokens from said file. > This would allow external applications to manage tokens outside of Spark and > be able to update a running Spark application (think, for example, a job > sever like Oozie, or something like Hive-on-Spark which manages Spark apps > running remotely). > - A way to notify running code that new delegation tokens have been loaded. > This may not be strictly necessary; it might be possible for code to detect > that, e.g., by peeking into the UserGroupInformation structure. But an event > sent to the listener bus would allow applications to react when new tokens > are available (e.g., the Hive backend could re-create connections to the > metastore server using the new tokens). > Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the > mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365533#comment-15365533 ] Gayathri Murali commented on SPARK-16240: - I can work on this > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Priority: Minor > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14743) Improve delegation token handling in secure clusters
[ https://issues.apache.org/jira/browse/SPARK-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365534#comment-15365534 ] Saisai Shao edited comment on SPARK-14743 at 7/7/16 3:18 AM: - Post design doc and move SPARK-16342 to here. was (Author: jerryshao): Post design doc here and move SPARK-16342 to here. > Improve delegation token handling in secure clusters > > > Key: SPARK-14743 > URL: https://issues.apache.org/jira/browse/SPARK-14743 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > In a way, I'd consider this a parent bug of SPARK-7252. > Spark's current support for delegation tokens is a little all over the place: > - for HDFS, there's support for re-creating tokens if a principal and keytab > are provided > - for HBase and Hive, Spark will fetch delegation tokens so that apps can > work in cluster mode, but will not re-create them, so apps that need those > will stop working after 7 days > - for anything else, Spark doesn't do anything. Lots of other services use > delegation tokens, and supporting them as data sources in Spark becomes more > complicated because of that. e.g., Kafka will (hopefully) soon support them. > It would be nice if Spark had consistent support for handling delegation > tokens regardless of who needs them. I'd list these as the requirements: > - Spark to provide a generic interface for fetching delegation tokens. This > would allow Spark's delegation token support to be extended using some plugin > architecture (e.g. Java services), meaning Spark itself doesn't need to > support every possible service out there. > This would be used to fetch tokens when launching apps in cluster mode, and > when a principal and a keytab are provided to Spark. > - A way to manually update delegation tokens in Spark. For example, a new > SparkContext API, or some configuration that tells Spark to monitor a file > for changes and load tokens from said file. > This would allow external applications to manage tokens outside of Spark and > be able to update a running Spark application (think, for example, a job > sever like Oozie, or something like Hive-on-Spark which manages Spark apps > running remotely). > - A way to notify running code that new delegation tokens have been loaded. > This may not be strictly necessary; it might be possible for code to detect > that, e.g., by peeking into the UserGroupInformation structure. But an event > sent to the listener bus would allow applications to react when new tokens > are available (e.g., the Hive backend could re-create connections to the > metastore server using the new tokens). > Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the > mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365483#comment-15365483 ] Saisai Shao commented on SPARK-16342: - OK, I see. Sorry I didn't notice your JIRA, let me consolidate things to your opened JIRA if you don't mind. > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365482#comment-15365482 ] Marcelo Vanzin commented on SPARK-16342: I'm not working on it, I filed a bug because it's a missing feature that's needed. I'm just saying that instead of filing a bug with pretty much the same contents, it's better just to consolidate things. > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-16342: Comment: was deleted (was: Thanks [~vanzin] for pointing out the jira, looks like most part of the ideas are similar. I'm not sure what is your progress on it, I think it would be great to collaborate to make that happen. Thanks a lot. ) > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365477#comment-15365477 ] Saisai Shao commented on SPARK-16342: - Thanks [~vanzin] for pointing out the jira, looks like most part of the ideas are similar. I'm not sure what is your progress on it, I think it would be great to collaborate to make that happen. Thanks a lot. > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16342) Add a new Configurable Token Manager for Spark Running on YARN
[ https://issues.apache.org/jira/browse/SPARK-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365478#comment-15365478 ] Saisai Shao commented on SPARK-16342: - Thanks [~vanzin] for pointing out the jira, looks like most part of the ideas are similar. I'm not sure what is your progress on it, I think it would be great to collaborate to make that happen. Thanks a lot. > Add a new Configurable Token Manager for Spark Running on YARN > --- > > Key: SPARK-16342 > URL: https://issues.apache.org/jira/browse/SPARK-16342 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > Current Spark on YARN token management has some problems: > 1. Supported service is hard-coded, only HDFS, Hive and HBase are supported > for token fetching. For other third-party services which need to be > communicated with Spark in Kerberos way, currently the only way is to modify > Spark code. > 2. Current token renewal and update mechanism is also hard-coded, which means > other third-party services cannot be benefited from this system and will be > failed when token is expired. > 3. Also In the code level, current token obtain and update codes are placed > in several different places without elegant structured, which makes it hard > to maintain and extend. > So here propose a new Configurable Token Manager class to solve the issues > mentioned above. > Basically this new proposal will have two changes: > 1. Abstract a ServiceTokenProvider for different services, this is > configurable and pluggable, by default there will be hdfs, hbase, hive > service, also user could add their own services through configuration. This > interface offers a way to retrieve the tokens and token renewal interval. > 2. Provide a ConfigurableTokenManager to manage all the added-in token > providers, also expose APIs for external modules to get and update tokens. > Details are in the design doc > (https://docs.google.com/document/d/1piUvrQywWXiSwyZM9alN6ilrdlX9ohlNOuP4_Q3A6dc/edit?usp=sharing), > any suggestion and comment is greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16174) Improve `OptimizeIn` optimizer to remove literal repetitions
[ https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16174: -- Summary: Improve `OptimizeIn` optimizer to remove literal repetitions (was: Improve `OptimizeIn` optimizer to remove deterministic repetitions) > Improve `OptimizeIn` optimizer to remove literal repetitions > > > Key: SPARK-16174 > URL: https://issues.apache.org/jira/browse/SPARK-16174 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > This issue improves `OptimizeIn` optimizer to remove the deterministic > repetitions from SQL `IN` predicates. This optimizer prevents user mistakes > and also can optimize some queries like > [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16174) Improve `OptimizeIn` optimizer to remove literal repetitions
[ https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16174: -- Description: This issue improves `OptimizeIn` optimizer to remove the literal repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19]. (was: This issue improves `OptimizeIn` optimizer to remove the deterministic repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].) > Improve `OptimizeIn` optimizer to remove literal repetitions > > > Key: SPARK-16174 > URL: https://issues.apache.org/jira/browse/SPARK-16174 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > This issue improves `OptimizeIn` optimizer to remove the literal repetitions > from SQL `IN` predicates. This optimizer prevents user mistakes and also can > optimize some queries like > [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16406) Reference resolution for large number of columns should be faster
[ https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365362#comment-15365362 ] Apache Spark commented on SPARK-16406: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14083 > Reference resolution for large number of columns should be faster > - > > Key: SPARK-16406 > URL: https://issues.apache.org/jira/browse/SPARK-16406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > Resolving columns in a LogicalPlan on average takes n / 2 (n being the number > of columns). This gets problematic as soon as you try to resolve a large > number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16406) Reference resolution for large number of columns should be faster
[ https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16406: Assignee: Apache Spark (was: Herman van Hovell) > Reference resolution for large number of columns should be faster > - > > Key: SPARK-16406 > URL: https://issues.apache.org/jira/browse/SPARK-16406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Apache Spark > > Resolving columns in a LogicalPlan on average takes n / 2 (n being the number > of columns). This gets problematic as soon as you try to resolve a large > number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16406) Reference resolution for large number of columns should be faster
[ https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16406: Assignee: Herman van Hovell (was: Apache Spark) > Reference resolution for large number of columns should be faster > - > > Key: SPARK-16406 > URL: https://issues.apache.org/jira/browse/SPARK-16406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > Resolving columns in a LogicalPlan on average takes n / 2 (n being the number > of columns). This gets problematic as soon as you try to resolve a large > number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16381) Update SQL examples and programming guide for R language binding
[ https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16381: Assignee: (was: Apache Spark) > Update SQL examples and programming guide for R language binding > > > Key: SPARK-16381 > URL: https://issues.apache.org/jira/browse/SPARK-16381 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16381) Update SQL examples and programming guide for R language binding
[ https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16381: Assignee: Apache Spark > Update SQL examples and programming guide for R language binding > > > Key: SPARK-16381 > URL: https://issues.apache.org/jira/browse/SPARK-16381 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16381) Update SQL examples and programming guide for R language binding
[ https://issues.apache.org/jira/browse/SPARK-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365352#comment-15365352 ] Apache Spark commented on SPARK-16381: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/14082 > Update SQL examples and programming guide for R language binding > > > Key: SPARK-16381 > URL: https://issues.apache.org/jira/browse/SPARK-16381 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Please follow guidelines listed in this SPARK-16303 > [comment|https://issues.apache.org/jira/browse/SPARK-16303?focusedCommentId=15362575=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15362575]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16406) Reference resolution for large number of columns should be faster
Herman van Hovell created SPARK-16406: - Summary: Reference resolution for large number of columns should be faster Key: SPARK-16406 URL: https://issues.apache.org/jira/browse/SPARK-16406 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Herman van Hovell Assignee: Herman van Hovell Resolving columns in a LogicalPlan on average takes n / 2 (n being the number of columns). This gets problematic as soon as you try to resolve a large number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365340#comment-15365340 ] Apache Spark commented on SPARK-16403: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/14081 > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. > * consistent appNames, most are camel case > * fix formatting, add newlines if difficult to read - many examples are just > solid blocks of code > * should use __future__ print function > * pipeline_example is a duplicate of simple_text_classification_pipeline > * some spelling errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16403: Assignee: (was: Apache Spark) > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. > * consistent appNames, most are camel case > * fix formatting, add newlines if difficult to read - many examples are just > solid blocks of code > * should use __future__ print function > * pipeline_example is a duplicate of simple_text_classification_pipeline > * some spelling errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16403: Assignee: Apache Spark > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. > * consistent appNames, most are camel case > * fix formatting, add newlines if difficult to read - many examples are just > solid blocks of code > * should use __future__ print function > * pipeline_example is a duplicate of simple_text_classification_pipeline > * some spelling errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs
[ https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16021. - Resolution: Fixed Assignee: Eric Liang (was: Apache Spark) Fix Version/s: 2.1.0 > Zero out freed memory in test to help catch correctness bugs > > > Key: SPARK-16021 > URL: https://issues.apache.org/jira/browse/SPARK-16021 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > > In both on-heap and off-heap modes, it would be helpful to immediately zero > out (or otherwise fill with a sentinel value) memory when an object is > deallocated. > Currently, in on-heap mode, freed memory can be accessed without visible > error if no other consumer has written to the same space. Similarly, off-heap > memory can be accessed without fault if the allocation library has not > released the pages back to the OS. Zeroing out freed memory would make these > errors immediately visible as a correctness problem. > Since this would add some performance overhead, it would make sense to > conf-flag and enable only in test. > cc [~sameerag] [~hvanhovell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-16403: - Description: General cleanup of examples, focused on PySpark ML, to remove unused imports, sync with Scala examples, improve consistency and fix minor issues such as arg checks etc. * consistent appNames, most are camel case * fix formatting, add newlines if difficult to read - many examples are just solid blocks of code * should use __future__ print function * pipeline_example is a duplicate of simple_text_classification_pipeline * some spelling errors was:General cleanup of examples, focused on PySpark ML, to remove unused imports, sync with Scala examples, improve consistency and fix minor issues such as arg checks etc. > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. > * consistent appNames, most are camel case > * fix formatting, add newlines if difficult to read - many examples are just > solid blocks of code > * should use __future__ print function > * pipeline_example is a duplicate of simple_text_classification_pipeline > * some spelling errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365321#comment-15365321 ] Shixiong Zhu commented on SPARK-6028: - Changing the default RPC to Netty mostly because we want to test it broadly before dropping Akka. The class version conflict is a different story anyway. Even if we don't switch to Netty, you probably will see some other weird error. > Provide an alternative RPC implementation based on the network transport > module > --- > > Key: SPARK-6028 > URL: https://issues.apache.org/jira/browse/SPARK-6028 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 1.6.0 > > > Network transport module implements a low level RPC interface. We can build a > new RPC implementation on top of that to replace Akka's. > Design document: > https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16382) YARN - Dynamic allocation with spark.executor.instances should increase max executors.
[ https://issues.apache.org/jira/browse/SPARK-16382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-16382. --- Resolution: Won't Fix > YARN - Dynamic allocation with spark.executor.instances should increase max > executors. > -- > > Key: SPARK-16382 > URL: https://issues.apache.org/jira/browse/SPARK-16382 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Ryan Blue > > SPARK-13723 changed the behavior of dynamic allocation when > {{--num-executors}} ({{spark.executor.instances}}) is set. Rather than > turning off dynamic allocation, the value is used as the initial number of > executors. This did not change the behavior of > {{spark.dynamicAllocation.maxExecutors}}. We've noticed that some users set > {{--num-executors}} higher than the max and the expectation is that the max > increases. > I think that either max should be increased, or Spark should fail and > complain that the executors requested is higher than the max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16382) YARN - Dynamic allocation with spark.executor.instances should increase max executors.
[ https://issues.apache.org/jira/browse/SPARK-16382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365255#comment-15365255 ] Ryan Blue commented on SPARK-16382: --- [~jerryshao], [~tgraves], I think you're both right that this is currently caught. The behavior I observed was in our local copy with an older patch for SPARK-13723 that used {{spark.executor.instances}} to increase the min rather than the initial number of executors. For those jobs where min was then higher than max, Spark would try to get the min number of executors and never let go of any and there wasn't a problem that it was higher than max. I was originally suggesting that max should be increased, which doesn't currently happen, but then I thought that it may be better to fail so I added that to the description. That's why I missed that Spark already fails. I'll close this. Thanks! > YARN - Dynamic allocation with spark.executor.instances should increase max > executors. > -- > > Key: SPARK-16382 > URL: https://issues.apache.org/jira/browse/SPARK-16382 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Ryan Blue > > SPARK-13723 changed the behavior of dynamic allocation when > {{--num-executors}} ({{spark.executor.instances}}) is set. Rather than > turning off dynamic allocation, the value is used as the initial number of > executors. This did not change the behavior of > {{spark.dynamicAllocation.maxExecutors}}. We've noticed that some users set > {{--num-executors}} higher than the max and the expectation is that the max > increases. > I think that either max should be increased, or Spark should fail and > complain that the executors requested is higher than the max. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16405) Add metrics and source for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365247#comment-15365247 ] Apache Spark commented on SPARK-16405: -- User 'lovexi' has created a pull request for this issue: https://github.com/apache/spark/pull/14080 > Add metrics and source for external shuffle service > --- > > Key: SPARK-16405 > URL: https://issues.apache.org/jira/browse/SPARK-16405 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: YangyangLiu > Labels: Metrics, Monitoring, features > > ExternalShuffleService is essential for spark. In order to better monitor > shuffle service, we added various metrics in shuffle service and > ExternalShuffleServiceSource for metric system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16405) Add metrics and source for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16405: Assignee: (was: Apache Spark) > Add metrics and source for external shuffle service > --- > > Key: SPARK-16405 > URL: https://issues.apache.org/jira/browse/SPARK-16405 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: YangyangLiu > Labels: Metrics, Monitoring, features > > ExternalShuffleService is essential for spark. In order to better monitor > shuffle service, we added various metrics in shuffle service and > ExternalShuffleServiceSource for metric system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16405) Add metrics and source for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16405: Assignee: Apache Spark > Add metrics and source for external shuffle service > --- > > Key: SPARK-16405 > URL: https://issues.apache.org/jira/browse/SPARK-16405 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: YangyangLiu >Assignee: Apache Spark > Labels: Metrics, Monitoring, features > > ExternalShuffleService is essential for spark. In order to better monitor > shuffle service, we added various metrics in shuffle service and > ExternalShuffleServiceSource for metric system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8425) Add blacklist mechanism for task scheduling
[ https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365244#comment-15365244 ] Apache Spark commented on SPARK-8425: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/14079 > Add blacklist mechanism for task scheduling > --- > > Key: SPARK-8425 > URL: https://issues.apache.org/jira/browse/SPARK-8425 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Reporter: Saisai Shao >Assignee: Imran Rashid >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16405) Add metrics and source for external shuffle service
YangyangLiu created SPARK-16405: --- Summary: Add metrics and source for external shuffle service Key: SPARK-16405 URL: https://issues.apache.org/jira/browse/SPARK-16405 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: YangyangLiu ExternalShuffleService is essential for spark. In order to better monitor shuffle service, we added various metrics in shuffle service and ExternalShuffleServiceSource for metric system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365227#comment-15365227 ] Seth Hendrickson commented on SPARK-16404: -- cc [~dbtsai] I looked in to using the @transient tag, but this prevents the coefficients from being serialized and broadcast to the executors at all, resulting in a {{NullPointerException}}. I am not sure of a way around this. I can submit a patch utilizing the same strategy as in LoR later this week. > LeastSquaresAggregator in Linear Regression serializes unnecessary data > --- > > Key: SPARK-16404 > URL: https://issues.apache.org/jira/browse/SPARK-16404 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson > > This is basically the same issue as > [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for > linear regression, where {{coefficients}} and {{featuresStd}} are > unnecessarily serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data
Seth Hendrickson created SPARK-16404: Summary: LeastSquaresAggregator in Linear Regression serializes unnecessary data Key: SPARK-16404 URL: https://issues.apache.org/jira/browse/SPARK-16404 Project: Spark Issue Type: Improvement Components: ML Reporter: Seth Hendrickson This is basically the same issue as [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for linear regression, where {{coefficients}} and {{featuresStd}} are unnecessarily serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions
[ https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11857: Assignee: Reynold Xin (was: Apache Spark) > Remove Mesos fine-grained mode subject to discussions > - > > Key: SPARK-11857 > URL: https://issues.apache.org/jira/browse/SPARK-11857 > Project: Spark > Issue Type: Sub-task > Components: Mesos >Reporter: Reynold Xin >Assignee: Reynold Xin > > See discussions in > http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html > and > http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions
[ https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11857: Assignee: Apache Spark (was: Reynold Xin) > Remove Mesos fine-grained mode subject to discussions > - > > Key: SPARK-11857 > URL: https://issues.apache.org/jira/browse/SPARK-11857 > Project: Spark > Issue Type: Sub-task > Components: Mesos >Reporter: Reynold Xin >Assignee: Apache Spark > > See discussions in > http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html > and > http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions
[ https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365190#comment-15365190 ] Apache Spark commented on SPARK-11857: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/14078 > Remove Mesos fine-grained mode subject to discussions > - > > Key: SPARK-11857 > URL: https://issues.apache.org/jira/browse/SPARK-11857 > Project: Spark > Issue Type: Sub-task > Components: Mesos >Reporter: Reynold Xin >Assignee: Reynold Xin > > See discussions in > http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html > and > http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365174#comment-15365174 ] Bryan Cutler commented on SPARK-16403: -- I'm working on this > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-16403: - Priority: Trivial (was: Major) > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16403) Example cleanup and fix minor issues
Bryan Cutler created SPARK-16403: Summary: Example cleanup and fix minor issues Key: SPARK-16403 URL: https://issues.apache.org/jira/browse/SPARK-16403 Project: Spark Issue Type: Sub-task Components: Examples, PySpark Reporter: Bryan Cutler General cleanup of examples, focused on PySpark ML, to remove unused imports, sync with Scala examples, improve consistency and fix minor issues such as arg checks etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16402) JDBC source: Implement save API
[ https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16402: Assignee: Apache Spark > JDBC source: Implement save API > --- > > Key: SPARK-16402 > URL: https://issues.apache.org/jira/browse/SPARK-16402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Currently, we are unable to call the `save` API of `DataFrameWriter` when the > source is JDBC. For example, > {noformat} > df.write > .format("jdbc") > .option("url", url1) > .option("dbtable", "TEST.TRUNCATETEST") > .option("user", "testUser") > .option("password", "testPass") > .save() > {noformat} > The error message users will get is like > {noformat} > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > java.lang.RuntimeException: > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > {noformat} > However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16402) JDBC source: Implement save API
[ https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365150#comment-15365150 ] Apache Spark commented on SPARK-16402: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14077 > JDBC source: Implement save API > --- > > Key: SPARK-16402 > URL: https://issues.apache.org/jira/browse/SPARK-16402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are unable to call the `save` API of `DataFrameWriter` when the > source is JDBC. For example, > {noformat} > df.write > .format("jdbc") > .option("url", url1) > .option("dbtable", "TEST.TRUNCATETEST") > .option("user", "testUser") > .option("password", "testPass") > .save() > {noformat} > The error message users will get is like > {noformat} > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > java.lang.RuntimeException: > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > {noformat} > However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16402) JDBC source: Implement save API
[ https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16402: Assignee: (was: Apache Spark) > JDBC source: Implement save API > --- > > Key: SPARK-16402 > URL: https://issues.apache.org/jira/browse/SPARK-16402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are unable to call the `save` API of `DataFrameWriter` when the > source is JDBC. For example, > {noformat} > df.write > .format("jdbc") > .option("url", url1) > .option("dbtable", "TEST.TRUNCATETEST") > .option("user", "testUser") > .option("password", "testPass") > .save() > {noformat} > The error message users will get is like > {noformat} > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > java.lang.RuntimeException: > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > {noformat} > However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16402) JDBC source: Implement save API
Xiao Li created SPARK-16402: --- Summary: JDBC source: Implement save API Key: SPARK-16402 URL: https://issues.apache.org/jira/browse/SPARK-16402 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Currently, we are unable to call the `save` API of `DataFrameWriter` when the source is JDBC. For example, {noformat} df.write .format("jdbc") .option("url", url1) .option("dbtable", "TEST.TRUNCATETEST") .option("user", "testUser") .option("password", "testPass") .save() {noformat} The error message users will get is like {noformat} org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not allow create table as select. java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not allow create table as select. {noformat} However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging
[ https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365117#comment-15365117 ] Charles Allen commented on SPARK-16379: --- That's great, thanks a ton! > Spark on mesos is broken due to race condition in Logging > - > > Key: SPARK-16379 > URL: https://issues.apache.org/jira/browse/SPARK-16379 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Assignee: Sean Owen >Priority: Blocker > Fix For: 2.0.0 > > Attachments: out.txt > > > This commit introduced a transient lazy log val: > https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec > This has caused problems in the past: > https://github.com/apache/spark/pull/1004 > One commit before that everything works fine. > I spotted that when my CI started to fail: > https://ci.typesafe.com/job/mit-docker-test-ref/191/ > You can easily verify it by installing mesos on your machine and try to > connect with spark shell from bin dir: > ./spark-shell --master mesos://zk://localhost:2181/mesos --conf > spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz > It gets stuck at the point where it tries to create the SparkContext. > Logging gets stuck here: > I0705 12:10:10.076617 9303 group.cpp:700] Trying to get > '/mesos/json.info_000152' in ZooKeeper > I0705 12:10:10.076920 9304 detector.cpp:479] A new leading master > (UPID=master@127.0.1.1:5050) is detected > I0705 12:10:10.076956 9303 sched.cpp:326] New master detected at > master@127.0.1.1:5050 > I0705 12:10:10.077057 9303 sched.cpp:336] No credentials provided. > Attempting to register without authentication > I0705 12:10:10.090709 9301 sched.cpp:703] Framework registered with > 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001 > I verified it also by changing @transient lazy val log to def and it works as > expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging
[ https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365112#comment-15365112 ] Sean Owen commented on SPARK-16379: --- https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20Priority%20%3D%20Blocker%20AND%20%22Target%20Version%2Fs%22%20%3D%202.0.0%20AND%20Resolution%20%3D%20Unresolved ? you can filter JIRA how you like. Target Version should be pretty reliable. > Spark on mesos is broken due to race condition in Logging > - > > Key: SPARK-16379 > URL: https://issues.apache.org/jira/browse/SPARK-16379 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Assignee: Sean Owen >Priority: Blocker > Fix For: 2.0.0 > > Attachments: out.txt > > > This commit introduced a transient lazy log val: > https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec > This has caused problems in the past: > https://github.com/apache/spark/pull/1004 > One commit before that everything works fine. > I spotted that when my CI started to fail: > https://ci.typesafe.com/job/mit-docker-test-ref/191/ > You can easily verify it by installing mesos on your machine and try to > connect with spark shell from bin dir: > ./spark-shell --master mesos://zk://localhost:2181/mesos --conf > spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz > It gets stuck at the point where it tries to create the SparkContext. > Logging gets stuck here: > I0705 12:10:10.076617 9303 group.cpp:700] Trying to get > '/mesos/json.info_000152' in ZooKeeper > I0705 12:10:10.076920 9304 detector.cpp:479] A new leading master > (UPID=master@127.0.1.1:5050) is detected > I0705 12:10:10.076956 9303 sched.cpp:326] New master detected at > master@127.0.1.1:5050 > I0705 12:10:10.077057 9303 sched.cpp:336] No credentials provided. > Attempting to register without authentication > I0705 12:10:10.090709 9301 sched.cpp:703] Framework registered with > 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001 > I verified it also by changing @transient lazy val log to def and it works as > expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging
[ https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365109#comment-15365109 ] Charles Allen commented on SPARK-16379: --- [~srowen] is there a list of blockers somewhere? I also want to get branch-2.0 tested from our side but would like to know what sort of caveats to expect. > Spark on mesos is broken due to race condition in Logging > - > > Key: SPARK-16379 > URL: https://issues.apache.org/jira/browse/SPARK-16379 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Assignee: Sean Owen >Priority: Blocker > Fix For: 2.0.0 > > Attachments: out.txt > > > This commit introduced a transient lazy log val: > https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec > This has caused problems in the past: > https://github.com/apache/spark/pull/1004 > One commit before that everything works fine. > I spotted that when my CI started to fail: > https://ci.typesafe.com/job/mit-docker-test-ref/191/ > You can easily verify it by installing mesos on your machine and try to > connect with spark shell from bin dir: > ./spark-shell --master mesos://zk://localhost:2181/mesos --conf > spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz > It gets stuck at the point where it tries to create the SparkContext. > Logging gets stuck here: > I0705 12:10:10.076617 9303 group.cpp:700] Trying to get > '/mesos/json.info_000152' in ZooKeeper > I0705 12:10:10.076920 9304 detector.cpp:479] A new leading master > (UPID=master@127.0.1.1:5050) is detected > I0705 12:10:10.076956 9303 sched.cpp:326] New master detected at > master@127.0.1.1:5050 > I0705 12:10:10.077057 9303 sched.cpp:336] No credentials provided. > Attempting to register without authentication > I0705 12:10:10.090709 9301 sched.cpp:703] Framework registered with > 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001 > I verified it also by changing @transient lazy val log to def and it works as > expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+
[ https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365062#comment-15365062 ] Ryan Blue commented on SPARK-16344: --- It looks like the main change is to specifically catch the 3-level name structure, {{list-name (LIST) -> "list" -> "element"}}. The problem with this approach is that it doesn't solve the problem entirely either. Let me try to give a bit more background. In parquet-avro, there are two {{isElementType}} methods; one in the schema converter and one in the record converter. The one in the schema converter will guess whether the Parquet type uses a 3-level list or a 2-level list when it can't be determined according to the spec's backward-compatibility rules. That guess assumes a 2-level structure by default and at the next major release will guess a 3-level structure. (This can be controlled by a property.) But this is only used when the reader doesn't supply a read schema / expected schema and the code has to convert from Parquet's type to get one. Ideally, we always have a read schema from the file, from the reader's expected class (if using Java objects), or from the reader passing in the expected schema. That's why the other {{isElementType}} method exists: it looks at the expected schema and the file schema to determine whether the caller has passed in a schema with the extra single-field list/element struct. That code has to distinguish between two cases for a 3-level list: 1. When the caller expects {{List}}, with the extra record layer that was originally returned when Avro only knew about 2-level lists. 2. When the caller expects {{List}}, without an extra layer. The code currently assumes that if the element schema appears to match the repeated type that the caller has passed a schema indicating case 1. This issue points out that the matching isn't perfect and an element with a single field named "element" will incorrectly match case 1 when it was really case 2. The problem with the solution in PR #14013, if it were applied to Avro, is that it breaks if the caller is actually passing a schema for case 1. I'm not sure whether Spark works like Avro and has two {{isElementType}} methods. If Spark can guarantee that the table schema is never case 1, then it is correct to use the logic in the PR. I don't think that's always the case because the table schema may come from user objects in a Dataset or from the Hive MetaStore. But, this may be a reasonable heuristic if you think case 2 is far more common than case 1. For parquet-avro, I think the user supplying a single-field record with the inner field named "element" is rare enough that it doesn't really matter, but it's up to you guys in the Spark community on this issue. One last thing: based on the rest of the schema structure, there should be only one way to match the expected schema to the file schema. You could always try both and fall back to the other case, or have a more complicated {{isElementType}} method that recurses down the sub-trees to find a match. I didn't implement this in parquet-avro because I think it's a rare problem and not worth the time. > Array of struct with a single field name "element" can't be decoded from > Parquet files written by Spark 1.6+ > > > Key: SPARK-16344 > URL: https://issues.apache.org/jira/browse/SPARK-16344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > This is a weird corner case. Users may hit this issue if they have a schema > that > # has an array field whose element type is a struct, and > # the struct has one and only one field, and > # that field is named as "element". > The following Spark shell snippet for Spark 1.6 reproduces this bug: > {code} > case class A(element: Long) > case class B(f: Array[A]) > val path = "/tmp/silly.parquet" > Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path) > val df = sqlContext.read.parquet(path) > df.printSchema() > // root > // |-- f0: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- element: long (nullable = true) > df.show() > {code} > Exception thrown: > {noformat} > org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in > block -1 in file > file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at >
[jira] [Resolved] (SPARK-16379) Spark on mesos is broken due to race condition in Logging
[ https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16379. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.0.0 > Spark on mesos is broken due to race condition in Logging > - > > Key: SPARK-16379 > URL: https://issues.apache.org/jira/browse/SPARK-16379 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Assignee: Sean Owen >Priority: Blocker > Fix For: 2.0.0 > > Attachments: out.txt > > > This commit introduced a transient lazy log val: > https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec > This has caused problems in the past: > https://github.com/apache/spark/pull/1004 > One commit before that everything works fine. > I spotted that when my CI started to fail: > https://ci.typesafe.com/job/mit-docker-test-ref/191/ > You can easily verify it by installing mesos on your machine and try to > connect with spark shell from bin dir: > ./spark-shell --master mesos://zk://localhost:2181/mesos --conf > spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz > It gets stuck at the point where it tries to create the SparkContext. > Logging gets stuck here: > I0705 12:10:10.076617 9303 group.cpp:700] Trying to get > '/mesos/json.info_000152' in ZooKeeper > I0705 12:10:10.076920 9304 detector.cpp:479] A new leading master > (UPID=master@127.0.1.1:5050) is detected > I0705 12:10:10.076956 9303 sched.cpp:326] New master detected at > master@127.0.1.1:5050 > I0705 12:10:10.077057 9303 sched.cpp:336] No credentials provided. > Attempting to register without authentication > I0705 12:10:10.090709 9301 sched.cpp:703] Framework registered with > 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001 > I verified it also by changing @transient lazy val log to def and it works as > expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16400) Remove InSet filter pushdown from Parquet
[ https://issues.apache.org/jira/browse/SPARK-16400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16400: Assignee: Apache Spark > Remove InSet filter pushdown from Parquet > - > > Key: SPARK-16400 > URL: https://issues.apache.org/jira/browse/SPARK-16400 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Filter pushdown that needs to be evaluated per row is not useful to Spark, > since parquet-mr own filtering is likely to be less performant than Spark's > due to boxing and virtual function dispatches. > To simplify the code base, we should remove the InSet filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16400) Remove InSet filter pushdown from Parquet
[ https://issues.apache.org/jira/browse/SPARK-16400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365035#comment-15365035 ] Apache Spark commented on SPARK-16400: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14076 > Remove InSet filter pushdown from Parquet > - > > Key: SPARK-16400 > URL: https://issues.apache.org/jira/browse/SPARK-16400 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > Filter pushdown that needs to be evaluated per row is not useful to Spark, > since parquet-mr own filtering is likely to be less performant than Spark's > due to boxing and virtual function dispatches. > To simplify the code base, we should remove the InSet filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16400) Remove InSet filter pushdown from Parquet
[ https://issues.apache.org/jira/browse/SPARK-16400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16400: Assignee: (was: Apache Spark) > Remove InSet filter pushdown from Parquet > - > > Key: SPARK-16400 > URL: https://issues.apache.org/jira/browse/SPARK-16400 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > Filter pushdown that needs to be evaluated per row is not useful to Spark, > since parquet-mr own filtering is likely to be less performant than Spark's > due to boxing and virtual function dispatches. > To simplify the code base, we should remove the InSet filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15740: -- Fix Version/s: (was: 2.0.0) 2.1.0 2.0.1 > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Antonio Murgia >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15740: -- Target Version/s: (was: 2.0.0) > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Antonio Murgia >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364972#comment-15364972 ] Jeff Zhang commented on SPARK-16367: [~gae...@xeberon.net] I still don't understand how the binary wheel work in different OS machines since you build wheel on the client machine. > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of > maitaining a setup.py files really easy > -- create a virtualenv if not already in one: > {code} > virtualenv env > {code} > -- Work on your environment, define the requirement you need in > {{requirements.txt}}, do all the {{pip install}} you need. > - create
[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15740: -- Assignee: Antonio Murgia > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Antonio Murgia >Priority: Critical > Fix For: 2.0.0 > > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider
[ https://issues.apache.org/jira/browse/SPARK-16401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364970#comment-15364970 ] Apache Spark commented on SPARK-16401: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14075 > Data Source APIs: Extending RelationProvider and CreatableRelationProvider > Without SchemaRelationProvider > - > > Key: SPARK-16401 > URL: https://issues.apache.org/jira/browse/SPARK-16401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Priority: Critical > > When users try to implement a data source API with extending only > RelationProvider and CreatableRelationProvider, they will hit an error when > resolving the relation. > {noformat} > spark.read > .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") > .load() > .write. > format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") > .save() > {noformat} > The error they hit is like > {noformat} > xyzDataSource does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: xyzDataSource does not allow > user-specified schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider
[ https://issues.apache.org/jira/browse/SPARK-16401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16401: Assignee: Apache Spark > Data Source APIs: Extending RelationProvider and CreatableRelationProvider > Without SchemaRelationProvider > - > > Key: SPARK-16401 > URL: https://issues.apache.org/jira/browse/SPARK-16401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > When users try to implement a data source API with extending only > RelationProvider and CreatableRelationProvider, they will hit an error when > resolving the relation. > {noformat} > spark.read > .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") > .load() > .write. > format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") > .save() > {noformat} > The error they hit is like > {noformat} > xyzDataSource does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: xyzDataSource does not allow > user-specified schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider
[ https://issues.apache.org/jira/browse/SPARK-16401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16401: Assignee: (was: Apache Spark) > Data Source APIs: Extending RelationProvider and CreatableRelationProvider > Without SchemaRelationProvider > - > > Key: SPARK-16401 > URL: https://issues.apache.org/jira/browse/SPARK-16401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Priority: Critical > > When users try to implement a data source API with extending only > RelationProvider and CreatableRelationProvider, they will hit an error when > resolving the relation. > {noformat} > spark.read > .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") > .load() > .write. > format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") > .save() > {noformat} > The error they hit is like > {noformat} > xyzDataSource does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: xyzDataSource does not allow > user-specified schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15740. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13509 [https://github.com/apache/spark/pull/13509] > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > Fix For: 2.0.0 > > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16401) Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider
Xiao Li created SPARK-16401: --- Summary: Data Source APIs: Extending RelationProvider and CreatableRelationProvider Without SchemaRelationProvider Key: SPARK-16401 URL: https://issues.apache.org/jira/browse/SPARK-16401 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Priority: Critical When users try to implement a data source API with extending only RelationProvider and CreatableRelationProvider, they will hit an error when resolving the relation. {noformat} spark.read .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .load() .write. format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .save() {noformat} The error they hit is like {noformat} xyzDataSource does not allow user-specified schemas.; org.apache.spark.sql.AnalysisException: xyzDataSource does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column
[ https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364959#comment-15364959 ] Apache Spark commented on SPARK-16371: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14074 > IS NOT NULL clause gives false for nested not empty column > -- > > Key: SPARK-16371 > URL: https://issues.apache.org/jira/browse/SPARK-16371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 2.0.0 > > > I have df where column1 is struct type and there is 1M rows. > (sample data from https://issues.apache.org/jira/browse/SPARK-16320) > {code} > df.where("column1 is not null").count() > {code} > gives: > 1M in Spark 1.6 > *0* in Spark 2.0 > Is there a change in IS NOT NULL behaviour in Spark 2.0 ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column
[ https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16371. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.0.0 > IS NOT NULL clause gives false for nested not empty column > -- > > Key: SPARK-16371 > URL: https://issues.apache.org/jira/browse/SPARK-16371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 2.0.0 > > > I have df where column1 is struct type and there is 1M rows. > (sample data from https://issues.apache.org/jira/browse/SPARK-16320) > {code} > df.where("column1 is not null").count() > {code} > gives: > 1M in Spark 1.6 > *0* in Spark 2.0 > Is there a change in IS NOT NULL behaviour in Spark 2.0 ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16212) code cleanup of kafka-0-8 to match review feedback on 0-10
[ https://issues.apache.org/jira/browse/SPARK-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364926#comment-15364926 ] Apache Spark commented on SPARK-16212: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/14073 > code cleanup of kafka-0-8 to match review feedback on 0-10 > -- > > Key: SPARK-16212 > URL: https://issues.apache.org/jira/browse/SPARK-16212 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Cody Koeninger >Assignee: Cody Koeninger > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16400) Remove InSet filter pushdown from Parquet
Reynold Xin created SPARK-16400: --- Summary: Remove InSet filter pushdown from Parquet Key: SPARK-16400 URL: https://issues.apache.org/jira/browse/SPARK-16400 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Filter pushdown that needs to be evaluated per row is not useful to Spark, since parquet-mr own filtering is likely to be less performant than Spark's due to boxing and virtual function dispatches. To simplify the code base, we should remove the InSet filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363438#comment-15363438 ] Vladimir Ivanov edited comment on SPARK-16334 at 7/6/16 6:54 PM: - Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case it's thrown during {noformat}DataFrame.rdd{noformat} call. Moreover it somehow depends on volume of data, because it is not thrown when we change filter criteria accordingly. We used SparkSQL to write these parquet files and didn't explicitly specify WriterVersion option so I believe whatever version is set by default was used. was (Author: vivanov): Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case it's thrown during DataFrame.rdd call. Moreover it somehow depends on volume of data, because it is not thrown when we change filter criteria accordingly. We used SparkSQL to write these parquet files and didn't explicitly specify WriterVersion option so I believe whatever version is set by default was used. > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363438#comment-15363438 ] Vladimir Ivanov edited comment on SPARK-16334 at 7/6/16 6:54 PM: - Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case it's thrown during DataFrame.rdd call. Moreover it somehow depends on volume of data, because it is not thrown when we change filter criteria accordingly. We used SparkSQL to write these parquet files and didn't explicitly specify WriterVersion option so I believe whatever version is set by default was used. was (Author: vivanov): Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case it's thrown during {noformat}DataFrame.rdd{noformat} call. Moreover it somehow depends on volume of data, because it is not thrown when we change filter criteria accordingly. We used SparkSQL to write these parquet files and didn't explicitly specify WriterVersion option so I believe whatever version is set by default was used. > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364870#comment-15364870 ] Dongjoon Hyun commented on SPARK-16387: --- Oh, it means Pull Request. Since you know `JdbcDialect` class, I think you can make a code patch for that. > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8425) Add blacklist mechanism for task scheduling
[ https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364834#comment-15364834 ] Thomas Graves commented on SPARK-8425: -- Added some questions to the design doc > Add blacklist mechanism for task scheduling > --- > > Key: SPARK-8425 > URL: https://issues.apache.org/jira/browse/SPARK-8425 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Reporter: Saisai Shao >Assignee: Imran Rashid >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
[ https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16399: Assignee: Apache Spark > Set PYSPARK_PYTHON to point to "python" instead of "python2.7" > -- > > Key: SPARK-16399 > URL: https://issues.apache.org/jira/browse/SPARK-16399 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Manoj Kumar >Assignee: Apache Spark >Priority: Minor > > Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even > though higher versions of Python seem to be installed. > It should be better to force "PYSPARK_PYTHON" to python instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
[ https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16399: Assignee: (was: Apache Spark) > Set PYSPARK_PYTHON to point to "python" instead of "python2.7" > -- > > Key: SPARK-16399 > URL: https://issues.apache.org/jira/browse/SPARK-16399 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Manoj Kumar >Priority: Minor > > Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even > though higher versions of Python seem to be installed. > It should be better to force "PYSPARK_PYTHON" to python instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
[ https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364814#comment-15364814 ] Apache Spark commented on SPARK-16399: -- User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/14016 > Set PYSPARK_PYTHON to point to "python" instead of "python2.7" > -- > > Key: SPARK-16399 > URL: https://issues.apache.org/jira/browse/SPARK-16399 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Manoj Kumar >Priority: Minor > > Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even > though higher versions of Python seem to be installed. > It should be better to force "PYSPARK_PYTHON" to python instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
Manoj Kumar created SPARK-16399: --- Summary: Set PYSPARK_PYTHON to point to "python" instead of "python2.7" Key: SPARK-16399 URL: https://issues.apache.org/jira/browse/SPARK-16399 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Manoj Kumar Priority: Minor Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even though higher versions of Python seem to be installed. It should be better to force "PYSPARK_PYTHON" to python instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16394) Timestamp conversion error in pyspark.sql.Row.asDict because of timezones
[ https://issues.apache.org/jira/browse/SPARK-16394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364792#comment-15364792 ] Martin Tapp commented on SPARK-16394: - It seems the root problem is the conversion from spark's internal representation to a pyspark Row object that already causes the timezone conversion problem. Hence, the only fix we have for now is to cast the column to StringType. > Timestamp conversion error in pyspark.sql.Row.asDict because of timezones > - > > Key: SPARK-16394 > URL: https://issues.apache.org/jira/browse/SPARK-16394 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Martin Tapp >Priority: Minor > > We use DataFrame.map to convert each row to a dictionary using Row.asDict(). > The problem occurs when a Timestamp column is converted. It seems the > Timestamp gets converted to a naive Python datetime. This causes processing > errors since all naive datetimes get adjusted to the process' timezone. For > instance, a Timestamp with a time of midnight see's it's time bounce based on > the local timezone (+/- x hours). > Current fix is to apply the pytz.utc timezone to each datetime instance. > Proposed solution is to make all datetime instances aware and use the > pytz.utc timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16394) Timestamp conversion error in pyspark.sql.Row because of timezones
[ https://issues.apache.org/jira/browse/SPARK-16394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Tapp updated SPARK-16394: Summary: Timestamp conversion error in pyspark.sql.Row because of timezones (was: Timestamp conversion error in pyspark.sql.Row.asDict because of timezones) > Timestamp conversion error in pyspark.sql.Row because of timezones > -- > > Key: SPARK-16394 > URL: https://issues.apache.org/jira/browse/SPARK-16394 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Martin Tapp >Priority: Minor > > We use DataFrame.map to convert each row to a dictionary using Row.asDict(). > The problem occurs when a Timestamp column is converted. It seems the > Timestamp gets converted to a naive Python datetime. This causes processing > errors since all naive datetimes get adjusted to the process' timezone. For > instance, a Timestamp with a time of midnight see's it's time bounce based on > the local timezone (+/- x hours). > Current fix is to apply the pytz.utc timezone to each datetime instance. > Proposed solution is to make all datetime instances aware and use the > pytz.utc timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364762#comment-15364762 ] Charles Allen commented on SPARK-6028: -- ClassLoader problem on my side. Loader was pulling in 1.5.2 classes for the driver but 1.6.1 classes in the tasks. Ideally the default behavior would not have changed, the tasks would have launched, then class version conflicts would have given logging, rather than having a uri naming conflict. > Provide an alternative RPC implementation based on the network transport > module > --- > > Key: SPARK-6028 > URL: https://issues.apache.org/jira/browse/SPARK-6028 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 1.6.0 > > > Network transport module implements a low level RPC interface. We can build a > new RPC implementation on top of that to replace Akka's. > Design document: > https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364757#comment-15364757 ] Charles Allen commented on SPARK-6028: -- Was semi-related. The patch changed the default from akka to netty, and I had improper classloader in my app which was loading in the 1.5.2 classes instead of the 1.6.1 classes. > Provide an alternative RPC implementation based on the network transport > module > --- > > Key: SPARK-6028 > URL: https://issues.apache.org/jira/browse/SPARK-6028 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 1.6.0 > > > Network transport module implements a low level RPC interface. We can build a > new RPC implementation on top of that to replace Akka's. > Design document: > https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16398) Make cancelJob and cancelStage API public
[ https://issues.apache.org/jira/browse/SPARK-16398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-16398: --- Description: Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This allows applications to use {{SparkListener}} to do their own management of jobs via events, but without using the REST API. (was: Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This allows applications to use `SparkListener` to do their own management of jobs via events, but without using the REST API.) > Make cancelJob and cancelStage API public > - > > Key: SPARK-16398 > URL: https://issues.apache.org/jira/browse/SPARK-16398 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Mitesh >Priority: Trivial > > Make the SparkContext {{cancelJob}} and {{cancelStage}} APIs public. This > allows applications to use {{SparkListener}} to do their own management of > jobs via events, but without using the REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org