[jira] [Updated] (SPARK-20407) ParquetQuerySuite 'Enabling/disabling ignoreCorruptFiles' flaky test
[ https://issues.apache.org/jira/browse/SPARK-20407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20407: Fix Version/s: 2.1.1 > ParquetQuerySuite 'Enabling/disabling ignoreCorruptFiles' flaky test > > > Key: SPARK-20407 > URL: https://issues.apache.org/jira/browse/SPARK-20407 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu > Fix For: 2.1.1, 2.2.0 > > > ParquetQuerySuite test "Enabling/disabling ignoreCorruptFiles" can sometimes > fail. This is caused by the fact that when one task fails, the driver call > returns and test code continues, but there might still be tasks running that > will be killed at the next killing point. > There are 2 specific issues created by this: > 1. Files can be closed some time after the test finishes, so > DebugFilesystem.assertNoOpenStreams fails. One solution for this is to change > SharedSqlContext and call assertNoOpenStreams inside eventually {} > 2. ParquetFileReader constructor from apache parquet 1.8.2 can leak a stream > at line 538. This happens when the next line throws an exception. So, the > constructor fails and Spark doesn't have any way to close the file. > This happens in this test because the test deletes the temporary directory at > the end (but while tasks might still be running). Deleting the directory > causes the constructor to fail. > The solution for this could be to Thread.sleep at the end of the test or to > somehow wait for all tasks to be definitely killed before finishing the test -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980268#comment-15980268 ] Yan Facai (颜发才) commented on SPARK-16957: - [~vlad.feinberg] Hi, I found that R's gbm uses mean value, instead of weighted mean. Hence the first phrase is removed in the description. > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Facai (颜发才) updated SPARK-16957: Description: We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}. Example: {code} +++-+-+ |feature0|feature1|label|count| +++-+-+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0|2| | 0.0| 0.0| 1.0|2| | 0.0| 1.0| 0.0|7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0|7| | 1.0| 1.0| 0.0| 18| +++-+-+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333 {code} was: Just like R's gbm, we should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}. Example: {code} +++-+-+ |feature0|feature1|label|count| +++-+-+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0|2| | 0.0| 0.0| 1.0|2| | 0.0| 1.0| 0.0|7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0|7| | 1.0| 1.0| 0.0| 18| +++-+-+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333 {code} > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-20132: --- Assignee: Michael Patterson > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Assignee: Michael Patterson >Priority: Minor > Labels: documentation, newbie > Fix For: 2.3.0 > > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-20132. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17469 [https://github.com/apache/spark/pull/17469] > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Assignee: Michael Patterson >Priority: Minor > Labels: documentation, newbie > Fix For: 2.3.0 > > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20440) Allow SparkR session and context to have delayed binding
[ https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayak Joshi updated SPARK-20440: -- Description: It would be useful if users could do something like this without first invoking {{sparkR.session()}}: {code} delayedAssign(".sparkRsession", { sparkR.session(..) }, assign.env=SparkR:::.sparkREnv) {code} This would help providers of interactive environments that bootstrap Spark for their users but where, the user code need not always include SparkR. So the possibility of lazy semantics for setting up a SparkSession/Context would be very useful. Note that SparkR API does not have a single entry object (such as Scala/Python SparkSession classes) so it's the only env where such lazy setup is currently difficult to achieve, so doing this enhancement will make it easier. The changes required are minor and do not affect the external API or functionality in any way. I will attach a PR with the changes needed for consideration shortly. was: It would be useful if users could do something like this without first invoking {{sparkR.session()}}: {code} delayedAssign(".sparkRsession", { sparkR.session(..) }, assign.env=SparkR:::.sparkREnv) {code} This would help providers of interactive environments that bootstrap Spark for their users but the user code need not always include SparkR and so possibility of lazy semantics for setting up a SparkSession/Context would be very useful. Note that SparkR API does not have a single entry object (such as Scala/Python SparkSession classes) so it's the only env where such lazy setup is currently difficult to achieve, so doing this enhancement will make it easier. The changes required are minor and do not affect the external API or functionality in any way. I will attach a PR with the changes needed for consideration shortly. > Allow SparkR session and context to have delayed binding > > > Key: SPARK-20440 > URL: https://issues.apache.org/jira/browse/SPARK-20440 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Vinayak Joshi > > It would be useful if users could do something like this without first > invoking {{sparkR.session()}}: > {code} > delayedAssign(".sparkRsession", { sparkR.session(..) }, > assign.env=SparkR:::.sparkREnv) > {code} > This would help providers of interactive environments that bootstrap Spark > for their users but where, the user code need not always include SparkR. So > the possibility of lazy semantics for setting up a SparkSession/Context would > be very useful. > Note that SparkR API does not have a single entry object (such as > Scala/Python SparkSession classes) so it's the only env where such lazy setup > is currently difficult to achieve, so doing this enhancement will make it > easier. > The changes required are minor and do not affect the external API or > functionality in any way. I will attach a PR with the changes needed for > consideration shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20440) Allow SparkR session and context to have delayed binding
[ https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20440: Assignee: (was: Apache Spark) > Allow SparkR session and context to have delayed binding > > > Key: SPARK-20440 > URL: https://issues.apache.org/jira/browse/SPARK-20440 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Vinayak Joshi > > It would be useful if users could do something like this without first > invoking {{sparkR.session()}}: > {code} > delayedAssign(".sparkRsession", { sparkR.session(..) }, > assign.env=SparkR:::.sparkREnv) > {code} > This would help providers of interactive environments that bootstrap Spark > for their users but the user code need not always include SparkR and so > possibility of lazy semantics for setting up a SparkSession/Context would be > very useful. > Note that SparkR API does not have a single entry object (such as > Scala/Python SparkSession classes) so it's the only env where such lazy setup > is currently difficult to achieve, so doing this enhancement will make it > easier. > The changes required are minor and do not affect the external API or > functionality in any way. I will attach a PR with the changes needed for > consideration shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20440) Allow SparkR session and context to have delayed binding
[ https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20440: Assignee: Apache Spark > Allow SparkR session and context to have delayed binding > > > Key: SPARK-20440 > URL: https://issues.apache.org/jira/browse/SPARK-20440 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Vinayak Joshi >Assignee: Apache Spark > > It would be useful if users could do something like this without first > invoking {{sparkR.session()}}: > {code} > delayedAssign(".sparkRsession", { sparkR.session(..) }, > assign.env=SparkR:::.sparkREnv) > {code} > This would help providers of interactive environments that bootstrap Spark > for their users but the user code need not always include SparkR and so > possibility of lazy semantics for setting up a SparkSession/Context would be > very useful. > Note that SparkR API does not have a single entry object (such as > Scala/Python SparkSession classes) so it's the only env where such lazy setup > is currently difficult to achieve, so doing this enhancement will make it > easier. > The changes required are minor and do not affect the external API or > functionality in any way. I will attach a PR with the changes needed for > consideration shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20440) Allow SparkR session and context to have delayed binding
[ https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980163#comment-15980163 ] Apache Spark commented on SPARK-20440: -- User 'vijoshi' has created a pull request for this issue: https://github.com/apache/spark/pull/17731 > Allow SparkR session and context to have delayed binding > > > Key: SPARK-20440 > URL: https://issues.apache.org/jira/browse/SPARK-20440 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Vinayak Joshi > > It would be useful if users could do something like this without first > invoking {{sparkR.session()}}: > {code} > delayedAssign(".sparkRsession", { sparkR.session(..) }, > assign.env=SparkR:::.sparkREnv) > {code} > This would help providers of interactive environments that bootstrap Spark > for their users but the user code need not always include SparkR and so > possibility of lazy semantics for setting up a SparkSession/Context would be > very useful. > Note that SparkR API does not have a single entry object (such as > Scala/Python SparkSession classes) so it's the only env where such lazy setup > is currently difficult to achieve, so doing this enhancement will make it > easier. > The changes required are minor and do not affect the external API or > functionality in any way. I will attach a PR with the changes needed for > consideration shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20440) Allow SparkR session and context to have delayed binding
Vinayak Joshi created SPARK-20440: - Summary: Allow SparkR session and context to have delayed binding Key: SPARK-20440 URL: https://issues.apache.org/jira/browse/SPARK-20440 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.0 Reporter: Vinayak Joshi It would be useful if users could do something like this without first invoking {{sparkR.session()}}: {code} delayedAssign(".sparkRsession", { sparkR.session(..) }, assign.env=SparkR:::.sparkREnv) {code} This would help providers of interactive environments that bootstrap Spark for their users but the user code need not always include SparkR and so possibility of lazy semantics for setting up a SparkSession/Context would be very useful. Note that SparkR API does not have a single entry object (such as Scala/Python SparkSession classes) so it's the only env where such lazy setup is currently difficult to achieve, so doing this enhancement will make it easier. The changes required are minor and do not affect the external API or functionality in any way. I will attach a PR with the changes needed for consideration shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables
[ https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20439: Assignee: Xiao Li (was: Apache Spark) > Catalog.listTables() depends on all libraries used to create tables > --- > > Key: SPARK-20439 > URL: https://issues.apache.org/jira/browse/SPARK-20439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > spark.catalog.listTables() and getTable > You may get an error on the table serde library: > java.lang.RuntimeException: java.lang.ClassNotFoundException: > com.amazon.emr.kinesis.hive.KinesisHiveInputFormat > Or if the database contains any table (e.g., index) with a table type that is > not accessible by Spark SQL, it will fail the whole listTable API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables
[ https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20439: Assignee: Apache Spark (was: Xiao Li) > Catalog.listTables() depends on all libraries used to create tables > --- > > Key: SPARK-20439 > URL: https://issues.apache.org/jira/browse/SPARK-20439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > spark.catalog.listTables() and getTable > You may get an error on the table serde library: > java.lang.RuntimeException: java.lang.ClassNotFoundException: > com.amazon.emr.kinesis.hive.KinesisHiveInputFormat > Or if the database contains any table (e.g., index) with a table type that is > not accessible by Spark SQL, it will fail the whole listTable API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables
[ https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980155#comment-15980155 ] Apache Spark commented on SPARK-20439: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17730 > Catalog.listTables() depends on all libraries used to create tables > --- > > Key: SPARK-20439 > URL: https://issues.apache.org/jira/browse/SPARK-20439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > spark.catalog.listTables() and getTable > You may get an error on the table serde library: > java.lang.RuntimeException: java.lang.ClassNotFoundException: > com.amazon.emr.kinesis.hive.KinesisHiveInputFormat > Or if the database contains any table (e.g., index) with a table type that is > not accessible by Spark SQL, it will fail the whole listTable API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20438) R wrappers for split and repeat
[ https://issues.apache.org/jira/browse/SPARK-20438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20438: Assignee: (was: Apache Spark) > R wrappers for split and repeat > --- > > Key: SPARK-20438 > URL: https://issues.apache.org/jira/browse/SPARK-20438 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > SparkR wrappers for {{o.a.s.sql.functions.split}} and > {o.a.s.sql.functions.repeat}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20438) R wrappers for split and repeat
[ https://issues.apache.org/jira/browse/SPARK-20438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20438: Assignee: Apache Spark > R wrappers for split and repeat > --- > > Key: SPARK-20438 > URL: https://issues.apache.org/jira/browse/SPARK-20438 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark > > SparkR wrappers for {{o.a.s.sql.functions.split}} and > {o.a.s.sql.functions.repeat}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20438) R wrappers for split and repeat
[ https://issues.apache.org/jira/browse/SPARK-20438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980153#comment-15980153 ] Apache Spark commented on SPARK-20438: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17729 > R wrappers for split and repeat > --- > > Key: SPARK-20438 > URL: https://issues.apache.org/jira/browse/SPARK-20438 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > SparkR wrappers for {{o.a.s.sql.functions.split}} and > {o.a.s.sql.functions.repeat}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables
Xiao Li created SPARK-20439: --- Summary: Catalog.listTables() depends on all libraries used to create tables Key: SPARK-20439 URL: https://issues.apache.org/jira/browse/SPARK-20439 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li spark.catalog.listTables() and getTable You may get an error on the table serde library: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.amazon.emr.kinesis.hive.KinesisHiveInputFormat Or if the database contains any table (e.g., index) with a table type that is not accessible by Spark SQL, it will fail the whole listTable API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20438) R wrappers for split and repeat
Maciej Szymkiewicz created SPARK-20438: -- Summary: R wrappers for split and repeat Key: SPARK-20438 URL: https://issues.apache.org/jira/browse/SPARK-20438 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.2.0 Reporter: Maciej Szymkiewicz SparkR wrappers for {{o.a.s.sql.functions.split}} and {o.a.s.sql.functions.repeat}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final
[ https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980133#comment-15980133 ] Pawel Szulc commented on SPARK-19552: - Wherever I go this days (while working on Spark based projects) I have to deal with this issue. Elastic4s is on 4.1.x; mongo clients are on 4.1.x. I understand this is a breaking change, but could that be treated with a higher priority? I can only imagine I'm not the only person with this issue... > Upgrade Netty version to 4.1.8 final > > > Key: SPARK-19552 > URL: https://issues.apache.org/jira/browse/SPARK-19552 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > Netty 4.1.8 was recently released but isn't API compatible with previous > major versions (like Netty 4.0.x), see > http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details. > This version does include a fix for a security concern but not one we'd be > exposed to with Spark "out of the box". Let's upgrade the version we use to > be on the safe side as the security fix I'm especially interested in is not > available in the 4.0.x release line. > We should move up anyway to take on a bunch of other big fixes cited in the > release notes (and if anyone were to use Spark with netty and tcnative, they > shouldn't be exposed to the security problem) - we should be good citizens > and make this change. > As this 4.1 version involves API changes we'll need to implement a few > methods and possibly adjust the Sasl tests. This JIRA and associated pull > request starts the process which I'll work on - and any help would be much > appreciated! Currently I know: > {code} > @Override > public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise > promise) > throws Exception { > if (!foundEncryptionHandler) { > foundEncryptionHandler = > ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this > returns false and causes test failures > } > ctx.write(msg, promise); > } > {code} > Here's what changes will be required (at least): > {code} > common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code} > requires touch, retain and transferred methods > {code} > common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code} > requires the above methods too > {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code} > With "dummy" implementations so we can at least compile and test, we'll see > five new test failures to address. > These are > {code} > org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption > org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption > org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980081#comment-15980081 ] Maciej Szymkiewicz commented on SPARK-20208: [~felixcheung] I believe this can be marked as resolved. > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests
[ https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980056#comment-15980056 ] Marcelo Vanzin commented on SPARK-20435: bq. Providing passwords that way is supported by Spark That's not an argument. You can type secrets in any command line and that doesn't make it OK. bq. and ps ax works for users with appropriate privileges "ps ax" works for all users. Try it for yourself. {noformat} vanzin@vanzin-t460p:/work/apache/spark-prs$ sudo sleep 10 {noformat} {noformat} vanzin@vanzin-t460p:/tmp$ ps axu | grep sleep root 32583 0.0 0.0 73264 4524 pts/5S+ 10:55 0:00 sudo sleep 10 root 32584 0.0 0.0 7296 760 pts/5S+ 10:55 0:00 sleep 10 vanzin 32586 0.0 0.0 14232 1072 pts/2S+ 10:56 0:00 grep --color=auto sleep {noformat} I'm not saying redacting from logs is useless, but I'm saying that a user that is providing secrets in the command line is giving up any security, and redaction won't save him. > More thorough redaction of sensitive information from logs/UI, more unit tests > -- > > Key: SPARK-20435 > URL: https://issues.apache.org/jira/browse/SPARK-20435 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover > > SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. > hadoop credential provider password, AWS access/secret keys) from event logs > + YARN logs + UI and from the console output, respectively. > While some unit tests were added along with these changes - they asserted > when a sensitive key was found, that redaction took place for that key. They > didn't assert globally that when running a full-fledged Spark app (whether or > YARN or locally), that sensitive information was not present in any of the > logs or UI. Such a test would also prevent regressions from happening in the > future if someone unknowingly adds extra logging that publishes out sensitive > information to disk or UI. > Consequently, it was found that in some Java configurations, sensitive > information was still being leaked in the event logs under the > {{SparkListenerEnvironmentUpdate}} event, like so: > {code} > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf > spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ... > {code} > "secret_password" should have been redacted. > Moreover, previously redaction logic was only checking if the key matched the > secret regex pattern, it'd redact it's value. That worked for most cases. > However, in the above case, the key (sun.java.command) doesn't tell much, so > the value needs to be searched. So the check needs to be expanded to match > against values as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20430) Throw an NullPointerException in range when wholeStage is off
[ https://issues.apache.org/jira/browse/SPARK-20430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20430. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.2.0 > Throw an NullPointerException in range when wholeStage is off > - > > Key: SPARK-20430 > URL: https://issues.apache.org/jira/browse/SPARK-20430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 2.2.0 > > > I hit an exception below in master; > {code} > sql("SET spark.sql.codegen.wholeStage=false") > sql("SELECT * FROM range(1)").show > 17/04/20 17:11:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NullPointerException > at > org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:54) > at > org.apache.spark.sql.execution.RangeExec.numSlices(basicPhysicalOperators.scala:343) > at > org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:506) > at > org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:505) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20437) R wrappers for rollup and cube
[ https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20437: Assignee: Apache Spark > R wrappers for rollup and cube > -- > > Key: SPARK-20437 > URL: https://issues.apache.org/jira/browse/SPARK-20437 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20437) R wrappers for rollup and cube
[ https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20437: Assignee: (was: Apache Spark) > R wrappers for rollup and cube > -- > > Key: SPARK-20437 > URL: https://issues.apache.org/jira/browse/SPARK-20437 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20437) R wrappers for rollup and cube
[ https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979949#comment-15979949 ] Apache Spark commented on SPARK-20437: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17728 > R wrappers for rollup and cube > -- > > Key: SPARK-20437 > URL: https://issues.apache.org/jira/browse/SPARK-20437 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20437) R wrappers for rollup and cube
Maciej Szymkiewicz created SPARK-20437: -- Summary: R wrappers for rollup and cube Key: SPARK-20437 URL: https://issues.apache.org/jira/browse/SPARK-20437 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.2.0 Reporter: Maciej Szymkiewicz Priority: Minor Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20386) The log info "Added %s in memory on %s (size: %s, free: %s)" in function "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate if the block exist
[ https://issues.apache.org/jira/browse/SPARK-20386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20386: - Assignee: eaton > The log info "Added %s in memory on %s (size: %s, free: %s)" in function > "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate > if the block exists on the slave already > -- > > Key: SPARK-20386 > URL: https://issues.apache.org/jira/browse/SPARK-20386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: eaton >Assignee: eaton >Priority: Trivial > Fix For: 2.2.0 > > > The log info"Added %s in memory on %s (size: %s, free: %s)" in function > "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate > if the block exists on the slave already; > the current code is: > if (storageLevel.useMemory) { > blockStatus = BlockStatus(storageLevel, memSize = memSize, diskSize = > 0) > _blocks.put(blockId, blockStatus) > _remainingMem -= memSize > logInfo("Added %s in memory on %s (size: %s, free: %s)".format( > blockId, blockManagerId.hostPort, Utils.bytesToString(memSize), > Utils.bytesToString(_remainingMem))) > } > If the block exists on the slave already, the added memory should be memSize > - originalMemSize, the originalMemSize is _blocks.get(blockId).memSize -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20386) The log info "Added %s in memory on %s (size: %s, free: %s)" in function "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate if the block exist
[ https://issues.apache.org/jira/browse/SPARK-20386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20386. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17683 [https://github.com/apache/spark/pull/17683] > The log info "Added %s in memory on %s (size: %s, free: %s)" in function > "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate > if the block exists on the slave already > -- > > Key: SPARK-20386 > URL: https://issues.apache.org/jira/browse/SPARK-20386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: eaton >Priority: Trivial > Fix For: 2.2.0 > > > The log info"Added %s in memory on %s (size: %s, free: %s)" in function > "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate > if the block exists on the slave already; > the current code is: > if (storageLevel.useMemory) { > blockStatus = BlockStatus(storageLevel, memSize = memSize, diskSize = > 0) > _blocks.put(blockId, blockStatus) > _remainingMem -= memSize > logInfo("Added %s in memory on %s (size: %s, free: %s)".format( > blockId, blockManagerId.hostPort, Utils.bytesToString(memSize), > Utils.bytesToString(_remainingMem))) > } > If the block exists on the slave already, the added memory should be memSize > - originalMemSize, the originalMemSize is _blocks.get(blockId).memSize -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20436) NullPointerException when restart from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979820#comment-15979820 ] fangfengbin commented on SPARK-20436: - There are other people have this problem: http://stackoverflow.com/questions/39039157/check-pointing-several-filestreams-in-my-spark-streaming-context > NullPointerException when restart from checkpoint file > -- > > Key: SPARK-20436 > URL: https://issues.apache.org/jira/browse/SPARK-20436 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 >Reporter: fangfengbin > > I have written a Spark Streaming application which have two DStreams. > Code is : > object KafkaTwoInkfk { > def main(args: Array[String]) { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val ssc = StreamingContext.getOrCreate(checkPointDir, () => > createContext(args)) > ssc.start() > ssc.awaitTermination() > } > def createContext(args : Array[String]) : StreamingContext = { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val sparkConf = new SparkConf().setAppName("KafkaWordCount") > val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) > ssc.checkpoint(checkPointDir) > val topicArr1 = topic1.split(",") > val topicSet1 = topicArr1.toSet > val topicArr2 = topic2.split(",") > val topicSet2 = topicArr2.toSet > val kafkaParams = Map[String, String]( > "metadata.broker.list" -> brokers > ) > val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet1) > val words1 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts1 = words1.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts1.print() > val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet2) > val words2 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts2 = words2.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts2.print() > return ssc > } > } > when restart from checkpoint file, it throw NullPointerException: > java.lang.NullPointerException > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at
[jira] [Created] (SPARK-20436) NullPointerException when restart from checkpoint file
fangfengbin created SPARK-20436: --- Summary: NullPointerException when restart from checkpoint file Key: SPARK-20436 URL: https://issues.apache.org/jira/browse/SPARK-20436 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.5.0 Reporter: fangfengbin I have written a Spark Streaming application which have two DStreams. Code is : object KafkaTwoInkfk { def main(args: Array[String]) { val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args val ssc = StreamingContext.getOrCreate(checkPointDir, () => createContext(args)) ssc.start() ssc.awaitTermination() } def createContext(args : Array[String]) : StreamingContext = { val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args val sparkConf = new SparkConf().setAppName("KafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) ssc.checkpoint(checkPointDir) val topicArr1 = topic1.split(",") val topicSet1 = topicArr1.toSet val topicArr2 = topic2.split(",") val topicSet2 = topicArr2.toSet val kafkaParams = Map[String, String]( "metadata.broker.list" -> brokers ) val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet1) val words1 = lines1.map(_._2).flatMap(_.split(" ")) val wordCounts1 = words1.map(x => { (x, 1L)}).reduceByKey(_ + _) wordCounts1.print() val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet2) val words2 = lines1.map(_._2).flatMap(_.split(" ")) val wordCounts2 = words2.map(x => { (x, 1L)}).reduceByKey(_ + _) wordCounts2.print() return ssc } } when restart from checkpoint file, it throw NullPointerException: java.lang.NullPointerException at org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) at org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) at org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) at org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) at org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at
[jira] [Assigned] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode
[ https://issues.apache.org/jira/browse/SPARK-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17928: Assignee: Apache Spark > No driver.memoryOverhead setting for mesos cluster mode > --- > > Key: SPARK-17928 > URL: https://issues.apache.org/jira/browse/SPARK-17928 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.1 >Reporter: Drew Robb >Assignee: Apache Spark > > Mesos cluster mode does not have a configuration setting for the driver's > memory overhead. This makes scheduling long running drivers on mesos using > dispatcher very unreliable. There is an equivalent setting for yarn-- > spark.yarn.driver.memoryOverhead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode
[ https://issues.apache.org/jira/browse/SPARK-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17928: Assignee: (was: Apache Spark) > No driver.memoryOverhead setting for mesos cluster mode > --- > > Key: SPARK-17928 > URL: https://issues.apache.org/jira/browse/SPARK-17928 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.1 >Reporter: Drew Robb > > Mesos cluster mode does not have a configuration setting for the driver's > memory overhead. This makes scheduling long running drivers on mesos using > dispatcher very unreliable. There is an equivalent setting for yarn-- > spark.yarn.driver.memoryOverhead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode
[ https://issues.apache.org/jira/browse/SPARK-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979818#comment-15979818 ] Apache Spark commented on SPARK-17928: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/17726 > No driver.memoryOverhead setting for mesos cluster mode > --- > > Key: SPARK-17928 > URL: https://issues.apache.org/jira/browse/SPARK-17928 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.1 >Reporter: Drew Robb > > Mesos cluster mode does not have a configuration setting for the driver's > memory overhead. This makes scheduling long running drivers on mesos using > dispatcher very unreliable. There is an equivalent setting for yarn-- > spark.yarn.driver.memoryOverhead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19897) Unable to access History server web UI
[ https://issues.apache.org/jira/browse/SPARK-19897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19897. -- Resolution: Invalid Question should be asked into mailing list. I am resolving this. > Unable to access History server web UI > -- > > Key: SPARK-19897 > URL: https://issues.apache.org/jira/browse/SPARK-19897 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.1.0 > Environment: centos-7 >Reporter: Eduardo Rodrigues > > cant access history server web UI althoug logs after executing > ./sbin/start-history-server.sh says : > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/03/10 11:53:37 INFO HistoryServer: Started daemon with process name: > 25390@centos-1 > 17/03/10 11:53:37 INFO SignalUtils: Registered signal handler for TERM > 17/03/10 11:53:37 INFO SignalUtils: Registered signal handler for HUP > 17/03/10 11:53:37 INFO SignalUtils: Registered signal handler for INT > 17/03/10 11:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/10 11:53:37 INFO SecurityManager: Changing view acls to: centos > 17/03/10 11:53:37 INFO SecurityManager: Changing modify acls to: centos > 17/03/10 11:53:37 INFO SecurityManager: Changing view acls groups to: > 17/03/10 11:53:37 INFO SecurityManager: Changing modify acls groups to: > 17/03/10 11:53:37 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(centos); groups > with view permissions: Set(); users with modify perm$ > 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: > file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310115034-0017 > 17/03/10 11:53:38 INFO Utils: Successfully started service on port 18080. > 17/03/10 11:53:38 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://10.1.0.185:18080 > 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: > file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310113453-0016 > 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: > file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310105045-0010 > 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: > file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310104828-0009 > 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: > file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310104352-0007 > i changed spark-defaults to > spark.eventLog.enabled true > spark.eventLog.dir /home/centos/spark-2.1.0-bin-hadoop2.7/work/ > spark.history.fs.logDirectory > file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979788#comment-15979788 ] Helena Edelson commented on SPARK-18057: Confirming that https://issues.apache.org/jira/browse/KAFKA-4879 - KafkaConsumer.position may hang forever when deleting a topic - is the only blocker. I upgraded in my fork with some minor code changes and the delete-related tests in spark-sql-kafka-0-10 hang. I can submit this as a PR as soon as that is resolved. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org