date:20190520

[jira] [Commented] (SPARK-15348) Hive ACID

2019-05-20 Thread Gowtham SB (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844571#comment-16844571
 ] 

Gowtham SB commented on SPARK-15348:


I faced the same issue (Spark for Hive acid tables )and I can able to manage 
with JDBC call from Spark. May be I can use this JDBC call from spark until we 
get the native ACID support from Spark.

Thanks

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27787) Eliminate uncessary job to compute SSreg

2019-05-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27787:


Assignee: (was: Apache Spark)

> Eliminate uncessary job to compute SSreg
> 
>
> Key: SPARK-27787
> URL: https://issues.apache.org/jira/browse/SPARK-27787
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> in {{RegressionMetrics}}, a job is needed to compute SSreg
> However, we only need to agg summary of column prediction, then SSreg can be 
> computed directly based on the summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27787) Eliminate uncessary job to compute SSreg

2019-05-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27787:


Assignee: Apache Spark

> Eliminate uncessary job to compute SSreg
> 
>
> Key: SPARK-27787
> URL: https://issues.apache.org/jira/browse/SPARK-27787
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> in {{RegressionMetrics}}, a job is needed to compute SSreg
> However, we only need to agg summary of column prediction, then SSreg can be 
> computed directly based on the summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27787) Eliminate uncessary job to compute SSreg

2019-05-20 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-27787:


 Summary: Eliminate uncessary job to compute SSreg
 Key: SPARK-27787
 URL: https://issues.apache.org/jira/browse/SPARK-27787
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


in {{RegressionMetrics}}, a job is needed to compute SSreg

However, we only need to agg summary of column prediction, then SSreg can be 
computed directly based on the summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-20 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Labels:   (was: bulk-closed)

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-20 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Affects Version/s: 2.4.3

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-20 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-19248:
--

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-20 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Component/s: PySpark

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-20 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844547#comment-16844547
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Looks like Spark 2.4.3 still exhibits the behavior reported in the original 
issue: 
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
  /_/

Using Python version 3.7.3 (default, Mar 27 2019 13:25:00)
SparkSession available as 'spark'.
>>> df = spark.createDataFrame([('..   5.',)], ['col'])
>>> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS 
>>> col"]).collect()
>>> dfout   
[Row(col='5')]
>>> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
>>> col"]).collect()
>>> dfout2
[Row(col='')]
>>> 
{code}
[~hyukjin.kwon] - I'm going to reopen this issue.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27637) If exception occured while fetching blocks by netty block transfer service, check whether the relative executor is alive before retry

2019-05-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27637:
---

Assignee: feiwang

> If exception occured while  fetching blocks by netty block transfer service, 
> check whether the relative executor is alive before retry
> --
>
> Key: SPARK-27637
> URL: https://issues.apache.org/jira/browse/SPARK-27637
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.3, 2.4.3
>Reporter: feiwang
>Assignee: feiwang
>Priority: Major
> Fix For: 3.0.0
>
>
> There are several kinds of shuffle client, blockTransferService and 
> externalShuffleClient.
> For the externalShuffleClient,  there are relative external shuffle service, 
> which guarantees  the shuffle block data and regardless the  state of 
> executors.
> For the blockTransferService, it is used to fetch broadcast block, and fetch 
> the shuffle data when external shuffle service is not enabled. 
> When fetching data by using blockTransferService, the shuffle client would 
> connect relative executor's blockManager, so if the relative executor is 
> dead, it would never fetch successfully.
> When spark.shuffle.service.enabled is true and 
> spark.dynamicAllocation.enabled is true,  the executor will be removed while 
> it has been idle  for more than idleTimeout.
> If a blockTransferService create connection to relative executor 
> successfully, but the relative executor is removed when beginning to fetch 
> broadcast block, it would retry (see RetryingBlockFetcher), which is 
> Ineffective.
> If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big,  
> such as 30s and 10 times, it would waste 5 minutes.
> So, I think we should judge whether the relative executor is alive before 
> retry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27637) If exception occured while fetching blocks by netty block transfer service, check whether the relative executor is alive before retry

2019-05-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27637.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24533
[https://github.com/apache/spark/pull/24533]

> If exception occured while  fetching blocks by netty block transfer service, 
> check whether the relative executor is alive before retry
> --
>
> Key: SPARK-27637
> URL: https://issues.apache.org/jira/browse/SPARK-27637
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.3, 2.4.3
>Reporter: feiwang
>Priority: Major
> Fix For: 3.0.0
>
>
> There are several kinds of shuffle client, blockTransferService and 
> externalShuffleClient.
> For the externalShuffleClient,  there are relative external shuffle service, 
> which guarantees  the shuffle block data and regardless the  state of 
> executors.
> For the blockTransferService, it is used to fetch broadcast block, and fetch 
> the shuffle data when external shuffle service is not enabled. 
> When fetching data by using blockTransferService, the shuffle client would 
> connect relative executor's blockManager, so if the relative executor is 
> dead, it would never fetch successfully.
> When spark.shuffle.service.enabled is true and 
> spark.dynamicAllocation.enabled is true,  the executor will be removed while 
> it has been idle  for more than idleTimeout.
> If a blockTransferService create connection to relative executor 
> successfully, but the relative executor is removed when beginning to fetch 
> broadcast block, it would retry (see RetryingBlockFetcher), which is 
> Ineffective.
> If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big,  
> such as 30s and 10 times, it would waste 5 minutes.
> So, I think we should judge whether the relative executor is alive before 
> retry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2019-05-20 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844530#comment-16844530
 ] 

Nicholas Chammas commented on SPARK-18277:
--

[~hyukjin.kwon] - If I still think this issue is relevant, should I just reopen 
it?

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: bulk-closed
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1921) Allow duplicate jar files among the app jar and secondary jars in yarn-cluster mode

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-1921:

Labels: bulk-closed  (was: )

> Allow duplicate jar files among the app jar and secondary jars in 
> yarn-cluster mode
> ---
>
> Key: SPARK-1921
> URL: https://issues.apache.org/jira/browse/SPARK-1921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: bulk-closed
>
> In yarn-cluster mode, jars are uploaded to a staging folder on hdfs. If there 
> are duplicates among the app jar and secondary jars, there will be overwrites 
> that cause inconsistent timestamps. I saw the following message:
> {code}
> Application application_1400965808642_0021 failed 2 times due to AM Container 
> for appattempt_1400965808642_0021_02 exited with  exitCode: -1000 due to: 
> Resource 
> hdfs://localhost.localdomain:8020/user/cloudera/.sparkStaging/application_1400965808642_0021/app_2.10-0.1.jar
>  changed on src filesystem (expected 1400998721965, was 1400998723123
> {code}
> Tested on a CDH-5 quickstart VM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3251) Clarify learning interfaces

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-3251:

Labels: bulk-closed  (was: )

>  Clarify learning interfaces
> 
>
> Key: SPARK-3251
> URL: https://issues.apache.org/jira/browse/SPARK-3251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Christoph Sawade
>Priority: Major
>  Labels: bulk-closed
>
> *Make threshold mandatory*
> Currently, the output of predict for an example is either the score
> or the class. This side-effect is caused by clearThreshold. To
> clarify that behaviour three different types of predict (predictScore,
> predictClass, predictProbabilty) were introduced; the threshold is not
> longer optional.
> *Clarify classification interfaces*
> Currently, some functionality is spreaded over multiple models.
> In order to clarify the structure and simplify the implementation of
> more complex models (like multinomial logistic regression), two new
> classes are introduced:
> - BinaryClassificationModel: for all models that derives a binary 
> classification from a single weight vector. Comprises the tresholding 
> functionality to derive a prediction from a score. It basically captures 
> SVMModel and LogisticRegressionModel.
> - ProbabilitistClassificaitonModel: This trait defines the interface for 
> models that return a calibrated confidence score (aka probability).
> *Misc*
> - some renaming
> - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4229) Create hadoop configuration in a consistent way

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-4229:

Labels: bulk-closed  (was: )

> Create hadoop configuration in a consistent way
> ---
>
> Key: SPARK-4229
> URL: https://issues.apache.org/jira/browse/SPARK-4229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Cody Koeninger
>Priority: Minor
>  Labels: bulk-closed
>
> Some places use SparkHadoopUtil.get.conf, some create a new hadoop config.  
> Prefer SparkHadoopUtil so that spark.hadoop.* properties are pulled in.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Hadoop-configuration-for-checkpointing-td9084.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-3717:

Labels: bulk-closed  (was: )

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6619) Improve Jar caching on executors

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6619.
-
Resolution: Incomplete

> Improve Jar caching on executors
> 
>
> Key: SPARK-6619
> URL: https://issues.apache.org/jira/browse/SPARK-6619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mingyu Kim
>Priority: Major
>  Labels: bulk-closed
>
> Taking SPARK-2713 one step further so that
> - The cached jars can be used by multiple applications. In order to do that, 
> I'm planning to use MD5 as the cache key as opposed to url hash and timestamp.
> - The cached jars are hard-linked to the work directory as opposed to being 
> copied.
> Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
> 158 jars with the total size of 56MB, and this takes ~10s to ship to the 
> executor at the start-up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3601.
-
Resolution: Incomplete

> Kryo NPE for output operations on Avro complex Objects even after registering.
> --
>
> Key: SPARK-3601
> URL: https://issues.apache.org/jira/browse/SPARK-3601
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: local, standalone cluster
>Reporter: mohan gaddam
>Priority: Major
>  Labels: bulk-closed
>
> Kryo serializer works well when avro objects has simple data. but when the 
> same avro object has complex data(like unions/arrays) kryo fails while output 
> operations. but mappings are good. Note that i have registered all the Avro 
> generated classes with kryo. Im using Java as programming language.
> when used complex message throws NPE, stack trace as follows:
> ==
> ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
> ms.0 
> org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
> while getting task result: com.esotericsoftware.kryo.KryoException: 
> java.lang.NullPointerException 
> Serialization trace: 
> value (xyz.Datum) 
> data (xyz.ResMsg) 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>  
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
> at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
> at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  
> In the above exception, Datum and ResMsg are project specific classes 
> generated by avro using the below avdl snippet:
> ==
> record KeyValueObject { 
> union{boolean, int, long, float, double, bytes, string} name; 
> union {boolean, int, long, float, double, bytes, string, 
> array KeyValueObject}>, KeyValueObject} value; 
> } 
> record Datum { 
> union {boolean, int, long, float, double, bytes, string, 
> array KeyValueObject}>, KeyValueObject} value; 
> } 
> record ResMsg { 
> string version; 
> string sequence; 
> string resourceGUID; 
> string GWID; 
> string GWTimestamp; 
> union {Datum, array} data; 
> }
> avro message samples are as follows:
> 
> 1)
> {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
> "GWTimestamp": "1409823150737", "data": {"value": "30"}} 
> 2)
> {"version": "01", "sequence": "1", "resource": "sensor-001", 
> "controller": "002", "controllerTimestamp": "1411038710358", "data": 
> {"value": [ {"name": "Temperature", "value": "30"}, {"name": "Speed", 
> "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, 
> {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}}
> both 1 and 2 adhere to the avro schema, so decoder is able to convert them 
> into avro objects in spark streaming api.
> BTW the messages were pulled from kafka source, and decoded by using kafka 
> decoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscri

[jira] [Resolved] (SPARK-5431) SparkSubmitSuite and DriverSuite hang indefinitely if Master fails

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5431.
-
Resolution: Incomplete

> SparkSubmitSuite and DriverSuite hang indefinitely if Master fails
> --
>
> Key: SPARK-5431
> URL: https://issues.apache.org/jira/browse/SPARK-5431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Priority: Major
>  Labels: bulk-closed
>
> Steps to reproduce:
> (1) Add a throw new exception anywhere in Master's constructor
> (2) Run the DriverSuite or SparkSubmitSuite
> (3) zzz
> We need to somehow propagate the exception all the way to the suites 
> themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4206) BlockManager warnings in local mode: "Block $blockId already exists on this machine; not re-adding it

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4206.
-
Resolution: Incomplete

> BlockManager warnings in local mode: "Block $blockId already exists on this 
> machine; not re-adding it
> -
>
> Key: SPARK-4206
> URL: https://issues.apache.org/jira/browse/SPARK-4206
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
> Environment: local mode, branch-1.1 & master
>Reporter: Imran Rashid
>Priority: Minor
>  Labels: bulk-closed
>
> When running in local mode, you often get log warning messages like:
> WARN storage.BlockManager: Block input-0-1415022975000 already exists on this 
> machine; not re-adding it
> (eg., try running the TwitterPopularTags example in local mode)
> I think these warning messages are pretty unsettling for a new user, and 
> should be removed.  If they are truly innocuous, they should be changed to 
> logInfo, or maybe even logDebug.  Or if they might actually indicate a 
> problem, we should find the root cause and fix it.
> I *think* the problem is caused by a replication level > 1 when running in 
> local mode.  In BlockManager.doPut, first the block is put locally:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692
> and then if the replication level > 1, a request is sent out to replicate the 
> block:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827
> However, in local mode, there isn't anywhere else to replicate the block; the 
> request comes back to the same node, which then issues the warning that the 
> block has already been added.
> If that analysis is right, the easy fix would be to make sure 
> replicationLevel = 1 in local mode.  But, its a little disturbing that a 
> replication request could result in an attempt to replicate on the same node 
> -- and that if something is wrong, we only issue a warning and keep going.
> If this really the culprit, then it might be worth taking a closer look at 
> the logic of replication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4716) Avoid shuffle when all-to-all operation has single input and output partition

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4716.
-
Resolution: Incomplete

> Avoid shuffle when all-to-all operation has single input and output partition
> -
>
> Key: SPARK-4716
> URL: https://issues.apache.org/jira/browse/SPARK-4716
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Major
>  Labels: bulk-closed
>
> I encountered an application that performs joins on a bunch of small RDDs, 
> unions the results, and then performs larger aggregations across them.  Many 
> of these small RDDs fit in a single partition.  For these operations with 
> only a single partition, there's no reason to write data to disk and then 
> fetch it over a socket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-1823.
-
Resolution: Incomplete

> ExternalAppendOnlyMap can still OOM if one key is very large
> 
>
> Key: SPARK-1823
> URL: https://issues.apache.org/jira/browse/SPARK-1823
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Or
>Priority: Major
>  Labels: bulk-closed
>
> If the values for one key do not collectively fit into memory, then the map 
> will still OOM when you merge the spilled contents back in.
> This is a problem especially for PySpark, since we hash the keys (Python 
> objects) before a shuffle, and there are only so many integers out there in 
> the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5079.
-
Resolution: Incomplete

> Detect failed jobs / batches in Spark Streaming unit tests
> --
>
> Key: SPARK-5079
> URL: https://issues.apache.org/jira/browse/SPARK-5079
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Reporter: Josh Rosen
>Assignee: Ilya Ganelin
>Priority: Major
>  Labels: bulk-closed
>
> Currently, it is possible to write Spark Streaming unit tests where Spark 
> jobs fail but the streaming tests succeed because we rely on wall-clock time 
> plus output comparision in order to check whether a test has passed, and 
> hence may miss cases where errors occurred if they didn't affect these 
> results.  We should strengthen the tests to check that no job failures 
> occurred while processing batches.
> See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for 
> additional context.
> The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might 
> also fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4911) Report the inputs and outputs of Spark jobs so that external systems can track data lineage

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4911.
-
Resolution: Incomplete

> Report the inputs and outputs of Spark jobs so that external systems can 
> track data lineage 
> 
>
> Key: SPARK-4911
> URL: https://issues.apache.org/jira/browse/SPARK-4911
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Major
>  Labels: bulk-closed
>
> When Spark runs a job, it would be useful to log its filesystem inputs and 
> outputs somewhere.  This allows external tools to track which persisted 
> datasets are derived from other persisted datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5043) Implement updated Receiver API

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5043.
-
Resolution: Incomplete

> Implement updated Receiver API
> --
>
> Key: SPARK-5043
> URL: https://issues.apache.org/jira/browse/SPARK-5043
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>  Labels: bulk-closed
>
> Implement the updated Receiver API based on this design.
> https://docs.google.com/document/d/1J66JCF1g-H1y2YnBnzdshTA1-hEJy3VvxXSEGgFpqws/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4488) Add control over map-side aggregation

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4488.
-
Resolution: Incomplete

> Add control over map-side aggregation
> -
>
> Key: SPARK-4488
> URL: https://issues.apache.org/jira/browse/SPARK-4488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Genmao Yu
>Priority: Minor
>  Labels: bulk-closed
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5713) Support python serialization for RandomForest

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5713.
-
Resolution: Incomplete

> Support python serialization for RandomForest
> -
>
> Key: SPARK-5713
> URL: https://issues.apache.org/jira/browse/SPARK-5713
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Tested on MacOS
>Reporter: Guillaume Charhon
>Priority: Major
>  Labels: bulk-closed
>
> I was trying to pickle a trained ramdom forest model. Unfortunately, it is 
> impossible to serialize the model for future use. 
> model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={}, 
> numTrees=nb_tree,featureSubsetStrategy="auto", impurity='variance', 
> maxDepth=depth)
> output = open('model.ml', 'wb')
> pickle.dump(model,output)
> I am getting this error:
> TypeError: can't pickle lock objects
> I am using Spark 1.2.0. I have also tested Spark 1.2.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6798) Fix Date serialization in SparkR

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6798.
-
Resolution: Incomplete

> Fix Date serialization in SparkR
> 
>
> Key: SPARK-6798
> URL: https://issues.apache.org/jira/browse/SPARK-6798
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Davies Liu
>Priority: Minor
>  Labels: bulk-closed
>
> SparkR's date serialization right now sends strings from R to the JVM. We 
> should convert this to integers and also account for timezones correctly by 
> using DateUtils



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6497) Class is not registered: scala.reflect.ManifestFactory$$anon$9

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6497.
-
Resolution: Incomplete

> Class is not registered: scala.reflect.ManifestFactory$$anon$9
> --
>
> Key: SPARK-6497
> URL: https://issues.apache.org/jira/browse/SPARK-6497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.5.2
>Reporter: Daniel Darabos
>Priority: Major
>  Labels: bulk-closed
>
> This is a slight regression from Spark 1.2.1 to 1.3.0.
> {noformat}
> spark-1.2.1-bin-hadoop2.4/bin/spark-shell --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
> spark.kryo.registrationRequired=true --conf 
> 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
> scala> sc.parallelize(Seq(1 -> 1)).groupByKey.collect
> res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)))
> {noformat}
> {noformat}
> spark-1.3.0-bin-hadoop2.4/bin/spark-shell --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
> spark.kryo.registrationRequired=true --conf 
> 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
> scala> sc.parallelize(Seq(1 -> 1)).groupByKey.collect
> Lost task 1.0 in stage 3.0 (TID 25, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Class is not registered: scala.reflect.ManifestFactory$$anon$9
> Note: To register this class use: 
> kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
> Serialization trace:
> evidence$1 (org.apache.spark.util.collection.CompactBuffer)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
>   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: Class is not registered: 
> scala.reflect.ManifestFactory$$anon$9
> Note: To register this class use: 
> kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:561)
>   ... 13 more
> {noformat}
> In our production code the exception is actually about 
> {{scala.reflect.ManifestFactory$$anon$8}} instead of 
> {{scala.reflect.ManifestFactory$$anon$9}} but it's probably the same thing. 
> Any idea what caused from 1.2.1 to 1.3.0 that could be causing this?
> We also get exceptions in 1.3.0 for {{scala.reflect.ClassTag$$anon$1}} and 
> {{java.lang.Class}}, but I haven't reduced them to a spark-shell reproduction 
> yet. We can of course just register these classes ourselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2913) Spark's log4j.properties should always appear ahead of Hadoop's on classpath

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2913.
-
Resolution: Incomplete

> Spark's log4j.properties should always appear ahead of Hadoop's on classpath
> 
>
> Key: SPARK-2913
> URL: https://issues.apache.org/jira/browse/SPARK-2913
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.2, 1.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: bulk-closed
>
> In the current {{compute-classpath}} scripts, the Hadoop conf directory may 
> appear before Spark's conf directory in the computed classpath.  This leads 
> to Hadoop's log4j.properties being used instead of Spark's, preventing users 
> from easily changing Spark's logging settings.
> To fix this, we should add a new classpath entry for Spark's log4j.properties 
> file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3115) Improve task broadcast latency for small tasks

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3115.
-
Resolution: Incomplete

> Improve task broadcast latency for small tasks
> --
>
> Key: SPARK-3115
> URL: https://issues.apache.org/jira/browse/SPARK-3115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> Broadcasting the task information helps reduce the amount of data transferred 
> for large tasks. However we've seen that this adds more latency for small 
> tasks. It'll be great to profile and fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6815) Support accumulators in R

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6815.
-
Resolution: Incomplete

> Support accumulators in R
> -
>
> Key: SPARK-6815
> URL: https://issues.apache.org/jira/browse/SPARK-6815
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>  Labels: bulk-closed
>
> SparkR doesn't support acccumulators right now.  It might be good to add 
> support for this to get feature parity with PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2253.
-
Resolution: Incomplete

> [Core] Disable partial aggregation automatically when reduction factor is low
> -
>
> Key: SPARK-2253
> URL: https://issues.apache.org/jira/browse/SPARK-2253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> Once we see enough number of rows in partial aggregation and don't observe 
> any reduction, Aggregator should just turn off partial aggregation. This 
> reduces memory usage for high cardinality aggregations.
> This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4545) If first Spark Streaming batch fails, it waits 10x batch duration before stopping

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4545.
-
Resolution: Incomplete

> If first Spark Streaming batch fails, it waits 10x batch duration before 
> stopping
> -
>
> Key: SPARK-4545
> URL: https://issues.apache.org/jira/browse/SPARK-4545
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.1.0, 1.2.1
>Reporter: Sean Owen
>Priority: Major
>  Labels: bulk-closed
>
> (I'd like to track the issue raised at 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdKY=QCT0YUdrkvbVuqXdFCGp1+6g-=s71fk8zr4uat...@mail.gmail.com%3E
>  as a JIRA since I think it's a legitimate issue that I can take a look into, 
> with some help.)
> This bit of {{JobGenerator.stop()}} executes, since the message appears in 
> the logs:
> {code}
> def haveAllBatchesBeenProcessed = {
>   lastProcessedBatch != null && lastProcessedBatch.milliseconds == stopTime
> }
> logInfo("Waiting for jobs to be processed and checkpoints to be written")
> while (!hasTimedOut && !haveAllBatchesBeenProcessed) {
>   Thread.sleep(pollTime)
> }
> // ... 10x batch duration wait here, before seeing the next line log:
> logInfo("Waited for jobs to be processed and checkpoints to be written")
> {code}
> I think that {{lastProcessedBatch}} is always null since no batch ever
> succeeds. Of course, for all this code knows, the next batch might
> succeed and so is there waiting for it. But it should proceed after
> one more batch completes, even if it failed?
> {{JobGenerator.onBatchCompleted}} is only called for a successful batch.
> Can it be called if it fails too? I think that would fix it.
> Should the condition also not be {{lastProcessedBatch.milliseconds <=
> stopTime}} instead of == ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1107) Add shutdown hook on executor stop to stop running tasks

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-1107.
-
Resolution: Incomplete

> Add shutdown hook on executor stop to stop running tasks
> 
>
> Key: SPARK-1107
> URL: https://issues.apache.org/jira/browse/SPARK-1107
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Andrew Ash
>Priority: Major
>  Labels: bulk-closed
>
> Originally reported by aash:
> http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA%2B-p3AHXYhpjXH9fr8jQ5%2B_gc%3DNHjLbOiJB9bHSahfEET5aHBQ%40mail.gmail.com%3E
> Latest in thread:
> http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA+-p3AFi7vz=2oty3caa0g+5ekg+a84uvqrl9tgstvgwgyb...@mail.gmail.com%3E
> The most popular approach is to add a shutdown hook that stops running tasks 
> in the executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-636.

Resolution: Incomplete

> Add mechanism to run system management/configuration tasks on all workers
> -
>
> Key: SPARK-636
> URL: https://issues.apache.org/jira/browse/SPARK-636
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>Priority: Major
>  Labels: bulk-closed
>
> It would be useful to have a mechanism to run a task on all workers in order 
> to perform system management tasks, such as purging caches or changing system 
> properties.  This is useful for automated experiments and benchmarking; I 
> don't envision this being used for heavy computation.
> Right now, I can mimic this with something like
> {code}
> sc.parallelize(0 until numMachines, numMachines).foreach { } 
> {code}
> but this does not guarantee that every worker runs a task and requires my 
> user code to know the number of workers.
> One sample use case is setup and teardown for benchmark tests.  For example, 
> I might want to drop cached RDDs, purge shuffle data, and call 
> {{System.gc()}} between test runs.  It makes sense to incorporate some of 
> this functionality, such as dropping cached RDDs, into Spark itself, but it 
> might be helpful to have a general mechanism for running ad-hoc tasks like 
> {{System.gc()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2280) Java & Scala reference docs should describe function reference behavior.

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2280.
-
Resolution: Incomplete

> Java & Scala reference docs should describe function reference behavior.
> 
>
> Key: SPARK-2280
> URL: https://issues.apache.org/jira/browse/SPARK-2280
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>Priority: Minor
>  Labels: bulk-closed, starter
>
> Example
>  JavaPairRDD> groupBy(Function f)
> Return an RDD of grouped elements. Each group consists of a key and a 
> sequence of elements mapping to that key. 
> T and K are not described and there is no explanation of what the function's 
> inputs and outputs should be and how GroupBy uses this information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2545.
-
Resolution: Incomplete

> Add a diagnosis mode for closures to figure out what they're bringing in
> 
>
> Key: SPARK-2545
> URL: https://issues.apache.org/jira/browse/SPARK-2545
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Aaron Davidson
>Priority: Major
>  Labels: bulk-closed
>
> Today, it's pretty hard to figure out why your closure is bigger than 
> expected, because it's not obvious what objects are being included or who is 
> including them. We should have some sort of diagnosis available to users with 
> very large closures that displays the contents of the closure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4500) Improve exact stratified sampling implementation

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4500.
-
Resolution: Incomplete

> Improve exact stratified sampling implementation
> 
>
> Key: SPARK-4500
> URL: https://issues.apache.org/jira/browse/SPARK-4500
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> The current implementation for exact stratified sampling (sampleByKeyExact) 
> could be more efficient.  Proposed algorithm sketch:
> * Sampling is done separately for each stratum.  Here, all counts are w.r.t. 
> a fixed stratum.
> * Let n_partition = number of elements in a given partition
> * This method uses 2-tiered sampling:
> ** Tier 1 (on driver): Select the sample sizes {n_partition} for each 
> partition.
> *** Without replacement, this uses a multivariate hypergeometric distribution.
> *** With replacement, this should use a multinomial distribution.
> ** Tier 2 (in parallel): Select a sample of size n_partition.
> If anyone is interested in implementing this, I have a rough draft which 
> works without replacement, but it needs to be cleaned up and augmented to do 
> sampling with replacement too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5488) SPARK_LOCAL_IP not read by mesos scheduler

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5488.
-
Resolution: Incomplete

> SPARK_LOCAL_IP not read by mesos scheduler
> --
>
> Key: SPARK-5488
> URL: https://issues.apache.org/jira/browse/SPARK-5488
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 1.1.1
>Reporter: Martin Tapp
>Priority: Minor
>  Labels: bulk-closed
>
> My environment sets SPARK_LOCAL_IP and my driver sees it. But mesos sees the 
> one from my first available network adapter.
> I can even see that SPARK_LOCAL_IP is read correctly by Utils.localHostName 
> and Utils.localIpAddress 
> (core/src/main/scala/org/apache/spark/util/Utils.scala). Seems spark mesos 
> framework doesn't use it.
> Work around for now is to disable my first adapter such that the second one 
> becomes the one seen by Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3916) recognize appended data in textFileStream()

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3916.
-
Resolution: Incomplete

> recognize appended data in textFileStream()
> ---
>
> Key: SPARK-3916
> URL: https://issues.apache.org/jira/browse/SPARK-3916
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
>  Labels: bulk-closed
>
> Right now, we only find new data from new files, the data written to old 
> files (processed in last batch) will not be processed.
> In order to support this, we need partialRDD(), which is an RDD for part of 
> file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5272.
-
Resolution: Incomplete

> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed, clustering
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5077) Map output statuses can still exceed spark.akka.frameSize

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5077.
-
Resolution: Incomplete

> Map output statuses can still exceed spark.akka.frameSize
> -
>
> Key: SPARK-5077
> URL: https://issues.apache.org/jira/browse/SPARK-5077
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Priority: Major
>  Labels: bulk-closed
>
> Since HighlyCompressedMapOutputStatuses uses a bitmap for tracking empty 
> blocks, its size is not bounded and thus Spark is still susceptible to 
> "MapOutputTrackerMasterActor: Map output statuses
> were 11141547 bytes which exceeds spark.akka.frameSize"-type errors, even in 
> 1.2.0.
> We needed to use a bitmap for tracking zero-sized blocks (see SPARK-3740; 
> this isn't just a performance issue; it's necessary for correctness).  This 
> will require a bit more effort to fix, since we'll either have to find a way 
> to use a fixed size / capped size encoding for MapOutputStatuses (which might 
> require changes to let us fetch empty blocks safely) or figure out some other 
> strategy for shipping these statues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6026) Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6026.
-
Resolution: Incomplete

> Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle 
> within the Sort shuffle code
> -
>
> Key: SPARK-6026
> URL: https://issues.apache.org/jira/browse/SPARK-6026
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.3.0
>Reporter: Kay Ousterhout
>Priority: Major
>  Labels: bulk-closed
>
> The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle 
> when the number of partitions is less than this) is basically a workaround 
> for SparkSQL, because the fact that the sort-based shuffle stores 
> non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. 
>  Once the sort-based shuffle is changed to store serialized objects, we 
> should never be secretly doing hash-ish shuffle even when the user has 
> specified to use sort-based shuffle (because of its otherwise worse 
> performance).
> [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from 
> you on this proposal (and also a sanity check that I've correctly 
> characterized the issue).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4144.
-
Resolution: Incomplete

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, MLlib
>Reporter: Chris Fregly
>Priority: Major
>  Labels: bulk-closed
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2408) RDD.map(func) dependencies issue after checkpoint & count

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2408.
-
Resolution: Incomplete

> RDD.map(func) dependencies issue after checkpoint & count
> -
>
> Key: SPARK-2408
> URL: https://issues.apache.org/jira/browse/SPARK-2408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1, 1.0.0
>Reporter: Daniel Fry
>Priority: Major
>  Labels: bulk-closed
>
> i am noticing strange behavior with a simple example use of rdd.checkpoint(). 
> you can paste the following code into any spark-shell (e.g. with 
> MASTER=local[*]) 
> // build an array of 100 random lowercase strings of length 10
> val r = new scala.util.Random()
> val str_arr = (1 to 100).map(a => (1 to 10).map(b => new 
> Character(((Math.abs(r.nextInt) % 26) + 97).toChar)).mkString(""))
> // make this into an rdd
> val str_rdd = sc.parallelize(str_arr)
> // checkpoint & count
> sc.setCheckpointDir("hdfs://[namenode]:54310/path/to/some/spark_checkpoint_dir")
> str_rdd.checkpoint()
> str_rdd.count
> // rdd.map some dummy function
> def test(a : String) : String = { return a }
> str_rdd.map(test).count
> this results in a surprising exception! 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: scala.util.Random
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4698) Data-locality aware Partitioners

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4698.
-
Resolution: Incomplete

> Data-locality aware Partitioners
> 
>
> Key: SPARK-4698
> URL: https://issues.apache.org/jira/browse/SPARK-4698
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kevin Mader
>Priority: Minor
>  Labels: bulk-closed
>
> The current hash and range partitioner tools do not seem to respect the 
> existing data-locality. A 'dictionary' driven partitioner that calculated the 
> partitions based on the existing key locations instead of re-calculating them 
> would be ideal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3835) Spark applications that are killed should show up as "KILLED" or "CANCELLED" in the Spark UI

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3835.
-
Resolution: Incomplete

> Spark applications that are killed should show up as "KILLED" or "CANCELLED" 
> in the Spark UI
> 
>
> Key: SPARK-3835
> URL: https://issues.apache.org/jira/browse/SPARK-3835
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: UI, bulk-closed
>
> Spark applications that crash or are killed are listed as FINISHED in the 
> Spark UI.
> It looks like the Master only passes back a list of "Running" applications 
> and a list of "Completed" applications, All of the applications under 
> "Completed" have status "FINISHED", however if they were killed manually they 
> should show "CANCELLED", or if they failed they should read "FAILED".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4489) JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4489.
-
Resolution: Incomplete

> JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException
> -
>
> Key: SPARK-4489
> URL: https://issues.apache.org/jira/browse/SPARK-4489
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.1.0
>Reporter: Christopher Ng
>Priority: Major
>  Labels: bulk-closed
>
> Calling collectAsMap() on a JavaPairRDD reconstructed from a checkpoint fails 
> with a ClassCastException:
> Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; 
> cannot be cast to [Lscala.Tuple2;
>   at 
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:595)
>   at 
> org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:569)
>   at org.facboy.spark.CheckpointBug.main(CheckpointBug.java:46)
> Code sample reproducing the issue: 
> https://gist.github.com/facboy/8387e950ffb0746a8272



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5091) Hooks for PySpark tasks

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5091.
-
Resolution: Incomplete

> Hooks for PySpark tasks
> ---
>
> Key: SPARK-5091
> URL: https://issues.apache.org/jira/browse/SPARK-5091
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Davies Liu
>Priority: Major
>  Labels: bulk-closed
>
> Currently, it's not convenient to add package on executor to PYTHONPATH (we 
> did not assume the environment of driver an executor are identical). 
> It will be nice to have a hook to called before/after every tasks, then user 
> could manipulate sys.path by pre-task-hooks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3380) DecisionTree: overflow and precision in aggregation

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3380.
-
Resolution: Incomplete

> DecisionTree: overflow and precision in aggregation
> ---
>
> Key: SPARK-3380
> URL: https://issues.apache.org/jira/browse/SPARK-3380
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: bulk-closed
>
> DecisionTree does not check for overflows or loss of precision while 
> aggregating sufficient statistics (binAggregates).  It uses Double, which may 
> be a problem for DecisionTree regression since the variance calculation could 
> blow up.  At the least, it could check for overflow and renormalize as needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3153) shuffle will run out of space when disks have different free space

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3153.
-
Resolution: Incomplete

> shuffle will run out of space when disks have different free space
> --
>
> Key: SPARK-3153
> URL: https://issues.apache.org/jira/browse/SPARK-3153
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Reporter: Davies Liu
>Priority: Major
>  Labels: bulk-closed
>
> If we have several disks in SPARK_LOCAL_DIRS, and one of them is much smaller 
> than others (maybe added in my mistake, or special disk, SSD), them the 
> shuffle will meet the problem of run out of space with this smaller disk.
> PySpark also has this issue during spilling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4868) Twitter DStream.map() throws "Task not serializable"

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4868.
-
Resolution: Incomplete

> Twitter DStream.map() throws "Task not serializable"
> 
>
> Key: SPARK-4868
> URL: https://issues.apache.org/jira/browse/SPARK-4868
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Spark Shell
>Affects Versions: 1.1.1, 1.2.0
> Environment: * Spark 1.1.1 or 1.2.0
> * EC2 cluster with 1 slave spun up using {{spark-ec2}}
> * twitter4j 3.0.3
> * {{spark-shell}} called with {{--jars}} argument to load 
> {{spark-streaming-twitter_2.10-1.0.0.jar}} as well as all the twitter4j jars.
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: bulk-closed
>
> _(Continuing the discussion [started here on the Spark user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-in-Spark-Streaming-td5725.html].)_
> The following Spark Streaming code throws a serialization exception I do not 
> understand.
> {code}
> import twitter4j.auth.{Authorization, OAuthAuthorization}
> import twitter4j.conf.ConfigurationBuilder 
> import org.apache.spark.streaming.{Seconds, Minutes, StreamingContext}
> import org.apache.spark.streaming.twitter.TwitterUtils
> def getAuth(): Option[Authorization] = {
>   System.setProperty("twitter4j.oauth.consumerKey", "consumerKey")
>   System.setProperty("twitter4j.oauth.consumerSecret", "consumerSecret")
>   System.setProperty("twitter4j.oauth.accessToken", "accessToken") 
>   System.setProperty("twitter4j.oauth.accessTokenSecret", "accessTokenSecret")
>   Some(new OAuthAuthorization(new ConfigurationBuilder().build()))
> } 
> def noop(a: Any): Any = {
>   a
> }
> val ssc = new StreamingContext(sc, Seconds(5))
> val liveTweetObjects = TwitterUtils.createStream(ssc, getAuth())
> val liveTweets = liveTweetObjects.map(_.getText)
> liveTweets.map(t => noop(t)).print()  // exception here
> ssc.start()
> {code}
> So before I even start the StreamingContext, I get the following stack trace:
> {code}
> scala> liveTweets.map(t => noop(t)).print()
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1264)
>   at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438)
>   at $iwC$$iwC$$iwC$$iwC.(:27)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
>   at (:38)
>   at .(:42)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.in

[jira] [Resolved] (SPARK-4653) DAGScheduler refactoring and cleanup

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4653.
-
Resolution: Incomplete

> DAGScheduler refactoring and cleanup
> 
>
> Key: SPARK-4653
> URL: https://issues.apache.org/jira/browse/SPARK-4653
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: bulk-closed
>
> This is an umbrella JIRA for DAGScheduler refactoring and cleanup.  Please 
> comment or open sub-issues if you have refactoring suggestions that should 
> fall under this umbrella.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3717.
-
Resolution: Incomplete

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5506) java.lang.ClassCastException using lambda expressions in combination of spark and Servlet

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5506.
-
Resolution: Incomplete

> java.lang.ClassCastException using lambda expressions in combination of spark 
> and Servlet
> -
>
> Key: SPARK-5506
> URL: https://issues.apache.org/jira/browse/SPARK-5506
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: spark server: Ubuntu 14.04 amd64
> $ java -version
> java version "1.8.0_25"
> Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
>Reporter: Milad Khajavi
>Priority: Major
>  Labels: bulk-closed
>
> I'm trying to build a web API for my Apache spark jobs using sparkjava.com 
> framework. My code is:
> @Override
> public void init() {
> get("/hello",
> (req, res) -> {
> String sourcePath = "hdfs://spark:54310/input/*";
> SparkConf conf = new SparkConf().setAppName("LineCount");
> conf.setJars(new String[] { 
> "/home/sam/resin-4.0.42/webapps/test.war" });
> File configFile = new File("config.properties");
> String sparkURI = "spark://hamrah:7077";
> conf.setMaster(sparkURI);
> conf.set("spark.driver.allowMultipleContexts", "true");
> JavaSparkContext sc = new JavaSparkContext(conf);
> @SuppressWarnings("resource")
> JavaRDD log = sc.textFile(sourcePath);
> JavaRDD lines = log.filter(x -> {
> return true;
> });
> return lines.count();
> });
> }
> If I remove the lambda expression or put it inside a simple jar rather than a 
> web service (somehow a Servlet) it will run without any error. But using a 
> lambda expression inside a Servlet will result this exception:
> 15/01/28 10:36:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> hamrah): java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.f$1 of type 
> org.apache.spark.api.java.function.Function in instance of 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1
> at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2089)
> at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1999)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> P.S: I tried combination of jersey and javaspark with jetty, tomcat and resin 
> and all of them led me to the same result.
> Here the same issue: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-ClassCastException-SerializedLambda-to-org-apache-spark-api-java-function-Fu1-tt21261.html
> This is my colleague question in stackoverflow: 
> http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e

[jira] [Resolved] (SPARK-5497) start-all script not working properly on Standalone HA cluster (with Zookeeper)

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5497.
-
Resolution: Incomplete

> start-all script not working properly on Standalone HA cluster (with 
> Zookeeper)
> ---
>
> Key: SPARK-5497
> URL: https://issues.apache.org/jira/browse/SPARK-5497
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.2.0
>Reporter: Roque Vassal'lo
>Priority: Major
>  Labels: Configuration, Deployment, Spark, bulk-closed, start
>
> I have configured a Standalone HA cluster with Zookeeper with:
> - 3 Zookeeper nodes
> - 2 Spark master nodes (1 alive and 1 in standby mode)
> - 2 Spark slave nodes
> While executing start-all.sh on each master, it will start the master and 
> start a worker on each configured slave.
> If alive master goes down, those worker are supposed to reconfigure 
> themselves to use the new active master automatically.
> I have noticed that the spark-env property SPARK_MASTER_IP is used in both 
> called scripts, start-master and start-slaves.
> The problem is that if you configure SPARK_MASTER_IP with the active master 
> ip, when it goes down, workers don't reassign themselves to the new active 
> master.
> And if you configure SPARK_MASTER_IP with the masters cluster route (well, an 
> approximation, because you have to write master's port in all-but-last ips, 
> that is "master1:7077,master2", in order to make it work), slaves start 
> properly but master doesn't.
> So, the start-master script needs SPARK_MASTER_IP property to contain its ip 
> in order to start master properly; and start-slaves script needs 
> SPARK_MASTER_IP property to contain the masters cluster ips (that is 
> "master1:7077,master2")
> To test that idea, I have modified start-slaves and spark-env scripts on 
> master nodes.
> On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on 
> each master node (that is, on master node 1, SPARK_MASTER_IP=master1; and on 
> master node 2, SPARK_MASTER_IP=master2)
> On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the 
> pseudo-masters-cluster-ips (SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on 
> both masters.
> On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to 
> SPARK_MASTER_CLUSTER_IP.
> I have tried that and it works great! When active master node goes down, all 
> workers reassign themselves to the new active node.
> Maybe there is a better fix for this issue.
> Hope this quick-fix idea can help.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5685.
-
Resolution: Incomplete

> Show warning when users open text files compressed with non-splittable 
> algorithms like gzip
> ---
>
> Key: SPARK-5685
> URL: https://issues.apache.org/jira/browse/SPARK-5685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: bulk-closed
>
> This is a usability or user-friendliness issue.
> It's extremely common for people to load a text file compressed with gzip, 
> process it, and then wonder why only 1 core in their cluster is doing any 
> work.
> Some examples:
> * http://stackoverflow.com/q/28127119/877069
> * http://stackoverflow.com/q/27531816/877069
> I'm not sure how this problem can be generalized, but at the very least it 
> would be helpful if Spark displayed some kind of warning in the common case 
> when someone opens a gzipped file with {{sc.textFile}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5045) Update FlumePollingReceiver to use updated Receiver API

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5045.
-
Resolution: Incomplete

> Update FlumePollingReceiver to use updated Receiver API
> ---
>
> Key: SPARK-5045
> URL: https://issues.apache.org/jira/browse/SPARK-5045
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Tathagata Das
>Assignee: Hari Shreedharan
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6462) UpdateStateByKey should allow inner join of new with old keys

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6462.
-
Resolution: Incomplete

> UpdateStateByKey should allow inner join of new with old keys
> -
>
> Key: SPARK-6462
> URL: https://issues.apache.org/jira/browse/SPARK-6462
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.3.0
>Reporter: Andre Schumacher
>Priority: Major
>  Labels: bulk-closed
>
> In a nutshell: provide a (inner join) instead of a cogroup for 
> updateStateByKey in StateDStream.
> Details:
> It is common to read data (saw weblog data) from a streaming source (say 
> Kafka) and each time update the state of a relatively small number of keys.
> If only the state changes need to be propagated to a downstream sink then one 
> could avoid filtering out unchanged state in the user program and instead 
> provide this functionality in the API (say by adding a 
> updateStateChangesByKey method).
> Note that this is related but not identical to:
> https://issues.apache.org/jira/browse/SPARK-2629



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4540) Improve Executor ID Logging

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4540.
-
Resolution: Incomplete

> Improve Executor ID Logging
> ---
>
> Key: SPARK-4540
> URL: https://issues.apache.org/jira/browse/SPARK-4540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arun Ahuja
>Priority: Minor
>  Labels: bulk-closed
>
> A few things that would useful here:
> - An executor should log what executor it is running, AFAICT this does not 
> help and only the driver reports that executor 10 is running on xyz.host.com
> - For YARN, when an executor fails in addition to reporting the executor ID 
> of the lost executor, report the container ID as well
> The latter is useful for multiple executors running on the same machine where 
> it may be more useful to find the container directly than the executor ID or 
> host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3134.
-
Resolution: Incomplete

> Update block locations asynchronously in TorrentBroadcast
> -
>
> Key: SPARK-3134
> URL: https://issues.apache.org/jira/browse/SPARK-3134
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> Once the TorrentBroadcast gets the data blocks, it needs to tell the master 
> the new location. We should make the location update non-blocking to reduce 
> roundtrips we need to launch tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5674) Spark Job Explain Plan Proof of Concept

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5674.
-
Resolution: Incomplete

> Spark Job Explain Plan Proof of Concept
> ---
>
> Key: SPARK-5674
> URL: https://issues.apache.org/jira/browse/SPARK-5674
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Priority: Major
>  Labels: bulk-closed
>
> This is just a prototype of creating an explain plan for a job. Code can be 
> found here: https://github.com/ksakellis/spark/tree/kostas-explainPlan-poc 
> The code was written very quickly and so doesn't have any comments, tests and 
> is probably buggy - hence it being a proof of concept.
> *How to Use*
> # {code}sc.explainOn <=> sc.explainOff{code} This will generate the explain 
> plain and print it in the logs
> # {code}sc.enableExecution <=> sc.disableExecution{code} This will disable 
> executing of the job.
> Using these two knobs a user can choose to print the explain plan and/or 
> disable the running of the job if they only want to see the plan.
> *Implementation*
> This is only a prototype and it is by no means production ready. The code is 
> pretty hacky in places and a few shortcuts were made just to get the 
> prototype working.
> The most interesting part of this commit is in the ExecutionPlanner.scala 
> class. This class creates its own private instance of the DAGScheduler and 
> passes into it a NoopTaskScheduler. The NoopTaskScheduler receives the 
> created TaskSets from the DAGScheduler and records the stages -> tasksets. 
> The NoopTaskScheduler also creates fake CompletionsEvents and sends them to 
> the DAGScheduler to move the scheduling along. It is done this way so that we 
> can use the DAGScheduler unmodified thus reducing code divergence.
> The rest of the code is about processing the information produced by the 
> ExecutionPlanner, creating a DAG with a bunch of metadata and printing it as 
> a pretty ascii drawing. For drawing the DAG, 
> https://github.com/mdr/ascii-graphs is used. This was just easier again to 
> prototype.
> *How is this different than RDD#toDebugString?*
> The execution planner runs the job through the entire DAGScheduler so we can 
> collect some metrics that are not presently available in the debugString. For 
> example, we can report the binary size of the task which might be important 
> if the closures are referencing large object.
> In addition, because we execute the scheduler code from an action, we can get 
> a more accurate picture of where the stage boundaries and dependencies. An 
> action such ask treeReduce will generate a number of stages that you can't 
> get just by doing .toDebugString on the rdd.
> *Limitations of this Implementation*
> Because this is a prototype there are is a lot of lame stuff in this commit.
> # All of the code in SparkContext in particular sucks. This adds some code in 
> the runJob() call and when it gets the plan it just writes it to the INFO 
> log. We need to find a better way of exposing the plan to the caller so that 
> they can print it, analyze it etc. Maybe we can use implicits or something? 
> Not sure how best to do this yet.
> # Some of the actions will return through exceptions because we are basically 
> faking a runJob(). If you want ot try this, it is best to just use count() 
> instead of say collect(). This will get fixed when we fix 1)
> # Because the ExplainPlanner creates its own DAGScheduler, there currently is 
> no way to map the real stages to the "explainPlan" stages. So if a user turns 
> on explain plan, and doesn't disable execution, we can't automatically add 
> more metrics to the explain plan as they become available. The stageId in the 
> plan and the stageId in the real scheduler will be different. This is 
> important for when we add it to the webUI and users can track progress on the 
> DAG
> # We are using https://github.com/mdr/ascii-graphs to draw the DAG - not sure 
> if we want to depend on that project.
> *Next Steps*
> # It would be good to get a few people to take a look at the code 
> specifically at how the plan gets generated. Clone the package and give it a 
> try with some of your jobs
> # If the approach looks okay overall, I can put together a mini design doc 
> and add some answers to the above limitations of this approach. 
> #Feedback most welcome.
> *Example Code:*
> {code}
> sc.explainOn
> sc.disableExecution
> val rdd = sc.parallelize(1 to 10, 4).map(key => (key.toString, key))
> val rdd2 = sc.parallelize(1 to 5, 2).map(key => (key.toString, key))
> rdd.join(rdd2)
>.count()
> {code}
> *Example Output:*
> {noformat}
> EXPLAIN PLAN:
>  +---+ +---+
>  |   | |   |
>  |Stage: 0 @ map | |Stage: 1 @ m

[jira] [Resolved] (SPARK-6165) Aggregate and reduce should be able to work with very large number of tasks.

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6165.
-
Resolution: Incomplete

> Aggregate and reduce should be able to work with very large number of tasks.
> 
>
> Key: SPARK-6165
> URL: https://issues.apache.org/jira/browse/SPARK-6165
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>  Labels: bulk-closed
>
> To prevent data from workers causing OOM at master, we have the property 
> 'spark.driver.maxResultSize'.
> But the OOM at master can be due to two reasons :
> a) Data being sent from workers is too large - causing OOM at master.
> b) Large number of moderate (to low) sized data being sent to master causing 
> OOM.
> (For example: 500k tasks, 1k each)
> spark.driver.maxResultSize protects against both - but (b) should be handled 
> more gracefully by master : example spool it to disk, aggregate without 
> waiting for entire result set to be fetched, etc.
> Currently we are forced to use treeReduce and co to work around this problem 
> : adding to the latency of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2581) complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2581.
-
Resolution: Incomplete

> complete or withdraw visitedStages optimization in DAGScheduler’s 
> stageDependsOn
> 
>
> Key: SPARK-2581
> URL: https://issues.apache.org/jira/browse/SPARK-2581
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Aaron Staple
>Priority: Minor
>  Labels: bulk-closed
>
> Right now the visitedStages HashSet is populated with stages, but never 
> queried to limit examination of previously visited stages.  It may make sense 
> to check whether a mapStage has been visited previously before visiting it 
> again, as in the nearby visitedRdds check.  Or it may be that the existing 
> visitedRdds check sufficiently optimizes this function, and visitedStages can 
> simply be removed.
> See discussion here: 
> https://github.com/apache/spark/pull/1362#discussion-diff-15018046L1107



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5490.
-
Resolution: Incomplete

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Priority: Major
>  Labels: bulk-closed, clustering
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5142) Possibly data may be ruined in Spark Streaming's WAL mechanism.

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5142.
-
Resolution: Incomplete

> Possibly data may be ruined in Spark Streaming's WAL mechanism.
> ---
>
> Key: SPARK-5142
> URL: https://issues.apache.org/jira/browse/SPARK-5142
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>Priority: Major
>  Labels: bulk-closed
>
> Currently in Spark Streaming's WAL manager, data will be written into HDFS 
> with multiple tries when meeting failure, because of lacking of transactional 
> guarantee, previously partial-written data is not rolled back and the retried 
> data will be appended to the last, this will ruin the file and make the 
> WriteAheadLogReader to read data with failure.
> Firstly I think this problem is hard to fix because HDFS do not support 
> truncate operation(HDFS-3107) or random write with specific offset.
> Secondly, I think if we meet such write exception, it is better not to try 
> again, try again will ruin the file and make read abnormal.
> Sorry if I misunderstand anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5480.
-
Resolution: Incomplete

> GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
> ---
>
> Key: SPARK-5480
> URL: https://issues.apache.org/jira/browse/SPARK-5480
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0, 1.3.1
> Environment: Yarn client
>Reporter: Stephane Maarek
>Priority: Major
>  Labels: bulk-closed
>
> Running the following code:
> val subgraph = graph.subgraph (
>   vpred = (id,article) => //working predicate)
> ).cache()
> println( s"Subgraph contains ${subgraph.vertices.count} nodes and 
> ${subgraph.edges.count} edges")
> val prGraph = subgraph.staticPageRank(5).cache
> val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
>   (v, title, rank) => (rank.getOrElse(0.0), title)
> }
> titleAndPrGraph.vertices.top(13) {
>   Ordering.by((entry: (VertexId, (Double, _))) => entry._2._1)
> }.foreach(t => println(t._2._2._1 + ": " + t._2._1 + ", id:" + t._1))
> Returns a graph with 5000 nodes and 4000 edges.
> Then it crashes during the PageRank with the following:
> 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
> 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
> 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
> 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
> at 
> org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
> at 
> org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
> at 
> org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
> at 
> org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.

[jira] [Resolved] (SPARK-4885) Enable fetched blocks to exceed 2 GB

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4885.
-
Resolution: Incomplete

> Enable fetched blocks to exceed 2 GB
> 
>
> Key: SPARK-4885
> URL: https://issues.apache.org/jira/browse/SPARK-4885
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> 14/12/18 09:53:13 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[handle-message-executor-12,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at 
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at 
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at 
> com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at 
> com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
> at 
> com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200)
> at 
> com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at 
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1054)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1063)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:154)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:428)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:394)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:100)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:79)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1921) Allow duplicate jar files among the app jar and secondary jars in yarn-cluster mode

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-1921.
-
Resolution: Incomplete

> Allow duplicate jar files among the app jar and secondary jars in 
> yarn-cluster mode
> ---
>
> Key: SPARK-1921
> URL: https://issues.apache.org/jira/browse/SPARK-1921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: bulk-closed
>
> In yarn-cluster mode, jars are uploaded to a staging folder on hdfs. If there 
> are duplicates among the app jar and secondary jars, there will be overwrites 
> that cause inconsistent timestamps. I saw the following message:
> {code}
> Application application_1400965808642_0021 failed 2 times due to AM Container 
> for appattempt_1400965808642_0021_02 exited with  exitCode: -1000 due to: 
> Resource 
> hdfs://localhost.localdomain:8020/user/cloudera/.sparkStaging/application_1400965808642_0021/app_2.10-0.1.jar
>  changed on src filesystem (expected 1400998721965, was 1400998723123
> {code}
> Tested on a CDH-5 quickstart VM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3631) Add docs for checkpoint usage

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3631.
-
Resolution: Incomplete

> Add docs for checkpoint usage
> -
>
> Key: SPARK-3631
> URL: https://issues.apache.org/jira/browse/SPARK-3631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>Priority: Major
>  Labels: bulk-closed
>
> We should include general documentation on using checkpoints.  Right now the 
> docs only cover checkpoints in the Spark Streaming use case which is slightly 
> different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the 
> RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, 
> therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
>  ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price 
> to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up 
> shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an 
> action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, 
> you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it 
> materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be 
> done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at 
> https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6378.
-
Resolution: Incomplete

> srcAttr in graph.triplets don't update when the size of graph is huge
> -
>
> Key: SPARK-6378
> URL: https://issues.apache.org/jira/browse/SPARK-6378
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.1
>Reporter: zhangzhenyue
>Priority: Major
>  Labels: bulk-closed
> Attachments: TripletsViewDonotUpdate.scala
>
>
> when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the 
> srcAttr and dstAttr in graph.triplets don't update when using the 
> Graph.outerJoinVertices(when the data in vertex is changed).
> the code and the log is as follows:
> {quote}
> g = graph.outerJoinVertices()...
> g,vertices,count()
> g.edges.count()
> println("example edge " + g.triplets.filter(e => e.srcId == 
> 51L).collect()
>   .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + 
> e.dstAttr)).mkString("\n"))
> println("example vertex " + g.vertices.filter(e => e._1 == 
> 51L).collect()
>   .map(e => (e._1 + "," + e._2)).mkString("\n"))
> {quote}
> the result:
> {quote}
> example edge 51:0, 2467451620:61
> 51:0, 1962741310:83 // attr of vertex 51 is 0 in 
> Graph.triplets
> example vertex 51,2 // attr of vertex 51 is 2 in 
> Graph.vertices
> {quote}
> when the graph is smaller(10 million vertex), the code is OK, the triplets 
> will update when the vertex is changed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6415.
-
Resolution: Incomplete

> Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills 
> the app
> -
>
> Key: SPARK-6415
> URL: https://issues.apache.org/jira/browse/SPARK-6415
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Hari Shreedharan
>Priority: Major
>  Labels: bulk-closed
>
> Of course, this would have to be done as a configurable param, but such a 
> fail-fast is useful else it is painful to figure out what is happening when 
> there are cascading failures. In some cases, the SparkContext shuts down and 
> streaming keeps scheduling jobs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6148) cachedDataSourceTables may store outdated metadata if the table is updated from another HiveContext

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6148.
-
Resolution: Incomplete

> cachedDataSourceTables may store outdated metadata if the table is updated 
> from another HiveContext
> ---
>
> Key: SPARK-6148
> URL: https://issues.apache.org/jira/browse/SPARK-6148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Major
>  Labels: bulk-closed
>
> If we have two HiveContext, if we change a table through one (e.g. append and 
> overwrite with new data), cachedDataSourceTables in another one will not be 
> updated automatically
> . 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6637) Test lambda weighting in implicit ALS

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6637.
-
Resolution: Incomplete

> Test lambda weighting in implicit ALS
> -
>
> Key: SPARK-6637
> URL: https://issues.apache.org/jira/browse/SPARK-6637
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>  Labels: bulk-closed
>
> As discussed on the user-list, there is some inconsistency between 1.2 and 
> 1.3 on how we scale lambda in the implicit formulation. In 1.2 we scale by 
> the number of explicit ratings, while in 1.3 we scale by the number of 
> users/items.
> https://www.mail-archive.com/user@spark.apache.org/msg24817.html
> It is worth testing which is more proper for implicit feedback.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6312) ChiSqTest should check for too few counts

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6312.
-
Resolution: Incomplete

> ChiSqTest should check for too few counts
> -
>
> Key: SPARK-6312
> URL: https://issues.apache.org/jira/browse/SPARK-6312
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: bulk-closed
>
> ChiSqTest assumes that elements of the contingency matrix are large enough 
> (have enough counts) s.t. the central limit theorem kicks in.  It would be 
> reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of 
> instances being used (or counts in the contingency table entries, to be more 
> precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is 
> insignificant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3306) Addition of external resource dependency in executors

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3306.
-
Resolution: Incomplete

> Addition of external resource dependency in executors
> -
>
> Key: SPARK-3306
> URL: https://issues.apache.org/jira/browse/SPARK-3306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Yan
>Priority: Major
>  Labels: bulk-closed
>
> Currently, Spark executors only support static and read-only external 
> resources of side files and jar files. With emerging disparate data sources, 
> there is a need to support more versatile external resources, such as 
> connections to data sources, to facilitate efficient data accesses to the 
> sources. For one, the JDBCRDD, with some modifications,  could benefit from 
> this feature by reusing established JDBC connections from the same Spark 
> context before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5748) Improve Vectors.sqdist implementation

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5748.
-
Resolution: Incomplete

> Improve Vectors.sqdist implementation
> -
>
> Key: SPARK-5748
> URL: https://issues.apache.org/jira/browse/SPARK-5748
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: bulk-closed
>
> Saw some regression of k-means in 1.3 performance tests. I think the problem 
> is the sqdist implementation is slow but not very sure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5372) Change the default storage level of window operators

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5372.
-
Resolution: Incomplete

> Change the default storage level of window operators
> 
>
> Key: SPARK-5372
> URL: https://issues.apache.org/jira/browse/SPARK-5372
> Project: Spark
>  Issue Type: Task
>  Components: DStreams
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>Priority: Minor
>  Labels: bulk-closed
>
> Current storage level of window operators is MEMORY_ONLY_SER, if the memory 
> is not enough to hold all the window data, cached RDD will be discarded, 
> which will lead to unexpected behavior. 
> Besides the default storage level of input data is MEMORY_AND_DISK_SER_2, it 
> is better to align to this storage level to change the storage level of 
> window operators to MEMORY_AND_DISK_SER.
> This changing has no effect when memory is enough. So I'd propose to change 
> the default storage level to MEMORY_AND_DISK_SER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6808.
-
Resolution: Incomplete

> Checkpointing after zipPartitions results in NODE_LOCAL execution
> -
>
> Key: SPARK-6808
> URL: https://issues.apache.org/jira/browse/SPARK-6808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.0
> Environment: EC2 Ubuntu r3.8xlarge machines
>Reporter: Xinghao Pan
>Priority: Minor
>  Labels: bulk-closed
>
> I'm encountering a weird issue where a simple iterative zipPartition is 
> PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations 
> after checkpointing. More often than not, tasks are fetching remote blocks 
> from the network, leading to a 10x increase in runtime.
> Here's an example snippet of code:
> var R : RDD[(Long,Int)]
> = sc.parallelize((0 until numPartitions), numPartitions)
>   .mapPartitions(_ => new Array[(Long,Int)](1000).map(i => 
> (0L,0)).toSeq.iterator).cache()
> sc.setCheckpointDir(checkpointDir)
> var iteration = 0
> while (iteration < 50){
>   R = R.zipPartitions(R)((x,y) => x).cache()
>   if ((iteration+1) % checkpointIter == 0) R.checkpoint()
>   R.foreachPartition(_ => {})
>   iteration += 1
> }
> I've also tried to unpersist the old RDDs, and increased spark.locality.wait 
> but nether helps.
> Strangely, by adding a simple identity map
> R = R.map(x => x).cache()
> after the zipPartitions appears to partially mitigate the issue.
> The problem was originally triggered when I attempted to checkpoint after 
> doing joinVertices in GraphX, but the above example shows that the issue is 
> in Spark core too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1863) Allowing user jars to take precedence over Spark jars does not work as expected

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-1863.
-
Resolution: Incomplete

> Allowing user jars to take precedence over Spark jars does not work as 
> expected
> ---
>
> Key: SPARK-1863
> URL: https://issues.apache.org/jira/browse/SPARK-1863
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: koert kuipers
>Priority: Minor
>  Labels: bulk-closed
>
> See here:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html
> The issue seems to be that within ChildExecutorURLClassLoader userClassLoader 
> has no visibility on classes managed by parentClassLoader because their is no 
> parent/child relationship. What this means that if a class is loaded by 
> userClassLoader and it refers to a class loaded by parentClassLoader you get 
> a NoClassDefFoundError.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2610) When spark.serializer is set as org.apache.spark.serializer.KryoSerializer, importing a method causes multiple spark applications creations

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2610.
-
Resolution: Incomplete

> When spark.serializer is set as org.apache.spark.serializer.KryoSerializer, 
> importing a method causes multiple spark applications creations  
> -
>
> Key: SPARK-2610
> URL: https://issues.apache.org/jira/browse/SPARK-2610
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.1
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> To reproduce, set
> {code}
> spark.serializerorg.apache.spark.serializer.KryoSerializer
> {code}
> in conf/spark-defaults.conf and launch a spark shell.
> Then, execute
> {code}
> class X() { println("What!"); def y = 3 }
> val x = new X
> import x.y
> case class Person(name: String, age: Int)
> val serializer = org.apache.spark.serializer.Serializer.getSerializer(null)
> val kryoSerializer = serializer.newInstance
> val value = kryoSerializer.serialize(Person("abc", 1))
> kryoSerializer.deserialize(value): Person
> // Once you execute this line, you will see ...
> // What!
> // What!
> // res1: Person = Person(abc,1)
> {code}
> Basically, importing a method of a class causes the constructor of that class 
> been called twice.
> It affects our branch 1.0 and master.
> For the master, you can use 
> {code}
> val serializer = org.apache.spark.serializer.Serializer.getSerializer(None)
> {code}
> to get the serializer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6160) ChiSqSelector should keep test statistic info

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6160.
-
Resolution: Incomplete

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: bulk-closed
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5046) Update KinesisReceiver to use updated Receiver API

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5046.
-
Resolution: Incomplete

> Update KinesisReceiver to use updated Receiver API
> --
>
> Key: SPARK-5046
> URL: https://issues.apache.org/jira/browse/SPARK-5046
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Tathagata Das
>Priority: Major
>  Labels: bulk-closed
>
> Currently the KinesisReceier is not reliable as it does not correctly 
> acknowledge the source. This tasks is to update the receiver to use the 
> updated Receiver API in SPARK-5042 and implement a reliable receiver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5150) Strange implicit resolution behavior in Spark REPL

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5150.
-
Resolution: Incomplete

> Strange implicit resolution behavior in Spark REPL
> --
>
> Key: SPARK-5150
> URL: https://issues.apache.org/jira/browse/SPARK-5150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Tobias Schlatter
>Priority: Major
>  Labels: bulk-closed
>
> Consider the following Spark REPL session:
> {code}
> scala> def showInt(implicit x: Int) = println(x)
> showInt: (implicit x: Int)Unit
> scala> object IntHolder { implicit val myInt = 5 }
> defined module IntHolder
> scala> import IntHolder.myInt
> import IntHolder.myInt
> scala> showInt
> 5
> scala> class A; showInt
> :11: error: could not find implicit value for parameter x: Int
> showInt
> ^
> {code}
> This was most likely caused by the fix to SPARK-2632



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5104) Distributed Representations of Sentences and Documents

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5104.
-
Resolution: Incomplete

> Distributed Representations of Sentences and Documents
> --
>
> Key: SPARK-5104
> URL: https://issues.apache.org/jira/browse/SPARK-5104
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Reporter: Guoqiang Li
>Priority: Major
>  Labels: bulk-closed
>
> The Paper [Distributed Representations of Sentences and 
> Documents|http://arxiv.org/abs/1405.4053]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3750) Log ulimit settings at warning if they are too low

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3750.
-
Resolution: Incomplete

> Log ulimit settings at warning if they are too low
> --
>
> Key: SPARK-3750
> URL: https://issues.apache.org/jira/browse/SPARK-3750
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Minor
>  Labels: bulk-closed
>
> In recent versions of Spark the shuffle implementation is much more 
> aggressive about writing many files out to disk at once.  Most linux kernels 
> have a default limit in the number of open files per process, and Spark can 
> exhaust this limit.  The current hash-based shuffle implementation requires 
> as many files as the product of the map and reduce partition counts in a wide 
> dependency.
> In order to reduce the errors we're seeing on the user list, we should 
> determine a value that is considered "too low" for normal operations and log 
> a warning on executor startup when that value isn't met.
> 1. determine what ulimit is acceptable
> 2. log when that value isn't met



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3504.
-
Resolution: Incomplete

> KMeans optimization: track distances and unmoved cluster centers across 
> iterations
> --
>
> Key: SPARK-3504
> URL: https://issues.apache.org/jira/browse/SPARK-3504
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Priority: Major
>  Labels: bulk-closed, clustering
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
> recomputes all distances to all cluster centers on each iteration.  In later 
> iterations of Lloyd's algorithm, points don't change clusters and clusters 
> don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which 
> cluster it belongs to and the distance to that cluster, one can avoid 
> recomputing distances in many cases with very little increase in memory 
> requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 
> c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here 
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, 
> resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
> this improvement resulting in a reduction of roughly 75% in running time with 
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is 
> not a simple modification of the existing implementation.  Please let me know 
> if you are interested in this new implementation.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-799) Windows versions of the deploy scripts

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-799.

Resolution: Incomplete

> Windows versions of the deploy scripts
> --
>
> Key: SPARK-799
> URL: https://issues.apache.org/jira/browse/SPARK-799
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Reporter: Matei Zaharia
>Priority: Major
>  Labels: Starter, bulk-closed
>
> Although the Spark daemons run fine on Windows with run.cmd, the deploy 
> scripts (bin/start-all.sh and such) don't do so unless you have Cygwin. It 
> would be nice to make .cmd versions of those.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4684) Add a script to run JDBC server on Windows

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4684.
-
Resolution: Incomplete

> Add a script to run JDBC server on Windows
> --
>
> Key: SPARK-4684
> URL: https://issues.apache.org/jira/browse/SPARK-4684
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Windows
>Reporter: Matei Zaharia
>Assignee: Cheng Lian
>Priority: Minor
>  Labels: bulk-closed
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5575) Artificial neural networks for MLlib deep learning

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5575.
-
Resolution: Incomplete

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>Priority: Major
>  Labels: bulk-closed
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default interface in Spark, 
> we will need to wrap only a subset of functions from a given specialized 
> library. It does require an effort, how

[jira] [Resolved] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5915.
-
Resolution: Incomplete

> Spillable should check every N bytes rather than every 32 elements
> --
>
> Key: SPARK-5915
> URL: https://issues.apache.org/jira/browse/SPARK-5915
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.0
>Reporter: Mingyu Kim
>Priority: Major
>  Labels: bulk-closed
>
> Spillable currently checks for spill every 32 elements. However, this puts it 
> at a risk of OOM if each element is large enough. A better alternative is to 
> check every N bytes accumulated.
> N should be decided to a reasonable number via proper testing.
> This is a follow-up of SPARK-4808, and was discussed originally in 
> https://github.com/apache/spark/pull/4420.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3244) Add fate sharing across related files in Jenkins

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3244.
-
Resolution: Incomplete

> Add fate sharing across related files in Jenkins
> 
>
> Key: SPARK-3244
> URL: https://issues.apache.org/jira/browse/SPARK-3244
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Priority: Major
>  Labels: bulk-closed
>
> A few files are closely linked with each other. For instance, changes in 
> "bin/spark-submit" must be reflected in "bin/spark-submit.cmd" and 
> "SparkSubmitDriverBootstrapper.scala". It would be good if Jenkins gives a 
> warning if one file is changed but not the related ones.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1272) Don't fail job if some local directories are buggy

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-1272.
-
Resolution: Incomplete

> Don't fail job if some local directories are buggy
> --
>
> Key: SPARK-1272
> URL: https://issues.apache.org/jira/browse/SPARK-1272
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Patrick Wendell
>Priority: Major
>  Labels: bulk-closed
>
> If Spark cannot create shuffle directories inside of a local directory it 
> might make sense to just log an error and continue, provided that at least 
> one valid shuffle directory exists. Otherwise if a single disk is wonky the 
> entire job can fail.
> The down side is that this might mask failures if the person actually 
> misconfigures the local directories to point to the wrong disk(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4524) Add documentation on packaging Python dependencies / installing them on clusters

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-4524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4524.
-
Resolution: Incomplete

> Add documentation on packaging Python dependencies / installing them on 
> clusters
> 
>
> Key: SPARK-4524
> URL: https://issues.apache.org/jira/browse/SPARK-4524
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Josh Rosen
>Priority: Major
>  Labels: bulk-closed
>
> The documentation should have a section on ways to package Python 
> dependencies, add them to jobs, install them on clusters, etc. with an 
> overview of the different mechanisms and discussion of best practices.
> This could incorporate some information from 
> https://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes/24686708#24686708



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6208) executor-memory does not work when using local cluster

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6208.
-
Resolution: Incomplete

> executor-memory does not work when using local cluster
> --
>
> Key: SPARK-6208
> URL: https://issues.apache.org/jira/browse/SPARK-6208
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> Seems executor memory set with a local cluster is not correctly set (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377).
>  Also, totalExecutorCores seems has the same issue 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-3735.
-
Resolution: Incomplete

> Sending the factor directly or AtA based on the cost in ALS
> ---
>
> Key: SPARK-3735
> URL: https://issues.apache.org/jira/browse/SPARK-3735
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: bulk-closed
>
> It is common to have some super popular products in the dataset. In this 
> case, sending many user factors to the target product block could be more 
> expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i 
> r_ij` to the product block. The cost of sending a single factor is `k`, while 
> the cost of sending a normal equation is much more expensive, `k * (k + 3) / 
> 2`. However, if we use normal equation for all products associated with a 
> user, we don't need to send this user factor.
> Determining the optimal assignment is hard. But we could use a simple 
> heuristic. Inside any rating block,
> 1) order the product ids by the number of user ids associated with them in 
> desc order
> 2) starting from the most popular product, mark popular products as "use 
> normal eq" and calculate the cost
> Remember the best assignment that comes with the lowest cost and use it for 
> computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2296) Refactor util.JsonProtocol for evolvability

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2296.
-
Resolution: Incomplete

> Refactor util.JsonProtocol for evolvability
> ---
>
> Key: SPARK-2296
> URL: https://issues.apache.org/jira/browse/SPARK-2296
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Priority: Major
>  Labels: bulk-closed
>
> The current design is not very evolvable. For backwards compatibility, every 
> time we add a new field in one of the relevant objects (e.g. StageInfo) we 
> need to add a default value to the field. Otherwise, the test suite still 
> passes, but it throws some sort of obscure json exception if the field does 
> not exist.
> We should let a common interface (JsonSerializable) handle this logic, so we 
> don't need to do it for all classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5918) Spark Thrift server reports metadata for VARCHAR column as STRING in result set schema

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5918.
-
Resolution: Incomplete

> Spark Thrift server reports metadata for VARCHAR column as STRING in result 
> set schema
> --
>
> Key: SPARK-5918
> URL: https://issues.apache.org/jira/browse/SPARK-5918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Holman Lan
>Assignee: Cheng Lian
>Priority: Major
>  Labels: bulk-closed
>
> This is reproducible using the open source JDBC driver by executing a query 
> that will return a VARCHAR column then retrieving the result set metadata. 
> The type name returned by the JDBC driver is VARCHAR which is expected but 
> reports the column type as string[12] and precision/column length as 
> 2147483647 (which is what the JDBC driver would return for STRING column) 
> even though we created a VARCHAR column with max length of 1000.
> Further investigation indicates the GetResultSetMetadata Thrift client API 
> call returns the incorrect metadata.
> We have confirmed this behaviour in  versions 1.1.1 and 1.2.0. We have not 
> yet tested this against 1.2.1 but will do so and report our findings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2723) Block Manager should catch exceptions in putValues

2019-05-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2723.
-
Resolution: Incomplete

> Block Manager should catch exceptions in putValues
> --
>
> Key: SPARK-2723
> URL: https://issues.apache.org/jira/browse/SPARK-2723
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>  Labels: bulk-closed
>
> The BlockManager should catch exceptions encountered while writing out files 
> to disk. Right now these exceptions get counted as user-level task failures 
> and the job is aborted after failing 4 times. We should either fail the 
> executor or handle this better to prevent the job from dying.
> I ran into an issue where one disk on a large EC2 cluster failed and this 
> resulted in a long running job terminating. Longer term, we should also look 
> at black-listing local directories when one of them become unusable ?
> Exception pasted below:
> 14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: 
> /mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 
> (Input/output error)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79)
> at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256)
> at 
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179)
> at 
> org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663)
> at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4398 matches

Mail list logo