date:20160404

[jira] [Assigned] (SPARK-14399) Remove unnecessary excludes from POMs

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14399:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove unnecessary excludes from POMs
> -
>
> Key: SPARK-14399
> URL: https://issues.apache.org/jira/browse/SPARK-14399
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In many cases, our Maven POMs are overly-complicated by unnecessary excludes. 
> I believe that we can significantly simplify the build by removing many of 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14399) Remove unnecessary excludes from POMs

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225793#comment-15225793
 ] 

Apache Spark commented on SPARK-14399:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12171

> Remove unnecessary excludes from POMs
> -
>
> Key: SPARK-14399
> URL: https://issues.apache.org/jira/browse/SPARK-14399
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In many cases, our Maven POMs are overly-complicated by unnecessary excludes. 
> I believe that we can significantly simplify the build by removing many of 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14370) Avoid creating duplicate ids in OnlineLDAOptimizer

2016-04-04 Thread Pravin Gadakh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225786#comment-15225786
 ] 

Pravin Gadakh commented on SPARK-14370:
---

I can work on it if no one has already taken it up.

> Avoid creating duplicate ids in OnlineLDAOptimizer
> --
>
> Key: SPARK-14370
> URL: https://issues.apache.org/jira/browse/SPARK-14370
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> In {{OnlineLDAOptimizer}}'s {{submitMiniBatch}} method, we create a list of 
> ids {{val ids: List[Int]}} before calling {{variationalTopicInference}}, 
> which then creates a duplicate set of ids.  {{variationalTopicInference}} 
> should reuse the same set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14400) ScriptTransformation does not fail the job for bad user command

2016-04-04 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-14400:

Description: If the `script` to be ran is an incorrect command, Spark does 
not catch the failure in running the sub-process and the job is marked as 
successful.

> ScriptTransformation does not fail the job for bad user command
> ---
>
> Key: SPARK-14400
> URL: https://issues.apache.org/jira/browse/SPARK-14400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> If the `script` to be ran is an incorrect command, Spark does not catch the 
> failure in running the sub-process and the job is marked as successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14400) ScriptTransformation does not fail the job for bad user command

2016-04-04 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-14400:
---

 Summary: ScriptTransformation does not fail the job for bad user 
command
 Key: SPARK-14400
 URL: https://issues.apache.org/jira/browse/SPARK-14400
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Tejas Patil
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-04 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225767#comment-15225767
 ] 

Bo Meng commented on SPARK-14398:
-

Not a problem. I will work on it tomorrow.  Thanks.

> Audit non-reserved keyword list in ANTLR4 parser.
> -
>
> Key: SPARK-14398
> URL: https://issues.apache.org/jira/browse/SPARK-14398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>
> We need to check if all keywords that were non-reserved in the `old` ANTLR3 
> parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
> {{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and 
> are now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14399) Remove unnecessary excludes from POMs

2016-04-04 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-14399:
--

 Summary: Remove unnecessary excludes from POMs
 Key: SPARK-14399
 URL: https://issues.apache.org/jira/browse/SPARK-14399
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Josh Rosen
Assignee: Josh Rosen


In many cases, our Maven POMs are overly-complicated by unnecessary excludes. I 
believe that we can significantly simplify the build by removing many of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command

2016-04-04 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-14348.
-
  Resolution: Resolved
Assignee: Dilip Biswal
Target Version/s: 2.0.0

> Support native execution of SHOW TBLPROPERTIES command
> --
>
> Key: SPARK-14348
> URL: https://issues.apache.org/jira/browse/SPARK-14348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>
> 1. Support parsing of SHOW TBLPROPERTIES command
> 2. Support the native execution of SHOW TBLPROPERTIES command
> The syntax for SHOW commands are described in following link:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225765#comment-15225765
 ] 

Herman van Hovell commented on SPARK-14398:
---

cc [~bomeng] want to take a look at this?

> Audit non-reserved keyword list in ANTLR4 parser.
> -
>
> Key: SPARK-14398
> URL: https://issues.apache.org/jira/browse/SPARK-14398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>
> We need to check if all keywords that were non-reserved in the `old` ANTLR3 
> parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
> {{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and 
> are now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14397) and tags are nested in LogPage

2016-04-04 Thread Maarten Kesselaers (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maarten Kesselaers updated SPARK-14397:
---
Comment: was deleted

(was: Can I try this one, please?)

>  and  tags are nested in LogPage
> 
>
> Key: SPARK-14397
> URL: https://issues.apache.org/jira/browse/SPARK-14397
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In `LogPage`, the content to be rendered is defined as follows.
> {code}
> val content =
>   
> 
>   {linkToMaster}
>   
> {backButton}
> {range}
> {nextButton}
>   
>   
>   
> {logText}
>   
> 
>   
> UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
> {code}
> As you can see,  and  tags will be rendered.
> On the other hand, `UIUtils.basicSparkPage` will render those tags so those 
> tags will be nested.
> {code}
>   def basicSparkPage(
>   content: => Seq[Node],
>   title: String,
>   useDataTables: Boolean = false): Seq[Node] = {
> 
>   
> {commonHeaderNodes}
> {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
> {title}
>   
>   
> 
>   
> 
>   
> 
>src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
>style="margin-right: 
> 15px;">{org.apache.spark.SPARK_VERSION}
> 
> {title}
>   
> 
>   
>   {content}
> 
>   
> 
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14397) and tags are nested in LogPage

2016-04-04 Thread Maarten Kesselaers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225762#comment-15225762
 ] 

Maarten Kesselaers commented on SPARK-14397:


Can I try this one, please?

>  and  tags are nested in LogPage
> 
>
> Key: SPARK-14397
> URL: https://issues.apache.org/jira/browse/SPARK-14397
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In `LogPage`, the content to be rendered is defined as follows.
> {code}
> val content =
>   
> 
>   {linkToMaster}
>   
> {backButton}
> {range}
> {nextButton}
>   
>   
>   
> {logText}
>   
> 
>   
> UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
> {code}
> As you can see,  and  tags will be rendered.
> On the other hand, `UIUtils.basicSparkPage` will render those tags so those 
> tags will be nested.
> {code}
>   def basicSparkPage(
>   content: => Seq[Node],
>   title: String,
>   useDataTables: Boolean = false): Seq[Node] = {
> 
>   
> {commonHeaderNodes}
> {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
> {title}
>   
>   
> 
>   
> 
>   
> 
>src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
>style="margin-right: 
> 15px;">{org.apache.spark.SPARK_VERSION}
> 
> {title}
>   
> 
>   
>   {content}
> 
>   
> 
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14397) and tags are nested in LogPage

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225760#comment-15225760
 ] 

Apache Spark commented on SPARK-14397:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/12170

>  and  tags are nested in LogPage
> 
>
> Key: SPARK-14397
> URL: https://issues.apache.org/jira/browse/SPARK-14397
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In `LogPage`, the content to be rendered is defined as follows.
> {code}
> val content =
>   
> 
>   {linkToMaster}
>   
> {backButton}
> {range}
> {nextButton}
>   
>   
>   
> {logText}
>   
> 
>   
> UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
> {code}
> As you can see,  and  tags will be rendered.
> On the other hand, `UIUtils.basicSparkPage` will render those tags so those 
> tags will be nested.
> {code}
>   def basicSparkPage(
>   content: => Seq[Node],
>   title: String,
>   useDataTables: Boolean = false): Seq[Node] = {
> 
>   
> {commonHeaderNodes}
> {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
> {title}
>   
>   
> 
>   
> 
>   
> 
>src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
>style="margin-right: 
> 15px;">{org.apache.spark.SPARK_VERSION}
> 
> {title}
>   
> 
>   
>   {content}
> 
>   
> 
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14397) and tags are nested in LogPage

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14397:


Assignee: Apache Spark

>  and  tags are nested in LogPage
> 
>
> Key: SPARK-14397
> URL: https://issues.apache.org/jira/browse/SPARK-14397
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In `LogPage`, the content to be rendered is defined as follows.
> {code}
> val content =
>   
> 
>   {linkToMaster}
>   
> {backButton}
> {range}
> {nextButton}
>   
>   
>   
> {logText}
>   
> 
>   
> UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
> {code}
> As you can see,  and  tags will be rendered.
> On the other hand, `UIUtils.basicSparkPage` will render those tags so those 
> tags will be nested.
> {code}
>   def basicSparkPage(
>   content: => Seq[Node],
>   title: String,
>   useDataTables: Boolean = false): Seq[Node] = {
> 
>   
> {commonHeaderNodes}
> {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
> {title}
>   
>   
> 
>   
> 
>   
> 
>src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
>style="margin-right: 
> 15px;">{org.apache.spark.SPARK_VERSION}
> 
> {title}
>   
> 
>   
>   {content}
> 
>   
> 
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14397) and tags are nested in LogPage

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14397:


Assignee: (was: Apache Spark)

>  and  tags are nested in LogPage
> 
>
> Key: SPARK-14397
> URL: https://issues.apache.org/jira/browse/SPARK-14397
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In `LogPage`, the content to be rendered is defined as follows.
> {code}
> val content =
>   
> 
>   {linkToMaster}
>   
> {backButton}
> {range}
> {nextButton}
>   
>   
>   
> {logText}
>   
> 
>   
> UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
> {code}
> As you can see,  and  tags will be rendered.
> On the other hand, `UIUtils.basicSparkPage` will render those tags so those 
> tags will be nested.
> {code}
>   def basicSparkPage(
>   content: => Seq[Node],
>   title: String,
>   useDataTables: Boolean = false): Seq[Node] = {
> 
>   
> {commonHeaderNodes}
> {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
> {title}
>   
>   
> 
>   
> 
>   
> 
>src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
>style="margin-right: 
> 15px;">{org.apache.spark.SPARK_VERSION}
> 
> {title}
>   
> 
>   
>   {content}
> 
>   
> 
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-04 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-14398:
-

 Summary: Audit non-reserved keyword list in ANTLR4 parser.
 Key: SPARK-14398
 URL: https://issues.apache.org/jira/browse/SPARK-14398
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Herman van Hovell


We need to check if all keywords that were non-reserved in the `old` ANTLR3 
parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
{{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and are 
now.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14397) and tags are nested in LogPage

2016-04-04 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-14397:
--

 Summary:  and  tags are nested in LogPage
 Key: SPARK-14397
 URL: https://issues.apache.org/jira/browse/SPARK-14397
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Kousuke Saruta
Priority: Minor


In `LogPage`, the content to be rendered is defined as follows.

{code}
val content =
  

  {linkToMaster}
  
{backButton}
{range}
{nextButton}
  
  
  
{logText}
  

  
UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
{code}

As you can see,  and  tags will be rendered.

On the other hand, `UIUtils.basicSparkPage` will render those tags so those 
tags will be nested.

{code}
  def basicSparkPage(
  content: => Seq[Node],
  title: String,
  useDataTables: Boolean = false): Seq[Node] = {

  
{commonHeaderNodes}
{if (useDataTables) dataTablesHeaderNodes else Seq.empty}
{title}
  
  

  

  

  
  {org.apache.spark.SPARK_VERSION}

{title}
  

  
  {content}

  

  }

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225722#comment-15225722
 ] 

Apache Spark commented on SPARK-14396:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12169

> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {noformat}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14396:


Assignee: Apache Spark

> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {noformat}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14396:


Assignee: (was: Apache Spark)

> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {noformat}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14396:

Description: 
Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{noformat}
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{noformat}
 
An exception is thrown when users issue any of these three DDL commands.


  was:
Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{no format}
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{no format}
 
An exception is thrown when users issue any of these three DDL commands.



> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {noformat}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14362) DDL Native Support: Drop View

2016-04-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14362:

Description: 
Native parsing and native analysis of DDL command: Drop View.

Based on the HIVE DDL document for 
[DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
DropView
), `DROP VIEW` is defined as, 

Syntax:
{noformat}
DROP VIEW [IF EXISTS] [db_name.]view_name;
{noformat}
 - to remove metadata for the specified view. 
 - illegal to use DROP TABLE on a view.
 - illegal to use DROP VIEW on a table.


  was:
Native parsing and native analysis of DDL command: Drop View.

Based on the HIVE DDL document for 
[DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
DropView
), `DROP VIEW` is defined as, 

Syntax:

DROP VIEW [IF EXISTS] [db_name.]view_name;

 - to remove metadata for the specified view. 
 - illegal to use DROP TABLE on a view.
 - illegal to use DROP VIEW on a table.



> DDL Native Support: Drop View
> -
>
> Key: SPARK-14362
> URL: https://issues.apache.org/jira/browse/SPARK-14362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Native parsing and native analysis of DDL command: Drop View.
> Based on the HIVE DDL document for 
> [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
> DropView
> ), `DROP VIEW` is defined as, 
> Syntax:
> {noformat}
> DROP VIEW [IF EXISTS] [db_name.]view_name;
> {noformat}
>  - to remove metadata for the specified view. 
>  - illegal to use DROP TABLE on a view.
>  - illegal to use DROP VIEW on a table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14396:

Description: 
Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{no format}
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{no format}
 
An exception is thrown when users issue any of these three DDL commands.


  was:
Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{no format}
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{noformat}
 
An exception is thrown when users issue any of these three DDL commands.



> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {no format}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {no format}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14396:

Description: 
Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{no format}
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{noformat}
 
An exception is thrown when users issue any of these three DDL commands.


  was:
Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{{{code
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
}}}
 
An exception is thrown when users issue any of these three DDL commands.



> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {no format}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-04 Thread Xiao Li (JIRA)

Xiao Li created SPARK-14396:
---

 Summary: Throw Exceptions for DDLs of Partitioned Views (CREATE 
VIEW and ALTER VIEW)
 Key: SPARK-14396
 URL: https://issues.apache.org/jira/browse/SPARK-14396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Because the concept of partitioning is associated with physical tables, we 
disable all the supports of partitioned views, which are defined in the 
following three commands in [Hive DDL 
Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
{{{code
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];

ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
}}}
 
An exception is thrown when users issue any of these three DDL commands.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14240) PySpark Standalone Application hangs without any Error message

2016-04-04 Thread Sayak Ghosh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225704#comment-15225704
 ] 

Sayak Ghosh commented on SPARK-14240:
-

The issue has been resolved. Its a GC issue as you mentioned earlier. Thanks 
for the guidance.

> PySpark Standalone Application hangs without any Error message
> --
>
> Key: SPARK-14240
> URL: https://issues.apache.org/jira/browse/SPARK-14240
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 1.6.0
>Reporter: Sayak Ghosh
> Attachments: DAG visualisation of pending stages.png, 
> error_terminal.png, event timeline.png
>
>
> I am relatively new to Spark and wrote a simple script using python and spark 
> SQL. My problem is that it is perfectly allright at the starting phase of the 
> execution but gradually it slowed down and at the end of the last phase the 
> whole application hangs
> Here is my code snippet - 
>   hivectx.registerDataFrameAsTable(aggregatedDataV1,"aggregatedDataV1")
> q1 = "SELECT *, (Total_Sale/Sale_Weeks) as Average_Sale_Per_SaleWeek, 
> (Total_Weeks/Sale_Weeks) as Velocity FROM aggregatedDataV1"
> aggregatedData = hivectx.sql(q1)
> aggregatedData.show(100)
> == Terminal Hanging with the following =
> 16/03/29 09:05:50 INFO TaskSetManager: Finished task 96.0 in stage 416.0 (TID 
> 19992) in 41924 ms on 10.9.0.7 (104/200)
> 16/03/29 09:05:50 INFO TaskSetManager: Finished task 108.0 in stage 416.0 
> (TID 20004) in 24608 ms on 10.9.0.10 (105/200)
> 16/03/29 09:05:50 INFO TaskSetManager: Finished task 105.0 in stage 416.0 
> (TID 20001) in 24610 ms on 10.9.0.10 (106/200)
> 16/03/29 09:05:55 INFO TaskSetManager: Starting task 116.0 in stage 416.0 
> (TID 20012, 10.9.0.10, partition 116,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:06:31 INFO TaskSetManager: Finished task 99.0 in stage 416.0 (TID 
> 19995) in 78435 ms on 10.9.0.7 (110/200)
> 16/03/29 09:06:40 INFO TaskSetManager: Starting task 119.0 in stage 416.0 
> (TID 20015, 10.9.0.10, partition 119,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:07:12 INFO TaskSetManager: Starting task 122.0 in stage 416.0 
> (TID 20018, 10.9.0.7, partition 122,NODE_LOCAL, 2240 bytes) 
> 16/03/29 09:07:16 INFO TaskSetManager: Starting task 123.0 in stage 416.0 
> (TID 20019, 10.9.0.7, partition 123,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:07:28 INFO TaskSetManager: Finished task 111.0 in stage 416.0 
> (TID 20007) in 110198 ms on 10.9.0.7 (114/200)
> 16/03/29 09:07:52 INFO TaskSetManager: Starting task 124.0 in stage 416.0 
> (TID 20020, 10.9.0.10, partition 124,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:08:08 INFO TaskSetManager: Finished task 110.0 in stage 416.0 
> (TID 20006) in 150023 ms on 10.9.0.7 (115/200)
> 16/03/29 09:08:12 INFO TaskSetManager: Finished task 113.0 in stage 416.0 
> (TID 20009) in 154120 ms on 10.9.0.7 (116/200)
> 16/03/29 09:08:16 INFO TaskSetManager: Finished task 116.0 in stage 416.0 
> (TID 20012) in 145691 ms on 10.9.0.10 (117/200)
> There is no sign of error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14125) View related commands

2016-04-04 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225675#comment-15225675
 ] 

Xiao Li commented on SPARK-14125:
-

Now, working on TOK_ALTERVIEW_ADDPARTS/TOK_ALTERVIEW_DROPPARTS. will also 
disable CREATE VIEW for partitioned tables, because the concept of partitioning 
is associated with physical tables. Thanks!

> View related commands
> -
>
> Key: SPARK-14125
> URL: https://issues.apache.org/jira/browse/SPARK-14125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>
> We should support the following commands.
> TOK_ALTERVIEW_AS
> TOK_ALTERVIEW_PROPERTIES
> TOK_ALTERVIEW_RENAME
> TOK_DROPVIEW
> TOK_DROPVIEW_PROPERTIES
> For TOK_ALTERVIEW_ADDPARTS/TOK_ALTERVIEW_DROPPARTS, we should throw 
> exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14298) LDA should support disable checkpoint

2016-04-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14298:
--
Issue Type: Bug  (was: Improvement)

> LDA should support disable checkpoint
> -
>
> Key: SPARK-14298
> URL: https://issues.apache.org/jira/browse/SPARK-14298
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> LDA should support disable checkpoint by setting checkpointInterval = -1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14298) LDA should support disable checkpoint

2016-04-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14298:
--
 Shepherd: Joseph K. Bradley
 Assignee: Yanbo Liang
Affects Version/s: 2.0.0
   1.5.2
   1.6.1
 Target Version/s: 1.5.3, 1.6.2, 2.0.0

> LDA should support disable checkpoint
> -
>
> Key: SPARK-14298
> URL: https://issues.apache.org/jira/browse/SPARK-14298
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> LDA should support disable checkpoint by setting checkpointInterval = -1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14359) Improve user experience for typed aggregate functions in Java

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14359:


Assignee: Apache Spark

> Improve user experience for typed aggregate functions in Java
> -
>
> Key: SPARK-14359
> URL: https://issues.apache.org/jira/browse/SPARK-14359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See the Scala version in SPARK-14285. The main problem we'd need to work 
> around is that Java cannot return primitive types in generics, and as a 
> result we would have to return boxed types.
> One requirement is that we should add tests for both Java 7 style (in 
> sql/core) and Java 8 style lambdas (in external/java-8...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14359) Improve user experience for typed aggregate functions in Java

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225599#comment-15225599
 ] 

Apache Spark commented on SPARK-14359:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/12168

> Improve user experience for typed aggregate functions in Java
> -
>
> Key: SPARK-14359
> URL: https://issues.apache.org/jira/browse/SPARK-14359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> See the Scala version in SPARK-14285. The main problem we'd need to work 
> around is that Java cannot return primitive types in generics, and as a 
> result we would have to return boxed types.
> One requirement is that we should add tests for both Java 7 style (in 
> sql/core) and Java 8 style lambdas (in external/java-8...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14359) Improve user experience for typed aggregate functions in Java

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14359:


Assignee: (was: Apache Spark)

> Improve user experience for typed aggregate functions in Java
> -
>
> Key: SPARK-14359
> URL: https://issues.apache.org/jira/browse/SPARK-14359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> See the Scala version in SPARK-14285. The main problem we'd need to work 
> around is that Java cannot return primitive types in generics, and as a 
> result we would have to return boxed types.
> One requirement is that we should add tests for both Java 7 style (in 
> sql/core) and Java 8 style lambdas (in external/java-8...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-14368:
---
Fix Version/s: 2.0.0
   1.6.2

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Assignee: Yong Tang
>Priority: Trivial
> Fix For: 1.6.2, 2.0.0
>
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-14368.

Resolution: Fixed

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Assignee: Yong Tang
>Priority: Trivial
> Fix For: 1.6.2, 2.0.0
>
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-14368:
---
Assignee: Yong Tang

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Assignee: Yong Tang
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13211) StreamingContext throws NoSuchElementException when created from non-existent checkpoint directory

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225593#comment-15225593
 ] 

Apache Spark commented on SPARK-13211:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12167

> StreamingContext throws NoSuchElementException when created from non-existent 
> checkpoint directory
> --
>
> Key: SPARK-13211
> URL: https://issues.apache.org/jira/browse/SPARK-13211
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> new StreamingContext("_checkpoint")
> 16/02/05 08:51:10 INFO Checkpoint: Checkpoint directory _checkpoint does not 
> exist
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:108)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:114)
>   ... 43 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14386) spark.ml DecisionTreeModel abstraction should not be exposed

2016-04-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14386.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12158
[https://github.com/apache/spark/pull/12158]

> spark.ml DecisionTreeModel abstraction should not be exposed
> 
>
> Key: SPARK-14386
> URL: https://issues.apache.org/jira/browse/SPARK-14386
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the 
> {{trees}} method, but they should not since it is a private trait (and not 
> ready to be made public).  It will also be more useful to users if we return 
> the concrete types.
> Proposal: return concrete types
> * The MIMA checks appear to be OK with this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs

2016-04-04 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225525#comment-15225525
 ] 

Yanbo Liang commented on SPARK-13783:
-

[~josephkb] I will work on this.

> Model export/import for spark.ml: GBTs
> --
>
> Key: SPARK-13783
> URL: https://issues.apache.org/jira/browse/SPARK-13783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both GBTClassifier and GBTRegressor.  The implementation 
> should reuse the one for DecisionTree*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-04-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14287.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12080
[https://github.com/apache/spark/pull/12080]

> Method to determine if Dataset is bounded or not
> 
>
> Key: SPARK-14287
> URL: https://issues.apache.org/jira/browse/SPARK-14287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Burak Yavuz
> Fix For: 2.0.0
>
>
> With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
> become unbounded. With unbounded data, the execution of some methods and 
> operations will not make sense, e.g. Dataset.count().
> A simple API is required to check whether the data in a Dataset is bounded or 
> unbounded. This will allow users to check whether their Dataset is in 
> streaming mode or not. ML algorithms may check if the data is unbounded and 
> throw an exception for example.
> The implementation of this method is simple, however naming it is the 
> challenge. Some possible names for this method are:
>  - isStreaming
>  - isContinuous
>  - isBounded
>  - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14240) PySpark Standalone Application hangs without any Error message

2016-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14240.
---
Resolution: Not A Problem

Yeah all of this indicates you are not giving your job enough memory. It's not 
a Spark problem.

> PySpark Standalone Application hangs without any Error message
> --
>
> Key: SPARK-14240
> URL: https://issues.apache.org/jira/browse/SPARK-14240
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 1.6.0
>Reporter: Sayak Ghosh
> Attachments: DAG visualisation of pending stages.png, 
> error_terminal.png, event timeline.png
>
>
> I am relatively new to Spark and wrote a simple script using python and spark 
> SQL. My problem is that it is perfectly allright at the starting phase of the 
> execution but gradually it slowed down and at the end of the last phase the 
> whole application hangs
> Here is my code snippet - 
>   hivectx.registerDataFrameAsTable(aggregatedDataV1,"aggregatedDataV1")
> q1 = "SELECT *, (Total_Sale/Sale_Weeks) as Average_Sale_Per_SaleWeek, 
> (Total_Weeks/Sale_Weeks) as Velocity FROM aggregatedDataV1"
> aggregatedData = hivectx.sql(q1)
> aggregatedData.show(100)
> == Terminal Hanging with the following =
> 16/03/29 09:05:50 INFO TaskSetManager: Finished task 96.0 in stage 416.0 (TID 
> 19992) in 41924 ms on 10.9.0.7 (104/200)
> 16/03/29 09:05:50 INFO TaskSetManager: Finished task 108.0 in stage 416.0 
> (TID 20004) in 24608 ms on 10.9.0.10 (105/200)
> 16/03/29 09:05:50 INFO TaskSetManager: Finished task 105.0 in stage 416.0 
> (TID 20001) in 24610 ms on 10.9.0.10 (106/200)
> 16/03/29 09:05:55 INFO TaskSetManager: Starting task 116.0 in stage 416.0 
> (TID 20012, 10.9.0.10, partition 116,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:06:31 INFO TaskSetManager: Finished task 99.0 in stage 416.0 (TID 
> 19995) in 78435 ms on 10.9.0.7 (110/200)
> 16/03/29 09:06:40 INFO TaskSetManager: Starting task 119.0 in stage 416.0 
> (TID 20015, 10.9.0.10, partition 119,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:07:12 INFO TaskSetManager: Starting task 122.0 in stage 416.0 
> (TID 20018, 10.9.0.7, partition 122,NODE_LOCAL, 2240 bytes) 
> 16/03/29 09:07:16 INFO TaskSetManager: Starting task 123.0 in stage 416.0 
> (TID 20019, 10.9.0.7, partition 123,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:07:28 INFO TaskSetManager: Finished task 111.0 in stage 416.0 
> (TID 20007) in 110198 ms on 10.9.0.7 (114/200)
> 16/03/29 09:07:52 INFO TaskSetManager: Starting task 124.0 in stage 416.0 
> (TID 20020, 10.9.0.10, partition 124,NODE_LOCAL, 2240 bytes)
> 16/03/29 09:08:08 INFO TaskSetManager: Finished task 110.0 in stage 416.0 
> (TID 20006) in 150023 ms on 10.9.0.7 (115/200)
> 16/03/29 09:08:12 INFO TaskSetManager: Finished task 113.0 in stage 416.0 
> (TID 20009) in 154120 ms on 10.9.0.7 (116/200)
> 16/03/29 09:08:16 INFO TaskSetManager: Finished task 116.0 in stage 416.0 
> (TID 20012) in 145691 ms on 10.9.0.10 (117/200)
> There is no sign of error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12425) DStream union optimisation

2016-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12425:
--
Assignee: Guillaume Poulin

> DStream union optimisation
> --
>
> Key: SPARK-12425
> URL: https://issues.apache.org/jira/browse/SPARK-12425
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Guillaume Poulin
>Assignee: Guillaume Poulin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. 
> However using `PartitionerAwareUnionRDD` when possible would yield to better 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12425) DStream union optimisation

2016-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12425.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10382
[https://github.com/apache/spark/pull/10382]

> DStream union optimisation
> --
>
> Key: SPARK-12425
> URL: https://issues.apache.org/jira/browse/SPARK-12425
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Guillaume Poulin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. 
> However using `PartitionerAwareUnionRDD` when possible would yield to better 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225506#comment-15225506
 ] 

Joseph K. Bradley commented on SPARK-13048:
---

[~jvstein] Would you be able to test whether the patch I just sent takes care 
of the problem you encountered?

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Joseph K. Bradley
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13048:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Apache Spark
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225502#comment-15225502
 ] 

Apache Spark commented on SPARK-13048:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12166

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Joseph K. Bradley
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13048:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Joseph K. Bradley
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14395) remove some unused code in REPL module

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14395:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove some unused code in REPL module
> --
>
> Key: SPARK-14395
> URL: https://issues.apache.org/jira/browse/SPARK-14395
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14395) remove some unused code in REPL module

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14395:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove some unused code in REPL module
> --
>
> Key: SPARK-14395
> URL: https://issues.apache.org/jira/browse/SPARK-14395
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14395) remove some unused code in REPL module

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225494#comment-15225494
 ] 

Apache Spark commented on SPARK-14395:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12164

> remove some unused code in REPL module
> --
>
> Key: SPARK-14395
> URL: https://issues.apache.org/jira/browse/SPARK-14395
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14347) Require Java 8 for Spark 2.x

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14347:


Assignee: (was: Apache Spark)

> Require Java 8 for Spark 2.x
> 
>
> Key: SPARK-14347
> URL: https://issues.apache.org/jira/browse/SPARK-14347
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> Putting this down as a JIRA to advance the discussion -- I think this is far 
> enough along to consensus for that.
> The change here is to require Java 8. This means:
> - Require Java 8 in the build
> - Only build and test with Java 8, removing other older Jenkins configs
> - Remove MaxPermSize
> - Remove reflection to use Java 8-only methods
> - Move external/java8-tests to core/streaming and remove profile
> And optionally:
> - Update all Java 8 code to take advantage of 8+ features, like lambdas, for 
> simplification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14347) Require Java 8 for Spark 2.x

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14347:


Assignee: Apache Spark

> Require Java 8 for Spark 2.x
> 
>
> Key: SPARK-14347
> URL: https://issues.apache.org/jira/browse/SPARK-14347
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> Putting this down as a JIRA to advance the discussion -- I think this is far 
> enough along to consensus for that.
> The change here is to require Java 8. This means:
> - Require Java 8 in the build
> - Only build and test with Java 8, removing other older Jenkins configs
> - Remove MaxPermSize
> - Remove reflection to use Java 8-only methods
> - Move external/java8-tests to core/streaming and remove profile
> And optionally:
> - Update all Java 8 code to take advantage of 8+ features, like lambdas, for 
> simplification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14347) Require Java 8 for Spark 2.x

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225495#comment-15225495
 ] 

Apache Spark commented on SPARK-14347:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12165

> Require Java 8 for Spark 2.x
> 
>
> Key: SPARK-14347
> URL: https://issues.apache.org/jira/browse/SPARK-14347
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> Putting this down as a JIRA to advance the discussion -- I think this is far 
> enough along to consensus for that.
> The change here is to require Java 8. This means:
> - Require Java 8 in the build
> - Only build and test with Java 8, removing other older Jenkins configs
> - Remove MaxPermSize
> - Remove reflection to use Java 8-only methods
> - Move external/java8-tests to core/streaming and remove profile
> And optionally:
> - Update all Java 8 code to take advantage of 8+ features, like lambdas, for 
> simplification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14395) remove some unused code in REPL module

2016-04-04 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-14395:
---

 Summary: remove some unused code in REPL module
 Key: SPARK-14395
 URL: https://issues.apache.org/jira/browse/SPARK-14395
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14368:


Assignee: Apache Spark

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Assignee: Apache Spark
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14368:


Assignee: (was: Apache Spark)

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225486#comment-15225486
 ] 

Apache Spark commented on SPARK-14368:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12163

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Yong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225471#comment-15225471
 ] 

Yong Tang commented on SPARK-14368:
---

That is like an easy fix. Will create a pull request shortly.

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14289) Support multiple eviction strategies for cached RDD partitions

2016-04-04 Thread Yuanzhen Geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanzhen Geng updated SPARK-14289:
--
Summary: Support multiple eviction strategies for cached RDD partitions  
(was: Add support to multiple eviction strategies for cached RDD partitions)

> Support multiple eviction strategies for cached RDD partitions
> --
>
> Key: SPARK-14289
> URL: https://issues.apache.org/jira/browse/SPARK-14289
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Yuanzhen Geng
>Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14289) Add support to multiple eviction strategies for cached RDD partitions

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14289:


Assignee: (was: Apache Spark)

> Add support to multiple eviction strategies for cached RDD partitions
> -
>
> Key: SPARK-14289
> URL: https://issues.apache.org/jira/browse/SPARK-14289
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Yuanzhen Geng
>Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14289) Add support to multiple eviction strategies for cached RDD partitions

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14289:


Assignee: Apache Spark

> Add support to multiple eviction strategies for cached RDD partitions
> -
>
> Key: SPARK-14289
> URL: https://issues.apache.org/jira/browse/SPARK-14289
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Yuanzhen Geng
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14289) Add support to multiple eviction strategies for cached RDD partitions

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225469#comment-15225469
 ] 

Apache Spark commented on SPARK-14289:
--

User 'Earne' has created a pull request for this issue:
https://github.com/apache/spark/pull/12162

> Add support to multiple eviction strategies for cached RDD partitions
> -
>
> Key: SPARK-14289
> URL: https://issues.apache.org/jira/browse/SPARK-14289
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Yuanzhen Geng
>Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-04-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13456.
-
Resolution: Fixed

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)

[jira] [Commented] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-04-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225468#comment-15225468
 ] 

Wenchen Fan commented on SPARK-13456:
-

This is a scala bug: https://issues.scala-lang.org/browse/SI-9734

I'm going to resolve this JIRA as the issue is not a Spark issue.

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableO

[jira] [Updated] (SPARK-14289) Add support to multiple eviction strategies for cached RDD partitions

2016-04-04 Thread Yuanzhen Geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanzhen Geng updated SPARK-14289:
--
Summary: Add support to multiple eviction strategies for cached RDD 
partitions  (was: Add support to multiple eviction strategys for cached RDD 
partitions)

> Add support to multiple eviction strategies for cached RDD partitions
> -
>
> Key: SPARK-14289
> URL: https://issues.apache.org/jira/browse/SPARK-14289
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Yuanzhen Geng
>Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14394) Generate AggregateHashMap class during TungstenAggregate codegen

2016-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225450#comment-15225450
 ] 

Apache Spark commented on SPARK-14394:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12161

> Generate AggregateHashMap class during TungstenAggregate codegen
> 
>
> Key: SPARK-14394
> URL: https://issues.apache.org/jira/browse/SPARK-14394
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14394) Generate AggregateHashMap class during TungstenAggregate codegen

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14394:


Assignee: (was: Apache Spark)

> Generate AggregateHashMap class during TungstenAggregate codegen
> 
>
> Key: SPARK-14394
> URL: https://issues.apache.org/jira/browse/SPARK-14394
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14394) Generate AggregateHashMap class during TungstenAggregate codegen

2016-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14394:


Assignee: Apache Spark

> Generate AggregateHashMap class during TungstenAggregate codegen
> 
>
> Key: SPARK-14394
> URL: https://issues.apache.org/jira/browse/SPARK-14394
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14394) Generate AggregateHashMap class during TungstenAggregate codegen

2016-04-04 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-14394:
--

 Summary: Generate AggregateHashMap class during TungstenAggregate 
codegen
 Key: SPARK-14394
 URL: https://issues.apache.org/jira/browse/SPARK-14394
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-04 Thread Jason Piper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Piper updated SPARK-14393:

Description: 
When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below

{code}
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+

>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+
{code}

  was:
When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below

{code}
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+
```
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+
{code}


> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-04 Thread Jason Piper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Piper updated SPARK-14393:

Description: 
When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below
```
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+
```
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+

  was:
When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below

>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+

>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+


> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> ```
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> ```
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-04 Thread Jason Piper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Piper updated SPARK-14393:

Description: 
When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below

{code}
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+
```
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+
{code}

  was:
When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below
```
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+
```
>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+


> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> ```
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |

[jira] [Created] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-04 Thread Jason Piper (JIRA)

Jason Piper created SPARK-14393:
---

 Summary: monotonicallyIncreasingId not monotonically increasing 
with downstream coalesce
 Key: SPARK-14393
 URL: https://issues.apache.org/jira/browse/SPARK-14393
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Jason Piper


When utilising monotonicallyIncreasingId with a coalesce, it appears that every 
partition uses the same offset (0) leading to non-monotonically increasing IDs.

See examples below

>>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
+---+
|monotonicallyincreasingid()|
+---+
|25769803776|
|51539607552|
|77309411328|
|   103079215104|
|   128849018880|
|   163208757248|
|   188978561024|
|   214748364800|
|   240518168576|
|   266287972352|
+---+

>>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
+---+

>>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
+---+
|monotonicallyincreasingid()|
+---+
|  0|
|  1|
|  0|
|  0|
|  1|
|  2|
|  3|
|  0|
|  1|
|  2|
+---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225263#comment-15225263
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/5/16 12:45 AM:
---

Oh, sorry I should have mentioned that it reads all the data (including line 
separators) regardless of a line separator once it meets a quote character 
which does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

-(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's. It seems Univocity only takes input as {{Reader}} whereas Apache's 
takes {{String}}, which can be easily produced from {{Iterator}} (as far as I 
remember).-


was (Author: hyukjin.kwon):
Oh, sorry I should have mentioned that it reads all the data (including line 
separators) regardless of a line separator once it meets a quote character 
which does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's. It seems Univocity only takes input as {{Reader}} whereas Apache's 
takes {{String}}, which can be easily produced from {{Iterator}} (as far as I 
remember).

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
>

[jira] [Commented] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-04-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225389#comment-15225389
 ] 

Yin Huai commented on SPARK-13456:
--

[~rxin] This is the repl issue.

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collecti

[jira] [Commented] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-04-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225390#comment-15225390
 ] 

Yin Huai commented on SPARK-13456:
--

The original jira for this should be 
https://issues.apache.org/jira/browse/SPARK-1199.

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIter

[jira] [Comment Edited] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-04-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223758#comment-15223758
 ] 

Wenchen Fan edited comment on SPARK-13456 at 4/5/16 12:20 AM:
--

I found a minimal case to reproduce this issue:
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

case class Data(i: Int)
val d = Data(1)

// Exiting paste mode, now interpreting.

defined class Data
d: Data = Data@1a536164

scala> val d2: Data = d
:28: error: type mismatch;
 found   : Data
 required: Data
   val d2: Data = d
  ^
{code}

It's not related to our encoder framework, but looks like a fundamental problem 
in the Spark Shell. Looking into it.


was (Author: cloud_fan):
I found a minimal case to reproduce this issue:
{code}
scala> class Wrapper[T](t: T)
defined class Wrapper

scala> :pa
// Entering paste mode (ctrl-D to finish)

case class Data(i: Int)
val w = new Wrapper(Data(1))

// Exiting paste mode, now interpreting.

defined class Data
w: Wrapper[Data] = Wrapper@1a536164

scala> val w2: Wrapper[Data] = w
:28: error: type mismatch;
 found   : Wrapper[Data]
 required: Wrapper[Data]
   val w2: Wrapper[Data] = w
  ^
{code}

It's not related to our encoder framework, but looks like a fundamental problem 
in the Spark Shell. Looking into it.

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.appl

[jira] [Updated] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-04-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14087:
--
Target Version/s:   (was: 2.0.0)

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13048:
--
Target Version/s: 2.0.0

I'll send a PR for this.  I'd prefer to fix this in 2.0 only since it will 
require a public API change (adding a Param saying not to delete the last 
checkpoint).

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Joseph K. Bradley
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-13048:
-

Assignee: Joseph K. Bradley

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Joseph K. Bradley
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13326) Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-04-04 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13326.
-
Resolution: Later

See SPARK-14155.

We will re-introduce UDT in Spark 2.1 to make it work with Datasets. This will 
require some design..


> Dataset in spark 2.0.0-SNAPSHOT missing columns
> ---
>
> Key: SPARK-13326
> URL: https://issues.apache.org/jira/browse/SPARK-13326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> i noticed some things stopped working on datasets in spark 2.0.0-SNAPSHOT, 
> and with a confusing error message (cannot resolved some column with input 
> columns []).
> for example in 1.6.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> res0: org.apache.spark.sql.Dataset[Option[Int]] = [value: int]
> {noformat}
> and same commands in 2.0.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
> columns: [];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:176)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:176)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:181)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnaly

[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-04-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225326#comment-15225326
 ] 

Bryan Cutler commented on SPARK-14087:
--

I don't think this would completely solve it, please see my comment in the PR.

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14366) Remove SBT-Idea plugin

2016-04-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14366.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12151
[https://github.com/apache/spark/pull/12151]

> Remove SBT-Idea plugin
> --
>
> Key: SPARK-14366
> URL: https://issues.apache.org/jira/browse/SPARK-14366
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Joan Goyeau
>Assignee: Luciano Resende
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We should remove this deprecated sbt-idea plugin as this is generating 
> outdated Idea files.
> Idea displays a warning to ask you to load the SBT project directly from Idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14366) Remove SBT-Idea plugin

2016-04-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14366:
---
Assignee: Luciano Resende

> Remove SBT-Idea plugin
> --
>
> Key: SPARK-14366
> URL: https://issues.apache.org/jira/browse/SPARK-14366
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Joan Goyeau
>Assignee: Luciano Resende
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We should remove this deprecated sbt-idea plugin as this is generating 
> outdated Idea files.
> Idea displays a warning to ask you to load the SBT project directly from Idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11157) Allow Spark to be built without assemblies

2016-04-04 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225322#comment-15225322
 ] 

Josh Rosen commented on SPARK-11157:


I just merged the final patch for this, so I'm marking this as fixed. If anyone 
runs into new bugs related to this, please link them here.

> Allow Spark to be built without assemblies
> --
>
> Key: SPARK-11157
> URL: https://issues.apache.org/jira/browse/SPARK-11157
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Spark Core, YARN
>Reporter: Marcelo Vanzin
> Fix For: 2.0.0
>
> Attachments: no-assemblies.pdf
>
>
> For reasoning, discussion of pros and cons, and other more detailed 
> information, please see attached doc.
> The idea is to be able to build a Spark distribution that has just a 
> directory full of jars instead of the huge assembly files we currently have.
> Getting there requires changes in a bunch of places, I'll try to list the 
> ones I identified in the document, in the order that I think would be needed 
> to not break things:
> * make streaming backends not be assemblies
> Since people may depend on the current assembly artifacts in their 
> deployments, we can't really remove them; but we can make them be dummy jars 
> and rely on dependency resolution to download all the jars.
> PySpark tests would also need some tweaking here.
> * make examples jar not be an assembly
> Probably requires tweaks to the {{run-example}} script. The location of the 
> examples jar would have to change (it won't be able to live in the same place 
> as the main Spark jars anymore).
> * update YARN backend to handle a directory full of jars when launching apps
> Currently YARN localizes the Spark assembly (depending on the user 
> configuration); it needs to be modified so that it can localize all needed 
> libraries instead of a single jar.
> * Modify launcher library to handle the jars directory
> This should be trivial
> * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory 
> depending on which profile is enabled.
> We should keep the option to build with the assembly on by default, for 
> backwards compatibility, to give people time to prepare.
> Filing this bug as an umbrella; please file sub-tasks if you plan to work on 
> a specific part of the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13808) Don't build assembly in dev/run-tests

2016-04-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13808.

   Resolution: Fixed
 Assignee: Marcelo Vanzin  (was: Josh Rosen)
Fix Version/s: 2.0.0

This was done as part of SPARK-13579.

> Don't build assembly in dev/run-tests 
> --
>
> Key: SPARK-13808
> URL: https://issues.apache.org/jira/browse/SPARK-13808
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core, YARN
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> As of SPARK-9284 we should no longer need to build the full Spark assembly 
> JAR in order to run tests. Therefore, we should remove the assembly step from 
> {{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11157) Allow Spark to be built without assemblies

2016-04-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11157.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Allow Spark to be built without assemblies
> --
>
> Key: SPARK-11157
> URL: https://issues.apache.org/jira/browse/SPARK-11157
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Spark Core, YARN
>Reporter: Marcelo Vanzin
> Fix For: 2.0.0
>
> Attachments: no-assemblies.pdf
>
>
> For reasoning, discussion of pros and cons, and other more detailed 
> information, please see attached doc.
> The idea is to be able to build a Spark distribution that has just a 
> directory full of jars instead of the huge assembly files we currently have.
> Getting there requires changes in a bunch of places, I'll try to list the 
> ones I identified in the document, in the order that I think would be needed 
> to not break things:
> * make streaming backends not be assemblies
> Since people may depend on the current assembly artifacts in their 
> deployments, we can't really remove them; but we can make them be dummy jars 
> and rely on dependency resolution to download all the jars.
> PySpark tests would also need some tweaking here.
> * make examples jar not be an assembly
> Probably requires tweaks to the {{run-example}} script. The location of the 
> examples jar would have to change (it won't be able to live in the same place 
> as the main Spark jars anymore).
> * update YARN backend to handle a directory full of jars when launching apps
> Currently YARN localizes the Spark assembly (depending on the user 
> configuration); it needs to be modified so that it can localize all needed 
> libraries instead of a single jar.
> * Modify launcher library to handle the jars directory
> This should be trivial
> * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory 
> depending on which profile is enabled.
> We should keep the option to build with the assembly on by default, for 
> backwards compatibility, to give people time to prepare.
> Filing this bug as an umbrella; please file sub-tasks if you plan to work on 
> a specific part of the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13579) Stop building assemblies for Spark

2016-04-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13579.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11796
[https://github.com/apache/spark/pull/11796]

> Stop building assemblies for Spark
> --
>
> Key: SPARK-13579
> URL: https://issues.apache.org/jira/browse/SPARK-13579
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> See parent bug for more details. This change needs to wait for the other 
> sub-tasks to be finished, so that the code knows what to do when there's only 
> a bunch of jars to work with.
> This should cover both maven and sbt builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225305#comment-15225305
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/4/16 11:52 PM:
---

Just to cut it short, the input is being read as a byte stream, bytes by bytes 
across every line produced by {{Iterator}} with manually inserted line 
separators.


was (Author: hyukjin.kwon):
Just to cut it short, the input is being read as a byte stream, bytes by bytes 
from each line produced by {{Iterator}} with manually inserted line separators.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecut

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225305#comment-15225305
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

Just to cut it short, the input is being read as a byte stream, bytes by bytes 
from each line produced by {{Iterator}} with manually inserted line separators.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
>

[jira] [Commented] (SPARK-14256) Remove parameter sqlContext from as.DataFrame

2016-04-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225296#comment-15225296
 ] 

Felix Cheung commented on SPARK-14256:
--

This was discussed previously and was suggested not to omit the sqlContext 
parameter.

> Remove parameter sqlContext from as.DataFrame
> -
>
> Key: SPARK-14256
> URL: https://issues.apache.org/jira/browse/SPARK-14256
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Currently, the user requires to pass parameter sqlContext to both 
> createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
> parameter, it should be optional from the signature of as.DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225263#comment-15225263
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/4/16 11:37 PM:
---

Oh, sorry I should have mentioned that it reads all the data (including line 
separators) regardless of a line separator once it meets a quote character 
which does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's. It seems Univocity only takes input as {{Reader}} whereas Apache's 
takes {{String}}, which can be easily produced from {{Iterator}} (as far as I 
remember).


was (Author: hyukjin.kwon):
Oh, sorry I should have mentioned that it reads all the data (including roe 
separator) regardless of a line separator once it meets a quote character which 
does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's. It seems Univocity only takes input {{Reader}} whereas Apache's takes 
{{String}} (as far as I remember).

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.Bu

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225263#comment-15225263
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/4/16 11:34 PM:
---

Oh, sorry I should have mentioned that it reads all the data (including roe 
separator) regardless of a line separator once it meets a quote character which 
does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's. It seems Univocity only takes input {{Reader}} whereas Apache's takes 
{{String}} (as far as I remember).


was (Author: hyukjin.kwon):
Oh, sorry I should have mentioned that it reads all the data (including roe 
separator) regardless of a line separator once it meets a quote character which 
does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's)

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.da

[jira] [Commented] (SPARK-14392) CountVectorizer Estimator should include binary toggle Param

2016-04-04 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225277#comment-15225277
 ] 

Miao Wang commented on SPARK-14392:
---

I will on this one. 

Thanks!

Miao

> CountVectorizer Estimator should include binary toggle Param
> 
>
> Key: SPARK-14392
> URL: https://issues.apache.org/jira/browse/SPARK-14392
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> CountVectorizerModel contains a "binary" toggle Param.  The Estimator should 
> contain it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14388) Create Table and Drop table

2016-04-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14388:
-
Description: For now, we still ask Hive to handle creating hive tables and 
dropping a table. We should handle them.

> Create Table and Drop table
> ---
>
> Key: SPARK-14388
> URL: https://issues.apache.org/jira/browse/SPARK-14388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For now, we still ask Hive to handle creating hive tables and dropping a 
> table. We should handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-04 Thread Steve Johnston (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225268#comment-15225268
 ] 

Steve Johnston commented on SPARK-14389:


Related to [What influences the space complexity of Spark 
operations?|http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944.html]

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: lineitem.tbl, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-04 Thread Steve Johnston (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225265#comment-15225265
 ] 

Steve Johnston commented on SPARK-14389:


The sample script, data, etc is contrived in order to demonstrate the problem. 
This OOM occurs at various data sizes, query complexities and cluster 
configurations. We mostly see it as a deviation in our experiments.

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: lineitem.tbl, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225263#comment-15225263
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

Oh, sorry I should have mentioned that it reads all the data (including roe 
separator) regardless of a line separator once it meets a quote character which 
does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, 
meaning it processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it 
would be just like you said but it is processed with {{Reader}} with whole data 
as input. So, this even ignores line separators as well as delimiters which 
ends up reading whole data after a quote as a value.

(Actually this is one of the reasons why I am thinkng changing this library to 
Apache's)

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.s

[jira] [Comment Edited] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-04-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225254#comment-15225254
 ] 

Joseph K. Bradley edited comment on SPARK-13629 at 4/4/16 11:29 PM:


I just realized that we should have added the binary toggle Param to 
CountVectorizer (the Estimator) as well.  (We need all Estimators to contain 
the Model Params so that users can configure the whole Pipeline/Estimator 
before running fit.)


was (Author: josephkb):
I just realized that we should have added the binary toggle Param to 
CountVectorizer (the Estimator) as well.  (We need all Estimators to contain 
the Model Params so that users can configure the whole Pipeline/Estimator 
before running fit. I'll create a JIRA for that.)  I'll create and link a JIRA 
for this and HashingTF.

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-04-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225260#comment-15225260
 ] 

Joseph K. Bradley commented on SPARK-14087:
---

I believe [SPARK-14392] is the right way to solve this issue.  [~bryanc] could 
you please confirm about this?

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14392) CountVectorizer Estimator should include binary toggle Param

2016-04-04 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14392:
-

 Summary: CountVectorizer Estimator should include binary toggle 
Param
 Key: SPARK-14392
 URL: https://issues.apache.org/jira/browse/SPARK-14392
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


CountVectorizerModel contains a "binary" toggle Param.  The Estimator should 
contain it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-04-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225254#comment-15225254
 ] 

Joseph K. Bradley commented on SPARK-13629:
---

I just realized that we should have added the binary toggle Param to 
CountVectorizer (the Estimator) as well.  (We need all Estimators to contain 
the Model Params so that users can configure the whole Pipeline/Estimator 
before running fit. I'll create a JIRA for that.)  I'll create and link a JIRA 
for this and HashingTF.

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 244 matches

Mail list logo