date:20150515


 [ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7473:
-
Assignee: Ai He

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Ai He
>Priority: Trivial
> Fix For: 1.4.0
>
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7681) Add SparseVector support for gemv with DenseMatrix


[ 
https://issues.apache.org/jira/browse/SPARK-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546561#comment-14546561
 ] 

Apache Spark commented on SPARK-7681:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6209

> Add SparseVector support for gemv with DenseMatrix
> --
>
> Key: SPARK-7681
> URL: https://issues.apache.org/jira/browse/SPARK-7681
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>
> Current gemv only works on DenseVector (with DenseMatrix and SparseMatrix). 
> This ticket is proposed to add SparseVector support for DenseMatrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7681) Add SparseVector support for gemv with DenseMatrix


 [ 
https://issues.apache.org/jira/browse/SPARK-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7681:
---

Assignee: (was: Apache Spark)

> Add SparseVector support for gemv with DenseMatrix
> --
>
> Key: SPARK-7681
> URL: https://issues.apache.org/jira/browse/SPARK-7681
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>
> Current gemv only works on DenseVector (with DenseMatrix and SparseMatrix). 
> This ticket is proposed to add SparseVector support for DenseMatrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7681) Add SparseVector support for gemv with DenseMatrix


 [ 
https://issues.apache.org/jira/browse/SPARK-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7681:
---

Assignee: Apache Spark

> Add SparseVector support for gemv with DenseMatrix
> --
>
> Key: SPARK-7681
> URL: https://issues.apache.org/jira/browse/SPARK-7681
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Current gemv only works on DenseVector (with DenseMatrix and SparseMatrix). 
> This ticket is proposed to add SparseVector support for DenseMatrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7681) Add SparseVector support for gemv with DenseMatrix

2015-05-15 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-7681:
--

 Summary: Add SparseVector support for gemv with DenseMatrix
 Key: SPARK-7681
 URL: https://issues.apache.org/jira/browse/SPARK-7681
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh


Current gemv only works on DenseVector (with DenseMatrix and SparseMatrix). 
This ticket is proposed to add SparseVector support for DenseMatrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted


 [ 
https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6649:
---

Assignee: (was: Apache Spark)

> DataFrame created through SQLContext.jdbc() failed if columns table must be 
> quoted
> --
>
> Key: SPARK-6649
> URL: https://issues.apache.org/jira/browse/SPARK-6649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Frédéric Blanc
>Priority: Minor
>
> If I want to import the content a table from oracle, that contains a column 
> with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
> the columns of this table.
> {code:title=ddl.sql|borderStyle=solid}
> CREATE TABLE TEST_TABLE (
> "COMMENT" VARCHAR2(10)
> );
> {code}
> {code:title=test.java|borderStyle=solid}
> SQLContext sqlContext = ...
> DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
> df.rdd();   // => failed if the table contains a column with a reserved 
> keyword
> {code}
> The same problem can be encounter if reserved keyword are used on table name.
> The JDBCRDD scala class could be improved, if the columnList initializer 
> append the double-quote for each column. (line : 225)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted


[ 
https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546552#comment-14546552
 ] 

Apache Spark commented on SPARK-6649:
-

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/6208

> DataFrame created through SQLContext.jdbc() failed if columns table must be 
> quoted
> --
>
> Key: SPARK-6649
> URL: https://issues.apache.org/jira/browse/SPARK-6649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Frédéric Blanc
>Priority: Minor
>
> If I want to import the content a table from oracle, that contains a column 
> with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
> the columns of this table.
> {code:title=ddl.sql|borderStyle=solid}
> CREATE TABLE TEST_TABLE (
> "COMMENT" VARCHAR2(10)
> );
> {code}
> {code:title=test.java|borderStyle=solid}
> SQLContext sqlContext = ...
> DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
> df.rdd();   // => failed if the table contains a column with a reserved 
> keyword
> {code}
> The same problem can be encounter if reserved keyword are used on table name.
> The JDBCRDD scala class could be improved, if the columnList initializer 
> append the double-quote for each column. (line : 225)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted


 [ 
https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6649:
---

Assignee: Apache Spark

> DataFrame created through SQLContext.jdbc() failed if columns table must be 
> quoted
> --
>
> Key: SPARK-6649
> URL: https://issues.apache.org/jira/browse/SPARK-6649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Frédéric Blanc
>Assignee: Apache Spark
>Priority: Minor
>
> If I want to import the content a table from oracle, that contains a column 
> with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
> the columns of this table.
> {code:title=ddl.sql|borderStyle=solid}
> CREATE TABLE TEST_TABLE (
> "COMMENT" VARCHAR2(10)
> );
> {code}
> {code:title=test.java|borderStyle=solid}
> SQLContext sqlContext = ...
> DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
> df.rdd();   // => failed if the table contains a column with a reserved 
> keyword
> {code}
> The same problem can be encounter if reserved keyword are used on table name.
> The JDBCRDD scala class could be improved, if the columnList initializer 
> append the double-quote for each column. (line : 225)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node

2015-05-15 Thread Ai He (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ai He updated SPARK-7473:
-
Comment: was deleted

(was: Hi Joseph, it's AiHe. Thank you for reviewing and merging this PR. )

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
> Fix For: 1.4.0
>
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node

2015-05-15 Thread Ai He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546550#comment-14546550
 ] 

Ai He commented on SPARK-7473:
--

Hi Joseph, it's AiHe. Thank you for reviewing and merging this PR.

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
> Fix For: 1.4.0
>
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node

2015-05-15 Thread Ai He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546546#comment-14546546
 ] 

Ai He commented on SPARK-7473:
--

Hi Joseph, it's AiHe. Thank you for reviewing and merging this PR. 

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
> Fix For: 1.4.0
>
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7673) DataSourceStrategy's buildPartitionedTableScan always list list file status for all data files

2015-05-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-7673:
--
Summary: DataSourceStrategy's buildPartitionedTableScan always list list 
file status for all data files   (was: DataSourceStrategy''s 
buildPartitionedTableScan always list list file status for all data files )

> DataSourceStrategy's buildPartitionedTableScan always list list file status 
> for all data files 
> ---
>
> Key: SPARK-7673
> URL: https://issues.apache.org/jira/browse/SPARK-7673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Blocker
>
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node


 [ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7473.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5988
[https://github.com/apache/spark/pull/5988]

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
> Fix For: 1.4.0
>
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action

2015-05-15 Thread Jeremy A. Lucas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy A. Lucas updated SPARK-7621:
---
Fix Version/s: (was: 1.3.1)

> Report KafkaReceiver MessageHandler errors so StreamingListeners can take 
> action
> 
>
> Key: SPARK-7621
> URL: https://issues.apache.org/jira/browse/SPARK-7621
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Jeremy A. Lucas
> Attachments: SPARK-7621.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, when a MessageHandler (for any of the Kafka Receiver 
> implementations) encounters an error handling a message, the error is only 
> logged with:
> {code:none}
> case e: Exception => logError("Error handling message", e)
> {code}
> It would be _incredibly_ useful to be able to notify any registered 
> StreamingListener of this receiver error (especially since this 
> {{try...catch}} block masks more fatal Kafka connection exceptions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7543) Break dataframe.py into multiple files

2015-05-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7543.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Break dataframe.py into multiple files
> --
>
> Key: SPARK-7543
> URL: https://issues.apache.org/jira/browse/SPARK-7543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.4.0
>
>
> dataframe.py is getting large again. We should just make each class its own 
> file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7073) Clean up Python data type hierarchy

2015-05-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7073.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Clean up Python data type hierarchy
> ---
>
> Key: SPARK-7073
> URL: https://issues.apache.org/jira/browse/SPARK-7073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.4.0
>
>
> We recently removed PrimitiveType in Scala, but in Python we still have that 
> (internal) concept. We should revisit and clean those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7680) Add a fake Receiver that generates random strings, useful for prototyping

2015-05-15 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-7680:


 Summary: Add a fake Receiver that generates random strings, useful 
for prototyping
 Key: SPARK-7680
 URL: https://issues.apache.org/jira/browse/SPARK-7680
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6964) Support Cancellation in the Thrift Server


 [ 
https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6964:
---

Assignee: Apache Spark

> Support Cancellation in the Thrift Server
> -
>
> Key: SPARK-6964
> URL: https://issues.apache.org/jira/browse/SPARK-6964
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> There is already a hook in {{ExecuteStatementOperation}}, we just need to 
> connect it to the job group cancellation support we already have and make 
> sure the various drivers support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6964) Support Cancellation in the Thrift Server


[ 
https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546518#comment-14546518
 ] 

Apache Spark commented on SPARK-6964:
-

User 'dongwang218' has created a pull request for this issue:
https://github.com/apache/spark/pull/6207

> Support Cancellation in the Thrift Server
> -
>
> Key: SPARK-6964
> URL: https://issues.apache.org/jira/browse/SPARK-6964
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>
> There is already a hook in {{ExecuteStatementOperation}}, we just need to 
> connect it to the job group cancellation support we already have and make 
> sure the various drivers support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6964) Support Cancellation in the Thrift Server


 [ 
https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6964:
---

Assignee: (was: Apache Spark)

> Support Cancellation in the Thrift Server
> -
>
> Key: SPARK-6964
> URL: https://issues.apache.org/jira/browse/SPARK-6964
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>
> There is already a hook in {{ExecuteStatementOperation}}, we just need to 
> connect it to the job group cancellation support we already have and make 
> sure the various drivers support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7575) Example code for OneVsRest


 [ 
https://issues.apache.org/jira/browse/SPARK-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7575.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6115
[https://github.com/apache/spark/pull/6115]

> Example code for OneVsRest
> --
>
> Key: SPARK-7575
> URL: https://issues.apache.org/jira/browse/SPARK-7575
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Ram Sriharsha
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7654:
---
Priority: Blocker  (was: Major)

> DataFrameReader and DataFrameWriter for input/output API
> 
>
> Key: SPARK-7654
> URL: https://issues.apache.org/jira/browse/SPARK-7654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> We have a proliferation of save options now. It'd make more sense to have a 
> builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7673) DataSourceStrategy''s buildPartitionedTableScan always list list file status for all data files


 [ 
https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7673:

Assignee: Cheng Lian

> DataSourceStrategy''s buildPartitionedTableScan always list list file status 
> for all data files 
> 
>
> Key: SPARK-7673
> URL: https://issues.apache.org/jira/browse/SPARK-7673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Blocker
>
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7673) DataSourceStrategy''s buildPartitionedTableScan always list list file status for all data files


 [ 
https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7673:

Fix Version/s: (was: 1.4.0)

> DataSourceStrategy''s buildPartitionedTableScan always list list file status 
> for all data files 
> 
>
> Key: SPARK-7673
> URL: https://issues.apache.org/jira/browse/SPARK-7673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Priority: Blocker
>
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7673) DataSourceStrategy''s buildPartitionedTableScan always list list file status for all data files


 [ 
https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7673:

Target Version/s: 1.4.0

> DataSourceStrategy''s buildPartitionedTableScan always list list file status 
> for all data files 
> 
>
> Key: SPARK-7673
> URL: https://issues.apache.org/jira/browse/SPARK-7673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Priority: Blocker
>
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7563) OutputCommitCoordinator.stop() should only be executed in driver


[ 
https://issues.apache.org/jira/browse/SPARK-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546468#comment-14546468
 ] 

Patrick Wendell commented on SPARK-7563:


I pulled the fix into 1.4.0, but not yet 1.3.2 (didn't feel comfortable doing 
the backport).

> OutputCommitCoordinator.stop() should only be executed in driver
> 
>
> Key: SPARK-7563
> URL: https://issues.apache.org/jira/browse/SPARK-7563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>Priority: Critical
> Fix For: 1.4.0
>
>
> I am from IBM Platform Symphony team and we are integrating Spark 1.3.1 with 
> EGO (a resource management product).
> In EGO we uses fine-grained dynamic allocation policy, and each Executor will 
> exit after its tasks are all done. When testing *spark-shell*, we find that 
> when executor of first job exit, it will stop OutputCommitCoordinator, which 
> result in all future jobs failing. Details are as follows:
> We got the following error in executor when submitting job in *spark-shell* 
> the second time (the first job submission is successful):
> {noformat}
> 15/05/11 04:02:31 INFO spark.util.AkkaUtils: Connecting to 
> OutputCommitCoordinator: 
> akka.tcp://sparkDriver@whlspark01:50452/user/OutputCommitCoordinator
> Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://sparkDriver@whlspark01:50452/), 
> Path(/user/OutputCommitCoordinator)]
> at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
> at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
> at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
> at 
> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
> at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
> at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
> at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}
> And in driver side, we see a log message telling that the 
> OutputCommitCoordinator is stopped after the first submission:
> {noformat}
> 15/05/11 04:01:23 INFO 
> spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: 
> OutputCommitCoordinator stopped!
> {noformat}
> We examine the code of OutputCommitCoordinator, and find that executor will 
> reuse the ref of driver's OutputCommitCoordinatorActor. So when an executor 
> exits, it will eventually call SparkEnv.stop():
> {noformat}
>   private[spark] def stop() {
> isStopped = true
> pythonWorkers.foreach { case(key, worker) => worker.stop() }
> Option(httpFileServer).foreach(_.stop())
> mapOutputTracker.stop

[jira] [Updated] (SPARK-7563) OutputCommitCoordinator.stop() should only be executed in driver


 [ 
https://issues.apache.org/jira/browse/SPARK-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7563:
---
Fix Version/s: 1.4.0

> OutputCommitCoordinator.stop() should only be executed in driver
> 
>
> Key: SPARK-7563
> URL: https://issues.apache.org/jira/browse/SPARK-7563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>Priority: Critical
> Fix For: 1.4.0
>
>
> I am from IBM Platform Symphony team and we are integrating Spark 1.3.1 with 
> EGO (a resource management product).
> In EGO we uses fine-grained dynamic allocation policy, and each Executor will 
> exit after its tasks are all done. When testing *spark-shell*, we find that 
> when executor of first job exit, it will stop OutputCommitCoordinator, which 
> result in all future jobs failing. Details are as follows:
> We got the following error in executor when submitting job in *spark-shell* 
> the second time (the first job submission is successful):
> {noformat}
> 15/05/11 04:02:31 INFO spark.util.AkkaUtils: Connecting to 
> OutputCommitCoordinator: 
> akka.tcp://sparkDriver@whlspark01:50452/user/OutputCommitCoordinator
> Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://sparkDriver@whlspark01:50452/), 
> Path(/user/OutputCommitCoordinator)]
> at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
> at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
> at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
> at 
> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
> at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
> at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
> at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}
> And in driver side, we see a log message telling that the 
> OutputCommitCoordinator is stopped after the first submission:
> {noformat}
> 15/05/11 04:01:23 INFO 
> spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: 
> OutputCommitCoordinator stopped!
> {noformat}
> We examine the code of OutputCommitCoordinator, and find that executor will 
> reuse the ref of driver's OutputCommitCoordinatorActor. So when an executor 
> exits, it will eventually call SparkEnv.stop():
> {noformat}
>   private[spark] def stop() {
> isStopped = true
> pythonWorkers.foreach { case(key, worker) => worker.stop() }
> Option(httpFileServer).foreach(_.stop())
> mapOutputTracker.stop()
> shuffleManager.stop()
> broadcastManager.stop()
> blockManager.stop()
> blockManager.master.stop()
> m

[jira] [Updated] (SPARK-7563) OutputCommitCoordinator.stop() should only be executed in driver


 [ 
https://issues.apache.org/jira/browse/SPARK-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7563:
---
Target Version/s: 1.3.2, 1.4.0

> OutputCommitCoordinator.stop() should only be executed in driver
> 
>
> Key: SPARK-7563
> URL: https://issues.apache.org/jira/browse/SPARK-7563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>Priority: Critical
> Fix For: 1.4.0
>
>
> I am from IBM Platform Symphony team and we are integrating Spark 1.3.1 with 
> EGO (a resource management product).
> In EGO we uses fine-grained dynamic allocation policy, and each Executor will 
> exit after its tasks are all done. When testing *spark-shell*, we find that 
> when executor of first job exit, it will stop OutputCommitCoordinator, which 
> result in all future jobs failing. Details are as follows:
> We got the following error in executor when submitting job in *spark-shell* 
> the second time (the first job submission is successful):
> {noformat}
> 15/05/11 04:02:31 INFO spark.util.AkkaUtils: Connecting to 
> OutputCommitCoordinator: 
> akka.tcp://sparkDriver@whlspark01:50452/user/OutputCommitCoordinator
> Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://sparkDriver@whlspark01:50452/), 
> Path(/user/OutputCommitCoordinator)]
> at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
> at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
> at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
> at 
> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
> at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
> at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
> at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}
> And in driver side, we see a log message telling that the 
> OutputCommitCoordinator is stopped after the first submission:
> {noformat}
> 15/05/11 04:01:23 INFO 
> spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: 
> OutputCommitCoordinator stopped!
> {noformat}
> We examine the code of OutputCommitCoordinator, and find that executor will 
> reuse the ref of driver's OutputCommitCoordinatorActor. So when an executor 
> exits, it will eventually call SparkEnv.stop():
> {noformat}
>   private[spark] def stop() {
> isStopped = true
> pythonWorkers.foreach { case(key, worker) => worker.stop() }
> Option(httpFileServer).foreach(_.stop())
> mapOutputTracker.stop()
> shuffleManager.stop()
> broadcastManager.stop()
> blockManager.stop()
> blockManager.master.stop

[jira] [Comment Edited] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones


[ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546463#comment-14546463
 ] 

Davies Liu edited comment on SPARK-6411 at 5/16/15 1:02 AM:


Since TimestampType in Spark SQL does not support timezone, there is no way to 
get the timezone back after a round trip. So we should drop the timezone for 
datetime before serializing (convert to UTC).

I will send out a PR soon.


was (Author: davies):
Since TimestampType in Spark SQL does not support timezone, there is no way to 
get the timezone back after a round trip. So we should drop the timezone for 
datetime before serializing.

I will send out a PR soon.

> PySpark DataFrames can't be created if any datetimes have timezones
> ---
>
> Key: SPARK-6411
> URL: https://issues.apache.org/jira/browse/SPARK-6411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Harry Brundage
>Assignee: Davies Liu
>
> I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
> objects that pass through the conversion process have a {{tzinfo}} property 
> set. 
> This works fine:
> {code}
> In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
> 10),)]).toDF().collect()
> Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
> {code}
> as expected, the tuple's schema is inferred as having one anonymous column 
> with a datetime field, and the datetime roundtrips through to the Java side 
> python deserialization and then back into python land upon {{collect}}. This 
> however:
> {code}
> In [5]: from dateutil.tz import tzutc
> In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
> tzinfo=tzutc()),)]).toDF().collect()
> {code}
> explodes with
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
> pickle data for datetime; expected 1 or 7 args, got 2
>   at 
> net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
>   at 
> net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
>   at org.apache.spa

[jira] [Commented] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones


[ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546463#comment-14546463
 ] 

Davies Liu commented on SPARK-6411:
---

Since TimestampType in Spark SQL does not support timezone, there is no way to 
get the timezone back after a round trip. So we should drop the timezone for 
datetime before serializing.

I will send out a PR soon.

> PySpark DataFrames can't be created if any datetimes have timezones
> ---
>
> Key: SPARK-6411
> URL: https://issues.apache.org/jira/browse/SPARK-6411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Harry Brundage
>Assignee: Davies Liu
>
> I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
> objects that pass through the conversion process have a {{tzinfo}} property 
> set. 
> This works fine:
> {code}
> In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
> 10),)]).toDF().collect()
> Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
> {code}
> as expected, the tuple's schema is inferred as having one anonymous column 
> with a datetime field, and the datetime roundtrips through to the Java side 
> python deserialization and then back into python land upon {{collect}}. This 
> however:
> {code}
> In [5]: from dateutil.tz import tzutc
> In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
> tzinfo=tzutc()),)]).toDF().collect()
> {code}
> explodes with
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
> pickle data for datetime; expected 1 or 7 args, got 2
>   at 
> net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
>   at 
> net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> j

[jira] [Assigned] (SPARK-7073) Clean up Python data type hierarchy


 [ 
https://issues.apache.org/jira/browse/SPARK-7073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7073:
---

Assignee: Apache Spark  (was: Davies Liu)

> Clean up Python data type hierarchy
> ---
>
> Key: SPARK-7073
> URL: https://issues.apache.org/jira/browse/SPARK-7073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We recently removed PrimitiveType in Scala, but in Python we still have that 
> (internal) concept. We should revisit and clean those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7073) Clean up Python data type hierarchy


 [ 
https://issues.apache.org/jira/browse/SPARK-7073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7073:
---

Assignee: Davies Liu  (was: Apache Spark)

> Clean up Python data type hierarchy
> ---
>
> Key: SPARK-7073
> URL: https://issues.apache.org/jira/browse/SPARK-7073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> We recently removed PrimitiveType in Scala, but in Python we still have that 
> (internal) concept. We should revisit and clean those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7073) Clean up Python data type hierarchy


[ 
https://issues.apache.org/jira/browse/SPARK-7073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546450#comment-14546450
 ] 

Apache Spark commented on SPARK-7073:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6206

> Clean up Python data type hierarchy
> ---
>
> Key: SPARK-7073
> URL: https://issues.apache.org/jira/browse/SPARK-7073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> We recently removed PrimitiveType in Scala, but in Python we still have that 
> (internal) concept. We should revisit and clean those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6820) Convert NAs to null type in SparkR DataFrames


 [ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6820:
---

Assignee: Apache Spark  (was: Qian Huang)

> Convert NAs to null type in SparkR DataFrames
> -
>
> Key: SPARK-6820
> URL: https://issues.apache.org/jira/browse/SPARK-6820
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>Priority: Critical
>
> While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
> handle missing values or NAs.
> We should convert NAs to SparkSQL's null type to handle the conversion 
> correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6820) Convert NAs to null type in SparkR DataFrames


 [ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6820:
---

Assignee: Qian Huang  (was: Apache Spark)

> Convert NAs to null type in SparkR DataFrames
> -
>
> Key: SPARK-6820
> URL: https://issues.apache.org/jira/browse/SPARK-6820
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>Assignee: Qian Huang
>Priority: Critical
>
> While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
> handle missing values or NAs.
> We should convert NAs to SparkSQL's null type to handle the conversion 
> correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6820) Convert NAs to null type in SparkR DataFrames


[ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546451#comment-14546451
 ] 

Apache Spark commented on SPARK-6820:
-

User 'hqzizania' has created a pull request for this issue:
https://github.com/apache/spark/pull/6190

> Convert NAs to null type in SparkR DataFrames
> -
>
> Key: SPARK-6820
> URL: https://issues.apache.org/jira/browse/SPARK-6820
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>Assignee: Qian Huang
>Priority: Critical
>
> While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
> handle missing values or NAs.
> We should convert NAs to SparkSQL's null type to handle the conversion 
> correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7676) Cleanup unnecessary code and fix small bug in the stage timeline view


 [ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-7676.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

> Cleanup unnecessary code and fix small bug in the stage timeline view
> -
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
> Fix For: 1.4.0
>
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7676) Cleanup unnecessary code and fix small bug in the stage timeline view


 [ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-7676:
--
Component/s: Web UI

> Cleanup unnecessary code and fix small bug in the stage timeline view
> -
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
> Fix For: 1.4.0
>
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them


[ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546426#comment-14546426
 ] 

Apache Spark commented on SPARK-6980:
-

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/6205

> Akka timeout exceptions indicate which conf controls them
> -
>
> Key: SPARK-6980
> URL: https://issues.apache.org/jira/browse/SPARK-6980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Harsh Gupta
>Priority: Minor
>  Labels: starter
> Attachments: Spark-6980-Test.scala
>
>
> If you hit one of the akka timeouts, you just get an exception like
> {code}
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> {code}
> The exception doesn't indicate how to change the timeout, though there is 
> usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
> nice if the exception including the relevant setting.
> I think this should be pretty easy to do -- we just need to create something 
> like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
> akka timeout and throws its own exception.  We should change 
> {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a 
> {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a 
> better exception.
> Given the latest refactoring to the rpc layer, this needs to be done in both 
> {{AkkaUtils}} and {{AkkaRpcEndpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6289) PySpark doesn't maintain SQL date Types

2015-05-15 Thread Michael Nazario (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546425#comment-14546425
 ] 

Michael Nazario commented on SPARK-6289:


This does work for me, but it seems odd that this behavior would change based 
on the way you access data.

> PySpark doesn't maintain SQL date Types
> ---
>
> Key: SPARK-6289
> URL: https://issues.apache.org/jira/browse/SPARK-6289
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.1
>Reporter: Michael Nazario
>Assignee: Davies Liu
>
> For the DateType, Spark SQL requires a datetime.date in Python. However, if 
> you collect a row based on that type, you'll end up with a returned value 
> which is type datetime.datetime.
> I have tried to reproduce this using the pyspark shell, but have been unable 
> to. This is definitely a problem coming from pyrolite though:
> https://github.com/irmen/Pyrolite/
> Pyrolite is being used for datetime and date serialization, but appears to 
> not map to date objects, but maps to datetime objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6902) Row() object can be mutated even though it should be immutable

2015-05-15 Thread Jonathan Arfa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546417#comment-14546417
 ] 

Jonathan Arfa commented on SPARK-6902:
--

[~davies] it works for me simply because I *now* know not to mis-use the Row() 
object like that.

I filed the bug because I didn't want other people to tear their hair out over 
the same problem. I think editing a Row like I did should throw an error, 
because had it done so in version 1.2 I would've saved a ton of time by not 
having to hunt for the bug in my code.

> Row() object can be mutated even though it should be immutable
> --
>
> Key: SPARK-6902
> URL: https://issues.apache.org/jira/browse/SPARK-6902
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Jonathan Arfa
>Assignee: Davies Liu
>
> See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and 
> should just give you an error.
> {quote}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.2.0-SNAPSHOT
>   /_/
> Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)
> SparkContext available as sc.
> >>> from pyspark.sql import *
> >>> x = Row(a=1, b=2, c=3)
> >>> x
> Row(a=1, b=2, c=3)
> >>> x.__dict__
> \{'__FIELDS__': ['a', 'b', 'c']\}
> >>> x.c
> 3
> >>> x.c = 5
> >>> x
> Row(a=1, b=2, c=3)
> >>> x.__dict__
> \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\}
> >>> x.c
> 5
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6811) Building binary R packages for SparkR


 [ 
https://issues.apache.org/jira/browse/SPARK-6811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6811:
---
Assignee: Shivaram Venkataraman

> Building binary R packages for SparkR
> -
>
> Key: SPARK-6811
> URL: https://issues.apache.org/jira/browse/SPARK-6811
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Blocker
>
> We should figure out how to distribute binary packages for SparkR as a part 
> of the release process. R packages for Windows might need to be built 
> separately and we could offer a separate download link for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format


 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2883:
---
Priority: Critical  (was: Blocker)

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Critical
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png, orc.diff
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format


[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546394#comment-14546394
 ] 

Patrick Wendell commented on SPARK-2883:


Since this is a feature I'm going to drop it down to critical priority since 
we'll start the release candidates soon. However, I think it's fine to slip 
this in between RC's because it's purely additive, so IMO it's very likely this 
will make it into Spark 1.4.

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png, orc.diff
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7355) FlakyTest - o.a.s.DriverSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7355:
---
Priority: Critical  (was: Blocker)

> FlakyTest - o.a.s.DriverSuite
> -
>
> Key: SPARK-7355
> URL: https://issues.apache.org/jira/browse/SPARK-7355
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7644) Ensure all scoped RDD operations are tested and cleaned


 [ 
https://issues.apache.org/jira/browse/SPARK-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7644:
---
Priority: Critical  (was: Blocker)

> Ensure all scoped RDD operations are tested and cleaned
> ---
>
> Key: SPARK-7644
> URL: https://issues.apache.org/jira/browse/SPARK-7644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Streaming
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> If all goes well, this will be a "Won't Fix". Before releasing we should make 
> sure all operations wrapped in `RDDOperationScope.withScope` are actually 
> tested and enclosed closures are actually cleaned. This is because a big 
> change went into `ClosureCleaner` and wrapping methods in closures may change 
> whether they are serializable.
> TL;DR we should run all the wrapped operations to make sure we don't run into 
> java.lang.NotSerializableException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release


 [ 
https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7284:
---
Priority: Critical  (was: Blocker)

> Update streaming documentation for Spark 1.4.0 release
> --
>
> Key: SPARK-7284
> URL: https://issues.apache.org/jira/browse/SPARK-7284
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Things to update (continuously updated list)
> - Python API for Kafka Direct
> - Pointers to the new Streaming UI
> - Update Kafka version to 0.8.2.1
> - Add ref to RDD.foreachPartitionWithIndex (if merged)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7672) Number format exception with spark.kryoserializer.buffer.mb


 [ 
https://issues.apache.org/jira/browse/SPARK-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7672:
---
Component/s: Spark Core

> Number format exception with spark.kryoserializer.buffer.mb
> ---
>
> Key: SPARK-7672
> URL: https://issues.apache.org/jira/browse/SPARK-7672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nishkam Ravi
>
> With spark.kryoserializer.buffer.mb  1000 : 
> Exception in thread "main" java.lang.NumberFormatException: Size must be 
> specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), 
> tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 100.0
> at 
> org.apache.spark.network.util.JavaUtils.parseByteString(JavaUtils.java:238)
> at 
> org.apache.spark.network.util.JavaUtils.byteStringAsKb(JavaUtils.java:259)
> at org.apache.spark.util.Utils$.byteStringAsKb(Utils.scala:1037)
> at org.apache.spark.SparkConf.getSizeAsKb(SparkConf.scala:245)
> at 
> org.apache.spark.serializer.KryoSerializer.(KryoSerializer.scala:53)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:269)
> at 
> org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:280)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:283)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7672) Number format exception with spark.kryoserializer.buffer.mb


 [ 
https://issues.apache.org/jira/browse/SPARK-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7672:
---
Priority: Critical  (was: Major)

> Number format exception with spark.kryoserializer.buffer.mb
> ---
>
> Key: SPARK-7672
> URL: https://issues.apache.org/jira/browse/SPARK-7672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nishkam Ravi
>Priority: Critical
>
> With spark.kryoserializer.buffer.mb  1000 : 
> Exception in thread "main" java.lang.NumberFormatException: Size must be 
> specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), 
> tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 100.0
> at 
> org.apache.spark.network.util.JavaUtils.parseByteString(JavaUtils.java:238)
> at 
> org.apache.spark.network.util.JavaUtils.byteStringAsKb(JavaUtils.java:259)
> at org.apache.spark.util.Utils$.byteStringAsKb(Utils.scala:1037)
> at org.apache.spark.SparkConf.getSizeAsKb(SparkConf.scala:245)
> at 
> org.apache.spark.serializer.KryoSerializer.(KryoSerializer.scala:53)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:269)
> at 
> org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:280)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:283)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7679) Update AWS SDK and KCL versions to 1.2.1


 [ 
https://issues.apache.org/jira/browse/SPARK-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7679:
---

Assignee: Apache Spark  (was: Tathagata Das)

> Update AWS SDK and KCL versions to 1.2.1
> 
>
> Key: SPARK-7679
> URL: https://issues.apache.org/jira/browse/SPARK-7679
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7679) Update AWS SDK and KCL versions to 1.2.1


 [ 
https://issues.apache.org/jira/browse/SPARK-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7679:
---

Assignee: Tathagata Das  (was: Apache Spark)

> Update AWS SDK and KCL versions to 1.2.1
> 
>
> Key: SPARK-7679
> URL: https://issues.apache.org/jira/browse/SPARK-7679
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7679) Update AWS SDK and KCL versions to 1.2.1


[ 
https://issues.apache.org/jira/browse/SPARK-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546341#comment-14546341
 ] 

Apache Spark commented on SPARK-7679:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6147

> Update AWS SDK and KCL versions to 1.2.1
> 
>
> Key: SPARK-7679
> URL: https://issues.apache.org/jira/browse/SPARK-7679
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7679) Update AWS SDK and KCL versions to 1.2.1

2015-05-15 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-7679:


 Summary: Update AWS SDK and KCL versions to 1.2.1
 Key: SPARK-7679
 URL: https://issues.apache.org/jira/browse/SPARK-7679
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action


 [ 
https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7621:
---

Assignee: (was: Apache Spark)

> Report KafkaReceiver MessageHandler errors so StreamingListeners can take 
> action
> 
>
> Key: SPARK-7621
> URL: https://issues.apache.org/jira/browse/SPARK-7621
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Jeremy A. Lucas
> Fix For: 1.3.1
>
> Attachments: SPARK-7621.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, when a MessageHandler (for any of the Kafka Receiver 
> implementations) encounters an error handling a message, the error is only 
> logged with:
> {code:none}
> case e: Exception => logError("Error handling message", e)
> {code}
> It would be _incredibly_ useful to be able to notify any registered 
> StreamingListener of this receiver error (especially since this 
> {{try...catch}} block masks more fatal Kafka connection exceptions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action


 [ 
https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7621:
---

Assignee: Apache Spark

> Report KafkaReceiver MessageHandler errors so StreamingListeners can take 
> action
> 
>
> Key: SPARK-7621
> URL: https://issues.apache.org/jira/browse/SPARK-7621
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Jeremy A. Lucas
>Assignee: Apache Spark
> Fix For: 1.3.1
>
> Attachments: SPARK-7621.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, when a MessageHandler (for any of the Kafka Receiver 
> implementations) encounters an error handling a message, the error is only 
> logged with:
> {code:none}
> case e: Exception => logError("Error handling message", e)
> {code}
> It would be _incredibly_ useful to be able to notify any registered 
> StreamingListener of this receiver error (especially since this 
> {{try...catch}} block masks more fatal Kafka connection exceptions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action


[ 
https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546326#comment-14546326
 ] 

Apache Spark commented on SPARK-7621:
-

User 'jerluc' has created a pull request for this issue:
https://github.com/apache/spark/pull/6204

> Report KafkaReceiver MessageHandler errors so StreamingListeners can take 
> action
> 
>
> Key: SPARK-7621
> URL: https://issues.apache.org/jira/browse/SPARK-7621
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Jeremy A. Lucas
> Fix For: 1.3.1
>
> Attachments: SPARK-7621.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, when a MessageHandler (for any of the Kafka Receiver 
> implementations) encounters an error handling a message, the error is only 
> logged with:
> {code:none}
> case e: Exception => logError("Error handling message", e)
> {code}
> It would be _incredibly_ useful to be able to notify any registered 
> StreamingListener of this receiver error (especially since this 
> {{try...catch}} block masks more fatal Kafka connection exceptions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6216) Check Python version in worker before run PySpark job


[ 
https://issues.apache.org/jira/browse/SPARK-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546321#comment-14546321
 ] 

Apache Spark commented on SPARK-6216:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6203

> Check Python version in worker before run PySpark job
> -
>
> Key: SPARK-6216
> URL: https://issues.apache.org/jira/browse/SPARK-6216
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.4.0
>
>
> PySpark can only run with the same major version both in driver and worker ( 
> both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in 
> driver or 2.6 in worker (or vice).
> For example:
> {code}
> davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 
> PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark
> Using Python version 2.7.7 (default, Jun  2 2014 12:48:16)
> SparkContext available as sc, SQLContext available as sqlCtx.
> >>> sc.textFile('LICENSE').map(lambda l: l.split()).count()
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in 
> process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2251, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2251, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2251, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 281, in func
> return f(iterator)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 931, in 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 931, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "", line 1, in 
> TypeError: 'bool' object is not callable
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:177)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6902) Row() object can be mutated even though it should be immutable


[ 
https://issues.apache.org/jira/browse/SPARK-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546274#comment-14546274
 ] 

Davies Liu commented on SPARK-6902:
---

[~jarfa] Python is a dynamic language, it's not common to provide read-only 
interface (there are many ways to break it), so I'd like to leave it as current 
(won't fix). Does this work for you?

> Row() object can be mutated even though it should be immutable
> --
>
> Key: SPARK-6902
> URL: https://issues.apache.org/jira/browse/SPARK-6902
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Jonathan Arfa
>Assignee: Davies Liu
>
> See the below code snippet, IMHO it shouldn't let you assign {{x.c = 5}} and 
> should just give you an error.
> {quote}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.2.0-SNAPSHOT
>   /_/
> Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36)
> SparkContext available as sc.
> >>> from pyspark.sql import *
> >>> x = Row(a=1, b=2, c=3)
> >>> x
> Row(a=1, b=2, c=3)
> >>> x.__dict__
> \{'__FIELDS__': ['a', 'b', 'c']\}
> >>> x.c
> 3
> >>> x.c = 5
> >>> x
> Row(a=1, b=2, c=3)
> >>> x.__dict__
> \{'__FIELDS__': ['a', 'b', 'c'], 'c': 5\}
> >>> x.c
> 5
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7556) User guide update for feature transformer: Binarizer


 [ 
https://issues.apache.org/jira/browse/SPARK-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7556.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6116
[https://github.com/apache/spark/pull/6116]

> User guide update for feature transformer: Binarizer
> 
>
> Key: SPARK-7556
> URL: https://issues.apache.org/jira/browse/SPARK-7556
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> Copied from [SPARK-7443]:
> {quote}
> Now that we have algorithms in spark.ml which are not in spark.mllib, we 
> should start making subsections for the spark.ml API as needed. We can follow 
> the structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7511) PySpark ML seed Param should be varied per class


 [ 
https://issues.apache.org/jira/browse/SPARK-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7511:
-
Description: 
Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
seed.  Python uses 42.  After discussions, we've decided to use a seed which 
varies based on the class name, but which is fixed instead of random.  This 
will make behavior reproducible, rather than random, by default.  Users will 
still be able to change the random seed.

The default seed should be produced via some hash of the class name.

Scala's seed will be fixed in a separate patch.

  was:
Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
seed.  Python uses 42.  After discussions, we've decided to use a seed which 
varies based on the class name, but which is fixed instead of random.  This 
will make behavior reproducible, rather than random, by default.  Users will 
still be able to change the random seed.

Scala's seed will be fixed in a separate patch.


> PySpark ML seed Param should be varied per class
> 
>
> Key: SPARK-7511
> URL: https://issues.apache.org/jira/browse/SPARK-7511
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
> seed.  Python uses 42.  After discussions, we've decided to use a seed which 
> varies based on the class name, but which is fixed instead of random.  This 
> will make behavior reproducible, rather than random, by default.  Users will 
> still be able to change the random seed.
> The default seed should be produced via some hash of the class name.
> Scala's seed will be fixed in a separate patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7677) Enable Kafka In Scala 2.11 Build


 [ 
https://issues.apache.org/jira/browse/SPARK-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-7677.

   Resolution: Fixed
Fix Version/s: 1.4.0

Fixed by pull request:
https://github.com/apache/spark/pull/6149

> Enable Kafka In Scala 2.11 Build
> 
>
> Key: SPARK-7677
> URL: https://issues.apache.org/jira/browse/SPARK-7677
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Iulian Dragos
> Fix For: 1.4.0
>
>
> Now that we upgraded Kafka in SPARK-2808 we can enable it in the Scala 2.11 
> build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7678) Scala ML seed Param should vary per class

Joseph K. Bradley created SPARK-7678:


 Summary: Scala ML seed Param should vary per class
 Key: SPARK-7678
 URL: https://issues.apache.org/jira/browse/SPARK-7678
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


In [SPARK-7511], we decided to use fixed random seeds per class, rather than 
generating a new random seed whenever an algorithm is instantiated.  We need to 
change this in Scala's HasSeed Param.

The default seed should be produced via some hash of the class name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7678) Scala ML seed Param should be fixed but vary per class


 [ 
https://issues.apache.org/jira/browse/SPARK-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7678:
-
Summary: Scala ML seed Param should be fixed but vary per class  (was: 
Scala ML seed Param should vary per class)

> Scala ML seed Param should be fixed but vary per class
> --
>
> Key: SPARK-7678
> URL: https://issues.apache.org/jira/browse/SPARK-7678
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In [SPARK-7511], we decided to use fixed random seeds per class, rather than 
> generating a new random seed whenever an algorithm is instantiated.  We need 
> to change this in Scala's HasSeed Param.
> The default seed should be produced via some hash of the class name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7511) PySpark ML seed Param should be varied per class


 [ 
https://issues.apache.org/jira/browse/SPARK-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7511:
-
Description: 
Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
seed.  Python uses 42.  After discussions, we've decided to use a seed which 
varies based on the class name, but which is fixed instead of random.  This 
will make behavior reproducible, rather than random, by default.  Users will 
still be able to change the random seed.

Scala's seed will be fixed in a separate patch.

  was:
Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
seed.  Python uses 42.  After discussions, we've decided to use a seed which 
varies based on the class name, but which is fixed instead of random.  This 
will make behavior reproducible, rather than random, by default.  Users will 
still be able to change the random seed.

Scala's seed will be fixed in a separate patch


> PySpark ML seed Param should be varied per class
> 
>
> Key: SPARK-7511
> URL: https://issues.apache.org/jira/browse/SPARK-7511
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
> seed.  Python uses 42.  After discussions, we've decided to use a seed which 
> varies based on the class name, but which is fixed instead of random.  This 
> will make behavior reproducible, rather than random, by default.  Users will 
> still be able to change the random seed.
> Scala's seed will be fixed in a separate patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7676) Cleanup unnecessary code and fix small bug in the stage timeline view


 [ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7676:
---

Assignee: Kay Ousterhout  (was: Apache Spark)

> Cleanup unnecessary code and fix small bug in the stage timeline view
> -
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7676) Cleanup unnecessary code and fix small bug in the stage timeline view


[ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546239#comment-14546239
 ] 

Apache Spark commented on SPARK-7676:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/6202

> Cleanup unnecessary code and fix small bug in the stage timeline view
> -
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7676) Cleanup unnecessary code and fix small bug in the stage timeline view


 [ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7676:
---

Assignee: Apache Spark  (was: Kay Ousterhout)

> Cleanup unnecessary code and fix small bug in the stage timeline view
> -
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7511) PySpark ML seed Param should be varied per class


 [ 
https://issues.apache.org/jira/browse/SPARK-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7511:
-
Description: 
Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
seed.  Python uses 42.  After discussions, we've decided to use a seed which 
varies based on the class name, but which is fixed instead of random.  This 
will make behavior reproducible, rather than random, by default.  Users will 
still be able to change the random seed.

Scala's seed will be fixed in a separate patch

  was:Currently, Scala's HasSeed mix-in uses a random Long as the default value 
for seed.  Python should too.  (Currently, it seems to use "42")


> PySpark ML seed Param should be varied per class
> 
>
> Key: SPARK-7511
> URL: https://issues.apache.org/jira/browse/SPARK-7511
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
> seed.  Python uses 42.  After discussions, we've decided to use a seed which 
> varies based on the class name, but which is fixed instead of random.  This 
> will make behavior reproducible, rather than random, by default.  Users will 
> still be able to change the random seed.
> Scala's seed will be fixed in a separate patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7677) Enable Kafka In Scala 2.11 Build


 [ 
https://issues.apache.org/jira/browse/SPARK-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7677:
---
Description: Now that we upgraded Kafka in SPARK-2808 we can enable it in 
the Scala 2.11 build.

> Enable Kafka In Scala 2.11 Build
> 
>
> Key: SPARK-7677
> URL: https://issues.apache.org/jira/browse/SPARK-7677
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Iulian Dragos
>
> Now that we upgraded Kafka in SPARK-2808 we can enable it in the Scala 2.11 
> build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7511) PySpark ML seed Param should be varied per class


 [ 
https://issues.apache.org/jira/browse/SPARK-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7511:
-
Summary: PySpark ML seed Param should be varied per class  (was: PySpark ML 
seed Param should be random by default)

> PySpark ML seed Param should be varied per class
> 
>
> Key: SPARK-7511
> URL: https://issues.apache.org/jira/browse/SPARK-7511
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Scala's HasSeed mix-in uses a random Long as the default value for 
> seed.  Python should too.  (Currently, it seems to use "42")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input


 [ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7651:
-
Fix Version/s: 1.3.2

> PySpark GMM predict, predictSoft should fail on bad input
> -
>
> Key: SPARK-7651
> URL: https://issues.apache.org/jira/browse/SPARK-7651
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Meethu Mathew
>Priority: Minor
> Fix For: 1.3.2, 1.4.0
>
>
> In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
> is an RDD and operate correctly if so.  But if the argument is not an RDD, 
> they fail silently, returning nothing.
> [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
> Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7677) Enable Kafka In Scala 2.11 Build

Patrick Wendell created SPARK-7677:
--

 Summary: Enable Kafka In Scala 2.11 Build
 Key: SPARK-7677
 URL: https://issues.apache.org/jira/browse/SPARK-7677
 Project: Spark
  Issue Type: Sub-task
Reporter: Patrick Wendell
Assignee: Iulian Dragos






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7543) Break dataframe.py into multiple files


 [ 
https://issues.apache.org/jira/browse/SPARK-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7543:
---

Assignee: Apache Spark  (was: Davies Liu)

> Break dataframe.py into multiple files
> --
>
> Key: SPARK-7543
> URL: https://issues.apache.org/jira/browse/SPARK-7543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> dataframe.py is getting large again. We should just make each class its own 
> file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7543) Break dataframe.py into multiple files


 [ 
https://issues.apache.org/jira/browse/SPARK-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7543:
---

Assignee: Davies Liu  (was: Apache Spark)

> Break dataframe.py into multiple files
> --
>
> Key: SPARK-7543
> URL: https://issues.apache.org/jira/browse/SPARK-7543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> dataframe.py is getting large again. We should just make each class its own 
> file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7543) Break dataframe.py into multiple files


[ 
https://issues.apache.org/jira/browse/SPARK-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546228#comment-14546228
 ] 

Apache Spark commented on SPARK-7543:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6201

> Break dataframe.py into multiple files
> --
>
> Key: SPARK-7543
> URL: https://issues.apache.org/jira/browse/SPARK-7543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> dataframe.py is getting large again. We should just make each class its own 
> file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7676) Cleanup unnecessary code and fix small bug in the stage timeline view


 [ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-7676:
--
Summary: Cleanup unnecessary code and fix small bug in the stage timeline 
view  (was: Cleanup unnecessary code in the stage timeline view)

> Cleanup unnecessary code and fix small bug in the stage timeline view
> -
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7676) Cleanup unnecessary code in the stage timeline view


 [ 
https://issues.apache.org/jira/browse/SPARK-7676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-7676:
--
Description: 
SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
code left in this commit from the many iterations that should be removed. We 
should also remove the functionality to highlight the row in the task table 
when someone mouses over one of the tasks in the visualization, because there 
are typically far too many tasks in the table for this to be useful (because 
the user can't see which row is highlighted).

There's also a small bug where the end time is based on the last task's launch 
time, rather than the last task's finish time.

  was:SPARK-7296 added a per-stage visualization to the UI.  There's some 
unneeded code left in this commit from the many iterations that should be 
removed. We should also remove the functionality to highlight the row in the 
task table when someone mouses over one of the tasks in the visualization, 
because there are typically far too many tasks in the table for this to be 
useful (because the user can't see which row is highlighted).


> Cleanup unnecessary code in the stage timeline view
> ---
>
> Key: SPARK-7676
> URL: https://issues.apache.org/jira/browse/SPARK-7676
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
> code left in this commit from the many iterations that should be removed. We 
> should also remove the functionality to highlight the row in the task table 
> when someone mouses over one of the tasks in the visualization, because there 
> are typically far too many tasks in the table for this to be useful (because 
> the user can't see which row is highlighted).
> There's also a small bug where the end time is based on the last task's 
> launch time, rather than the last task's finish time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming

2015-05-15 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546216#comment-14546216
 ] 

Tathagata Das commented on SPARK-7661:
--

What do you mean by the currently the logic is N+1 executor? Is that documented 
somewhere that you have to have exactly N+1 ?


> Support for dynamic allocation of executors in Kinesis Spark Streaming
> --
>
> Key: SPARK-7661
> URL: https://issues.apache.org/jira/browse/SPARK-7661
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.3.1
> Environment: AWS-EMR
>Reporter: Murtaza Kanchwala
>
> Currently the logic for the no. of executors is (N + 1), where N is no. of 
> shards in a Kinesis Stream.
> My Requirement is that if I use this Resharding util for Amazon Kinesis :
> Amazon Kinesis Resharding : 
> https://github.com/awslabs/amazon-kinesis-scaling-utils
> Then there should be some way to allocate executors on the basis of no. of 
> shards directly (for Spark Streaming only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7655) Akka timeout exception


[ 
https://issues.apache.org/jira/browse/SPARK-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546204#comment-14546204
 ] 

Apache Spark commented on SPARK-7655:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6200

> Akka timeout exception
> --
>
> Key: SPARK-7655
> URL: https://issues.apache.org/jira/browse/SPARK-7655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> I got the following exception when I was running a query with broadcast join.
> {code}
> 15/05/15 01:15:49 [WARN] AkkaRpcEndpointRef: Error sending message [message = 
> UpdateBlockInfo(BlockManagerId(driver, 10.0.171.162, 
> 54870),broadcast_758_piece0,StorageLevel(false, false, false, false, 
> 1),0,0,0)] in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>   at 
> org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:374)
>   at 
> org.apache.spark.storage.BlockManager.reportBlockStatus(BlockManager.scala:350)
>   at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1107)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1083)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1083)
>   at scala.collection.immutable.Set$Set2.foreach(Set.scala:94)
>   at 
> org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1083)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:78)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land


[ 
https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546201#comment-14546201
 ] 

Yin Huai commented on SPARK-6917:
-

[~adrian-wang] Can you take a look?

> Broken data returned to PySpark dataframe if any large numbers used in Scala 
> land
> -
>
> Key: SPARK-6917
> URL: https://issues.apache.org/jira/browse/SPARK-6917
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, Python 2.7.6, Scala 2.10
>Reporter: Harry Brundage
>Assignee: Yin Huai
>Priority: Critical
> Attachments: part-r-1.parquet
>
>
> When trying to access data stored in a Parquet file with an INT96 column 
> (read: TimestampType() encoded for Impala), if the INT96 column is included 
> in the fetched data, other, smaller numeric types come back broken.
> {code}
> In [1]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").select('int_col',
>  'long_col').first()
> Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
> In [2]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").first()
> Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo= 'America/Toronto' EDT-1 day, 19:00:00 DST>))
> {code}
> Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being 
> returned for the {{int_col}} and {{long_col}} columns in the second loop 
> above. This only happens if I select the {{date_col}} which is stored as 
> {{INT96}}. 
> I don't know much about Scala boxing, but I assume that somehow by including 
> numeric columns that are bigger than a machine word I trigger some different, 
> slower execution path somewhere that boxes stuff and causes this problem.
> If anyone could give me any pointers on where to get started fixing this I'd 
> be happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

2015-05-15 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6820:
-
Assignee: Qian Huang

> Convert NAs to null type in SparkR DataFrames
> -
>
> Key: SPARK-6820
> URL: https://issues.apache.org/jira/browse/SPARK-6820
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>Assignee: Qian Huang
>Priority: Critical
>
> While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
> handle missing values or NAs.
> We should convert NAs to SparkSQL's null type to handle the conversion 
> correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

2015-05-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546174#comment-14546174
 ] 

Shivaram Venkataraman commented on SPARK-6820:
--

PR open at https://github.com/apache/spark/pull/6190

> Convert NAs to null type in SparkR DataFrames
> -
>
> Key: SPARK-6820
> URL: https://issues.apache.org/jira/browse/SPARK-6820
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
> handle missing values or NAs.
> We should convert NAs to SparkSQL's null type to handle the conversion 
> correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7676) Cleanup unnecessary code in the stage timeline view

Kay Ousterhout created SPARK-7676:
-

 Summary: Cleanup unnecessary code in the stage timeline view
 Key: SPARK-7676
 URL: https://issues.apache.org/jira/browse/SPARK-7676
 Project: Spark
  Issue Type: Improvement
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


SPARK-7296 added a per-stage visualization to the UI.  There's some unneeded 
code left in this commit from the many iterations that should be removed. We 
should also remove the functionality to highlight the row in the task table 
when someone mouses over one of the tasks in the visualization, because there 
are typically far too many tasks in the table for this to be useful (because 
the user can't see which row is highlighted).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7226) Support math functions in R DataFrame

2015-05-15 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-7226.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6170
[https://github.com/apache/spark/pull/6170]

> Support math functions in R DataFrame
> -
>
> Key: SPARK-7226
> URL: https://issues.apache.org/jira/browse/SPARK-7226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Reynold Xin
>Assignee: Qian Huang
>Priority: Critical
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6418) Add simple per-stage visualization to the UI


 [ 
https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-6418:
--
Fix Version/s: 1.4.0

> Add simple per-stage visualization to the UI
> 
>
> Key: SPARK-6418
> URL: https://issues.apache.org/jira/browse/SPARK-6418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Kay Ousterhout
> Fix For: 1.4.0
>
> Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png
>
>
> Visualizing how tasks in a stage spend their time can be very helpful to 
> understanding performance.  Many folks have started using the visualization 
> tools here: https://github.com/kayousterhout/trace-analysis (see the README 
> at the bottom) to analyze their jobs after they've finished running, but it 
> would be great if this functionality were natively integrated into Spark's UI.
> I'd propose adding a relatively simple visualization to the stage detail 
> page, that's hidden by default but that users can view by clicking on a 
> drop-down menu.  The plan is to implement this using D3; a mock up of how 
> this would look (that uses D3) is attached.  One change we'll make for the 
> initial implementation, compared to the attached visualization, is tasks will 
> be sorted by start time.
> This is intended to be a much simpler and more limited version of SPARK-3468



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6439) Show per-task metrics when you hover over a task in the web UI visualization


 [ 
https://issues.apache.org/jira/browse/SPARK-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-6439.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

> Show per-task metrics when you hover over a task in the web UI visualization
> 
>
> Key: SPARK-6439
> URL: https://issues.apache.org/jira/browse/SPARK-6439
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6418) Add simple per-stage visualization to the UI


 [ 
https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-6418.
---
Resolution: Fixed
  Assignee: (was: Pradyumn Shroff)

> Add simple per-stage visualization to the UI
> 
>
> Key: SPARK-6418
> URL: https://issues.apache.org/jira/browse/SPARK-6418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Kay Ousterhout
> Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png
>
>
> Visualizing how tasks in a stage spend their time can be very helpful to 
> understanding performance.  Many folks have started using the visualization 
> tools here: https://github.com/kayousterhout/trace-analysis (see the README 
> at the bottom) to analyze their jobs after they've finished running, but it 
> would be great if this functionality were natively integrated into Spark's UI.
> I'd propose adding a relatively simple visualization to the stage detail 
> page, that's hidden by default but that users can view by clicking on a 
> drop-down menu.  The plan is to implement this using D3; a mock up of how 
> this would look (that uses D3) is attached.  One change we'll make for the 
> initial implementation, compared to the attached visualization, is tasks will 
> be sorted by start time.
> This is intended to be a much simpler and more limited version of SPARK-3468



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6438) Indicate which tasks ran on which executors in per-stage visualization in UI


 [ 
https://issues.apache.org/jira/browse/SPARK-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-6438.
---
Resolution: Fixed

> Indicate which tasks ran on which executors in per-stage visualization in UI
> 
>
> Key: SPARK-6438
> URL: https://issues.apache.org/jira/browse/SPARK-6438
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>
> One way to do this would be to have a filter for the visualization, where you 
> could filter to see only the tasks for a particular executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7080) Binary processing based aggregate operator

2015-05-15 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546163#comment-14546163
 ] 

Michael Armbrust commented on SPARK-7080:
-

That sounds like a good idea to me.

> Binary processing based aggregate operator
> --
>
> Key: SPARK-7080
> URL: https://issues.apache.org/jira/browse/SPARK-7080
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Josh Rosen
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7296) Timeline view for Stage page


 [ 
https://issues.apache.org/jira/browse/SPARK-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-7296.
---
  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0  (was: 1.4.0, 1.5.0)

> Timeline view for Stage page
> 
>
> Key: SPARK-7296
> URL: https://issues.apache.org/jira/browse/SPARK-7296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Patrick Wendell
>Assignee: Kousuke Saruta
> Fix For: 1.4.0
>
>
> May be a stretch for 1.4 but would like to see if we can get it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land


 [ 
https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6917:
--
Assignee: Yin Huai  (was: Davies Liu)

> Broken data returned to PySpark dataframe if any large numbers used in Scala 
> land
> -
>
> Key: SPARK-6917
> URL: https://issues.apache.org/jira/browse/SPARK-6917
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, Python 2.7.6, Scala 2.10
>Reporter: Harry Brundage
>Assignee: Yin Huai
>Priority: Critical
> Attachments: part-r-1.parquet
>
>
> When trying to access data stored in a Parquet file with an INT96 column 
> (read: TimestampType() encoded for Impala), if the INT96 column is included 
> in the fetched data, other, smaller numeric types come back broken.
> {code}
> In [1]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").select('int_col',
>  'long_col').first()
> Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
> In [2]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").first()
> Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo= 'America/Toronto' EDT-1 day, 19:00:00 DST>))
> {code}
> Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being 
> returned for the {{int_col}} and {{long_col}} columns in the second loop 
> above. This only happens if I select the {{date_col}} which is stored as 
> {{INT96}}. 
> I don't know much about Scala boxing, but I assume that somehow by including 
> numeric columns that are bigger than a machine word I trigger some different, 
> slower execution path somewhere that boxes stuff and causes this problem.
> If anyone could give me any pointers on where to get started fixing this I'd 
> be happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land


[ 
https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546160#comment-14546160
 ] 

Davies Liu edited comment on SPARK-6917 at 5/15/15 8:58 PM:


[~yhuai] It's a bug in SQL or Parquet library:
{code}
scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet")
res1: org.apache.spark.sql.DataFrame = [long_col: decimal(18,0), str_col: 
string, int_col: decimal(18,0), date_col: timestamp]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").first()
res2: org.apache.spark.sql.Row = [(),Hello!,(),0001-12-31 16:00:00.0]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("long_col").first()
res3: org.apache.spark.sql.Row = [10]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("long_col",
 "date_col").first()
res4: org.apache.spark.sql.Row = [(),0001-12-31 16:00:00.0]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("date_col").first()
res5: org.apache.spark.sql.Row = [0001-12-31 16:00:00.0]
{code}


was (Author: davies):
[~yhuai] It's a bug in SQL or Parquet library:
[[code]]
scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
res1: org.apache.spark.sql.DataFrame = [long_col: decimal(18,0), str_col: 
string, int_col: decimal(18,0), date_col: timestamp]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").first()
res2: org.apache.spark.sql.Row = [(),Hello!,(),0001-12-31 16:00:00.0]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("long_col").first()
res3: org.apache.spark.sql.Row = [10]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("long_col",
 "date_col").first()
res4: org.apache.spark.sql.Row = [(),0001-12-31 16:00:00.0]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("date_col").first()
res5: org.apache.spark.sql.Row = [0001-12-31 16:00:00.0]
[[code]]

> Broken data returned to PySpark dataframe if any large numbers used in Scala 
> land
> -
>
> Key: SPARK-6917
> URL: https://issues.apache.org/jira/browse/SPARK-6917
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, Python 2.7.6, Scala 2.10
>Reporter: Harry Brundage
>Assignee: Davies Liu
>Priority: Critical
> Attachments: part-r-1.parquet
>
>
> When trying to access data stored in a Parquet file with an INT96 column 
> (read: TimestampType() encoded for Impala), if the INT96 column is included 
> in the fetched data, other, smaller numeric types come back broken.
> {code}
> In [1]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").select('int_col',
>  'long_col').first()
> Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
> In [2]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").first()
> Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo= 'America/Toronto' EDT-1 day, 19:00:00 DST>))
> {code}
> Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being 
> returned for the {{int_col}} and {{long_col}} columns in the second loop 
> above. This only happens if I select the {{date_col}} which is stored as 
> {{INT96}}. 
> I don't know much about Scala boxing, but I assume that somehow by including 
> numeric columns that are bigger than a machine word I trigger some different, 
> slower execution path somewhere that boxes stuff and causes this problem.
> If anyone could give me any pointers on where to get started fixing this I'd 
> be happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land


[ 
https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546160#comment-14546160
 ] 

Davies Liu commented on SPARK-6917:
---

[~yhuai] It's a bug in SQL or Parquet library:
[[code]]
scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
res1: org.apache.spark.sql.DataFrame = [long_col: decimal(18,0), str_col: 
string, int_col: decimal(18,0), date_col: timestamp]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").first()
res2: org.apache.spark.sql.Row = [(),Hello!,(),0001-12-31 16:00:00.0]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("long_col").first()
res3: org.apache.spark.sql.Row = [10]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("long_col",
 "date_col").first()
res4: org.apache.spark.sql.Row = [(),0001-12-31 16:00:00.0]

scala> 
sqlContext.parquetFile("/Users/davies/Downloads/part-r-1.parquet").select("date_col").first()
res5: org.apache.spark.sql.Row = [0001-12-31 16:00:00.0]
[[code]]

> Broken data returned to PySpark dataframe if any large numbers used in Scala 
> land
> -
>
> Key: SPARK-6917
> URL: https://issues.apache.org/jira/browse/SPARK-6917
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, Python 2.7.6, Scala 2.10
>Reporter: Harry Brundage
>Assignee: Davies Liu
>Priority: Critical
> Attachments: part-r-1.parquet
>
>
> When trying to access data stored in a Parquet file with an INT96 column 
> (read: TimestampType() encoded for Impala), if the INT96 column is included 
> in the fetched data, other, smaller numeric types come back broken.
> {code}
> In [1]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").select('int_col',
>  'long_col').first()
> Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
> In [2]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").first()
> Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo= 'America/Toronto' EDT-1 day, 19:00:00 DST>))
> {code}
> Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being 
> returned for the {{int_col}} and {{long_col}} columns in the second loop 
> above. This only happens if I select the {{date_col}} which is stored as 
> {{INT96}}. 
> I don't know much about Scala boxing, but I assume that somehow by including 
> numeric columns that are bigger than a machine word I trigger some different, 
> slower execution path somewhere that boxes stuff and causes this problem.
> If anyone could give me any pointers on where to get started fixing this I'd 
> be happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7675) PySpark spark.ml Params type conversions

Joseph K. Bradley created SPARK-7675:


 Summary: PySpark spark.ml Params type conversions
 Key: SPARK-7675
 URL: https://issues.apache.org/jira/browse/SPARK-7675
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Priority: Minor


Currently, PySpark wrappers for spark.ml Scala classes are brittle when 
accepting Param types.  E.g., Normalizer's "p" param cannot be set to "2" (an 
integer); it must be set to "2.0" (a float).  Fixing this is not trivial since 
there does not appear to be a natural place to insert the conversion before 
Python wrappers call Java's Params setter method.

A possible fix will be to include a method "_checkType" to PySpark's Param 
class which checks the type, prints an error if needed, and converts types when 
relevant (e.g., int to float, or scipy matrix to array).  The Java wrapper 
method which copies params to Scala can call this method when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land


 [ 
https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6917:
--
Priority: Critical  (was: Major)

> Broken data returned to PySpark dataframe if any large numbers used in Scala 
> land
> -
>
> Key: SPARK-6917
> URL: https://issues.apache.org/jira/browse/SPARK-6917
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, Python 2.7.6, Scala 2.10
>Reporter: Harry Brundage
>Assignee: Davies Liu
>Priority: Critical
> Attachments: part-r-1.parquet
>
>
> When trying to access data stored in a Parquet file with an INT96 column 
> (read: TimestampType() encoded for Impala), if the INT96 column is included 
> in the fetched data, other, smaller numeric types come back broken.
> {code}
> In [1]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").select('int_col',
>  'long_col').first()
> Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
> In [2]: 
> sql.parquetFile("/Users/hornairs/Downloads/part-r-1.parquet").first()
> Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
> date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo= 'America/Toronto' EDT-1 day, 19:00:00 DST>))
> {code}
> Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being 
> returned for the {{int_col}} and {{long_col}} columns in the second loop 
> above. This only happens if I select the {{date_col}} which is stored as 
> {{INT96}}. 
> I don't know much about Scala boxing, but I assume that somehow by including 
> numeric columns that are bigger than a machine word I trigger some different, 
> slower execution path somewhere that boxes stuff and causes this problem.
> If anyone could give me any pointers on where to get started fixing this I'd 
> be happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7671) Fix wrong URLs in MLlib Data Types Documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7671:
-
Component/s: MLlib

> Fix wrong URLs in MLlib Data Types Documentation
> 
>
> Key: SPARK-7671
> URL: https://issues.apache.org/jira/browse/SPARK-7671
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
> Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS 
> from cloudera 2.6.0-cdh5.4.0.
>Reporter: Favio Vázquez
>Priority: Trivial
>  Labels: Documentation,, Fix, MLlib,, URL
>
> There is a mistake in the URL of Matrices in the MLlib Data Types 
> documentation (Local matrix scala section), the URL points to 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices
>  which is a mistake, since Matrices is an object that implements factory 
> methods for Matrix that does not have a companion class. The correct link 
> should point to 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
> There is another mistake, in the Local Vector section in Scala, Java and 
> Python
> In the Scala section the URL of Vectors points to the trait Vector 
> (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector)
>  and not to the factory methods implemented in Vectors. 
> The correct link should be: 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$
> In the Java section the URL of Vectors points to the Interface Vector 
> (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html)
>  and not to the Class Vectors
> The correct link should be:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html
> In the Python section the URL of Vectors points to the class Vector 
> (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector)
>  and not the Class Vectors
> The correct link should be:
> https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7674) R-like stats for ML models

Joseph K. Bradley created SPARK-7674:

Summary: R-like stats for ML models
Key: SPARK-7674
URL: https://issues.apache.org/jira/browse/SPARK-7674
Project: Spark
Issue Type: New Feature
Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

This is an umbrella JIRA for supporting ML model summaries and statistics,
following the example of R's summary() and plot() functions.

[Design
doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]

>From the design doc:
{quote}
R and its well-established packages provide extensive functionality for
inspecting a model and its results. This inspection is critical to
interpreting, debugging and improving models.

R is arguably a gold standard for a statistics/ML library, so this doc largely
attempts to imitate it. The challenge we face is supporting similar
functionality, but on big (distributed) data. Data size makes both efficient
computation and meaningful displays/summaries difficult.

R model and result summaries generally take 2 forms:
* summary(model): Display text with information about the model and results on
data
* plot(model): Display plots about the model and results

We aim to provide both of these types of information. Visualization for the
plottable results will not be supported in MLlib itself, but we can provide
results in a form which can be plotted easily with other tools.
{quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6806) SparkR examples in programming guide