[jira] [Commented] (SPARK-8503) SizeEstimator returns negative value for recursive data structures

2015-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595014#comment-14595014
 ] 

Sean Owen commented on SPARK-8503:
--

That looks near to Long.MIN_VALUE. It could be overflow, but I suspect 
something in the code returns Long.MIN_VALUE as a size to mean some kind of 
error condition and this is being added to a sum.

This doesn't really have any info that helps reproduce it though. If you can't 
describe your exact data structure, what does debugging reveal about when and 
where this happens?

 SizeEstimator returns negative value for recursive data structures
 --

 Key: SPARK-8503
 URL: https://issues.apache.org/jira/browse/SPARK-8503
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Ilya Rakitsin

 When estimating size of recursive data structures like graphs, with transient 
 fields referencing one another, SizeEstimator may return negative value if 
 the structure if big enough.
 This then affects the logic of other components, e.g. 
 SizeTracker#takeSample() and may lead to incorrect behavior and exceptions 
 like:
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -9223372036854691384
   at scala.Predef$.require(Predef.scala:233)
   at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)
   at 
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
   at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8511) Modify a test to remove a saved model in `regression.py`

2015-06-21 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-8511:
---
Description: 
According to the reference of python, {{os.removedirs}} doesn't work if there 
are any files in the directory we want to remove.
https://docs.python.org/3/library/os.html#os.removedirs

Instead of that, using {{shutil.rmtree()}} would be better to remove a 
temporary directory to test for saving model.
https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137

  was:
According to the reference of python, {{os.removedirs}} doesn't work if there 
are any files in the directory we want to remove.
https://docs.python.org/3/library/os.html#os.removedirs

Instead of that, using `shutil.rmtree()` would be better to remove a temporary 
directory to test for saving model.
https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137


 Modify a test to remove a saved model in `regression.py`
 

 Key: SPARK-8511
 URL: https://issues.apache.org/jira/browse/SPARK-8511
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Yu Ishikawa

 According to the reference of python, {{os.removedirs}} doesn't work if there 
 are any files in the directory we want to remove.
 https://docs.python.org/3/library/os.html#os.removedirs
 Instead of that, using {{shutil.rmtree()}} would be better to remove a 
 temporary directory to test for saving model.
 https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6112) Provide external block store support through HDFS RAM_DISK

2015-06-21 Thread Bogdan Ghit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595034#comment-14595034
 ] 

Bogdan Ghit commented on SPARK-6112:


This is my configuration:

1. My tmpfs is mounted on /dev/shm.
2. dfs.datanode.data.dir = /local/bghit/myhdfs,[RAM_DISK]/dev/shm/ramdisk
3. dfs.datanode.max.locked.memory=10

The amount of memory I can lock is set in /etc/security/limits.conf to 
unlimited, so ulimit -l outputs unlimited. However, I get the exception 
Cannot start datanode because the configured max locked memory size 
(dfs.datanode.max.locked.memory) is greater than zero and native code is not 
available. Any ideas why?

Regarding my previous comment, the documentation still has offHeap instead of 
externalBlock.







 Provide external block store support through HDFS RAM_DISK
 --

 Key: SPARK-6112
 URL: https://issues.apache.org/jira/browse/SPARK-6112
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Zhan Zhang
 Attachments: SparkOffheapsupportbyHDFS.pdf


 HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in 
 hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs 
 RAM_DISK feature, if the user environment does not have tachyon deployed. 
 With this feature, it potentially provides possibility to share RDD in memory 
 across different jobs and even share with jobs other than spark, and avoid 
 the RDD recomputation if executors crash. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8511) Modify a test to remove a saved model in `regression.py`

2015-06-21 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-8511:
--

 Summary: Modify a test to remove a saved model in `regression.py`
 Key: SPARK-8511
 URL: https://issues.apache.org/jira/browse/SPARK-8511
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Yu Ishikawa


According to the reference of python, {{os.removedirs}} doesn't work if there 
are any files in the directory we want to remove.
https://docs.python.org/3/library/os.html#os.removedirs

Instead of that, using `shutil.rmtree()` would be better to remove a temporary 
directory to test for saving model.
https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8511) Modify a test to remove a saved model in `regression.py`

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8511:
---

Assignee: (was: Apache Spark)

 Modify a test to remove a saved model in `regression.py`
 

 Key: SPARK-8511
 URL: https://issues.apache.org/jira/browse/SPARK-8511
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Yu Ishikawa

 According to the reference of python, {{os.removedirs}} doesn't work if there 
 are any files in the directory we want to remove.
 https://docs.python.org/3/library/os.html#os.removedirs
 Instead of that, using {{shutil.rmtree()}} would be better to remove a 
 temporary directory to test for saving model.
 https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8472) Python API for DCT

2015-06-21 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595125#comment-14595125
 ] 

Yu Ishikawa commented on SPARK-8472:


Alright. Thanks!

 Python API for DCT
 --

 Key: SPARK-8472
 URL: https://issues.apache.org/jira/browse/SPARK-8472
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 We need to implement a wrapper for enabling the DCT feature transformer to be 
 used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8511) Modify a test to remove a saved model in `regression.py`

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8511:
---

Assignee: Apache Spark

 Modify a test to remove a saved model in `regression.py`
 

 Key: SPARK-8511
 URL: https://issues.apache.org/jira/browse/SPARK-8511
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Yu Ishikawa
Assignee: Apache Spark

 According to the reference of python, {{os.removedirs}} doesn't work if there 
 are any files in the directory we want to remove.
 https://docs.python.org/3/library/os.html#os.removedirs
 Instead of that, using {{shutil.rmtree()}} would be better to remove a 
 temporary directory to test for saving model.
 https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8511) Modify a test to remove a saved model in `regression.py`

2015-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595124#comment-14595124
 ] 

Apache Spark commented on SPARK-8511:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/6926

 Modify a test to remove a saved model in `regression.py`
 

 Key: SPARK-8511
 URL: https://issues.apache.org/jira/browse/SPARK-8511
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Yu Ishikawa

 According to the reference of python, {{os.removedirs}} doesn't work if there 
 are any files in the directory we want to remove.
 https://docs.python.org/3/library/os.html#os.removedirs
 Instead of that, using {{shutil.rmtree()}} would be better to remove a 
 temporary directory to test for saving model.
 https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6813) SparkR style guide

2015-06-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595150#comment-14595150
 ] 

Shivaram Venkataraman commented on SPARK-6813:
--

Yeah we could disable those two checks but that would also disable it for all 
other things like `if-else` blocks where we want to encourage the other style 
of ending the line with `{` instead of having long one-liners. I guess line 
length limit will prevent some of this. Lets see if there are any other 
opinions on this

 SparkR style guide
 --

 Key: SPARK-6813
 URL: https://issues.apache.org/jira/browse/SPARK-6813
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman

 We should develop a SparkR style guide document based on the some of the 
 guidelines we use and some of the best practices in R.
 Some examples of R style guide are:
 http://r-pkgs.had.co.nz/r.html#style 
 http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
 A related issue is to work on a automatic style checking tool. 
 https://github.com/jimhester/lintr seems promising
 We could have a R style guide based on the one from google [1], and adjust 
 some of them with the conversation in Spark:
 1. Line Length: maximum 100 characters
 2. no limit on function name (API should be similar as in other languages)
 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8510) Store and read NumPy arrays and matrices as values in sequence files

2015-06-21 Thread Peter Aberline (JIRA)
Peter Aberline created SPARK-8510:
-

 Summary: Store and read NumPy arrays and matrices as values in 
sequence files
 Key: SPARK-8510
 URL: https://issues.apache.org/jira/browse/SPARK-8510
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Peter Aberline
Priority: Minor


I have extended the provided example code DoubleArrayWritable example to store 
NumPy double type arrays and matrices as arrays of doubles and nested arrays of 
doubles.

Pandas DataFrames can be easily converted to NumPy matrices, so I've also added 
the ability to store the schema-less data from DataFrames and Series that 
contain double data. 

Other than my own use there seems to be demand for this functionality:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E

I'll be issuing a PR for this shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()

2015-06-21 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595048#comment-14595048
 ] 

Kai Sasaki commented on SPARK-8419:
---

In the {{Statistics#colStats}}, the number of rows seems to be updated in 
{{computeColumnSummaryStatistics}} with {{updateNumRows}}. This is computed 
through distributed process which is calculated inside of 
{{RDD#treeAggregate}}. So I think there is no extra {{count()}} when just only 
creating {{RowMatrix}}. Is this assumption correct?

 Statistics.colStats could avoid an extra count()
 

 Key: SPARK-8419
 URL: https://issues.apache.org/jira/browse/SPARK-8419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 Statistics.colStats goes through RowMatrix to compute the stats.  But 
 RowMatrix.computeColumnSummaryStatistics does an extra count() which could be 
 avoided.  Not going through RowMatrix would skip this extra pass over the 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8510) Support NumPy arrays and matrices as values in sequence files

2015-06-21 Thread Peter Aberline (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Aberline updated SPARK-8510:
--
Summary: Support NumPy arrays and matrices as values in sequence files  
(was: Store and read NumPy arrays and matrices as values in sequence files)

 Support NumPy arrays and matrices as values in sequence files
 -

 Key: SPARK-8510
 URL: https://issues.apache.org/jira/browse/SPARK-8510
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Peter Aberline
Priority: Minor

 I have extended the provided example code DoubleArrayWritable example to 
 store NumPy double type arrays and matrices as arrays of doubles and nested 
 arrays of doubles.
 Pandas DataFrames can be easily converted to NumPy matrices, so I've also 
 added the ability to store the schema-less data from DataFrames and Series 
 that contain double data. 
 Other than my own use there seems to be demand for this functionality:
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
 I'll be issuing a PR for this shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8472) Python API for DCT

2015-06-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595119#comment-14595119
 ] 

Joseph K. Bradley commented on SPARK-8472:
--

Let's wait until its dependency is complete.

 Python API for DCT
 --

 Key: SPARK-8472
 URL: https://issues.apache.org/jira/browse/SPARK-8472
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 We need to implement a wrapper for enabling the DCT feature transformer to be 
 used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8115) Remove TestData

2015-06-21 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595052#comment-14595052
 ] 

Benjamin Fradet commented on SPARK-8115:


I've started working on this.

 Remove TestData
 ---

 Key: SPARK-8115
 URL: https://issues.apache.org/jira/browse/SPARK-8115
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Minor

 TestData was from the era when we didn't have easy ways to generate test 
 datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
 the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8503) SizeEstimator returns negative value for recursive data structures

2015-06-21 Thread Ilya Rakitsin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595060#comment-14595060
 ] 

Ilya Rakitsin commented on SPARK-8503:
--

The structure is a simple cycled graph, like you would imagine it:

public abstract class Edge implements Serializable {
private static final long serialVersionUID = MavenVersion.VERSION.getUID();
private int id;
protected Vertex fromv;
protected Vertex tov;
...
}

public abstract class Vertex implements Serializable, Cloneable {
private String name;
private transient Edge[] incoming = new Edge[0];
private transient Edge[] outgoing = new Edge[0];
...
}

So, as you can see, edges in vertex are transient, so are serialized correctly 
(basically, not serialized) when using kryo or regular serialization. But when 
broadcasting, size is computed in a eternal loop until it's negative (at least 
it seems that way) due to cycles in the graph and transient edges not being 
handled.

Does this help?

Another issue is that in SizeTracker#takeSample() negative value returned by 
the estimator is not handled as well. Do you think this could be a separate 
issue, or could you investigate it as well? Hope this helps.



 SizeEstimator returns negative value for recursive data structures
 --

 Key: SPARK-8503
 URL: https://issues.apache.org/jira/browse/SPARK-8503
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Ilya Rakitsin

 When estimating size of recursive data structures like graphs, with transient 
 fields referencing one another, SizeEstimator may return negative value if 
 the structure if big enough.
 This then affects the logic of other components, e.g. 
 SizeTracker#takeSample() and may lead to incorrect behavior and exceptions 
 like:
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -9223372036854691384
   at scala.Predef$.require(Predef.scala:233)
   at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)
   at 
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
   at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8503) SizeEstimator returns negative value for recursive data structures

2015-06-21 Thread Ilya Rakitsin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595060#comment-14595060
 ] 

Ilya Rakitsin edited comment on SPARK-8503 at 6/21/15 2:28 PM:
---

The structure is a simple cycled graph, like you would imagine it:
{code}
public abstract class Edge implements Serializable {
private static final long serialVersionUID = MavenVersion.VERSION.getUID();
private int id;
protected Vertex fromv;
protected Vertex tov;
...
}
{code}

{code}
public abstract class Vertex implements Serializable, Cloneable {
private String name;
private transient Edge[] incoming = new Edge[0];
private transient Edge[] outgoing = new Edge[0];
...
}
{code}
So, as you can see, edges in vertex are transient, so are serialized correctly 
(basically, not serialized) when using kryo or regular serialization. But when 
broadcasting, size is computed in a eternal loop until it's negative (at least 
it seems that way) due to cycles in the graph and transient edges not being 
handled.

Does this help?

Another issue is that in SizeTracker#takeSample() negative value returned by 
the estimator is not handled as well. Do you think this could be a separate 
issue, or could you investigate it as well? Hope this helps.




was (Author: irakitin):
The structure is a simple cycled graph, like you would imagine it:
{code}
public abstract class Edge implements Serializable {
private static final long serialVersionUID = MavenVersion.VERSION.getUID();
private int id;
protected Vertex fromv;
protected Vertex tov;
...
}
{code}

public abstract class Vertex implements Serializable, Cloneable {
private String name;
private transient Edge[] incoming = new Edge[0];
private transient Edge[] outgoing = new Edge[0];
...
}

So, as you can see, edges in vertex are transient, so are serialized correctly 
(basically, not serialized) when using kryo or regular serialization. But when 
broadcasting, size is computed in a eternal loop until it's negative (at least 
it seems that way) due to cycles in the graph and transient edges not being 
handled.

Does this help?

Another issue is that in SizeTracker#takeSample() negative value returned by 
the estimator is not handled as well. Do you think this could be a separate 
issue, or could you investigate it as well? Hope this helps.



 SizeEstimator returns negative value for recursive data structures
 --

 Key: SPARK-8503
 URL: https://issues.apache.org/jira/browse/SPARK-8503
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Ilya Rakitsin

 When estimating size of recursive data structures like graphs, with transient 
 fields referencing one another, SizeEstimator may return negative value if 
 the structure if big enough.
 This then affects the logic of other components, e.g. 
 SizeTracker#takeSample() and may lead to incorrect behavior and exceptions 
 like:
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -9223372036854691384
   at scala.Predef$.require(Predef.scala:233)
   at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)
   at 
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
   at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8503) SizeEstimator returns negative value for recursive data structures

2015-06-21 Thread Ilya Rakitsin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595060#comment-14595060
 ] 

Ilya Rakitsin edited comment on SPARK-8503 at 6/21/15 2:28 PM:
---

The structure is a simple cycled graph, like you would imagine it:
{code}
public abstract class Edge implements Serializable {
private static final long serialVersionUID = MavenVersion.VERSION.getUID();
private int id;
protected Vertex fromv;
protected Vertex tov;
...
}
{code}

public abstract class Vertex implements Serializable, Cloneable {
private String name;
private transient Edge[] incoming = new Edge[0];
private transient Edge[] outgoing = new Edge[0];
...
}

So, as you can see, edges in vertex are transient, so are serialized correctly 
(basically, not serialized) when using kryo or regular serialization. But when 
broadcasting, size is computed in a eternal loop until it's negative (at least 
it seems that way) due to cycles in the graph and transient edges not being 
handled.

Does this help?

Another issue is that in SizeTracker#takeSample() negative value returned by 
the estimator is not handled as well. Do you think this could be a separate 
issue, or could you investigate it as well? Hope this helps.




was (Author: irakitin):
The structure is a simple cycled graph, like you would imagine it:

public abstract class Edge implements Serializable {
private static final long serialVersionUID = MavenVersion.VERSION.getUID();
private int id;
protected Vertex fromv;
protected Vertex tov;
...
}

public abstract class Vertex implements Serializable, Cloneable {
private String name;
private transient Edge[] incoming = new Edge[0];
private transient Edge[] outgoing = new Edge[0];
...
}

So, as you can see, edges in vertex are transient, so are serialized correctly 
(basically, not serialized) when using kryo or regular serialization. But when 
broadcasting, size is computed in a eternal loop until it's negative (at least 
it seems that way) due to cycles in the graph and transient edges not being 
handled.

Does this help?

Another issue is that in SizeTracker#takeSample() negative value returned by 
the estimator is not handled as well. Do you think this could be a separate 
issue, or could you investigate it as well? Hope this helps.



 SizeEstimator returns negative value for recursive data structures
 --

 Key: SPARK-8503
 URL: https://issues.apache.org/jira/browse/SPARK-8503
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Ilya Rakitsin

 When estimating size of recursive data structures like graphs, with transient 
 fields referencing one another, SizeEstimator may return negative value if 
 the structure if big enough.
 This then affects the logic of other components, e.g. 
 SizeTracker#takeSample() and may lead to incorrect behavior and exceptions 
 like:
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -9223372036854691384
   at scala.Predef$.require(Predef.scala:233)
   at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)
   at 
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
   at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8419) Statistics.colStats could avoid an extra count()

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-8419.

Resolution: Not A Problem

 Statistics.colStats could avoid an extra count()
 

 Key: SPARK-8419
 URL: https://issues.apache.org/jira/browse/SPARK-8419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 Statistics.colStats goes through RowMatrix to compute the stats.  But 
 RowMatrix.computeColumnSummaryStatistics does an extra count() which could be 
 avoided.  Not going through RowMatrix would skip this extra pass over the 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()

2015-06-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595120#comment-14595120
 ] 

Joseph K. Bradley commented on SPARK-8419:
--

Oops, I misread what the count() was being called on.  Thanks!  I'll close this.

 Statistics.colStats could avoid an extra count()
 

 Key: SPARK-8419
 URL: https://issues.apache.org/jira/browse/SPARK-8419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 Statistics.colStats goes through RowMatrix to compute the stats.  But 
 RowMatrix.computeColumnSummaryStatistics does an extra count() which could be 
 avoided.  Not going through RowMatrix would skip this extra pass over the 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8514) LU factorization on BlockMatrix

2015-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8514:


 Summary: LU factorization on BlockMatrix
 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


LU is the most common method to solve a general linear system or inverse a 
general matrix. A distributed version could in implemented block-wise with 
pipelining. A reference implementation is provided in ScaLAPACK:

http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8506:
---

Assignee: Apache Spark

 SparkR does not provide an easy way to depend on Spark Packages when 
 performing init from inside of R
 -

 Key: SPARK-8506
 URL: https://issues.apache.org/jira/browse/SPARK-8506
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: holdenk
Assignee: Apache Spark
Priority: Minor

 While packages can be specified when using the sparkR or sparkSubmit scripts, 
 the programming guide tells people to create their spark context using the R 
 shell + init. The init does have a parameter for jars but no parameter for 
 packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary 
 information. I think a good solution would just be adding another field to 
 the init function to allow people to specify packages in the same way as jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6412) Add Char support in dataTypes.

2015-06-21 Thread Naden Franciscus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595280#comment-14595280
 ] 

Naden Franciscus commented on SPARK-6412:
-

This needs to be reopened.

The use case couldn't be more clear. Querying from a table that has a CHAR data 
type is unsupported.

 Add Char support in dataTypes.
 --

 Key: SPARK-6412
 URL: https://issues.apache.org/jira/browse/SPARK-6412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Chen Song

 We can't get the schema of case class PrimitiveData, because of 
 ScalaReflection.schemaFor and dataTYpes doesn't support CharType.
 case class PrimitiveData(
 charField: Char,// Can't get the char schema info
 intField: Int,
 longField: Long,
 doubleField: Double,
 floatField: Float,
 shortField: Short,
 byteField: Byte,
 booleanField: Boolean)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6412) Add Char support in dataTypes.

2015-06-21 Thread Naden Franciscus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naden Franciscus updated SPARK-6412:

Comment: was deleted

(was: This needs to be reopened.

The use case couldn't be more clear. Querying from a table that has a CHAR data 
type is unsupported.)

 Add Char support in dataTypes.
 --

 Key: SPARK-6412
 URL: https://issues.apache.org/jira/browse/SPARK-6412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Chen Song

 We can't get the schema of case class PrimitiveData, because of 
 ScalaReflection.schemaFor and dataTYpes doesn't support CharType.
 case class PrimitiveData(
 charField: Char,// Can't get the char schema info
 intField: Int,
 longField: Long,
 doubleField: Double,
 floatField: Float,
 shortField: Short,
 byteField: Byte,
 booleanField: Boolean)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7426.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6540
[https://github.com/apache/spark/pull/6540]

 spark.ml AttributeFactory.fromStructField should allow other NumericTypes
 -

 Key: SPARK-7426
 URL: https://issues.apache.org/jira/browse/SPARK-7426
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.5.0


 It currently only supports DoubleType, but it should support others, at least 
 for fromStructField (importing into ML attribute format, rather than 
 exporting).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8513:
--
Description: 
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file

  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.

# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first

  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.

# Because tasks are canceled asynchronously, some other task, say {{a2}} may 
just get scheduled and create its own task attempt directory {{d2}} under {{D2}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all Spark jobs that writes files with 
{{FileOutputCommitter}} and its subclasses which creates temporary directories.

One of the possible way to fix this issue can be making task cancellation 
synchronous, but this also increases latency.

 _temporary may be left undeleted when a write job committed with 
 FileOutputCommitter fails due to a race condition
 --

 Key: SPARK-8513
 URL: https://issues.apache.org/jira/browse/SPARK-8513
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Cheng Lian

 To reproduce this issue, we need a node with relatively more cores, say 32 
 (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
 following code should be relatively easy to reproduce this issue:
 {code}
 sqlContext.range(0, 10).repartition(32).select('id / 
 0).write.mode(overwrite).parquet(file:///tmp/foo)
 {code}
 You may observe similar log lines as below:
 {noformat}
 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
 WARN FileUtil: Failed to delete file or dir 
 [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
  it still exists.
 {noformat}
 The reason is that, for a Spark job with multiple tasks, when a task fails 
 after multiple retries, the job gets canceled on driver side.  At the same 
 time, all child tasks of this job also get canceled.  However, task 
 cancelation is asynchronous.  This means, some tasks may still be running 
 when the job is already killed on driver side.
 With this in mind, the following execution order may cause the log line 
 mentioned above:
 # Job {{A}} spawns 32 tasks to write the Parquet file
   Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
 temporary directory {{D1}} is created to hold output files of different task 
 attempts.
 # Task {{a1}} fails after several retries first because of the division by 
 zero error
 # Task {{a1}} aborts the Parquet write task and tries to remove its task 
 attempt output directory {{d1}} (a sub-directory of {{D1}})
 # Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
 canceled *asynchronously*
 # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
 removing all its child 

[jira] [Updated] (SPARK-8514) LU factorization on BlockMatrix

2015-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8514:
-
Labels: advanced  (was: )

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced

 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8515) Improve ML attribute API

2015-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8515:


 Summary: Improve ML attribute API
 Key: SPARK-8515
 URL: https://issues.apache.org/jira/browse/SPARK-8515
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng


In 1.4.0, we introduced ML attribute API to embed feature/label attribute info 
inside DataFrame's schema. However, the API is not very friendly to use. In 
1.5, we should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8515) Improve ML attribute API

2015-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8515:
-
Labels: advanced  (was: )

 Improve ML attribute API
 

 Key: SPARK-8515
 URL: https://issues.apache.org/jira/browse/SPARK-8515
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
  Labels: advanced

 In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
 info inside DataFrame's schema. However, the API is not very friendly to use. 
 In 1.5, we should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595251#comment-14595251
 ] 

Apache Spark commented on SPARK-7888:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/6927

 Be able to disable intercept in Linear Regression in ML package
 ---

 Key: SPARK-7888
 URL: https://issues.apache.org/jira/browse/SPARK-7888
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7888:
---

Assignee: Apache Spark  (was: holdenk)

 Be able to disable intercept in Linear Regression in ML package
 ---

 Key: SPARK-7888
 URL: https://issues.apache.org/jira/browse/SPARK-7888
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7888:
---

Assignee: holdenk  (was: Apache Spark)

 Be able to disable intercept in Linear Regression in ML package
 ---

 Key: SPARK-7888
 URL: https://issues.apache.org/jira/browse/SPARK-7888
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7443) MLlib 1.4 QA plan

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7443.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

I'm closing this and marking it Fixed.  I've checked to make sure the 
still-unresolved JIRAs linked from this umbrella are marked for appropriate 
targets.  The remaining ones are mainly perf tests (necessary for the next 
release) or documentation (which we can add to the Spark user guide when those 
docs are ready).

 MLlib 1.4 QA plan
 -

 Key: SPARK-7443
 URL: https://issues.apache.org/jira/browse/SPARK-7443
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
Priority: Critical
 Fix For: 1.4.0


 h2. API
 * Check API compliance using java-compliance-checker (SPARK-7458)
 * Audit new public APIs (from the generated html doc)
 ** Scala (do not forget to check the object doc) (SPARK-7537)
 ** Java compatibility (SPARK-7529)
 ** Python API coverage (SPARK-7536)
 * audit Pipeline APIs (SPARK-7535)
 * graduate spark.ml from alpha (SPARK-7748)
 ** remove AlphaComponent annotations
 ** remove mima excludes for spark.ml
 ** mark concrete classes final wherever reasonable
 h2. Algorithms and performance
 *Performance*
 * _List any other missing performance tests from spark-perf here_
 * LDA online/EM (SPARK-7455)
 * ElasticNet for linear regression and logistic regression (SPARK-7456)
 * Bernoulli naive Bayes (SPARK-7453)
 * PIC (SPARK-7454)
 * ALS.recommendAll (SPARK-7457)
 * perf-tests in Python (SPARK-7539)
 *Correctness*
 * PMML
 ** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
 * model save/load (SPARK-7541)
 h2. Documentation and example code
 * Create JIRAs for the user guide to each new algorithm and assign them to 
 the corresponding author.  Link here as requires
 ** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed.  We can 
 follow the structure of the spark.mllib user guide.
 *** The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 *** We should not duplicate info in the spark.ml guides.  Since spark.mllib 
 is still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 * Create example code for major components.  Link here as requires
 ** cross validation in python (SPARK-7387)
 ** pipeline with complex feature transformations (scala/java/python) 
 (SPARK-7546)
 ** elastic-net (possibly with cross validation) (SPARK-7547)
 ** kernel density (SPARK-7707)
 * Update Programming Guide for 1.4 (towards end of QA) (SPARK-7715)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8513:
--
Description: 
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file
  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.
# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first
  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.
# Because tasks are canceled asynchronously, some other task, say {{a2}} may 
just get scheduled and create its own task attempt directory {{d2}} under {{D2}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all Spark jobs that writes files with 
{{FileOutputCommitter}} and its subclasses which creates temporary directories.

One of the possible way to fix this issue can be making task cancellation 
synchronous, but this also increases latency.

  was:
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file

  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.

# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first

  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.

# Because tasks are canceled asynchronously, some other task, say {{a2}} may 
just get scheduled and create its own task attempt directory {{d2}} under {{D2}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all Spark jobs 

[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8513:
--
Description: 
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file
  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.
# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first
  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.
# Because tasks are canceled asynchronously, some other task, say {{a2}}, may 
just get scheduled and create its own task attempt directory {{d2}} under {{D1}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all Spark jobs that writes files with 
{{FileOutputCommitter}} and its subclasses which creates temporary directories.

One of the possible way to fix this issue can be making task cancellation 
synchronous, but this also increases latency.

  was:
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file
  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.
# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first
  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.
# Because tasks are canceled asynchronously, some other task, say {{a2}} may 
just get scheduled and create its own task attempt directory {{d2}} under {{D2}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all Spark jobs 

[jira] [Commented] (SPARK-8475) SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-21 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595363#comment-14595363
 ] 

Burak Yavuz commented on SPARK-8475:


Me too. I prefer option 1 as well.

 SparkSubmit with Ivy jars is very slow to load with no internet access
 --

 Key: SPARK-8475
 URL: https://issues.apache.org/jira/browse/SPARK-8475
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.4.0
Reporter: Nathan McCarthy
Priority: Minor

 Spark Submit adds maven central  spark bintray to the ChainResolver before 
 it adds any external resolvers. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821
 When running on a cluster without internet access, this means the spark shell 
 takes forever to launch as it tries these two remote repos before the ones 
 specified in the --repositories list. In our case we have a proxy which the 
 cluster can access it and supply it via --repositories.
 This is also a problem for users who maintain a proxy for maven/ivy repos 
 with something like Nexus/Artifactory. Having a repo proxy is popular at many 
 organisations so I'd say this would be a useful change for these users as 
 well. In the current state even if a maven central proxy is supplied, it will 
 still try and hit central. 
 I see two options for a fix;
 * Change the order repos are added to the ChainResolver, making the 
 --repositories supplied repos come before anything else. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L843
  
 * Have a config option (like spark.jars.ivy.useDefaultRemoteRepos, default 
 true) which when false wont add the maven central  bintry to the 
 ChainResolver. 
 Happy to do a PR for this fix. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming

2015-06-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7398:
-
Priority: Critical  (was: Major)
Target Version/s: 1.5.0

 Add back-pressure to Spark Streaming
 

 Key: SPARK-7398
 URL: https://issues.apache.org/jira/browse/SPARK-7398
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: François Garillot
Priority: Critical
  Labels: streams

 Spark Streaming has trouble dealing with situations where 
  batch processing time  batch interval
 Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
 from the queue.
 If this throughput is sustained for long enough, it leads to an unstable 
 situation where the memory of the Receiver's Executor is overflowed.
 This aims at transmitting a back-pressure signal back to data ingestion to 
 help with dealing with that high throughput, in a backwards-compatible way.
 The design doc can be found here:
 https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7398) Add back-pressure to Spark Streaming

2015-06-21 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595374#comment-14595374
 ] 

Tathagata Das commented on SPARK-7398:
--

I took a look at the whole design doc. Its very well composed, but the actual 
details on how the actual code changes is a little unclear. I Now that you have 
a working branch, I strongly recommend doing another additional design doc 
which skips all the intro and background, and just focuses on the code changes. 

Here is a design doc for inspiration. This is original design doc for the Write 
Ahead Log. 
https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit#heading=h.9xoxtbgz551y
See the architecture and proposed implementation section. Accordingly you 
should have the following two sections 

1. Use diagrams to explain the high-level control flow in the architecture with 
new classes in the picture and how they interoperate/interface with existing 
classes (BTW, high-level = not as detailed at the control flow that you have in 
the earlier design doc).
2. The details of every class and interface that needs to be introduced or 
modified. Especially focus on the interfaces for - (1) the heuristic algorithm, 
(2) the congestion control.

This will allow me and others to evaluate the architecture more critically. 

Then if needed we can break up the task into smaller smaller sub-tasks (as done 
in the case of the WAL JIRA - 
https://issues.apache.org/jira/browse/SPARK-3129). 






 Add back-pressure to Spark Streaming
 

 Key: SPARK-7398
 URL: https://issues.apache.org/jira/browse/SPARK-7398
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: François Garillot
  Labels: streams

 Spark Streaming has trouble dealing with situations where 
  batch processing time  batch interval
 Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
 from the queue.
 If this throughput is sustained for long enough, it leads to an unstable 
 situation where the memory of the Receiver's Executor is overflowed.
 This aims at transmitting a back-pressure signal back to data ingestion to 
 help with dealing with that high throughput, in a backwards-compatible way.
 The design doc can be found here:
 https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8513:
--
Description: 
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file
  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.
# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first
  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.
# Because tasks are canceled asynchronously, some other task, say {{a2}}, may 
just get scheduled and create its own task attempt directory {{d2}} under {{D1}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all Spark jobs that writes files with 
{{FileOutputCommitter}} and its subclasses which create and delete temporary 
directories.

One of the possible way to fix this issue can be making task cancellation 
synchronous, but this also increases latency.

  was:
To reproduce this issue, we need a node with relatively more cores, say 32 
(e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
following code should be relatively easy to reproduce this issue:
{code}
sqlContext.range(0, 10).repartition(32).select('id / 
0).write.mode(overwrite).parquet(file:///tmp/foo)
{code}
You may observe similar log lines as below:
{noformat}
01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
WARN FileUtil: Failed to delete file or dir 
[/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
 it still exists.
{noformat}
The reason is that, for a Spark job with multiple tasks, when a task fails 
after multiple retries, the job gets canceled on driver side.  At the same 
time, all child tasks of this job also get canceled.  However, task cancelation 
is asynchronous.  This means, some tasks may still be running when the job is 
already killed on driver side.

With this in mind, the following execution order may cause the log line 
mentioned above:

# Job {{A}} spawns 32 tasks to write the Parquet file
  Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
temporary directory {{D1}} is created to hold output files of different task 
attempts.
# Task {{a1}} fails after several retries first because of the division by zero 
error
# Task {{a1}} aborts the Parquet write task and tries to remove its task 
attempt output directory {{d1}} (a sub-directory of {{D1}})
# Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
canceled *asynchronously*
# {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
removing all its child files/directories first
  Note that when testing with local directory, {{RawLocalFileSystem}} simply 
calls {{java.io.File.delete()}} to deletion, and only empty directories can be 
deleted.
# Because tasks are canceled asynchronously, some other task, say {{a2}}, may 
just get scheduled and create its own task attempt directory {{d2}} under {{D1}}
# Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
itself, but fails because {{d2}} makes {{D1}} non-empty again

Notice that this bug affects all 

[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8513:
--
Component/s: SQL

 _temporary may be left undeleted when a write job committed with 
 FileOutputCommitter fails due to a race condition
 --

 Key: SPARK-8513
 URL: https://issues.apache.org/jira/browse/SPARK-8513
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8506:
---

Assignee: (was: Apache Spark)

 SparkR does not provide an easy way to depend on Spark Packages when 
 performing init from inside of R
 -

 Key: SPARK-8506
 URL: https://issues.apache.org/jira/browse/SPARK-8506
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: holdenk
Priority: Minor

 While packages can be specified when using the sparkR or sparkSubmit scripts, 
 the programming guide tells people to create their spark context using the R 
 shell + init. The init does have a parameter for jars but no parameter for 
 packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary 
 information. I think a good solution would just be adding another field to 
 the init function to allow people to specify packages in the same way as jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R

2015-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595275#comment-14595275
 ] 

Apache Spark commented on SPARK-8506:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/6928

 SparkR does not provide an easy way to depend on Spark Packages when 
 performing init from inside of R
 -

 Key: SPARK-8506
 URL: https://issues.apache.org/jira/browse/SPARK-8506
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: holdenk
Priority: Minor

 While packages can be specified when using the sparkR or sparkSubmit scripts, 
 the programming guide tells people to create their spark context using the R 
 shell + init. The init does have a parameter for jars but no parameter for 
 packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary 
 information. I think a good solution would just be adding another field to 
 the init function to allow people to specify packages in the same way as jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-06-21 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-8513:
-

 Summary: _temporary may be left undeleted when a write job 
committed with FileOutputCommitter fails due to a race condition
 Key: SPARK-8513
 URL: https://issues.apache.org/jira/browse/SPARK-8513
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0, 1.3.1, 1.2.2
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8510) NumPy arrays and matrices as values in sequence files

2015-06-21 Thread Peter Aberline (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Aberline updated SPARK-8510:
--
Description: 
Using the DoubleArrayWritable example, I have added support for storing NumPy 
double arrays and matrices as arrays of doubles and nested arrays of doubles as 
value elements of Sequence Files.

Each value element is a discrete matrix or array. This is useful where you have 
many matrices that you don't want to join into a single Spark Data Frame to 
store in a Parquet file.

Pandas DataFrames can be easily converted to and from NumPy matrices, so I've 
also added the ability to store the schema-less data from DataFrames and Series 
that contain double data. 

There seems to be demand for this functionality:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E

I'll be issuing a PR for this shortly.

  was:
I have extended the provided example code DoubleArrayWritable example to store 
NumPy double type arrays and matrices as arrays of doubles and nested arrays of 
doubles.

Pandas DataFrames can be easily converted to NumPy matrices, so I've also added 
the ability to store the schema-less data from DataFrames and Series that 
contain double data. 

Other than my own use there seems to be demand for this functionality:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E

I'll be issuing a PR for this shortly.


 NumPy arrays and matrices as values in sequence files
 -

 Key: SPARK-8510
 URL: https://issues.apache.org/jira/browse/SPARK-8510
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Peter Aberline
Priority: Minor

 Using the DoubleArrayWritable example, I have added support for storing NumPy 
 double arrays and matrices as arrays of doubles and nested arrays of doubles 
 as value elements of Sequence Files.
 Each value element is a discrete matrix or array. This is useful where you 
 have many matrices that you don't want to join into a single Spark Data Frame 
 to store in a Parquet file.
 Pandas DataFrames can be easily converted to and from NumPy matrices, so I've 
 also added the ability to store the schema-less data from DataFrames and 
 Series that contain double data. 
 There seems to be demand for this functionality:
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
 I'll be issuing a PR for this shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7075) Project Tungsten: Improving Physical Execution

2015-06-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7075:
---
Summary: Project Tungsten: Improving Physical Execution  (was: Project 
Tungsten: Improving Physical Execution and Memory Management)

 Project Tungsten: Improving Physical Execution
 --

 Key: SPARK-7075
 URL: https://issues.apache.org/jira/browse/SPARK-7075
 Project: Spark
  Issue Type: Epic
  Components: Block Manager, Shuffle, Spark Core, SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Based on our observation, majority of Spark workloads are not bottlenecked by 
 I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
 improve the efficiency of memory and CPU for Spark applications, to push 
 performance closer to the limits of the underlying hardware.
 *Memory Management and Binary Processing*
 - Avoiding non-transient Java objects (store them in binary format), which 
 reduces GC overhead.
 - Minimizing memory usage through denser in-memory data format, which means 
 we spill less.
 - Better memory accounting (size of bytes) rather than relying on heuristics
 - For operators that understand data types (in the case of DataFrames and 
 SQL), work directly against binary format in memory, i.e. have no 
 serialization/deserialization
 *Cache-aware Computation*
 - Faster sorting and hashing for aggregations, joins, and shuffle
 *Code Generation*
 - Faster expression evaluation and DataFrame/SQL operators
 - Faster serializer
 Several parts of project Tungsten leverage the DataFrame model, which gives 
 us more semantics about the application. We will also retrofit the 
 improvements onto Spark’s RDD API whenever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-06-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595226#comment-14595226
 ] 

Joseph K. Bradley commented on SPARK-1503:
--

[~staple] [~lewuathe] Can you please coordinate about how convergence is 
measured, as in [SPARK-3382]?  The current implementation for [SPARK-3382] uses 
relative convergence w.r.t. the weight vector, where it measures relative to 
the weight vector from the previous iteration.  I figure we should use the same 
convergence criterion for both accelerated and non-accelerated gradient descent.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple
 Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png


 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595239#comment-14595239
 ] 

Reynold Xin commented on SPARK-8072:


I think it's best to say the name of the duplicate columns.

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 

[jira] [Closed] (SPARK-8509) Failed to JOIN in pyspark

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-8509.

Resolution: Not A Problem

 Failed to JOIN in pyspark
 -

 Key: SPARK-8509
 URL: https://issues.apache.org/jira/browse/SPARK-8509
 Project: Spark
  Issue Type: Bug
Reporter: afancy

 Hi,
 I am writing pyspark stream program. I have the training data set to compute 
 the regression model. I want to use the stream data set to test the model. 
 So, I join with RDD with the StreamRDD, but i got the exception. Following 
 are my source code, and the exception I got. Any help is appreciated. Thanks
 Regards,
 Afancy
 
 {code}
 from __future__ import print_function
 import sys,os,datetime
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.sql.context import SQLContext
 from pyspark.resultiterable import ResultIterable
 from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
 import numpy as np
 import statsmodels.api as sm
 def splitLine(line, delimiter='|'):
 values = line.split(delimiter)
 st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S')
 return (values[0],st.hour), values[2:]
 def reg_m(y, x):
 ones = np.ones(len(x[0]))
 X = sm.add_constant(np.column_stack((x[0], ones)))
 for ele in x[1:]:
 X = sm.add_constant(np.column_stack((ele, X)))
 results = sm.OLS(y, X).fit()
 return results
 def train(line):
 y,x = [],[]
 y, x = [],[[],[],[],[],[],[]]
 reading_tmp,temp_tmp = [],[]
 i = 0
 for reading, temperature in line[1]:
 if i%4==0 and len(reading_tmp)==4:
 y.append(reading_tmp.pop())
 x[0].append(reading_tmp.pop())
 x[1].append(reading_tmp.pop())
 x[2].append(reading_tmp.pop())
 temp = float(temp_tmp[0])
 del temp_tmp[:]
 x[3].append(temp-20.0 if temp20.0 else 0.0)
 x[4].append(16.0-temp if temp16.0 else 0.0)
 x[5].append(5.0-temp if temp5.0 else 0.0)
 reading_tmp.append(float(reading))
 temp_tmp.append(float(temperature))
 i = i + 1
 return str(line[0]),reg_m(y, x).params.tolist()
 if __name__ == __main__:
 if len(sys.argv) != 4:
 print(Usage: regression.py checkpointDir trainingDataDir 
 streamDataDir, file=sys.stderr)
 exit(-1)
 checkpoint, trainingInput, streamInput = sys.argv[1:]
 sc = SparkContext(local[2], appName=BenchmarkSparkStreaming)
 trainingLines = sc.textFile(trainingInput)
 modelRDD = trainingLines.map(lambda line: splitLine(line, |))\
 .groupByKey()\
 .map(lambda line: train(line))\
 .cache()
 ssc = StreamingContext(sc, 2)
 ssc.checkpoint(checkpoint)
 lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, 
 |))
 testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), 
 line[1])).transform(lambda rdd:  rdd.leftOuterJoin(modelRDD))
 testRDD.pprint(20)
 ssc.start()
 ssc.awaitTermination()
 {code}
 
 {code}
 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 
 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6
 Traceback (most recent call last):
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py,
  line 90, in dumps
 return bytearray(self.serializer.dumps((func.func, func.deserializers)))
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py,
  line 427, in dumps
 return cloudpickle.dumps(obj, 2)
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py,
  line 622, in dumps
 cp.dump(obj)
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py,
  line 107, in dump
 return Pickler.dump(self, obj)
   File /usr/lib/python2.7/pickle.py, line 224, in dump
 self.save(obj)
   File /usr/lib/python2.7/pickle.py, line 286, in save
 f(self, obj) # Call unbound method with explicit self
   File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
 save(element)
   File /usr/lib/python2.7/pickle.py, line 286, in save
 f(self, obj) # Call unbound method with explicit self
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py,
  line 193, in save_function
 self.save_function_tuple(obj)
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py,
  line 236, in save_function_tuple
 save((code, closure, base_globals))
   File /usr/lib/python2.7/pickle.py, line 286, in save
 f(self, obj) # Call unbound method with explicit self
   File 

[jira] [Commented] (SPARK-8509) Failed to JOIN in pyspark

2015-06-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595163#comment-14595163
 ] 

Joseph K. Bradley commented on SPARK-8509:
--

I believe the bug is here, where you'll need to be careful about what is a 
DStream vs RDD vs local data:
{code}
testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), 
line[1])).transform(lambda rdd:  rdd.leftOuterJoin(modelRDD))
testRDD.pprint(20)
{code}

I believe this is a bug in your code, not in Spark, so please post on the user 
list instead of JIRA.  I'll close this for now.

 Failed to JOIN in pyspark
 -

 Key: SPARK-8509
 URL: https://issues.apache.org/jira/browse/SPARK-8509
 Project: Spark
  Issue Type: Bug
Reporter: afancy

 Hi,
 I am writing pyspark stream program. I have the training data set to compute 
 the regression model. I want to use the stream data set to test the model. 
 So, I join with RDD with the StreamRDD, but i got the exception. Following 
 are my source code, and the exception I got. Any help is appreciated. Thanks
 Regards,
 Afancy
 
 {code}
 from __future__ import print_function
 import sys,os,datetime
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.sql.context import SQLContext
 from pyspark.resultiterable import ResultIterable
 from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
 import numpy as np
 import statsmodels.api as sm
 def splitLine(line, delimiter='|'):
 values = line.split(delimiter)
 st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S')
 return (values[0],st.hour), values[2:]
 def reg_m(y, x):
 ones = np.ones(len(x[0]))
 X = sm.add_constant(np.column_stack((x[0], ones)))
 for ele in x[1:]:
 X = sm.add_constant(np.column_stack((ele, X)))
 results = sm.OLS(y, X).fit()
 return results
 def train(line):
 y,x = [],[]
 y, x = [],[[],[],[],[],[],[]]
 reading_tmp,temp_tmp = [],[]
 i = 0
 for reading, temperature in line[1]:
 if i%4==0 and len(reading_tmp)==4:
 y.append(reading_tmp.pop())
 x[0].append(reading_tmp.pop())
 x[1].append(reading_tmp.pop())
 x[2].append(reading_tmp.pop())
 temp = float(temp_tmp[0])
 del temp_tmp[:]
 x[3].append(temp-20.0 if temp20.0 else 0.0)
 x[4].append(16.0-temp if temp16.0 else 0.0)
 x[5].append(5.0-temp if temp5.0 else 0.0)
 reading_tmp.append(float(reading))
 temp_tmp.append(float(temperature))
 i = i + 1
 return str(line[0]),reg_m(y, x).params.tolist()
 if __name__ == __main__:
 if len(sys.argv) != 4:
 print(Usage: regression.py checkpointDir trainingDataDir 
 streamDataDir, file=sys.stderr)
 exit(-1)
 checkpoint, trainingInput, streamInput = sys.argv[1:]
 sc = SparkContext(local[2], appName=BenchmarkSparkStreaming)
 trainingLines = sc.textFile(trainingInput)
 modelRDD = trainingLines.map(lambda line: splitLine(line, |))\
 .groupByKey()\
 .map(lambda line: train(line))\
 .cache()
 ssc = StreamingContext(sc, 2)
 ssc.checkpoint(checkpoint)
 lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, 
 |))
 testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), 
 line[1])).transform(lambda rdd:  rdd.leftOuterJoin(modelRDD))
 testRDD.pprint(20)
 ssc.start()
 ssc.awaitTermination()
 {code}
 
 {code}
 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 
 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6
 Traceback (most recent call last):
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py,
  line 90, in dumps
 return bytearray(self.serializer.dumps((func.func, func.deserializers)))
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py,
  line 427, in dumps
 return cloudpickle.dumps(obj, 2)
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py,
  line 622, in dumps
 cp.dump(obj)
   File 
 /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py,
  line 107, in dump
 return Pickler.dump(self, obj)
   File /usr/lib/python2.7/pickle.py, line 224, in dump
 self.save(obj)
   File /usr/lib/python2.7/pickle.py, line 286, in save
 f(self, obj) # Call unbound method with explicit self
   File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
 save(element)
   File /usr/lib/python2.7/pickle.py, line 286, in save
 f(self, obj) # Call unbound method with explicit self
   File 

[jira] [Created] (SPARK-8512) Web UI Inconsistent with History Server in Standalone Mode

2015-06-21 Thread Jonathon Cai (JIRA)
Jonathon Cai created SPARK-8512:
---

 Summary: Web UI Inconsistent with History Server in Standalone Mode
 Key: SPARK-8512
 URL: https://issues.apache.org/jira/browse/SPARK-8512
 Project: Spark
  Issue Type: Bug
Reporter: Jonathon Cai


I find that when I examine the web UI, a couple bugs arise:

1. There is a discrepancy between the number denoting the duration of the
application when I run the history server and the number given by the web UI
(default address is master:8080). I checked more specific details, including
task and stage durations (when clicking on the application), and these
appear to be the same for both avenues.

2. Sometimes the web UI on master:8080 is unable to display more specific
information for an application that has finished (when clicking on the
application), even when there is a log file in the appropriate directory.
But when the history server is opened, it is able to read this file and
output information.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8512) Web UI Inconsistent with History Server in Standalone Mode

2015-06-21 Thread Jonathon Cai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathon Cai updated SPARK-8512:

Affects Version/s: 1.4.0

 Web UI Inconsistent with History Server in Standalone Mode
 --

 Key: SPARK-8512
 URL: https://issues.apache.org/jira/browse/SPARK-8512
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Jonathon Cai

 I find that when I examine the web UI, a couple bugs arise:
 1. There is a discrepancy between the number denoting the duration of the
 application when I run the history server and the number given by the web UI
 (default address is master:8080). I checked more specific details, including
 task and stage durations (when clicking on the application), and these
 appear to be the same for both avenues.
 2. Sometimes the web UI on master:8080 is unable to display more specific
 information for an application that has finished (when clicking on the
 application), even when there is a log file in the appropriate directory.
 But when the history server is opened, it is able to read this file and
 output information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8510) NumPy arrays and matrices as values in sequence files

2015-06-21 Thread Peter Aberline (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Aberline updated SPARK-8510:
--
Summary: NumPy arrays and matrices as values in sequence files  (was: 
Support NumPy arrays and matrices as values in sequence files)

 NumPy arrays and matrices as values in sequence files
 -

 Key: SPARK-8510
 URL: https://issues.apache.org/jira/browse/SPARK-8510
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Peter Aberline
Priority: Minor

 I have extended the provided example code DoubleArrayWritable example to 
 store NumPy double type arrays and matrices as arrays of doubles and nested 
 arrays of doubles.
 Pandas DataFrames can be easily converted to NumPy matrices, so I've also 
 added the ability to store the schema-less data from DataFrames and Series 
 that contain double data. 
 Other than my own use there seems to be demand for this functionality:
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
 I'll be issuing a PR for this shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-06-21 Thread Bolke de Bruin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595205#comment-14595205
 ] 

Bolke de Bruin commented on SPARK-5111:
---

This patch does not work on a hadoop 2.6 + kerberos + yarn cluster. The 
SparkContext cannot be setup if the patch is nor earlier in the process (in 
SparkContext.init). However, just moving the patch makes the connection fail in 
HiveContext. Keeping it (ie. duplicating) in two places does not work as the 
class became stale if I remember correctly.

Latest spark build (1.5.0-SNAPSHOT) from git does not go beyond 0.13 yet it 
seems even if something like

spark.sql.hive.metastore.version0.14
spark.sql.hive.metastore.jars   
/usr/hdp/2.2.0.0-2041/hadoop/conf:/usr/hdp/2.2.0.0-2041/hadoop/lib/*:/usr/hdp/2.2.0.0-2041/hadoop/.//*:/usr/hdp/2.2.0.0-2041/hadoop-hdfs/./:/usr/hdp/2.2.0.0-2041/hadoop-hdfs/lib/*:/usr/hdp/2.2.0.0-2041/hadoop-hdfs/.//*:/usr/hdp/2.2.0.0-2041/hadoop-yarn/lib/*:/usr/hdp/2.2.0.0-2041/hadoop-yarn/.//*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/lib/*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/.//*::/usr/share/java/mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java-5.1.34-bin.jar:/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-mapreduce-client/*:/usr/hdp/current/tez-client/*:/usr/hdp/current/tez-client/lib/*:/etc/tez/conf/:/usr/hdp/2.2.0.0-2041/tez/*:/usr/hdp/2.2.0.0-2041/tez/lib/*:/etc/tez/conf:/usr/hdp/2.2.0.0-2041/hive/lib/*

is configured (an out of memory error is then thrown, could not get rid of it).



 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
 ---

 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang

 Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
 hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
 support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7715) Update MLlib Programming Guide for 1.4

2015-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7715.
--
   Resolution: Fixed
Fix Version/s: (was: 1.4.0)
   1.4.1
   1.5.0

Issue resolved by pull request 6897
[https://github.com/apache/spark/pull/6897]

 Update MLlib Programming Guide for 1.4
 --

 Key: SPARK-7715
 URL: https://issues.apache.org/jira/browse/SPARK-7715
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.5.0, 1.4.1


 Before the release, we need to update the MLlib Programming Guide.  Updates 
 will include:
 * Add migration guide subsection.
 ** Use the results of the QA audit JIRAs.
 * Check phrasing, especially in main sections (for outdated items such as In 
 this release, ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6813) SparkR style guide

2015-06-21 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595159#comment-14595159
 ] 

Yu Ishikawa commented on SPARK-6813:


I got it. As you mentioned, disabling those two checks affects {{if-else}} 
blocks. And according to the Google R guide, an opening curly brace should 
never go on its own line, as you know.

Alright. I think we should go with your suggestion. If {{lintr}} support any 
linters for {{if}} clauses and {{function}} definitions in a future, we will 
discuss that again. What do you think about that?

I asked the linter for curly braces in detail to the creator of lintr by email.

 SparkR style guide
 --

 Key: SPARK-6813
 URL: https://issues.apache.org/jira/browse/SPARK-6813
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman

 We should develop a SparkR style guide document based on the some of the 
 guidelines we use and some of the best practices in R.
 Some examples of R style guide are:
 http://r-pkgs.had.co.nz/r.html#style 
 http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
 A related issue is to work on a automatic style checking tool. 
 https://github.com/jimhester/lintr seems promising
 We could have a R style guide based on the one from google [1], and adjust 
 some of them with the conversation in Spark:
 1. Line Length: maximum 100 characters
 2. no limit on function name (API should be similar as in other languages)
 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7604) Python API for PCA and PCAModel

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7604.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6315
[https://github.com/apache/spark/pull/6315]

 Python API for PCA and PCAModel
 ---

 Key: SPARK-7604
 URL: https://issues.apache.org/jira/browse/SPARK-7604
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang
 Fix For: 1.5.0


 Python API for org.apache.spark.mllib.feature.PCA and 
 org.apache.spark.mllib.feature.PCAModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings

2015-06-21 Thread Ben Sully (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595171#comment-14595171
 ] 

Ben Sully commented on SPARK-7499:
--

I just had a quick go and it seems quite possibly:

```
name_to_col - function(colname, data) {
  eval(parse(text=paste0(substitute(data), $, colname)))
}
```

The colname and then column could be extracted using lazyeval::all_dots like 
this:

```
select_.DataFrame - function(.data, ..., .dots) {
  dots - lazyeval::all_dots(..., .dots)
  colnames - sapply(dots, function(x) x$expr)
  columns - sapply(colnames, name_to_col, data = .data)

  do_something_with_columns()
}

select(df, name, age)
```

The method I've used is quite primitive, in that it just uses `substitute` to 
find out the name of the DataFrame object being passed, then uses `eval` to 
resolve the {tablename}${column} into the actual object which then gets passed 
to SparkR. It feels quite hacky though, by relying on the `$` method of 
DataFrames.

This is just adding a method to dplyr which allows its select, mutate, filter, 
group_by and summarise functions to work with DataFrames. I think the part 
which allows it to work without strings or df$age is the lazyeval package.

 Investigate how to specify columns in SparkR without $ or strings
 -

 Key: SPARK-7499
 URL: https://issues.apache.org/jira/browse/SPARK-7499
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman

 Right now in SparkR we need to specify the columns used using `$` or strings. 
 For example to run select we would do
 {code}
 df1 - select(df, df$age  10)
 {code}
 It would be good to infer the set of columns in a dataframe automatically and 
 resolve symbols for column names. For example
 {code} 
 df1 - select(df, age  10)
 {code}
 One way to do this is to build an environment with all the column names to 
 column handles and then use `substitute(arg, env = columnNameEnv)`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8508.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6925
[https://github.com/apache/spark/pull/6925]

 Test case SQLQuerySuite.test script transform for stderr generates super 
 long output
 --

 Key: SPARK-8508
 URL: https://issues.apache.org/jira/browse/SPARK-8508
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.5.0


 This test case writes 100,000 lines of integer triples to stderr, and makes 
 Jenkins build output unnecessarily large and hard to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings

2015-06-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595153#comment-14595153
 ] 

Shivaram Venkataraman commented on SPARK-7499:
--

[~sd2k] Thanks for taking a shot at this -- One question I have about this 
approach is if we can avoid implementing new versions of all these functions. 
i.e. is there a common sub-routine we can create that given a set of 
identifiers / columns will resolve them to SparkR DataFrame columns ? 

Also just to help me understand this better, what is the specific function in 
dplyr we are using here ?

 Investigate how to specify columns in SparkR without $ or strings
 -

 Key: SPARK-7499
 URL: https://issues.apache.org/jira/browse/SPARK-7499
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman

 Right now in SparkR we need to specify the columns used using `$` or strings. 
 For example to run select we would do
 {code}
 df1 - select(df, df$age  10)
 {code}
 It would be good to infer the set of columns in a dataframe automatically and 
 resolve symbols for column names. For example
 {code} 
 df1 - select(df, age  10)
 {code}
 One way to do this is to build an environment with all the column names to 
 column handles and then use `substitute(arg, env = columnNameEnv)`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7426:
-
Assignee: Mike Dusenberry

 spark.ml AttributeFactory.fromStructField should allow other NumericTypes
 -

 Key: SPARK-7426
 URL: https://issues.apache.org/jira/browse/SPARK-7426
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Mike Dusenberry
Priority: Minor

 It currently only supports DoubleType, but it should support others, at least 
 for fromStructField (importing into ML attribute format, rather than 
 exporting).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8509) Failed to JOIN in pyspark

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8509:
-
Description: 
Hi,

I am writing pyspark stream program. I have the training data set to compute 
the regression model. I want to use the stream data set to test the model. So, 
I join with RDD with the StreamRDD, but i got the exception. Following are my 
source code, and the exception I got. Any help is appreciated. Thanks

Regards,
Afancy


{code}
from __future__ import print_function
import sys,os,datetime

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.context import SQLContext
from pyspark.resultiterable import ResultIterable
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
import numpy as np
import statsmodels.api as sm


def splitLine(line, delimiter='|'):
values = line.split(delimiter)
st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S')
return (values[0],st.hour), values[2:]

def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results

def train(line):
y,x = [],[]
y, x = [],[[],[],[],[],[],[]]
reading_tmp,temp_tmp = [],[]
i = 0
for reading, temperature in line[1]:
if i%4==0 and len(reading_tmp)==4:
y.append(reading_tmp.pop())
x[0].append(reading_tmp.pop())
x[1].append(reading_tmp.pop())
x[2].append(reading_tmp.pop())
temp = float(temp_tmp[0])
del temp_tmp[:]
x[3].append(temp-20.0 if temp20.0 else 0.0)
x[4].append(16.0-temp if temp16.0 else 0.0)
x[5].append(5.0-temp if temp5.0 else 0.0)
reading_tmp.append(float(reading))
temp_tmp.append(float(temperature))
i = i + 1
return str(line[0]),reg_m(y, x).params.tolist()

if __name__ == __main__:
if len(sys.argv) != 4:
print(Usage: regression.py checkpointDir trainingDataDir 
streamDataDir, file=sys.stderr)
exit(-1)

checkpoint, trainingInput, streamInput = sys.argv[1:]
sc = SparkContext(local[2], appName=BenchmarkSparkStreaming)

trainingLines = sc.textFile(trainingInput)
modelRDD = trainingLines.map(lambda line: splitLine(line, |))\
.groupByKey()\
.map(lambda line: train(line))\
.cache()


ssc = StreamingContext(sc, 2)
ssc.checkpoint(checkpoint)
lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, 
|))


testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), 
line[1])).transform(lambda rdd:  rdd.leftOuterJoin(modelRDD))
testRDD.pprint(20)

ssc.start()
ssc.awaitTermination()


15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 
6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6
Traceback (most recent call last):
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py,
 line 90, in dumps
return bytearray(self.serializer.dumps((func.func, func.deserializers)))
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, 
line 427, in dumps
return cloudpickle.dumps(obj, 2)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 622, in dumps
cp.dump(obj)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 107, in dump
return Pickler.dump(self, obj)
  File /usr/lib/python2.7/pickle.py, line 224, in dump
self.save(obj)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 193, in save_function
self.save_function_tuple(obj)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 236, in save_function_tuple
save((code, closure, base_globals))
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 600, in save_list
self._batch_appends(iter(obj))
  File /usr/lib/python2.7/pickle.py, line 633, in _batch_appends
save(x)
  File 

[jira] [Updated] (SPARK-8509) Failed to JOIN in pyspark

2015-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8509:
-
Description: 
Hi,

I am writing pyspark stream program. I have the training data set to compute 
the regression model. I want to use the stream data set to test the model. So, 
I join with RDD with the StreamRDD, but i got the exception. Following are my 
source code, and the exception I got. Any help is appreciated. Thanks

Regards,
Afancy


{code}
from __future__ import print_function
import sys,os,datetime

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.context import SQLContext
from pyspark.resultiterable import ResultIterable
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
import numpy as np
import statsmodels.api as sm


def splitLine(line, delimiter='|'):
values = line.split(delimiter)
st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S')
return (values[0],st.hour), values[2:]

def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results

def train(line):
y,x = [],[]
y, x = [],[[],[],[],[],[],[]]
reading_tmp,temp_tmp = [],[]
i = 0
for reading, temperature in line[1]:
if i%4==0 and len(reading_tmp)==4:
y.append(reading_tmp.pop())
x[0].append(reading_tmp.pop())
x[1].append(reading_tmp.pop())
x[2].append(reading_tmp.pop())
temp = float(temp_tmp[0])
del temp_tmp[:]
x[3].append(temp-20.0 if temp20.0 else 0.0)
x[4].append(16.0-temp if temp16.0 else 0.0)
x[5].append(5.0-temp if temp5.0 else 0.0)
reading_tmp.append(float(reading))
temp_tmp.append(float(temperature))
i = i + 1
return str(line[0]),reg_m(y, x).params.tolist()

if __name__ == __main__:
if len(sys.argv) != 4:
print(Usage: regression.py checkpointDir trainingDataDir 
streamDataDir, file=sys.stderr)
exit(-1)

checkpoint, trainingInput, streamInput = sys.argv[1:]
sc = SparkContext(local[2], appName=BenchmarkSparkStreaming)

trainingLines = sc.textFile(trainingInput)
modelRDD = trainingLines.map(lambda line: splitLine(line, |))\
.groupByKey()\
.map(lambda line: train(line))\
.cache()


ssc = StreamingContext(sc, 2)
ssc.checkpoint(checkpoint)
lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, 
|))


testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), 
line[1])).transform(lambda rdd:  rdd.leftOuterJoin(modelRDD))
testRDD.pprint(20)

ssc.start()
ssc.awaitTermination()
{code}



{code}
15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 
6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6
Traceback (most recent call last):
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py,
 line 90, in dumps
return bytearray(self.serializer.dumps((func.func, func.deserializers)))
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, 
line 427, in dumps
return cloudpickle.dumps(obj, 2)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 622, in dumps
cp.dump(obj)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 107, in dump
return Pickler.dump(self, obj)
  File /usr/lib/python2.7/pickle.py, line 224, in dump
self.save(obj)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 193, in save_function
self.save_function_tuple(obj)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 236, in save_function_tuple
save((code, closure, base_globals))
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 600, in save_list
self._batch_appends(iter(obj))
  File /usr/lib/python2.7/pickle.py, line 633, in _batch_appends

[jira] [Commented] (SPARK-8472) Python API for DCT

2015-06-21 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594977#comment-14594977
 ] 

Yu Ishikawa commented on SPARK-8472:


Please assign this issue to me.

 Python API for DCT
 --

 Key: SPARK-8472
 URL: https://issues.apache.org/jira/browse/SPARK-8472
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 We need to implement a wrapper for enabling the DCT feature transformer to be 
 used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8379.
---
   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0

Issue resolved by pull request 6833
[https://github.com/apache/spark/pull/6833]

 LeaseExpiredException when using dynamic partition with speculative execution
 -

 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: jeanlyn
Assignee: jeanlyn
 Fix For: 1.5.0, 1.4.1


 when inserting to table using dynamic partitions with 
 *spark.speculation=true*  and there is a skew data of some partitions trigger 
 the speculative tasks ,it will throws the exception like
 {code}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  Lease mismatch on 
 /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
 is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output

2015-06-21 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-8508:
-

 Summary: Test case SQLQuerySuite.test script transform for 
stderr generates super long output
 Key: SPARK-8508
 URL: https://issues.apache.org/jira/browse/SPARK-8508
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


This test case writes 100,000 lines of integer triples to stderr, and makes 
Jenkins build output unnecessarily large and hard to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output

2015-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594985#comment-14594985
 ] 

Apache Spark commented on SPARK-8508:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6925

 Test case SQLQuerySuite.test script transform for stderr generates super 
 long output
 --

 Key: SPARK-8508
 URL: https://issues.apache.org/jira/browse/SPARK-8508
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 This test case writes 100,000 lines of integer triples to stderr, and makes 
 Jenkins build output unnecessarily large and hard to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8508:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Test case SQLQuerySuite.test script transform for stderr generates super 
 long output
 --

 Key: SPARK-8508
 URL: https://issues.apache.org/jira/browse/SPARK-8508
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 This test case writes 100,000 lines of integer triples to stderr, and makes 
 Jenkins build output unnecessarily large and hard to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output

2015-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8508:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Test case SQLQuerySuite.test script transform for stderr generates super 
 long output
 --

 Key: SPARK-8508
 URL: https://issues.apache.org/jira/browse/SPARK-8508
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Minor

 This test case writes 100,000 lines of integer triples to stderr, and makes 
 Jenkins build output unnecessarily large and hard to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8509) Failed to JOIN in pyspark

2015-06-21 Thread afancy (JIRA)
afancy created SPARK-8509:
-

 Summary: Failed to JOIN in pyspark
 Key: SPARK-8509
 URL: https://issues.apache.org/jira/browse/SPARK-8509
 Project: Spark
  Issue Type: Bug
Reporter: afancy


Hi,

I am writing pyspark stream program. I have the training data set to compute 
the regression model. I want to use the stream data set to test the model. So, 
I join with RDD with the StreamRDD, but i got the exception. Following are my 
source code, and the exception I got. Any help is appreciated. Thanks

Regards,
Afancy


from __future__ import print_function
import sys,os,datetime

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.context import SQLContext
from pyspark.resultiterable import ResultIterable
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
import numpy as np
import statsmodels.api as sm


def splitLine(line, delimiter='|'):
values = line.split(delimiter)
st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S')
return (values[0],st.hour), values[2:]

def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results

def train(line):
y,x = [],[]
y, x = [],[[],[],[],[],[],[]]
reading_tmp,temp_tmp = [],[]
i = 0
for reading, temperature in line[1]:
if i%4==0 and len(reading_tmp)==4:
y.append(reading_tmp.pop())
x[0].append(reading_tmp.pop())
x[1].append(reading_tmp.pop())
x[2].append(reading_tmp.pop())
temp = float(temp_tmp[0])
del temp_tmp[:]
x[3].append(temp-20.0 if temp20.0 else 0.0)
x[4].append(16.0-temp if temp16.0 else 0.0)
x[5].append(5.0-temp if temp5.0 else 0.0)
reading_tmp.append(float(reading))
temp_tmp.append(float(temperature))
i = i + 1
return str(line[0]),reg_m(y, x).params.tolist()

if __name__ == __main__:
if len(sys.argv) != 4:
print(Usage: regression.py checkpointDir trainingDataDir 
streamDataDir, file=sys.stderr)
exit(-1)

checkpoint, trainingInput, streamInput = sys.argv[1:]
sc = SparkContext(local[2], appName=BenchmarkSparkStreaming)

trainingLines = sc.textFile(trainingInput)
modelRDD = trainingLines.map(lambda line: splitLine(line, |))\
.groupByKey()\
.map(lambda line: train(line))\
.cache()


ssc = StreamingContext(sc, 2)
ssc.checkpoint(checkpoint)
lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, 
|))


testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), 
line[1])).transform(lambda rdd:  rdd.leftOuterJoin(modelRDD))
testRDD.pprint(20)

ssc.start()
ssc.awaitTermination()


15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 
6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6
Traceback (most recent call last):
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py,
 line 90, in dumps
return bytearray(self.serializer.dumps((func.func, func.deserializers)))
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, 
line 427, in dumps
return cloudpickle.dumps(obj, 2)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 622, in dumps
cp.dump(obj)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 107, in dump
return Pickler.dump(self, obj)
  File /usr/lib/python2.7/pickle.py, line 224, in dump
self.save(obj)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 193, in save_function
self.save_function_tuple(obj)
  File 
/opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, 
line 236, in save_function_tuple
save((code, closure, base_globals))
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.7/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.7/pickle.py, line 600, in save_list
self._batch_appends(iter(obj))
  File